The ShATR Multiple Simultaneous Speaker Corpus
What is ShATR? | Downloading | Contributions | Documentation | Credits
ShATR is a corpus of overlapped speech collected by the University of Sheffield Speech and Hearing Research Group in collaboration with ATR in order to support research into computational auditory scene analysis. The task involved four participants working in pairs to solve two crosswords. A fifth participant acted as a hint-giver. Eight channels of audio data were recorded from the following sensors: one close microphone per speaker, one omnidirectional microphone, and the two channels of a binaurally-wired mannekin. Around 41% of the corpus contains overlapped speakers. In addition, a variety of other audio data was collected from each participant. The entire corpus, which has a duration of around 37 minutes, has been segmented and transcribed at 5 levels, from subtasks down to phones. In addition, all nonspeech sounds have been marked.
ShATR was recorded in 1994 and has been available since that time at a nominal cost on CDROM. In April 2002, a decision was taken to make the entire corpus available on the web at this site.
You can find out more about ShATR with this detailed description.
In order to compare representations and analyses of ShATR, we encourage you to submit your analyses to us. Further, to focus efforts on the same segment, we have, arbitrarily, selected a single minute of the corpus for preferential treatment (see below).
You are welcome to download both transcriptions and audio files. Some additional single-speaker audio data is also available.
ShATR has been transcribed/segmented at 5 levels:
Word- and phone-level transcriptions were produced automatically using the ABBOT recogniser and have not been manually corrected.
Transcriptions are simple ASCII text files containing lines of the form:
label startTime duration
where startTime and duration are given in samples at a sampling frequency of 48 kHz. Note that audio data for channels 6, 7 and 8 is supplied at 48 kHz (to support work in sound localisation), while 1-5 have been downsampled to 16 kHz. Consequently, you should divide the figures given in the transcription files by 3 to obtain a correspondence between samples and audio segments for channels 1-5.
You can inspect and download transcription files from the table below. Transcriptions are provided independently for each speaker.
For ease of downloading, the main part of the corpus (ie the crossword-solving activity) has been split up into 37 one-minute segments in each of the 8 channels. Recall that channels 1-5 were obtained from close-talking mikes for the 5 participants. These have been downsampled to 16 kHz from an original sampling rate of 48 kHz. These files occupy about 1.83M. Channel 6 is from the omnidirectional microphone, and hence would be a good source for testing single channel speech segregation algorithms. Channels 7 and 8 were obtained from the left and right ear microphones attached to a mannekin. Channels 6-8 are provided at the original sampling rate of 48 kHz, and have a file size of around 5.5M. All files are single channel .wav files.
It is expected that most researchers will require relatively few segments from the table below, but if you need access to the entire corpus, mail us and we may decide to add (very large!) zip or tar files for each channel. For the sake of comparison between analyses, we have selected a particular minute (19, highlighted in green below) as one which you may wish to focus your efforts on. We will be happy to collect representations and analyses of this portion on this website (see below).
You will note that 5 files in the table below are highlighted in red. These are segments (all in channel 8) that we are currently unable to uncompress from the original Shorten format. We are providing these segments as (possibly unreadable) Shorten files, and would be extremely happy if any of you manage to convert them to wav format for us!
A note on audio quality: we have noticed that channels 7 and 8 contain a narrowband component at around 16 kHz, with a level 20-30 dB below that of the main speech energy. We are unsure of its origin. However, it would be straightforward to remove it by filtering, or to simply ignore it. After all, few algorithms operate at those frequencies. However, you should be aware of its presence.
Speakers 1, 2, 3 and 5 are native British English speakers; speaker 4 uses American English.
|spkr 1||spkr 2||spkr 3||spkr 4||spkr 5||omni||left||right|
Each speaker provided 5 sets of additional audio data. This material may be of use for speaker adaptation or building word models for keyword-spotting tasks. Speakers occupied the same location relative to the mannekin as that adopted in the crossword task. Note that full data from the mannekin and omnidirectional microphone is provided for the shibboleth sentences only - the other sets contain sound material from the single speaker microphone only. Thus, studies of speaker location might use the shibboleth sentences as test material (before setting out on the harder problem of multi-source speaker location, for which the crossword task is ideal!).
The additional data consisted of:
(*) Due to an oversight, no "common words" data for speaker 4 was recorded.
Of these, only the TIMIT sentences have been transcribed:
For the purposes of comparisons amongst representations and analyses, this section will contain contributed figures, preferably for minute 19 of the corpus. Suggested contributions:
Please email a URL.
Prior to its web-release, ShATR was described in a couple of relatively obscure papers:
A revised version of the latter paper was published in Computational Auditory Scene Analysis (eds. D. F. Rosenthal and H. G. Okuno) Mahwah, NJ: Lawrence Erlbaum, pp. 321-333.
Some technical documentation describing the details of the recording environment is available.
The following people contributed to the planning, development, collection, annotation and subsequent web-release of ShATR: Jon Barker, Guy Brown, Martin Cooke, Malcolm Crawford, Inge-Marie Eigsti, Phil Green, Brian Karlsen, Hiroaki Kato, Hideki Kawahara, Steve Renals, Masako Tanaka and Minoru Tsuzaki. Thanks are due also to David Kirby of the BBC for much help and support during the early stages of configuring the recording equipment. We also acknowledge help, advice and criticism from many colleagues in the speech, hearing and linguistics communities.
This work was supported by SERC Image Interpretation Initiative Research Grant GR/H53174; a study visit grant from Advanced Telephony Research, Kyoto, to Guy Brown and Malcolm Crawford; Royal Society and Royal Academy of Engineering awards to Phil Green, and a Royal Society grant to Martin Cooke.
Last update: 18th April, 2002 by Martin Cooke