The ShATR Multiple Simultaneous Speaker Corpus

What is ShATR? | Downloading | Contributions | Documentation | Credits


What is ShATR?

ShATR is a corpus of overlapped speech collected by the University of Sheffield Speech and Hearing Research Group in collaboration with ATR in order to support research into computational auditory scene analysis. The task involved four participants working in pairs to solve two crosswords. A fifth participant acted as a hint-giver. Eight channels of audio data were recorded from the following sensors: one close microphone per speaker, one omnidirectional microphone, and the two channels of a binaurally-wired mannekin. Around 41% of the corpus contains overlapped speakers. In addition, a variety of other audio data was collected from each participant. The entire corpus, which has a duration of around 37 minutes, has been segmented and transcribed at 5 levels, from subtasks down to phones. In addition, all nonspeech sounds have been marked.

ShATR was recorded in 1994 and has been available since that time at a nominal cost on CDROM. In April 2002, a decision was taken to make the entire corpus available on the web at this site.

You can find out more about ShATR with this detailed description.

In order to compare representations and analyses of ShATR, we encourage you to submit your analyses to us. Further, to focus efforts on the same segment, we have, arbitrarily, selected a single minute of the corpus for preferential treatment (see below).

Downloading

You are welcome to download both transcriptions and audio files. Some additional single-speaker audio data is also available.

Transcriptions

ShATR has been transcribed/segmented at 5 levels:

Word- and phone-level transcriptions were produced automatically using the ABBOT recogniser and have not been manually corrected.

Transcriptions are simple ASCII text files containing lines of the form:

   label  startTime  duration

where startTime and duration are given in samples at a sampling frequency of 48 kHz. Note that audio data for channels 6, 7 and 8 is supplied at 48 kHz (to support work in sound localisation), while 1-5 have been downsampled to 16 kHz. Consequently, you should divide the figures given in the transcription files by 3 to obtain a correspondence between samples and audio segments for channels 1-5.

You can inspect and download transcription files from the table below. Transcriptions are provided independently for each speaker.


spkr1 spkr2 spkr3 spkr4 spkr5
structural struct1 struct2 struct3 struct4 struct5
nonspeech nonspeech1 nonspeech2 nonspeech3 nonspeech4 nonspeech5
sentence sentence1 sentence2 sentence3 sentence4 sentence5
word word1 word2 word3 word4 word5
phone phone1 phone2 phone3 phone4 phone5

Audio files

For ease of downloading, the main part of the corpus (ie the crossword-solving activity) has been split up into 37 one-minute segments in each of the 8 channels. Recall that channels 1-5 were obtained from close-talking mikes for the 5 participants. These have been downsampled to 16 kHz from an original sampling rate of 48 kHz. These files occupy about 1.83M. Channel 6 is from the omnidirectional microphone, and hence would be a good source for testing single channel speech segregation algorithms. Channels 7 and 8 were obtained from the left and right ear microphones attached to a mannekin. Channels 6-8 are provided at the original sampling rate of 48 kHz, and have a file size of around 5.5M. All files are single channel .wav files.

It is expected that most researchers will require relatively few segments from the table below, but if you need access to the entire corpus, mail us and we may decide to add (very large!) zip or tar files for each channel. For the sake of comparison between analyses, we have selected a particular minute (19, highlighted in green below) as one which you may wish to focus your efforts on. We will be happy to collect representations and analyses of this portion on this website (see below).

You will note that 5 files in the table below are highlighted in red. These are segments (all in channel 8) that we are currently unable to uncompress from the original Shorten format. We are providing these segments as (possibly unreadable) Shorten files, and would be extremely happy if any of you manage to convert them to wav format for us!

A note on audio quality: we have noticed that channels 7 and 8 contain a narrowband component at around 16 kHz, with a level 20-30 dB below that of the main speech energy. We are unsure of its origin. However, it would be straightforward to remove it by filtering, or to simply ignore it. After all, few algorithms operate at those frequencies. However, you should be aware of its presence.

Speakers 1, 2, 3 and 5 are native British English speakers; speaker 4 uses American English.

MINUTE


CHANNEL




spkr 1 spkr 2 spkr 3 spkr 4 spkr 5 omni left right
1 c1m1 c2m1 c3m1 c4m1 c5m1 c6m1 c7m1 c8m1
2 c1m2 c2m2 c3m2 c4m2 c5m2 c6m2 c7m2 c8m2
3 c1m3 c2m3 c3m3 c4m3 c5m3 c6m3 c7m3 c8m3
4 c1m4 c2m4 c3m4 c4m4 c5m4 c6m4 c7m4 c8m4
5 c1m5 c2m5 c3m5 c4m5 c5m5 c6m5 c7m5 c8m5
6 c1m6 c2m6 c3m6 c4m6 c5m6 c6m6 c7m6 c8m6
7 c1m7 c2m7 c3m7 c4m7 c5m7 c6m7 c7m7 c8m7
8 c1m8 c2m8 c3m8 c4m8 c5m8 c6m8 c7m8 c8m8
9 c1m9 c2m9 c3m9 c4m9 c5m9 c6m9 c7m9 c8m9
10 c1m10 c2m10 c3m10 c4m10 c5m10 c6m10 c7m10 c8m10
11 c1m11 c2m11 c3m11 c4m11 c5m11 c6m11 c7m11 c8m11
12 c1m12 c2m12 c3m12 c4m12 c5m12 c6m12 c7m12 c8m12
13 c1m13 c2m13 c3m13 c4m13 c5m13 c6m13 c7m13 c8m13
14 c1m14 c2m14 c3m14 c4m14 c5m14 c6m14 c7m14 c8m14
15 c1m15 c2m15 c3m15 c4m15 c5m15 c6m15 c7m15 c8m15
16 c1m16 c2m16 c3m16 c4m16 c5m16 c6m16 c7m16 c8m16
17 c1m17 c2m17 c3m17 c4m17 c5m17 c6m17 c7m17 c8m17
18 c1m18 c2m18 c3m18 c4m18 c5m18 c6m18 c7m18 c8m18
19 c1m19 c2m19 c3m19 c4m19 c5m19 c6m19 c7m19 c8m19
20 c1m20 c2m20 c3m20 c4m20 c5m20 c6m20 c7m20 c8m20
21 c1m21 c2m21 c3m21 c4m21 c5m21 c6m21 c7m21 c8m21
22 c1m22 c2m22 c3m22 c4m22 c5m22 c6m22 c7m22 c8m22
23 c1m23 c2m23 c3m23 c4m23 c5m23 c6m23 c7m23 c8m23
24 c1m24 c2m24 c3m24 c4m24 c5m24 c6m24 c7m24 c8m24
25 c1m25 c2m25 c3m25 c4m25 c5m25 c6m25 c7m25 c8m25
26 c1m26 c2m26 c3m26 c4m26 c5m26 c6m26 c7m26 c8m26
27 c1m27 c2m27 c3m27 c4m27 c5m27 c6m27 c7m27 c8m27
28 c1m28 c2m28 c3m28 c4m28 c5m28 c6m28 c7m28 c8m28
29 c1m29 c2m29 c3m29 c4m29 c5m29 c6m29 c7m29 c8m29
30 c1m30 c2m30 c3m30 c4m30 c5m30 c6m30 c7m30 c8m30
31 c1m31 c2m31 c3m31 c4m31 c5m31 c6m31 c7m31 c8m31
32 c1m32 c2m32 c3m32 c4m32 c5m32 c6m32 c7m32 c8m32
33 c1m33 c2m33 c3m33 c4m33 c5m33 c6m33 c7m33 c8m33
34 c1m34 c2m34 c3m34 c4m34 c5m34 c6m34 c7m34 c8m34
35 c1m35 c2m35 c3m35 c4m35 c5m35 c6m35 c7m35 c8m35
36 c1m36 c2m36 c3m36 c4m36 c5m36 c6m36 c7m36 c8m36
37 c1m37 c2m37 c3m37 c4m37 c5m37 c6m37 c7m37 c8m37

Additional audio data

Each speaker provided 5 sets of additional audio data. This material may be of use for speaker adaptation or building word models for keyword-spotting tasks. Speakers occupied the same location relative to the mannekin as that adopted in the crossword task. Note that full data from the mannekin and omnidirectional microphone is provided for the shibboleth sentences only - the other sets contain sound material from the single speaker microphone only. Thus, studies of speaker location might use the shibboleth sentences as test material (before setting out on the harder problem of multi-source speaker location, for which the crossword task is ideal!).

The additional data consisted of:


spkr1 spkr2 spkr3 spkr4 spkr5
alphabet 1ea 2ea 3ea 4ea 5ea
passage 1ej 2ej 3ej 4ej 5ej
numbers 1en 2en 3en 4en 5en
TIMIT 1es 2es 3es 4es 5es
common words 1ew 2ew 3ew * 5ew
omni 6_1 6_2 6_3 6_4 6_5
left 7_1 7_2 7_3 7_4 7_5
right 8_1 8_2 8_3 8_4 8_5

(*) Due to an oversight, no "common words" data for speaker 4 was recorded.

Of these, only the TIMIT sentences have been transcribed:


spkr1 spkr2 spkr3 spkr4 spkr5
word timit_w1 timit_w2 timit_w3 timit_w4 timit_w5
phone timit_p1 timit_p2 timit_p3 timit_p4 timit_p5

Contributions

For the purposes of comparisons amongst representations and analyses, this section will contain contributed figures, preferably for minute 19 of the corpus. Suggested contributions:

Please email a URL.

Documentation

Prior to its web-release, ShATR was described in a couple of relatively obscure papers:

A revised version of the latter paper was published in Computational Auditory Scene Analysis (eds. D. F. Rosenthal and H. G. Okuno) Mahwah, NJ: Lawrence Erlbaum, pp. 321-333.

Some technical documentation describing the details of the recording environment is available.

Credits

The following people contributed to the planning, development, collection, annotation and subsequent web-release of ShATR: Jon Barker, Guy Brown, Martin Cooke, Malcolm Crawford, Inge-Marie Eigsti, Phil Green, Brian Karlsen, Hiroaki Kato, Hideki Kawahara, Steve Renals, Masako Tanaka and Minoru Tsuzaki. Thanks are due also to David Kirby of the BBC for much help and support during the early stages of configuring the recording equipment. We also acknowledge help, advice and criticism from many colleagues in the speech, hearing and linguistics communities.

This work was supported by SERC Image Interpretation Initiative Research Grant GR/H53174; a study visit grant from Advanced Telephony Research, Kyoto, to Guy Brown and Malcolm Crawford; Royal Society and Royal Academy of Engineering awards to Phil Green, and a Royal Society grant to Martin Cooke.


Last update: 18th April, 2002 by Martin Cooke