Small vocabulary track

Task overview   Data   Baseline software tools   Instructions

Task overview

The task considers the problem of recognising commands being spoken in a noisy living room from recordings made using a binaural mannikin. As in the 2011 CHiME Challenge, the target utterances are taken from the small-vocabulary Grid corpus. However, while the target speaker was previously located at a fixed position of 2m directly in front of the mannikin, he/she is now allowed to make small head movements within a square zone of +/- 10cm around that position.

Data

Recording and mixing process

The target utterances consist of 34 speakers reading simple 6-word sequences of the form <command:4><color:4><preposition:4><letter:25><number:10><adverb:4> where the numbers in brackets indicate the number of choices at each point.

Each utterance has been convolved with a set of binaural room impulse responses (BRIRs) simulating speaker movements and reverberation. The target speaker is static at the beginning of each utterance, then he/she moves once, and finally he/she is static again. Movements follow a straight left-right line at fixed front-back distance from the mannikin and each movement is at most 5cm at a speed of at most 15cm/s. These movements have been simulated by interpolating a set of fixed BRIRs recorded at closely spaced positions in a way that has been shown to provide a reasonable approximation to actual time-varying BRIRs.

The reverberated utterances have then been mixed with binaural recordings of genuine room noise made over a period of days in the same family living room. The temporal placement of the utterances within the noise background has been controlled in a manner which produces mixtures at 6 different ranges of SNR without rescaling the speech and noise signals: -6, -3, 0, 3, 6, 9 dB.

More details about the background noise and BRIR recording process can be found here and there are some audio demos here.

Training, development and test data

All data are provided as 16 bit WAV files sampled at 16 kHz. The noisy utterances are provided both in isolated form and in embedded form. The latter either involve 5s of background noise before and after the utterance (in the training set) or they are mixed in continuous 5min noise background recordings (in the development and test sets).

(For ftp downloads log in with user name 'anonymous')

Training set: 500 utterances from each of 34 speakers

Development set: 600 utterances at each of 6 ranges of SNR

The noise-free reverberated utterances of the development set are provided for benchmarking purposes only, e.g., for computing the SNR achieved by the denoising front-end, and shall not be exploited to obtain the output transcripts in any way.

Test set: 600 utterances at each of 6 ranges of SNR

In addition to the above data, we also provide 7h of (ftp) noise background (1.1 GB) which are not part of the training set. You are welcome to use these data, provided that you also report the results obtained using the official training and development sets.

All data are provided under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 license and were authored by the challenge organisers (for the BRIRs and the mixtures), J. Barker, M. Cooke, S. Cunningham and X. Shao (for the original speech data) and H. Christensen, J. Barker, N. Ma and P. Green (for the original noise data).

If you are interested in participating, please contact us so we can monitor interest and send you further updates about the challenge. If you eventually use the data in any published research please cite,

  • Vincent, E., Barker, J., Watanabe, S., Le Roux, J., Nesta, F. and Matassoni, M. The second CHiME Speech Separation and Recognition Challenge: Datasets, tasks and baselines'' In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013, Vancouver
  • Barker, J.P., Vincent, E., Ma, N., Christensen, H. and Green, P.D. "The PASCAL CHiME Speech Separation and Recognition Challenge", Computer Speech and Language, 27:3 (2013) pages 621-633

Filenaming conventions and annotation files

Isolated utterances are named as <GridUtt> or s<Speaker>_<GridUtt> where <GridUtt> are 6 characters encoding the word sequence and <Speaker> the speaker ID. Embedded utterances are split up into 5 minute segments named as CR_lounge_<Date>_<Time>.s<SegmentNumber> where <Time> is the beginning of the recording and <SegmentNumber> the offset in seconds after the beginning of the recording.

The data is accompanied by one annotation file per speaker or per SNR. Each line encodes the available information about each utterance in the format <Utt> <NoiseSegment> <StartSample> <Duration> <SNR> <Y> <XStart> <XEnd> <TStart> <TEnd> where

  • <Utt> is the utterance filename
  • <NoiseSegment> the noise background segment filename
  • <StartSample> the position (start sample) of the utterance within the noise background segment
  • <Duration> the duration of the utterance in samples
  • <SNR> the range of SNR in dB (available for training only)
  • <Y> the front-back distance from the microphones in meters
  • <XStart> the initial left-right position in meters compared to the front direction of the mannikin (before the move)
  • <XEnd> the final left-right position in meters compared to the front direction of the mannikin (after the move)
  • <TStart> the starting time of the move in samples after the beginning of the utterance
  • <TEnd> the ending time of the move in samples after the beginning of the utterance

Baseline software tools

Recognition systems will be evaluated on their ability to correctly recognise the letter and digit tokens. Baseline scoring, decoding and training tools based on HTK can be downloaded here (ftp). Documentation is included with the download but can also be viewed online here.

These tools include 3 baseline speaker-dependent recognition systems trained on clean, reverberated and noisy data, respectively, and a number of scripts allowing you to

  • train a baseline recognition system from the training data after processing by your own denoising front end,
  • transcribe utterances in the development and test sets using one of the 3 provided systems or your own trained system,
  • score the resulting transcriptions in terms of keyword recognition rates.
While extensive testing has been performed, please don't hesitate to contact us in case you would encounter any problem installing or using them.

Running the baseline system should produce the following set of results,

Development test set
-6 -3 0 3 6 9
clean 11.83 12.33 16.50 17.50 21.75 23.50
rev 32.08 36.33 50.33 64.00 75.08 83.50
noisy 49.67 57.92 67.83 73.67 80.75 82.67
Evaluation test set
-6 -3 0 3 6 9
clean 10.58 11.17 13.33 17.75 21.17 24.42
rev 32.17 38.33 52.08 62.67 76.08 83.83
noisy 49.33 58.67 67.50 75.08 78.83 82.92

Instructions

In order to reach a broad audience we have tried to avoid setting rules that might artificially disadvantage one research community over another. However, to keep the task as close to an application scenario as possible, and to allow systems to be broadly comparable, there are some guidelines that we would like participants to follow.

Which information can I use?

You are encouraged to use the embedded data in any way that may help, e.g. to learn about the acoustic environment in general, or the immediate acoustic context of each utterance. Also, the recognition system is allowed to assume that the speaker identity is known and use a corresponding model. Finally, information about the movements of the target speaker can be used, although the participants are encouraged to consider conditions where they don't explicitly know the head position.

Which information shall I not use?

The systems should not exploit:

All parameters should be tuned on the training set or the development set. Once you are satisfied with your system's tuning, run it only once on the final test set.

Can I use different features, a different recogniser or more data?

You are entirely free in the development of your system, from the front end to the back end and beyond, and you may even use extra data, e.g., derived from the provided 7h of noise background. However, if you change the features or the recogniser compared to the baseline or if you use extra data, you should provide enough information, results and comparisons, such that one can understand where the performance gains obtained by your system come from. For example, if your system is made of multiple blocks, we encourage you to separately evaluate and report the influence of each block on performance. If you use extra data, you should also report the results obtained from the official training and development sets alone.

How do I submit my results?

Detailed instructions are available here.