Medium vocabulary track

Task overview   Data   Baseline software tools   Instructions

Task overview

The task considers the problem of recognizing utterances being spoken in a noisy living room from recordings made using a binaural mannikin. The task uses the same setup as the 2011 CHiME Challenge in terms of reverberation and noise conditions, but the target utterances are here taken from the speaker-independent medium (5k) vocabulary subset of the Wall Street Journal (WSJ0) corpus, a well-known corpus of read speech.

Data

Mixing process

The target utterances are speech utterances from the Linguistic Data Consortium's CSR-I (WSJ0) dataset. As in the 2011 CHiME Challenge, each utterance has been convolved with a fixed Binaural Room Impulse Response (BRIR) corresponding to a frontal position at a distance of 2 m, then mixed with binaural recordings of genuine room noise made over a period of days in the same family living room. The temporal placement of the utterances within the noise background has been controlled in a manner which produces mixtures at 6 different ranges of SNR with limited rescaling of the speech and noise signals: -6, -3, 0, 3, 6, 9 dB.

More details about the background noise and BRIR recording process can be found here. Some audio demos using the Grid corpus as target speech are available here.

Training, development and test data

All data are provided as 16 bit stereo WAV files sampled at 16 kHz. The noisy utterances are provided both in isolated form and in embedded form. The latter involve 5 s of background noise before and after the utterance. The present dataset is provided under agreement with the Linguistic Data Consortium (LDC), and its redistribution is not allowed.

Training set: 7138 reverberated utterances from a total of 83 speakers forming the WSJ0 SI-84 training set and the same utterances each mixed at one random SNR.

Development set: 409 noisy utterances from 10 other speakers, forming the “no verbal punctuation” (NVP) part of the WSJ0 speaker-independent 5k vocabulary development set at each of 6 ranges of SNR.

Test set: 330 noisy utterances from 12 other speakers, forming the Nov’92 ARPA WSJ evaluation set (NVP, 5k vocabulary) at each of 6 ranges of SNR.
The test set follows the same specifications as the development set and has been available since 29th October.

Different BRIRs are used for each dataset and these are the same as in the 2011 CHiME Challenge.

In addition to the above data, we also provide 7h of (ftp) noise background (1.1 GB) and an optional set of noisy utterances with larger vocabulary, derived from the WSJ0 speaker-independent 20k vocabulary development and test sets at each of 6 ranges of SNR. These data are not part of the challenge, but you are welcome to use them provided that you also report the results obtained using the official training and development sets.

Obtaining the data

The full dataset (consisting of training, development test and evaluation test sets) is provided to challenge participants who hold an LDC license for the WSJ0 dataset (either LDC93S6A or LDC93S6B). If you hold a license for the WSJ0 dataset, please contact us so that we check your license information and tell you how to download the full dataset.

A "public" subset of the development test set is made available here to all participants for the purpose of evaluation only under agreement with the LDC. It consists of 240 stereo audio files which are noisy versions of 40 utterances by one female speaker and one male speaker of the WSJ0 development set (si_dt_05) at 6 different SNRs.

(For ftp downloads log in with user name 'anonymous')

You are welcome to download these data, but if you are interesting in participating please contact us so we can monitor interest and send you further updates about the challenge.

Licenses and references

The mixed data were authored by the challenge organisers and are provided under the same restrictions of use as the original CSR-I (WSJ0) dataset. The noise background data are provided under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 license and were authored by H. Christensen, J. Barker, N. Ma and P. Green. If you eventually use these data in any published research please cite,

The original CSR-I (WSJ0) dataset is Copyright (c) Linguistic Data Consortium. Its authors are John Garofalo, David Graff, Doug Paul, and David Pallett. More information can be found on the LDC website for LDC93S6A or LDC93S6B.

Baseline software tools

The task is to transcribe all test utterances. Success will be measured in terms of Word Error Rate (WER), i.e., the number of word substitutions, insertions and deletions as a fraction of the number of target words.

Baseline scoring, decoding and retraining tools based on HTK and on Keith Vertanen's recipes will be provided to participants who hold a WSJ0 license. Documentation will be included with the download but can also be viewed online here.

These tools include 3 baseline speaker-independent recognition systems trained on clean, reverberated and noisy data, respectively, and a number of scripts allowing you to

While extensive testing has been performed, please don't hesitate to contact us in case you would encounter any problem installing or using them.

Instructions

In order to reach a broad audience we have tried to avoid setting rules that might artificially disadvantage one research community over another. However, to keep the task as close to an application scenario as possible, and to allow systems to be broadly comparable, there are some guidelines that we would like participants to follow.

Which information can I use?

You are encouraged to use the embedded data in any way that may help, e.g. to learn about the acoustic environment in general, or the immediate acoustic context of each utterance. However, you should not train models of the noise background within a given test utterance on other test utterances. Because the noise signals in different utterances temporally overlap, this would lead to strong overfitting.

Which information shall I not use?

The systems should not exploit:

It is allowed to jointly process all the test utterances but the fact that the BRIRs are identical between different test utterances shall not be explicitly used.

All parameters should be tuned on the training set or the development set. Once you are satisfied with your system's tuning, run it only once on the final test set.

Can I use different features, a different recognizer or more data?

You are entirely free in the development of your system, from the front end to the back end and beyond, and you may even use extra data, e.g., derived from the provided 7h of noise background. However, if you change the features or the recogniser compared to the baseline or if you use extra data, you should provide enough information, results and comparisons, such that one can understand where the performance gains obtained by your system come from. For example, if your system is made of multiple blocks, we encourage you to separately evaluate and report the influence of each block on performance. If you use extra data, you should also report the results obtained from the official training and development sets alone.

How do I submit my results?

Detailed instructions are available here.