In order to reach a broad audience we have tried to avoid setting rules that might artificially disadvantage one research community over another. However, to keep the task as close to an application scenario as possible, and to allow systems to be broadly comparable, there are some guidelines that we expect participants to follow.

Which information can I use?

You are allowed to use the fact that the four classes of acoustic environments (BUS, CAF, PED, STR) are shared across datasets.

You are also allowed to use the environment and speaker labels in the training data, and the speaker labels in the development and test data.

You are encouraged to use the embedded training and development data and the corresponding noise-only recordings in any way that may help, e.g., to learn models of the acoustic environments and use them to recognize the test environment and/or to enhance the signal. The embedded test data may also be used in the limit of the immediate acoustic context of each test utterance, that is the 5 s preceding the utterance. Note that these 5 s may also contain speech, that is not always annotated.

Which information shall I not use?

The systems should not exploit the following information in order to transcribe a given test utterance:

Automatic identification of the environment of the test utterance and the immediate acoustic context is allowed, though. The rationale is that a commercial ASR system to be deployed on a tablet should work in any environment just after the tablet has been switched on.

Similarly, manual refinement of the speech start and end times or manual annotation of the unnotated speech data are not allowed, but automatic refinement and automatic detection of the speech data in the 5 s context are allowed.

All parameters should be tuned on the training set or the development set. The system should not use different tuning parameters depending on different noisy environments and different data types (real or simulation). For example the baseline script tunes the system with a single language model weight, which is optimized by the average WER of over all recognition results in the development set including all noisy environments and data types.

Which results should I report?

For every tested system, you should report 4 WERs (%), namely:

For instance, here are the WERs (%) achieved by the baseline GMM and DNN models (the WERs on test data will be available later). All these results are obtained by training on noisy multicondition data (channel 5) and testing on data enhanced by BeamformIt. They were obtained for one run on one machine. If you run the baseline yourself, you will probably obtain slightly different results due to random initialisation and to machine-specific issues.

Track Model Development set Evaluation set
Real Simulated Real Simulated
1ch GMM 22.16 24.48 37.54 33.30
DNN+sMBR 14.67 15.67 27.68 24.13
DNN+RNNLM 11.57 12.98 23.70 20.84
2ch GMM 16.22 19.15 29.03 27.57
DNN+sMBR 10.90 12.36 20.44 19.04
DNN+RNNLM 8.23 9.50 16.58 15.33
6ch GMM 13.03 14.30 21.83 21.30
DNN+sMBR 8.14 9.07 15.00 14.23
DNN+RNNLM 5.76 6.77 11.51 10.90

Such results will make it possible to assess whether simulated data are a reliable way of predicting ASR performance on real data, for development and/or for test. This currently appears to be approximately true. You are encouraged to improve the simulation baseline, so that it becomes even more true.

Eventually, only the results of the best system on the real test will be taken into account in the final WER ranking of all systems. The best system is taken to be the one that performs best on the real development set.

For that system, you should report 16 WERs (one for every development/test set and for every environment). The participants should also provide the recognized transcriptions for all the sets, when applicable with time alignment information (if the format of the transcriptions is not standard it must be described).

For instance, here are the WERs achieved by the baseline DNN+RNNLM system.

Track Environment Development set Evaluation set
Real Simulated Real Simulated
1ch BUS 15.13 11.90 35.93 16.49
CAF 11.81 15.90 24.60 23.91
PED 7.42 9.94 19.94 20.25
STR 11.90 14.19 14.36 22.71
2ch BUS 10.90 8.19 25.37 10.66
CAF 7.96 12.15 15.97 18.21
PED 5.22 7.12 13.53 15.61
STR 8.82 10.55 11.45 16.85
6ch BUS 7.39 6.02 16.86 7.68
CAF 5.77 8.10 10.18 11.54
PED 3.72 5.49 9.83 10.31
STR 6.18 7.48 9.19 14.06

Can I use different features, a different recogniser or more data?

You are entirely free in the development of your system, from the front end to the back end and beyond, and you may even use extra data, including clean data, additional noisy data created by running the provided simulation baseline (or an improved version thereof), or any other data.

However, you should provide enough information, results and comparisons, such that one can understand where the performance gains obtained by your system come from. For example, if your system is made of multiple blocks, we encourage you to separately evaluate and report the influence of each block on performance.


The interface between front and back end is taken to be either at the signal or feature level, depending whether your front end operates in the signal or feature domain.

Only the results obtained using the official training and development sets (including possible modifications of the acoustic simulation baseline as specified above) and one of the baseline language models will be taken into account in the final WER ranking of all systems.