FAQ Frequently Asked Questions

In the baseline code, beamforming is performed only on the data in development set but not on data in the training set. Is this intentional?

Yes. This is because this was found to be beneficial in the CHiME-3 challenge. You are of course allowed to enhance the training data too, provided that your enhancement system does not rely on external data.

Can we use weighted combinations of selected channels for data augmentation (e.g. averaging channel 1 and channel 2)?

Yes.

Are we allowed to use external text, such as the Wall Street Journal text materials?

No.

Are we allowed to use information from floor-plans (e.g. room sizes, kinect positions) in the training routine?

Yes, you are allowed to use the map of the recording environment showing the positions of the Kinects for training and development data. Note, this information will not be available for the evaluation data.

Are we allowed to use unsupervised adaptation performed by a model trained on the entire dev/eval set?

No, you can use the entire dev set but not the entire eval set. For the multiple-array track you can use the whole single test session in which the utterance occurs. For the single array track, you are allowed to use the whole single test session from the reference Kinect only.

Can I use the segment information to identify speaker overlap in the test set?

Yes.

What information will be provide for the evaluation set?

The annotation for the evaluation set will provide the same information as those for the development set, except for the transcriptions.

Can we realign begin and end times of each utterance?

Yes, you are allowed to automatically refine the utterance start and end times provided that your approach is fully automatic (no manual reannotation) and it relies on the provided signals only.

When I re-run ./run.sh the scripts do not run correctly.

This can happen if redo-ing experiments without removing all temporary files. It is recommended that you move or remove the ./data directory before rerunning experiments.

Why does the Python code in the array alignment baseline contain syntax errors?

It doesn't. You are probably using the wrong Python version. Check that you are using Python 3.6 or later.

If we change the numbers of tied phonetic phones from baseline (with fixed language model and fixed lexicon), will our system still be ranked within category A?

Yes, because the outputs of the acoustic model remain frame-level tied phonetic targets and the lexicon and language model are unchanged.

If we use a combination (such as ROVER) of ASR systems using tied phonetic targets, fixed language model and fixed lexicon, will our system still be ranked within category A?

Yes, provided that the combination algorithm does not implicitly rely on a modified language model.

Are we allowed to use measured impulse responses from a Kinect device for processing and enhancement of the speech signals?

Any processing should rely on the provided signals only (no external speech, impulse responses, or noise data). You can use measured impulse responses from a Kinect device and report the corresponding results if you wish. However, these results will not be taken into account in the final WER ranking of all systems and you should still report the results obtained without using this information.

How will the reference array be chosen in the evaluation set?

The reference array for a given session and location will be fixed in the same way in the evaluation set as in the development set.

Are we allowed to augment the training data using synthetic speech data generated by text-to-speech (TTS).

If you have a way of fully training a TTS system on the CHiME-5 data and generating utterances without relying on any external speech or text data, then this is allowed.