Task overview

The CHiME-3 scenario is ASR for a multi-microphone tablet device being used in everyday, noisy environments. It represents a significant step forward in terms of realism with respect to the previous CHiME challenges.

The challenge features: To maintain compatibility with the 2nd CHiME challenge, the new challenge re-uses the WSJ evaluation framework. Utterances are provided recorded in continuous audio with ground truth VAD annotations.

The recording set up

WSJ prompts are read from a tablet device. Speech is captured by 6 microphones embedded in the frame and recorded in 24-bits at 48 kHz by a TASCAM DR-680 multi-track field recorder. A separate close-talking microphone channel has also been recorded. Audio was subsequently downsampled to 16-bit 16 kHz for distribution. (The 48 kHz data is available on request.)

CHiME recording device
Photograph showing the recording device in use.

The microphone configuration

The figure below shows the recording device and indicates the positions of the microphones. The microphones are numbered 1 to 6 corresponding to channels 1 to 6 in audio available for download. The microphones labeled in green are facing forward and are mounted flush with the front of the frame. Microphone 2, labeled in blue, faces backwards and is mounted flush with the back of the 1.0 cm thick frame. The microphones were audio-technica ATR3350 omnidirectional lavalier mics.

CHiME mic array
The recording device and the positions of the 6 microphones.

All 6 tablet microphones were recorded on the same TASCAM unit and are therefore sample synchronised. The close-talking microphone (channel 0) was a Beyerdynamic condenser headset microphone recorded on a separate TASCAM unit daisy-chained to the first. Synchronisation between the close-talking mic and the tablet mics is only approximate, +-20 ms.

Note that, similarly to a commercial device, a number of the tablet microphones occasionally failed to record properly in some situations due to hardware issues or to masking by the user's hands or clothes. These situations can often be detected thanks to low signal power in the corresponding channels.

The recording procedure

The original live recordings have been made by 12 US talkers (6 male and 6 female). For each talker recordings were made first in a sound proof booth and then in each of the four noisy target environments. About 100 sentences were read in each location. The talkers used a simple interface that presented WSJ prompts on the tablet. It was stressed that each sentence had to be read correctly and without interruption. Talkers were allowed as many attempts as necessary to read each sentence. They were asked to use the tablet in whatever way felt natural and comfortable but they were encouraged to adjust their reading position after each 10 utterances, e.g. either holding the tablet (most typical), resting it on their lap, laying it on a table, etc.

The recordings have been divided into training, dev test and eval test sets. Each set features different talkers and different instances of the same noise environment, e.g., all data sets feature the cafés noise environment but different specific cafés are used in each set. For full details of the data sets please continue to the next page.