Separating and recognising speech in everyday listening conditions

One of the chief difficulties of building distant-microphone speech recognition systems for use in `everyday' applications, is that we live our everyday lives in acoustically cluttered environments. A speech recognition system designed to operate in a family home, for example, must contend with competing noise from televisions and radios, children playing, vacuum cleaners, outdoors noises from open windows — an array of sound sources of enormous variation. Such domains may be complex and seemingly unpredictable but nevertheless contain structure that can be learnt and exploited given enough data.

annotated ratemap The figure to the right shows a time-frequency representation of a twenty second segment of domestic living room data that forms the background for the PASCAL 'CHiME' challenge. The figure illustrates some of the huge challenges presented by the target scenario: there is a noise floor that, although quasi-stationary, can change abruptly in response to unpredictable changes of state in the room (doors opening, appliances being turned on or off); on top of the background there are abrupt impact noises such as footsteps and doors banging that can mask even highly energetic portions of the speech signal; there may be multiple speakers in the room producing overlapping speech; the positions of the competing sound sources can change over time, etc, etc.

There are many partial solutions to the problem of noise-robust ASR: systems either attempt to filter out the noise, directly model the combination of the noise and the speech or use signal-level cues such as location and pitch to un-mix the sources. In narrowly specified conditions all of these approaches have been shown to be useful; but in conditions closer to real application scenarios they individually fail. The PASCAL 'CHiME' challenge aims to encourage the development of more flexible approaches that combine the strengths of source separation techniques with those of speech and noise modelling.

Compared to the demands of a genuine domestic speech recognition application, the PASCAL CHiME challenge has been somewhat simplified to make the competition more widely accessible, while still retaining the essential difficulty posed by additive noise. To this end, the ASR task is speaker dependent and employs a small but phonetically confusable vocabulary. Further, the microphone-speaker geometry remains fixed, focusing the task on the problems of additive noise rather than on the problem of channel variability that is also encountered in distant microphone speech processing. However, no compromise has been made with the noise background — the recordings come from a genuine living room (in a house with two small children!) measured over a period of several weeks. The SNRs employed in the challenge range between 9 dB down to -6 dB; a range resulting from the natural variation of the noise background rather than artificial scaling. In contrast to other noise robust speech recognition tasks, the utterances to be recognised, although supplied in a pre-segmented form, are locatable in a continuous audio stream that can also be downloaded, i.e., there are hours worth of acoustic context for each utterance. We hope participants will be able to exploit the continuous background data, for example, by learning spatial and spectral properties of the commonly occurring competing sound sources.

Recordings have been made with binaural microphones. Although, larger microphone arrays would be more effective, the binaural configuration has particular interest: we hope by constraining the task to just two microphones, we will better enable engineers building solutions to robust ASR to trade insights with scientist trying to better understand human hearing.

We hope you find this challenge both useful and fun. Details of how to take part are on the 'instructions' page.

Good luck!