The English Consistent Confusion Corpus

A collection of noise induced listening errors.

About the corpus

The English Consistent Confusion Corpus is a large-scale collection of noise induced British English speech misperceptions. These misperceptions have been elicited by asking listeners to transcribe English words mixed with complex noise backgrounds.

The corpus has been distilled from over 300,000 listener responses and includes responses to over 9,000 individual noisy speech tokens. Of these, a subset of over 3,000 tokens induce 'consistent confusions', i.e. tokens that are misheard in the same way by a significant number of listeners.

Full technical details here.

Download

The confusion corpus can be downloaded as a single zip file [165 Mb]

It is made up of the following components:

1. A spreadsheet in csv format describing each misperception.

2. Waveforms of the speech and masker pairs that led to misperceptions.

3. Continuous masker waveforms from which the individual masker fragments were chosen (plus transcription files for maskers containing natural speech).

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

If using the data please provide attribution by citing:

Ricard Marxer, Jon Barker, Martin Cooke and Maria Luisa Garcia Lecumberri, "A corpus of noise-induced word misperceptions for English", JASA Express Letters (submitted)

Contact Us

consistentconfusion@gmail.com

Ricard Marxer, University of Sheffield, UK

Jon Barker, Univesity of Sheffield, UK

Martin Cooke, Ikerbasque, Bilbao, Spain

Maria Luisa Garcia Lecumberri, University of the Basque Contry, Spain

Further Links

A Spanish Consistent Confusion corpus - an equivalent dataset in Spanish

Intelligibility Under the Microscope - Interspeech 2016 Special Session