The English Consistent Confusion Corpus is a large-scale collection of noise induced British English speech misperceptions. These misperceptions have been elicited by asking listeners to transcribe English words mixed with complex noise backgrounds.
The corpus has been distilled from over 300,000 listener responses and includes responses to over 9,000 individual noisy speech tokens. Of these, a subset of over 3,000 tokens induce 'consistent confusions', i.e. tokens that are misheard in the same way by a significant number of listeners.
The confusion corpus can be downloaded as a single zip file [165 Mb]
It is made up of the following components:
1. A spreadsheet in csv format describing each misperception.
2. Waveforms of the speech and masker pairs that led to misperceptions.
3. Continuous masker waveforms from which the individual masker fragments were chosen (plus transcription files for maskers containing natural speech).
This work is licensed under a Creative Commons Attribution 4.0 International License.
If using the data please provide attribution by citing:
Ricard Marxer, Jon Barker, Martin Cooke and Maria Luisa Garcia Lecumberri, "A corpus of noise-induced word misperceptions for English", JASA Express Letters (submitted)
Ricard Marxer, University of Sheffield, UK
Jon Barker, Univesity of Sheffield, UK
Martin Cooke, Ikerbasque, Bilbao, Spain
Maria Luisa Garcia Lecumberri, University of the Basque Contry, Spain
A Spanish Consistent Confusion corpus - an equivalent dataset in Spanish
Intelligibility Under the Microscope - Interspeech 2016 Special Session