|RESPITE: Annual Report 2001: Scientific Highlights: Dealing with Missing Data|
The Missing Data approach aims to base recognition only on those spectral-temporal regions which have been identified as carrying reliable speech evidence.
Even at very low SNRs some spectral-temporal regions contain reasonably `clean' speech. The idea is illustrated in the figure below. The left panel shows a spectral-temporal representation of speech mixed with factory noise. The centre panel shows the same mixture through a so-called `a priori mask' which uses knowledge of the unmixed signals to select regions where the speech has more energy than the noise. The right hand panel shows how we can approximate this `ideal' mask using noise estimation techniques.
|Speech + Factory Noise||A priori Mask||Estimated Mask|
This year we have concentrated on fine-tuning and evaluating our missing data recognition system. Our missing data system was one of those competing in a large comparison session at the recent Eurospeech conference in Denmark. Over twenty papers were submitted to the session and each used the same noisy digit recognition task for evaluation. For recognition systems employing models trained on clean speech, the RESPITE missing data system came in the top three.
The figure below shows the steady progress in recognition performance that has been made over recent years using the missing data technique. The difference between the 1998 system (green) and the 2001 system (blue) is the result of work conducted within the RESPITE project. The black line represents what we believe to be the theoretical limit of the technique. At moderate noise levels we are fast approaching this limit.
Last year we reported some novel ideas for developing speech recognition systems that are able to operate reliably in the presence of highly non-stationary competing noise sources. The technique (envisaged in the RESPITE proposal), which we have termed `multisource decoding', employs a new speech recognition engine that operates within the missing data framework to integrate bottom-up techniques which identify `coherent fragments' of spectro-temporal energy (based on local features), with the top-down hypothesis search of conventional speech recognition.
Work on these ideas has continued, and the prototype system we developed last year has been extended to incorporate the latest advances in our missing data systems. The system has been partially evaluated on the AURORA noisy connected digit recognition task and has produced some promising results:
Research funding has been obtained to further develop these ideas.