Robust Speech Recognition with Missing Data
While current speech recognition devices can achieve good performance in favourable conditions, they are not 'robust', i.e. recognition accuracy falls off rapidly in even modest amounts of steady noise. In contrast, human listeners maintain performance even when there is more energy in the noise than in the speech. Furthermore, unpredictable acoustic events (doors slamming, windows opening, passing cars...) can be handled by the auditory system without trouble. Lack of robustness limits the application of speech recognition technolgy to quiet, controlled environments, precluding for instance voice control of mobile 'phones or taking the minutes of a meeting.
It is natural to think of Auditory Scene Analysis (ASA) as a pre-processing stage for robust automatic speech recognition: first separate out the speech evidence from that for other sound sources, then present this evidence to the recogniser. In contrast to other methods for achieving robustness, ASA requires no model of the noise, relying only on low-level grouping principles which reach down to the properties of the speech preduction and perception systems, and to the physics of sound.
However, ASA will never recover all the speech evidence. There
will be some spectro-temporal regions which are dominated by other
sources. For instance, here's a clean speech spectrogram:
There is an analogy with visual occlusion: just as it is possible to recognise objects which are partly hidden by other objects, it is possible to recognise speech without having the complete spectrum available. This is the essence of the 'missing data' approach:
Problem 1 can be thought of as placing a 'mask' over the spectral data. In simple situations, techniques which estimate the local signal-to-noise ratio can be used to define the mask. This is illustrated in the 'SNR Mask' above. For comparison, the 'a priori mask' is obtained by cheating - using the clean speech and noise before mixing to work out the true local SNR. A better solution is to use primitive Auditory Scene Analysis to define the mask, or to combine ASA and SNR estimation.
There are two approaches to problem 2. One can either estimate the missing values and then proceeed as normal (missing data imputation) or use the distribution of the remaining values alone (marginalisation). In SpandH, both these techniques have been formulated for statistical recognisers based on Continuous Density Hidden Markov Models. Marginalisation generally outperforms imputation. Both techniques can be improved by the additional use of counter-evidence: even if we don't know the true speech value for some time-frequency pixel we can put a bound on it: the speech energy cannot be greater than the energy in the mixture. Thus speech sounds wich require more energy than the total available can be rejected.
Here are some results, obtained in November 2000. The task here is connected digit recognition and speech is mixed with lynx helicopter noise in varying proportions. Word recognition accuracy is plotted against global Signal to Noise ratio.