Robust Speech Recognition with Missing Data

While current speech recognition devices can achieve good performance in favourable conditions, they are not 'robust', i.e. recognition accuracy falls off rapidly in even modest amounts of steady noise. In contrast, human listeners maintain performance even when there is more energy in the noise than in the speech. Furthermore, unpredictable acoustic events (doors slamming, windows opening, passing cars...) can be handled by the auditory system without trouble. Lack of robustness limits the application of speech recognition technolgy to quiet, controlled environments, precluding for instance voice control of mobile 'phones or taking the minutes of a meeting.

It is natural to think of Auditory Scene Analysis (ASA) as a pre-processing stage for robust automatic speech recognition: first separate out the speech evidence from that for other sound sources, then present this evidence to the recogniser. In contrast to other methods for achieving robustness, ASA requires no model of the noise, relying only on low-level grouping principles which reach down to the properties of the speech preduction and perception systems, and to the physics of sound.

However, ASA will never recover all the speech evidence. There will be some spectro-temporal regions which are dominated by other sources. For instance, here's a clean speech spectrogram: 
and the left figure below is the same speech with added helicopter noise:


There is an analogy with visual occlusion: just as it is possible to recognise objects which are partly hidden by other objects, it is possible to recognise speech without having the complete spectrum available. This is the essence of the 'missing data' approach:

  1. Identify the reliable evidence
  2. Recognise on the basis of this incomplete evidence.

Problem 1 can be thought of as placing a 'mask' over the spectral data. In simple situations, techniques which estimate the local signal-to-noise ratio can be used to define the mask. This is illustrated in the 'SNR Mask' above. For comparison, the 'a priori mask' is obtained by cheating - using the clean speech and noise before mixing to work out the true local SNR. A better solution is to use primitive Auditory Scene Analysis to define the mask, or to combine ASA and SNR estimation.

There are two approaches to problem 2. One can either estimate the missing values and then proceeed as normal (missing data imputation) or use the distribution of the remaining values alone (marginalisation). In SpandH, both these techniques have been formulated for statistical recognisers based on Continuous Density Hidden Markov Models.  Marginalisation generally outperforms imputation.  Both techniques can be improved by the additional use of counter-evidence: even if we don't know the true speech value for some time-frequency pixel we can put a bound on it: the speech energy cannot be greater than the energy in the mixture. Thus speech sounds wich require more energy than the total available can be rejected.

Here are some results, obtained in November 2000. The task here is connected digit recognition and speech is mixed with lynx helicopter noise in varying proportions. Word recognition accuracy is plotted against global Signal to Noise ratio.

  • The curve with legend 'raw filter bank' shows that performance degrades rapidly if no attempt is made to correct for the noise.
  • The curve with legend 'A priori' shows the potential of the missing data technique: it uses a mask obtained by 'cheating', using knowledge of the clean speech and noise before mixing.
  • The 'MFCC+CMN' curve uses standard robustness techniques - Mel Frequency Cepstral Coefficinets and Cepstral Mean Normalisation.
  • MD Discrete shows missing data results from marginalisation with bounds
  • MD Fuzzy shows a further win by 'softening' the missing data decision, by using an estimate of the probability that a given time-frequency 'pixel' is speech or noise.
Missing Data Projects

MD is a key part of the SpandH contribution to SPHEAR and RESPITE.

Inaugural Lecture Key Publications
  • P.D. Green, J. Barker, M.P. Cooke and L. Josifovski,  Handling Missing and Unreliable Information in Speech Recognition, Proc. AISTATS 2001, Key West. Postscript
  • M. P. Cooke, P.D. Green, L. Josifovski and A. Vizinho, Robust automatic speech recognition with missing and unreliable acoustic data, to appear in Speech Communication, 2000 PDF
  • J. Barker, M.P. Cooke and D.P.W.Ellis, Decoding speech in the presence of other sound sources, ICSLP-2000, Beijing PostScript
  • J. Barker, L. Josifovski, M.P. Cooke and P.D. Green , Soft decisions in missing data techniques for robust automatic speech recognition, ICSLP-2000, Beijing, PostScript
  • M. Cooke, P. Green, L. Josifovski and A. Vizinho, Robust ASR with unreliable data and minimal assumptions, in "Robust Methods for Speech Recognition in Adverse Conditions", Tampere, 1999 Postscript|
  • L. Josifovski, M. Cooke, P. Green and A. Vizinho, State based imputation of missing data for robust speech recognition and speech enhancement , Eurospeech'99, Budapest, 1999 PostScript
  • A. Vizinho, P. Green, M. Cooke and L. Josifovski, Missing data theory, spectral subtraction and signal-to-noise estimation for robust ASR: An integrated study, Eurospeech'99, Budapest, 1999