RESPITE: Annual Report 1999: Scientific Highlights: Dealing with Missing Data

(RESPITE logo)

Dealing with Missing Data

In the Missing Data approach, the theme is to base recognition only on those spectral-temporal regions which have been identified as carrying reliable speech evidence.

Even at very low SNRs some spectral-temporal regions contain reasonably `clean' speech. The idea is illustrated in the figure below. The left panel shows a spectral-temporal representation of speech mixed with factory noise. The centre panel shows the same mixture through a so-called `a priori mask' which uses knowledge of the unmixed signals to select regions where the speech has more energy than the noise. The right hand panel shows how we can approximate this `ideal' mask using noise estimation techniques.

Speech + Factory Noise A priori Mask Estimated Mask
SNR = 20 dB --10 dB --0 dB
Using the buttons above you can see that as the SNR is reduced fewer and fewer reliable points remain in the masks. Missing data speech recognition techniques have to be able to cope with the sparse representations that remain when the unreliable data has been removed.

Missing Data Recognition Methods (Sheffield)

Two missing data methods have been formalised in a probabilistic framework and tested within the context of statistical speech recognition (with Continuous Density Hidden Markov Models). The `Marginalisation' method essentially integrates over all possible values of the missing data, whereas in `State-based Imputation' the idea is to estimate missing data values conditioned on the present data and the recognition hypothesis. Both approaches benefit greatly by using data limits - in an energy-based acoustic representation, missing speech values must lie between 0 and the total energy in the mixture.

The following figure illustrates that using simple noise estimates to identify reliable data it is possible to achieve recognition performance close to that obtained by `cheating' (using knowledge of the speech and noise separately before mixing) down to around 10 dB SNR.

Missing data results graph

In a series of experiments on connected digit recognition using the AURORA task we have shown, for instance, that robust performance can be obtained in speech mixed with helicopter noise at 0dB SNR. In this case, only xx% of the spectro-temporal `pixels' are taken to be reliable.