|RESPITE: Annual Report 2000: Scientific Highlights: Dealing with Missing Data|
In the Missing Data approach, the theme is to base recognition only on those spectral-temporal regions which have been identified as carrying reliable speech evidence.
Even at very low SNRs some spectral-temporal regions contain reasonably `clean' speech. The idea is illustrated in the figure below. The left panel shows a spectral-temporal representation of speech mixed with factory noise. The centre panel shows the same mixture through a so-called `a priori mask' which uses knowledge of the unmixed signals to select regions where the speech has more energy than the noise. The right hand panel shows how we can approximate this `ideal' mask using noise estimation techniques.
|Speech + Factory Noise||A priori Mask||Estimated Mask|
As reported last year, two missing data methods have been formalised in a probabilistic framework and tested within the context of statistical speech recognition (with Continuous Density Hidden Markov Models). The `Marginalisation' method essentially integrates over all possible values of the missing data, whereas in `State-based Imputation' the idea is to estimate missing data values conditioned on the present data and the recognition hypothesis. Both approaches benefit greatly by using data limits - in an energy-based acoustic representation, missing speech values must lie between 0 and the total energy in the mixture.
This year we have concentrated on enhancing the sophistication of the marginalistaion-based system. The progress we have made is reflected in the improved recognition results shown below:
While the overall performance gain has been due to a host of small system improvements, one noteworthy development has been the application of `fuzzy decisions'. Hard decisions about whether or not to trust information in a spectro-temporal region, have been replaced by soft probabilist decisions. The figure below illustrates how a time-frequency SNR estimate was previously transformed into a `mask' which classified spectro-temporal regions as either present or missing. By clicking on the buttons beneath the figure you can see how the thresholding function may be adapted to soften the mask. Softening the mask in this way means we avoid having to make potentially damaging incorrect decisions about the nature of the data at an early stage of processing.
Even missing data systems start to struggle in conditions where the noise and the signal have comparable energy -- this is particularly true when the noise is highly non-stationary. To address this problem, we have designed and implemented a new speech recognition engine that employs the missing data framework to integrate bottom-up techniques which identify `coherent fragments' of spectro-temporal energy (based on local features), with the top-down hypothesis search of conventional speech recognition, extended to search also across possible assignments of each fragment as speech or interference. Initial tests of the system have produced highly promising results.
This work is reported in the paper: Decoding Speech in the Presence of Other Sound Sources.