|RESPITE: Annual Report 1999: Scientific Highlights: Identifying Reliable Information|
In natural (i.e. noisy) conditions, some of the acoustic evidence for a speech source will be corrupted by other sources, just as in a visual scene an object may be partly hidden by other objects. One theme of RESPITE is to develop ways of identifying the reliable speech evidence, so that subsequent recognition algorithms can be directed towards the data that matters.
Human listeners are remarkably adept at `Auditory Scene Analysis', the perceptual separation of different sound sources. Following earlier studies by RESPITE partners and others we are building a software toolkit for Computational Auditory Scene Analysis, which will enable these advances to be deployed within practical recognition systems. This toolkit is now ready for its first software release. It provides a dataflow mechanism and a library of processing blocks describing key Auditory Scene Analysis algorithms and many simple signal processing operations. An easy-to-use scripting language allows complicated systems to be efficiently constructed from this library of simple blocks. The toolkit also contains blocks for performing token-passing Viterbi decoding and missing data recognition. Hence auditory scene analysis systems can be designed, built and evaluated in the same software framework. Here is a screen shot [262k] of the toolkit in action.
Signal-to-noise (SNR) ratio estimation provides another way to identify reliable data in the time-frequency grid, portions of signal for which the SNR is too low can be considered as missing. During the first year of this project several local (in time and in frequency) noise level estimation techniques have been tested, compared and improved: Hirsch histograms, energy clustering, the low-envelope following, weighted average. These methods combined with a new pre-processing algorithm based on harmonic filtering are able to follow fast changes of the noise spectrum. Indeed, harmonic filtering allows the system to estimate the noise level during voiced part of speech. Description of the methods, comparison and application to robust speech recognition can be found in here
In multiband ASR it is required to estimate the reliability of data in each frequency subband.
In the theoretical framework within which we are working at IDIAP it is required to estimate a weight for each subband combination. This weight represents the probability that the expert for this combination is the most likely to perform correct phoneme level recognition. The phoneme posterior probabilities output from each MLP expert are combined as a linear weighted sum, after which a word sequence is generated from these combined phoneme posteriors in the same way as they are combined in the baseline system.
Some of the combination methods tested estimate only a fixed set of weights. Fixed weights are optimised over the whole training set. Other methods tested are adaptive, estimating a new set of weights every ten or so ms. Objective functions tested for fixed weight estimation include linear & nonlinear LMSE (least mean square error), and ML (maximum likelihood). Local data reliability functions tested for adaptive weight estimation include: local SNR (signal to noise ratio) estimation, characteristics of the local signal energy profile, harmonicity index, classification entropy, and data likelihood.
The "front-end" principle consists in feeding an enhanced (or segregated) signal to the recogniser, which is a conventional full-band ANN/HMM system. The "enhancement/segregation and then recognition" principle is globally sequential. Practically, after reconstruction of the spectrogram, a waveform is re-synthesised in order to feed the recognition module, which runs independently. This signal can be heard, and the front-end could be used for "noise reduction" in a context other than automatic speech recognition. We have tested the enhancement using the harmonicity cue. In comparison with a J-RASTA pre-processing used alone, we observe a significant improvement of recognition in loud stationary noise after an enhancement of the signal based on the local degree of harmonicity measured by subband auto-correlation. The effect of this enhancement is to boost the vocalic segments of the speech and the price to pay is a relative attenuation of the unvoiced segments. Hence, this processing realises a reinforcement of parts of speech, which are the most resistant to loud noise. A segregation algorithm using a similar principle, but based on the TDOA, has been applied to segregate binary mixture.