|RESPITE: Annual Report 1999: Scientific Highlights: The Multistream Formalism|
Recognition `experts' processing different kinds of information tend to make different, and often complementary, errors. Multistream recognition comprises a broad range of techniques for the combination of multiple experts in various ways such that the overall error rate decreases.
Multiple data streams may arise as data from different sensory modalities, or as independent feature sets extracted from the same sensory data. Multiband ASR is a particular case of multistream recognition in which the acoustic speech signal, first spectrally decomposed in a standard way, is then divided into a number of subbands, with each subband treated as an independent source of information.
While the separate processing of frequency subbands can improve ASR performance in noise, this can also result in reduced performance with clean speech. At IDIAP we have developed a model which overcomes this problem by assigning an expert not only to each subband, but to each subband combination. We call this the `full combination' approach to multiband ASR. By introducing a latent variable which gives the unknown position of the most reliable data, the contribution of each data window to the fullband phoneme posterior probability (or likelihood) is decomposed into a weighted sum of posteriors (or likelihoods) for each subband combination.
Experiments with a large multispeaker database of free format numbers, and various types of noise, have shown that some of these multiband methods are significantly more robust to noise than the best state of the art ASR systems, providing that a reasonable proportion of acoustic features remain uncorrupted.
A key idea in the RESPITE project is the combination of multiple versions of the speech signal to improve recognition accuracy. These versions might be obtained from different microphones, or by filtering the signal from a single microphone, or by applying a variety of algorithms to obtain different representations of a single audio recording. The theory is that if each version or 'information stream' performs best for different speakers, different portions of the speech, or in different acoustic environments, then an ideal combination of the streams gives the best of all worlds - relying on different information streams as the conditions change to favour them.
The main questions are:
Combining information is an interesting and complex theoretical problem. Our approach is as follows: Initially, the probability of each speech sound is calculated based on each information stream in isolation, then the overall probability is obtained by multiplying all the component probabilities together. (This makes an implicit approximation that the information streams are conditionally independent).
By training a second statistical model on the output of this first combination stage, we obtained the final speech probabilities to pass to the standard Hidden Markov Model recogniser. We might have expected that all the useful information would have been extracted in the first stage of probability estimation, but our results show that this second stage of modelling gives us very considerable improvements.
A speech recognition system using this approach was submitted to the European Telecommunications Standards Institute's Aurora evaluation, which is trying to find the best speech representation for use in distributed speech recognition - i.e. where part of the recognition processing occurs in a mobile telephone handset, and the intermediate results are transmitted to a remote server. The evaluation involved recognising continuously-spoken digits in a range of noisy backgrounds, and the participants were major telecommunications research laboratories operating in Europe. Our system, which was submitted in collaboration with Qualcomm, eliminated 60% of the misrecognised words made by the standard reference system. This was by far the best system - the next best performance was 42%, and the remaining participants reduced errors by less than 30%. All the other systems were based on a single information stream and a single stage of statistical modelling; the dramatic success of our submission indicates the improved robustness possible for speech recognisers employing multistream techniques.
Detailed technical information about these experiments can be found in the multistream results page.