RESPITE: Annual Report 1999: Scientific Highlights: The Multistream Formalism

(RESPITE logo)

The Multistream Formalism

Recognition `experts' processing different kinds of information tend to make different, and often complementary, errors. Multistream recognition comprises a broad range of techniques for the combination of multiple experts in various ways such that the overall error rate decreases.

Multiple data streams may arise as data from different sensory modalities, or as independent feature sets extracted from the same sensory data. Multiband ASR is a particular case of multistream recognition in which the acoustic speech signal, first spectrally decomposed in a standard way, is then divided into a number of subbands, with each subband treated as an independent source of information.

The Full Combination Approach to Multiband ASR (IDIAP)

While the separate processing of frequency subbands can improve ASR performance in noise, this can also result in reduced performance with clean speech. At IDIAP we have developed a model which overcomes this problem by assigning an expert not only to each subband, but to each subband combination. We call this the `full combination' approach to multiband ASR. By introducing a latent variable which gives the unknown position of the most reliable data, the contribution of each data window to the fullband phoneme posterior probability (or likelihood) is decomposed into a weighted sum of posteriors (or likelihoods) for each subband combination.

Experiments with a large multispeaker database of free format numbers, and various types of noise, have shown that some of these multiband methods are significantly more robust to noise than the best state of the art ASR systems, providing that a reasonable proportion of acoustic features remain uncorrupted.

Multiple information streams to improve speech recognition (ICSI)

A key idea in the RESPITE project is the combination of multiple versions of the speech signal to improve recognition accuracy.  These versions might be obtained from different microphones, or by filtering the signal from a single microphone, or by applying a variety of algorithms to obtain different representations of a single audio recording.  The theory is that if each version or 'information stream' performs best for different speakers, different portions of the speech, or in different acoustic environments, then an ideal combination of the streams gives the best of all worlds - relying on different information streams as the conditions change to favour them.

The main questions are:

Rather than developing new information streams specifically for combination, we decided to experiment with combining speech representations that we were already using.  We had four of these, each sufficiently distinct to be promising candidates for combination.  Two of the feature streams (called Perceptual Linear Prediction and Modulation-filtered SpectroGram) were already being used by several project partners.  The remaining two were contributed by our colleague Hynek Hermansky and his students at the Oregon Graduate Institute, and were based on looking at the signal over long time windows of up to 1 second.

Combining information is an interesting and complex theoretical problem.  Our approach is as follows: Initially, the probability of each speech sound is calculated based on each information stream in isolation, then the overall probability is obtained by multiplying all the component probabilities together.  (This makes an implicit approximation that the information streams are conditionally independent).

By training a second statistical model on the output of this first combination stage, we obtained the final speech probabilities to pass to the standard Hidden Markov Model recogniser.  We might have expected that all the useful information would have been extracted in the first stage of probability estimation, but our results show that this second stage of modelling gives us very considerable improvements.

Multi-stream wins the Distributed Speech Recognition evaluation

A speech recognition system using this approach was submitted to the European Telecommunications Standards Institute's Aurora evaluation, which is trying to find the best speech representation for use in distributed speech recognition - i.e. where part of the recognition processing occurs in a mobile telephone handset, and the intermediate results are transmitted to a remote server.  The evaluation involved recognising continuously-spoken digits in a range of noisy backgrounds, and the participants were major telecommunications research laboratories operating in Europe.  Our system, which was submitted in collaboration with Qualcomm, eliminated 60% of the misrecognised words made by the standard reference system.  This was by far the best system - the next best performance was 42%, and the remaining participants reduced errors by less than 30%.  All the other systems were based on a single information stream and a single stage of statistical modelling; the dramatic success of our submission indicates the improved robustness possible for speech recognisers employing multistream techniques.

Detailed technical information about these experiments can be found in the multistream results page.

CASA labelling (Grenoble + IDIAP)

The aim of the "labelling" principle is to inform the recognition level about the SNR existing in the peripheral time-frequency representation. The first solution (CASA front-end) is rather classical whereas this second CASA approach was promoted after work on missing data and multistream recognition, which propose to adapt the recognition level. We propose a generic method based on mapping of the relationship existing between the hidden SNR and an observable value O, which is extracted from the signal (harmonicity [2], localisation [5]). This could be generalised to any parameter of the signal varying with the SNR, and this could be a "non CASA" index; e.g., the entropy of the posteriors. Then, we evaluate the probability P(SNR>T|O) to reach a SNR threshold T. This is a probability "clean enough" which can be computed locally in the time-frequency plane. The CASA labelling, given to a multistream recogniser, is related to the SNR, but it is not a SNR estimate. In this framework, the link between partial recognition and multistream recognition, as well as the detection step are well formalised since this probability is used to compute an average weighting of the posteriors produced by the full set of partial recognisers (this corresponds to the so-called "full combination" method).