RESPITE: Annual Report 2001: Scientific Highlights: The Multistream Formalism

(RESPITE logo)

The Multistream Formalism


"Tandem" Multistream Recognition

This year we have continued to investigate the Tandem Acoustic Modelling approach developed earlier in the project, in which a neural network is used as a first stage of processing before a conventional Gaussian mixture model speech recogniser. Our investigations focused on trying to understand the large improvements shown by this system by examining a range of variants to test specific hypotheses. For instance, we tried replacing the neural network in the first stage with a second Gaussian mixture model (GMM) to estimate the initial posterior probabilities. The results give us some insight into why the neural net performs better, as the figure below illustrates:

We see that even for a clean speech sample (left column), a GMM trained on the same data as the neural net has a lot of trouble, for instance completely missing the vowel of "two". At 5dB signal-to-noise ratio (the right side), the difference between neural net and GMM is very pronounced, with the GMM tending to mark noisy segments as background ("sil") while the neural net is much more precise.

RESPITE is a Winner!

Our Tandem system was one of the participants in a large comparison session at the recent Eurospeech conference in Denmark in which over 20 papers were presented describing systems for the same noisy digits task. All the foremost speech research labs in the world participated, including IBM, Lucent, Microsoft etc. Our Tandem system achieved the best overall word error rate, and the largest margin of improvement (50% better than the standard baseline) among systems trained on noisy examples, as shown in the chart below:

Multiband With Contaminated Training Data

Last year we introduced a new noise robust approach based on the use of a particular training procedure (based on data contamination) in a particular architecture (the multi-band paradigm). In this framework, we expected to remove the drawbacks of both the corpus contamination approach which is the dependency to noise spectral characteristics, and the multi-band architecture which is its inefficiency in case of wideband noise.

Recently, this method has been tested on the AURORA 2 task and compared to other robust methods:

The matching noise conditions corresponds to a reference system in which training has been performed in the same noise conditions as the tests (cf. multi-condition training of the AURORA2 task)

Two different configurations of our multiband approach have been compared:

The following figures show that, even for the smaller configuration, our approach leads to very good performance on different kinds of additive noise, without any a priori knowledge of the noise characteristics.

Figure 1-4: Comparative results on four kinds of noise (subway, babble, in-car and exhibition hall). Click on the image to see it full size.

Multistream Combination Based On A Model For The Use Of Context In Human Perception

A study was made to see whether a two stage model for the human perception of speech and text categories (first stage using only local sensory data, second using higher level data from a wider context) could usefully be adapted as a new variant of HMM/ANN based ASR. Test performance of the ECPC (Error Correction Posteriors Combination) model was not, however, significantly different to that for the usual FC model.

Multistream ASR Techniques Tested On EEG Recognition

Like speech recognition in noise, thought recognition from EEG data (electric field measurements from the scalp surface) also has to contend with a high degree of unaccountable signal variation. Initial multiband classifier-decoder experiments, though providing an interesting new field of potential application, showed that direct application of multiband ASR to EEG recognition does not give significant improvement over a fullband classifier baseline, and that HMMs are not able to model the time structure in raw EEG data. We believe that the limited data streams available were not sufficiently complementary.