|RESPITE: Annual Report 2000: Scientific Highlights: The Multistream Formalism|
Last year a new approach to extracting the speech information from sound was developed. We call this approach "tandem modelling". It combines the neural network classifier used in connectionist speech recognisers with the Gaussian mixture models used in standard ASR, resulting in a system that performs better than either. This basic idea is as shown below:
Compared to conventional recognition, connectionist recognition uses a neural network in place of Gaussian models as the classifier mapping acoustic features into a probabilistic description of speech sounds. Connecting both classifier types together, as in tandem modeling, is on the face of it a strange thing to do, since the Gaussian model has to re-learn the association between the network outputs and the linguistic classes - even though the network has already tried to perform that mapping! However, differences between the two types of classifier allow each to take advantage of different parts of the signal, and to show an overall gain when used together.
One of the key attractions of the tandem approach is that it has allowed us to use multiple simultaneous feature representations as input to a single recogniser, as shown below:
The system shown above was used in a large-vocabulary, spontaneous speech task (the NRL's SPINE1); using both PLP and MSG feature calculations, and combining them at the outputs of the network to feed features into CMU's SPHINX recogniser, gave a 30% relative improvement in word error rate over the normal Mel-frequency cepstral coefficients when used with a baseline, context-independent configuration. These results are described in more detail in the paper: Tandem acoustic modeling in large-vocabulary recognition.
We have also applied this approach to the recognition of noisy digits within the system developed by industrial partner DaimlerChrysler, again showing very considerable improvements (see here ).
In the bottom table in this figure it can be seen that combination in the FC multistream framework of base PLP spectral features with their first and second difference features (covering successively larger time scales) gives significantly better results than stream combination using the HMM baseline, even though both systems use the same Gaussian mixture models, and equal stream weights.
It is not possible to calculate the marginal output from an MLP expert, so for this purpose the usual MLP expert was replaced with a Gaussian Mixture Expert (GME). In this network the hidden units are Gaussian probability density functions which model the data distribution, while the non-linear output layer uses Bayes rule to convert these likelihoods into phoneme probabilities. While this HMM/GME system eliminates the need to train multiple experts, there are still two problems with this approach. One is that the phoneme level classification performance of the GME is significantly lower than that of the MLP which it replaces. Another is that it is still necessary to calculate 2d marginal phoneme probabilities for each phoneme, and this can be prohibitively time consuming.
One of the most efficient techniques to improve noise robustness of speech recognition systems is to train the system with data corrupted by the noise at different SNR. This approach leads to very good performance when the noise used for training is similar to the noise used for testing but fails when the noises are too different. Therefore, this method can only be used when we have good a priori knowledge on the noise spectrum characteristics.
To remove this drawback, we propose to use the multi-band architecture. This is based on the observation that, if we consider narrow frequency bands, the noises inside the bands practically differs only by their energy level, not by their spectral shape. Therefore, if the models associated with each frequency band are trained on data corrupted by any kind of noise at different SNR, we can expect, if the frequency bands are narrow enough, that they are insensible to other kinds of noise.
For each frequency band, we develop a system to estimate noise-robust acoustic features. These features are computed from parameters specific to the frequency band (as critical bands energies inside the frequency band). We train, for each band, a MLP on data corrupted by white noise at different SNR (see figure above). These MLPs can produce acoustic features according to the non-linear discriminant analysis (NLDA) technique (see figure). These robust acoustic features are then concatenated and passed through the recognition system (an hybrid HMM/MLP in our case)
This approach led to up to 26 % improvement (average) compared with other noise robust methods such as J-Rasta or spectral subtraction on a connected digits contaminated by 6 different kinds of noises at different SNR.