RESPITE:Events : Meeting, Sep 2000:Presentations: Stéphane Dupont

Multiband with contaminated training data

Stéphane Dupont

Presented by Christophe Ris

One of the most efficient techniques to improve noise robustness of speech recognition systems is to train the system with data corrupted by the noise at different SNR. This approach leads to very good performance when the noise used for training is similar to the noise used for testing but fails when the noises are too different. Therefore, this method can only be used when we have good a priori knowledge on the noise spectrum characteristics.

To remove this drawback, we propose to use the multi-band architecture. This is based on the observation that, if we consider narrow frequency bands, the noises inside the bands practically differs only by their energy level, not by their spectral shape. Therefore, if the models associated with each frequency band are trained on data corrupted by any kind of noise at different SNR, we can expect, if the frequency bands are narrow enough, that they are insensible to other kinds of noise.

For each frequency band, we develop a system to estimate noise-robust acoustic features. These features are computed from parameters specific to the frequency band (as critical bands energies inside the frequency band). We train, for each band, a MLP on data corrupted by white noise at different SNR. These MLPs can produce acoustic features according to the non-linear discriminant analysis (NLDA) technique [1] (output of the last hidden layer). These robust acoustic features are then concatenated and passed through the recognition system (an hybrid HMM/MLP in our case).

This approach ('new - 7 bands' in the tables) has been compared to the baseline system (log-RASTA), J-RASTA, Spectral subtraction, J-RASTA multi-band (4 bands), Spectral subtraction multi-band (4 bands) on Numbers'95. Results are the average on six kinds of noises (gaussian white noise, Noisex helicopter noise, Madras car noise, Daimler inside car noise, shopping mall, exhibition hall). For each method we used the configuration that led to the best performance (J value, noise level estimation method, number of sub-bands, ...):
 
 

SNR
5
10
20
average
log-rasta
35.8
22.3
13.9
24.0
J-rasta
23.0
15.4
10.0
16.1
Spectral sub.
22.0
15.0
10.6
15.9
J-rasta multiband
23.1
14.1
8.7
15.3
Spectral sub. multiband
23.8
14.9
9.3
16.0
new - 7 bands
16.9
10.9
7.5
11.8

Word error rate on Numbers'95. Average on 6 kinds of noises. Different methods

Next table shows the influence of the number of sub-band on the performance of  the new approach. Note that the '1-band' system is a bit different from the baseline system in the sense that robust features were extracted from the spectral features  using NLDA (system trained on white noise) and then passed to the recognizer.
 
 

SNR
5
10
20
average
J-rasta
23.0
15.4
10.0
16.1
new - 1 band
21.9
15.1
10.2
15.9
new - 2 bands
18.1
12.4
9.1
13.2
new - 4 bands
16.9
11.9
8.4
12.4
new - 7 bands
16.9
10.9
7.5
11.8

Word error rate on Numbers'95. Average on 6 kinds of noises. Different number of sub-bands.

This table shows results on Resource Management using different methods. Speech was corrupted by the Noisex helicopter noise. The last row correspond to a hybrid system trained on this particular noise at different SNR.
 
 

SNR
clean
20
12
average
plp
6.4
71.3
96.2
58.0
plp / log-rasta
7.8
49.5
87.3
48.2
plp / J-rasta
8.4
28.2
63.0
33.2
cbe - spectral sub.
7.5
27.1
63.8
32.8
new - 7 bands
6.1
10.9
35.3
17.4
plp / J-rasta contaminated
8.4
14.1
21.0
14.5

Word error rate on Resource Management. Noisex helicopter noise. Different methods.

References

   [1] V. Fontaine, C. Ris, J-M. Boite, "Nonlinear Discriminant analysis for improved speech recognition", In Proc. EUROSPEECH'97, Rhodes, Greece, 1997.

Jon Barker
Last modified: Mon Sep 18 15:29:19 BST 2000