RESPITE:Events : Meeting, Sep 2000:Presentations: Barker et al.

Soft Decisions in Missing Data techniques for Robust Automatic Speech Recognition

Jon Barker, Martin Cooke, Ljubomir Josifovsky and Phil Green

In previous work [1,2] we have developed the theory and demonstrated the promise of the Missing Data approach to robust Automatic Speech Recognition. In this technique, spectral-temporal regions uncontaminated by noise are identified and CDHMM recognition methods are adapted to make use of this partial information. For a given acoustic energy vector, we compute the joint probability that (1) the reliable components would be generated from their marginal distribution and (2) that the true values of the unreliable components are between zero and the observed value in the speech-and-noise mixture.

In this paper we replace the discrete decision that a time-frequency pixel is reliable or unreliable with an estimate of the probability that the data is reliable. We adapt the probability calculation to use this estimate as weighting factors for term (1) and term (2) for each vector component. weighting factor is expressed using a sigmoidal transformation of the reliability estimate.

This soft decision approach integrates smoothly with algorithms which derive a continuous-valued estimate of the degree to which a local spectro-temporal region belongs to a group. For example, grouping of frequency channels based on common periodicity no longer require hard decisions such as the presence/absence of a peak in the channel autocorrelation function at a given lag [3], but can use the value of the channel autocorrelation function itself as a reliability estimate.

We also outline three additional improvements to missing data work previously published: we make use of temporal constraints, word-boundary penalties and improved silence modelling. These are shown to enhance performance compared to those reported in [2], and the use of soft decisions provides a further significant gain at low SNRs. For example, on speech from the TIDIGIT database contaminated by factory noise we obtain the following word accuracies over a range of SNRs, using models trained on clean speech:
 
 

SNR/ % Accuracy
0
5
10
15
20
clean
Reported in [2]
34.0
59.0
81.0
91.0
94.0
96.0
Improvements as above
39.3
70.1
87.4
96.1
97.6
98.4
Soft Decisions
49.0
74.8
88.1
95.7
96.9
98.2

References

   [1] A. Vizinho, P.D. Green, M.P. Cooke and L. Josifovski (1999), 'Missing data theory, spectral subtraction and signal-to-noise estimation for robust ASR: An integrated study', Proc. Eurospeech 99, pp 2407-2410.
   [2] M.P. Cooke, P.D. Green, L. Josifovski and A. Vizinho, 'Robust automatic speech recognition with missing and unreliable acoustic data', to appear in Speech Communication.
   [3] R. Meddis & M. Hewitt (1992), 'Modelling the identification of concurrent vowels with different fundamental frequencies', JASA, 91:233-245. 


Jon Barker
Last modified: Mon Jan 29 17:02:27 GMT 2001