RESPITE:Events : Meeting, Sep 2000:Presentations: Barker et al.

Decoding Speech in the Presence of Other Sound Sources

Jon Barker, Martin Cooke and Dan Ellis

Previous studies in both speech perception and ASR have amply demonstrated that it is possible to recognise speech on the basis of limited spectro-temporal evidence, e.g. [1,2]. In this work, we develop a novel Viterbi-style decoder which efficiently explores all possible combinations of such evidence.
Our recognition process involves two key stages:
  1. First, a "primitive grouping" stage is applied to the spectro-temporal representation to bind together regions ("groups") that are judged to be dominated by a common sound source. This stage could be achieved using Computational Auditory Scene Analysis (CASA) techniques e.g. [3,4] - for example spectro-temporal regions may be judged to be dominated by a common source on the grounds of similarity, such as common amplitude modulation, common onset, or harmonicity. 

  2. Second, we employ a modified HMM Viterbi token-passing speech decoder designed to simultaneously estimate the most likely set of groups forming the speech signal and the speech state sequence. The decoder employs Missing Data techniques [1,5] to evaluate P(q|x, mask), and a modified Viterbi algorithm to efficiently consider all masks that can be generated from the set of groups supplied. Briefly, decoder tokens are duplicated whenever a new group commences. One copy of the token is assigned the "this group is speech" interpretation and the other is assigned the "this group is background" interpretation. When a group terminates tokens representing complimentary interpretations with respect to the terminating group are compared.
In our initial experiments a "mock" grouping stage has been performed by processing estimated local SNR data. Simple noise estimation techniques are employed to produce spectro-temporal local SNR estimates. These estimates are thresholded to produce a binary mask of regions judged to be either signal or noise. The masks are then cut into four frequency subbands and then connected regions within each band are labeled as separate groups.
Grouping the mask in this manner and employing the modified decoding algorithm described above is shown to provide a significant gain in recognition performance at low SNRs compared to using the unprocessed mask and a standard missing data decoder. For example, on speech from the TIDIGIT database contaminated by factory noise we obtain the following word accuracies over a range of SNRs, using models trained on clean speech:
 
 
Factory noise SNR
0
5
10
15
20
clean
Standard mask
39.3
70.1
87.4
96.1
97.6
98.4
Grouped mask
47.3
78.1
89.1
95.3
97.0
98.1
A-priori mask
88.0
92.0
96.0
97.0
97.6
98.4

The table above also shows results using an "a priori mask" - this is a mask that has been generated using a priori knowledge of the noise signal to calculate true local SNRs. We believe that some of the potential shown by the a priori results may be realised by improving the grouping stage of our recognition process. We hope that by using CASA techniques we will be able to provide the decoder with a more robust set of building blocks from which to construct the speech source. In our present work we are experimenting with the decoding of CASA groupings based on simple periodicity cues.

References

   [1] M.P. Cooke, P.D. Green, L. Josifovski and A. Vizinho, 'Robust automatic speech recognition with missing and unreliable acoustic data', to appear in Speech Communication.
   [2] Warren, R.M., Riener, K.R. Bashford, J.A. Brubaker, B.S.(1995) Spectral redundancy: Intelligibility of sentences heard through narrow spectral slits', Perception and Psychophysics 57(2): 175-182
   [3] Brown, G.J., Cooke, M.P. (1994) Computational auditory scene analysis, Computer Speech & Language, 8, 297-336
   [4] Ellis, D.P.W. (1996) Prediction-Driven Computational Auditory Scene Analysis', Ph.D. Thesis, MIT.
   [5] A. Vizinho, P.D. Green, M.P. Cooke and L. Josifovski (1999), 'Missing data theory, spectral subtraction and signal-to-noise estimation for robust ASR: An integrated study', Proc. Eurospeech 99, pp 2407-2410.


Jon Barker
Last modified: Mon Sep 18 15:25:44 BST 2000