Decoding Speech in the Presence of Other Sound Sources
Jon Barker, Martin Cooke and Dan Ellis
Previous studies in both speech perception and ASR have amply demonstrated
that it is possible to recognise speech on the basis of limited spectro-temporal
evidence, e.g. [1,2]. In this work, we develop a novel Viterbi-style decoder
which efficiently explores all possible combinations of such evidence.
Our recognition process involves two key stages:
In our initial experiments a "mock" grouping stage has been performed by
processing estimated local SNR data. Simple noise estimation techniques
are employed to produce spectro-temporal local SNR estimates. These estimates
are thresholded to produce a binary mask of regions judged to be either
signal or noise. The masks are then cut into four frequency subbands and
then connected regions within each band are labeled as separate groups.
First, a "primitive grouping" stage is applied to the spectro-temporal
representation to bind together regions ("groups") that are judged to be
dominated by a common sound source. This stage could be achieved using
Computational Auditory Scene Analysis (CASA) techniques e.g. [3,4] - for
example spectro-temporal regions may be judged to be dominated by a common
source on the grounds of similarity, such as common amplitude modulation,
common onset, or harmonicity.
Second, we employ a modified HMM Viterbi token-passing speech decoder designed
to simultaneously estimate the most likely set of groups forming the speech
signal and the speech state sequence. The decoder employs Missing Data
techniques [1,5] to evaluate P(q|x, mask), and a modified Viterbi algorithm
to efficiently consider all masks that can be generated from the set of
groups supplied. Briefly, decoder tokens are duplicated whenever a new
group commences. One copy of the token is assigned the "this group is speech"
interpretation and the other is assigned the "this group is background"
interpretation. When a group terminates tokens representing complimentary
interpretations with respect to the terminating group are compared.
Grouping the mask in this manner and employing the modified decoding
algorithm described above is shown to provide a significant gain in recognition
performance at low SNRs compared to using the unprocessed mask and a standard
missing data decoder. For example, on speech from the TIDIGIT database
contaminated by factory noise we obtain the following word accuracies over
a range of SNRs, using models trained on clean speech:
Factory noise SNR
The table above also shows results using an "a priori mask" - this is
a mask that has been generated using a priori knowledge of the noise signal
to calculate true local SNRs. We believe that some of the potential shown
by the a priori results may be realised by improving the grouping stage
of our recognition process. We hope that by using CASA techniques we will
be able to provide the decoder with a more robust set of building blocks
from which to construct the speech source. In our present work we are experimenting
with the decoding of CASA groupings based on simple periodicity cues.
 M.P. Cooke, P.D. Green, L. Josifovski and A. Vizinho,
'Robust automatic speech recognition with missing and unreliable acoustic
data', to appear in Speech Communication.
 Warren, R.M., Riener, K.R. Bashford, J.A. Brubaker,
B.S.(1995) Spectral redundancy: Intelligibility of sentences heard through
narrow spectral slits', Perception and Psychophysics 57(2): 175-182
 Brown, G.J., Cooke, M.P. (1994) Computational auditory
scene analysis, Computer Speech & Language, 8, 297-336
 Ellis, D.P.W. (1996) Prediction-Driven Computational
Auditory Scene Analysis', Ph.D. Thesis, MIT.
 A. Vizinho, P.D. Green, M.P. Cooke and L. Josifovski
(1999), 'Missing data theory, spectral subtraction and signal-to-noise
estimation for robust ASR: An integrated study', Proc. Eurospeech 99, pp
Last modified: Mon Sep 18 15:25:44 BST 2000