Informing multisource decoding for speech in the presence of other sound sources

Investigator: Ning Ma Supervisor: Phil Green

This research project is sponsored by EPSRC.

This is an EPSRC funded project that aims to generalise Automatic Speech Recognition decoding algorithms for natural listening conditions, where the speech to be recognised is one of many sound sources which change unpredictably in space and time [1]. It is well understood that humans employ a combination of source-driven bottom-up processes and model-driven top-down processes when perceiving complex acoustic signals such as speech [3]. The primitive bottom-up processes used group together acoustic evidence likely to arise from a single source without deciding the identity of the source. The segregation decision is delayed to the decoding where the top-down processes are applied.

There is evidence that the pure top-down information in HMMs is insufficient to pick up appropriate speech fragments[2]. By analogy with the language model / acoustic model distinction applied in ASR it is straightforward to introduce probabilities to represent the degree to which a fragment belongs with the speech group, which we call the `speechiness' of a fragment. In this way, constraints which persist over a longer term (e.g. location, pitch, speaker identity) can inform the multisource decoding.

[1] J. Barker, M.P. Cooke and D.P.W.Ellis (in press), 'Decoding speech in the presence of other sources', accepted for Speech Communication
[2] J. Barker, M. Cooke, and D. Ellis (2002), 'Temporal integration as a consequence of multi-source decoding', ISCA Workshop on the Temporal Integration in the Perception of Speech (TIPS), Aix-en-Provence, April 8-10, 2002
[3] A.S. Bregman, Auditory Scene Analysis, MIT Press, 1990.