Multisource Decoding Home  | References

Overview: Multisource decoding for speech in the presence of other sound sources

Phil Green, Martin Cooke, Guy Brown and Jon Barker, SpandH, Dept of Computer Science, University of Sheffield

EPSRC-funded project: Jan 2002-Dec 2004


The aim is to generalise Automatic Speech Recognition decoding algorithms for natural listening conditions, where the speech to be recognised is one of many sound sources which change unpredictably in space and time. The multisource decoder will be informed by processes which group together acoustic evidence likely to arise from a single source, without deciding the identity of the source. It will decode with respect to stochastic models of known sources (typically, though not limited to, speech units). The decoder will select, and identify, the acoustic evidence which best matches model sequences. In addition to its role in robust ASR, the multisource decoder will provide a framework for investigating the integration of bottom-up and top-down processes in sound understanding. For example, the decoder will function even in the absence of initial grouping, and could be used to study the effect of additional grouping constraints. Likewise, it will be possible to add models for sources other than speech if they are available.


The proposal builds on two strands of research developed, in parallel, at Sheffield since 1990. One strand is missing data theory. As adapted for ASR by the proposers (Green et al., 1995; Cooke et al. 1997; Josifovski et al. 1999; Cooke et al., in press) and others (Lippmann, 1997a; Morris et al. 1998), this theory describes how both partial evidence and counter-evidence for sound sources can be handled within the conventional probabilistic formalism for ASR. Recent results using the missing data approach on the AURORA task (Pearce & Hirsch, 2000) show a level of performance which compares favourably with any reported elsewhere (Barker et al., 2000a) and yet has fewer constraints. In this work simple noise estimation techniques are used to identify speech evidence. There is no requirement to re-train models for different noise conditions.

The second research strand, auditory scene analysis (Bregman, 1990; Darwin & Carlyon, 1995; Cooke & Ellis, in press) and CASA, its computational counterpart (Cooke, 1991; Brown and Cooke, 1994), describes how arbitrary sound source combinations are segregated and regrouped. This proposal extends the recent work of Brown and Dr Deliang Wang of Ohio State University (Wang & Brown, 1999) which takes a parallel and distributed approach to the CASA problem. Groups of acoustic components are coded by populations of neurons which have an oscillatory firing pattern. In this scheme, the acoustic features of a single sound source are represented by a population of synchronized oscillators; in turn, this population is desynchronised from oscillator populations that represent other sound sources.

We now intend to use CASA techniques to group acoustic evidence for missing data recognition, because in unpredictable listening conditions simple noise estimates are insufficient. Indeed, this was the original motivation for the missing data work. Recent experiments have demonstrated a significant recognition gain from combining CASA and noise estimates (see next section).

The multisource decoder brings these research strands together: it will use CASA and noise estimation to make initial, probabilistic allocations of spectral-temporal regions to source-specific groups. Its operation will depend on missing data techniques: when some subset of groups is hypothesised to be part of the speech source, the remainder are considered to be missing. A prototype decoder (Barker et al., 2000b) has confirmed both feasibility and performance gains, although more theoretical and implementation work is required to fully realise the benefits.

State of the Art

Modern speech recognition devices are capable of producing good results for carefully-spoken material captured in a quiet room with a close-talking microphone. However, their performance deteriorates rapidly in more natural speaking and listening conditions. Lippmann (1997b) presents evidence from a wide range of speech recognition tasks which indicates that human error rates are an order of magnitude smaller than those obtained by ASR algorithms for clean speech, and two orders of magnitude smaller for typical noise conditions. The problem of robustness is the largest factor limiting the take-up of ASR and not surprisingly it is the subject of much contemporary research (see Gong 1995, Furui 1997, for reviews).

Conventional approaches to robust ASR seek to reduce the mismatch between training and test conditions. This can be done in several ways, alone or in combination. One can look for signal processing which is less sensitive to noise (e.g. RASTA, Hermansky and Morgan, 1994), one can `clean up' the noisy data (e.g. spectral subtraction, Lockwood & Boudy, 1992), or one can adapt the recogniser's statistics to more closely resemble those of the noisy speech (e.g. PMC, Gales and Young, 1992; 1993). The last two methods make use of statistical models of the noise, and are therefore only applicable when such models can be estimated.

In contrast, the methodology adopted in this proposal is based on the premise that there will be some reliable spectro-temporal regions which are dominated by the speech source and are essentially uncorrupted, while the remaining regions are dominated by other sources. This observation of source or target dominance simply follows from the compression (cube root or log) typically applied to energy estimates. Previously, we have shown how probability estimation supporting conventional continuous density HMMs can be adapted to handle the missing data case (Green et al., 1995, Cooke et al., 1997, Morris et al. 1998).

Reliable regions must first be identified, in order to define a `missing data mask'. Prior knowledge of the true local Signal to Noise Ratio (from the clean speech and noise before mixing) generates an `a priori' mask (see Figure 1, middle panel). Recognition rates with these oracles approach human levels of robustness, providing a compelling proof of concept. Figure 1 (right panel) shows a mask obtained from the speech/noise mixture by using simple noise statistics to obtain a local SNR estimate. Recently, we have obtained improvements with `fuzzy' masks which express a degree of belief that each time-frequency point is associated with the target source (Barker et al., 2000a). Figure 2 summarises results on a connected digit recognition task, as of January 2001.

By identifying reliable regions on the basis of noise estimates we can only deal with those parts of the noise background which are stationary or slowly-changing. Thus, in figure 2, results for factory noise are inferior to those for helicopter noise because the former contains unpredictable components, such as hammer blows. In less restricted (and more natural) listening conditions, what is required is the ability to separate evidence from different sound sources - Auditory Scene Analysis.

Listeners are remarkably adept at ASA (Bregman 1990). The proposers and others have developed computational models for low-level ASA which require no noise estimates, using constraints such as harmonicity, component synchrony and source location (see Cooke & Ellis, in press, for a review). Recent studies (so far unpublished but plotted in Figure 2) show further improvements with masks obtained by combining SNR estimates with harmonicity constraints derived from correlograms.

Work package WP1 in this proposal is concerned with neural oscillators for Computational Auditory Scene Analysis. Although current systems for CASA represent a useful advance in sound separation technology, they suffer two major drawbacks: their performance is far below that of human listeners, and they tend to be computationally expensive. Wang and Brown (1999) have argued that both of these limitations can be addressed by giving CASA a firmer neurobiological foundation. The human auditory system is a sound separator par excellence; also, neural systems are parallel and distributed, and a model based on such an approach will be better suited to real-time operation.

Accordingly, Wang and Brown have developed the neural oscillator approach to CASA. Auditory scene analysis may be conceptualized as a two-stage process. In the first stage (segmentation) the acoustic mixture is decomposed into a collection of sensory elements. In the second stage (grouping), elements that are likely to have arisen from the same environmental event are combined into a perceptual structure termed a stream (an auditory stream roughly corresponds to an object in vision). The neural oscillator approach to CASA mirrors this two-stage processing with a two-layer architecture (Figure 3a). In each layer, oscillators are arranged in a time-frequency grid. In the segmentation layer, lateral connections between oscillators extract `segments' - harmonics and formants that evolve over time and frequency. In the grouping layer, segments are grouped according to common fundamental frequency (F0). The latter is derived from the correlogram, a computational model of auditory pitch analysis. The result of the separation process is encoded in the grouping layer by synchronized populations of neural oscillators. For example, figure 3 shows two `snapshots' of activity in the grouping layer following separation of a mixture of speech and trill telephone. At one instant (b) the active group of oscillators corresponds to the speech source. After a short time, these oscillators become inactive and a different group of oscillators representing the telephone source become active (c). This successive `pop-out' of streams continues in a periodic fashion. The model has been systematically evaluated using a corpus of voiced speech mixed with interfering sounds, and produces improvements in terms of signal-to-noise ratio for every mixture.

Where some sources in the auditory scene are predictable, they can be handled by statistical models, whose estimates should be combined with grouping cues from CASA for unpredictable sources. This is the topic of work package WP4. A prototype scheme for combining SNR masks and harmonicity masks was used to obtain the improved `fuzzy harmonicity' results in Figure 2.

Primitive grouping, as implemented by the oscillator network, is not the whole ASA story. The auditory system also makes use of expectations about the way sound sources evolve. Bregman (1990) calls this `schema-driven' grouping, and relevant computational work is reported by Ellis (1997) and Okuno et al. (1999). Our approach to modelling schema-driven grouping and combining it with primitive grouping is `multisource decoding'. Initially, primitive grouping identifies time-frequency fragments which are associated with a single source, but does not decide the identity of the source. The task of the multisource decoder is to identify that subset of fragments which should be allocated to the speech source by finding the best match over all subsets to some sequence of speech models. It performs higher-level grouping by virtue of recognition, and recognition by virtue of grouping. The multisource decoder relies on missing data recognition: a fragment considered to be part of the speech source is treated as reliable data and the remainder as unreliable. Perhaps the closest precursor to multisource decoding in the literature is `HMM Decomposition' (Varga & Moore 1990), but there the assumption is that models are available for all the sources: there is no concept of missing data.

It is possible to generalise conventional Viterbi decoding to handle the multisource case using the following algorithm. Working forward in time, the onset of a new fragment causes all existing decodings to split into pairs, corresponding to the hypotheses that the new fragment is or is not a continuation of an existing source decoding. When a fragment terminates, hypotheses which differ only in their interpretation of this fragment are in competition and only one survives. A prototype version of the multisource decoder was reported in Barker et al. (2000b), and achieves performance gains over standard missing data recognition (20% relative improvement in word error rate for factory noise at 0 dB) with naive primitive grouping based only on local SNR and an artificial frequency banding.

The prototype decoder achieves these performance gains despite its rather superficial approaches to some deeper issues. Most important of these is the generalised treatment of the mask itself afforded by the decoder. Rather than viewing a decoding as finding the most likely interpretation given a series of observations, it can be viewed as one on finding the most likely interpretation given the observations and the mask. The fragments making up the mask are not uniformly likely, although this is the way they are treated in the prototype. For instance, a mask fragment generated by a given hypothesis about the fundamental frequency is more likely to be followed by one with a similar hypothesis (an implementation of sequential good continuation). WP 2 describes our plans for this and other issues in the development of multisource decoding theory. WP3 deals with implementation concerns for the multisource decoder. WP5 formalises the progressive evaluation of the work.