|Multisource Decoding||Home | References|
Although the current neural oscillator model will be used in the first instance to generate coherent source fragments required by the multisource decoder, three developments will be explored:
Conventional spoken language decoding, in which one seeks the most likely symbol sequence given a sequence of observations, is generalised in multisource decoding by the introduction of a mask of coherent time-frequency regions. The decoding is now additionally conditioned on the subset of regions chosen as present data. This opens up opportunities to employ a new layer of probabilistic constraints during decoding. For instance, certain combinations of regions are, a priori, more likely than others, due to such things as sequential continuity or harmonicity constraints (one might consider the mask regions as analogous to words in a language model). In our work on the prototype decoder, we have been aware of these possibilities but have yet to realise them. The purpose of this work package is to develop the requisite extensions to the (probabilistic) decoding formalism to incorporate the mask term in a general way.
As part of this work package, a metric for assessment of mask accuracy will be developed. We are aware, for example, that false positives (i.e. classifying a feature as present when it is missing) should be penalised more than false negatives (i.e. classifying a feature as missing when it is present), and seek a principled mechanism for incorporating this information in the decoding process.
This work package is intended to support decoder developments related to WP2 and to further explore a partially-unsolved problem in the current decoder implementation related to hypothesis merging. The problem arises due to differences in the way that paired alternative decodings are evaluated (one branch uses probabilities, the other likelihoods, as described in Barker et al. 2000b). Some progress has been made using ad-hoc methods to ameliorate these differences, but a better solution is sought.
This work package will also involve considerable software development. For example, the present software assumes that the set of fragments is mutually exclusive (i.e. no fragments overlap): with multiple evidence streams this may not be the case.
Missing data masks can be generated using a number of different processes, including those to be developed in WP1 together with approaches based on SNR estimation. These masks can then be combined in such a way that the combination produces better results than either mask individually. Recent work has applied this technique to the combination of masks derived from SNR estimates and harmonic energy extracted from a correlogram representation. This work package will investigate several approaches to cue and mask combination:
Work packages 1, 3 and 4 will be evaluated using monaural and binaural versions of the AURORA task (and any subsequent developments thereof), which involves the recognition of digit sequences in a variety of noises and across a range of noise levels (-5 dB to 20 dB SNR). AURORA (Pearce & Hirsch 2000) is currently the principal international task in robust ASR. Consequently, it provides a rich set of external baselines in addition to an internal gauge for comparison of the alternatives proposed in the work programme here.
Exploration of the possibilities offered by the missing data and multisource paradigm provides a rich vein for doctoral studies and we ahave obtained two EPSRC studentships. Student projects will enhance the work programme, but the success of the project will not depend on student progress. A selection of possible themes for doctoral studies is given below.
There are two basic strategies for the integration of top-down influences in the oscillator/multisource decoder architecture:
Neural oscillator models offer an interesting framework for investigating the interaction of top-down and bottom-up grouping cues. We envisage a different approach employing a three-layer architecture. Predictions from the decoder will be represented as oscillations in the highest layer of the network, and information about the grouping of features will be encoded in the lowest layer. In the middle layer, the two sources of information will interact to give a grouping decision. This project will also extend work currently in progress, which is incorporating attentional mechanisms into oscillator networks. This will allow attention to be shifted to a group of components based on a prediction of its characteristics (e.g., speech or non speech) from the multisource decoder.
The current multisource decoder framework demonstrates remarkable robustness while making minimal assumptions about the noise. However, in some circumstances we may have knowledge/models of the identity of one of the competing environmental sound sources (e.g. we may know that the environment involves two speakers). In these cases we would like to extend the multisource approach to search for the best fragment/source assignment employing models for each of the sources we believe to be present. The simultaneous speaker problem is particularly interesting because in conversations recorded in an otherwise clean environment there are still periods of speaker overlap that can cause serious problems for conventional recognition techniques.
An alternative to having missing data masks pre-defined by primitive processes is to define them, independently, for each HMM state at each time. The optimal mask for each state can be obtained by deciding whether each frequency channel is present or missing by choosing the maximum of the positive evidence and counter evidence in that channel. One could then perform a single, conventional decoding with the likelihoods of these state-optimal masks. This would be a purely schema-driven scheme. Primitive grouping could then be re-introduced, without multisource decoding, by applying constraints on the choice of channel assignment, for instance that all channels in a primitive group must be either present or missing.
The missing data approach illustrates the degree of redundancy in a time-frequency representation of speech. We have also observed that selecting noisy estimates is typically more deleterious than omitting correct estimates. An open question is: what is the optimal strategy for selecting a subset of time-frequency points to utilise for ASR, given some estimate of the noise (and a confidence measure for this estimate)? We anticipate that an answer to this question will invoke information-theoretic criteria involving mutual information.