Identifying Reliable Information
Both the Multistream and the missing data recognition architectures rely on processes that judge the reliability of the acoustic evidence presented to the recogniser. Developing techniques for identifying reliable speech information is therefore a major part of the RESPITE project. This year considerable progress has been made on both the multistream and the missing data fronts.
Improved Stream Weighting for Multistream Systems
For multistream systems identifying reliable information amounts to assigning a dynamic `weighting' to each stream of evidence. Throughout the course of the project the sophistication of the techniques applied to obtain this weighting has steadily increased.
Based on a simple proof that the weights W which maximise the weights
likelihood P(Q|W,X) for any given state sequence Q and observation sequence X
are always 1 or 0,
This year a new model has been proposed for estimating `full-combination (FC) expert weights' by Maximum Likelihood. Previously both weight estimation and expert combination were performed prior to decoding. Based on a simple proof that the weights W which maximise the weights likelihood P(Q|W,X) for any given state sequence Q and observation sequence X are always 1 or 0, selection of 0/1 weights in the new model is integrated into the (Viterbi) decoding process. Initial tests on the Aurora benchmark show that this model can give significantly improved robustness to wideband noise. For details see:
The figure below illustrates the current Maximum Likelihood, Full-Combination, Multistream recognition architecture has evolved throughout the course of the RESPITE project.
Improved Reliability Measures for Missing Data Systems
For missing data systems a reliability has to be assigned to every spectro-temporal point that is to be presented to the recogniser. In early work points were labelled with a binary mask as being either wholly reliable or wholly unreliable. Although this discrete unreliable/reliable split is a good approximation to the behaviour of real noisy speech, inevitable errors in the estimation of such binary decisions lead to poor results. Last year we reported on performance improvements arising from 'soft' labelings, where the points are assigned `a probability of being reliable'.
This year further progress has been made in the design of the front-end which approximates these reliability probabilities. Previous systems have relied solely on simple noise estimates and have assumed that the noise is reasonably stationary. This year we have reinforced these estimates with reliability estimates based on techniques which exploit the harmonicity of the speech signal. The new techniques actively search out the 'voiced' regions of the speech signal and mark them as reliable evidence.
Details can be found in the following paper:
Click on thumbnail to see full sized image.
Audio Visual Speech Recognition
Visual cues are an important part of the speech signal. Humans unconsciously use cues from the movements of the speakers lips and face to help them understand what is being said. This information becomes increasingly important in noisy environments where the acoustic information may be unreliable.
Within the RESPITE project, ICP have been developing sophisticated audio-visual speech recognition techniques. Their system judges the reliability of information at three separate entry points along the acoustic-phonetic pathway:
Within this framework, AV recognition experiments have been conducted using a new weighting scheme and the voicing index as an SNR estimator. Various
experiments, some of them carried out at ICP, show that, in the general
case (i.e. whatever the interference), level-specific techniques are not
effective. These are generally specialised for a given task (e.g., for
segregating two localised speech sources). So, the development of combination of methods having different principles is promising. Orthogonally, an audio-visual experiment based on spectrally reduced speech suggests that,
in a multistream recognition system, the streams could carry a
representation of the articulatory features.
- the sensor level, possibly multimodal, and time-frequency representation
- the primitive level, using harmonicity and localisation cues
- the phonetic level, possibly multimodal.
- F. Berthommier and S.Choi Evaluation of CASA and BSS models for cocktail-party speech segregation, ICA'01, San Diego, 2001.
- F. Berthommier Audio-visual recognition of spectrally reduced speech, Proceedings of AVSP' 01, Aalborg,
- M. Heckmann, F. Berthommier, and K. Kroschel, Noise adaptive stream weighting in audio-visual speech recognition, Submitted for JASP