Supported by EPSRC (project GR/T04823/01)
This proposal concerns the development of novel techniques for exploiting visual speech information (e.g. lip and face movements) in the design of automatic speech recognition systems. The approaches explored are motivated by the desire for reliable speech recognition in the presence of highly non-stationary noise sources (e.g. other speakers). The basis of the project will be a new approach to robust automatic speech recognition developed by the proposer that operates by `piecing together' fragments of speech that can be observed in low noise regions of a time-frequency representation. The proposal extends this approach into the audio visual domain. The audio-visual system will exploit the correlation that exists between audio and visual aspects of speech to resolve ambiguities in the acoustic fragment labelling that occur in the present system when attempting to recognise speech in the presence of noises with speech-like characteristics (e.g. simultaneous speakers). Extending the current system from audio-only to audio-visual is nontrivial; questions about how best to integrate the audio and visual data streams, and how best to exploit the visual information have to be carefully addressed. The research will necessitate integrating several areas of expertise including computational auditory modelling, audio-visual speech perception, and robust automatic speech recognition.
The project is now complete and the final report is available here.