Workplan

In the task descriptions below, the lead participant is the first in the list.

Abbreviations:
USFD - University of Sheffield, GB
RUB - Ruhr-University Bonn, DE
DCAG - DaimlerChrysler AG, DE
HUT - Helsinki University of Technology, FI
IDIAP - Institut Dalle Molle dIntelligence Artificielle Perceptive, CH
LIV - University of Liverpool, UK
PATRAS - University of Patras, GR
 

Task 1: How are sound mixtures perceptually organised, and how can this Auditory Scene Analysis can be used in speech recognition?

  • Task 1.1: Neural Oscillators for  Auditory Scene Analysis (USFD): Three developments will be explored: Estimation of multiple F0s using harmonic cancellation rather than enhancement, F0 tracking using continuity constraints and incorporation of binaural (ITDs and IIDs) and monaural (spectral) cues to spatial location.
  • Task1.2 Modelling grouping integration by multisource decoding (USFD, LIV): Missing data masks can be generated using a number of different CASA processes and SNR estimation. Masks can then be combined so that the combination produces better results. This task will investi-gate several approaches to cue and mask combination: incorporation of noise estimation into oscillator-based grouping, sequential application of cues and mask-level integration
  • Task 1.3 Active/Passive speech perception (LIV,USFD, HUT):  Extension of  existing work on the active/passive dichotomy. LIV will co-operate with USFD to extend the multi-source decoder paradigm to model the results, and with HUT to study active/passive processing in the brain.
  • Task 1.4. Envelope information and binaural processing (LIV, RUB, PATRAS): LIV will implement the use of envelope information (within & between channels) within an artificial system (CTK) and model its effects with respect to human listeners' data. They will study the effect of envelope information in other conditions and consider the respective contributions of envelope and other cues (pitch, ILD, ITD) to binaural processing, in conjunction with RUB and PATRAS.

  •  

    Task 2: How does the Auditory System handle reverberant conditions, and how can models of this processing be used for speech enhancement?

  • Task 2.1 Researching the precedence effect (RUB, PATRAS): The Precedence Effect seems to involve processes both in the auditory periphery and higher auditory centres.  RUB and PATRAS will research both aspects and their relationship, including psychoacoustic experimentation and perceptual modelling.
  • Task 2.2 Reliability of auditory cues in multi-source scenarios (RUB, LIV, USFD):  Participants will carry out psychoacoustic localisation and sound source separation experiments to find out which information is evaluated in adverse conditions and how cues are combined/weighted by human listeners. The findings will be integrated into segregation models in order to improve CASA performance.
  • Task 2.3 Perceptual models of room reverberation with application to speech recognition (PATRAS, RUB):  Participants will establish perceptually-based DSP models for the acoustic transmission paths which define one or more speech sources within a reverberant and potentially noisy environment. Modeling of the perceptual interaction for multiple source signals in such reverberant spaces will also be addressed.
  • Task 2.4 Speech enhancement for reverberant environments (PATRAS): This research will be based on work which will identify perceptually-significant room response features to be applied to both the autodirectivity of compact microphone arrays and pre-processing of reverberant speech (via perceptually-based inverse filtering and echo cancellation) prior to use for ASR in task 5.1. Multiple sources and additive noises will also be addressed during pre-processing and may be dealt with by the acoustic signal optimisation achieved by the interface.

  •  

    Task 3: How is speech production related to speech perception and cerebral speech processing, and how can this knowledge be integrated into speech recognition systems?

  • Task 3.1 Glottal excitation estimation (HUT): HUT will develop Digital Signal Processing to automatically estimate the glottal excitation directly from the acoustic speech pressure waveform and effectively parameterise the estimated excitation waveforms. The application of these methods in clinical practices will also be addressed.
  • Task 3.2 Voice production studies(HUT, USFD): Using the DSP algorithms the goal is to study dynamic limits of human voice production (in terms of the lung pressure, the fundamental frequency, the AC characteristics of the glottal flow) by focusing on problematic speech material such as female voices and utterances produced in either soft or extremely loud phonation.
  • Task 3.3 Voice production and cortical speech processing (HUT, IDIAP, RUB): Brain imaging techniques, such as magnetoencephalography (MEG), will be used to analyse how voice production phenomena are reflected to brain activity. With RUB,  HUT will link models of spatial hearing with the analysis of brain functions evoked by three-dimensional stimuli.
  • Task 4.1 Developments in MultiSource Decoding (USFD, IDIAP, HUT): In the multisource paradigm, decoding is conditioned on the subset of regions chosen as reliable data. This opens up opportunities to employ a new layer of probabilistic constraints during decoding: certain combinations of regions are, a priori, more likely than others, due to such factors as sequential continuity and constraints which follow from models of speech production. To deploy this knowledge, it will be necessary to generalise the (probabilistic) decoding formalism to incorporate the mask term in a general way.

  •  

    Task 4: How can Automatic Speech Recognition algorithms take advantage of the work in Themes 1 and 2, for use in natural listening conditions?

  • Task 4.2 Informing Speech Recognition (LIV, DCAG, IDIAP) : The aim is to combine classical and new noise estimation methods with a predictive element to allow the prediction and removal of time varying background noise. Novel noise estimation techniques will also be used to inform missing data techniques to obtain better recognition. In Blind Source Separation, the intention is to develop semi-blind algorithms which address the problems of echo compensation, noise reduction and de-reverberation.
  • Task 4.3 Advanced ASR Algorithms (IDIAP, USFD): Here, IDIAP will adapt multi-steam recognition, HMM2 and DBM to accommodate work in themes 1, 2 and 3, and test on the applications in theme 5.

  •  

    Task 5: How can the results of other themes be exploited in speech recognition applications which require robust performance  in adverse conditions and/or processing of sound mixtures?

  • Task 5.1 Speech recognition evaluation in multi-speaker conditions (DCAG, IDIAP, USFD): DCAG will lead the work on evaluation of speech recognition systems produced within the project, with a view to incorporating successful features into their own systems. Robustness evaluation will, in the first instance, use DCAG data (recorded in real car conditions), as well as the AURORA international reference database. For later work on multiple-source problems, several large audio databases (mainly broadcast news) containing a variety of sound mixtures are available at IDIAP and will be used throughout theme 5.
  • Task 5.2: Signal and speech detection in sound mixtures (IDIAP): Participants will address the problem of sound organisation in multimedia domains requiring ASR.
  • Task 5.3 Speech technology assessment  by  simulated acoustic environments (RUB, IDIAP): The plan here is to expand RUB's "transmission channel simulation" technology to the case of hands-free communication systems, in order to test recogniser performance with simulated deteriorated speech data.

  •