In the task descriptions below, the lead participant is the first in
USFD - University of Sheffield, GB
RUB - Ruhr-University Bonn, DE
DCAG - DaimlerChrysler AG, DE
HUT - Helsinki University of Technology, FI
IDIAP - Institut Dalle Molle d’Intelligence Artificielle Perceptive,
LIV - University of Liverpool, UK
PATRAS - University of Patras, GR
Task 1: How are sound
mixtures perceptually organised, and how can this
Auditory Scene Analysis
can be used in speech recognition?
Task 1.1: Neural Oscillators for Auditory Scene Analysis (USFD):
Three developments will be explored: Estimation of multiple F0s using harmonic
cancellation rather than enhancement, F0 tracking using continuity constraints
and incorporation of binaural (ITDs and IIDs) and monaural (spectral) cues
to spatial location.
Task1.2 Modelling grouping integration by multisource decoding (USFD,
LIV): Missing data masks can be generated using a number of different
CASA processes and SNR estimation. Masks can then be combined so that the
combination produces better results. This task will investi-gate several
approaches to cue and mask combination: incorporation of noise estimation
into oscillator-based grouping, sequential application of cues and mask-level
Task 1.3 Active/Passive speech perception (LIV,USFD, HUT):
Extension of existing work on the active/passive dichotomy. LIV
will co-operate with USFD to extend the multi-source decoder paradigm to
model the results, and with HUT to study active/passive processing in the
Task 1.4. Envelope information and binaural processing (LIV, RUB,
PATRAS): LIV will implement the use of envelope information (within
& between channels) within an artificial system (CTK) and model its
effects with respect to human listeners' data. They will study the effect
of envelope information in other conditions and consider the respective
contributions of envelope and other cues (pitch, ILD, ITD) to binaural
processing, in conjunction with RUB and PATRAS.
Task 2: How does the
Auditory System handle reverberant conditions, and how can models
of this processing be used for speech enhancement?
Task 2.1 Researching the precedence effect (RUB, PATRAS): The Precedence
Effect seems to involve processes both in the auditory periphery and higher
auditory centres. RUB and PATRAS will research both aspects and their
relationship, including psychoacoustic experimentation and perceptual modelling.
Task 2.2 Reliability of auditory cues in multi-source scenarios (RUB,
LIV, USFD): Participants will carry out psychoacoustic localisation
and sound source separation experiments to find out which information is
evaluated in adverse conditions and how cues are combined/weighted by human
listeners. The findings will be integrated into segregation models in order
to improve CASA performance.
Task 2.3 Perceptual models of room reverberation with application to
speech recognition (PATRAS, RUB): Participants will establish
perceptually-based DSP models for the acoustic transmission paths which
define one or more speech sources within a reverberant and potentially
noisy environment. Modeling of the perceptual interaction for multiple
source signals in such reverberant spaces will also be addressed.
Task 2.4 Speech enhancement for reverberant environments (PATRAS):
This research will be based on work which will identify perceptually-significant
room response features to be applied to both the autodirectivity of compact
microphone arrays and pre-processing of reverberant speech (via perceptually-based
inverse filtering and echo cancellation) prior to use for ASR in task 5.1.
Multiple sources and additive noises will also be addressed during pre-processing
and may be dealt with by the acoustic signal optimisation achieved by the
Task 3: How is speech
production related to speech perception and cerebral speech processing,
and how can this knowledge be integrated into speech recognition systems?
Task 3.1 Glottal excitation estimation (HUT): HUT will develop Digital
Signal Processing to automatically estimate the glottal excitation directly
from the acoustic speech pressure waveform and effectively parameterise
the estimated excitation waveforms. The application of these methods in
clinical practices will also be addressed.
Task 3.2 Voice production studies(HUT, USFD): Using the DSP algorithms
the goal is to study dynamic limits of human voice production (in terms
of the lung pressure, the fundamental frequency, the AC characteristics
of the glottal flow) by focusing on problematic speech material such as
female voices and utterances produced in either soft or extremely loud
Task 3.3 Voice production and cortical speech processing (HUT, IDIAP,
RUB): Brain imaging techniques, such as magnetoencephalography (MEG),
will be used to analyse how voice production phenomena are reflected to
brain activity. With RUB, HUT will link models of spatial hearing
with the analysis of brain functions evoked by three-dimensional stimuli.
Task 4.1 Developments in MultiSource Decoding (USFD, IDIAP, HUT):
In the multisource paradigm, decoding is conditioned on the subset of regions
chosen as reliable data. This opens up opportunities to employ a new layer
of probabilistic constraints during decoding: certain combinations of regions
are, a priori, more likely than others, due to such factors as sequential
continuity and constraints which follow from models of speech production.
To deploy this knowledge, it will be necessary to generalise the (probabilistic)
decoding formalism to incorporate the mask term in a general way.
Task 4: How can Automatic
Speech Recognition algorithms take advantage of the work in Themes
1 and 2, for use in natural listening conditions?
Task 4.2 Informing Speech Recognition (LIV, DCAG, IDIAP) : The
aim is to combine classical and new noise estimation methods with a predictive
element to allow the prediction and removal of time varying background
noise. Novel noise estimation techniques will also be used to inform missing
data techniques to obtain better recognition. In Blind Source Separation,
the intention is to develop semi-blind algorithms which address the problems
of echo compensation, noise reduction and de-reverberation.
Task 4.3 Advanced ASR Algorithms (IDIAP, USFD): Here, IDIAP will
adapt multi-steam recognition, HMM2 and DBM to accommodate work in themes
1, 2 and 3, and test on the applications in theme 5.
Task 5: How can the
results of other themes be exploited in speech recognition applications
which require robust performance in adverse conditions and/or processing
of sound mixtures?
Task 5.1 Speech recognition evaluation in multi-speaker conditions (DCAG,
IDIAP, USFD): DCAG will lead the work on evaluation of speech recognition
systems produced within the project, with a view to incorporating successful
features into their own systems. Robustness evaluation will, in the first
instance, use DCAG data (recorded in real car conditions), as well as the
AURORA international reference database. For later work on multiple-source
problems, several large audio databases (mainly broadcast news) containing
a variety of sound mixtures are available at IDIAP and will be used throughout
Task 5.2: Signal and speech detection in sound mixtures (IDIAP):
Participants will address the problem of sound organisation in multimedia
domains requiring ASR.
Task 5.3 Speech technology assessment by simulated acoustic
environments (RUB, IDIAP): The plan here is to expand RUB's "transmission
channel simulation" technology to the case of hands-free communication
systems, in order to test recogniser performance with simulated deteriorated