ESPRIT Reactive Long Term Research
Recognition of Speech by Partial
Information Techniques (RESPITE)
RESPITE will extend and apply two novel technologies &endash;
missing data theory and multi-stream theory &endash; to the problem
of robust automatic speech recognition (ASR), with particular
application to cellular phones and in-car environments. It will also
support studies whose purpose is to inform this endeavour.
The specific measurable objectives are to
- develop techniques for identifying reliable data;
- advance the theory of multi-stream processing;
- advance the theory of missing and masked data handling;
- inform the above by obtaining new perceptual data on speech
- combine missing data and multistream processing with existing
robust ASR methods
- evaluate all this within a framework of demonstrator ASR
applications to cellular phones and in cars.
For the recognition-based objectives (2, 3 and 5) we will use
well-established corpus-based evaluation techniques for ASR (for
instance word accuracy), which will allow the benefit of each of the
above innovations to be quantified in comparison to standard
approaches. These studies will be made on standard reference data and
on in-house data (see Task 1.1 in section 2.2.2).
For the demonstrators (objective 6), error rates can be measured
as the user attempts to accomplish her/his task, under varying
Yardsticks for identifying reliable data (objective 1) can be
based on comparisons between the algorithms' outputs and predefined
'optimal labelling of signal regions, and on recognition results that
employ the data deemed to be reliable compared to those based on
indiscriminate use of the whole signal.
The success of the perceptual studies (4) can be evaluated
indirectly by the extent to which their results are deployed within
recognition schemes and the resulting effect on performance. The
studies will, in addition, have scientific merit in their own right.
In this sense, the measures of success are those of experimental
science: has an experiment been designed which will elicit the
information required? have results been obtained which are
statistically significant? Are these results reproducible? Can the
results be understood in terms of the model which provoked the
In additional to the person-months accounted for here, substantial
additional resources will be committed to the project by its authors
and their colleagues.
There are 6 work packages:
- WP0 Management
- WP1 Resources and Basic Technologies
- WP2 Identifying Reliable Information
- WP3 Recognition Techniques
- WP4 Application Demonstrators and Evaluation.
- WP5 Dissemination and Take Up
In outline, the relation between these is as follows:
- WP0 covers the coordinator's role and resource and is active
throughout the programme.
- WP1 provides the platform for the main body of the work and
occupies months 1-12.
- WP2 is required for Missing Data recognition and desirable for
Multi Stream recognition. It runs throughout the project lifetime
but will produce results incrementally, from month 6.
- WP3 develops the central Missing Data and MultiStream
technology and their integration. It will be active throughout the
- WP4 is concerned with assessment and deployment of the
technology and again will be active throughout the programme.
- WP5 covers the publication and technology take up aspects of
the project and again will be active throughout the programme.
2.2 Detailed work plan
For each WP we specify the executing partners and the manager.
- Work Package Manager: Sheffield
- Executing Partners: Sheffield
- Project management is documented in section 3.
WP1 Resources and Basic Technologies
- Work Package Manager: FPMs
- Executing Partners: Daimler-Benz, MATRA, IDIAP, FPMs,
- WP1 encapsulates the work involved in establishing common
resources, a common software framework and baseline recognition
systems for comparison with research prototypes.
Task T1.1 Database management
- Task Coordinator: Daimler-Benz
- Executing Partners: Daimler-Benz, MATRA, IDIAP, FPMs
- Speech recognition technology is dependent on the availability
of substantial corpora of spoken material for training and for
evaluation. In the case of RESPITE, we are fortunate in that much
of the data we need has already been collected. We intend to make
use of the following resources:
- Standard evaluation databases for robust ASR, allowing direct
comparison with results already reported.
- A GSM speech database recently collected at IDIAP on two
different (low end and high end) cellular phones. It is also
intended to complement those calls with about 50 calls recorded in
a quiet room simultaneously via the GSM line and directly onto DAT
- Extensive recordings on speech in cars recently made by
Daimler-Benz has recently, addressing the hands-free speaking
- MATRA will provide access to an important multi-lingual Speech
database in car environment and through GSM networks which is to
be recorded in the frame of the new project SpeechDat-Car.
- The man-months allocated to T1.1 cover the work involved in
organising and processing material from these databases in such a
way that they can be used by all partners as the basis for RESPITE
research. For instance, work at FPMs will involve converting the
databases for the STRUT format.
Task T1.2 Baseline recognition systems
- Task co-ordinator: FPMs
- Executing partners: FPMs, ICSI, Sheffield, IDIAP,
- We will first establish baseline results for our databases
using 'reference' speech recognition research systems. Two kinds
of platform are of interest:
- Hidden Markov Model (HMM) systems, exemplified by HTK, a
commercial package from Entropic which has been used as the basis
for Sheffield's missing data work.
- Hybrid systems which combine Artificial Neural Nets and HMMs,
exemplified by STRUT (developed by FPMs) and the ICSI system.
- These two methodologies have strengths and drawbacks: HMM
systems are more generally accepted as a reference but are less
amenable to some of the modifications we will need to make. Rather
than attempting to coerce partners into using a single system at
the outset, we will therefore pursue the pluralistic approach of
establishing baseline results using the three recognition systems
(HTK, STRUT and ICSI) mentioned above. Baseline configurations,
configured for the evaluation databases, will be duplicated across
the partners. For RESPITE portability, STRUT's data structure
design and programming interface will be improved. Key features of
the ICSI system are that it already includes code for multi-stream
recognition, and front-end visualisation tools.
- It is anticipated that as research progresses, we will migrate
away from multiple recognition systems to a single system,
possibly drawing pieces from each of the baseline recognisers. We
will make this decision when we address the design of the first
demonstrators (see T4.2). We will not invest undue effort into
porting the research results to systems where they are not
required within the project.
WP2 Identifying reliable information
- Work Package Manager: ICP
- Executing Partners: Sheffield, ICP, ICSI, IDIAP, FPMS
- This work package is concerned with the identification of the
regions within the signal to which the recognition techniques of
WP3 should be paying the most attention and, conversely,
indicating which features and streams should be regarded as
contaminated or 'missing'. To this end, we will investigate a
variety of techniques, based both on conventional statistical
signal processing, and on the study and modelling of human
Task T2.1 Computational Auditory Scene Analysis
- Task coordinator: ICP
- Executing partners: Sheffield, ICP, ICSI
- The field of 'computational auditory scene analysis' (CASA)
encompasses everything from the peripheral frequency analysis of
the cochlea through to abstract constraints such as 'expectations'
of familiar sounds. Task T2.1 will pursue the following studies:
- Reference CASA implementation. ICP, ICSI and Sheffield will
combine their existing expertise to produce software which will
detect local (bottom-up) sound-organization cues such as
harmonicity, common onset/offset, common modulation, and, for
binaural signals, spatial location. These cues will be used to
separate evidence from different sound sources. The effectiveness
of this software will be evaluated using SNR and/or
recognition-based metrics for assessing sound-source separation as
outlined in section 2.1.
- Comparison of source separation techniques. In a comparative
study, we will also consider approaches to source separation not
directly motivated by auditory processing, namely blind source
separation and model decomposition
- Interfacing CASA to speech recognition. We will research the
coupling between CASA processing and recognition. In addition to
performing sound-source separation, CASA can provide other
information of use to a recogniser, such as the possible locations
of speaker changes or overlap. By the same token, the analysis
performed within the recogniser can provide constraints &endash;
such as the preferred interpretation of an ambiguous segment
&endash; that could be useful to the CASA processor. In the longer
term (for the second generation of demonstrators, T4.2), we will
investigate closer coupling of CASA and recognition, so that CASA
is no longer seen as a front end which identifies reliable
evidence for use by an adapted conventional recogniser. This goal
links to the developments in recognition architectures proposed in
T3.2 and T3.4
Task T2.2 Other information-location techniques
- Task Coordinator: FPMs
- Executing partners: IDIAP, Sheffield, ICP, ICSI, FPMS
- Sound source separation is not the only way in which the most
useful features within the signal can be identified. Since
alternative techniques are likely to contribute complementary
results, we will investigate the following areas:
- Signal-to-noise ratio (SNR) estimation. We will improve
existing schemes adopted by FPMs, ICSI and MATRA for estimating
background noise, to make the process more robust and adaptive to
changing noise conditions.
- Confidence and entropy measures. The information quality in a
channel (measured by entropy or statistical confidence measures)
will be used to identify reliable features.
- Experiments in human speech perception. We will conduct
experiments to inform the design of time-frequency decompositions
to use in speech recognition. In particular, we will extend
investigations into the way that subband envelopes at different
timescales (i.e. the 'modulation spectra') affect the
intelligibility of speech.
WP3 Recognition Techniques
- Work Package Manager: IDIAP
- Executing Partners: all
- In WP3 we investigate new recognition techniques, including
those based on the missing data and multi-stream approaches, and
extend these techniques to take advantage of the results of WP2.
All methods and open issues discussed below will be tested on the
common databases defined in Task T1.1. Results will be compared to
"standard" (state-of-the-art) noise robust speech recognition
techniques also benefiting, when possible, from findings of WP2.
Task T3.1 Developments in noise robust speech recognition
- Task Coordinator: MATRA
- Executing Partners: MATRA, IDIAP, Daimler-Benz
- This task will test different variants of the new approaches
in the framework of standard approaches (e.g. by doing subband
emission probability weighting). It will be based on the existing
noise-robust recognition systems at MATRA and Daimler-Benz.
Task T3.2 Missing Data Recognition
- Task Coordinator: Sheffield
- Executing Partners: Sheffield, IDIAP, FPMs, ICP
- The following topics will be investigated:
- Porting missing data techniques to other platforms: for
instance FPMs will implement missing data techniques in STRUT.
- Encoding spectral dynamics. We will make use of temporal
derivatives and time-domain interpolation to improve phone
- Exploiting masked regions. Missing data performance can be
improved by exploiting counter-evidence: the probability that a
given model could have been masked by the observed values. A joint
perceptual-modelling study will be pursued here in order to better
understand the relative weight attached to limited positive
evidence and masked counter-evidence in the auditory-phonetic
- Decoder architecture for multiple sources. We will extend the
traditional hypothesis search space to handle multiple
simultaneous sources. Each evidence group provided by CASA will be
regarded either as the continuation of existing hypotheses, in
which case it is treated as partial evidence, or as the start of a
new acoustic source, in which case it not only triggers new paths,
but can be used as counter-evidence for the ongoing sources.
Task T3.3 Multi-stream Recognition
- Task Coordinator: IDIAP
- Executing Partners: IDIAP, ICSI, FPMs, ICP
- This task will investigate multi-band and multiple time scale
speech recognition. As with the missing data task, we will begin
by making existing multi-stream software available to all
partners. In the case of multi-band speech recognition, the
following issues will be investigated:
- Features: whether novel signal processing techniques are
better suited to multi-band recognition than conventional
- Level of combination: whether the sub-band information streams
should be combined on a per state level, or fluidly over a phone
or a syllable.
- Method of combination: the correct strategy for combining the
sub-bands. Different likelihood or posterior probability based
combination approaches will be tested. Furthermore, the
reliability measures defined in Task 2.2 will also be tested for
- Choosing sub-bands: what is the optimal number of sub-bands
and what should the cutoffs be? This is still an open issue and so
far the only way to get insights into it is to test different
possibilities. Obviously, we would like to have as many sub-bands
as possible, while keeping enough information in each band and
minimizing the correlation across bands. Findings from Task T2.2
will be useful here.
- For multiple time scale speech recognition, the following
issues will be investigated:
- Feature extraction: There are a number of experiments that
need to be done to determine the best form for long-term signal
representation, and to make use of some results of psycho-acoustic
experiments. This will include critical band filtering (with the
right trade off between temporal and spectral resolution) and the
modulation spectrum. More generally, our explorations may show
problems with some aspects of the long-time features we are
planning to use, and we will modify them as suggested by our
diagnostics and intuition.
- Incorporating multiple time scale units: As for subband based
ASR, this task will also involve research into levels of
combination and methods for combination.
- We will also need to modify or rewrite parts of our existing
decoders to incorporate the new models. The solutions that will be
considered and implemented here are referred to as 'HMM
combination' and the 'two-level dynamic programming algorithm'.
Task T3.4 Combining Recognition Techniques
- Task Coordinator: Sheffield
- Executing Partners: Sheffield, IDIAP, FPMs, ICP
- We will investigate a number of ways in which the techniques
we are developing might interact:
- The essence of multi-stream is several parallel feature
streams and recognisers which are eventually combined; the essence
of missing-data is modifying the probability estimation based on
additional data on feature reliability. If not all the required
features for a single stream are always available, one can
implement missing data recognition within each stream. Similarly,
the multi-stream recombination computation could be formulated in
missing data fashion.
- If missing data recognition is to be deployed within hybrid
ANN/HMM systems, it will be necessary to adapt the techniques to
produce phone probability vectors. This might be done on the basis
of already-trained statistical models or with ANN architectures
(such as Radial Basis Functions).
- Multi-stream processing might exploit the output of a scene
analysing module (Task T2.1) by treating each group as a separate
stream, with recombination points whenever new groups appear. Any
path through the resulting lattice represents a particular
assignment of groups to streams.
- In situations where some characteristics of the noise are
known, it of course makes sense to use this knowledge. For
instance in a car there will be predictable engine noise together
with other unknown noises. We will therefore investigate ways of
combining techniques like those developed by MATRA with the
missing data and multi-stream approaches developed here.
WP4 Application demonstrators and evaluation
- Work Package Manager: MATRA
- Executing Partners: all
- Each of the tasks in WP2 and WP3 have accompanying evaluation
schedules making use of the databases introduced in task T1.1. In
addition, there is a need to develop techniques within the
framework of complete applications. Hence in WP4 we will specify
and build application demonstrators in which we will deploy
recognition modules as they become available. The demonstrators
will involve specific applications within the in-car/telephone
domains, for instance voice interaction with navigation systems.
- While the databases of T1.1 provide material for recogniser
training and for testing in the form of raw recognition
performance, the demonstrators provide different challenges - the
integration of recognition techniques into a live system with a
habitable interface and perhaps limited computing resources.
Assessment here should be based on performance while carrying out
a task, as outlined in section 2.1. We will also be able to obtain
subjective measures of performance across a wide set of conditions
Task 4.1 Definition of application demonstrators
- Task Coordinator: Daimler-Benz
- Executing Partners: Daimler-Benz, MATRA, IDIAP
- The RESPITE industrial partners and IDIAP will define several
demonstrator applications for the in-car and cellular 'phone
domains. It will be necessary to specify to the software
interfaces required for each demonstrator, so that all partners
can contribute. It will also be necessary to establish recognition
performance targets for each demonstrator. IDIAP will test the
resulting systems on their GSM database, which is particularly
well suited to test robustness to noise and variability in
Task T4.2 Demonstrator design and evaluation
- Task Coordinator: MATRA
- Executing Partners: all
- This task formalises the process of building and evaluating
demonstrator recognition applications deploying RESPITE
techniques. The model is to produce an assessment of the problems
involved in doing this after 12 months, and two generations of
demonstrators scheduled for month 24 and month 33.
- This task consists of three activities:
- integration: The various algorithmic components will be
integrated into in-house real-time interfaces developed by MATRA
- application development: software design, implementation and
- validation: each year, all partners will contribute to the
validation of the updated system.
WP5 Dissemination of Results and Exploitation
Dissemination of results and exploitation is dealt with in detail
in section 9. Briefly,
- Scientific results will be disseminated by all the usual
- A RESPITE web site will document project progress and provide
accessibility to the community.
- Training measures will be put in place for future researchers.
- An international workshop will be held around month 27.
- Exploitation paths are straightforward, through the product
range of the industrial partners and the direct link of
These pages are maintained by Jon Barker,
Last modified: Mon Dec 20 16:21:06 GMT 1999