13 December 2006
Junichi Yamagishi
Centre for Speech Technology Research, Edinburgh University, UK
Model Adaptation Approach to Speech Synthesis with Diverse Voice and Styles
In human computer interaction and dialogue systems, it is often
necessary that text-to-speech synthesis have abilities to generate natural
sounding speech with arbitrary speakers voice and various speaking styles
and/or emotional expressions. We have developed an average voice-based
speech synthesis method using statistical average voice models and model
adaptation techniques for this purpose. In this talk, we describe an
overview of the speech synthesis system and show the current performance
from several experimental results.
PDF Slides
29 November 2006
Saeed Vaseghi
Department of Electronics and Computer Engineering, School of Engineering and Design, Brunel University, UK
Speech Enhancement: Noise Reduction, Bandwidth Extension and Lost Packet Recovery
The presentation describes the outcomes of a joint by EPSRC-supported
research collaboration between Brunel, UEA Norwich, ISVR Southampton and
Queen's Belfast. During the course of this work a DeNoise ToolBox was
developed capable of noise reduction, bandwidth extension and
prediction/interpolation of lost speech segments. The speech enhancement
tools are based on several methodologies including Bayesian inference and
the use of LP_HNM (linear prediction combined with harmonic plus noise
models) with Kalman filters.
15 November 2006
Yasser H. Abdel-Haleem
Department of Computer Science, University of Sheffield, UK
Conditional Random Fields for Continuous Speech Recognition
Acoustic modelling based on Hidden Markov Models (HMMs) is employed
by state-of-the-art stochastic speech recognition systems. Although
HMMs are a natural choice to warp the time axis and model the temporal
phenomena in the speech signal, they do not model the spectral
phenomena well. This is a consequence of their conditionally
independent properties, which are inadequate for sequential
processing.
In this work, a new acoustic modelling paradigm based on
Conditional Random Fields (CRFs) is investigated and developed. This
paradigm addresses some of the weak aspects of HMMs while maintaining
many of the good aspects, which have made them successful. In
particular, the acoustic modelling problem is reformulated in a data
driven, sparse, augmented space to increase discrimination. In
addition, acoustic context modelling is explicitly integrated to
handle the sequential phenomena of the speech signal. In a phone
recognition task, test results have shown significant improvements
over comparable HMM systems.
While the theoretical motivations behind this work are attractive,
a practical implementation of these complex systems is demanding. The
main goal of this work is to present an efficient framework for
estimating these models that ensures scalability and generality for
large vocabulary speech recognition systems.
26 September 2006
Sadaoki Furui
Department of Computer Science, Tokyo Institute of Technology, Japan
Why is automatic recognition of spontaneous speech so difficult?
Although speech derived from reading texts, and similar types of
speech, e.g. that from reading newspapers or that from news
broadcast, can be recognized with high accuracy, recognition
accuracy drastically decreases for spontaneous speech. This is due
to the fact that spontaneous speech and read speech are
significantly different acoustically as well as linguistically.
This talk reports analysis and recognition of spontaneous speech
using a large-scale spontaneous speech database "Corpus of
Spontaneous Japanese (CSJ)". Recognition results in this experiment
show that recognition accuracy significantly increases as a function
of the size of acoustic as well as language model training data and
the improvement levels off at approximately 7M words of training
data. This means that a very large corpus is needed to encompass
the huge linguistic and acoustic variations which occur in
spontaneous speech. Spectral analysis using various styles of
utterances in the CSJ shows that the spectral
distribution/difference of phonemes is significantly reduced in
spontaneous speech compared to read speech. Experimental results
also show that there is a strong correlation between mean spectral
distance between phonemes and phoneme recognition accuracy. This
indicates that spectral reduction is one major reason for the
decrease of recognition accuracy of spontaneous speech.
Comparative analysis of statistical language models for written
language, including newspaper articles, and spontaneous speech shows
that there is a significant difference between written language and
spontaneous speech in terms of observation frequency of each
part-of-speech and perplexity.
6 September 2006
Matt Gibson
Department of Computer Science, University of Sheffield, UK
Hypothesis Spaces For Minimum Bayes Risk Training In Large Vocabulary Speech Recognition
The Minimum Bayes Risk (MBR) parameter estimation
framework has been a successful strategy for training hidden
Markov models in large vocabulary speech recognition.
Practical implementations of MBR must select an appropriate
"hypothesis space". Typically the hypothesis space is the set
of all word sequences but recent work in minimum phone error
(MPE) training has reported improved generalisation when
using the set of all phone sequences.
This work further examines the generalisation of MBR
training using a range of hypothesis spaces defined via the
system components of word, phone, physical triphone, physical
state and physical mixture component.
5 July 2006
Iain Murray
Applied Computing, University of Dundee, UK
Expressive Speech Synthesis
Is newer really better? Modern synthetic speech systems offer
excellent intelligibility and naturalness, but with current
synthesiser technology it is more difficult to add variability into
the synthesis, such as alternative "voice personalities" and
emotion effects. This talk will look at ways forward for more
expressive speech synthesis.
14 June 2006
Oscar Saz
Department of Computer Science, University of Sheffield, UK
Visiting PhD Student from the Univerisity of Zaragoza, Spain
Speech Technologies at the University of Zaragoza
This is a short talk about the Speech Technologies
Group at the University of Zaragoza, Spain, focussing
on clinical applications of speech technologies and
research in the area of pathological speech.
PPT Slides
24 May 2006
Colin Breithaupt
Department of Computer Science, University of Sheffield, UK
Visiting researcher from the Institute of Communication Acoustics, Ruhr-University Bochum, Germany
Speech feature analysis for robust automatic speech recognition
Automatic speech recognition in noisy environments can be made
robust by the use of noise reduction frontends that reduce the noise
level in the noisy time signal. Nevertheless, it can be observed that
an improved noise reduction in the sense of the segmental
Signal-to-Noise ratio does not lead inevitably to better
recognition results. Noise reduction filters seem to be limited in
their ability to increase the recognition rate by reducing the
noise. A statistical analysis reveals that the limiting factor is the
increased variance of the estimated clean speech features that comes with the
noise reduction.
I will try to show that recognition based on Hidden-Markov models
is more sensitive to an increase in the variance of the speech
features than to a shift of the feature mean as caused by
residual additive noise. This leads to a trade-off between noise
reduction and increased feature variance.
Additionally, speech features that have a lower variance than the
training data lead to higher recognition scores as long as their mean
value is similar to that of the training data. A statistical analysis
shows that, in contrast to spectral features, cepstral features
exhibit a reduced variance in the case of additive noise. It seem as though
Mel-frequency cepstral coefficients are a good representation
of the speech signal if simple model adaptation or compensation techniques
based on cepstral mean correction are applied.
10 May 2006
Peter Howell
Department of Psychology, University College London, UK
A model of fluency breakdown based on data from speakers who stutter
The sorts of fluency breakdowns that occur in stuttered
speech also occur in fluent speakers' speech. Accounting for how
these breakdowns occur in stuttered speech may 1) provide an
answer to how similar events arise in fluent speech and 2)
indicate what is different about fluent speech. Several
different modelling approaches have been proposed. The theories
contrast in 1) whether they focus on the language or the motor
system and 2) whether perceptual information is essential for
appropriate control (through feedback and feedforward
mechanisms). Selected versions of these models are critiqued.
Howell's EXPLAN model is then described which maintains 1) that
problems in synchronizing language and motor processes gives
rise to fluency problems and 2) links between the perceptual
and production mechanisms are not essential. Evidence in
support of the model is presented including speech control
across development in fluent speakers and stutterers,
measurement of loci of difficulty in speech, the effect of
speech rate on fluency control, the interaction between
difficulty and speech rate, timing functions of the
speech-motor system and the effects of altered auditory
feedback on speech control.
3 May 2006
Philip Jackson
Centre for Vision Speech and Signal Processing, University of Surrey, UK
Amplitude modulation of frication by voicing: acoustics and perception
The talk describes a speech production study that has motivated
the development of a signal-processing technique for measuring
modulated noise, and a series of perceptual tests.
The two principal sources of sound in speech, voicing and
frication, occur simultaneously in voiced fricatives as well as at
the /VF/ boundary in phonologically voiceless fricatives. Instead
of simply overlapping, the two sources interact. This talk presents
an acoustic study of one such interaction effect: the amplitude
modulation of the frication component when voicing is present.
Corpora of sustained and fluent-speech English fricatives were
recorded and analyzed using a signal-processing technique designed
to extract estimates of modulation depth.
To investigate the modulation's contribution to fricative
auditory quality, AM white noise, with simultaneous sinusoidal
component at the modulating frequency, provided stimuli for
perceptual tests. Two AM detection-threshold experiments were
conducted to establish the effect of varying the relative amplitude
and phase of the tone. In the first experiment, tone and noise
stimuli were separated within each trial by short pauses; in the
second, the tone played continuously throughout the trial.
Preliminary results of the AM detection threshold experiments will
be discussed in relation to AM's influence on quality and
categorisation of fricatives.
PPT Slides
hi-res fricative plot
26 April 2006
Mahesan Niranjan
Depatment of Computer Science, University of Sheffield, UK
Introduction to Sequential Monte Carlo methods and their use in estimating formants
Sequential Markov Chain Monte Carlo methods (or Particle
Filters) offer a framework for learning and inference in
nonlinear, non stationary environments. They have recently
been applied to a range of interesting problems in signal
processing, robotics, computer vision and control. They
allow the propagation of probabilistic models in an on-line
fashion. This talk will be a tutorial on particle filters
using my current work on formant estimation as an
example.
PPT Slides
29 March 2006
James Carmichael
Depatment of Computer Science, University of Sheffield, UK
Quantifying Speech Disorder Diagnosis on the Cheap - Computerising the Frenchay Dysarthria Assessment Tests
Consistent diagnosis of speech disorders is frequently
hampered by a lack of consensus among experts vis--vis the
interpretation of assessment data. Inconsistent
classification/reclassification by a given evaluator for the same
data is not uncommon and can result in incorrect evaluation. This
presentation reports on the implementation of a computer-based
objective metric system for the diagnosing of dysarthria - a
speech disorder resulting from impaired control of the
articulators. A series of acoustic measurement algorithms are
proposed, one of which utilises population sampling techniques to
generate statistical models for a specified vocabulary. These
models are then used to return likelihood scores which correlate
with subjective assessments of the intelligibility of a given
speaker. We discuss the constraints of implementing such an
application as a practical real-world tool to be used by
clinicians, oftentimes having access to only minimal
computational resources.
PPT Slides
Associated sound clips:
22 March 2006
Dennis Norris
Cognition and Brain Sciences Unit, Cambridge, UK
Perceptual learning in speech
The speech perception system must be flexible in responding
to the variability in speech sounds caused by differences among
speakers and by language change over the lifespan of the
listener. One potentially valuable source of information that
can help listeners to adapt to this variability is to make use
of lexical information to retune perceptual categories - if you
know what the word is, you can use this knowledge to map new
speech sounds onto existing perceptual categories. A number of
experiments show that speakers can perform this perceptual
retuning very rapidly. I will argue that the data refute the
currently popular view that lexical representations are purely
'episodic' (based on 'exemplars').
PPT Slides.
Associated sound clips and papers:
8 March 2006
John Bridle
Novauris Technologies Ltd, UK
It Keeps them on the Knife: Interpretations of HMMs with 'Dynamic Observations'
For many years it has been standard practice to use augmented
'observation' vectors for HMMs for speech recognition. It makes the
conditional independence assumption even less credible, but it
improves accuracy. Is there still a generative stochastic model of
speech patterns in there? What does the output look like? Does it
matter? Are there clues to better speech recognition? This talk
explores some possible attitudes to these questions, but does not
necessarily provide many answers. We touch on the Trajectory HMM
formulation, products of mixtures of Gaussians, and Gibbs sampling.
Illustrated with real-time graphics.
PPT Slides
1 March 2006
Torsten Dau
Centre for Applied Hearing Research,Technical University of Denmark
Modeling spectro-temporal processing in the auditory system
Many real-life auditory stimuli have intensity peaks and valleys as a
function of time in which intensity trajectories are highly correlated
across frequency. This is true of speech, of interfering noise such as
"cafeteria" noise and of many other kinds of environmental stimuli.
Across-frequency comparisons of temporal envelopes are a general
feature of auditory pattern analysis and play an important role in
extracting signals from noise backgrounds, or in separating competing
sources of sound. For example, comodulation of different frequency
bands in background noise facilitates the detection of tones in noise,
a phenomenon known as comodulation masking release (CMR). The
perception of complex sounds is critically dependent on the faithful
representation of the signal's modulations in the auditory
system. While various modulation-detection and masking data as well as
speech-intelligibility data can be accounted for nicely by current
models, effects associated with auditory object perception have not
yet been described adequately. As an example, the influence of
concurrent and sequential streaming in CMR and in binaural unmasking
will be illustrated. These experiments provide constraints for future
models of auditory signal processing. If the developed transformation
from the acoustic stimuli in complex environments into their internal
spectro-temporal representation is correct, such a model could be
interesting for technical applications in speech and audio coding,
automatic speech recognition, and digital hearing-aids.
23 February 2006
Robert Kirchner
Department of Linguistics,
University of Alberta,
Edmonton, Canada
Exemplar-Based Speech Processing and Phonological Learning
Over the past decade, evidence has accumulated, from a range of
research domains, that phonological patterns, including patterns
relating to fine phonetic detail, are token-frequency-sensitive,
motivating an exemplar-based speech processing model (see e.g. Bybee
2001, Pierrehumbert 2002). In this talk, I review this evidence,
against the background of standard phonological assumptions. I then
consider how an exemplar-based model might capture not only these
frequency effects, but also the core observations of phonological
theory: in brief, the model must be capable of detecting phonological
patterns over phonetic signals, and extending these patterns to novel
words (i.e. phonological learning). The move to such an approach also
promises to make phonological theory more immediately relevant to the
concerns of phonetics, psychology, and computer science, perhaps
spurring greater interdisciplinary collaboration on spoken language
processing. In this spirit, I present some attempts on my part at
developing and implementing such a model, and invite feedback from
those with a deeper understanding of automatic speech processing
techniques.
Bybee, J. (2001) Phonology and Language Use. Cambridge University
Press, Cambridge UK.
Pierrehumbert, J. (2002) Word-specific phonetics, in C. Gussenhoven
& N. Warner (eds.), Laboratory Phonology VII, Berlin: Mouton de
Gruyter. 101-140.
22 February 2006
Martin Russell
Department of Electronic, Electrical and Computer Engineering,
University of Birmingham, UK
A Data-Driven Analysis of Vowels in the ABI (Accents of the British Isles) Speech Corpus
The ABI (Accents of the British Isles) corpus was collected by the
University of Birmingham in 2003. It comprises speech from nearly 300
speakers from 14 different towns in the British Isles, representing 14
distinct regional accents. The corpus was collected on location in
each town, and consists of approximately 20 minutes of recordings from
10 male and 10 female subjects, aged between 18 and 60, who were born
in that town and had lived there all of their lives. In total there
are nearly 100 hours of recordings. The recordings include read
simple commands, digit and letter sequences, SCRIBE sentences, short
passages, simple syllables to explore the pronunciation of vowels and
a small amount of spontaneous speech.
In the talk I will describe the data collection process and some of
the lessons which were learnt. I will present the results of a
data-driven analysis of the vowel sounds produced by the subjects from
the different locations in the ABI corpus. I will describe some more
recent experiments aimed at understanding the underlying acoustic
factors which account for variation between accents. Finally I will
outline future plans for extending the corpus.
15 February 2006
Christoph Draxler
University of Munich, Germany
SpeechRecorder - a Platform-Independent Tools for Speech Recordings via the WWW
SpeechRecorder is a free and platform-independent application for performing speech recordings. The contents of a recording session are defined in an XML-formatted recording script; the prompt material can be text, images, or audio. The software supports multiple displays:
the speaker view contains only the prompt item and recording control elements, the experimenter view contains the speaker view plus a signal display, level meters, and the entire recording script.
Recordings can either be made to the local hard disk, or to a server
via the WWW. This allows the geographically distributed recording of
speech in high-quality.
Three projects using SpeechRecorder will be presented:
- BITS Speech Synthesis database,
- Ph@ttSessionz speech database of adolescent speakers, collected in schools all over Germany, and
- MVP recordings of aphasia patients in a clinical environment.
This talk complements the NLP seminar of the previous day.
14 February 2006
Christoph Draxler
University of Munich, Germany
WebTranscribe - A framework for web-based speech annotation
WebTranscribe is a platform independent web-based annotation
framework for the creation and exploitation of speech data bases. The
framework consists of an editor front-end that runs on client
computers, and a server for data administration and DBMS storage. The
annotation capabilities are implemented via editor plug- ins, and a
typical annotation configuration consists of several such plug-ins
(visual display of the speech signal, annotation text editor, editing
buttons, meta-data display panel, etc.). WebTranscribe is implemented
as a Java Web Start application to eliminate browser
incompatibilities. A number of localized editor plug-ins have been
developed, and the software is freely available.
There is a Spandh seminar on the following day that complements
this.
8 February 2006
Simon King
Centre for Speech Technology Research, University of Edinburgh, UK
Dynamic Bayesian Networks: the new framework for ASR research?
This talk has three parts. I will start with a brief introduction
to DBNs for those that are not familiar with them and explain how the
model structure expresses a complete recognizer (from observations to
words and from output densities to language model) in a single
graphical representation. It thus exposes model components which are
usually hardwired into the source code, making it much easier to
experiment with novel configurations.
In the second part of the talk I will introduce a topic neglected
in the classic textbooks on DBNs: triangulation. This is the process
that takes us from a DBN to the structure that is needed for
inference: the junction tree. The Baum-Welch algorithm for HMMs is
essentially a manual derivation of the junction tree. A good
triangulation can make even complex models computationally feasible;
experimental results for a variety of models will be shown to support
this.
In the last part of the talk I will present a case study in which a
DBN is used for articulatory feature extraction. This model
outperforms a set of parallel HMMs on the same task by introducing
dependencies between the states, something only possible with a
DBN.
PDF Slides
PDF Slides