SpandH seminar abstracts 2006

13 December 2006

Junichi Yamagishi

Centre for Speech Technology Research, Edinburgh University, UK

Model Adaptation Approach to Speech Synthesis with Diverse Voice and Styles

In human computer interaction and dialogue systems, it is often necessary that text-to-speech synthesis have abilities to generate natural sounding speech with arbitrary speakers voice and various speaking styles and/or emotional expressions. We have developed an average voice-based speech synthesis method using statistical average voice models and model adaptation techniques for this purpose. In this talk, we describe an overview of the speech synthesis system and show the current performance from several experimental results.

PDF Slides

29 November 2006

Saeed Vaseghi

Department of Electronics and Computer Engineering, School of Engineering and Design, Brunel University, UK

Speech Enhancement: Noise Reduction, Bandwidth Extension and Lost Packet Recovery

The presentation describes the outcomes of a joint by EPSRC-supported research collaboration between Brunel, UEA Norwich, ISVR Southampton and Queen's Belfast. During the course of this work a DeNoise ToolBox was developed capable of noise reduction, bandwidth extension and prediction/interpolation of lost speech segments. The speech enhancement tools are based on several methodologies including Bayesian inference and the use of LP_HNM (linear prediction combined with harmonic plus noise models) with Kalman filters.

15 November 2006

Yasser H. Abdel-Haleem

Department of Computer Science, University of Sheffield, UK

Conditional Random Fields for Continuous Speech Recognition

Acoustic modelling based on Hidden Markov Models (HMMs) is employed by state-of-the-art stochastic speech recognition systems. Although HMMs are a natural choice to warp the time axis and model the temporal phenomena in the speech signal, they do not model the spectral phenomena well. This is a consequence of their conditionally independent properties, which are inadequate for sequential processing.

In this work, a new acoustic modelling paradigm based on Conditional Random Fields (CRFs) is investigated and developed. This paradigm addresses some of the weak aspects of HMMs while maintaining many of the good aspects, which have made them successful. In particular, the acoustic modelling problem is reformulated in a data driven, sparse, augmented space to increase discrimination. In addition, acoustic context modelling is explicitly integrated to handle the sequential phenomena of the speech signal. In a phone recognition task, test results have shown significant improvements over comparable HMM systems.

While the theoretical motivations behind this work are attractive, a practical implementation of these complex systems is demanding. The main goal of this work is to present an efficient framework for estimating these models that ensures scalability and generality for large vocabulary speech recognition systems.

26 September 2006

Sadaoki Furui

Department of Computer Science, Tokyo Institute of Technology, Japan

Why is automatic recognition of spontaneous speech so difficult?

Although speech derived from reading texts, and similar types of speech, e.g. that from reading newspapers or that from news broadcast, can be recognized with high accuracy, recognition accuracy drastically decreases for spontaneous speech. This is due to the fact that spontaneous speech and read speech are significantly different acoustically as well as linguistically. This talk reports analysis and recognition of spontaneous speech using a large-scale spontaneous speech database "Corpus of Spontaneous Japanese (CSJ)". Recognition results in this experiment show that recognition accuracy significantly increases as a function of the size of acoustic as well as language model training data and the improvement levels off at approximately 7M words of training data. This means that a very large corpus is needed to encompass the huge linguistic and acoustic variations which occur in spontaneous speech. Spectral analysis using various styles of utterances in the CSJ shows that the spectral distribution/difference of phonemes is significantly reduced in spontaneous speech compared to read speech. Experimental results also show that there is a strong correlation between mean spectral distance between phonemes and phoneme recognition accuracy. This indicates that spectral reduction is one major reason for the decrease of recognition accuracy of spontaneous speech. Comparative analysis of statistical language models for written language, including newspaper articles, and spontaneous speech shows that there is a significant difference between written language and spontaneous speech in terms of observation frequency of each part-of-speech and perplexity.

6 September 2006

Matt Gibson

Department of Computer Science, University of Sheffield, UK

Hypothesis Spaces For Minimum Bayes Risk Training In Large Vocabulary Speech Recognition

The Minimum Bayes Risk (MBR) parameter estimation framework has been a successful strategy for training hidden Markov models in large vocabulary speech recognition. Practical implementations of MBR must select an appropriate "hypothesis space". Typically the hypothesis space is the set of all word sequences but recent work in minimum phone error (MPE) training has reported improved generalisation when using the set of all phone sequences.

This work further examines the generalisation of MBR training using a range of hypothesis spaces defined via the system components of word, phone, physical triphone, physical state and physical mixture component.

5 July 2006

Iain Murray

Applied Computing, University of Dundee, UK

Expressive Speech Synthesis

Is newer really better? Modern synthetic speech systems offer excellent intelligibility and naturalness, but with current synthesiser technology it is more difficult to add variability into the synthesis, such as alternative "voice personalities" and emotion effects. This talk will look at ways forward for more expressive speech synthesis.

14 June 2006

Oscar Saz

Department of Computer Science, University of Sheffield, UK
Visiting PhD Student from the Univerisity of Zaragoza, Spain

Speech Technologies at the University of Zaragoza

This is a short talk about the Speech Technologies Group at the University of Zaragoza, Spain, focussing on clinical applications of speech technologies and research in the area of pathological speech.

PPT Slides

24 May 2006

Colin Breithaupt

Department of Computer Science, University of Sheffield, UK
Visiting researcher from the Institute of Communication Acoustics, Ruhr-University Bochum, Germany

Speech feature analysis for robust automatic speech recognition

Automatic speech recognition in noisy environments can be made robust by the use of noise reduction frontends that reduce the noise level in the noisy time signal. Nevertheless, it can be observed that an improved noise reduction in the sense of the segmental Signal-to-Noise ratio does not lead inevitably to better recognition results. Noise reduction filters seem to be limited in their ability to increase the recognition rate by reducing the noise. A statistical analysis reveals that the limiting factor is the increased variance of the estimated clean speech features that comes with the noise reduction.

I will try to show that recognition based on Hidden-Markov models is more sensitive to an increase in the variance of the speech features than to a shift of the feature mean as caused by residual additive noise. This leads to a trade-off between noise reduction and increased feature variance.

Additionally, speech features that have a lower variance than the training data lead to higher recognition scores as long as their mean value is similar to that of the training data. A statistical analysis shows that, in contrast to spectral features, cepstral features exhibit a reduced variance in the case of additive noise. It seem as though Mel-frequency cepstral coefficients are a good representation of the speech signal if simple model adaptation or compensation techniques based on cepstral mean correction are applied.

10 May 2006

Peter Howell

Department of Psychology, University College London, UK

A model of fluency breakdown based on data from speakers who stutter

The sorts of fluency breakdowns that occur in stuttered speech also occur in fluent speakers' speech. Accounting for how these breakdowns occur in stuttered speech may 1) provide an answer to how similar events arise in fluent speech and 2) indicate what is different about fluent speech. Several different modelling approaches have been proposed. The theories contrast in 1) whether they focus on the language or the motor system and 2) whether perceptual information is essential for appropriate control (through feedback and feedforward mechanisms). Selected versions of these models are critiqued. Howell's EXPLAN model is then described which maintains 1) that problems in synchronizing language and motor processes gives rise to fluency problems and 2) links between the perceptual and production mechanisms are not essential. Evidence in support of the model is presented including speech control across development in fluent speakers and stutterers, measurement of loci of difficulty in speech, the effect of speech rate on fluency control, the interaction between difficulty and speech rate, timing functions of the speech-motor system and the effects of altered auditory feedback on speech control.

3 May 2006

Philip Jackson

Centre for Vision Speech and Signal Processing, University of Surrey, UK

Amplitude modulation of frication by voicing: acoustics and perception

The talk describes a speech production study that has motivated the development of a signal-processing technique for measuring modulated noise, and a series of perceptual tests.

The two principal sources of sound in speech, voicing and frication, occur simultaneously in voiced fricatives as well as at the /VF/ boundary in phonologically voiceless fricatives. Instead of simply overlapping, the two sources interact. This talk presents an acoustic study of one such interaction effect: the amplitude modulation of the frication component when voicing is present. Corpora of sustained and fluent-speech English fricatives were recorded and analyzed using a signal-processing technique designed to extract estimates of modulation depth.

To investigate the modulation's contribution to fricative auditory quality, AM white noise, with simultaneous sinusoidal component at the modulating frequency, provided stimuli for perceptual tests. Two AM detection-threshold experiments were conducted to establish the effect of varying the relative amplitude and phase of the tone. In the first experiment, tone and noise stimuli were separated within each trial by short pauses; in the second, the tone played continuously throughout the trial. Preliminary results of the AM detection threshold experiments will be discussed in relation to AM's influence on quality and categorisation of fricatives.

PPT Slides
hi-res fricative plot

26 April 2006

Mahesan Niranjan

Depatment of Computer Science, University of Sheffield, UK

Introduction to Sequential Monte Carlo methods and their use in estimating formants

Sequential Markov Chain Monte Carlo methods (or Particle Filters) offer a framework for learning and inference in nonlinear, non stationary environments. They have recently been applied to a range of interesting problems in signal processing, robotics, computer vision and control. They allow the propagation of probabilistic models in an on-line fashion. This talk will be a tutorial on particle filters using my current work on formant estimation as an example.

PPT Slides

29 March 2006

James Carmichael

Depatment of Computer Science, University of Sheffield, UK

Quantifying Speech Disorder Diagnosis on the Cheap - Computerising the Frenchay Dysarthria Assessment Tests

Consistent diagnosis of speech disorders is frequently hampered by a lack of consensus among experts vis--vis the interpretation of assessment data. Inconsistent classification/reclassification by a given evaluator for the same data is not uncommon and can result in incorrect evaluation. This presentation reports on the implementation of a computer-based objective metric system for the diagnosing of dysarthria - a speech disorder resulting from impaired control of the articulators. A series of acoustic measurement algorithms are proposed, one of which utilises population sampling techniques to generate statistical models for a specified vocabulary. These models are then used to return likelihood scores which correlate with subjective assessments of the intelligibility of a given speaker. We discuss the constraints of implementing such an application as a practical real-world tool to be used by clinicians, oftentimes having access to only minimal computational resources.

PPT Slides

Associated sound clips:

22 March 2006

Dennis Norris

Cognition and Brain Sciences Unit, Cambridge, UK

Perceptual learning in speech

The speech perception system must be flexible in responding to the variability in speech sounds caused by differences among speakers and by language change over the lifespan of the listener. One potentially valuable source of information that can help listeners to adapt to this variability is to make use of lexical information to retune perceptual categories - if you know what the word is, you can use this knowledge to map new speech sounds onto existing perceptual categories. A number of experiments show that speakers can perform this perceptual retuning very rapidly. I will argue that the data refute the currently popular view that lexical representations are purely 'episodic' (based on 'exemplars').

PPT Slides.
Associated sound clips and papers:

8 March 2006

John Bridle

Novauris Technologies Ltd, UK

It Keeps them on the Knife: Interpretations of HMMs with 'Dynamic Observations'

For many years it has been standard practice to use augmented 'observation' vectors for HMMs for speech recognition. It makes the conditional independence assumption even less credible, but it improves accuracy. Is there still a generative stochastic model of speech patterns in there? What does the output look like? Does it matter? Are there clues to better speech recognition? This talk explores some possible attitudes to these questions, but does not necessarily provide many answers. We touch on the Trajectory HMM formulation, products of mixtures of Gaussians, and Gibbs sampling. Illustrated with real-time graphics.

PPT Slides

1 March 2006

Torsten Dau

Centre for Applied Hearing Research,Technical University of Denmark

Modeling spectro-temporal processing in the auditory system

Many real-life auditory stimuli have intensity peaks and valleys as a function of time in which intensity trajectories are highly correlated across frequency. This is true of speech, of interfering noise such as "cafeteria" noise and of many other kinds of environmental stimuli. Across-frequency comparisons of temporal envelopes are a general feature of auditory pattern analysis and play an important role in extracting signals from noise backgrounds, or in separating competing sources of sound. For example, comodulation of different frequency bands in background noise facilitates the detection of tones in noise, a phenomenon known as comodulation masking release (CMR). The perception of complex sounds is critically dependent on the faithful representation of the signal's modulations in the auditory system. While various modulation-detection and masking data as well as speech-intelligibility data can be accounted for nicely by current models, effects associated with auditory object perception have not yet been described adequately. As an example, the influence of concurrent and sequential streaming in CMR and in binaural unmasking will be illustrated. These experiments provide constraints for future models of auditory signal processing. If the developed transformation from the acoustic stimuli in complex environments into their internal spectro-temporal representation is correct, such a model could be interesting for technical applications in speech and audio coding, automatic speech recognition, and digital hearing-aids.

23 February 2006

Robert Kirchner

Department of Linguistics, University of Alberta, Edmonton, Canada

Exemplar-Based Speech Processing and Phonological Learning

Over the past decade, evidence has accumulated, from a range of research domains, that phonological patterns, including patterns relating to fine phonetic detail, are token-frequency-sensitive, motivating an exemplar-based speech processing model (see e.g. Bybee 2001, Pierrehumbert 2002). In this talk, I review this evidence, against the background of standard phonological assumptions. I then consider how an exemplar-based model might capture not only these frequency effects, but also the core observations of phonological theory: in brief, the model must be capable of detecting phonological patterns over phonetic signals, and extending these patterns to novel words (i.e. phonological learning). The move to such an approach also promises to make phonological theory more immediately relevant to the concerns of phonetics, psychology, and computer science, perhaps spurring greater interdisciplinary collaboration on spoken language processing. In this spirit, I present some attempts on my part at developing and implementing such a model, and invite feedback from those with a deeper understanding of automatic speech processing techniques.

Bybee, J. (2001) Phonology and Language Use. Cambridge University Press, Cambridge UK.
Pierrehumbert, J. (2002) Word-specific phonetics, in C. Gussenhoven & N. Warner (eds.), Laboratory Phonology VII, Berlin: Mouton de Gruyter. 101-140.

22 February 2006

Martin Russell

Department of Electronic, Electrical and Computer Engineering, University of Birmingham, UK

A Data-Driven Analysis of Vowels in the ABI (Accents of the British Isles) Speech Corpus

The ABI (Accents of the British Isles) corpus was collected by the University of Birmingham in 2003. It comprises speech from nearly 300 speakers from 14 different towns in the British Isles, representing 14 distinct regional accents. The corpus was collected on location in each town, and consists of approximately 20 minutes of recordings from 10 male and 10 female subjects, aged between 18 and 60, who were born in that town and had lived there all of their lives. In total there are nearly 100 hours of recordings. The recordings include read simple commands, digit and letter sequences, SCRIBE sentences, short passages, simple syllables to explore the pronunciation of vowels and a small amount of spontaneous speech.

In the talk I will describe the data collection process and some of the lessons which were learnt. I will present the results of a data-driven analysis of the vowel sounds produced by the subjects from the different locations in the ABI corpus. I will describe some more recent experiments aimed at understanding the underlying acoustic factors which account for variation between accents. Finally I will outline future plans for extending the corpus.

15 February 2006

Christoph Draxler

University of Munich, Germany

SpeechRecorder - a Platform-Independent Tools for Speech Recordings via the WWW

SpeechRecorder is a free and platform-independent application for performing speech recordings. The contents of a recording session are defined in an XML-formatted recording script; the prompt material can be text, images, or audio. The software supports multiple displays: the speaker view contains only the prompt item and recording control elements, the experimenter view contains the speaker view plus a signal display, level meters, and the entire recording script. Recordings can either be made to the local hard disk, or to a server via the WWW. This allows the geographically distributed recording of speech in high-quality.

Three projects using SpeechRecorder will be presented:

  1. BITS Speech Synthesis database,
  2. Ph@ttSessionz speech database of adolescent speakers, collected in schools all over Germany, and
  3. MVP recordings of aphasia patients in a clinical environment.

This talk complements the NLP seminar of the previous day.

14 February 2006

Christoph Draxler

University of Munich, Germany

WebTranscribe - A framework for web-based speech annotation

WebTranscribe is a platform independent web-based annotation framework for the creation and exploitation of speech data bases. The framework consists of an editor front-end that runs on client computers, and a server for data administration and DBMS storage. The annotation capabilities are implemented via editor plug- ins, and a typical annotation configuration consists of several such plug-ins (visual display of the speech signal, annotation text editor, editing buttons, meta-data display panel, etc.). WebTranscribe is implemented as a Java Web Start application to eliminate browser incompatibilities. A number of localized editor plug-ins have been developed, and the software is freely available.

There is a Spandh seminar on the following day that complements this.

8 February 2006

Simon King

Centre for Speech Technology Research, University of Edinburgh, UK

Dynamic Bayesian Networks: the new framework for ASR research?

This talk has three parts. I will start with a brief introduction to DBNs for those that are not familiar with them and explain how the model structure expresses a complete recognizer (from observations to words and from output densities to language model) in a single graphical representation. It thus exposes model components which are usually hardwired into the source code, making it much easier to experiment with novel configurations.

In the second part of the talk I will introduce a topic neglected in the classic textbooks on DBNs: triangulation. This is the process that takes us from a DBN to the structure that is needed for inference: the junction tree. The Baum-Welch algorithm for HMMs is essentially a manual derivation of the junction tree. A good triangulation can make even complex models computationally feasible; experimental results for a variety of models will be shown to support this.

In the last part of the talk I will present a case study in which a DBN is used for articulatory feature extraction. This model outperforms a set of parallel HMMs on the same task by introducing dependencies between the states, something only possible with a DBN.

PDF Slides
PDF Slides