SpandH Seminar Abstracts 2012

Simon Godsill

Signal Processing and Communications (SigProC) Laboratory, University of Cambridge

Bayesian statistical methods in audio and music processing

In this talk I will survey approaches to high-level and low level modelling of sound signals, focussing on applications such as noise reduction for general audio (`low-level') and score transcription for musical signals `high-level'). In all of these approaches, structured prior models can be formulated in terms of stochastic linkages between the various elements of the models, which encode notions such as sparsity, regularity and connectedness into the models, and solutions can be explored using state-of-the-art computational methodologies, including MCMC and particle filtering.

Simon Godsill is Professor of Statistical Signal Processing and a Fellow of Corpus Christi College, Cambridge. His research interests include audio signal processing, sensor fusions, mutltiple object tracking and Bayesian methods. See


David Martinez Gonzalez


iVector-Based Approaches for Spoken Language Identification

The three main current approaches for language identification are considered to be acoustic, phonotactic and prosodic. The acoustic approach models the frequency information of the signal; the phonotactic approach is based on the statistics of apparition of the phonemes recognised by a phoneme recogniser; and the prosodic models suprasegmental information of the speech. State of the art acoustic and prosodic techniques are based on variations of well-known gaussian mixture models (GMMs), such as joint factor analysis (JFA) or iVectors. These techniques assume that GMM supervectors live in a space of reduced dimension, what makes possible to compensate for channel variations. In phonotactic approaches a language model is built with the counts of the recognised phonemes. In this talk we will focused on iVector approaches and their performance will be assessed on different databases including a wide range of languages.

Biography: I graduated in 2006 as a Telecommunication Engineer, and obtained my Master Degree in 2009. Since then, I have been working on my PhD thesis, advised by Professor Dr. Eduardo Lleida. My research interests are focused on language recognition (LID) and voice pathology detection. I investigate new acoustic approaches for LID, based on joint factor analysis (JFA) and Vectors; and new prosodic approaches, analysing different ways to extract suprasegmental information out of the voice signal, useful for recognizing languages. Lately, I have been studying how these techniques can be applied on the field of voice pathology detection. During this time I have visited Brno University of Technology (Brno, Czech Republic) in 2010, and SRI International (Menlo Park, CA, USA) in 2011. There, I had the great opportunity to be in touch with researchers as successful as Lukas Burget and Nicolas Scheffer. Currently, I am doing an internship at the University of Sheffield, in SpandH group with Phil Green, working on pathological voices (2012).


Ray Meddis

Hearing Research Laboratory, University of Essex

Auditory profiles for hearing dummies

The Hearing Dummy project aims to optimise hearing aid fitting parameters by developing an individualised computer model of a patient's hearing. This can be used in conjunction with computerised hearing aid algorithms to explore, in the absence of the patient, the alternative benefits of different device parameters. Before a hearing dummy can be constructed, it is necessary to collect data concerning threshold and, more importantly, supra-threshold aspects of the patient's hearing across the frequency spectrum. This talk will describe the procedures we use to generate an auditory profile aimed at assessing sensitivity, tuning, and compression characteristics. The computer model is then adjusted so that it produces the same profile when tested using the same procedures as those used with the patient. The hearing aid algorithm is then tuned so that the resulting profile is as close as possible to normal hearing when retested in conjunction with the hearing dummy. Finally, the patient can be assessed with the hearing aid to check the model predictions. The process will be illustrated using our new "biologically-inspired" hearing aid algorithm.

Biography: Ray Meddis studied Psychology at University College, London and then became lecturer in Psychology at Bedford College, University of London. He joined Essex University from Loughborough University where he had been Director of the Speech and Hearing Laboratory for ten years. Ray Meddis' work involves the creation of computer models of low-level hearing that pay special attention to the physiological processes involved. He has developed models of the auditory periphery and the response of individual neurons in the auditory brainstem to acoustic stimulation. He has also produced models of pitch perception and the segregation of simultaneous sounds. The work has been supported with extensive grants from Research Council Sources. He is currently Emeritus Professor and Director of the Hearing Research Laboratory at Essex University.


Arnab Ghoshal

Centre for Speech Technology Research, University of Edinburgh

Acoustic modeling with Subspace Gaussian Mixture Models

Conventional automatic speech recognition (ASR) systems use hidden Markov models (HMMs) whose emission densities are modeled by mixtures of Gaussians. The subspace Gaussian mixture model (SGMM) is a recently proposed acoustic modeling approach for ASR, which provides a compact representation of the Gaussian parameters in an acoustic model by using a relatively small number of globally-shared full covariance matrices and phonetic subspaces for the Gaussian means and mixture weights. The means and weights are not directly modeled, as in the conventional systems, but are constrained to these subspaces. Defining multiple globally-shared subspaces for the means makes it possible to model multiple Gaussian densities at each state with a single vector; the weights ensure that these Gaussians can be selectively "turned-on" or "turned-off". Such a model has been demonstrated to outperform conventional HMM-GMM systems, while having substantially lesser number of parameters. The model also defines speaker subspaces which capture speaker variability. The speaker-specific contribution is modeled by a single speaker vector of relatively low dimensionality, which define a particular point in these subspaces, and can be estimated using little adaptation data, which makes the model suitable for rapid speaker adaptation. Moreover, it is possible to train the shared parameters using data from different domains or languages. It is shown that the multilingually trained shared parameters can improve speech recognition accuracy. Additionally, for languages with limited transcribed audio, the multilingually trained shared parameters are directly used, with only the state-specific parameters trained on the target language data. This is shown to be an effective strategy for training acoustic models for languages with limited resources.

Biography: Arnab Ghoshal is a research fellow at the Centre for Speech Technology Research, University of Edinburgh. Prior to joining University of Edinburgh, he was a Marie Curie Fellow at Saarland University, Saarbrücken. He received Ph.D. and M.S.E. degrees in Electrical and Computer Engineering from Johns Hopkins University, Baltimore, in 2009 and 2005, respectively. During the summers of 2003 and 2004, he worked as an intern in the Speech Technology Group at Microsoft Research, Redmond. His current research interests include acoustic modeling for large-vocabulary automatic speech recognition, pronunciation modeling, and speaker adaptation. He is one of the principle developers of Kaldi (, a free, open-source toolkit for speech recognition research.


Chris Sumner

MRC Institute of Hearing Research, Nottingham, UK

Spectral and temporal neural processing: relationships to auditory perception

The relationship between neural activity in the brain, neural processing and perception is relevant for a wide range of hearing related matters such as speech communication, hearing impairment and even artificial speech recognition. Work in my group is directed at understanding these relationships using computational, physiological and behavioural methods.

One area of research has been cortical neurons how process tones sequences and their relation to perception. Neuronal responses to two-tone sequences suggest cortical adaptation contributes to perceptual forward- masking. Cortical neurons also demonstrate a frequency and rate dependent competitive suppression consistent with properties of auditory perceptual stream segregation. However, cortical adaptation is also ‘frequency-specific’ – within single neurons a response to a given frequency is most strongly adapted by tones of the same frequency – the opposite to a competitive suppression. Thus cortical adaptation may contribute to several perceptual phenomena, and it appears that cortical adaptation reflects both competitive and ‘frequency-specific’ processes.

We are also working on mechanisms of spectral integration and resolution. We have shown that binaural-level interactions are also ‘frequency-specific’. Cortical neurons preferentially compute differences for similar frequencies in either ear, despite broad monaural frequency tuning. This suggests that within frequency-channel binaural comparisons precede later spectral integration of spatial information. Perhaps the most fundamental aspect of spectral processing is that brought about by frequency analysis in the ear: the auditory filter. Previous physiological evidence is ambiguous as to whether perceptual frequency selectivity is determined entirely by the cochlea. Using contemporary psychophysical masking methods to estimate ‘neuronal auditory filter’ characteristics, we find that even in the auditory cortex where pure tone receptive fields are very broadly tuned, the neuronal auditory filter width is a close match for perceptual and peripheral bandwidths. Finally, comparisons of psychophysics and physiological measurements are often subject to caveats of species differences and anaesthesia. We are tackling this via integrated study of frequency selectivity, combining psychophysical and peripheral measurements with cortical recordings in awake animals.

24 April 2012, 12:00-13:00, Regent Court, G30

Bernhard Seeber

MRC Institute of Hearing Research, Nottingham, UK

Assessing and improving hearing with cochlear implants in noisy spaces

Cochlear implants (CIs) can restore speech understanding in quiet but most implant users complain that noise and reverberation make speech understanding difficult or even impossible. In normal hearing, several mechanisms contribute to our ability to hear out one source in a potpourri of sources. This so called auditory scene analysis uses monaural and binaural cues which are often degraded in electric hearing. I will review our research on auditory scene analysis by CI users with a particular focus on the effect of reverberation on binaural hearing. The results of our studies suggest ways to improve the coding of binaural cues in implants which will be discussed.

17 April 2012, 12:00-13:00, Regent Court, G30

Tim Kempton

Department of Computer Science, University of Sheffield, UK

Machine-Assisted Phonemic Analysis

There is a consensus between many linguists that half of all languages risk disappearing by the end of the century. Documentation is agreed to be a priority. This includes the process of phonemic analysis to discover the contrastive sounds of a language with the resulting benefits of further linguistic analysis, literacy, and access to speech technology. A machine-assisted approach to phonemic analysis has the potential to greatly speed up the process and make the analysis more objective.

Good computer tools are already available to help in a phonemic analysis, but these primarily provide search and sort database functionality, rather than automated analysis. In computational phonology there have been very few studies on the automated discovery of phonological patterns from surface level data such as narrow phonetic transcriptions or acoustics.

This thesis addresses the lack of research in this area. The key scientific question underpinning the work in this thesis is “To what extent can a machine algorithm contribute to the procedures needed for a phonemic analysis?”. A secondary question is “What insights does such a quantitative evaluation give about the contribution of each of these procedures to a phonemic analysis?”

It is demonstrated that a machine-assisted approach can make a measurable contribution to a phonemic analysis for all the procedures investigated; phonetic similarity, phone recognition & alignment, complementary distribution, and minimal pairs. The evaluation measures introduced in this thesis allows a comprehensive quantitative comparison between these phonemic analysis procedures. Given the best available data and the machine-assisted procedures described, there is a strong indication that phonetic similarity is the most important piece of evidence in a phonemic analysis.

The tools and techniques developed in this thesis have resulted in tangible benefits to the analysis of two under-resourced languages and it is expected that many more languages will follow.