Simon Godsill
Signal Processing and Communications (SigProC) Laboratory, University of Cambridge
Bayesian statistical methods in audio and music processing
In this talk I will survey approaches to high-level and low level modelling of sound signals, focussing on applications such as noise
reduction for general audio (`low-level') and score transcription for musical signals `high-level'). In all of these approaches, structured prior models can be formulated in terms
of stochastic linkages between the various elements of the models, which encode notions such as sparsity, regularity and connectedness into the models, and solutions can be explored using state-of-the-art computational methodologies, including MCMC and particle filtering.
Simon Godsill is Professor of Statistical Signal Processing and a Fellow of Corpus Christi College, Cambridge. His research interests include audio signal processing, sensor fusions, mutltiple object tracking and Bayesian methods. See http://www-sigproc.eng.cam.ac.uk/~sjg/.
Host: charles.fox@sheffield.ac.uk
David Martinez Gonzalez
SpandH
iVector-Based Approaches for Spoken Language Identification
The three main current approaches for language identification are considered
to be acoustic, phonotactic and prosodic. The acoustic approach models the
frequency information of the signal; the phonotactic approach is based on
the statistics of apparition of the phonemes recognised by a phoneme
recogniser; and the prosodic models suprasegmental information of the
speech. State of the art acoustic and prosodic techniques are based on
variations of well-known gaussian mixture models (GMMs), such as joint
factor analysis (JFA) or iVectors. These techniques assume that GMM
supervectors live in a space of reduced dimension, what makes possible to
compensate for channel variations. In phonotactic approaches a language
model is built with the counts of the recognised phonemes. In this talk we
will focused on iVector approaches and their performance will be assessed on
different databases including a wide range of languages.
Biography: I graduated in 2006 as a Telecommunication Engineer, and obtained my Master
Degree in 2009. Since then, I have been working on my PhD thesis, advised by
Professor Dr. Eduardo Lleida. My research interests are focused on language
recognition (LID) and voice pathology detection. I investigate new acoustic
approaches for LID, based on joint factor analysis (JFA) and Vectors; and
new prosodic approaches, analysing different ways to extract suprasegmental
information out of the voice signal, useful for recognizing languages.
Lately, I have been studying how these techniques can be applied on the
field of voice pathology detection. During this time I have visited Brno University of Technology (Brno, Czech
Republic) in 2010, and SRI International (Menlo Park, CA, USA) in 2011.
There, I had the great opportunity to be in touch with researchers as
successful as Lukas Burget and Nicolas Scheffer. Currently, I am doing an
internship at the University of Sheffield, in SpandH group with Phil Green,
working on pathological voices (2012).
Host: charles.fox@sheffield.ac.uk
Ray Meddis
Hearing Research Laboratory, University of Essex
Auditory profiles for hearing dummies
The Hearing Dummy project aims to optimise hearing aid fitting
parameters by developing an individualised computer model of a
patient's hearing. This can be used in conjunction with computerised
hearing aid algorithms to explore, in the absence of the patient, the
alternative benefits of different device parameters. Before a hearing
dummy can be constructed, it is necessary to collect data concerning
threshold and, more importantly, supra-threshold aspects of the
patient's hearing across the frequency spectrum. This talk will
describe the procedures we use to generate an auditory profile aimed
at assessing sensitivity, tuning, and compression characteristics. The
computer model is then adjusted so that it produces the same profile
when tested using the same procedures as those used with the patient.
The hearing aid algorithm is then tuned so that the resulting profile
is as close as possible to normal hearing when retested in conjunction
with the hearing dummy. Finally, the patient can be assessed with the
hearing aid to check the model predictions. The process will be
illustrated using our new "biologically-inspired" hearing aid
algorithm.
Biography: Ray Meddis studied Psychology at University College, London and then
became lecturer in Psychology at Bedford College, University of
London. He joined Essex University from Loughborough University where
he had been Director of the Speech and Hearing Laboratory for ten
years. Ray Meddis' work involves the creation of computer models of
low-level hearing that pay special attention to the physiological
processes involved. He has developed models of the auditory periphery
and the response of individual neurons in the auditory brainstem to
acoustic stimulation. He has also produced models of pitch perception
and the segregation of simultaneous sounds. The work has been
supported with extensive grants from Research Council Sources. He is currently Emeritus Professor and Director of the Hearing Research Laboratory at Essex University.
Host: charles.fox@sheffield.ac.uk
Arnab Ghoshal
Centre for Speech Technology Research, University of Edinburgh
Acoustic modeling with Subspace Gaussian Mixture Models
Conventional automatic speech recognition (ASR) systems use hidden Markov models (HMMs) whose emission densities are modeled by mixtures of Gaussians. The subspace Gaussian mixture model (SGMM) is a recently proposed acoustic modeling approach for ASR, which provides a compact representation of the Gaussian parameters in an acoustic model by using a relatively small number of globally-shared full covariance matrices and phonetic subspaces for the Gaussian means and mixture weights. The means and weights are not directly modeled, as in the conventional systems, but are constrained to these subspaces. Defining multiple globally-shared subspaces for the means makes it possible to model multiple Gaussian densities at each state with a single vector; the weights ensure that these Gaussians can be selectively "turned-on" or "turned-off". Such a model has been demonstrated to outperform conventional HMM-GMM systems, while having substantially lesser number of parameters. The model also defines speaker subspaces which capture speaker variability. The speaker-specific contribution is modeled by a single speaker vector of relatively low dimensionality, which define a particular point in these subspaces, and can be estimated using little adaptation data, which makes the model suitable for rapid speaker adaptation. Moreover, it is possible to train the shared parameters using data from different domains or languages. It is shown that the multilingually trained shared parameters can improve speech recognition accuracy. Additionally, for languages with limited transcribed audio, the multilingually trained shared parameters are directly used, with only the state-specific parameters trained on the target language data. This is shown to be an effective strategy for training acoustic models for languages with limited resources.
Biography: Arnab Ghoshal is a research fellow at the Centre for Speech Technology Research, University of Edinburgh. Prior to joining University of Edinburgh, he was a Marie Curie Fellow at Saarland University, Saarbrücken. He received Ph.D. and M.S.E. degrees in Electrical and Computer Engineering from Johns Hopkins University, Baltimore, in 2009 and 2005, respectively. During the summers of 2003 and 2004, he worked as an intern in the Speech Technology Group at Microsoft Research, Redmond. His current research interests include acoustic modeling for large-vocabulary automatic speech recognition, pronunciation modeling, and speaker adaptation. He is one of the principle developers of Kaldi (http://kaldi.sf.net/), a free, open-source toolkit for speech recognition research.
Host: charles.fox@sheffield.ac.uk