SpandH Seminar Abstracts

Ke Chen

University of Manchester

Extracting Speaker Specific Information with a Deep Neural Architecture

Abstract: Speech signals convey different yet mixed information ranging from linguistic to speaker-specific information components, and each of them should be exclusively used in a specific speech information processing task. Due to the entanglement in different information components, it is extremely difficult to extract any specific information. As a result, nearly all existing speech representations carry all types of speech information but are used for different tasks. For example, the same speech representations are often used in both speech and speaker recognition. However, the interference of irrelevant information conveyed in such representations hinders either of such systems from producing better performance. In this seminar, I am going to talk about our work in developing a deep neural architecture to extract speaker-specific information from a common speech representation, MFCCs, including motivation of speech information component analysis, architecture and training of our proposed regularised Siamese deep network for speaker specific representation learning, experiments in various speaker-related tasks on benchmark date sets, and discussions on relevant issues and the future direction on speech information component analysis. Host:

Keiichi Tokuda

Google / Nagoya Institute of Technology

Flexible speech synthesis in karaoke, amine, smart phones, video games, digital signage, TV and radio programs, etc.

Abstract: This talk will give an overview of statistical approach to flexible speech synthesis. For constructing human-like talking machines, speech synthesis systems are required to have an ability to generate speech with arbitrary speaker's voice, various speaking styles in different languages, varying emphasis and focus, and/or emotional expressions. The main advantage of the statistical approach is that such flexibility can easily be realized using mathematically well-defined algorithms. In this talk, the system architecture is outlined and then recent results and demos will be presented. Host:

Stephen Cox

University of East Anglia

Read my Lips: Reflections on nearly Ten Years of Research at the University of East Anglia in Automatic Lip Reading

Abstract: Automatic lip reading has not attracted anything like the same kind of research interest as automatic speech recognition, for the obvious reason that it is an imperfect form of communication that is only used by hearing impaired people. But it is an absorbing computational task that would have some useful applications (e.g. improving audio-visual speech recognition, medical and disability applications, crime fighting) if it were met, and lip-reading is a fascinating study in its own right. In this talk, I will review the challenges of automatic lip reading, comment on human lip-reading performance, describe the approaches taken by UEA to the problem and the current state-of-the-art, and comment on possible directions for work in the future. Host:

Stuart Green

Zoo Digital Ltd

Opportunities for applied speech technologies in film and broadcast

Abstract: ZOO Digital, headquartered in Sheffield, provides a range of services to content owners and platform operators in the film and broadcast industries. ZOO’s services are built on the use of its proprietary cloud-based systems that deliver clear competitive advantages through process automation and workflow management. This talk will provide an inside perspective on entertainment industry production processes and practices for subtitling, captioning and dubbing. Commercial ASR tools are already in widespread use for captioning. A close examination of current practices for entertainment localisation and access services may lead to new opportunities for application of speech technologies to deliver productivity gains. Host:

Raymond Ng and Erfan Loweimi


The USFD Systems for the IWSLT 2014 Evaluation (Ng)

Abstract: A team of 9 from the University of Sheffield (USFD) participated in the International Workshop for Spoken Language Translation (IWSLT) in 2014. In this talk, we will introduce the USFD systems for IWSLT. The automatic speech recognition (ASR) system comprises two multi-pass deep neural network systems with adaptation and rescoring techniques. The machine translation (MT) system is a phrase-based system. The USFD primary system incorporates state-of-the-art ASR and MT techniques and gives a word error rate of 14.3% and BLEU score of 31.33 on an English-to-French speech-to-text translation task with the IWSLT 2012 data. The USFD contrastive system explores the integration of ASR and MT by using a quality estimation system to rescore the ASR outputs, optimising towards better translation. This gives a further 0.54 BLEU improvement on the same test data.

Phase information in speech recognition (Loweimi)

Abstract: It is well-established that speech phase spectrum does not carry a notable amount of information. This issue along with the difficulties in interpreting and processing the phase from the mathematical point of view due to phase wrapping is the main hurdle in applying this spectrum in practice. In this talk, we deal with the first issue i.e. the speech phase information content and try to evaluate the amount and nature of such information. By information, we do not mean the average uncertainty (entropy), as defined in information theory, but speech specific information like lingual content, speaker characteristic and so on. To measure this kind of information, we discard the magnitude spectrum after signal analysis through Fourier transform and then reconstruct the signal only from its phase spectrum. Then, the phase-only reconstructed speech will be compared with the original one via perceptual and psychoacoustic-based distance measures to evaluate the amount of such information. The same set of experiments will be done by discarding the phase spectrum and reconstructing the signal only from its magnitude spectrum. Comparison of the magnitude-only and phase-only signals under different conditions could help in estimating the information content of each spectra. The talk will proceed by analysing the results based on the Hilbert transform relations and the minimum-phase/all-pass signal decomposition. It will be demonstrated that which aspect of the signal is shared by both spectra and which aspect of the signal exclusively captured by these spectra. At this point it will be shown that contrary to the prevailing belief in speech community, the phase spectrum includes remarkable amount of information which magnitude spectrum failed to encode some them. For corroboration of our signal processing-based argument, we perform a number of ASR tests and prove its validity at speech recognition context. Host:

Tobias May

Technical University of Denmark

A monaural cocktail-party processor: Speech segregation in background noise

Abstract: One of the most striking abilities of the human auditory system is the capability to focus on a desired target source and to segregate it from interfering background noise. Despite substantial progress in the field of computational auditory scene analysis (CASA) over the past decades, machine-based approaches that attempt to replicate human speech recognition abilities are still far away from being as robust as humans against the detrimental influence of competing sources and interfering noise. Recent studies have employed supervised learning strategies which exploit a priori knowledge about the distribution of acoustic features extracted from speech and noise signals during an initial training stage. This talk will present a computational speech segregation system that is based on the supervised learning of monaural cues in order to "blindly" estimate the ideal binary mask (IBM) from a noisy speech mixture. It will be shown that auditory-inspired processing principles can improve the performance of computational speech segregation systems. In addition, the ability of the segregation system to generalize to unseen acoustic conditions is evaluated by systematically varying the mismatch between the acoustic conditions used for training and testing. Host:

Michael I Mandel

Ohio State

Detailed models for understanding speech in noise

Abstract: The human ability to understand speech in noise far outstrips current automatic approaches, despite recent technological breakthroughs. This talk presents two projects that use detailed models of speech to begin to close this gap. The first project investigates the human ability to understand speech in noise using a new data-driven paradigm. Our listening test is able to identify the specific spectro-temporal "glimpses" of individual speech utterances that are crucial to their intelligibility. By formulating intelligibility prediction as a classification problem, we are also able to successfully extrapolate intelligibility predictions to new recordings of the same and similar words. The second project aims to reconstruct damaged or obscured speech using a detailed prior model, a full large vocabulary continuous speech recognizer. Posed as an optimization problem, this system finds the latent clean speech features that minimize a combination of the distance to the reliable regions of the noisy observation and the negative log likelihood under the recognizer. This new approach to missing data speech recognition reduces both recognition errors and the distance between the estimated speech and the original clean speech. Bio: Michael I Mandel is a Research Scientist in Computer Science and Engineering at the Ohio State University working at the intersection of machine learning, signal processing, and psychoacoustics. He earned his BSc in Computer Science from the Massachusetts Institute of Technology in 2004 and his MS and PhD with distinction in Electrical Engineering from Columbia University in 2006 and 2010 as a Fu Foundation School of Engineering and Applied Sciences Presidential Scholar. From 2009 to 2010 he was an FQRNT Postdoctoral Research Fellow in the Machine Learning laboratory at the Université de Montréal. From 2010 to 2012 he was an Algorithm Developer at Audience Inc, a company that has shipped over 350 million noise suppression chips for cell phones. For the summer of 2014, he is a Visiting Professor in the department of Signal and Image Processing at Télécom ParisTech. Host:

Oscar Saz, Charles Fox and Heidi Christensen


Natural Speech Technology

Abstract: Many SpandH members have requested some internal seminars in addition to our external speaker series. We are currently looking for volunteers to present overviews of current SpandH projects, and to get the ball rolling the first of these will cover the Natural Speech Technology project, with three speakers taking turns to present key aspects of the project: Acoustic factorisation, eigenvoices and rapid adaptation using context and metadata. These techniques will be applied in acoustically complex scenarios, where the diversity of speakers and acoustic backgrounds is high. The evaluation will be done on different datasets, one of them an artificially corrupted speech corpus, and the other one based on actual broadcast media sources. Environment models. Much natural speech occurs in noisy, overlapping environments. Using multiple microphone arrays in our instrumented meeting room, we can try to find filters that combine and processes signals to isolate and recognize individual speakers. In addition to conventional beamformers we are experimenting with varients of Likelihood Maximising Beamforming, which uses prior knowledge about speech signals to guide the search for beamformers. HomeService aims to take the state-of-the-art speech recognition developed by the NST research team and put it to use in people’s homes. This will provide people with disabilities who can’t or choose not to use conventional means of interacting with technology (such as a keyboard or a computer mouse), with an interface to some of their devices such as their TVs or their PC. This talk will describe the novel "ASR-in-the-cloud" system architecture developed during the project as well as present results from the initial phase of the user trial. Host:

Tim Jurgens


Auditory models for better rehabilitative devices

Abstract: Cochlear implant (CI) users and hearing impaired listeners show a large inter-individual variety of speech-in-noise recognition performances. Some of them can obtain speech reception thresholds (SRTs) that are close to those of normal-hearing listeners, while others show SRTs of +10 dB signal-to-noise ratio or more, which impedes their conversation abilities severely, especially in social situations. Models that reproduce the ability of the individual patient to understand speech in noise are highly desirable in order to tune and optimize algorithms in hearing devices for these patients. This talk will introduce models of speech recognition for the acoustically and electrically stimulated auditory system. It will be shown how different parameters in these models affect modelled speech performance and how it might be possible to relate these parameters to individual characteristics of the patient. First results of recently started projects with this model approach will be shown. These include (1) the optimization of different signal processing strategies in such an individualized model and (2) the combination of an electric and an acoustic auditory model to predict the speech recognition benefit of hybrid CI users, i.e. users with low-frequency acoustic hearing and electric stimulation in the same ear. Host: Guy Brown

Patrick Naylor

Imperial College

Acoustic Signal Processing and Applications to Speech Dereverberation

Abstract: The impact of acoustic propagation on speech signals captured at various points in space by microphones can be to degrade speech quality and speech intelligibility and to reduce the accuracy of automatic speech recognition. The degrading effects include importantly the addition of noise and the addition of reverberation. Example applications where such degradations are problematic include interactive TV, SkypeTV, in-car communications, meeting transcription systems and conferencing systems. In this talk I shall discuss some multichannel signal processing approaches to address the problem of reverberation in speech. In general, speech dereverberation can be achieved by first performing multichannel blind estimation of the acoustic propagation channel and then processing the reverberant signal with a multichannel equalizer corresponding to the inverse of the channel. This approach to dereverberation will be reviewed and some of the practical difficulties highlighted. Current and new approaches for the approximate inversion of the acoustic channel will be described. Bio: Patrick Naylor received the BEng degree in Electronic and Electrical Engineering from the University of Sheffield, U.K., in 1986 and the PhD degree from Imperial College London, U.K., in 1990. Since 1990 he has been a member of academic staff in the Department of Electrical and Electronic Engineering at Imperial College London. His research interests are in the areas of speech, audio and acoustic signal processing. He has worked in audio particular on adaptive signal processing and speech processing and has recently produced the first research textbook on dereverberation. Important topics in his work are microphone array signal processing, blind multichannel acoustic system identification and equalization, single and multi-channel speech enhancement and speech production modelling with particular focus on the analysis of the voice source signal. He is a director of the UK Centre for Law Enforcement Audio Research, a government funded centre tasked to undertake advanced research and to support the law enforcement agencies. In addition to his academic research, he enjoys several fruitful links with industry in the UK, USA. Host:

Jeff Adams

Speech & NLP at Amazon: Unique Challenges, Unique Resources

Abstract: Amazon's new speech & language group has an interesting set of tasks to work on. From aligning e-books and audiobooks, to indexing home-made video reviews of consumer products, and a dozen other applications in between, there is a lot to do for the industry's newest research lab. Fortunately, the lab has access to a wealth of resources, from Amazon's vast media holdings, to Amazon Mechanical Turk for annotation, to Amazon's Web Service infrastructure, the environment has been compared to being a startup with infinite resources. In this presentation, I will summarize some of the tasks we are taking on, and some prospects for the future. I will also address opportunities for internships and employment with this new group, including some tips on what to expect when interviewing. Bio: Jeff Adams has 18 years of experience in speech research management at a variety of firms. Under Jeff's leadership, speech scientists at Kurzweil Applied Intelligence, Lernout and Hauspie, Dragon, ScanSoft, Nuance, and Yap achieved unprecedented breakthroughs in accuracy with leading-edge commercial large-vocabulary speech recognition products such as Dragon NaturallySpeaking, L&H VoiceXpress, Kurzweil Voice, Dictaphone EXSpeech & PowerScribe, and Yap's state of the art voicemail ASR service. Jeff continues that tradition at Amazon, where he is leading efforts to build a new speech & language lab. Jeff studied mathematics at Brigham Young University (B.S., summa cum laude, 1984), the University of California (M.A., 1989), and the University of Oregon. He worked as a mathematician for the Department of Defense for several years before applying his skills to speech recognition Host:

Angela Josupeit

Olderburg University, Germany

Modeling of Speech Localization in a Multitalker Environment using Binaural and Harmonic Cues

Angela is a PhD student from the large Hearing and Audiology group at Oldenburg University who will be visiting us next week along with her supervisor Volker Hohmann. ( The group conduct basic research towards understanding the properties of the auditory system and they successfully apply this knowledge to the design of intelligent hearing aids. Their binaural hearing aid technology developed with Siemens has been awarded the Deutscher Zukenftspreis, 2012 -- the most prestigious award in Germany for Technology and Innovation. (i.e. a 4* REF Impact Case Study ;-) Host: Jon Barker

Pete Howell


Screening school-aged children for risk of stuttering and other speech disorders

This talk describes the scientific research that led to a screening procedure for stuttering based on disruption to speech. A brief summary of findings from systematic review work that identified how the symptom set should be extended so that it can be used to identify other speech disorders is given. As part of this, some empirical work on assessment of procedures used, and validation that the extension to the screen allowed this to be extended to other speech disorders are given. The evidence that shows that disruptions to speech can be assessed in a language that the assessor does not know is summarized. Then the findings from the first group of 500 children who were screeende are reported. The validations planned for the follow ups are described (this includes assessment of cases by a qualified SLT). Finally, I will discuss where this work sits with respect to other clinical activities and discuss other practical issues such as how risk should be communicated to teachers, parents and children. Host:

Alexa Wright

University of Westminster

Conversation Piece: Speech technology in art

Alexa is a visual artist who works with photography, sound, video and interactive installation. She is a Reader in Visual Culture at University of Westminster. In this talk she will present some of her artworks, in particular ‘Conversation Piece’, made in 2009. This is a speech-based interactive installation that mimics social relations with human users. By emulating, but not quite replicating human social interaction, Conversation Piece exposes some of the mechanics of human-to-human communication. For each user the illusion of meaningful social exchange is mediated by the extent to which he or she projects personality into the synthesized voice, and how much he or she chooses to engage with the virtual character we called ‘Heather’. Alexa is keen to find computer scientists interested in collaborating on a similar project! See for more details. Host:

Rogier van Dalen

University of Cambridge

Efficient segmental features for speech recognition

Speech recognisers are usually based on hidden Markov models. These make the Markov assumption, which does not model the temporal aspects of speech well. It would be better to use segmental features, for example, for each word. However, the optimal combination of segmentation of the utterance into words and word sequence must be found. Features must therefore be extracted for each possible segment of audio. For many types of features, this becomes slow. In the first part of this talk, I will discuss how to extract a class of segmental features, in a "generative score-space", efficiently. If all segmentations are considered while decoding, the same should be done in training. In the second part, I will discuss a version of minimum Bayes risk training that marginalises out the segmentation. This is joint work with Mark Gales and Anton Ragni.

Rogier van Dalen is a Research Associate in Speech Recognition at the University of Cambridge, where he also did his PhD. His research interest is applying machine learning techniques to speech recognition. Host:

Max Little


The Parkinson's Voice Initiative

Neurological disorders such as Parkinson's destroy the ability to move; there are over 6 million worldwide with the disease, but no cure. Until we have a cure, and indeed, to find a cure, we need objective tests. Unfortunately, there are no biomarkers (e.g. blood tests). Current objective symptom tests for Parkinson's are expensive, time-consuming, and logistically difficult, so mostly, they are not done outside trials. What is exciting though: voice is affected as much by Parkinson's as limb movements, so we have developed the technology to test for symptoms using voice recordings alone. This could enable some radical breakthroughs, because voice-based tests are as accurate as clinical tests, but additionally, they can be administered remotely, and patients can do the tests themselves. Also, they are high speed (take less than 30 seconds), and are ultra low cost (they don't involve expert staff time). So, they are massively scalable. Host:

Richard Smith

Smith Watkins Ltd

Good Vibrations! The Physics of Brass Instrument Design

A talk with demonstrations showing how scientific methods can help the design of musical instruments and at the same time demolish some of the myths perpetuated by musicians.

Of particular interest speech researchers interested in vocal tract and oral cavity modelling, who would like to see how their assumptions and models compare to those used in instrument design. Richard is currently working on ideas about how instrument acoustics interact with the oral cavity, and how ultra-high notes can be produced. His company designed the custom instruments used at the recent Royal Wedding and Royal Jubilee.

More details at, -- see the "library" section for publications and scientific overview articles.

'Unconventional, maybe; eccentric, perhaps; but then few scientists in their field can claim to have charted new territories of knowledge like Richard Smith.'--Yorkshire Post

Richard Smith wrote a doctoral thesis on trumpet acoustics before joining Boosey and Hawkes, where he worked for 12 years as chief designer and technical manager responsible for the world famous Besson brass range, including the original trumpets used by Derek Watkins and John Wallace, trombones for Roy Williams and Don Lusher, and the cornets used by most brass and military bands. Richard’s research work into acoustics, testing and development of brass instruments has been widely publicised in the scientific literature and on TV and radio, and he has travelled in Europe, the United States and Japan, testing instruments with top professional symphonic and session players, and presenting papers at international conferences on acoustics and instrument design. In 2000, Richard's cornet 'The Soloist' was awarded Millennium Product Status by the U.K. Design Council, recognising its enduring place among the best of British design, creativity and innovation, as 'a brass cornet with a unique system of interchangeable leadpipes, providing several instruments in one body that match changing playing conditions and genres.' These awards were granted to only 1,000 British products and services, deemed to be challenging existing conventions and solving key problems in an environmentally and ethically sound manner, as judged by a panel of judges drawn from design, business, science and the arts. In 2008, Richard was made an Honorary Fellow of the College of Science and Engineering (University of Edinburgh), in recognition of his collaborative work within the School of Physics on the measurement and understanding of the acoustics of brass instruments. He continues to maintain a close association with Edinburgh University, and is furthering the training of the next generation of British brass instrument designers and makers through a series of apprenticeship schemes. Richard moved to North Yorkshire in 2005 and in 2010, he celebrated, with Derek, 25 years of designing and building specialist brass instruments. Host:

Dan Stowell

Queen Mary's London

Tracking multiple intermittent sources in noise: inferring a mixture of Markov renewal processes

Consider the sound of birdsong, or footsteps. They are intermittent sounds, having as much structure in the gaps between events as in the events themselves. And often there's more than one bird, or more than one person - so the sound is a mixture of intermittent sources. Standard tracking techniques (e.g. Markov models, autoregressive models) are a poor fit to such situations. We describe a simple signal model (the Markov renewal process (MRP)) for these intermittent data, and introduce a novel inference technique that can infer the presence of multiple MRPs even in heavy noise. We illustrate the technique via a simulation of auditory streaming phenomena, and an experiment to track a mixture of singing birds. Host:

Steve Renals

University of Edinburgh

(Deep) neural nets in speech recognition

In this talk I'll present some of our recent work in using deep neural networks (DNNs) for speech recognition. Amongst other things the talk will include:

- a discussion of the similarities and differences between the recently discovered deep neural network approaches, and the neural network approaches used for speech recognition in the 80s, 90s, and 00s;

- MLAN, an approach to incorporate out-of-domain data using posterior features;

- supervised and unsupervised ways to make use of multilingual acoustic training data;

- comparison of tandem (DNN outputs used as features) and hybrid (DNN outputs used directly as probability estimates) approaches, and their combinations.

The talk will include results of experiments on Globalphone, BBC broadcasts, and TED talks.

This is joint work with Peter Bell, Arnab Ghoshal, and Pawel Swietojanski. Host:

Oscar Saz

Carnegie Melon University

Speech recognition and evaluation in the presence of severe phonological errors

In this talk, I will address some of the work I carried out in the recognition and evaluation of a corpus with speech from children with cognitive disorders. These speakers present such heavy phonological errors in their speech, due to learning delays, that their lexicons are completely different from the normal pronunciation of words. I will address how to automatically learn new dictionaries to improve recognition rates for these speakers and different methods to detect these pronunciation errors with the goal of developing computer assisted speech therapy tools. In the end, I will make a comparison with more recent work on language learning tools for non-native speakers and how, in this case, it might be more necessary to focus on the comprehension level than in the phonological level. Host:

Marcelo Rivolta

BMS, Sheffield University

Repairing the ear with stem cells

The presentation will discuss recent advances using stem cells in the search of a treatment for hearing loss. A method has been developed to generate ear sensory cells from human embryonic stem cells (hESCs). By exposing hESCs in a dish to the chemical signals that induce the formation of the ear in vivo, we have generated ear stem cells that can produce sensory hair cell-like cells and auditory neurons. We have taken the hESC-derived ear stem cells and explore if they could repair a deaf ear in a gerbil model of auditory neuropathy (that is when the cochlear nerve is damaged). When hESC-derived ear cells were transplanted into cochleae that have lost their auditory neurons the cells survived, engrafted and differentiated. Moreover, they sent projections making connections with the hair cells and with the brain. But more remarkably, they elicited a functional recovery. When the ear of the transplanted animals was stimulated with sound, we could record brain activity using a test called Auditory Brainstem-evoked Responses (ABR). The significance of this work and future steps will also be reviewed. Host:

Chris Mitchell

Audio Analytic Ltd

Sound Recognition in Physical Security Applications

Audio Analytic is a start-up company based in Cambridge, UK that primarily sells sound recognition software into the physical security market place; Aggression, Gunshot, Car Alarm and Glass Break detection. Applying sound recognition techniques within the physical security market place has a number of distinct technological, process and system level challenges. These challenges although specific to the discipline of machine listening are experienced in the general sense when applying a new concept to a market place. The talk, presenting by Dr. Christopher Mitchell, CEO & Founder of Audio Analytic, will discuss the company's experiences in applying cutting edge research in sound recognition to industrial applications, the lessons learnt and how to effectively and successfully transfer lab technology to industrial use. Host: