SpandH Seminar Abstracts

Hideki Kawahara

Emeritus Professor, Wakayama University; APSIPA Distinguished Lecturer 2015-2016; Visiting Research Scientist in Google UK Ltd. (London)

Making Speech Tangible: For Better Understanding of Human Speech Communication

Abstract: This talk presents the underlying concepts, technologies and applications of STRAIGHT, a framework for speech analysis, modification and resynthesis, which was originally designed to facilitate speech perception research. The talk also introduces two recent advances which may provide new possible strategies in speech communication research. The first is "temporally variable multi-aspect morphing of arbitrarily many voices." The second is "SparkNG: Speech Production and Auditory perception Research Kernel the Next Generation." Speech plays essential roles in human communication by providing rich side information channels which modify/expand linguistic content. While the recent resurgence of machine learning technologies made speech-based communication with smart machines practical and popular, these rich side information channels which make speech unique are not yet well explored. It is crucially important to make smart machines share a common basis with humans regarding these rich side information channels, and for this to be based on deep understanding of human speech communication. "Making speech tangible" by introducing tools which enable quantitative and precise, as well as intuitive and direct, manipulation of speech parameters, I hope, leads to better understanding of human speech communication.

Host: Erfan Loweimi (

Sarah Al-Shareef

Internal - viva pre

Conversational Arabic Automatic Speech Recognition

Abstract: Colloquial Arabic (CA) is the set of spoken variants of modern Arabic which exist in the form of regional dialects and considered generally to be mother-tongues in those regions. CA has limited textual resource because it exists only as a spoken language and without a standardised written form. Normally the modern standard Arabic (MSA) writing convention is employed that has limitations in phonetically representing CA. Without phonetic dictionaries the pronunciation of CA words is ambiguous, and can only be obtained through word and/or sentence context. Moreover, CA inherits the MSA complex word structure where words can be created from attaching affixes to a word. In automatic speech recognition (ASR), commonly used approaches to model acoustic, pronunciation and word variability are language independent. However, one can observe significant differences in performance between English and CA, with the latter yielding up to three times higher error rates. This thesis investigates the main issues for the under-performance of CA ASR systems. The work focused on two directions: first, to investigate the impact of limited lexical coverage, and insufficient training data for written CA on language modelling; second, to obtain better models for the acoustics and pronunciations by learning to transfer between written and spoken forms. Several original contributions resulted from each direction. Using data-driven classes from decomposed text are shown to reduce out-of-vocabulary rate. A novel colloquialisation system to import additional data is introduced; automatic diacritisation to restore the missing short vowels was found to yield good performance; and a new acoustic set for describing CA was defined. Using the proposed methods improved the ASR performance in terms of word error rate in a CA conversational telephone speech ASR task.

Host: Iñigo Casanueva (

Raymond Ng

Internal - Interspeech 2015 pre

A Study on The Stability and Effectiveness of Features in Quality Estimation for Spoken Language Translation

Abstract: A quality estimation (QE) approach informed with machine translation (MT) and speech recognition (ASR) features has recently shown to improve the performance of a spoken language translation (SLT) system in an in-domain scenario. When domain mismatch is progressively introduced in the MT and ASR systems, the SLT system’s performance naturally degrades. The use of QE to improve SLT performance has not been studied in this context. In this paper we investigate the effectiveness of QE under this setting. Our experiments showed that across moderate levels of domain mismatches, QE led to consistent translation improvements of around 0.4 in BLEU score. The QE system relies on 116 features derived from the ASR and MT system input and output. Feature analysis was conducted to understand the information sources contributing the most to performance improvements. LDA dimension reduction was used to summarise effective features into sets as small as 3 without affecting the SLT performance. By inspecting the principal components, eight features including the acoustic model scores and count-based word statistics on the bilingual text were found to be critically important, leading to a further boost of around 0.1 BLEU score over the full set of features. These findings provide interesting possibilities for further work by incorporating the effective QE features in SLT system training or decoding.

Host: Iñigo Casanueva (

Iñigo Casanueva


Knowledge Transfer Between Speakers for Personalised Dialogue Management

Abstract: Model-free reinforcement learning has been shown to be a promising data driven approach for automatic dialogue policy optimization, but a relatively large amount of dialogue interactions is needed before the system reaches reasonable performance. Recently, Gaussian process based reinforcement learning methods have been shown to reduce the number of dialogues needed to reach optimal performance, and pre-training the policy with data gathered from different dialogue systems has further reduced this amount. Following this idea, a dialogue system designed for a single speaker can be initialised with data from other speakers, but if the dynamics of the speakers are very different the model will have a poor performance. When data gathered from different speakers is available, selecting the data from the most similar ones might improve the performance. We propose a method which automatically selects the data to transfer by defining a similarity measure between speakers, and uses this measure to weight the influence of the data from each speaker in the policy model. The methods are tested by simulating users with different severities of dysarthria interacting with a voice enabled environmental control system.

Host: Iñigo Casanueva (

Iván López Espejo

Dept. of Signal Theory, Telematics and Communications, University of Granada, Spain

Robust Automatic Speech Recognition on Mobile Devices with Small Microphone Array

Abstract: Automatic speech recognition (ASR) technology is experiencing a new upswing in recent years thanks to the rapid development of portable electronic devices (e.g. smartphones or tablets). These devices often integrate small microphone arrays designed with noise reduction purposes. While this novel feature is especially being used for speech enhancement, little benefit is being taken so far for noise-robust ASR. In this presentation we want to explain some of the advances that we have made in our research group on the topic of noise-robust ASR on mobile devices with small microphone array. We will show that the multichannel information can be exploited to this end by briefly presenting our latest publications.

Host: Yulan Liu (

Dorothea Kolossa

Cognitive Signal Processing Group, Institute of Communication Acoustics, Ruhr-Universität Bochum

Statistical Models for Robust Speech Recognition and Model‐Based Speech Processing

Abstract: Human beings are highly effective at integrating multiple sources of uncertain information, and mounting evidence points to this integration being practically optimal in a Bayesian sense. Yet, in speech processing systems, the two central tasks of speech signal enhancement and of speech or phonetic-state recognition are often performed almost in isolation, with only estimates of mean values being exchanged between them. This talk describes concepts for enhancing the interface of these two systems, considering a range of appropriate probabilistic representations. Examples will illustrate how such interfaces can improve the quality of both components: On the one hand, more reliable pattern recognition can be attained, while on the other hand, enhanced signal quality is achieved when feeding back information from a pattern recognition stage to the signal preprocessing. This latter idea will be demonstrated using the example of twin HMMs, audiovisual speech models that help to recover lost acoustic information by exploiting video data. Overall, it will be shown how broader, probabilistic interfaces between signal processing and pattern recognition can help to achieve better performance in real-world conditions, and to more closely approximate the Bayesian ideal of using all sources of information in accordance with their respective degree of reliability.

Host: Ning Ma (

Raymond Ng

Internal - ICASSP 2015 pre

Quality Estimation for ASR K-best List Rescoring in Spoken Language Translation

Abstract: Spoken language translation (SLT) combines automatic speech recognition (ASR) and machine translation (MT). During the decoding stage, the best hypothesis produced by the ASR system may not be the best input candidate to the MT system, but making use of multiple sub-optimal ASR results in SLT has been shown to be too complex computationally. This paper presents a method to rescore the k-best ASR output such as to improve translation quality. A translation quality estimation model is trained on a large number of features which aim to capture complementary information from both ASR and MT on translation difficulty and adequacy, as well as syntactic properties of the SLT inputs and outputs. Based on the predicted quality score, the ASR hypotheses are rescored before they are fed to the MT system. ASR confidence is found to be crucial in guiding the rescoring step. In an English-to-French speech-to-text translation task, the coupling of ASR and MT systems led to an increase of 0.5 BLEU points in translation quality.

Host: Yulan Liu (

Yulan Liu

Internal - ICASSP 2015 pre

An Investigation into Speaker Informed DNN Front-end for LVCSR

Abstract: Deep Neural Network (DNN) has become a standard method in many ASR tasks. Recently there is considerable interest in “informed training” of DNNs, where DNN input is augmented with auxiliary codes, such as i-vectors, speaker codes, speaker separation bottleneck (SSBN) features, etc. This paper compares different speaker informed DNN training methods in LVCSR task. We discuss mathematical equivalence between speaker informed DNN training and “bias adaptation” which uses speaker dependent biases, and give detailed analysis on influential factors such as dimension, discrimination and stability of auxiliary codes. The analysis is supported by experiments on a meeting recognition task using bottleneck feature based system. Results show that i-vector based adaptation is also effective in bottleneck feature based system (not just hybrid systems). However all tested methods show poor generalisation to unseen speakers. We introduce a system based on speaker classification followed by speaker adaptation of biases, which yields equivalent performance to an i-vector based system with 10.4% relative improvement over baseline on seen speakers. The new approach can serve as a fast alternative especially for short utterances.

Host: Yulan Liu (

Stephen Elliott

Institute of Sound and Vibration Research, University of Southampton, Southampton SO17 1BJ, UK. e-mail:

Feedback Control of Sound in Aircraft and in The Ear

Abstract: The low frequency sound and vibration inside aircraft is now attenuated using commercial active control systems. These typically operate using many shakers acting on the structure to modify its vibration and hence reduce excitation of the sound field. As the structure becomes larger, the number of actuators and sensors required for effective control rises significantly. Conventional, fully coupled, control systems then become costly in terms of weight and sensitivity to individual failures. An alternative strategy is to distributing the control over multiple local controllers, which has been shown to be effective in a number of cases. Recent work will be presented on tuning these local control loops to maximise the power they absorb from the structure, which may allow the mass production of generic active control modules that include an actuator, sensor and self-tuning controller. The workings of the inner ear also provide a remarkable natural example of distributed active control, whose objective is to enhance the motion within the cochlea. A simple model for this cochlear amplifier, in which each of the outer hair cells act as a local control loop, will be described and its use illustrated in predicting the otoacoustic emissions generated by the ear. These emissions are used clinically to screen the hearing of young children and so it is important to understand how they are generated within the cochlea.

Bio: Steve Elliott graduated with first class joint honours BSc in physics and electronics from the University of London, in 1976, and received the PhD degree from the University of Surrey in 1979 for a dissertation on musical acoustics. After a short period as a Research Fellow at the ISVR and as a temporary Lecturer at the University of Surrey, he was appointed Lecturer at the Institute of Sound and Vibration Research (ISVR), University of Southampton, in 1982. He was made Senior Lecturer in 1988, Professor in 1994, and served as Director of the ISVR from 2005 to 2010. His research interests have been mostly concerned with the connections between the physical world, signal processing and control, mainly in relation the active control of sound using adaptive filters and the active feedback control of vibration. This work has resulted in the practical demonstration of active control in propeller aircraft, cars and helicopters. His current research interests include modular systems for active feedback control and modelling the active processes within the cochlear. Professor Elliott has published over 250 papers in refereed journals and 500 conference papers and is co-author of Active Control of Sound (with P A Nelson 1992), Active Control of Vibration (with C R Fuller and P A Nelson 1996) and author of Signal Processing for Active Control (2001). He is a Fellow of the Acoustical Society of America, the IET and the IOA and a senior member of the IEEE. He was jointly awarded the Tyndall Medal from the Institute of Acoustics in 1992 and the Kenneth Harris James Prize from the Institution of Mechanical Engineers in 2000. He was made a Fellow of the Royal Academy of Engineering in 2009.

Host: Yulan Liu (

Ke Chen

University of Manchester

Extracting Speaker Specific Information with a Deep Neural Architecture

Abstract: Speech signals convey different yet mixed information ranging from linguistic to speaker-specific information components, and each of them should be exclusively used in a specific speech information processing task. Due to the entanglement in different information components, it is extremely difficult to extract any specific information. As a result, nearly all existing speech representations carry all types of speech information but are used for different tasks. For example, the same speech representations are often used in both speech and speaker recognition. However, the interference of irrelevant information conveyed in such representations hinders either of such systems from producing better performance. In this seminar, I am going to talk about our work in developing a deep neural architecture to extract speaker-specific information from a common speech representation, MFCCs, including motivation of speech information component analysis, architecture and training of our proposed regularised Siamese deep network for speaker specific representation learning, experiments in various speaker-related tasks on benchmark date sets, and discussions on relevant issues and the future direction on speech information component analysis. Host:

Keiichi Tokuda

Google / Nagoya Institute of Technology

Flexible speech synthesis in karaoke, amine, smart phones, video games, digital signage, TV and radio programs, etc.

Abstract: This talk will give an overview of statistical approach to flexible speech synthesis. For constructing human-like talking machines, speech synthesis systems are required to have an ability to generate speech with arbitrary speaker's voice, various speaking styles in different languages, varying emphasis and focus, and/or emotional expressions. The main advantage of the statistical approach is that such flexibility can easily be realized using mathematically well-defined algorithms. In this talk, the system architecture is outlined and then recent results and demos will be presented. Host: