SpandH Seminar Abstracts

Dr Michael Stone

Marston Senior Research Fellow in Audiology/Hearing Sciences, Manchester Centre for Audiology and Deafness School of Health Sciences, Manchester University

The role of envelope cues in the perception of speech in background sounds

The envelope of an acoustic signal represents the short term fluctuation in level of an acoustic source. In speech it can be seen as representing the information applied to the carrier (voicing or frication) by the movement of tongue, teeth and lips. With age, and hearing loss, because of their restricted access to other cues, listeners become more reliant on envelope cues. Additionally, many forms of audio signal processing operate by applying modification to the envelope within different frequency bands. Consequently, understanding how listeners access envelope signals is important in order to design effective signal processing schemes. This talk looks at some perceptual properties of acoustic envelopes which define their importance, as well as illustrating with audio demonstrations.

Bio: Michael joined the Manchester Centre for Audiology and Deafness group in University of Manchester in 2014, from Professor Brian Moore’s Auditory Perception Group at the University of Cambridge. His research strands are on the contribution of modulations to speech intelligibility, such as using "pure" energetic masking and isolating its effect from that of modulation masking on speech. Another strand is on the early detection of a particular form of noise-induced hearing damage, sometimes called "hidden hearing loss". His collaborations extend to supporting other projects in the centre on the management of paediatric hearing loss as well as external links with computer science in the use of data analytics in hearing aid dispensing.

Host: Ning Ma (

Dr Nic Lane

Senior Lecturer at University College London (UCL) and Principal Scientist at Nokia Bell Labs

Squeezing Deep Learning onto Wearables, Phones and Things

In just a few short years, breakthroughs from the field of deep learning have transformed how computational models perform a wide-variety of tasks such as recognizing a face, tracking emotions or monitoring physical activities. Unfortunately, deep models and algorithms typically exert severe demands on local device resources and this conventionally limits their adoption within mobile and embedded platforms. Because sensor perception and reasoning are so fundamental to this class of computation, I believe the evolution of devices like phones, wearables and things will be crippled until we reach a point where current -- and future -- deep learning innovations can be simply and efficiently integrated into these systems. In this talk, I will describe our progress towards developing general-purpose support for deep learning on resource-constrained mobile and embedded devices. Primarily, this requires a radical reduction in the resources (viz. energy, memory and computation) consumed by these models -- especially at inference time. I will highlight various, largely complementary, approaches we have invented to achieve this goal including: sparse layer representations, dynamic forms of compression, and scheduling partitioned model architectures. Collectively, these techniques rethink how deep learning algorithms can execute not only to better cope with mobile and embedded device conditions; but also to increase the utilization of commodity processors (e.g., DSPs, GPUs, CPUs) -- as well as emerging purpose-built deep learning accelerators.

Bio: Nic Lane holds dual academic and industrial appointments as a Senior Lecturer (Associate Professor) at University College London (UCL), and a Principal Scientist at Nokia Bell Labs. At UCL, Nic is part of the Digital Health Institute and UCL Interaction Center, while at the Bell Labs he leads DeepX -- an embedded focused deep learning unit at the Cambridge location that is part of the broader Pervasive Sensing and Systems department. Before moving to England, Nic spent four years at Microsoft Research based in Beijing. There he was a Lead Researcher within the Mobile and Sensing Systems group (MASS). Nic's research interests revolve around the systems and modeling challenges that arise when computers collect and reason about people-centric sensor data. At heart, Nic is an experimentalist and likes to build prototype next-generation of wearable and embedded sensing devices based on well-founded computational models. His work has received multiple best paper awards, including ACM/IEEE IPSN 2017 and two from ACM UbiComp (2012 and 2015). Nic's recent academic service includes serving on the PC for leading venues in his field (e.g., UbiComp, MobiSys, SenSys, WWW, CIKM), and this year he is the PC-chair of HotMobile 2017. Nic received his PhD from Dartmouth College in 2011.

Host: Erfan Loweimi (

Dr William Whitmer

MRC Insitute of Hearing Research (Scottish Section)

What we Talk about when we Talk about Speech Intelligibility Benefits

Increases in speech intelligibility are conventionally reported as either (a) the change in relative levels of the target speech and noise(s) - the signal-to-noise ratio (SNR) - for a given percentage of utterances (e.g., 79%) or (b) the change in the percentage of utterances heard for a given SNR. What is considered an important change has been considered solely on account of the statistical distribution of scores, not what is a noticeable or convincing improvement in (a) SNR or (b) intelligibility. Through a series of experiements, we have determined what is the just noticeable difference (JND) in SNR and intelligibility, as well as inferred what is the just meaningful difference (JMD): the scale of improvement necessary to prompt an individual to seek intervention. For JNDs, participants of varying hearing ability and age judged which of two speech-in-noise samples were clearer. The SNR JND ranged from 2.4-4.4 dB dependent on stimulus type. The corresponding intelligibility JNDs, estimated from psychometric functions, ranged from 14-33% across stimuli. For JMDs, participants also listened to paired examples of sentences in same-spectrum noise: one at a reference SNR and the other at a variably higher SNR. In different experiments, different participants performed various tasks: (a) better/worse rating, (b) conversation tolerance, (c) device-swap, or (d) clinical importance. The SNR JMD was determined to be approximately 6 dB to reliably motivate participants to seek intervention. The SNR and Intelligibility JNDs provide perceptual benchmarks for performance beyond statistical relevance. The SNR JMD further adds clinical relevance to speech-intelligibility improvements. We have also recently studied the JND in speech intelligibility by having participants compare changes in SNR from their individual 50% thresholds across different noises. The data suggest a gap between the psychophysical intelligibility we commonly use and the subjective intelligibility that individuals experience.

Host: Ning Ma (

Dr Bob Sturm

School of Electronic Engineering and Computer Science, Queen Mary University of London

Horses in Machine Learning

Applied machine learning research entails the application of machine learning methods to specific problem domains and the evaluation of the models that result. Too often, however, evaluation is guided not by principles of relevance, reliability, and validity, but misguided by prevalence (e.g., using a particular method or dataset/benchmark because everyone uses it), convenience (e.g., many software packages provide a variety of statistical tests on any collection of numbers), and intuition (e.g., "more data is better"). I illustrate these using two examples in the domain of music informatics: 1) support vector machines with deep scattering features applied to music audio genre classification, evaluated using 10-fold cross-validation in the GTZAN benchmark dataset (Anden and Mallat, "Deep scattering spectrum," IEEE Trans. Signal Process. 62(16): 4114–4128, Aug 2014); 2) deep neural networks with energy envelope periodicity features applied to music audio rhythm recognition, evaluated using repeated 80/20 train/test partitioning in the benchmark BALLROOM dataset (Pikrakis, “A deep learning approach to rhythm modelling with applications,” in Proc. Int. Workshop Machine Learning and Music, 2013).

Bio: Bob Sturm is currently a Lecturer in Digital Media at the Centre for Digital Music ( in the School of Electronic Engineering and Computer Science, Queen Mary University of London. He specialises in audio and music signal processing, machine listening, and evaluation. He organised the 2016 HORSE workshop at QMUL ( ), which focused on sanity in applied machine learning

Host: Erfan Loweimi ( )

Professor Kirill V Horoshenkov

Department of Mechanical Engineering, University of Sheffield

Acoustic Condition Identification and Classification in Pipes Using Machine Learning Methods

Airborne acoustic waves can provide highly reliable means to measure remotely the hydraulic, operational and structural characteristics of buried pipes which are hard to access otherwise. This talk describes an acoustical technology which can be used to detect a range of conditions in underground networks of pipes. It explains how various pattern recognition methods can be applied to discriminate between the acoustic signatures recorded in pipes to detect a change. The talk also presents some field work the results of which prove that the proposed acoustical technology allows for a very rapid, remote inspection of urban water infrastructure which can partly replace more conventional and slower CCTV inspection.

Bio: Kirill Horoshenkov is a professor of Acoustics at the University of Sheffield. He is an expert in outdoor sound propagation, acoustic materials and instrumentation. In recognition of his contribution to the field of acoustics, he was awarded the prestigious Tyndall Medal by the Institute of Acoustics in 2006. Horoshenkov holds a degree in Acoustics and Ultrasonic Engineering from Moscow State Institute of Radio-engineering Electronics and Automation (1989) and a PhD in Acoustics from the University of Bradford (1997). He was an academic at the University of Bradford from 1995-2013 and currently holds the position of Professor of Acoustics at Sheffield. He has been involved actively in the professional life of the acoustic community in the UK and overseas, serving as a member of the Engineering Division Committee of the Institute of Acoustics (IOA), member of the Engineering and Physical Sciences Research Council (EPSRC) Peer Review College, Associate Editor of the Journal of the Acoustical Society of America, Journal of Applied Acoustics and the Journal of Acta Acoutsica (United with Acustica). He has authored/co-authored three books, over 140 journal and conference papers and has successfully submitted four patent applications.

Host: Erfan Loweimi ( )

Professor Andrew Lambourne

School of Computing, Creative Technology and Engineering, Leeds Beckett University

Computers and language – is it all smoke and mirrors?

We’re all familiar with computer applications which recognise speech, translate text, or speak to us. In fact, if you string those functions together you can fairly easily build a machine which listens in one language, translates what it heard, and speaks it in another. Fantastic! Humans are redundant, and true AI has arrived. Or has it? 20 years ago it amazed me to discover that speech recognition systems are not constructed around an explicit model of the grammar of the language. Instead, they work by having been trained on the probabilities that patterns and sequences of sounds are likely to correspond to equivalent sequences of words. And this method and this training has of course been provided by humans. So, innately, the machines have no intelligence at all – they are simply number-crunching the product of human endeavour, be it in the construction of the mathematical or neural network modelling techniques, or of the language material on which the training is based. And similar applies to machine translation. Having put them firmly back in their place, we can now start to reassess these so-called intelligent speech and language processing systems rather more realistically, and understand that what we are seeing is probably better called “Simulated Intelligence” at this stage. In order to move forwards towards real AI, and to processing Natural Language in its richest sense, computers will probably need to be “grounded” in the real-world concepts of context, content, relationships and discourse. This is a huge quantum leap compared to what we have now, and one way of illuminating it is by investigating just how well these state-of-the-art technologies cope when faced with real-world challenges. Where do they fail, how, and why? If we can understand the ways in which they currently fail, we can shed useful light on the next research and design steps which might actually lead us closer to true AI.

Host: Erfan Loweimi ( )

Nataliya Keberle and Hennadii Dobrovolskyi

Department of IT, Zaporozhye National University, Zaporozhye, Ukraine

Language-Independent Pronunciation Quality Assessment by Comparison with Sample

The task of pronunciation quality assessment by comparison with a reference example usually requires large training set of such examples including both correct and mispronounced utterances. Then a sort of the machine learning algorithms is used to distinguish correct pronunciation from inaccurate one. There are at least two shortcomings of the mentioned approach. First, the large collections of properly annotated voice recordings are rare even for widely used human languages. Second, the mispronounced utterances always depend on the native language of the speaker, therefore they cannot cover all possible error patterns. We propose an approach to assess pronunciation quality by comparison with a small set of high-quality reference utterance examples. The key points of the proposed method are artificial extension of the dataset, time warping with silence model, splitting voice recordings into pseudo-phonemes and pseudo-syllables, calculation of MFCC features. Student's utterance is then classified as correct or mispronounced using bagging method.

Host: Mauro Nicolao ( )

Jose Gonzalez Lopez and Phil Green

SPandH Internal

Silent Speech: Reconstructing Speech from Sensor Data by Machine Learning

Total removal of the larynx is often required to treat laryngeal cancer: every year some 17,500 people in Europe and North America lose the ability to speak in this way. Current methods for restoring speech include the electro-larynx, which produces an unnatural, electronic voice, oesophageal (belching) speech, which is difficult to learn, and fistula valve speech, which is considered to be the current gold standard but requires regular hospital visits for valve replacement and produces a masculine voice unpopular with female patients. All these methods sacrifice the patient's spoken identity. Here we introduce a technque which has the potential to restore the power of speech by sensing movement of the remaining speech articulators and using machine learning algorithms to derive a transformation which converts this sensor data into an acoustic signal - 'Silent Speech'. The sensing technique, developed by our collaborators at the University of Hull and called 'Permanent Magnetic Articulography', involves attaching small, unobtrusive magnets to the lips and tongue and monitoring changes in the magnetic field induced by their movement. We report experiments with several machine learning techniques and show that the Silent Speech generated, which may be delivered in real time, is intelligible and sounds natural. The identity of the speaker is recognisable. Previous work involving speech recognition followed by synthesis is, at best, like having an interpreter. In contrast, our 'Direct Synthesis' is like getting your voice back. This work is supported by the NIHR Invention for Innovation scheme.

Christian Füllgrabe

MRC Institute of Hearing Research, Nottingham

Beyond audibility - The role of supra-threshold auditory and cognitive processing in speech perception across the adult lifespan

Anecdotal evidence and experimental investigations indicate that older people experience increased speech-perception difficulties, especially in noisy environments. Since peripheral hearing sensitivity declines with age, lower speech intelligibility is generally attributed to a reduction in audibility. However, aided speech-perception in hearing-impaired listeners frequently falls short of the performance level that would be expected based on the audibility of the speech signal. Given that many of these listeners are older, poor performance may be partly caused by age-related changes in supra-threshold auditory and/or cognitive processing that are not captured by an audiometric assessment. The presentation will discuss experimental evidence obtained from clinically normal-hearing adult listeners showing that auditory temporal processing, cognition, and speech-in-noise perception are indeed linked and, independently of hearing loss, decline across the adult lifespan. These findings highlight the need to take into account these audibility unrelated factors in the prediction and rehabilitation of speech intelligibility.

Alessandro Di Nuovo

Centre for Automation and Robotics Research, Sheffield Hallam University

Number Understanding Modelling in a Behavioural Embodied Robot

The talk will present the recent cognitive developmental robotics studies on deep artificial neural network architectures to model the learning of associations between (motor) finger counting, (visual) object counting and (auditory) number words and sequence learning, to explore whether finger counting and the association of number words or digits to each finger could serve to bootstrap the representation of number.

The results obtained in the experiments with the iCub humanoid robotic platform show that learning the number word sequences together with finger sequencing helps the fast building of the initial representation of numbers in the robot. Just as has been found with young children, through the use of finger counting and verbal counting strategies, such a robotic model develops finger and word representations that subsequently sustain the robot’s learning the basic arithmetic operation of addition.

The ambition of the current work is to exploit the embodied mathematical processing, considered the archetypal examples of abstract and symbolic processing, as a fundamental cognitive capability of the next generation of interactive robots with human-like learning behaviours. This will positively influence the acceptance of robots in socially interactive environments, therefore increasing the socio-economic applications of future robots, in particular on tasks once thought too delicate to automate, especially in the fields of social care, companionship, children therapy, domestic assistance, entertainment, and education.


  • Di Nuovo, V. M. De La Cruz, and A. Cangelosi, “Grounding fingers, words and numbers in a cognitive developmental robot,” in Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB), 2014 IEEE Symposium on, 2014, pp. 9–15.
  • Di Nuovo, V. M. De La Cruz, A. Cangelosi, and S. Di Nuovo, “The iCub learns numbers: An embodied cognition study,” in International Joint Conference on Neural Networks (IJCNN 2014), 2014, pp. 692–699.
  • V. M. De La Cruz, A. Di Nuovo, S. Di Nuovo, and A. Cangelosi, “Making fingers and words count in a cognitive robot.,” Front. Behav. Neurosci., vol. 8, no. February, p. 13, 2014.
  • Di Nuovo, V. M. De La Cruz, and A. Cangelosi, “A Deep Learning Neural Network for Number Cognition: A bi-cultural study with the iCub,” in IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob) 2015, 2015, pp. 320–325.
  • Cangelosi, A. Morse, A. Di Nuovo, M. Rucinski, F. Stramandinoli, D. Marocco, V. De La Cruz, and K. Fischer, “Embodied language and number learning in developmental robots,” in Conceptual and Interactive Embodiment: Foundations of Embodied Cognition, vol. 2, Routledge, 2016, pp. 275–293.

Rosanna Milner

Internal - Interspeech 2016 pre

DNN-based speaker clustering for speaker diarisation

Abstract: Speaker diarisation, the task of answering ``who spoke when?'', is often considered to consist of three independent stages: speech activity detection, speaker segmentation and speaker clustering. These represent the separation of speech and non-speech, the splitting into speaker homogeneous speech segments, followed by grouping together those which belong to the same speaker. This paper is concerned with speaker clustering, which is typically performed by bottom-up clustering using the Bayesian information criterion (BIC). We present a novel semi-supervised method of speaker clustering based on a deep neural network (DNN) model. A speaker separation DNN trained on independent data is used to iteratively relabel the test data set. This is achieved by reconfiguration of the output layer, combined with fine tuning in each iteration. A stopping criterion involving posteriors as confidence scores is investigated. Results are shown on a meeting task (RT07) for single distant microphones and compared with standard diarisation approaches. The new method achieves a diarisation error rate (DER) of 14.8%, compared to a baseline of 19.9%.

Yulan Liu

Internal - Interspeech 2016 pre

The Sheffield Wargame Corpus - Day Two and Day Three

Abstract: Improving the performance of distant speech recognition is of considerable current interest, driven by a desire to bring speech recognition into people’s homes. Standard approaches to this task aim to enhance the signal prior to recognition, typically us- ing beamforming techniques on multiple channels. Only few real-world recordings are available that allow experimentation with such techniques. This has become even more pertinent with recent works with deep neural networks aiming to learn beamforming from data. Such approaches require large multi- channel training sets, ideally with location annotation for mov- ing speakers, which is scarce in existing corpora. This paper presents a freely available and new extended corpus of En- glish speech recordings in a natural setting, with moving speak- ers. The data is recorded with diverse microphone arrays, and uniquely, with ground truth location tracking. It extends the 8.0 hour Sheffield Wargames Corpus released in Interspeech 2013, with a further 16.6 hours of fully annotated data, including 6.1 hours of female speech to improve gender bias. Additional blog-based language model data is provided alongside, as well as a Kaldi baseline system. Results are reported with a standard Kaldi configuration, and a baseline meeting recognition system.

Thomas Hain

Internal - Interspeech 2016 pre

webASR 2 - Improved cloud based speech technology

Abstract: This paper presents the most recent developments of the webASR service (, the world’s first web– based fully functioning automatic speech recognition platform for scientific use. Initially released in 2008, the functionalities of webASR have recently been expanded with 3 main goals in mind: Facilitate access through a RESTful architecture, that al- lows for easy use through either the web interface or an API; al- low the use of input metadata when available by the user to im- prove system performance; and increase the coverage of avail- able systems beyond speech recognition. Several new systems for transcription, diarisation, lightly supervised alignment and translation are currently available through webASR. The results in a series of well–known benchmarks (RT’09, IWSLT’12 and MGB’15 evaluations) show how these webASR systems pro- vides state–of–the–art performances across these tasks.

Salil Deena

Internal - Interspeech 2016 pre

Combining Feature and Model-Based Adaptation of RNNLMs for Multi-Genre Broadcast Speech Recognition

Abstract: Recurrent neural network language models (RNNLMs) have consistently outperformed n-gram language models when used in automatic speech recognition (ASR). This is because RNNLMs provide robust parameter estimation through the use of a continuous-space representation of words, and can generally model longer context dependencies than n-grams. The adaptation of RNNLMs to new domains remains an active research area and the two main approaches are: feature-based adaptation, where the input to the RNNLM is augmented with auxiliary features; and model-based adaptation, which includes model fine-tuning and introduction of adaptation layer(s) in the network. This paper explores the properties of both types of adaptation on multi-genre broadcast speech recognition. Two hybrid adaptation techniques are proposed, namely the fine-tuning of feature-based RNNLMs and the use of a feature-based adaptation layer. A method for the semi-supervised adaptation of RNNLMs, using topic model-based genre classification, is also presented and investigated. The gains obtained with RNNLM adaptation on a system trained on 700h. of speech are consistent using both RNNLMs trained on a small (10M words) and large set (660M words), with 10% perplexity and 2% relative word error rate improvements on a 28.3h. test set.

Ning Ma

Internal - Interspeech 2016 pre

Speech Localisation in a Multitalker Mixture by Humans and Machines

Abstract: Speech localisation in multitalker mixtures is affected by the lis- tener’s expectations about the spatial arrangement of the sound sources. This effect was investigated via experiments with hu- man listeners and a machine system, in which the task was to localise a female-voice target among four spatially distributed male-voice maskers. Two configurations were used: either the masker locations were fixed or the locations varied from trial-to-trial. The machine system uses deep neural networks (DNNs) to learn the relationship between binaural cues and source az- imuth, and exploits top-down knowledge about the spectral characteristics of the target source. Performance was examined in both anechoic and reverberant conditions. Our experiments show that the machine system outperformed listeners in some conditions. Both the machine and listeners were able to make use of a priori knowledge about the spatial configuration of the sources, but the effect for headphone listening was smaller than that previously reported for listening in a real room.

Raymond Ng

Internal - Interspeech 2016 pre

Combining Weak Tokenisers for Phonotactic Language Recognition in a Resource-constrained Setting

Abstract: In the phonotactic approach for language recognition, a phone tokeniser is normally used to transform the audio signal into acoustic tokens. The language identity of the speech is modelled by the occurrence statistics of the decoded tokens. The performance of this approach depends heavily on the quality of the audio tokeniser. A high-quality tokeniser in matched condition is not always available for a language recognition task. This study investigated into the performance of a phonotactic language recogniser in a resource-constrained setting, following NIST LRE 2015 specification. An ensemble of phone tokenisers was constructed by applying unsupervised sequence training on different target languages followed by a score-based fusion. This method gave 5-7% relative performance improvement to baseline system on LRE 2015 eval set. This gain was retained when the ensemble phonotactic system was further fused with an acoustic iVector system.

Mortaza Doulaty

Internal - Interspeech 2016 pre

Automatic Genre and Show Identification of Broadcast Media

Abstract: Huge amounts of digital videos are being produced and broadcast every day, leading to giant media archives. Effective techniques are needed to make such data accessible further. Automatic meta-data labelling of broadcast media is an essential task for multimedia indexing, where it is standard to use multi-modal input for such purposes. This paper describes a novel method for automatic detection of media genre and show identities using acoustic features, textual features or a combination thereof. Furthermore the inclusion of available meta-data, such as time of broadcast, is shown to lead to very high performance. Latent Dirichlet Allocation is used to model both acoustics and text, yielding fixed dimensional representations of media recordings that can then be used in Support Vector Machines based classification. Experiments are conducted on more than 1200 hours of TV broadcasts from the British Broadcasting Corporation (BBC), where the task is to categorise the broadcasts into 8 genres or 133 show identities. On a 200-hour test set, accuracies of 98.6\% and 85.7% were achieved for genre and show identification respectively, using a combination of acoustic and textual features with meta-data.

Yanmeng Guo

Internal - Interspeech 2016 pre

A robust dual-microphone speech source localization algorithm for reverberant environments

Abstract: Voice controlled environmental control interfaces can be a life changing technology for users with disabilities. However, often these users suffer from speech disorders (e.g. dysarthria), making ASR very challenging. Acoustic model adaptation can improve the performance of the ASR, but the error rate will still be high for severe dysarthric speakers. POMDP-based dialogue management can improve the performance of these interfaces due to its robustness against high ASR error rates and its ability to find the optimal dialogue policy in each environment (e.g. the optimal policy depending on the dysarthria severity of the speaker or on the amount of acoustic data used to adapt the ASR). The dialogue state tracker (The module in charge of encoding all the information seen in the dialogue so far into a fixed length vector) is a key component of the dialogue manager. However, very little research has been done so far to adapt the state tracker to unique users interacting with a system over a long period of time. This talk explains how slot-based state trackers can be adapted to specific users and how ASR behaviour information can be used to improve state tracking generalisation to unseen dialogue states.

Erfan Loweimi

Internal - Interspeech 2016 pre

Use of Generalised Nonlinearity in Vector Taylor Series Noise Compensation for Robust Speech Recognition

Abstract: Designing good normalisation to counter the effect of environmen- tal distortions is one of the major challenges for automatic speech recognition (ASR). The Vector Taylor series (VTS) method is a pow- erful and mathematically well principled technique that can be ap- plied to both the feature and model domains to compensate for both additive and convolutional noises. One of the limitations of this approach, however, is that it is tied to MFCC (and log-filterbank) features and does not extend to other representations such as PLP, PNCC and phase-based front-ends that use power transformation rather than log compression. This paper aims at broadening the scope of the VTS method by deriving a new formulation that as- sumes a power transformation is used as the non-linearity during feature extraction. It is shown that the conventional VTS, in the log domain, is a special case of the new extended framework. In ad- dition, the new formulation introduces one more degree of freedom which makes it possible to tune the algorithm to better fit the data to the statistical requirements of the ASR back-end. Compared with MFCC and conventional VTS, the proposed approach provides upto 12.2% and 2.0% absolute performance improvements on average, in Aurora-4 tasks, respectively.

Iñigo Casanueva

Internal - Interspeech 2016 pre

Dialogue State Tracking Personalisation for Users with Speech Disorders

Abstract: Voice controlled environmental control interfaces can be a life changing technology for users with disabilities. However, often these users suffer from speech disorders (e.g. dysarthria), making ASR very challenging. Acoustic model adaptation can improve the performance of the ASR, but the error rate will still be high for severe dysarthric speakers. POMDP-based dialogue management can improve the performance of these interfaces due to its robustness against high ASR error rates and its ability to find the optimal dialogue policy in each environment (e.g. the optimal policy depending on the dysarthria severity of the speaker or on the amount of acoustic data used to adapt the ASR). The dialogue state tracker (The module in charge of encoding all the information seen in the dialogue so far into a fixed length vector) is a key component of the dialogue manager. However, very little research has been done so far to adapt the state tracker to unique users interacting with a system over a long period of time. This talk explains how slot-based state trackers can be adapted to specific users and how ASR behaviour information can be used to improve state tracking generalisation to unseen dialogue states.

Dr Tony Tew

Audio Lab, Department of Electronics, University of York

Around the head in 80 ways

Abstract: The complex shape (morphology) of the outer ears and their uniqueness to each listener continue to pose challenges for the successful introduction of binaural spatial audio on a large scale. This talk will outline approaches being taken in the Audio Lab Research Group at York to address some of these problems. Morphoacoustic perturbation analysis (MPA) is a powerful technique for relating features of head-related transfer functions to their morphological origins. The principles of MPA will be described and some initial validation results presented. One way in which MPA may assist with estimating individualised HRTFs will be discussed. Alternative approaches for determining the perceptual performance of a binaural audio system will be considered. An obvious problem is how to compare a virtual sound rendered binaurally with the equivalent real 3D sound without the listener knowing which they are hearing. This discussion will lead into a brief outline of recent efforts in broadcasting to improve the quality of experience for listeners to binaural spatial audio.

Biography: Tony Tew is a senior lecturer in the Department of Electronics at the University of York. He has a particular interest in auditory acoustics, spatial hearing and applications of binaural signal processing. Collaborators on the work presented in this talk include the University of Sydney, Orange Labs, BBC R&D and Meridian Audio, with additional support from EPSRC and the Australian Research Council.

Host: Guy Brown (

Professor Yannis Stylianou

Professor of Speech Processing at the University of Crete and Group Leader of the Speech Technology Group at Toshiba Cambridge Research Lab, UK.

Speech Intelligibility and Beyond

Abstract: Speech is highly variable in terms of its clarity and intelligibility. Especially in adverse listening contexts (noisy environment, hearing loss, level of language acquisition, etc) speech intelligibility can be highly reduced. The first question we will discuss is: can we modify speech before present it into the listening context with the goal to increase its intelligibility? Although just increasing the speech volume is the usual solution in such situations, it is well known that this is not optimal both in terms of signal distortions as well as of listener's comfort. In this talk, I will present advancements in terms of speech signal processing that have been shown to greatly improve the intelligibility of speech in various conditions without increasing any volume of speech. I will show results for normal hearing people in near and far field, listeners with mild to moderate hearing losses, children with certain degree of learning disabilities etc. and discuss possible applications. We will also discuss ways to evaluate intelligibility, objectively and subjectively, and comment on relatively recent results from two large scale international evaluations, the Hurricane Challenge ( The results I will show you are partially based on my group's work from an FP-7 FET-OPEN project: The Listening Talker. Finally, we will just ask a second question: Is it sufficient to increase the intelligibility of speech without paying attention to the effort or the cognitive load of the listener? This will not be answered during the talk. But we plan to address it during a new Horizon2020 Marie Curie ETN (2016-2019) project which is about to start soon. So, I will only put the question on the table and advertise the project, hoping to find in the audience interested candidates for PhD (for example, in the beautiful island of Crete in Greece).

Biography: Yannis Stylianou is Professor of Speech Processing at University of Crete, Department of Computer Science, CSD UOC, and Group Leader of the Speech Technology Group at Toshiba Cambridge Research Lab, UK. Until 2012, he was also Associated Researcher in the Signal Processing Laboratory of the Institute of Computer Science ICS at FORTH. During the academic year 2011-2012 was visiting Professor at AHOLAB, University of the Basque Country, in Bilbao, Spain (2011-2012). He received the Diploma of Electrical Engineering from the National Technical University, N.T.U.A., of Athens in 1991 and the M.Sc. and Ph.D. degrees in Signal Processing from the Ecole National Superieure des Telecommunications, ENST, Paris, France in 1992 and 1996, respectively. From 1996 until 2001 he was with AT&T Labs Research (Murray Hill and Florham Park, NJ, USA) as a Senior Technical Staff Member. In 2001 he joined Bell-Labs Lucent Technologies, in Murray Hill, NJ, USA (now Alcatel-Lucent). Since 2002 he is with the Computer Science Department at the University of Crete while since January 2013, he is also with Toshiba Labs in Cambridge UK. His current research focuses on speech signal processing algorithms for speech analysis, statistical signal processing (detection and estimation), and time-series analysis/modelling. He has (co-)authored more than 170 scientific publications, and holds about 20 UK and US patents, which have received more than 4400 citations (excluding self-citations) with H-index=31. He co-edited the book on “Progress in Non Linear Speech Processing”, Springer-Verlag, 2007. He has been the P.I. and scientific director of several European and Greek research programs and has been participating as leader in USA research programs. Among other projects, he was P.I. of the FET-OPEN project LISTA: “The Listening Talker”, where the goal is to develop scientific foundations for spoken language technologies based on human communicative strategies. In LISTA, he was charged of speech modelling and speech modifications in order to suggest novel techniques for spoken output generation of artificial and natural speech. He has created a lab for voice function assessment equipped with high quality instruments for speech and voice recordings (i.e., high-speed camera) for the purpose of basic research in speech and voice, as well for services, in collaboration with the Medical School at the University of Crete. He was on the Board of the International Speech Communication Association (ISCA), of the IEEE Multimedia Communications Technical Committee, member of the IEEE Speech and Language Technical Committee and on the Editorial Board of the Digital Signal Processing Journal of Elsevier. He is on the Editorial Board of Journal of Electrical and Computer Engineering, Hindawi JECE, Associate Editor of the EURASIP Journal on Speech, Audio, and Music Processing, ASMP, and of the EURASIP Research Letters in Signal Processing, RLSP. He was Associate Editor for the IEEE Signal Processing Letters, Vice-Chair of the Cost Action 2103: "Advanced Voice Function Assessment", VOICE, and on the Management Committee for the COST Action 277: "Nonlinear Speech Processing".

Host: Erfan Loweimi (

Dr Cleopatra Pike

Institute of Sound Recording, University of Surrey

Compensation for spectral envelope distortion in auditory perception

Abstract: Modifications by the transmission channel (loudspeakers, listening rooms, vocal tracts) can distort and colour sounds, preventing recognition. Human perception appears to be robust to channel distortions and a number of perceptual mechanisms appear to cause compensation for channel acoustics. Lab tests mimicking ‘real-world’ listening show that compensation reduces colouration caused by the channel by a moderate to large extent. These tests also indicate the psychological and physiological mechanisms that may be involved in this compensation. These mechanisms will be discussed and further work to uncover how humans remove distortions caused by transmission channels will put forward.

Biography: Cleo’s interest in audio perception and the hearing system began with her studies in Music Production at the Academy of Contemporary Music. In order to pursue this interest further she obtained an MSc in psychological research in 2009 and a PhD in psychoacoustics in 2015. Cleo’s PhD involved measuring the extent to which human listeners adapt to transmission channel acoustics (e.g. loudspeakers, rooms, and vocal tracts) and examining the psychological and neural mechanisms involved in this. Cleo has also worked as a research statistician and a research methods and statistics lecturer at Barts and The London School of Medicine, part of Queen Mary University of London. Cleo's ultimate research aim is to ascertain how human hearing processes can used benefit machine listening algorithms and the construction of listening environments, such as concert halls.

Host: Amy Beeston (