RSS

SpandH Seminar Abstracts


Phil Green and Jose Gonzales

Speech and Hearing Group, Department of Computer Science, University of Sheffield

Silent Speech: Reconstructing Speech from Sensor Data by Machine Learning



Christian Füllgrabe

MRC Institute of Hearing Research, Nottingham

Beyond audibility - The role of supra-threshold auditory and cognitive processing in speech perception across the adult lifespan

Anecdotal evidence and experimental investigations indicate that older people experience increased speech-perception difficulties, especially in noisy environments. Since peripheral hearing sensitivity declines with age, lower speech intelligibility is generally attributed to a reduction in audibility. However, aided speech-perception in hearing-impaired listeners frequently falls short of the performance level that would be expected based on the audibility of the speech signal. Given that many of these listeners are older, poor performance may be partly caused by age-related changes in supra-threshold auditory and/or cognitive processing that are not captured by an audiometric assessment. The presentation will discuss experimental evidence obtained from clinically normal-hearing adult listeners showing that auditory temporal processing, cognition, and speech-in-noise perception are indeed linked and, independently of hearing loss, decline across the adult lifespan. These findings highlight the need to take into account these audibility unrelated factors in the prediction and rehabilitation of speech intelligibility.



Alessandro Di Nuovo

Centre for Automation and Robotics Research, Sheffield Hallam University

Number Understanding Modelling in a Behavioural Embodied Robot

The talk will present the recent cognitive developmental robotics studies on deep artificial neural network architectures to model the learning of associations between (motor) finger counting, (visual) object counting and (auditory) number words and sequence learning, to explore whether finger counting and the association of number words or digits to each finger could serve to bootstrap the representation of number.

The results obtained in the experiments with the iCub humanoid robotic platform show that learning the number word sequences together with finger sequencing helps the fast building of the initial representation of numbers in the robot. Just as has been found with young children, through the use of finger counting and verbal counting strategies, such a robotic model develops finger and word representations that subsequently sustain the robot’s learning the basic arithmetic operation of addition.

The ambition of the current work is to exploit the embodied mathematical processing, considered the archetypal examples of abstract and symbolic processing, as a fundamental cognitive capability of the next generation of interactive robots with human-like learning behaviours. This will positively influence the acceptance of robots in socially interactive environments, therefore increasing the socio-economic applications of future robots, in particular on tasks once thought too delicate to automate, especially in the fields of social care, companionship, children therapy, domestic assistance, entertainment, and education.

Bibliography

  • Di Nuovo, V. M. De La Cruz, and A. Cangelosi, “Grounding fingers, words and numbers in a cognitive developmental robot,” in Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB), 2014 IEEE Symposium on, 2014, pp. 9–15.
  • Di Nuovo, V. M. De La Cruz, A. Cangelosi, and S. Di Nuovo, “The iCub learns numbers: An embodied cognition study,” in International Joint Conference on Neural Networks (IJCNN 2014), 2014, pp. 692–699.
  • V. M. De La Cruz, A. Di Nuovo, S. Di Nuovo, and A. Cangelosi, “Making fingers and words count in a cognitive robot.,” Front. Behav. Neurosci., vol. 8, no. February, p. 13, 2014.
  • Di Nuovo, V. M. De La Cruz, and A. Cangelosi, “A Deep Learning Neural Network for Number Cognition: A bi-cultural study with the iCub,” in IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob) 2015, 2015, pp. 320–325.
  • Cangelosi, A. Morse, A. Di Nuovo, M. Rucinski, F. Stramandinoli, D. Marocco, V. De La Cruz, and K. Fischer, “Embodied language and number learning in developmental robots,” in Conceptual and Interactive Embodiment: Foundations of Embodied Cognition, vol. 2, Routledge, 2016, pp. 275–293.


Rosanna Milner

Internal - Interspeech 2016 pre

DNN-based speaker clustering for speaker diarisation

Abstract: Speaker diarisation, the task of answering ``who spoke when?'', is often considered to consist of three independent stages: speech activity detection, speaker segmentation and speaker clustering. These represent the separation of speech and non-speech, the splitting into speaker homogeneous speech segments, followed by grouping together those which belong to the same speaker. This paper is concerned with speaker clustering, which is typically performed by bottom-up clustering using the Bayesian information criterion (BIC). We present a novel semi-supervised method of speaker clustering based on a deep neural network (DNN) model. A speaker separation DNN trained on independent data is used to iteratively relabel the test data set. This is achieved by reconfiguration of the output layer, combined with fine tuning in each iteration. A stopping criterion involving posteriors as confidence scores is investigated. Results are shown on a meeting task (RT07) for single distant microphones and compared with standard diarisation approaches. The new method achieves a diarisation error rate (DER) of 14.8%, compared to a baseline of 19.9%.



Yulan Liu

Internal - Interspeech 2016 pre

The Sheffield Wargame Corpus - Day Two and Day Three

Abstract: Improving the performance of distant speech recognition is of considerable current interest, driven by a desire to bring speech recognition into people’s homes. Standard approaches to this task aim to enhance the signal prior to recognition, typically us- ing beamforming techniques on multiple channels. Only few real-world recordings are available that allow experimentation with such techniques. This has become even more pertinent with recent works with deep neural networks aiming to learn beamforming from data. Such approaches require large multi- channel training sets, ideally with location annotation for mov- ing speakers, which is scarce in existing corpora. This paper presents a freely available and new extended corpus of En- glish speech recordings in a natural setting, with moving speak- ers. The data is recorded with diverse microphone arrays, and uniquely, with ground truth location tracking. It extends the 8.0 hour Sheffield Wargames Corpus released in Interspeech 2013, with a further 16.6 hours of fully annotated data, including 6.1 hours of female speech to improve gender bias. Additional blog-based language model data is provided alongside, as well as a Kaldi baseline system. Results are reported with a standard Kaldi configuration, and a baseline meeting recognition system.



Thomas Hain

Internal - Interspeech 2016 pre

webASR 2 - Improved cloud based speech technology

Abstract: This paper presents the most recent developments of the webASR service (www.webasr.org), the world’s first web– based fully functioning automatic speech recognition platform for scientific use. Initially released in 2008, the functionalities of webASR have recently been expanded with 3 main goals in mind: Facilitate access through a RESTful architecture, that al- lows for easy use through either the web interface or an API; al- low the use of input metadata when available by the user to im- prove system performance; and increase the coverage of avail- able systems beyond speech recognition. Several new systems for transcription, diarisation, lightly supervised alignment and translation are currently available through webASR. The results in a series of well–known benchmarks (RT’09, IWSLT’12 and MGB’15 evaluations) show how these webASR systems pro- vides state–of–the–art performances across these tasks.



Salil Deena

Internal - Interspeech 2016 pre

Combining Feature and Model-Based Adaptation of RNNLMs for Multi-Genre Broadcast Speech Recognition

Abstract: Recurrent neural network language models (RNNLMs) have consistently outperformed n-gram language models when used in automatic speech recognition (ASR). This is because RNNLMs provide robust parameter estimation through the use of a continuous-space representation of words, and can generally model longer context dependencies than n-grams. The adaptation of RNNLMs to new domains remains an active research area and the two main approaches are: feature-based adaptation, where the input to the RNNLM is augmented with auxiliary features; and model-based adaptation, which includes model fine-tuning and introduction of adaptation layer(s) in the network. This paper explores the properties of both types of adaptation on multi-genre broadcast speech recognition. Two hybrid adaptation techniques are proposed, namely the fine-tuning of feature-based RNNLMs and the use of a feature-based adaptation layer. A method for the semi-supervised adaptation of RNNLMs, using topic model-based genre classification, is also presented and investigated. The gains obtained with RNNLM adaptation on a system trained on 700h. of speech are consistent using both RNNLMs trained on a small (10M words) and large set (660M words), with 10% perplexity and 2% relative word error rate improvements on a 28.3h. test set.



Ning Ma

Internal - Interspeech 2016 pre

Speech Localisation in a Multitalker Mixture by Humans and Machines

Abstract: Speech localisation in multitalker mixtures is affected by the lis- tener’s expectations about the spatial arrangement of the sound sources. This effect was investigated via experiments with hu- man listeners and a machine system, in which the task was to localise a female-voice target among four spatially distributed male-voice maskers. Two configurations were used: either the masker locations were fixed or the locations varied from trial-to-trial. The machine system uses deep neural networks (DNNs) to learn the relationship between binaural cues and source az- imuth, and exploits top-down knowledge about the spectral characteristics of the target source. Performance was examined in both anechoic and reverberant conditions. Our experiments show that the machine system outperformed listeners in some conditions. Both the machine and listeners were able to make use of a priori knowledge about the spatial configuration of the sources, but the effect for headphone listening was smaller than that previously reported for listening in a real room.



Raymond Ng

Internal - Interspeech 2016 pre

Combining Weak Tokenisers for Phonotactic Language Recognition in a Resource-constrained Setting

Abstract: In the phonotactic approach for language recognition, a phone tokeniser is normally used to transform the audio signal into acoustic tokens. The language identity of the speech is modelled by the occurrence statistics of the decoded tokens. The performance of this approach depends heavily on the quality of the audio tokeniser. A high-quality tokeniser in matched condition is not always available for a language recognition task. This study investigated into the performance of a phonotactic language recogniser in a resource-constrained setting, following NIST LRE 2015 specification. An ensemble of phone tokenisers was constructed by applying unsupervised sequence training on different target languages followed by a score-based fusion. This method gave 5-7% relative performance improvement to baseline system on LRE 2015 eval set. This gain was retained when the ensemble phonotactic system was further fused with an acoustic iVector system.



Mortaza Doulaty

Internal - Interspeech 2016 pre

Automatic Genre and Show Identification of Broadcast Media

Abstract: Huge amounts of digital videos are being produced and broadcast every day, leading to giant media archives. Effective techniques are needed to make such data accessible further. Automatic meta-data labelling of broadcast media is an essential task for multimedia indexing, where it is standard to use multi-modal input for such purposes. This paper describes a novel method for automatic detection of media genre and show identities using acoustic features, textual features or a combination thereof. Furthermore the inclusion of available meta-data, such as time of broadcast, is shown to lead to very high performance. Latent Dirichlet Allocation is used to model both acoustics and text, yielding fixed dimensional representations of media recordings that can then be used in Support Vector Machines based classification. Experiments are conducted on more than 1200 hours of TV broadcasts from the British Broadcasting Corporation (BBC), where the task is to categorise the broadcasts into 8 genres or 133 show identities. On a 200-hour test set, accuracies of 98.6\% and 85.7% were achieved for genre and show identification respectively, using a combination of acoustic and textual features with meta-data.



Yanmeng Guo

Internal - Interspeech 2016 pre

A robust dual-microphone speech source localization algorithm for reverberant environments

Abstract: Voice controlled environmental control interfaces can be a life changing technology for users with disabilities. However, often these users suffer from speech disorders (e.g. dysarthria), making ASR very challenging. Acoustic model adaptation can improve the performance of the ASR, but the error rate will still be high for severe dysarthric speakers. POMDP-based dialogue management can improve the performance of these interfaces due to its robustness against high ASR error rates and its ability to find the optimal dialogue policy in each environment (e.g. the optimal policy depending on the dysarthria severity of the speaker or on the amount of acoustic data used to adapt the ASR). The dialogue state tracker (The module in charge of encoding all the information seen in the dialogue so far into a fixed length vector) is a key component of the dialogue manager. However, very little research has been done so far to adapt the state tracker to unique users interacting with a system over a long period of time. This talk explains how slot-based state trackers can be adapted to specific users and how ASR behaviour information can be used to improve state tracking generalisation to unseen dialogue states.



Erfan Loweimi

Internal - Interspeech 2016 pre

Use of Generalised Nonlinearity in Vector Taylor Series Noise Compensation for Robust Speech Recognition

Abstract: Designing good normalisation to counter the effect of environmen- tal distortions is one of the major challenges for automatic speech recognition (ASR). The Vector Taylor series (VTS) method is a pow- erful and mathematically well principled technique that can be ap- plied to both the feature and model domains to compensate for both additive and convolutional noises. One of the limitations of this approach, however, is that it is tied to MFCC (and log-filterbank) features and does not extend to other representations such as PLP, PNCC and phase-based front-ends that use power transformation rather than log compression. This paper aims at broadening the scope of the VTS method by deriving a new formulation that as- sumes a power transformation is used as the non-linearity during feature extraction. It is shown that the conventional VTS, in the log domain, is a special case of the new extended framework. In ad- dition, the new formulation introduces one more degree of freedom which makes it possible to tune the algorithm to better fit the data to the statistical requirements of the ASR back-end. Compared with MFCC and conventional VTS, the proposed approach provides upto 12.2% and 2.0% absolute performance improvements on average, in Aurora-4 tasks, respectively.

Iñigo Casanueva

Internal - Interspeech 2016 pre

Dialogue State Tracking Personalisation for Users with Speech Disorders

Abstract: Voice controlled environmental control interfaces can be a life changing technology for users with disabilities. However, often these users suffer from speech disorders (e.g. dysarthria), making ASR very challenging. Acoustic model adaptation can improve the performance of the ASR, but the error rate will still be high for severe dysarthric speakers. POMDP-based dialogue management can improve the performance of these interfaces due to its robustness against high ASR error rates and its ability to find the optimal dialogue policy in each environment (e.g. the optimal policy depending on the dysarthria severity of the speaker or on the amount of acoustic data used to adapt the ASR). The dialogue state tracker (The module in charge of encoding all the information seen in the dialogue so far into a fixed length vector) is a key component of the dialogue manager. However, very little research has been done so far to adapt the state tracker to unique users interacting with a system over a long period of time. This talk explains how slot-based state trackers can be adapted to specific users and how ASR behaviour information can be used to improve state tracking generalisation to unseen dialogue states.



Dr Tony Tew

Audio Lab, Department of Electronics, University of York

Around the head in 80 ways

Abstract: The complex shape (morphology) of the outer ears and their uniqueness to each listener continue to pose challenges for the successful introduction of binaural spatial audio on a large scale. This talk will outline approaches being taken in the Audio Lab Research Group at York to address some of these problems. Morphoacoustic perturbation analysis (MPA) is a powerful technique for relating features of head-related transfer functions to their morphological origins. The principles of MPA will be described and some initial validation results presented. One way in which MPA may assist with estimating individualised HRTFs will be discussed. Alternative approaches for determining the perceptual performance of a binaural audio system will be considered. An obvious problem is how to compare a virtual sound rendered binaurally with the equivalent real 3D sound without the listener knowing which they are hearing. This discussion will lead into a brief outline of recent efforts in broadcasting to improve the quality of experience for listeners to binaural spatial audio.

Biography: Tony Tew is a senior lecturer in the Department of Electronics at the University of York. He has a particular interest in auditory acoustics, spatial hearing and applications of binaural signal processing. Collaborators on the work presented in this talk include the University of Sydney, Orange Labs, BBC R&D and Meridian Audio, with additional support from EPSRC and the Australian Research Council.

Host: Guy Brown (g.j.brown@sheffield.ac.uk)



Professor Yannis Stylianou

Professor of Speech Processing at the University of Crete and Group Leader of the Speech Technology Group at Toshiba Cambridge Research Lab, UK.

Speech Intelligibility and Beyond

Abstract: Speech is highly variable in terms of its clarity and intelligibility. Especially in adverse listening contexts (noisy environment, hearing loss, level of language acquisition, etc) speech intelligibility can be highly reduced. The first question we will discuss is: can we modify speech before present it into the listening context with the goal to increase its intelligibility? Although just increasing the speech volume is the usual solution in such situations, it is well known that this is not optimal both in terms of signal distortions as well as of listener's comfort. In this talk, I will present advancements in terms of speech signal processing that have been shown to greatly improve the intelligibility of speech in various conditions without increasing any volume of speech. I will show results for normal hearing people in near and far field, listeners with mild to moderate hearing losses, children with certain degree of learning disabilities etc. and discuss possible applications. We will also discuss ways to evaluate intelligibility, objectively and subjectively, and comment on relatively recent results from two large scale international evaluations, the Hurricane Challenge (http://listening-talker.org/hurricane/). The results I will show you are partially based on my group's work from an FP-7 FET-OPEN project: The Listening Talker. Finally, we will just ask a second question: Is it sufficient to increase the intelligibility of speech without paying attention to the effort or the cognitive load of the listener? This will not be answered during the talk. But we plan to address it during a new Horizon2020 Marie Curie ETN (2016-2019) project which is about to start soon. So, I will only put the question on the table and advertise the project, hoping to find in the audience interested candidates for PhD (for example, in the beautiful island of Crete in Greece).

Biography: Yannis Stylianou is Professor of Speech Processing at University of Crete, Department of Computer Science, CSD UOC, and Group Leader of the Speech Technology Group at Toshiba Cambridge Research Lab, UK. Until 2012, he was also Associated Researcher in the Signal Processing Laboratory of the Institute of Computer Science ICS at FORTH. During the academic year 2011-2012 was visiting Professor at AHOLAB, University of the Basque Country, in Bilbao, Spain (2011-2012). He received the Diploma of Electrical Engineering from the National Technical University, N.T.U.A., of Athens in 1991 and the M.Sc. and Ph.D. degrees in Signal Processing from the Ecole National Superieure des Telecommunications, ENST, Paris, France in 1992 and 1996, respectively. From 1996 until 2001 he was with AT&T Labs Research (Murray Hill and Florham Park, NJ, USA) as a Senior Technical Staff Member. In 2001 he joined Bell-Labs Lucent Technologies, in Murray Hill, NJ, USA (now Alcatel-Lucent). Since 2002 he is with the Computer Science Department at the University of Crete while since January 2013, he is also with Toshiba Labs in Cambridge UK. His current research focuses on speech signal processing algorithms for speech analysis, statistical signal processing (detection and estimation), and time-series analysis/modelling. He has (co-)authored more than 170 scientific publications, and holds about 20 UK and US patents, which have received more than 4400 citations (excluding self-citations) with H-index=31. He co-edited the book on “Progress in Non Linear Speech Processing”, Springer-Verlag, 2007. He has been the P.I. and scientific director of several European and Greek research programs and has been participating as leader in USA research programs. Among other projects, he was P.I. of the FET-OPEN project LISTA: “The Listening Talker”, where the goal is to develop scientific foundations for spoken language technologies based on human communicative strategies. In LISTA, he was charged of speech modelling and speech modifications in order to suggest novel techniques for spoken output generation of artificial and natural speech. He has created a lab for voice function assessment equipped with high quality instruments for speech and voice recordings (i.e., high-speed camera) for the purpose of basic research in speech and voice, as well for services, in collaboration with the Medical School at the University of Crete. He was on the Board of the International Speech Communication Association (ISCA), of the IEEE Multimedia Communications Technical Committee, member of the IEEE Speech and Language Technical Committee and on the Editorial Board of the Digital Signal Processing Journal of Elsevier. He is on the Editorial Board of Journal of Electrical and Computer Engineering, Hindawi JECE, Associate Editor of the EURASIP Journal on Speech, Audio, and Music Processing, ASMP, and of the EURASIP Research Letters in Signal Processing, RLSP. He was Associate Editor for the IEEE Signal Processing Letters, Vice-Chair of the Cost Action 2103: "Advanced Voice Function Assessment", VOICE, and on the Management Committee for the COST Action 277: "Nonlinear Speech Processing".

Host: Erfan Loweimi (eloweimi1@sheffield.ac.uk)



Dr Cleopatra Pike

Institute of Sound Recording, University of Surrey

Compensation for spectral envelope distortion in auditory perception

Abstract: Modifications by the transmission channel (loudspeakers, listening rooms, vocal tracts) can distort and colour sounds, preventing recognition. Human perception appears to be robust to channel distortions and a number of perceptual mechanisms appear to cause compensation for channel acoustics. Lab tests mimicking ‘real-world’ listening show that compensation reduces colouration caused by the channel by a moderate to large extent. These tests also indicate the psychological and physiological mechanisms that may be involved in this compensation. These mechanisms will be discussed and further work to uncover how humans remove distortions caused by transmission channels will put forward.

Biography: Cleo’s interest in audio perception and the hearing system began with her studies in Music Production at the Academy of Contemporary Music. In order to pursue this interest further she obtained an MSc in psychological research in 2009 and a PhD in psychoacoustics in 2015. Cleo’s PhD involved measuring the extent to which human listeners adapt to transmission channel acoustics (e.g. loudspeakers, rooms, and vocal tracts) and examining the psychological and neural mechanisms involved in this. Cleo has also worked as a research statistician and a research methods and statistics lecturer at Barts and The London School of Medicine, part of Queen Mary University of London. Cleo's ultimate research aim is to ascertain how human hearing processes can used benefit machine listening algorithms and the construction of listening environments, such as concert halls.

Host: Amy Beeston (a.beeston@sheffield.ac.uk)