Home | Contact | Log into MUSE

You are here: Home / Departments / Computer Science / Research / Speech and Hearing

SpandH Seminar Abstracts

Dr Michael Stone

Marston Senior Research Fellow in Audiology/Hearing Sciences, Manchester Centre for Audiology and Deafness School of Health Sciences, Manchester University

The role of envelope cues in the perception of speech in background sounds

The envelope of an acoustic signal represents the short term fluctuation in level of an acoustic source. In speech it can be seen as representing the information applied to the carrier (voicing or frication) by the movement of tongue, teeth and lips. With age, and hearing loss, because of their restricted access to other cues, listeners become more reliant on envelope cues. Additionally, many forms of audio signal processing operate by applying modification to the envelope within different frequency bands. Consequently, understanding how listeners access envelope signals is important in order to design effective signal processing schemes. This talk looks at some perceptual properties of acoustic envelopes which define their importance, as well as illustrating with audio demonstrations.

Bio: Michael joined the Manchester Centre for Audiology and Deafness group in University of Manchester in 2014, from Professor Brian Moore’s Auditory Perception Group at the University of Cambridge. His research strands are on the contribution of modulations to speech intelligibility, such as using "pure" energetic masking and isolating its effect from that of modulation masking on speech. Another strand is on the early detection of a particular form of noise-induced hearing damage, sometimes called "hidden hearing loss". His collaborations extend to supporting other projects in the centre on the management of paediatric hearing loss as well as external links with computer science in the use of data analytics in hearing aid dispensing.

Host: Ning Ma (n.ma@sheffield.ac.uk)

Dr Nic Lane

Senior Lecturer at University College London (UCL) and Principal Scientist at Nokia Bell Labs

Squeezing Deep Learning onto Wearables, Phones and Things

In just a few short years, breakthroughs from the field of deep learning have transformed how computational models perform a wide-variety of tasks such as recognizing a face, tracking emotions or monitoring physical activities. Unfortunately, deep models and algorithms typically exert severe demands on local device resources and this conventionally limits their adoption within mobile and embedded platforms. Because sensor perception and reasoning are so fundamental to this class of computation, I believe the evolution of devices like phones, wearables and things will be crippled until we reach a point where current -- and future -- deep learning innovations can be simply and efficiently integrated into these systems. In this talk, I will describe our progress towards developing general-purpose support for deep learning on resource-constrained mobile and embedded devices. Primarily, this requires a radical reduction in the resources (viz. energy, memory and computation) consumed by these models -- especially at inference time. I will highlight various, largely complementary, approaches we have invented to achieve this goal including: sparse layer representations, dynamic forms of compression, and scheduling partitioned model architectures. Collectively, these techniques rethink how deep learning algorithms can execute not only to better cope with mobile and embedded device conditions; but also to increase the utilization of commodity processors (e.g., DSPs, GPUs, CPUs) -- as well as emerging purpose-built deep learning accelerators.

Bio: Nic Lane holds dual academic and industrial appointments as a Senior Lecturer (Associate Professor) at University College London (UCL), and a Principal Scientist at Nokia Bell Labs. At UCL, Nic is part of the Digital Health Institute and UCL Interaction Center, while at the Bell Labs he leads DeepX -- an embedded focused deep learning unit at the Cambridge location that is part of the broader Pervasive Sensing and Systems department. Before moving to England, Nic spent four years at Microsoft Research based in Beijing. There he was a Lead Researcher within the Mobile and Sensing Systems group (MASS). Nic's research interests revolve around the systems and modeling challenges that arise when computers collect and reason about people-centric sensor data. At heart, Nic is an experimentalist and likes to build prototype next-generation of wearable and embedded sensing devices based on well-founded computational models. His work has received multiple best paper awards, including ACM/IEEE IPSN 2017 and two from ACM UbiComp (2012 and 2015). Nic's recent academic service includes serving on the PC for leading venues in his field (e.g., UbiComp, MobiSys, SenSys, WWW, CIKM), and this year he is the PC-chair of HotMobile 2017. Nic received his PhD from Dartmouth College in 2011.

Host: Erfan Loweimi (eloweimi1@sheffield.ac.uk)

Dr William Whitmer

MRC Insitute of Hearing Research (Scottish Section)

What we Talk about when we Talk about Speech Intelligibility Benefits

Increases in speech intelligibility are conventionally reported as either (a) the change in relative levels of the target speech and noise(s) - the signal-to-noise ratio (SNR) - for a given percentage of utterances (e.g., 79%) or (b) the change in the percentage of utterances heard for a given SNR. What is considered an important change has been considered solely on account of the statistical distribution of scores, not what is a noticeable or convincing improvement in (a) SNR or (b) intelligibility. Through a series of experiements, we have determined what is the just noticeable difference (JND) in SNR and intelligibility, as well as inferred what is the just meaningful difference (JMD): the scale of improvement necessary to prompt an individual to seek intervention. For JNDs, participants of varying hearing ability and age judged which of two speech-in-noise samples were clearer. The SNR JND ranged from 2.4-4.4 dB dependent on stimulus type. The corresponding intelligibility JNDs, estimated from psychometric functions, ranged from 14-33% across stimuli. For JMDs, participants also listened to paired examples of sentences in same-spectrum noise: one at a reference SNR and the other at a variably higher SNR. In different experiments, different participants performed various tasks: (a) better/worse rating, (b) conversation tolerance, (c) device-swap, or (d) clinical importance. The SNR JMD was determined to be approximately 6 dB to reliably motivate participants to seek intervention. The SNR and Intelligibility JNDs provide perceptual benchmarks for performance beyond statistical relevance. The SNR JMD further adds clinical relevance to speech-intelligibility improvements. We have also recently studied the JND in speech intelligibility by having participants compare changes in SNR from their individual 50% thresholds across different noises. The data suggest a gap between the psychophysical intelligibility we commonly use and the subjective intelligibility that individuals experience.

Host: Ning Ma (n.ma@sheffield.ac.uk)

Dr Bob Sturm

School of Electronic Engineering and Computer Science, Queen Mary University of London

Horses in Machine Learning

Applied machine learning research entails the application of machine learning methods to specific problem domains and the evaluation of the models that result. Too often, however, evaluation is guided not by principles of relevance, reliability, and validity, but misguided by prevalence (e.g., using a particular method or dataset/benchmark because everyone uses it), convenience (e.g., many software packages provide a variety of statistical tests on any collection of numbers), and intuition (e.g., "more data is better"). I illustrate these using two examples in the domain of music informatics: 1) support vector machines with deep scattering features applied to music audio genre classification, evaluated using 10-fold cross-validation in the GTZAN benchmark dataset (Anden and Mallat, "Deep scattering spectrum," IEEE Trans. Signal Process. 62(16): 4114–4128, Aug 2014); 2) deep neural networks with energy envelope periodicity features applied to music audio rhythm recognition, evaluated using repeated 80/20 train/test partitioning in the benchmark BALLROOM dataset (Pikrakis, “A deep learning approach to rhythm modelling with applications,” in Proc. Int. Workshop Machine Learning and Music, 2013).

Bio: Bob Sturm is currently a Lecturer in Digital Media at the Centre for Digital Music (http://c4dm.eecs.qmul.ac.uk/) in the School of Electronic Engineering and Computer Science, Queen Mary University of London. He specialises in audio and music signal processing, machine listening, and evaluation. He organised the 2016 HORSE workshop at QMUL (http://c4dm.eecs.qmul.ac.uk/horse2016 ), which focused on sanity in applied machine learning

Host: Erfan Loweimi (eloweimi1@sheffield.ac.uk )

Professor Kirill V Horoshenkov

Department of Mechanical Engineering, University of Sheffield

Acoustic Condition Identification and Classification in Pipes Using Machine Learning Methods

Airborne acoustic waves can provide highly reliable means to measure remotely the hydraulic, operational and structural characteristics of buried pipes which are hard to access otherwise. This talk describes an acoustical technology which can be used to detect a range of conditions in underground networks of pipes. It explains how various pattern recognition methods can be applied to discriminate between the acoustic signatures recorded in pipes to detect a change. The talk also presents some field work the results of which prove that the proposed acoustical technology allows for a very rapid, remote inspection of urban water infrastructure which can partly replace more conventional and slower CCTV inspection.

Bio: Kirill Horoshenkov is a professor of Acoustics at the University of Sheffield. He is an expert in outdoor sound propagation, acoustic materials and instrumentation. In recognition of his contribution to the field of acoustics, he was awarded the prestigious Tyndall Medal by the Institute of Acoustics in 2006. Horoshenkov holds a degree in Acoustics and Ultrasonic Engineering from Moscow State Institute of Radio-engineering Electronics and Automation (1989) and a PhD in Acoustics from the University of Bradford (1997). He was an academic at the University of Bradford from 1995-2013 and currently holds the position of Professor of Acoustics at Sheffield. He has been involved actively in the professional life of the acoustic community in the UK and overseas, serving as a member of the Engineering Division Committee of the Institute of Acoustics (IOA), member of the Engineering and Physical Sciences Research Council (EPSRC) Peer Review College, Associate Editor of the Journal of the Acoustical Society of America, Journal of Applied Acoustics and the Journal of Acta Acoutsica (United with Acustica). He has authored/co-authored three books, over 140 journal and conference papers and has successfully submitted four patent applications.

Host: Erfan Loweimi (eloweimi1@sheffield.ac.uk )

Professor Andrew Lambourne

School of Computing, Creative Technology and Engineering, Leeds Beckett University

Computers and language – is it all smoke and mirrors?

We’re all familiar with computer applications which recognise speech, translate text, or speak to us. In fact, if you string those functions together you can fairly easily build a machine which listens in one language, translates what it heard, and speaks it in another. Fantastic! Humans are redundant, and true AI has arrived. Or has it? 20 years ago it amazed me to discover that speech recognition systems are not constructed around an explicit model of the grammar of the language. Instead, they work by having been trained on the probabilities that patterns and sequences of sounds are likely to correspond to equivalent sequences of words. And this method and this training has of course been provided by humans. So, innately, the machines have no intelligence at all – they are simply number-crunching the product of human endeavour, be it in the construction of the mathematical or neural network modelling techniques, or of the language material on which the training is based. And similar applies to machine translation. Having put them firmly back in their place, we can now start to reassess these so-called intelligent speech and language processing systems rather more realistically, and understand that what we are seeing is probably better called “Simulated Intelligence” at this stage. In order to move forwards towards real AI, and to processing Natural Language in its richest sense, computers will probably need to be “grounded” in the real-world concepts of context, content, relationships and discourse. This is a huge quantum leap compared to what we have now, and one way of illuminating it is by investigating just how well these state-of-the-art technologies cope when faced with real-world challenges. Where do they fail, how, and why? If we can understand the ways in which they currently fail, we can shed useful light on the next research and design steps which might actually lead us closer to true AI.

Host: Erfan Loweimi (eloweimi1@sheffield.ac.uk )

Nataliya Keberle and Hennadii Dobrovolskyi

Department of IT, Zaporozhye National University, Zaporozhye, Ukraine

Language-Independent Pronunciation Quality Assessment by Comparison with Sample

The task of pronunciation quality assessment by comparison with a reference example usually requires large training set of such examples including both correct and mispronounced utterances. Then a sort of the machine learning algorithms is used to distinguish correct pronunciation from inaccurate one. There are at least two shortcomings of the mentioned approach. First, the large collections of properly annotated voice recordings are rare even for widely used human languages. Second, the mispronounced utterances always depend on the native language of the speaker, therefore they cannot cover all possible error patterns. We propose an approach to assess pronunciation quality by comparison with a small set of high-quality reference utterance examples. The key points of the proposed method are artificial extension of the dataset, time warping with silence model, splitting voice recordings into pseudo-phonemes and pseudo-syllables, calculation of MFCC features. Student's utterance is then classified as correct or mispronounced using bagging method.

Host: Mauro Nicolao (m.nicolao@sheffield.ac.uk )

Jose Gonzalez Lopez and Phil Green

SPandH Internal

Silent Speech: Reconstructing Speech from Sensor Data by Machine Learning

Total removal of the larynx is often required to treat laryngeal cancer: every year some 17,500 people in Europe and North America lose the ability to speak in this way. Current methods for restoring speech include the electro-larynx, which produces an unnatural, electronic voice, oesophageal (belching) speech, which is difficult to learn, and fistula valve speech, which is considered to be the current gold standard but requires regular hospital visits for valve replacement and produces a masculine voice unpopular with female patients. All these methods sacrifice the patient's spoken identity. Here we introduce a technque which has the potential to restore the power of speech by sensing movement of the remaining speech articulators and using machine learning algorithms to derive a transformation which converts this sensor data into an acoustic signal - 'Silent Speech'. The sensing technique, developed by our collaborators at the University of Hull and called 'Permanent Magnetic Articulography', involves attaching small, unobtrusive magnets to the lips and tongue and monitoring changes in the magnetic field induced by their movement. We report experiments with several machine learning techniques and show that the Silent Speech generated, which may be delivered in real time, is intelligible and sounds natural. The identity of the speaker is recognisable. Previous work involving speech recognition followed by synthesis is, at best, like having an interpreter. In contrast, our 'Direct Synthesis' is like getting your voice back. This work is supported by the NIHR Invention for Innovation scheme.

Christian Füllgrabe

MRC Institute of Hearing Research, Nottingham

Beyond audibility - The role of supra-threshold auditory and cognitive processing in speech perception across the adult lifespan

Anecdotal evidence and experimental investigations indicate that older people experience increased speech-perception difficulties, especially in noisy environments. Since peripheral hearing sensitivity declines with age, lower speech intelligibility is generally attributed to a reduction in audibility. However, aided speech-perception in hearing-impaired listeners frequently falls short of the performance level that would be expected based on the audibility of the speech signal. Given that many of these listeners are older, poor performance may be partly caused by age-related changes in supra-threshold auditory and/or cognitive processing that are not captured by an audiometric assessment. The presentation will discuss experimental evidence obtained from clinically normal-hearing adult listeners showing that auditory temporal processing, cognition, and speech-in-noise perception are indeed linked and, independently of hearing loss, decline across the adult lifespan. These findings highlight the need to take into account these audibility unrelated factors in the prediction and rehabilitation of speech intelligibility.

Alessandro Di Nuovo

Centre for Automation and Robotics Research, Sheffield Hallam University

Number Understanding Modelling in a Behavioural Embodied Robot

The talk will present the recent cognitive developmental robotics studies on deep artificial neural network architectures to model the learning of associations between (motor) finger counting, (visual) object counting and (auditory) number words and sequence learning, to explore whether finger counting and the association of number words or digits to each finger could serve to bootstrap the representation of number.

The results obtained in the experiments with the iCub humanoid robotic platform show that learning the number word sequences together with finger sequencing helps the fast building of the initial representation of numbers in the robot. Just as has been found with young children, through the use of finger counting and verbal counting strategies, such a robotic model develops finger and word representations that subsequently sustain the robot’s learning the basic arithmetic operation of addition.

The ambition of the current work is to exploit the embodied mathematical processing, considered the archetypal examples of abstract and symbolic processing, as a fundamental cognitive capability of the next generation of interactive robots with human-like learning behaviours. This will positively influence the acceptance of robots in socially interactive environments, therefore increasing the socio-economic applications of future robots, in particular on tasks once thought too delicate to automate, especially in the fields of social care, companionship, children therapy, domestic assistance, entertainment, and education.

Bibliography

Di Nuovo, V. M. De La Cruz, and A. Cangelosi, “Grounding fingers, words and numbers in a cognitive developmental robot,” in Computational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB), 2014 IEEE Symposium on, 2014, pp. 9–15.
Di Nuovo, V. M. De La Cruz, A. Cangelosi, and S. Di Nuovo, “The iCub learns numbers: An embodied cognition study,” in International Joint Conference on Neural Networks (IJCNN 2014), 2014, pp. 692–699.
V. M. De La Cruz, A. Di Nuovo, S. Di Nuovo, and A. Cangelosi, “Making fingers and words count in a cognitive robot.,” Front. Behav. Neurosci., vol. 8, no. February, p. 13, 2014.
Di Nuovo, V. M. De La Cruz, and A. Cangelosi, “A Deep Learning Neural Network for Number Cognition: A bi-cultural study with the iCub,” in IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob) 2015, 2015, pp. 320–325.
Cangelosi, A. Morse, A. Di Nuovo, M. Rucinski, F. Stramandinoli, D. Marocco, V. De La Cruz, and K. Fischer, “Embodied language and number learning in developmental robots,” in Conceptual and Interactive Embodiment: Foundations of Embodied Cognition, vol. 2, Routledge, 2016, pp. 275–293.

Rosanna Milner