SpandH Seminar Abstracts 2010

7 December 2010

Amy Beeston

Department of Computer Science, University of Sheffield

Compensation for reverberation

Reverberation adversely affects artificial listening devices, and automatic speech recognition (ASR) systems often suffer high error rates with even minimal reflected sound energy. Contrary to this, human listeners exhibit constancy for hearing just as they do for seeing: they appear to compensate for reverberation, and can improve recognition of spoken words in reverberant rooms by deriving information from contextual sound [Watkins (2005). J Acoust Soc Am. 118, 249-262].

Unlike traditional engineering methods for dereverberation, an auditory modelling approach is discussed. Based on principles of human hearing, an efferent suppression feedback loop is used to factor out the increased noise-floor that accompanies late-reflections [Ferry & Meddis (2007). J Acoust Soc Am. 122, 3519-3526]. Used alongside a simple template-based speech recogniser, the computational model displays human-like compensation patterns in an experiment from Watkins' classic 'sir-stir' paradigm that investigates time-reversal of speech and reverberation conditions.

To bridge the gap from the 'sir-stir' paradigm to typical ASR studies, current listening experiments extend Watkins' method to use a wider variety of talkers, test and context words. Crucially, these experiments have replicated compensation for reverberation with naturalistic speech. Combining psychoacoustic experiment design with the modelling approach, participants' test results provide data that is useful for tuning parameters of the computational model described above. It is hoped that such work may similarly benefit the eventual development of a 'constancy' front-end for reverberant ASR.

Slides (pdf)

22 October 2010

Tim Jurgens

Medizinische Physik, Carl-von-Ossietzky Universit├Ąt Oldenburg, Germany

Microscopic modelling of speech recognition for normal-hearing and hearing-impaired listeners

In this talk a "microscopic" model of human speech recognition is presented; microscopic in a sense that first, the recognition of single phonemes rather than the recognition of whole sentences is modeled. Second, the particular spectro-temporal structure of speech is processed in a way that is presumably very similar to the processing that takes place in the human auditory system. This contrasts with macroscopic models of human speech recognition (such as the Speech Intelligibility Index, SII), which usually use the spectral structure only. The microscopic model is evaluated using phoneme recognition experiments with normal-hearing listeners in noise and sensorineural hearing-impaired listeners in quiet. Parameters that describe the hearing deficits of the hearing-impaired listeners beyond the pure-tone audiogram are assessed using supra-threshold measurement techniques. Furthermore, a systematic investigation is done how different forms of sensorineural hearing deficits (primarily realized as an adjustment of cochlear compression) affect the speech recognition model results.

15 September 2010

John Culling

School of Psychology, Cardiff University

Mapping speech intelligibility in noisy rooms

We have developed a method for predicting the benefit to speech intelligibility when speech and interfering noise come from different directions. The predictions have been validated against our own data as well as a wide range of studies from the literature for both anechoic and reverberant conditions. The method is based upon binaural room impulse responses which can either be measured from a real room using an acoustic manikin or predicted from an acoustic model of the room, combined with head-related impulse responses. In the latter case one can map out the room determining the relative intelligibility at each point in space, as well as the potential benefit of orienting the head. The technique could be useful to architects when planning room for which speech communication is critical. It can also be used to compare the patterns of performance for one- and two-eared listeners, such as unilateral and bilateral cochlear implantees. These analyses indicate that the benefit of bilateral implantation is much greater than has been reported previously.

Slides (ppt)

14 July 2010

Charles Fox

Adaptive Behaviour Research Group, University of Sheffield

Mean-field and Monte Carlo Musical Scene Analysis

Laptops have become a common sight on the popular musical stage but are still usually played by humans or set to run a predetermined program of notes at fixed tempo. The latter is highly restrictive for musicians, forcing them to play along with the machine rather than improvise with tempo and structure. I will present two steps towards more flexible machine musicianship: perception of rhythmic and grammatical structures using Bayesian networks. Rhythm perception can be performed using a model-base of known rhythms, which are deformed (Cootes) and tested against preprocessed percussive events such as bass, snare and hihat drum hits. Rhythm deformations have many local minima so annealed variational message passing is used, along with blackboard-system priming and variational posteriors for model selection. Musical structure perception can be performed using a stochastic context-sensitive grammars -- sensitive to global factors such as musical key -- and an extended hierarchical Bayesian blackboard system. Such a blackboard system draws on classical 1980s AI work such as Hofstadter's Copycat but updates it to modern machine learning by constructing Bayesian network structures on the blackboard and integrating annealed Gibbs sampling with the priming and pruning mechanisms. Finally I will briefly discuss how my Bayesian blackboard research is being continued at Sheffield in a new domain, for tactile localisation and mapping (SLAM) on a whiskered mobile robot.

Slides (pdf)

11 June 2010

Matt Gibson

Machine Intelligence Laboratory, Department of Engineering, University of Cambridge

Unsupervised adaptation of HMM-based synthesis models

Abstract: Hidden Markov model (HMM)-based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to firstly estimate the transcription of the adaptation data. By defining a mapping between HMM-based synthesis models and ASR-style models, this talk presents an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for supplementary acoustic models. Further, it is shown how this mapping enables a simplified unsupervised adaptation method for HMM-based speech synthesis models.

Slides (ppt)

21 April 2010

Sarah Creer

Department of Computer Science and School of Health and Related Research, University of Sheffield

Building personalised synthetic voices for individuals with severe speech impairment

Many individuals with severe speech impairment, due to conditions such as Parkinson's disease or motor neurone disease, use voice output communication aids (VOCAs), which have synthesised or pre-recorded voice output.

The limited choice of output voice that current VOCAs offer cannot accurately represent the personal identity of the person who is using the communication aid. Building a personalised voice for a VOCA user provides a way of portraying the individual's personal identity that is taken from them once they cannot use their own voice. Currently available personalisation of speech synthesis techniques require a large amount of data input, which is difficult to produce for individuals with severe speech impairment. These techniques also do not provide a solution for those individuals whose voices have begun to show the effects of dysarthria.

This work reports that Hidden Markov Model (HMM)-based speech synthesis is a promising approach for `voice banking' where individuals can store recordings of their voice from which to build a speech synthesiser using approximately 6-7 minutes of speech. Further, this method can be extended to reconstruct voices of individuals whose speech has already begun to deteriorate, showing similarity to target speakers without recreating the impairment in the synthesised speech output.

Slides (ppt)

31 March 2010

Patti Adank

Neuroscience and Aphasia Research Unit, School of Psychological Sciences, University of Manchester

The role of vocal Imitation in speech comprehension

Imitation of speech helps infants to learn their native language and even in adulthood, speakers spontaneously imitate each other's speech patterns. Furthermore, there is increasing support for the notion that adult speakers continuously generate speech motor programs during speech comprehension. This motor involvement may allow people to more efficiently process noisy or ambiguous speech signals. Here, we tested the hypothesis that vocal imitation of spoken sentences results in improved perception of noisy or ambiguous sentences. This was established using sentences produced in an unfamiliar accent. Six groups of listeners were tested on their comprehension performance using an auditory staircase procedure while listening to sentences in the unfamiliar accent. One group received no training, the second listened to a series of accented sentences, the third listened to the accented sentences and repeated them in their own accent, the fourth transcribed the accented sentences as they heard them, the fifth listened to the accented sentences and imitated them, and the sixth group listened to the accented sentences and imitated them while not being able to hear their own vocalisations. After training, listeners were once more tested on their comprehension performance with the staircase procedure. The results will be discussed with respect to possible roles for imitation in action perception.

Slides (ppt)

24 March 2010

Jindong Liu

Centre for Hybrid Intelligent Systems, Department of Computing, Engineering & Technology, University of Sunderland

Effect of Inhibition in a Computational Model of the Inferior Colliculus on Sound Localisation

A spiking neural network is proposed to model the human auditory pathway including the medial superior olive (MSO), the lateral superior olive (LSO) and the inferior colliculus (IC) regarding to azimuth sound localisation. The system reflects the biological fact that the IC is the central hub where two localisation cues are combined: an interaural time difference (ITD) cue from the MSO and an interaural level difference (ILD) cue from the LSO. The neurons in the IC, MSO and LSO are tonotopically arranged to reflect the frequency organisation of the auditory pathway. Based on biological evidence, inhibition from the ipsilateral LSO as well as excitations from the ipsilateral MSO and contralateral LSO are modelled as the inputs of the IC. Inspired by presence of GABAergic inhibitory neurons in the IC, an assumption is made that IC neurons are internal inhibited in order to have a clear and sharp sound spatial representation in the IC. A number of experiments are taken to validate the proposed system and four types of IC systems are implemented in the experiments to compare their performances. They show that our system is more robust than traditional methods to variation in sound type and sound source distance regarding to a sharp sound spatial representation in the IC.

Slides (ppt)

17 February 2010

Simon Makin, Anthony Watkins and Andrew Raimond

School of Psychology and Clinical Language Sciences, University of Reading

Spectral- and temporal-envelope room-acoustic cues in attentional tracking

Listeners can attend to a desired talker in the presence of other simultaneous talkers. A difference in spatial position between sources is known to aid this `tracking', and previous research has shown the effectiveness of cues arising from a difference in bearing between the `target' and interfering signals (Bronkhorst and Plomp, 1988). However, there is reason to question the utility of such cues in real-room reverberation (Kidd et al. 2005).

It has also been shown that listeners can track talkers using differences in their voice characteristics (Darwin and Hukin, 2000). Here, we ask how well cues arising from location (including bearing and distance differences) compete with differences between voice characteristics when listeners are tracking messages in real-room reverberation.

Real-room measurements of Binaural Room Impulse Responses (BRIRs) were used to simulate different locations of talkers in a room. Listeners decided which of two simultaneous target words belonged in an attended `context' phrase when it was played simultaneously with a different `distracter' context. Talker difference was in competition with location difference, so listeners' responses indicate which cue-type they were tracking. Further experiments used processed BRIRs to eliminate certain temporal cues.

Location difference was found to override talker difference in dichotic conditions. Results in diotich conditions indicate that, while there is some binaural advantage, this is primarily due to interaural level differences (ILDs), with interaural timing differences (ITDs) playing a relatively minor role, and that location can compete with talker differences in diotich conditions, especially if there is a distance separation. Effects of distance seem due to temporal-envelope differences.

Slides (pdf)