SpandH Seminar Abstracts 2009

2 December 2009

Guillaume Aimetti

Department of Computer Science, University of Sheffield

A computational model of early language acquisition: Towards a general statistical learning mechanism

Conventional Automatic Speech Recognition (ASR) systems can achieve near perfect recognition, however, for cases of mis-matching training and test cases accuracy significantly deteriorates and does not come anywhere near human speech processing abilities (Holmes, 2002). The ACORNS project (Acquisition of COmmunication and Recognitions Skills) aims to advance the state-of-the-art in ASR by modelling early language acquisition, in particular by focusing on pre-verbal infant word learning. Current cognitive theories take an empiricist stance, suggesting that young infants employ general statistical mechanisms that exploit the statistical regularities within their environment to acquire language skills.

In this talk I will present a computational model, agreeing with the empiricist view, which successfully discovers word-like units from cross-modal (acoustic & pseudo-visual) input and builds continuously evolving internal representations through cross-situational association. In its current state, the algorithm is a novel approach for segmenting speech directly from the acoustic signal in an unsupervised manner, therefore liberating it from a pre-defined lexicon. I will also briefly introduce current work on an ecologically valid method for automatically discovering an optimal set of minimal contrasting acoustic units.

Slides (pdf) Audio files (zip) Listen Again (mp3)

18 November 2009

Christopher Peters

Computer Games Technology, Department of Computing & the Digital Environment, Coventry University, UK

Synthetic Characters: Behaviour Modelling, Perception and Interaction

The study of synthetic characters, often represented on the screen as embodied three-dimensional humanoid forms, provides many challenges and opportunities, not only for creating plausible character animations, but also for further investigating and illuminating the human condition; from concepts such as shared attention to crowd perception, implications exist for domains as diverse as computer graphics, robotics and cognitive science.

This presentation describes work undertaken by the speaker in a number of national and European projects, focusing on the use of synthetic characters for investigating aspects of human computer interaction, in addition to implications for those working in the aforementioned domains.

Listen Again (mp3)

11 November 2009

Matthew Robertson

Speech and Hearing Group, Department of Computer Science, University of Sheffield, UK

Modelling the performance of hearing impaired listeners using an auditory model and speech in noise test

Hearing impairment affects almost nine million people in the UK. Although hearing aids have been available for some time, many hearing aids go unused. People cite many reasons for not using them, but a common reason is a lack of benefit in fluctuating noise environments, such as bars or restaurants. This project aims to use a model of the human auditory periphery, as a front end for an automatic speech recognition (ASR) system. This would allow us to model performance of human listeners on a speech in noise task. This auditory model could then be "broken" to match a specific hearing impaired listener, and used to tune hearing aids to better suit an individual's specific impairment. One reason why hearing impaired listeners may report problems listening to speech in modulated noise, is a lack of ability to take advantage of dips in the background noise. Normal hearing listeners can use these dips to improve speech recognition in fluctuating noise backgrounds. A possible explanation for this lack of "glimpsing" may be down to a reduced ability to use temporal fine structure (TFS) information. TFS is the rapidly changing fluctuations in amplitude in a signal. TFS information has been shown to be of benefit when listening in fluctuating noise environments. We are currently using TFS information to generate missing data masks for use with the ASR system. We will report studies which use these masks for recognition, as well as trying to model the effect of hearing impairment on the availability of glimpses.

Slides (ppt), Listen Again (mp3)

5 November 2009

Francisco Lacerda

Phonetics Lab, Department of Linguistics, Stockholm University

An ecological model of early language acquisition

The early stages of the infant's language acquisition process will be discussed in terms of a model of lexical emergence based on the acoustic characteristics of the infant-directed speech and the ecological settings of typical adult-infant interaction. The model's initial assumption is that the potential amount of sensory information available to the infant in its ecologic setting is so vast that the probability of random repetition of a given pattern is vanishingly small, given an underlying random uniform distribution. The model proposes that the infant's linguistic referential function initially emerges from the co-occurrence of a very limited number of repetitions of the acoustic pattern along with other correlated sensory information. Further lexical and grammatical categories can, at least to some extent, be derived by recursive application of essentially the same principle.

Slides (pdf)

21 October 2009

Steven Greenberg

Silicon Speech, Santa Venetia, CA, USA

Time Perspective in Spoken Language Processing

This presentation discusses how the "hidden" dimension of time may help decode spoken language in the brain. A perceptual study, using time-compressed speech and silent gaps, is used to illustrate how the concept of "decoding time" provides potential insight into the neural mechanisms involved in going from sound to meaning. Hierarchical Oscillatory Networks could be used to model the extremely complex process of speech decoding (and ultimately comprehension).

Slides (ppt), Sounds and Figures (zip)

21 October 2009

Maurizio Filippone

Department of Computer Science, University of Sheffield, UK

The Probabilistic Approach in Data Modeling

In this short tutorial, we will review some basic concepts about probabilistic methods for data modeling. The data modeling process aims at building a description of some observable quantities of a system. The final goal of data modeling is to gain some understanding of the system under study and/or to predict future observations. Probabilistic methods provide a parametric description of the system, in the sense that the observations and all (or some) of the parameters are thought as random variables. Therefore, they are assigned probability distributions, and can be analyzed by means of statistical tools. Such a framework provides a principled way of dealing with many crucial tasks related to data modeling problems, such as model selection, predictions with confidence intervals, statistical testing, and many others. We will briefly discuss the advantages of this approach, along with its computational complexity that represents one of the main issues when dealing with very complex systems. Finally, we will show a simple application of a probabilistic method to a speech analysis problem.

Slides (pdf), Listen Again (mp3)

14 October 2009

Jim Hieronymus

NASA Ames Research Center, Mountain View, CA

Exploiting Chinese Character Models to Improve Speech Recognition Performance

The Chinese language is based on characters which are syllabic in nature. Since languages have syllabotactic rules which govern the construction of syllables and their allowed sequences, Chinese character sequence models can be used as a first level approximation of allowed syllable sequences. N-gram character sequence models were trained on 4.3 billion characters. Characters are used as a first level recognition unit with multiple pronunciations per character. For comparison the CUHTK Mandarin word based system was used to recognize words which were then converted to character sequences. The character only system error rates for one best recognition were slightly worse than word based character recognition. However combining the two systems using log-linear combination gives better results than either system separately. An equally weighted combination gave consistent CER gains of 0.1 - 0.2% absolute over the word based standard system.

14 October 2009

Jim Hieronymus

NASA Ames Research Center, Mountain View, CA

Spoken Dialogue Systems for Space and Lunar Exploration

Building spoken dialogue systems for space applications requires systems which are flexible, portable to new applications, robust to noise and able to discriminate between speech intended for the system and conversations with other astronauts and systems. Our systems are built to be flexible by using general typed unification grammars for the language models which can be specialized using example data. These are designed so that most sensible ways of expressing a request are correctly recognized semantically. The language models are tuned with extensive user feedback and data if available. The International Space Station and the EVA Suits are noisy (76 and 70 dB SPL). This noise is best minimized by using active noise canceling microphones which permit accurate speech recognition. Finally open microphone speech recognition is important to hands free, always available operation. The EVITA system has been tested in prototype lunar space suits in the Arizona desert. Using an active noise canceling head mounted microphone in a pressurized suit, the lowest word error rate was 3.6 % for a 2500 word vocabulary. A short clip of the surface suit spoken dialogue system being used in a field test will be shown.

8 April 2009

Jaydip Ray

Consultant ENT Surgeon, Sheffield Teaching Hospitals & Sheffield Children's Hospital, Sheffield, UK

Modern otology and its links with the scientific community

Otology is a rapidly advancing and expanding super speciality within ENT surgery which screens and treats patients of all age groups starting from newborns to elderly. Significant proportion are medical conditions like hearing and balance disorders or allergies which whilst the remainder is purely surgical. The speciality has embraced modern technology (fibreoptics, lasers, electronics, computer science, speech recognition, speech processing, radioisotopes, imaging) and other biosciences like genetics and immunology to move ahead. It also uses various implanted devices for cosmetic and functional rehabilitation of structure and function.

One of the best examples of this harmonious working of medicine, surgery and computer science is in ongoing developments in implanted hearing devices like cochlear implants and also in image guided surgery.

This talk is aimed to provide a broad overview of the work we do and also to explore areas of common interest.

29 January 2009

Mark Wibrow

University of Sheffield, UK

Acoustic Cues for Sarcasm? How Interesting

Just under 10% of human utterances are ironic. Irony is type of figurative language which may pose problems for machine dialog systems as the intended meaning of an ironic utterance is not the same as its literal meaning.

This talk will focus on a particular type of irony: sarcasm. English speakers have an intuitive idea of a sarcastic tone of voice, yet analyses of sarcasm stemming from pragmatics, social psychology and neuropsychology, whilst acknowledging the use of this tone in speech, focus on the use of other cues to recognise conflicting intended and literal meanings, which are integrated under the influence of a "theory of mind" (the ability to infer the attitudes and beliefs of others).

Some recent (and not so recent) research on English sarcasm will be discussed in order to establish whether humans can and do use the sarcastic tone of voice to supplement the identification of sarcastic utterances, and whether this tone has systematic acoustic properties which could be exploited in machine dialog systems, in both spoken language comprehension and production.

Slides (pdf)

29 January 2009

Sharif Alghowail

University of Sheffield, UK

Keyboard Acoustic Emanations

Slides (ppt)

29 January 2009

Yi Hu

University of Sheffield, UK

The Techniques in Multiple-speaker Localisation