SpandH seminar abstracts 2007

31 October 2007

Ian Howard

Biological & Machine Learning Lab, Department of Engineering, University of Cambridge, UK

A Computational Model of Infant Speech Development

Almost all theories of child speech development assume that an infant learns speech sounds by direct imitation, performing an acoustic matching of adult output to his own speech. Some theories also postulate an innate link between perception and production. We present a computer model which has no requirement for acoustic matching on the part of the infant and which treats speech production and perception as separate processes with no innate link. Instead we propose that the infant initially explores his speech apparatus and reinforces his own actions on the basis of sensory salience, developing vocal motor schemes [1]. As the infants production develops, he will start to generate utterances which are sufficiently speech like to provoke a linguistic response from its mother. Such, interactions are particularly important, because she is better qualified than he is to judge the quality of his speech. Her response to his vocal output is beneficial in a number of ways. Because she is a learned speaker, her experienced perceptive system can effectively evaluate the infants output within the phonological system of the ambient language L1. Simply generating a salient response will tend to encourage the infants production of a given utterance. More significantly, during imitative exchanges in which the mother reformulates the infants speech, the infant can learn equivalence relations using simple associative mechanisms between his motor activity and his mothers acoustic output, and thus can solve the correspondence problem. Notice that the infant does not learn equivalence relations between his own acoustic output and that of his mother based on acoustic similarity. Any similarity based matching need only needs to be performed by his mother.

[1] McCune, L., Vihman M.M. 1987. Vocal Motor Schemes. Papers and Reports in Child Language Development, Stanford University Department of Linguistics 26, 72-79.

Speech Communication paper (draft)
PPT slides

17 October 2007

Peter van Hengel

Sound Intelligence, The Netherlands

Verbal Aggression Detection in Complex Acoustical Environments

A major problem in the developing market for CCTV systems is the monitoring of all the camera images. Recent studies reveal that some 70% of incidents in view of surveillance cameras are not noticed by the camera operators. This is not surprising if one takes into account the number of cameras to be monitored simultaneously. Comparison with a human observer it is clear that adding sound as a trigger to camera surveillance should significantly improve the efficiency and detection rate, while at the same time decreasing the impact on privacy. A verbal aggression system has been developed by Sound Intelligence in cooperation with the University of Groningen's Department of Artificial Intelligence. This system has an attentional mechanism based on human neural processes which identifies 'interesting sounds'. These sounds are then analysed for cues of aggression. If sufficient evidence for verbal aggression have been found, a signal is given to the camera operator. A 6 month study in Groningen's city center shows a detection rate of 100% with a false alarm rate of 0.15%. The system is currently being introduced into the english market.

13 September 2007

Jort Gemmeke

Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands

On the relation between statistical properties of spectrographic masks and their ability to reduce acoustic mismatch

The application of missing feature techniques (MFT) can significantly improve the noise robustness of automatic speech recognition (ASR). When using MFT for decoding, adequately estimated spectrographic masks are crucial. For the development of better mask estimation techniques it would be extremely valuable to have an analysis technique that can indicate which properties of a mask need to be changed to serve its main purpose: Reducing the acoustic mismatch between a noisy feature vector and its corresponding acoustic model. As a first step toward such an analysis technique we present a framework which enables us to compare two mask creation techniques. We show that, using a suitable distance function, we can relate the statistical properties of mask types to their effectiveness in reducing the acoustic mismatch at frame level.

PPT slides

6 September 2007

Maria Wolters

CSTR, University of Edinburgh, UK

Adapting Dialogue Systems to Older People

In this talk, I present an overview of a Wizard-of-Oz experiment that explored two guidelines for designing spoken dialogue systems for older people:

  1. To minimise cognitive load, use few options
  2. Confirm selections to support memory

Cognitive abilities change as people age. For some people, this change occurs more rapidly than for others. The areas affected most are speed of processing, fluid intelligence, and working memory, while crystallised intelligence is well-preserved. In order to make spoken dialogue systems more useable for older people, one might assume that they need to accommodate this cognitive ageing.

Our experiments show that for straightforward tasks such as appointment scheduling that follow well-known schemata, accommodation may not be necessary. We found no effects of dialogue strategy on user performance and user satisfaction, although individual users expressed strong, divergent preferences. There was a considerable efficiency effect, though.

Future work includes exploring different domains, such as flight booking, working with different user groups, such as older people with mild memory impairment, and investigating differences in interaction style between older and younger users on the corpus collected in this experiment.

PDF slides

3 September 2007

Takatoshi Okuno

Audiological R&D lab, Rion Ltd, Tokyo, Japan

Development of Frequency Selectivity Map (FSMap) depiction system for hearing impairment

Background: An audiological assessment of hearing impairments and the fitting of hearing aids require an accurate measurement as auditory profiles that can be composed of loudness recruitment, reduced frequency selectivity and reduced temporal resolution etc. In other word, it is difficult to evaluate such a complicated auditory characteristic of sensorineural hearing loss, using only audiogram. A system that can measure the auditory profile has been waited at the practical clinic and the fitting of hearing aids.

Method: It is known that frequency selectivity is related to the bandwidth of auditory filters. In order to investigate the reduced frequency selectivity, a system that enables the measurement of an individual auditory filter of a hearing impairment within 3 minutes has been developed. The measured auditory filters are utilised to draw Frequency Selectivity Map (FSMap) with colored gradation. Actually, the data of auditory filters with respect to several frequencies and sensation levels are employed to draw a FSMap. The system calculates the ratio between Equivalent Rectangular Bandwidth (ERB) of an individual hearing-impaired listener and the averaged ERB of normal-hearing listeners as a possible auditory profile. Auditory filters of 31 hearing impairments were measured using the proposed system.

Results: Results showed that the frequency selectivity of sensorineural hearing loss was reduced as comparing with that of the normal hearing. Different degrees of frequency selectivity for individuals who had the similar contour of audiogram were observed in the obtained FSMaps.

Conclusion: Reduced frequency selectivity, i.e. how much your frequency selectivity is poorer than normal hearings, is quantitatively and intuitively indicated by FSMap. Results of this study suggested that FSMap had a potential to be a new practical auditory profile.

22 August 2007

Jonathan Laidler

Department of Computer Science, University Sheffield, UK

Model-driven detection of clean speech patches in noise

Listeners may be able to recognise speech in adverse conditions by "glimpsing" time-frequency regions where the target speech is dominant. Previous computational attempts to identify such regions have been source-driven, using primitive cues. This talk describes a model-driven approach in which the likelihood of spectro-temporal patches of a noisy mixture representing speech is given by a generative model. The focus is on patch size and patch modelling. Small patches lead to a lack of discrimination, while large patches are more likely to contain contributions from other sources. A "cleanness" measure reveals that a good patch size is one which extends over a quarter of the speech frequency range and lasts for 40 ms. Gaussian mixture models are used to represent patches. A compact representation based on a 2D discrete cosine transform leads to reasonable speech/background discrimination.

25 July 2007

Sarah Creer

Department of Computer Science, University Sheffield, UK

Modern Speech Synthesis

The aim of this work is to demonstrate what speech synthesis techniques are available for building synthetic voices for potential use in communication aids. Synthetic speech is generally viewed as sounding robotic and unnatural, frequently showing some mismatch between the voice and the communication aid user. Current technology is providing techniques to allow the building of more personalised output to the communication aid user using a minimal amount of data to produce a natural and intelligible sounding output.

Current VOCAs only allow personalisation of gender, language and to a limited extent, age. The voice is an identifier of the person to whom it belongs, providing clues about the gender, age, size, ethnicity and geographical identity of the person along with identifying them as that particular individual to family members, friends and, once interaction has begun, to new communication partners. Maintaining the individual's identity will help the ease of interaction and maintenance of social relationships. Attempts to provide a more closely matching voice for the user can be demonstrated with the VIVOCA device, currently under development at the University of Sheffield and Barnsley Hospital. The VIVOCA takes the speech of a speaker with dysarthria, recognises it and sends a transcription to a text-to-speech synthesiser which will then output what the client is saying.

The VIVOCA users will be dysarthric speakers in the Barnsley and Sheffield area and so to preserve the geographic identity of the clients, attempts have been made to provide speech synthesisers with a local accent. To build voices for speech synthesis, different techniques have been used. These techniques require data from a speaker with a local accent. To achieve the best quality of recordings and a consistent and natural set of data, a professional speaker was sought. Ian McMillan, a Barnsley poet and broadcaster became the first of a proposed library of locally accented recordings of a phonetically balanced dataset from which to build a synthetic voice. Using this data, different techniques have been used to build synthetic voices such as concatenative synthesis and hidden Markov model based synthesis.

Research will also take place into how to extend this voice building into further personalising dysarthric speakers' synthetic voices, particularly for those clients with progressive speech disorders who will eventually rely on communication aids as their primary mode of communication. Attempts will be made using data from dysarthric speakers who still have some voice function to adapt the synthetic voice so that it will sound more like that individual.

7 June 2007

Piers Messum

Department of Phonetics and Linguistics, University College London, UK

No role for imitation in learning to pronounce

In due course, a child learns to pronounce words by imitation, copying the sequence of speech sounds that he or she hears used by adult models. However, there is a conceptual precursor to this that we can call plain "learning to pronounce", where a child learns to produce speech sounds that listeners take to be equivalent to those that they themselves produce. It is almost universally assumed that this process is also imitative, but there has been no critical examination of this belief. In fact, there is no evidence to support it and some good reasons to doubt it.

There is an alternative, by which a child might learn to pronounce through vocal mirroring on the part of his or her caregivers. This is plausible in the light of other mirroring interactions, explains some communicative behaviors of mothers and infants, and solves some of the fundamental problems in speech.

Paper of the talk
PhD thesis

23 May 2007

Robin Hofe

Department of Computer Science, University of Sheffield, UK

Tongues, Trunks and Tentacles: Energetics in Physiology and Speech

One of the amazing things about spoken language is the huge amount of variation that occurs within the speech signal. Humans are perfectly able to cope with that variation or to even use it to convey additional information or to increase intelligibility for the listener. On the other hand, the same variation causes great problems in technological applications. One possible way to explain some of the variation is the H&H theory by Björn Lindblom. To investigate the theoretical basis of that theory, a biomimetic vocal tract model is currently being developed. It will be used to clarify the energetics of human articulatory movements and provide an experimental speech simulation tool.

PPT slides
Audio and video associated with the slides

2 May 2007

Yanchen Lu

Department of Comuter Science, University of Sheffield, UK

Mechanisms for Human Speech Acquisition and Processing

Auditory distance perception is an important component of spatial hearing. Humans often give a biased estimate of perceived distance with respect to physical source distance. Typically, subjects tend to underestimate the far away sound sources and overestimate the sound sources which are closer than 1 metre. Intensity, reverberation, interaural difference and spectrum are believed to be the major cues in generating the distance judgement. Studies suggest those static cues cannot lead to an absolute judgement without prior information about the sound source or environment. However, it is possible to utilize the temporal context of auditory dynamic cues, motion parallax and acoustic tau, to infer an absolute distance by a moving listener. In this presentation, I will introduce you to a sequential model to obtain the distance estimate by employing the dynamic cues under the framework of particle filtering. This sequential model highlights the importance of temporal reasoning in this perception task by demonstrating superiority over the instantaneous model.

PDF slides

25 April 2007

Russell Mason

Institute of Sound Recording, University of Surrey, Guildford, UK

The role of head movement in perception and measurement of spatial impression

In most listening situations, humans actively investigate the sound field by moving their heads, especially if the sound field is ambiguous or confusing. However, measurements tend to use a single static receiver position, which may not adequately reflect the listener's perception. A project is currently being undertaken to investigate the type of head movements made by listeners in different situations, how best to capture binaural signals to represent these movements, and how to interpret the results of objective analyses incorporating a range of head positions. The results of the first experiment will be discussed in detail, showing that the head movements made by listeners are strongly dependent on the listening task.

28 March 2007

Mike Carey

University of Birmingham, UK

Mechanisms for Human Speech Acquisition and Processing

The methods that humans use for processing speech and language is an exciting area of scientific enquiry. Since, almost without exception, humans are also far superior to machines in this task human speech processing is of considerable interest to engineers designing speech processing systems. A key feature of the human system is that it is learnt, we are not born talking. Hence any system proposed as an analogue of human speech processing must take this into account. However there are strong differences of opinion between those like Chomsky who believe some evolutionary pre-adaptation of the human brain is necessary for speech processing and those like Piaget who believe it is solely a consequence of human brain processing power. It's also important to keep in mind that language is acquired through speech and not text.

The approach described in this talk is to address this problem "bottom-up" starting with the newborn infant's problem of discriminating between speech and noise. We then address a possible method for acquiring the ability to discriminate between significant speech features. We then describe the accommodation of timescale variability using a multi-layered neural network model required to recognize phonemes and words.

PPT Slides

7 March 2007

Thomas Poulsen

Department of Computer Science, University of Sheffield, UK

Sound Localization Through Evolutionary Learning Applied to Spiking Neural Networks

There is much ongoing work aimed at understanding the neural functionality involved in hearing. Typically this research attempts to "reverse-engineer" the neural processes in different animals, which has provided many invaluable insights, but it is nevertheless an extremely difficult task given the complexities involved. Simulations however offer a different approach to this problem such that they provide a look into neural encoding in much simpler structures. In my talk, I will present a simulative approach to demonstrate that temporally encoded neural networks can be evolved for the task of sound localization.

21 February 2007

Sue Harding

Department of Computer Science, University of Sheffield, UK

Auditory Gist Perception and Attention

In auditory scene analysis research, it is typically assumed that the sensory signal is first broken down into features which are grouped into auditory streams according to their low-level properties and then attention is applied to select one of the streams. On the other hand, research into visual attention has suggested that the gist of a visual scene is perceived before attention is focused on the details of a particular object. In this talk, I will describe some evidence that the gist of an auditory scene is also perceived before the detail, and I will illustrate this with comparisons between vision and hearing. Some ideas for gist representations will also be discussed.

2 February 2007

Kalle Palomäki

Adaptive Informatics Research Centre, Helsinki University of Technology, Finland

Speech recognition activities in the Adaptive Informatics Research Centre

This talk gives an overview about automatic speech recognition activities in the Adaptive Informatics Research centre in the Helsinki University of Technology. The group's research activities are centred mostly on unlimited vocabulary morpheme-based recognition of Finnish language. In the past the groups focus has been mostly to develop better language models for unlimited vocabulary recognition, while the acoustic modelling side has been from a rather standard large vocabulary ASR-system. However, recently we have been launching more activities in the acoustic modelling and are putting up a team that concentrates on noise robust ASR and the use of auditory models in it. Two of the specific starting points will be, one, to develop new missing data algorithms suitable for large vocabulary ASR, two, to conduct human vs. machine speech recognition tests to better exploit the knowledge about human speech recognition in ASR. In summary, the purpose of this talk is to overview our group's activities and to initiate discussions about potential collaboration topics.

31 January 2007

Roger Moore

Department of Computer Science, University of Sheffield, UK

Sensorimotor Overlap in Living Organisms

On 11th January 2007, the University of Birmingham hosted a one-day open meeting on 'Unified Models for Speech Recognition and Synthesis', and this talk is a repeat of the opening invited presentation. The talk was founded on the premise that it is not only interesting to speculate on the technological advantages for combining information between input and output modalities of practical speech technology systems, but it is also informative to consider the evidence for such sensorimotor overlap in living organisms. In fact recent research in the neurosciences offers strong evidence for quite intimate connections between sensor and motor behaviour, and this has given rise to some very interesting insights into the benefits that such organisational structures can provide. This talk summarises these findings and seeks to illuminate the wider implications of attempting to unify representations of interpretive and generative behaviour in living or automatic systems.