Home | Contact | Log into MUSE

You are here: Home / Departments / Computer Science / Research / Speech and Hearing

SpandH seminar abstracts 2005

7 December 2005

Ray Meddis

Hearing Research Laboratory, Department of Psychology University of Essex, UK

A computer model of absolute threshold performance in human listeners

Behavioral absolute threshold for a pure tone stimulus in quiet can not easily be defined in physiological terms. Auditory nerve (AN) rate threshold is an inappropriate comparison because psychophysical thresholds depend on the duration of the stimuli used. Traditional temporal integration theories explain many psychophysical observations but it remains unclear where the integration takes place. This modeling project starts from the observation that auditory nerve first-spike latency is reduced as the stimulus is increased in both duration and level. This could be related to the dependence of behavioral threshold on duration because more intense stimuli can be detected at shorter durations. Unfortunately, near threshold, there is no simple way to distinguish the first spike after stimulus onset from spontaneous activity. However, if first-order relay neurons are set to respond only to coincidental AN action potentials across a number of fibers, they can be shown to have thresholds that depend appropriately on stimulus duration. A computer model of the auditory periphery is described that has realistic first-spike latency properties. It also has firing thresholds that decline with increased stimulus duration in a manner similar to psychophysical observations. This is a temporal integration model where the integration takes place using pre-synaptic calcium in inner hair cells.

30 November 2005

Christopher Newell

Scarborough School of Arts, University of Hull, UK

Do we need synthetic speech to sound natural, in order to sound expresssive?

One goal of synthetic speech research is to produce human-like speech undistinguishable from the real thing. In order to do this the voice will be required to be individual and expressive. In this research we argue that achieving human-like speech is not the only way to produce an individual and expressive synthetic voice. Any voice, including a machine voice, is interpreted by users within a context. Framing a machine-like disembodied voice within the context of live theatrical or musical performance circumvents problems of naturalness and allows the voice to develop unconstrained expressive possibilities similar to singing. This may operate as a legitimate substitute for the synthesis of human vocal expressiveness.

Power point slides
Associated sound clips:

BERIO.WAV (23 MB)

loudguy2.wav (23 MB)

msnd_aud.wav (4 MB)

msnd_britten.wav (5 MB)

msnd_tts.mp3 (500 KB)

r2d2dead.wav (61 KB)

Radiohead - Fitter Happier.mp3 (2 MB)

23 November 2005

Yan-Chen Lu

Department of Computer Science, University of Sheffield, UK

My Multimedia Innovation Journey

This talk will present my recent work in the field of multimedia communications to create an IP surveillance system on a System on a Chip (SoC) platform. The presentation will describe work ranging from algorithm design to integration onto physical circuits.

A digital signal processing algorithm can be implemented as software on a PC, as middleware on an embedded platform and as an integrated circuit. The chosen implementation is determined by cost and specification. I will show how these realizations differ by introducing my multimedia coding implementations. I will also discuss the principles of popular auditory and visual compression techniques to derive the corresponding design considerations. PCs are powerful computing platforms that enable fast, fully-functioning verification. However, they are expensive in cost and power compared to digital signal processors (DSPs). In an optimized porting, it is important to manage the data path and control according to the DSP architecture. Application specific integrated circuits enable data path design optimization with minimal effort in data-flow control, sacrificing the flexibility of software code. A SoC platform uses the IP-reuse concept and is integrated on a single chip. It can construct a complex system within a reasonable design-time which is crucial for a commercial product.

Power point slides

17 November 2005

Alan Newell

Research in Applied Computing, University of Dundee, UK

Systems for Older and Disabled People

This seminar will describe research into developing computer systems to support older and disabled people.

Research Projects currently include:

Innovative requirements gathering techniques with older and disabled people
Accessibility and usability of IT systems for a wide range of users
Specially designed email and web browers
Reminiscence and conversational support for people with dementia
Applications of digital television for older and disabled people.
Communication systems for non-speaking people
Systems to support people with dyslexia
Gesture and fall recognition, and lifestyle modeling
The use of theatre in human computer interaction research and usability studies.

The research follows the philosophy of "ordinary and extra-ordinary human machine interaction" which is based on the parallels between a non disabled (ordinary) person performing an extra-ordinary (high work load) task and a disabled (extra-ordinary) person performing an ordinary task, and other environments which "disable" ordinary people. This work has led to the new concepts of "User Sensitive Inclusive Design" and "Design for Dynamic Diversity".

For more information click here.

16 November 2005

Sue Denham

Centre for Theoretical and Computational Neuroscience, University of Plymouth, UK

Modelling the representation and classification of natural sounds

Our recent work has been directed towards developing a representation of acoustic stimuli suitable for real time classification of natural sounds using networks of spiking neurons. The challenge is to find a representation which encapsulates time varying spectrotemporal patterns in patterns of network activity which can be read out and classified at the slower timescales typical for cortical responses. The model we propose is based upon biologically plausible processing including cochlear filtering, the extraction of transients, and convolution with cortex-like spectrotemporal response fields (STRFs). Derivation of useful STRFs is achieved in a putative developmental stage through exposure to speech, and the properties of the resulting response fields show a surprising similarity to those measured experimentally. Salient events are evident in the response of the ensemble of STRFs. The resulting representation is capable of supporting multiple interpretations of the input, as is a characteristic of human perception, e.g. awareness of the words and speaker identity.

Power point slides
Representative papers covering topics in the talk:

Coath and Denham (2005). Robust sound classification through the representation of similarity using response fields derived from stimuli during early experience. Biological Cybernetics 93 2230.

Coath, Brader, Fusi and Denham (2005). Multiple views of the response of an ensemble of spectro-temporal features supports concurrent classification of utterance, prosody, sex and speaker identity. Network: Computation in Neural Systems

2 November 2005

Piers Messum

University College London, UK

Learning to talk, but not by imitation

What moves a child's pronunciation in the direction of the adult norm? The standard answer is that children imitate the speech models that surround them and that they get better at this with age and experience. Such imitation must take different forms, of course. So for temporal phenomena (like 'rhythm' or changes in vowel lengths) the child must abstract 'rules' which determine these effects in different contexts, while for speech sounds he abstracts the features which make sounds distinctive and then produces these with his own voice.

The assumption of imitation underlies much research on speech, but it is problematic. I will present an alternative account of phonological development, where imitation plays only a minimal role. In its place, canalising pressures and reinforcement learning mechanisms are sufficient to account for progress in pronunciation. The canalising pressures arise from the embodiment of speech. That the production apparatus is a child's body rather than an adult one is the most significant factor.

Power point slides

19 October 2005

Esmeralda Uraga

Department of Computer Science, University of Sheffield, UK

Experiments using acoustic and articulatory data for speech recognition

In conventional acoustic modeling approaches for speech recognition, acoustic features are mapped to discrete symbols, such as phonemes or other subword units. Acoustic representations of speech are obtained from feature extraction techniques that make little or no use of knowledge about the underlying speech production mechanism. It has been suggested that using articulatory representations of speech should allow for better recognition systems. However, previous attempts to improve the performance of speech recognition systems using direct information about the movement of articulators have been unsuccessful.

This study compares the performance of several acoustic and articulatory speech recognition systems evaluated on a multi-channel acoustic-articulatory corpus (MOCHA). The results show that speech recognition systems based on articulatory representations of speech outperform MFCC-based systems in a speaker-dependent phone recognition task. We have found that articulatory representations of speech provide a comprehensive description of speech which is not only equal to that of the acoustic signal but also contains information which may not be present in a standard Mel Cepstrum based representation.

12 October 2005

Stephen Cox

School of Computing Sciences, University of East Anglia, Norwich, UK

Automatic Musical Genre Classification

The combination of efficient high quality music coding schemes, cheap portable players that can store thousands (soon to be millions) of tracks and growing access to broadband from home has caused a sea change in the way that recordings of music are acquired and used. Next year, income from Internet subscription services to music will exceed sales of CDs and vinyl recordings for the first time, and it is very likely that this gap will widen in the future. This phenomenon raises a set of interesting questions about how users can identify and organise the music that they want to hear. Whilst much music available on the Internet has metadata associated with it that provides information on (for instance) the associated genre, artist(s), song title etc., much does not, and it would be very useful to be able to obtain such metadata automatically from the music signal itself. Furthermore, a measure of the musical similarity between two pieces would be very useful to aid users in e.g. organising their collections and constructing playlists. In addition to these practical applications, computational processing of music is a fascinating area of artificial intelligence. In this talk, I will describe our work in one aspect of this field, automatic classification of musical genre. I will discuss both how this field (and related fields) can build on existing work in speech processing and how they need to develop new techniques and paradigms. Descriptions of experiments in automatic musical genre classification will be given, along with results, and new research opportunities and directions will be discussed.

Power point slides

5 October 2005

Richard Lyon

Chief Scientist, Foveon Inc., USA

History and Future of Electronic Color Photography: Where Vision and Silicon Meet

In the late twentieth century, developments in electronic color photography employed color separation techniques recycled from corresponding developments in silver halide photography of the late nineteenth century. Multi-shot cameras, beam-splitter cameras, and screen-plate or filter-mosaic cameras all had their day with film about a century ago, and with electronic sensors more recently. The multi-layer color sensing technique that dominated the twentieth century, originally commercialized as Kodachrome, is now recapitulated in the multi-layer silicon sensor introduced for the twenty-first century as the Foveon X3 technology. These techniques for color photography take their cues from human color vision, but ultimately must listen to the silicon.

28 September 2005

Martin Cooke

Department of Computer Science, University of Sheffield, UK

Non-native speech perception in noise

Spoken communication in a non-native language is especially difficult in the presence of noise. However, conflicting reports have appeared concerning the degree to which non-natives suffer in noise relative to natives. This study compared English and Spanish listeners perceptions of English intervocalic consonants as a function of both non-native phonological competence and masker type. Three backgrounds (stationary noise, multi-talker babble and competing speech) provided varying amounts of energetic and informational masking. Competing English and Spanish speech maskers were used to examine the effect of masker language. Non-native performance fell short of that of native listeners in quiet, but a larger performance differential was found for all masking conditions. Both groups performed better in competing speech than in stationary noise, and both suffered most in babble. Since babble is a less effective energetic masker than stationary noise, these results suggest that non-native listeners are more adversely affected by both energetic and informational masking. The most competent Spanish listeners in quiet were also the best in noise, but they also showed the steepest drop in performance for the most difficult maskers. A small effect of language background was evident: English listeners performed better when the competing speech was Spanish.

28 September 2005

Khademul Islam Molla

Keikichi Hirose Laboratory, University of Tokyo, Japan

Separation of mixed audio signals in the time-frequency domain

Power point slides

26 July 2005

Nobuaki Minematsu

Department of Information and Communication Engineering, University of Tokyo, Japan

Theorem of the Invariant Structure and its Derivation of Speech Gestalt

Speech communication has several steps of production, encoding, transmission, decoding, and hearing. In every step, acoustic distortions are involved inevitably as differences of speakers, microphones, rooms, hearing characteristics, etc. These are non-linguistic factors and completely irrelevant to speech recognition. Although spectrogram always carries these factors, almost all the speech applications have been built on this "noisy" representation. Recently, a novel representation of speech acoustics is proposed, called the acoustic universal structure. What is represented here is only the interrelations among speech events and absolute properties of the events are discarded completely. It is very interesting that the non-linguistic factors are removed effectively from speech as cepstrum smoothing of spectrogram can remove pitch information from speech.

In this talk, the theoretical backgrounds of the new speech representation is described in detail from the viewpoints of linguistics, psychology, acoustics and mathematics with some results of recognition experiments and perceptual experiments. It will be shown that the new representation can be viewed as speech Gestalt. Finally, some strategic similarity of speech processing between autistic people and the current speech recognizers is discussed.

12 May 2005

Sarah Hawkins

Phonetics Laboratory, University of Cambridge, UK

Perceptual coherence and speech understanding, what a speech perception model should look like

I will discuss acoustic-phonetic and perceptual data showing that small differences in the speech signal reflect systematic phonological and/or grammatical differences, and are perceptually salient. Instead of just enhancing a particular phonological contrast, at least some of this systematic phonetic variation seems to contribute to speech intelligibility by making the overall signal more 'perceptually coherent'. Perceptual coherence in speech encompasses properties of the signal that are probably natural consequences of vocal-tract dynamics (such as changes in excitation type as the articulators move from a vowel to an obstruent consonant), as well as patterns that are language- and accent-specific (such as, in English, widespread effects on vowel formant frequencies due to the presence of an /r/ in the utterance). Some effects last less than 50 ms, while others can last for over 500 ms. Still others may last longer. Perceptual coherence could facilitate perception by adding useful redundancy to particular phonological contrasts, but another possibility is that its main value is in grouping the signal so that it sounds as if it is coming from a single source, and provides consistent information over long time scales. I will discuss these possibilities, relating them to a polysystemic approach to speech perception, Polysp. Compared with standard phonetic and psycholinguistic models of speech perception, Polysp de-emphasizes the role of phonology and contrasts in lexical form, and emphasizes understanding speech within the general communicative situation, linguistic and non-linguistic.

Non-essential background reading on the general approach, and Polysp. All three are available (with printers' errors corrected!) from here.

Hawkins, S. (2003). Contribution of fine phonetic detail to speech understanding. Proceedings of the 15th International Congress of Phonetic Sciences. 293-296.

Hawkins, S. (2003). Roles and representations of systematic fine phonetic detail in speech understanding. Journal of Phonetics, 31, 373-405.

Hawkins, S., & Smith, R. H. (2001). Polysp: A polysystemic, phonetically-rich approach to speech understanding. Italian Journal of Linguistics-Rivista di Linguistica, 13, 99-188.

5 May 2005

Roger Moore

20/20 Speech Ltd., UK

How Good Does Automatic Speech Recognition Have to Be? ... and when will it be that good?

Automatic Speech Recognition (ASR) is often hailed as the most 'natural' interface between humans and machines, and it has recently been cited as a technology likely to have huge market success over the next few years. However, are these views founded on a realistic assessment of what the technology can actually do, or are they based on wishful thinking prompted by the vision of intelligent conversational machines that is often portrayed in science fiction?

This talk will review the state-of-the-art in automatic speech recognition, and will illustrate that current performance is simply not good enough for many applications. A comparison will be made between ASR performance and that of a human listener, as well as alternative methods of human-machine interaction. It will then be shown how the 'goodness' of an ASR may be characterised by modelling an equivalence to a hearing impaired human, and this will be used to predict the future capabilities of ASR. However, attention is drawn to ASR's reliance on ever increasing amounts of training data, and this is contrasted with the amount of speech exposure involved in human spoken language acquisition. The talk concludes with a discussion of the implications for the future of automatic speech recognition.

PDF slides

16 March 2005

Sarah Simpson

Department of Computer Science, University of Sheffield, UK

Consonant identification in N-talker babble as a function of N

Consonant identification rates were measured for VCV tokens gated with N-talker babble noise and babble-modulated noise for an extensive range of values for N. In the natural babble condition, intelligibility was a non-monotonically function of N, with a broad performance minimum from N = 6 to N = 128. Identification rates in babble-modulated noise fell gradually with N. The contributions of factors such as energetic masking, linguistic confusion, attentional load, peripheral adaptation and stationarity to the perception of consonants in N-talker babble are discussed.

9 March 2005

Ryuichiro Higashinaka

Department of Computer Science, University of Sheffield, UK

Incorporating Discourse Features into Confidence Scoring of Intention Recognition Results in Spoken Dialogue Systems

This paper proposes a method for the confidence scoring of intention recognition results in spoken dialogue systems. To achieve tasks, a spoken dialogue system has to recognize user intentions. However, because of speech recognition errors and ambiguity in user utterances, it sometimes has difficulty recognizing them correctly. Confidence scoring allows errors to be detected in intention recognition results and has proved useful for dialogue management. Conventional methods use the features obtained from speech recognition results for single utterances for confidence scoring. However, this may be insufficient since the intention recognition result is a result of discourse processing. We propose incorporating discourse features for a more accurate confidence scoring of intention recognition results. Experimental results show that incorporating discourse features significantly improves the confidence scoring.

This is an ICASSP presentation.

16 February 2005

Kalle Palomäki

Helsinki University of Technology, Finland

Spatial Processing in Human Auditory Cortex: The Effects of 3D, ITD and ILD Stimulation Technique

Here, the perception of auditory spatial information as indexed by behavioral measures is linked to brain dynamics as reflected by the N1m response recorded with whole-head magnetoencephalography (MEG). Broadband noise stimuli with realistic spatial cues corresponding to eight direction angles in the horizontal plane were constructed via custom-made, individualized binaural recordings (BAR) and generic head-related transfer functions (HRTF). For comparison purposes, stimuli with impoverished acoustical cues were created via interaural time and level differences (ITDs & ILDs) and their combinations. MEG recordings in ten subjects revealed that the amplitude of the N1m exhibits directional tuning to sound location, with the right-hemispheric N1m responses being particularly sensitive to the amount of spatial cues in the stimuli. The BAR, HRTF and combined ITD+ILD stimuli resulted both in a larger dynamic range and in a more systematic distribution of the N1m amplitude across stimulus angle than did the ITD or ILD stimuli alone. The right-hemispheric sensitivity to spatial cues was further emphasized with the latency and source location of the N1m systematically varying as a function of the amount of spatial cues present in the stimuli. In behavioral tests, we measured the ability of the subjects to localize BAR and HRTF stimuli in terms of azimuthal error and front-back confusions. We found that behavioral performance correlated positively with the amplitude of the N1m. Thus, the activity taking place already in the auditory cortex predicts behavioral sound detection of spatial stimuli, and the amount of spatial cues embedded in the signal are reflected in the activity of this brain area.