SpandH Seminar Abstracts 2011

13 December 2011, 12:00-13:00, G30 Regent Court

Maggie Vance

Department of Human Communication Sciences, The University of Sheffield

Measuring speech perception in young children and children with language difficulties: Effects of background noise and phonetic contrast

The research literature indicates that speech perception skills are significantly correlated with the level of language development in young children. Deficits in speech perception have been identified in at least some children with language impairments. This deficit is particularly marked when listening against background noise. There is currently little published material available that is suitable to assess the speech perception of young children who are speaking English.

Data will be presented from 2 studies using a novel test of speech perception that includes a background noise condition. Study 1 evaluates speech perception in young typically developing children and explores how well this is related to their language development. Study 2 investigates speech perception skills of young children with and without language difficulties, and also includes comparison of performance across different phonetic contrasts. Issues in developing material for assessment of speech discrimination in children with language impairments will be discussed.

20 October 2011, 12:00-13:00, G22 Regent Court

David Attwater

Senior Scientist, Enterprise Integration Group

Chasing the dream -- Man-machine conversation and the real world

The speech research community has been broadly pursuing the goal of the 'artificial human' for well over 20 years now. In spite of significant developments in the technology and marketplace, deployments of natural spontaneous conversational solutions are still rare and many fail. Our recent experience shows companies are now removing spoken solutions based on customer feedback. Is this a failure of technology or a failure to understand human behaviour?

The key questions facing the community today are:

  • Do we need to solve the general AI problem before conversational speech solutions are practically useful?
  • To what extent is this a desirable goal in the first place?
  • Is the additional cost and maintenance of conversational systems worth it?

In this talk we will explore how user theory of mind dominates their interactions with machines. Pragmatic examples based on customer testing in the real-world will be presented.

David Attwater joined EIG in 2003 with responsibilities for user interface usability, design, research and development. David also manages the EIG UK Office. Before joining EIG he was head of the spoken dialog research team at the British Telecom Research Labs.

David began his career researching the use of speech recognition for extremely large directory assistance applications. He then lead a team researching how-may-I-help-you applications in the United Kingdom including corpus-based development and testing of spontaneous UK English acoustic models and semantic classifiers. His research interests include natural dialog modeling techniques and emulation of spontaneous turn-taking in human-machine dialog.

David now spends much of his time testing enterprise automated dialog systems with end-customers. This includes testing new spoken dialogues on live customer calls using wizard of oz simulations. David has several international patents pending and a number of publications. He holds two engineering degrees from the University of York, UK.

Slides (pdf)
16 August 2011, 12:00-13:00, G30 Regent Court

Juan A. Morales-Cordovilla

Department of Signal Theory, Telematics and Communications, University of Granada, Spain

Equivalences and Limits of Pitch-based Techniques for Robust Speech Recognition

This talk discusses the performance limits of pitch-based techniques which, in some way, use the pitch in order to carry out robust ASR (Automatic Speech Recognition) under noise conditions and which employ minimal assumptions about the noise. In order to do so, we will identify the four basic robust mechanisms employed by these techniques for recognizing voiced frames, the optimum mechanisms will be identified (by means of some equivalences), and the corresponding limit results will be experimentally obtained by applying MD (Missing Data) oracle masks and ideal pitch. Experimental results with Aurora-2 database will show that Tunnelling and Harmonicity mask estimations for MD recognition are close to the limits of the pitch-based robust ASR techniques, although they would require additional speech or noise information in order to achieve the performance of MD oracle masks.

18 July 2011, 14:00-15:00, G30 Regent Court

Chiori Hori

Spoken Language Communication Laboratory, National Institute of Information and Communications Technology, Japan

Introduction of U-STAR activity

The fact that the world has many different languages is one of the causes of barriers to mutual understanding. To communicate each other without any obstacles in different languages has been a long-held dream for humankind. Speech-to-speech translation (S2ST) technologies are a convincing means to realize the dream. It is very difficult for individual organizations alone to build S2ST systems covering all topics and languages, but through connecting ASR, MT and TTS modules developed by each organization and distributed all over the world through the network, we can create S2ST systems that break the world’s language barriers.

The research collaboration started in the Asian region known as A-STAR consortium. Then A-STAR launched "the first Asian network-based speech-to-speech translation system", on July 29th, 2009, that can perform real-time, location-free, multi-party communication between speakers of different Asian languages. To accelerate and develop the consortium activities, it was of great importance for member countries to have international standardization for speech-to-speech translation area. We submitted draft texts for new ITU-T Recommendations and these new Recommendations for a protocol and data format transmitted between S2ST modules, H.625 and F.745, were successfully approved in October, 2010. A-STAR expanded its activities and transferred to U-STAR consortium in 2010.

We believe that activities of the consortium will help the progress of collaborative international research in this important area so that we contribute to the multilingual communities around the world. The Universal Speech Translation Advanced Research Consortium is established as an international research collaboration entity with the goal of developing a universal network-based speech-to-speech translation system. The object of the consortium is to create a basic infrastructure for spoken language communication for overcoming the language barriers that exist in the world. Currently, the participant members are 14 countries (15 institutes).

You can get more information about U-STAR from the following page:

14 June 2011

Warren Mansell

School of Psychological Sciences, University of Manchester

Perceptual Control Theory as a Framework for Computer Modelling Across the Social Sciences

Perceptual Control Theory is a psychological framework developed from control system engineering during the 1950s and 60s (Powers et al., 1960; Powers, 1973). The theory is supported by a range of empirical and theoretical papers (see and yet its central premise - 'Behaviour is the control of perception' - remains dominated by the same assumptions of linear 'Stimulus-Response' or 'Input-Compute-Output' psychology that primed its development. In this talk I will describe examples of computer modelling using PCT across a wide range of domains (e.g. robotics, economics, sociology, speech and language). I will then describe and demonstrate the PCT simulator I have developed to model actively learning multi-layered control systems. One hypothesis we are testing is that psychological distress (i.e. mental health problems) represents chronic loss of control due to internal goal conflict, and that this is relieved by promoting changes in higher level, or deeper, systems. This would explain why psychotherapies that access 'deeper meanings', 'longer term goals' and 'broader values' are effective. To date, we have demonstrated that learning (termed 'reorganisation' in PCT) of higher level systems allows a more enduring optimisation of control than learning within the lower order, conflicted, systems. Related research will be discussed, as well as a discussion of how PCT could be utilised to improve existing computer models of psychological systems.

Key Reference:
Mansell, W. (2011). Control of Perception Should be Operationalised as a Fundamental Property of the Nervous System. Topics in Cognitive Science, 3, 257-261.

7 June 2011

Parayitam Laxminarayana

Research and Training Unit for Navigational Electronics, Osmania University, India

ASR Performance Over Wired and Wireless Networks

The Automatic Speech Recognition (ASR) systems are slowly getting integrated into Voice over IP (VoIP) and GSM/3G wireless systems, which use the combination of different speech codecs, GSM and VoIP systems are matured enough in using different speech codecs for voice compression, with various bit rates. Recently, all these systems, are started using wideband speech and audio codecs for increased intelligibility and naturalness of the speech.

It is a known fact that the quality of speech will degrade, when it is being encoded/decoded in several phases, over a transmission channels. These in turn will affect considerably the performance of the speech recognition through the remote servers. In the current scenario of voice applications, speech has to traverse many times across different wired and wireless networks. It is not essential that all the networks will use a common codec for voice data transfer. When different speech coding schemes are used in the network, the speech recognition accuracy degrades further.

In this presentation, the results of investigations on ASR accuracy using different standard narrowband and wideband speech codecs, currently under deployment and with different trans-coding combinations will be discussed. The effects of the packet drop conditions in the packet networks along with the speech codecs are also investigated for the ASR accuracy. A comparative study of Mean Opinion Score (MOS) values, which represent the speech quality for different speech codecs, is presented to study the relation with the ASR accuracies under different conditions.

10 May 2011

Peter Wallis

Department of Computer Science, University of Sheffield

Social Engagement with Robots & Agents (SERA) - project report

Social Engagement with Robots and Agents - SERA - was a EC funded project that took a Companions stye system and actually deployed it. We knew there would be problems - in the wild, people tend to swear at their computers quite a bit - and the aim was to identify the unknown unknowns for future projects. This talk is a revised version of the Sheffield presentation at the final review. One unexpected unknown was how to do far field conversational ASR. It turns out that Language Models are key to any workable speech recognition system andconversational English is simply not the same as writing a letter. Wethink that a dialog system should have a better idea of what is comingnext than any Language Model and that tighter integration of ASR andDialog Management would be a prouctive way forward. A second unknownis that we really don't know what to do with the data we have collected.The SERA project deliberately used "a range of techniques" to look atthe video data and all of them were found wanting in one way oranother. Moving from raw data to a new system design is an engineeringproblem for which it seems the social and human sciences can offlittle help. The talk starts with a description of the challenge ofcollecting real data and our solutions, followed a summary of the datacollected and future work.

7 February 2011

Angelo Cangelosi

Centre for Robotics and Neural Systems, School of Computing and Mathematics, University of Plymouth

Embodied Language Learning with the Humanoid Robot iCub

Recent theoretical and experimental research on action and language processing clearly demonstrates the strict interaction between language and action, and the role of embodiment in cognition. These studies have important implication for the design of communication and linguistic capabilities in cognitive systems and robots, and have led to the new interdisciplinary approach of Cognitive Developmental Robotics. In the European FP7 project "ITALK" ( we follow this integrated view of action and language to develop cognitive capabilities in the humanoid robot iCub. This will be achieved through experiments on object manipulation learning, and on cooperation and communication between robots and humans (Cangelosi et al., 2010). During the talk we will present ongoing results from iCub experiments. These include iCub experiments on embodiment biases in early word acquisition ("Modi" experiment; Morse et al. 2010), studies on word order cues for lexical development and the sensorimotor bases of action words (Marocco et al 2010), and recent experiments on action and language compositionality. The talk will also introduce the simulation software of the iCub robot, an open source software tool to perform cognitive modeling experiments in simulation (Tikhanoff et al. in press).


  • Cangelosi A., Metta G., Sagerer G., Nolfi S., Nehaniv C.L., Fischer K., Tani J., Belpaeme B., Sandini G., Fadiga L., Wrede B., Rohlfing K., Tuci E., Dautenhahn K., Saunders J., Zeschel A. (2010). Integration of action and language knowledge: A roadmap for developmental robotics. IEEE Transactions on Autonomous Mental Development, 2(3), 167-195
  • Marocco D., Cangelosi A., Fischer K., Belpaeme T. (2010). Grounding action words in the sensorimotor interaction with the world: Experiments with a simulated iCub humanoid robot. Frontiers in Neurorobotics, 4:7
  • Morse A.F., Belpaeme T., Cangelosi A., Smith L.B. (2010). Thinking with your body: Modelling spatial biases in categorization using a real humanoid robot. Proceedings of 2010 Annual Meeting of the Cognitive Science Society. Portland, pp 1362-1368
  • Tikhanoff V., Cangelosi A., Metta G. (in press). Language understanding in humanoid robots: iCub simulation experiments. IEEE Transactions on Autonomous Mental Development.

Slides (pdf)

12 January 2011

Kalle Palomaki

Adaptive Informatics Research Centre, Aalto University, Finland

Our recent studies in noise robust ASR in a large vocabulary task

This talk presents some of our recent work on noise robust ASR in a large vocabulary task. In the first part, results of our recent comparison of noise robust techniques using Speecon Finnish language corpus with real public place and car recordings are shown (Keronen et al., 2010). Techniques compared are parallel model combination, missing data imputation and multicondition training.

The second part, which is the main focus of the talk, concerns our recent study on uncertainty measures in sparse imputation of missing data (Gemmeke et al., 2010). In missing data imputation the estimates are usually considered equally reliable while in reality, the estimation accuracy varies from feature to feature. In the work presented here we use uncertainty measures to characterise the expected accuracy of a sparse imputation (SI) based missing data method. In experiments on noisy large vocabulary speech data, using observation uncertainties derived from the proposed measures improved the speech recognition performance on features estimated with SI. Relative error reductions up to 15 % compared to the baseline system using SI without uncertainties were achieved with the best measures.

  • Keronen S., Remes U., Palomaki K. J., Virtanen T. and Kurimo M. (2010) Comparison of noise robust methods in large vocabulary speech recognition, Eusipco 2010.
  • Gemmeke J., Remes U. and Palomaki K. J. (2010) Observation uncertainty measures for sparse imputation, Interspeech 2010.