The work is divided into five workpackages (WPs), plus project management.
WP1: Smart Meeting Room, Data Collection and Annotation
This WP is concerned with the specification of the smart room environment and of data collection and annotation protocols, resulting in the M4 meeting corpus.
WP2: Multimodal Recognition
WP2 deals with the development of multimodal recognizers that transform raw audio and video streams to higher level streams. The work will focus on the development of existing work (within the partners) in speech recognition (including task- and language-independent approaches) and action/gesture recognition, porting to the M4 domain. It will also involve investigations regarding multimodal person identification, emotion and intention recognition, and source localization and tracking. The higher level streams generated in WP2 will form the basis for the integration and information access operations of WP3.
WP3: Multimodal Integration
WP3 focuses on the principled integration of multiple streams, and the development of information access methods to enable retrieval browsing and summarization from an archive of multistream meeting data. WP3 is a key element of M4 since it forms a bridge between the multimodal recognition level (WP2) and the application demonstrator (WP4).
WP4: Demonstration and Evaluation
This WP consists of the construction of an offline demonstration system for the MultiModal Meeting Manager, along with formal and informal evaluation of the system as a whole, and its component technologies.
WP5: Dissemination, Exploitation and Evaluation
A key aspect of this WP is the Industrial Advisory Board set up by the project, with representatives from industrial areas which could exploit the results of M4.
Background and Innovation
Smart rooms are environments equipped with multimodal sensors and computers, that are designed to enable their inhabitants to work more efficiently as individuals and in groups. A smart room may automatically identify its inhabitants, transcribe what they say, infer emotional states, and facilitate the exchange of information. The Digital Desk (Xerox EuroPARC), in which gestures and paper images were recorded and merged with computer generated information, was an early, influential piece of work in this area. More recently, projects such as the NIST Smart Spaces Laboratory and the MIT Intelligent Room have focused on the development of software architectures to to deal with information from various multimodal sensors.
This project builds on the ideas of smart rooms, and is concerned with the construction of a system for recording, structuring, browsing and querying an archive of meetings, using the outputs of a set of multimodal sensors.
There are some ongoing projects that address similar issues, mainly in the USA. These efforts have a different content and emphasis to the work proposed here, and may be regarded as complementary. BBN and ICSI have ongoing projects to record and browse meetings based purely on the speech streams of the participants. The BBN Rough'n'Ready project focuses on a structuring of the audio based on the speech transcription, supplemented with name, topic and speaker identification. The ICSI Meeting project has a focus on data collection and annotation, as a basis for research in spontaneous speech recognition, prosody and dialogue modelling.
The Interactive Systems Laboratory (ISL) project on Meeting Record Creation and Access at Carnegie Mellon University, is perhaps the closest project to what is proposed here. It is also concerned with recording and browsing meetings based on audio and video information. However, M4 has several innovative aspects which differentiate it from the ISL project:
- A framework for the integration of multimodal data; streams, including the development of a multimodal syntax and associated multisource decoding algorithms;
- Specification of an intelligent information management framework using prespecified text information (such as an agenda) to guide the structuring of the multimodal information;
- Localization and tracking of the meeting focus using sound and vision information; \item Automatic recognition of intention and emotion based on audio and video features;
- Summarization, both multimodal and textual;
- Gesture and action recognition.
We shall also work on speech recognition and multimodal information extraction/retrieval.
The recently approved IST project FAME (Facilitating Agent for Multicultural Exchange) addresses some issues related to the research proposed here. However, that project is concerned with developing an intelligent system that makes use of multimodal information streams to facilitate cross-cultural human-human conversation, rather than enabling browsing and information access.
Integration and Management of Multimodal Data Streams
M4 will require the development of new multi-channel processing approaches able to deal with the multiple, non-stationary, asynchronous data streams, that are characteristic of multimodal communication (eg, audio and video streams). Some of the M4 partners have started investigating principled ways of combining such data streams using multi-stream approaches, typically based on composite hidden Markov models (HMMs), to account for stream asynchrony and different levels of information integration. These approaches process each data stream using loosely independent modules, generating local probabilities that may be combined for recognition later in the process at some segmental levels ("temporal anchor points"). These temporal anchor points are determined automatically during training to maximise a given criterion. The different stream-based HMMs can be trained jointly based on a maximum likelihood criterion. This new approach will be investigated and tested on the multimodal database collected in M4.
In multimodal communication it is often the case that the semantic information is spread across different modalities. A critical challenge facing next-generation human-computer interfaces (and directly relevant to M4) concerns the development of effective language processing techniques for utterances distributed over multiple input modes such as speech, gesture and facial expression. This is not a straightforward problem, since standard approaches usually applied to a single mode (eg, finite state models of word sequences in speech recognition) cannot be easily generalized to handle multiple modes in which the syntactic/semantic information can be in distributed over several streams. The development of new multimodal decoding, parsing and understanding approaches will be an important aspect of the M4 project.
The representation of multi-channel (multimodal) syntactic constraints is a related problem. We intend to use new forms of multimodal syntaxes to constrain the recognition and indexing modules. This will mainly be done by studying and modelling the structures of typical meetings collected in the framework of WP1 and matching them to the formalism used in WP2 (Multimodal Recognition), for instance via a multimodal stochastic finite state automaton.
Finally we address the software problem of combining the outputs of the various multimodal streams within an intelligent information management framework, structured by the textual side information. This problem is closely related to that of multimodal information access. Here, we will also make extensive use of the knowledge captured during the definition and standardisation of the Synchronized Multimedia Integration Language (SMIL), a W3C Recommendation.
Multimodal Information Access
The final demonstrator system will rely on a set of well-integrated information access tools that enable browsing, navigation and summarization of the set of multimedia streams. Such tools are important since it is difficult for users to grasp the full content via direct access to audio, video or the various higher level streams (eg, transcriptions). A multimodal indexing system will be at the centre of the information access tools, building on the partners' experience of text, speech, image and video retrieval. In conjunction with the management system, data descriptions will be created and we propose here to exploit a cycle whereby data descriptions are iteratively proposed and validated via user relevance feedback to perform an efficient and high level retrieval. Clearly, insights on data description schemes such as that given by MPEG-7 or multimedia document abstractions such as that proposed in RDF will form solid bases on which we will build our data representations. The former will further provide us with directions for defining a unified retrieval context.
Although we shall be dealing with unsegmented, multistream data, the existence of textual side information offers clues to structuring and segmentation, as well as the possibility of training/adaptation data for trainable information access modules. In addition to retrieval-based tasks across an archive, we shall also investigate extraction-based tasks within a meeting. In particular, we plan to develop methods for multimodal entity extraction (eg, names, numbers, times, etc.) using both stochastic finite state models and maximum entropy models, and employing automatic algorithms for multimodal feature selection.
Summarization will be a key process in the efficient presentation of meeting data to users. Meeting summarization is a challenging task: recent research on summarizing dialogues indicates that it will be more difficult than summarizing newspaper articles. Textual summaries will be largely based on generated transcriptions, but may be constrained by further audio information such as prosody or non-speech events, along with other information sources such as person identification, emotion annotations and meeting focus locations. The main framework to be employed for textual summarization is statistical. In this framework, the aim is to find the most likely summary given the source streams. Initial approaches will be based on sentence extraction, which TNO has used for multi-document summarization. Because the notion of "sentence" is not clear in conversational speech, the system will based on speaker turns, or will make use of a statistical approach for sentence boundary detection in speech developed by USFD. We plan to investigate models which estimate the probability of a summary given the source streams directly using maximum entropy models. In addition to text summaries of meetings, and meeting segments, we shall construct multimodal summaries based on audio and video key frames. This work will make use of some of the partners' experience with SMIL, in particular UniGE.
The high speech recognition word error rate (WER) that is to be expected when dealing with conversational speech presents some problems for these information access approaches. These problems are only partially mitigated by the integration of other modalities. Background text corpora, typically obtained from the web, may be used to expand queries and transcripts to compensate for this. The high WER also complicates cross-lingual access. We plan to investigate two approaches to this, one involving the translation of complex index terms (phrases) followed by monolingual search, the second based on query translation.
Source Localization and Tracking
Human listeners make use of binaural cues (eg, interaural time difference) to localise sound sources in space. To make use of such information, and other aspects of computational auditory scene analysis, we propose to make binaural recordings using a dummy head, or manikin, in addition to the other microphones. Sound localization algorithms, to be used alone or in combination with video information, will be developed to track sources over time. An innovative aspect of the project arises from a combination of computational auditory scene analysis approaches, and independent component analysis (ICA). Frequency domain ICA has a close relationship with equalisation-cancellation models of binaural signal detection; the recombination problem of separated independent components is related to the problem of auditory scene analysis.
A major innovation in this area will be the use of novel audio-visual localization methods, where audio and video processing methods are combined in order to realize a reliable source localization. In this way, it will be possible to continuously identify the focus of the meeting discussion and to detect changes in this focus. A variety of different modes that could be used for tracking (eg, face localization, sound localization, motion evaluation) will be combined in a stochastic framework in order to guarantee a robust tracking algorithm, that remains stable even if one single mode is unreliable for a while (eg, when motion information is disturbed by other moving people), because it is still supported by the other modes in a probabilistic and error tolerant manner.
Recognition of Intention and Emotion
Methods for intention and emotion recognition are still in a very early stage of research and development, but are considered to be mature enough for inclusion in the spectrum of modalities investigated in this project: indeed M4 can provide a realistic testbed for work in this area. Again, audio-visual methods will be developed that combine the information obtained from voice (mainly based on prosodic features) and from video (mainly based on facial expressions).
Facial expression recognition is difficult because it is often based on visual cues segmented from a face image (eg, such as eyebrows and lips. We plan to employ algorithms that do not rely on these cues but are capable of processing face images and even dynamical facial expression from video without prior necessity for segmentation. Intention recognition will be performed in combination with action recognition and will rely on the consideration of different types of gestures, including static hand gestures (eg, pointing), dynamic hand gestures (eg, waving), and body gestures (eg, raising, turning, moving), along with prosodic features derived from the audio.
Another aspect of this work will be the combination and selection of a variety of multimodal features for these tasks. We propose to address this problem using two approaches: maximum entropy modelling; and an ROC-based approach which obtains optimal feature subset / classifier combination at a particular operating point on an ROC curve. This latter approach has proven to be successful for the extraction of information from spoken messages (using prosodic and lexical features), and will be extended to multimodal features in M4.
Robust Conversational Speech Recognition
In spite of their limitations, current state-of-the-art recognizers, as available from several of the M4 partners, exhibit acceptable recognition performance (WER of 20--50\%), which is often adequate for indexing and retrieval purposes. M4 will involve the exploitation and enhancement of existing large vocabulary speech recognition systems towards conversational speech recognition. Some of the partners have been, or still are, involved in successful audio indexing projects (including ambitious tasks such as sport news indexing). In addition to improving the quality of the acoustic models, one of the important research issues is the maintenance of the system, including automatic adaptation of the lexicon and grammar models. In our case, starting from a large, speaker-independent speech recognition system, we aim at optimizing the recognizer through a better definition of the lexicon and grammar model, through the additional information provided from other modes, especially the textual side information. Acoustic modelling work will concentrate on improving robustness and coping with spontaneous speech, including missing data and multiband approaches, and associated multisource decoding algorithms.
A strong reliance on context-dependent phoneme models presents significant difficulties for multilingual recognition. The phonetic environment is heavily language and task dependent. Further, a mapping from context-dependent to context-independent phonemes is many-to-one and as such it requires the additional knowledge source of the lexicon and the language model. That makes context-dependent phonemes unsuitable units for language-independent and task-independent transcription of the acoustic signal into strings of symbols. We propose to address the problems of coarticulation and recognition in changing acoustic environments using the multiband approach, based on time-frequency localized estimates of probabilities of sub-word classes. Such features show great promise in processing noisy speech from realistic environments, as recently demonstrated in the SPINE and Aurora evaluations.
Gesture and Action Recognition
The methods employed for this task have to be robust against the real-world conditions present at meetings, such as possible motion in the background, more than one person in the current focus of the camera, or varying lighting conditions. Furthermore, actions can be carried out at arbitrary moments with an arbitrary duration and determination of the beginning of an action or gesture by motion detection is hardly possible because the meeting participants will be able to move around without any restrictions and thus might be in motion almost all the time. HMM-based pattern recognition techniques will be employed.
In this project, we also consider the online definition of an interaction stream consisting of any information that can help in analysing events within the meeting. One example is the fact of pointing towards a feature in a slide with an electronic device (mouse, laser) implies that this feature is important to the current topic. Recording explicitly such an information will therefore help in disambiguating some components of the analysis (eg, emotion).