M4 Annual Report 2003
The M4 project is concerned with the construction of a demonstration system to enable structuring, browsing and querying of an archive of automatically analysed meetings. The archived meetings take place in a room equipped with multimodal sensors, specifically microphones and cameras. As described in the 2002 report, such a meeting room has been constructed at IDIAP, and is being used to generate multimodal meeting recordings for M4.
The objectives of the project may be illustrated by a proposed meeting browser. Consider a meeting (of 5 or 6 people), following an agenda. Offline, the meeting is segmented into agenda items. The browser follows the structure of the agenda. Clicking on an agenda item brings up a set of ways to view that topic. These might include: a textual summary; a diagrammatic discussion flow indicating which participants were involved; a set of audio or video key frames to give the essence of the discussion. It would also be possible to query the rest of the archive either by example from that segment, or through an explicit query.
Summary of 2003 activities
This report covers months 10-21 of the M4 project. The research activities have focussed on processing meetings data recorded at the IDIAP smart meeting room, and made available across the project via the media file server. Research in speech processing has focussed on the use of microphone arrays and the development of speech recognizers for the meeting data. Work on the video streams has concerned face detection, person detection and tracking, and action recognition. A significant achievement has been the development of multimodal person tracking which combines the precise tracking offered by vision-only approaches and the robustness offered by audio-based approaches. Multimodal integration activities has involved the automatic access of information based on features relating to speaker turns, the outputs of visual recognizers and other audio information such as prosody. A particular achievement has been the development of an approach to meeting structuring based on the segmentation of a meeting into group events.
The research outputs of M4 will be integrated in a showcase demonstrator for browsing meetings. We have developed a prototype version of such a demonstrator, which will finally be part of the media file server.
The Media File Server
In M4 we have recorded a number of meetings in the smart meeting room facility at IDIAP. Each meeting generates a lot of data: multiple audio channels (from lapel microphones, the microphone array, the binaural mannikin), video streams from 3 cameras, plus other information (PC interaction, instrumented whiteboard). The Media File Server supports browsing, playing and retrieving multimodal data files across the web. Furthermore, there exists the facility for researchers to upload annotations relating to the media files. The media file server has a capability to allow selected audio, video and textual channels to be combined and played back in a combined format (see below).
|Fig 1: The media file server: playback of three video streams, plus audio, the meeting structure and the inferred meeting structure.|
The meeting room recordings are characterized by multiple overlapping speakers, recorded by personal lapel microphones and microphone arrays. Microphone arrays are used for both location and recognition. A new technique for location-based speaker segmentation, using the microphone array and based on delays between pairs of microphones, has been developed and proven to be very accurate. Recognition experiments using beamformed microphone array data have also resulted in good results on simultaneous speakers, when compared with lapel microphones. Other ongoing work includes the porting of an HMM-based large vocabulary speech recognition system to the meeting room domain, and the development of a language independent front end based on a phone recognition system using time-frequency representation, referred to as TRAPs.
Tracking people in meetings
We have been developing ways to track people (or parts of people, such as faces) in the smart room. One particular application is to track the current speaker, without the tracker being distracted. Modern computer vision algorithms - based on particle filters - can achieve precise tracking of objects, but are rather sensitive to how they are initialized. And, of course, it is not always easy to track the speaker in a group of people, based on visual information alone. On the other hand, audio-based localization of sound sources (using a microphone array) is robust and is less sensitive to initialization. In the smart meeting room we have both audio and video information, and we can combine them to achieve a precise and robust audio-visual tracker, that it is able to switch between speakers, and can work across multiple cameras. The operation of the audio-video person tracker is illustrated below.
Other examples of visual processing in M4 include the automatic detection and tracking of faces, the estimation of pose (ie which way someone is looking, useful to determine the focus of attention) and recognition of actions such as pointing, standing up and sitting down.
|Fig 2: Audio-visual speaker tracking. Only the speaker is tracked, the tracker is able to ignore other people in the video.|
Automatic determination of meeting structure
A meeting may be accessed by its structure (how people interact, what they do), as well as by what the participants say. For example, figure 3 illustrates the structure of a meeting based on speaking turns. We have examined a variety of approaches to automatically structuring meetings. In one approach we have defined a meeting as a series of group actions (monologue, discussion, presentation, consensus, disagreement, ...) and have trained models to automatically segment meetings in terms of these group actions, using audio features (such as speech activity, keywords, intonation) and visual features (such as head detection).
The M4 demonstrator
The M4 demonstrator builds on the other components of the project to provide a meeting browser. In particular, the media file server, which forms the basis of the data distribution and annotation for the project researchers, will also form the basis for the M4 demonstrator. On top of this we will construct browsing interfaces, that enable the incorporation of the various recognizers and information access approaches developed in the project. An initial browsing prototype has been developed, using SVG technology, and a screen shot is shown below.
User group, promotion and awareness
The project industrial advisory board contains representatives from six European companies in the electronics, computing and communications sector. Progress and information about the project have been, and will continue to be, promoted via the project website and through other channels. The project had a stand at the IST Event 2003 in Milano, which generated considerable interest, and has been featured on the IST results website. We are pursuing ambitious research issues, and already many papers have appeared in leading international conferences and journals. In particular, members of the M4 project are organize an invited special session at the IEEE ICASSP-2003 conference on the topic of Smart Meeting Rooms.
During the final half of the project we will be focussing on further development of the multimodal recognizers for the challenging meeting room task, the development of information access approaches that combine different modalities in a principled way, and the integration of all the technologies developed in the project in a demonstrator showcase.