M4 Annual Report 2004
The M4 project is concerned with the construction of a demonstration system to enable structuring, browsing and querying of an archive of automatically analysed meetings. The archived meetings take place in a room equipped with multimodal sensors, specifically microphones and cameras. As described in the 2002 report, such a meeting room has been constructed at IDIAP and is being used to generate multimodal meeting recordings for M4.
The objectives of the project may be illustrated by a proposed meeting browser. Consider a meeting (of 5 or 6 people) following an agenda. Offline, the meeting is segmented into agenda items. The browser follows the structure of the agenda. Clicking on an agenda item brings up a set of ways to view that topic. These might include: a textual summary; a diagrammatic discussion flow indicating which participants were involved; a set of audio or video key frames to give the essence of the discussion. It would also be possible to query the rest of the archive either by example from that segment or through an explicit query.
Summary of 2004 activities
This report covers months 22-33 of the M4 project. The main achievements of the projects are:
- Development of an instrumented meeting room based on several cameras, personal microphones and a microphone array.
- The collection and annotation of a database of natural meetings recorded in the instrumented meeting room.
- Development of a meeting browser to access information from an archive of meetings.
- Development of a low-cost portable room based on a single camera and a hyperbolic mirror.
- Development of a miniature pan-tilt USB-2 camera with automatic tracking software.
- Implementation and evaluation of techniques for speaker localization, segmentation and cross-talk detection.
- Development of conversational speech recognition for meeting recordings.
- Visual and audio-visual tracking of people in meeting rooms.
- Identification and tracking of heads and hands together with head pose recognition.
- Multimodal segmentation and recognition of group actions based on multistream models.
- Automatic multi-camera video editing.
- Automatic addressee detection.
- Indexing and retrieval of a meetings archive.
- Extractive summarization of meetings.
- Dissemination: numerous invited presentations and publications; co-sponsorship of the MLMI-04 workshop in Switzerland attended by about 200 people.
The aim of the Ferret browser is to demonstrate M4 project outputs such as media of recordings, segmentations, transcripts and ASR results. Ferret offers a service to navigate interactively within meeting recordings and quickly find and play back segments of interest within these recordings.
Fig 1: Ferret browser
To do this, Ferret shows many different kinds of data in parallel along with the video, audio, slides and whiteboard content. For example, the illustration above shows a time-line with several "speaker segmentations" showing exactly when each person was speaking. Normally, playing the meeting causes each of several videos to play along with the audio track and a cursor wends its way down the timeline. At the same time, the appropriate slide is displayed, the transcript is scrolled and the whiteboard display is updated. Additionally, clicking on a slide, or the transcript, or a speaker-segment takes you directly to that point in time and all the streams continue playing in synch from there. So now it is easy to find what was said when a certain slide was shown or to search for a word in the transcript and then go straight to that point in the meeting.
Typically, a user chooses a meeting to watch, then chooses exactly which data to show in the Ferret browser - they can even add their own data while it is running, from anywhere on the Internet. These can be speaker turn intervals, meeting actions, level of interest representation, textual transcripts, slides or custom data. The user can zoom in particular parts of interest by means of the zooming button on the left. By zooming out, the user can see who talked most, for example.
Ferret is publicly accessible, along with more detailed documentation.
The Browser Evaluation Test (BET)
The aim of the Browser Evaluation Test (BET) is to develop objective performance metrics for meeting browsers.
Currently there is no standard evaluation procedure for meeting browsers. From many published browsers, evaluation is absent or based on informal user feedback. Where more objective data has been collected, perhaps by asking users to carry out certain tasks, these tasks are often not consistent across different studies. Generally, the questions asked of users vary widely, are often loosely defined and the final scores are therefore open to considerable interpretation. Most importantly, however, it is not currently possible to compare browsers and browsing techniques objectively.
In many other fields of research, an objective measure of system performance along with a standard corpus and set of reference tasks can be of enormous benefit in helping researchers compare techniques allowing the field to make progress. For example, in the field of speech recognition the use of standardized tasks, metrics and corpora has made possible the construction of real time, large vocabulary systems that would not have been feasible ten years ago. The text retrieval conference (TREC) has also used standard corpora, tasks and metrics with great success, with average precision doubling from 20% to 40% in the last seven years.
The BET is a method for assessing browser performance on meeting recordings. The metric we use is the number of "observations of interest" found in the minimum amount of time. Observations of interest are statements about the meeting collected by independent observers prior to performing an evaluation. When testing a browser subjects are presented with questions drawn from the observations, enabling browsers to be assessed in terms of both speed and accuracy.
We have successfully applied the BET in a trial run and expect ourselves and others to use it as fundamental tool in the coming years. More information.
Portable meeting room and pan-tilt camera
The Small Meeting Room (SMR) at Brno University of Technology, Czech Republic, has been specified as an alternative to the project's main meeting room at IDIAP, Switzerland. The SMR is low-cost, mobile and built from off-the-shelf products. Up to 4 participants can be recorded. The core of the SMR is a standard notebook equipped with 2 Hi-Fi sound cards. The audio is recorded by 4 lapel microphones and the video is captured using a standard digital camera together with a hyperbolic mirror. This mirror provides 360-degree view with a single video camera. Tools have been developed for the unwrapping of the image distorted by the hyperbolic mirror; for the stabilization of such an image (to eliminate possible vibrations of the video camera, mirror or stand when attached to the "meeting table"); and to geometrically enhance the output image such that physically straight edges appear straight.
Fig 2: Small Meeting Room video capture set up.
Fig 3: Meeting room scene captured via the hyperbolic mirror and digital camera (left) and an unwrapped image (right).
13 meetings (total length of 189 minutes) have been recorded and annotated (speaker turns and speech orthographic transcription) and are publicly available.
EPFL has completed development of the miniature pan-tilt camera. It is fully digital and features a USB 2.0 system interface that greatly simplifies the installation process and improves the performance. The camera is based on a 1.3 mega-pixel sensor that delivers high quality uncompressed digital video. Only one cable is needed to transport power, the digital video signal and instructions for remote control. The prototype phase is now complete, and the industrial version of the camera will soon be available. An 'automatic cameraman tracking software' based on a face colour algorithm is provided with the camera. It aims to 'pilot' the USB 2.0 camera position automatically and in real-time. Open source libraries and software tools are the main components of this algorithm and function under both Windows and Linux environments.
Fig 4: Miniature pan-tilt camera.
Within M4, researchers at IDIAP, Edinburgh and TU München have developed models to automatically segment meetings into actions performed by the group of participants. These models are based on audio-visual features. A variety of structures have been used (including two-level HMMs and dynamic Bayesian networks) that have in common a sophisticated approach to modelling multiple asynchronous feature streams. These approaches have proven to be powerful in segmenting meetings into phases such as discussion, monologue and presentation.
Fig 5: Action-based meeting structure.
We have developed a probabilistic method to track multiple people, and tested it in the meeting room. One of the main problems here is person occlusion. Our method is able to deal with the uncertainty of the problem (given that no depth information is provided), and efficiently explores the possible hypotheses (e.g., person A might be in front of person B, or vice versa) to eventually favour the most likely one.
Fig 6: Multi-person visual tracking. Three people are being tracked. Although the moving person (blue square) is momentarily occluded by the other people, the tracker is able to recover.
Based on rigorous acoustic theory, experiments have been conducted on isolating and tracking (possibly simultaneous) speech of various meeting participants. Wavefield extrapolation enables high accuracy tracking of sources anywhere in the meeting room. Synthetic experiments with various acquisition geometries show excellent results when using a large linear array of microphones. Sources that are positioned at least 10 cm apart can be separated anywhere in the meeting room. Non-target sources are suppressed 25 dB or better. Preliminary results of in vivo experiments in the meeting room at TNO in Soesterberg (using only 48 microphones) confirmed good performance on speaker tracking.
Fig 7: Computed source activity for a particular microphone array geometry (high activity in red) with room and participant configuration superimposed.
We have also investigated physiologically- and psychophysically-inspired techniques for the audio-visual localisation and tracking of sound sources. We currently use the video from a single camera coupled with high quality recordings made from a binaural manikin (a model of a human head and torso in which high quality microphones are embedded within the ear canals).
Fig 8: Binaural manikin and video camera used for audio-visual localisation and tracking of sound sources.
The approximations of the auditory periphery and cochlear responses are used to extract estimates of audio sound source location. Simple image analysis techniques extract the video features (object, motion and face location). Two neural oscillator networks perform segmentation and grouping of the audio and video activity. The system can extract video and audio features and successfully group video and audio activity when at the same position and segregate incongruous audio and video data (frame-based). A psychophysically plausible A-V tracking mechanism and attentional process that includes momentum is in development so that sources can be tracked through brief visual occlusions.
When more than two participants are involved in a conversation, the question of who is the addressee of an utterance becomes crucial for the dialogue understanding. Our goal is to automatically identify the addressee of each utterance in multi-party meeting dialogues. As a computation model we use Bayesian networks. In order to build the network, we identified the addressing mechanisms that people use in identifying the addressee. From these we extracted a set of verbal, non-verbal and contextual features that are relevant for observers to identify the participants the speaker is talking to. This year, we have been mostly focused on collecting data for training and testing the model. For this purpose, we developed a set of tools used for annotations of information relevant to addressing, such as: dialogue acts, named entities, some linguistic markers, gaze direction, end pointing gestures. The resulting corpus can be used not only for studying addressing behaviour but also for studying conversational organizations and structures, participants' involvements in a meeting, etc.
User group, promotion and awareness
The project industrial advisory board contains representatives from six European companies in the electronics, computing and communications sector. Progress and information about the project have been, and will continue to be, promoted via the project website and through other channels. The project had a stand at the IST Event 2003 in Milano which generated considerable interest and has been featured on the IST results website. We are pursuing ambitious research issues and many papers have appeared in leading international conferences and journals. In particular, members of the M4 project organized an invited special session at the IEEE ICASSP-2003 conference on the topic of Smart Meeting Rooms. In addition to this, M4 co-organised the Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI'04) on the topic of meeting data modelling and analysis.