M4 Annual Report 2002
The M4 project is concerned with the construction of a demonstration system to enable structuring, browsing and querying of an archive of automatically analysed meetings. The archived meetings will have taken place in a room equipped with multimodal sensors, specifically microphones and cameras.
The objectives of the project may be illustrated by a proposed meeting browser. Consider a meeting (of 5 or 6 people), following an agenda. Offline, the meeting is segmented into agenda items. The browser follows the structure of the agenda. Clicking on an agenda item brings up a set of ways to view that topic. These might include: a textual summary; a diagrammatic discussion flow indicating which participants were involved; a set of audio or video key frames to give the essence of the discussion. It would also be possible to query the rest of the archive either by example from that segment, or through an explicit query.
Summary of 2002 activities
The M4 project kicked off in March 2002. During the first months of the project the partners have focused on the development of the smart meeting room where meetings are recorded, and the porting of existing audio and video recognition technologies, and information access technologies, to the meeting room domain.
M4 is driven by the needs of the planned demonstrator, and another focus of the project has been to define more precisely the requirements of this demonstrator - which in turn will define the requirements for the audio, video and information access components of the project.
The project brings together partners specializing in several different areas: speech and audio processing, video processing and information access. An important aspect of the first few months of the partnership has been the finding of common ground between partners in terms of data acquisition, disseminating knowledge about tools, etc.
Smart meeting room
Multimodal meeting recordings are carried out in a smart room, which has been fully specified and installed (at IDIAP). The current installation includes 24 microphones, 3 fixed cameras, and synchronization and acquisition equipment. The setup has been fully debugged, and meeting recordings are proceeding.
For audio-visual recordings, the installation is dimensioned for a room with six persons. Two cameras are placed front-on view of three people. The third camera looks down table towards front of room (whiteboard, projector screen, whole meeting). From 12 to 16 microphones are arranged in table-top arrays. Additionally, a binaural manikin has been installed (2 audio channels), and from early 2003 miniature pan/tilt cameras with USB interface will be available. Figure 1 shows an overview of the room, and Figure 2 shows a still from one of the video channels during a recorded meeting.
Source separation, localization and tracking
Meetings are characterized by overlapping speech from multiple speakers and non-speech audio. Even recordings made using lapel microphones suffer considerable crosstalk (it is sometimes possible to follow a complete meeting using just the signal from one lapel mic). We are currently working on methods for speech detection in this domain, using both single and multiple microphones.
Complementary to the above work, the project is also investigating far field audio capture using microphone arrays and a binaural manikin. Initial experiments with microphone arrays have produced promising results for both speaker localization and tracking, and speech recognition.
The final demonstrator system will rely on a set of well-integrated information access tools that enable browsing, navigation and summarization of the set of multimedia streams. Such tools are important since it is difficult for users to grasp the full content via direct access to audio, video or the various higher level streams (eg, transcriptions). A multimodal indexing system will be at the centre of the information access tools, building on the partners' experience of text, speech, image and video retrieval.
During the first months of the project we have ported an existing speech retrieval system (designed for single channel news broadcasts) to the multi-channel meetings domain. Information access techniques based on prosodic features (the intonation and timing of speech) are also being ported to the meeting domain. A summarizer for multiple documents, and system to generate short "headlines" is also being ported to the M4 domain.
User group, promotion and awareness
The project has set up an industrial advisory board containing representatives from six European companies in the electronics, computing and communications sector. Yearly workshops will be organized, possibly in collaboration with other related projects (national and European). Progress and information about the project have been, and will continue to be, promoted via the project website and through other channels. Since the project is pursuing ambitious research issues, the partners will submit scientific papers to leading international conferences and journals. Members of the M4 project are coordinating an invited special session at the IEEE ICASSP-2003 conference on the topic of Smart Meeting Rooms.
The main focus for the coming months will be 2-3 meetings recorded in the smart meeting room. These meetings will form a testbed for the definition of information access tasks, and associated annotations, as well as for the existing recognizers and trackers to be ported to the M4 domain. Reports focused on these meetings will provide a firm basis on which to develop initial consensus standards for the M4 demonstrator platform and desired interactions.
In addition to the work areas discussed above, the project will focus on several other areas:
- Porting of existing speech recognition systems to the meetings domain
- Definition of video and multimodal events, gestures and actions to be annotated and modelled. Existing video and multimodal recognizers will be ported to the meetings domain
- Development of algorithms for multimodal fusion, particularly in the area of audio-video tracking
- A baseline architecture for multimodal information access, integrating transcribed speech, prosodic and video information.