Funded by EPSRC, ROPA GR/23954, from 1 June 2001 - 31 May 2002
This project is concerned with the development of methods for the automatic summarization of spoken language utilizing prosodic features such as energy, duration, pitch and pauses. The main components of the proposed research are an investigation into which prosodic features add useful information for speech summarization and the development of methods, based on statistical pattern recognition, relating the prosodic features present in a message to the content of that message. The expected results of the project will be a prototype system for the summarization of voicemail messages, and an evaluation of the prospects of the proposed approach for more general information extraction tasks.
Speech is a very rich communication medium and recently there have been efforts to find ways of incorporating prosodic cues in order to extend the capabilities of spoken dialogue and audio browsing/retrieval systems. An important aspect of this approach is the combination of prosodic, acoustic and language information to achieve results that are more robust than those of single sources. Humans use prosody to disambiguate similar words, to group words into meaningful phrases, and to mark the importance of words or phrases. The acoustic correlates of prosody are among the cues least affected by noise, so it is likely that human listeners use prosody as a redundant cue to help them correctly recognize speech in noisy environments. Spontaneous and read speech differ in regard to prosodic structure, with the former having shorter prosodic units.
In this project, we are concerned with speech summarization, in particular the generation of short text summaries of a user's incoming voicemail messages. This is a potentially important component of integrated voice/data communication, and we have applied such a facility in a Short Message Service (SMS) based system. SMS has several unique features that can be summarized as message storage if the recipient is not available, confirmation of delivery to the sender and simultaneous transmission with voice, data and fax services. Voicemail summarization differs from conventional text summarization or abstracting, since it does not assume a perfect transcription and is concerned with summarizing brief spoken messages (average duration about 40s) into terse summaries (140 characters in the case of SMS transmission). Given this level of compression, "document flow" is less important compared with the need to transmit the principal content words in the message. We describe the system's ability to generate summaries of two test sets, having trained and validated using messages from the IBM Voicemail corpus.
In many applications, such as speech summarization, the cost of different types of errors is not known at the time of designing the system. Additionally the costs may change over time. Finally, some costs cannot be specified quantitatively: in speech summarization such costs include coherence degradation, readability deterioration and topical under-representation. Thus, we resort to specifying the classifier in the form of an adjustable threshold and a receiver operating characteristic (ROC) curve obtained by setting the threshold to various possible values (Provost and Fawcett, 2001).
Apparently many tens of lexical and prosodic features as inputs to classifiers can be identified and calculated. It is desirable to select a subset of such features and to discard the remainder. This can be useful if there are features which carry little useful information for the particular task, or if there are very strong correlations between sets of inputs so that the same information is repeated in several features. Furthermore, one might wish to reduce the dimensionality simply in order to make the classification calculations quicker, to save storage space or to permit rapid feature extraction.
Classifiers may be combined by random switching to achieve any operating point on the convex hull of their ROC curves. Such a combination is referred to as the Maximum Realizable ROC (MRROC) classifier. Scott, Niranjan and Prager derived the Parcel algorithm that sequentially selects features and classifiers to maximize the MRROC. This implies that different trade-offs in the ROC curve require different optimal feature sets and classifiers. It is the objective of Parcel to produce a MRROC that has the largest possible area underneath it, i.e., to maximize the Wilcoxon statistic associated with the classification system defined by the MRROC. This is achieved by searching for, and retaining, those features and classifiers that extend the convex hull defined by the MRROC. The Parcel algorithm seeks not to select a single best feature subset, but rather to select as many as different subsets as are necessary to produce satisfactory performance across all costs.
We have applied the Parcel feature subset selection algorithm to evaluate which of the several and often correlated lexical and prosodic features are potentially optimal as classifier inputs for voicemail summarization and the architecture of our system is shown in the following figure:
Two rates can be calculated for any series of classifications: the true-positive (sensitivity) and the false-positive (1-specificity) rates. A true-positive has occurred when a important word is correctly included in the summary, and a false-positive when a non-important word is incorrectly included in the summary.
The left depicts the ROC curves produced using single features with respect to the validation set. For simplicity only the best (potentially optimal) types of features are shown with collection frequency, NE scoring, duration, energy, pitch onset, pitch amplitude and pitch range offering maximum discrimination.
The right figure depicts the MRROC curves produced by Parcel on the validation set using lexical only, prosodic only and combination of lexical and prosodic features. Lexical features as classifier inputs clearly dominate prosodic features in all intervals of thresholds. The combination of lexical and prosodic features gives superior performance than any single constituent classification system.
Results measuring the quality of summary artifacts using a weighted Slot Error Rate (SER) metric show that combined lexical and prosodic features are at least as robust as combined lexical features alone across all operating conditions.
The architecture of the proposed system encompasses three distinct phases of processing:
The following table shows the human transcription, automatic transcription and the automatic summarization of vm1dev26 spoken message.
HI MARY IT'S MARY ANN SHAW I JUST HAVE A QUICK QUESTION FOR YOU THE DEFENSIVE DRIVING COURSE THAT IS TOMORROW AND THURSDAY CAN YOU LET ME KNOW IF THAT'S IN THE HAWTHORNE ONE OR HAWTHORNE TWO JUST WANTED TO MAKE SURE I'M NOT SURE THE WAY THEY YOU KNOW SET UP THEIR ROOMS SO IF YOU COULD GIVE ME A CALL I'M ON TIELINE EIGHT TWO SIX SIXTEEN OH TWO BYE
HEY GARY IT'S MARY ANN SURE I SHOULD HAVE A QUICK QUESTION FOR YOU THE DEFENSIVE DRIVING COURSE IT IS TOMORROW AND THURSDAY CAN YOU LET ME KNOW THANKS IN THE HAWTHORNE WONDERFUL POINT TWO JUST WANTED TO MAKE SURE I'M NOT SURE THE WAY YOU KNOW SET UP THERE SO IF YOU COULD GIVE ME A CALL ON TIELINE EIGHT TWO SIX SIXTEEN OH TWO BYE
The following figures show the summary retrieval of the vm1dev26 on the display of a WAP phone. An optional connection to the voicemail system in order to listen to the particular message can be provided by the WTA.
K. Koumpis and S. Renals
Costis Koumpis Last modified: Tue Jan 15 15:31:07 GMT 2002