Automation of Summary Evaluation by the Pyramid Method
A Harnly, A Nenkova, R Passonneau, and O Rambow, 2005. Automation of summary evaluation by the Pyramid Method. Recent Advances in Natural Language (RANLP).
[PDF] |
[plaintext] |
[Bibtex]
The manual Pyramid method for summary evaluation, which focuses on the task of determining if a summary expresses the same content as a set of manual models, has shown sufficient promise that the Document Understanding Conference 2005 effort will make use of it. However, an automated approach would make the method far more useful for developers and evaluators of automated summarization systems. We present an experimental environment for testing automated evaluation of summaries, pre-annotated for shared information. We reduce the problem to a combination of similarity measure computation and clustering. The best results are achieved with a unigram overlap similarity measure and single-link clustering, which yields high correlation to manual pyramid scores (r=0.942, p=0.01), and shows better correlation than the n-gram overlap automatic approaches of the ROUGE system.
Full Text
- Aaron Harnly
- Ani Nenkova
- Rebecca Passonneau
- Owen Rambow
Department of Computer Science, Center for Computational Learning Systems
Columbia University
New York, NY, USA
Abstract
The manual Pyramid method for summary evaluation, which focuses on the task of determining if a summary expresses the same content as a set of manual models, has shown sufficient promise that the Document Understanding Conference 2005 effort will make use of it. However, an automated approach would make the method far more useful for developers and evaluators of automated summarization systems. We present an experimental environment for testing automated evaluation of summaries, pre-annotated for shared information. We reduce the problem to a combination of similarity measure computation and clustering. The best results are achieved with a unigram overlap similarity measure and single-link clustering, which yields high correlation to manual pyramid scores (r=0.942, p=0.01), and shows better correlation than the n-gram overlap automatic approaches of the ROUGE system.
Introduction
Automatic summarization is usually evaluated through comparison to human summarization choices for the same texts. Traditionally, the comparison is done by eliciting human judgments on content. When humans write short, abstractive summaries based on their reading of multiple documents, they select content they think belongs in a summary, and put it in their own words. While many words and phrases may be similar to those another human summarizer would employ, people can use different forms of the same words (inflectional or derivational variants), different word order, syntactic structure, and paraphrases. See for example the spans of words in bold below, coming from five different summaries of the same set of documents about a Swissair crash off of Nova Scotia in 1998, all expressing the fact that the cause of the crash has not been determined.



