aaron.harnly.net

Automation of Summary Evaluation by the Pyramid Method

A Harnly, A Nenkova, R Passonneau, and O Rambow, 2005. Automation of summary evaluation by the Pyramid Method. Recent Advances in Natural Language (RANLP). pdf[PDF] | markdown[plaintext] | markdown[Bibtex]

The manual Pyramid method for summary evaluation, which focuses on the task of determining if a summary expresses the same content as a set of manual models, has shown sufficient promise that the Document Understanding Conference 2005 effort will make use of it. However, an automated approach would make the method far more useful for developers and evaluators of automated summarization systems. We present an experimental environment for testing automated evaluation of summaries, pre-annotated for shared information. We reduce the problem to a combination of similarity measure computation and clustering. The best results are achieved with a unigram overlap similarity measure and single-link clustering, which yields high correlation to manual pyramid scores (r=0.942, p=0.01), and shows better correlation than the n-gram overlap automatic approaches of the ROUGE system.

Full Text

  • Aaron Harnly
  • Ani Nenkova
  • Rebecca Passonneau
  • Owen Rambow

Department of Computer Science, Center for Computational Learning Systems Columbia University New York, NY, USA

Abstract

The manual Pyramid method for summary evaluation, which focuses on the task of determining if a summary expresses the same content as a set of manual models, has shown sufficient promise that the Document Understanding Conference 2005 effort will make use of it. However, an automated approach would make the method far more useful for developers and evaluators of automated summarization systems. We present an experimental environment for testing automated evaluation of summaries, pre-annotated for shared information. We reduce the problem to a combination of similarity measure computation and clustering. The best results are achieved with a unigram overlap similarity measure and single-link clustering, which yields high correlation to manual pyramid scores (r=0.942, p=0.01), and shows better correlation than the n-gram overlap automatic approaches of the ROUGE system.

Introduction

Automatic summarization is usually evaluated through comparison to human summarization choices for the same texts. Traditionally, the comparison is done by eliciting human judgments on content. When humans write short, abstractive summaries based on their reading of multiple documents, they select content they think belongs in a summary, and put it in their own words. While many words and phrases may be similar to those another human summarizer would employ, people can use different forms of the same words (inflectional or derivational variants), different word order, syntactic structure, and paraphrases. See for example the spans of words in bold below, coming from five different summaries of the same set of documents about a Swissair crash off of Nova Scotia in 1998, all expressing the fact that the cause of the crash has not been determined.

Digg this     Create a del.icio.us Bookmark     Add to Newsvine

No Responses to “Automation of Summary Evaluation by the Pyramid Method”

No comments yet

Leave a Reply

*
To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Click to hear an audio file of the anti-spam word