0% found this document useful (0 votes)

40 views

Experiments in Single and Multi-Document Summarization Using MEAD

graph reductipn

Uploaded by

Marnie Omar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

Experiments in Single and Multi-Document Summarization Using MEAD

graph reductipn

Uploaded by

Marnie Omar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Experiments in Single and Multi-Document Summarization Using MEAD

Dragomir R. Radev
School of Information and Department of EECS University of Michigan Ann Arbor, MI 48109

Sasha Blair-Goldensohn
School of Information University of Michigan Ann Arbor, MI 48109

Zhu Zhang
School of Information University of Michigan Ann Arbor, MI 48109

ABSTRACT
In this paper, we describe four experiments in text summarization. The rst experiment involves the automatic creation of 120 multi-document summaries and 308 single-document summaries from a set of 30 clusters of related documents. We present ofcial results from a multi-site manual evaluation of the quality of the summaries. The second experiment is about the identication by human subjects of cross-document structural relationships such as identity, paraphrase, elaboration, and fulllment. The third experiment focuses on a particular cross-document structural relationship, namely subsumption. The last experiment asks human judges to determine which of the input articles in a given cluster were used to produce individual sentences of a manual summary. We present numerical evaluations of all four experiments. All automatic summaries have been produced by MEAD, a exible summarization system under development at the University of Michigan.

2. MEAD: A CENTROID-BASED SUMMARIZER

MEAD is based on work described in [7]. It is based on sentence extraction. For each sentence in a cluster of related documents, MEAD computes three features and uses a linear combination of the three to determine what sentences are most salient. The three features used are centroid score, position, and overlap with rst sentence (which may happen to be the title of a document). The input to MEAD is plain text and a compression rate. MEAD uses the LT-POS software, developed at the University of Edinburgh [3] to mark sentence boundaries automatically. For each sentence , MEAD then computes three values:

the centroid score [7] which is a measure of the centrality the position score which decreases linearly as the sentence
gets farther from the beginning of a document, and of a sentence to the overall topic of a cluster (or document in the case of a single-document cluster),

INTRODUCTION

The University of Michigans summarization system, named MEAD, was initially developed to produce multi-document extractive summaries. The main idea behind MEAD is the use of the centroidbased feature [7] which identies sentences that are highly relevant to an entire cluster of related documents. Version 2.0 of MEAD was developed in 2001 and addresses DUC-specic constraints such as absolute summary length, very short summaries, as well as the requirement to produce both single-document and multi-document summaries. In this paper we present a brief description of the MEAD system, including two deployed web-based applications: NewsInEssence and WebInEssence. We then turn to the version of MEAD as used in DUC2001, focusing on the results of the evaluation. We then briey describe three user studies which were undertaken in 2001 with the goal of understanding how information provenance, crossdocument subsumption, and the identication of cross-document structural relationships can be used in the production of better multidocument summaries.

the overlap-with-rst score which is the inner product of

the TF*IDF-weighted vector representations of a given sentence and the rst sentence (or title, if there is one) of the document: ( ). All three features are normalized in range The overall " !$# &%')( # the 0%' 213# 01. is a linear comscore for , bination of " the For only the combination !4 three 57698&features. @BAC)(DFEG AC2this 1HPpaper, I of is used. The value of 3 for ! weights is used to produce shorter multi-document summaries (502! and 100- word summaries). For 200- and 400-word summaries, was set to 4. A trainable version of MEAD was subsequently developed and will be briey mentioned in the conclusion of this paper. MEAD discards sentences that are too similar to other sentences. A parameter, redundancymax is used to decide which sentences are too similar (based on a cosine similarity). For the DUC experiments, a similarity threshold of .7 was used for multi-document summaries. That value was raised to .95 for single-document summaries. Any sentence that is not discarded due to high similarity and which gets a high score (within the specied compression rate) is included in the summary. In addition to the above values, MEAD uses a large number of other parameters that can be set by the user. Some of them need to be mentioned. The rst one, shortestsentmin indicates the minimum

sentence length (in words) that will be included in a summary. The second parameter, shortestrstsentmin species the minimal length in words of the rst sentence to be included in a summary. The defaultidf value indicates what IDF should be given to words that are seen for the rst time and for which an IDF value is therefore not known. The values for these three parameters that were used in the evaluation are 9, 15, and 5, respectively. Another parameter, wordutilpower species the power to which number of words in sentence will be raised before doing scoreper-word division - examples: set to 0 to have scores divided by 1, so all sentences scored the same regardless of length; set to .5 to have bias toward longer sentences; set to 1 to divide by the number of words in the sentence, so sentence score is inversely proportional to length; set to 1.5 to have bias toward shorter sentences. In general, lower settings favor longer sentences, higher ones favor shorter sentences. A version of MEAD was used in the development of two Webbased summarizers - WebInEssence and NewsInEssence. WebInEssence[6] works on arbitrary web pages while NewsInEssence[5] specializes on clusters of related news stories extracted in real time from Web sources.

System L M N O MEAD R S T U W Y Z

Grammaticality 432 382 423 439 426 418 418 407 380 363 284 380

Cohesion 212 235 232 249 224 250 220 271 152 172 201 209

Organization 220 259 258 270 252 284 233 303 129 148 205 225

Total 864 876 913 958 902 952 871 981 661 683 690 814

Table 3: Multi-document evaluation

CST proposes a taxonomy of the informational relationships between documents in clusters of related documents. Some of the relationships are direct descendents of these used in SUMMONS [8] except that in CST, these relationships are domain-independent. CST posits that by identifying these cross-document links, one can produce superior multi-document summaries. The concept of using CST for multi-document summaries relates to the that of using Rhetorical Structure Theory (RST) [1] for singledocument summarization [2]. However, while Marcu relied on cue phrases in implementing algorithms to discover the valid RST trees for a single document, such a technique is not very plausible for discovering CST links between documents. For instance, the cue phrase although statementX, statementY might indicate the RST relationship concession in some circumstances. Marcu is able to use these phrases for guidance because of the conventions of writing and the valid assumptions that authors tend to write documents using certain rhetorical techniques. However, in the case of multiple documents and CST inter-document relationships (links), we cannot expect to encounter a reliable analog to the cue phrase. This is because separate documents, even when they are related to a common topic, are not (generally) written with an overarching structure in mind. Particularly in the case of news article clusters, we are most often looking at articles which are written by different authors working from partially overlapping information as it becomes available. So, except in cases of explicit citation, we cannot expect to nd a static phrase in one document which reliably indicates a particular relationship to some phrase in another document. Nonetheless, with the proliferation of available online news sources, it becomes increasingly attractive to be able to map the inter-document relationships proposed by CST in an automated fashion. As argued in [4], being able to produce a set of CST arcs which map between a set of documents in a news cluster would enable multi-document summarization which was not only generally superior, in terms of reduced redundancy and other generally desirable features, but also summaries tailored to individual preferences. How, then, to approach the problem of discovering CST relationships in a set of documents? We present an exploratory experiment, in which human subjects were asked to nd these relationships over a multi-document news cluster. It is our hope that the results of this experiment will be an early step in the eventual development of automated CST parsing techniques.

RESULTS FROM DUC 2001

We produced 120 multi-document summaries from the 30 clusters provided by DUC (30 clusters * 4 compression rates: 50-word, 100-word, 200-word, and 400-word summaries) as well as 308 single-document summaries. Some sample summaries are included in Figure 1. These four summaries are from the same cluster, DUC cluster d05. We present our results as given to us by the DUC evaluators. Table ?? includes our performance on 11 criteria: The following two Tables: 2 and 3 show MEADs performance on three of the criteria: overall peer grammaticality, overall peer cohesion, and overall peer organization. System O MEAD Q R S T V W X Y Z Grammaticality 554 558 556 569 547 534 553 521 535 481 493 Cohesion 422 390 381 439 367 410 455 393 375 369 334 Organization 452 424 397 478 380 460 479 409 413 403 363 Total 1428 1372 1334 1486 1294 1404 1487 1323 1323 1253 1190

Table 2: Single-document evaluation

A STUDY OF CROSS-DOCUMENT STRUCTURAL RELATIONSHIPS

In this section, we present an experiment in which subjects were asked to analyze a set of documents using a set of proposed relationships from Cross-Document Structure Theory (CST) [4]. We then present the experimental results and consider the implications for further work in CST.

50 words Mad cow disease, or bovine spongiform encephalopathy, or BSE, was diagnosed only in 1986. THE CONDITION known in cattle as mad cow disease, spongiform encephalopathies, has been found in Britains sparsely-scattered antelope population, the government has admitted. He believes that BSE can trigger human brain disease. 100 words Mad cow disease has killed 10,000 cattle, restricted the export market for Britains cattle industry and raised fears about the safety of eating beef. Mad cow disease, or bovine spongiform encephalopathy, or BSE, was diagnosed only in 1986. THE CONDITION known in cattle as mad cow disease, spongiform encephalopathies, has been found in Britains sparsely-scattered antelope population, the government has admitted. He believes that BSE can trigger human brain disease. Our worst predictions are coming true, he said. The German government yesterday announced the launch of a new research project to examine whether the cattle disease bovine spongiform encephalopathy can be transmitted to human beings. 200 words Mad cow disease, or bovine spongiform encephalopathy, or BSE, was diagnosed only in 1986. Mad cow disease has killed 10,000 cattle, restricted the export market for Britains cattle industry and raised fears about the safety of eating beef. Some experts believe that cattle contracted the disease as a result of eating food contaminated with the remains of sheep infected with a BSE-like disease called scrapie. THE CONDITION known in cattle as mad cow disease, spongiform encephalopathies, has been found in Britains sparsely-scattered antelope population, the government has admitted. GOVERNMENT veterinary and health experts were yesterday putting out reassuring messages about bovine spongiform encephalopathy, or mad cow disease, in the face of growing public anxiety. Dr Kenneth Calman, the governments chief medical ofcer, yesterday repeated the ofcial advice that beef can be eaten safely: There is no scientic evidence of a causal link between BSE in cattle and CJD in humans. The epidemic of bovine spongiform encephalopathy or mad cow disease- which has killed more than 100,000 animals in the UK- is causing a new wave of public concern. The German government yesterday announced the launch of a new research project to examine whether the cattle disease bovine spongiform encephalopathy can be transmitted to human beings. 400 words Mad cow disease, or bovine spongiform encephalopathy, or BSE, was diagnosed only in 1986. Mad cow disease has killed 10,000 cattle, restricted the export market for Britains cattle industry and raised fears about the safety of eating beef. Mad cow disease, an enigmatic nervous disorder that has killed thousands of cattle in Britain, is causing trade friction in Europe and is threatening the $3.7-billion British beef industry. Some experts believe that cattle contracted the disease as a result of eating food contaminated with the remains of sheep infected with a BSE-like disease called scrapie. THE CONDITION known in cattle as mad cow disease, spongiform encephalopathies, has been found in Britains sparsely-scattered antelope population, the government has admitted. GOVERNMENT veterinary and health experts were yesterday putting out reassuring messages about bovine spongiform encephalopathy, or mad cow disease, in the face of growing public anxiety. Dr Kenneth Calman, the governments chief medical ofcer, yesterday repeated the ofcial advice that beef can be eaten safely: There is no scientic evidence of a causal link between BSE in cattle and CJD in humans. Both BSE and CJD are caused by mysterious particles of infectious protein called prions. The epidemic of bovine spongiform encephalopathy or mad cow disease- which has killed more than 100,000 animals in the UK- is causing a new wave of public concern. He believes that BSE can trigger human brain disease. One argument put forward by the health department is that CJD has such a long incubation period- typically 10 to 20 years- that clinical symptoms would not yet have appeared, even if BSE had triggered any cases of CJD. Scientists trying to understand the epidemic face an unusual problem: BSE, scrapie and CJD are caused by a bizarre, infectious agent, the prion, which does not follow the normal rules of microbiology. Language: English Article Type:CSO [Article by Nigel Hawkes, Science Editor: Zoo Antelope Catch Mad Cow Disease] [Text] Scientists at London zoo have discovered that a strain of mad cow disease affecting a type of antelope can be transmitted much more easily than was thought. The German government yesterday announced the launch of a new research project to examine whether the cattle disease bovine spongiform encephalopathy can be transmitted to human beings. Several German scientists have expressed concern that BSE- popularly known as mad cow disease because of the way it debilitates the brains of cattle -may be transmissible to humans who eat contaminated beef or take medicines made with ingredients from contaminated animals. Figure 1: Sample multi-document summaries produced by MEAD at four compression rates

Metric Overall peer grammaticality Overall peer cohesion Overall peer organization Unmarked peer units (PUs) that ought to be in model in place of something there Unmarked PUs that dont deserve to be in the model, but related to the subject Unmarked PUs that are unrelated to the subject of the model Number of model units (MU) Number of peer units (PU) Number of unique PUs marked expressing some of the same content as one or more MUs Number of peer units marked for this MU Extent which marked PUs express meaning of the current MU

Avg. all peers (L-Z) 3.53 2.30 2.46 0.28 2.79 0.40 8.80 5.90 2.96 0.35 0.61

Avg. MEAD 3.58 2.50 2.79 0.27 2.70 0.29 8.69 5.72 3.31 0.40 0.73

StDev 0.72 1.19 1.21 0.86 1.59 0.93 6.28 4.69 2.79 0.82 1.30

Table 1: Comparison of MEAD with the rest of DUC participants

4.1 The experiment

The experiment which we conducted required subjects to read a set of news articles and write down the inter-document relationships which they observed. Specically, the articles were on the subject of an airplane crash of a ight from Egypt to Bahrain in August, 2000. They were written by several different news organizations and retrieved from online news web sites in the days following the accident. The cluster contained eight articles in total. Six of the articles focus generally on the crash and its direct aftermath; one mentions the crash while focusing on the history of the particular model of jet plane involved; and one focuses on the toll of the crash in Egypt, where many passengers were from. The subjects, eight graduate students and one professor at the University of Michigan, were given the articles and a set of instructions. The instructions specied ve sets of article pairs comprised by random pairings of the eight articles mentioned above. Each article was included in at least one pair; no article was included in more than two pairs. For each pair, the subjects were instructed to rst read the articles carefully. They were then instructed to look for and note down any occurrences of relationships like those in Figure 2. (Subjects were also provided with the examples shown in Figure 2 to illustrate each relationship type.) It was stated in the instructions that the relationships comprised only a proposed list, and not to be considered exhaustive. Subjects were invited to make up new relationship types if they observed cross-document relationships which did not correspond to those in Figure 2. Although subjects were given examples of the proposed relationships at the sentence level, the instructions also explicitly stated that it was possible for a relationship to hold with one or both text spans being more than one sentence long. There was no provision for subjects to mark relationships with one or both text spans less than a full sentence in length. Subjects were instructed not to note down examples of these relationships across spans within a single document. Also, subjects were instructed that it was possible for more than one relationship to exist across the same pair of text spans, and to note down as many relationships as they observed for each pair of text spans. No guidelines were given to subjects about how many relationships to identify per article pair. Rather, they were simply instructed to continue writing down relations until you are reasonably certain that no further interesting relationships hold for a given document pair.

Articles 7 and 63 81 and 87 30 and 97 41 and 81 30 and 47

Total CST Relationships Identied 92 100 76 31 110

Table 5: Total Identications of CST Relationships by Article Pair In the second and third rows, the numbers of sentence pairs listed speaks of distinct sentence pairs for which either one or multiple judges observed a relationship, respectively. That is, if one judge observed relationship X between sentences 1-2 of document A and sentence 2 of document B, this would count as two sentence pairs. However, if the identical observation was made save that the rst text span was limited to sentence 1, this would count as one sentence pair in the context of Table 6. Furthermore, in the context of Table 6, counting a pair as being observed to have a relationship by multiple judges, it is not necessary that a) the relationship observed be the same one OR b) the judges have marked a relationship for the exact same text spans. For example:

Judge John identied relationship X between Doc A/Sents

1-2 and Doc B/Sent 2

Judge Kyle identied relationship Y between Doc A/Sent 1

and Doc B/Sent 2

In the context of Table 6, this equates to one sentence pair identied by multiple judges (A/1-B/2), and one sentence pair identied by a single judge (A/2-B/2). Judges Finding a Relationship No Judges One Judge Multiple Judges Number of Sentence Pairs 4,291 200 88

Table 6: Sentence Pairs by Number of Judges Marking a CST Relationship As can be seen in Table 6, there are 88 sentence pairs (as just dened) for which multiple judges identify at least one CST relationship. Table 7 describes the breakdown of these 88 pairs in terms of inter-judge agreement. Although subjects were permitted to mark more than one relation per sentence pair, they are counted as in agreement here if at least one of the relations they mark agrees with one of the relations marked by another judge.

4.2 Results
A summary of the raw results of the experiment are shown in Table 4. Table 5 indicates the total number of links observed per article pair. Articles 41 and 47 are the articles mentioned above which focus on the airplane model and Egyptian perspective, respectively. Table 6 describes the sentence pairs for which judges noted relationships. The total number of sentence pairs for all ve article ' , where is the pairs assigned was 4579, which is number of the article pair, is the number of sentences in the rst article in the pair, and is the number of sentences in the second article in the pair. Of course, by combining sentences to form longer text spans, a hugely larger number of text-span pairs are possible. Therefore, the other numbers in Table 6 should be carefully understood.

Judge Frank identies relationships X and Y for a given sentence pair

Judge Horace identies (only) relationship X for the same

pair

In Table 7, these judges would be counted as agreeing.

Relationship Identity Equivalence (Paraphrase) Translation Subsumption Contradiction Historical Background Citation Modality

Description The same text appears in more than one location Two text spans have the same information content Same information content in different languages S1 contains all information in S2, plus additional information not in S2 Conicting information S1 gives historical context to information in S2 S1 explicitly cites document S2 S1 presents a qualied version of the information in S2, e.g., using allegedly S1 presents an attributed version of information in S2, e.g. using According to CNN, S1 summarizes S2.

Span 1 (S1) Tony Blair was elected for (Repetition) a second term today. Derek Bell is experiencing a resurgence in his career. Shouts of Viva la revolucion! echoed through the night. With 3 wins this year, Green Bay has the best record in the NFL. There were 122 people on the downed plane. This was the fourth time a member of the Royal Family has gotten divorced. Prince Albert then went on to say, I never gamble. Sean Puffy Combs is reported to own several multimillion dollar estates. According to a top Bush advisor, the President was alarmed at the news. The Mets won the Title in seven games. 102 casualties have been reported in the earthquake region. Mr. Cuban then gave the crowd his personal guarantee of free Chalupas. 50% of students are under 25; 20% are between 26 and 30; the rest are over 30. After traveling to Austria Thursday, Mr. Green returned home to New York. Greeneld, a retired general and father of two, has declined to comment. The Durian, a fruit used in Asian cuisine, has a strong smell. Giuliani criticized the Ofcers Union as too demanding in contract talks.

Span 2 (S2) Tony Blair was elected for a second term today. Derek Bell is having a comeback year. The rebels could be heard shouting, Long live the revolution. Green Bay has 3 wins this year. 126 people were aboard the plane. The Duke of Windsor was divorced from the Duchess of Windsor yesterday. An earlier article quoted Prince Albert as saying I never gamble. Puffy owns four multimillion dollar homes in the New York area. The President was alarmed to hear of his daughters low grades. After a grueling rst six games, the Mets came from behind tonight to take the Title. So far, no casualties from the quake have been conrmed. Ill personally guarantee free Chalupas, Mr. Cuban announced to the crowd. Most students at the University are under 30. Mr. Green will go to Austria Thursday. Mr. Greeneld appeared in court yesterday. The dish is usually made with Durian. Giuliani praised the Ofcers Union, which provides legal aid and advice to members.

Attribution

Summary

Follow-up Indirect speech

S1 presents additional information which has happened since S2 S1 indirectly quotes something which was directly quoted in S2 S1 elaborates or provides details of some information given more generally in S2 S1 asserts the occurrence of an event predicted in S2 S1 describes an entity mentioned in S2 S1 and S2 provide similar information written for a different audience. The same entity presents a differing opinion or presents a fact in a different light.

Elaboration / Renement Fulllment Description

Reader Prole Change of perspective

Figure 2: Proposed CST relationships and examples

Relationship Type Identity Equivalence Translation Subsumption Contradiction Historical Background Citation Modality Attribution Summary Follow-up Indirect speech Elaboration / Renement Fulllment Description Reader Prole Change of Perspective

1 1 8 0 16 4 35 0 0 0 1 8 1 6 1 44 0 0 125

2 0 2 0 3 4 3 0 1 0 0 6 1 15 0 10 0 0 45

3 0 2 0 2 0 0 0 0 1 0 2 1 2 0 0 0 0 16

Subject 4 5 6 2 1 1 36 5 7 0 0 0 7 3 1 7 4 5 4 1 0 0 0 0 0 0 0 8 4 0 0 0 0 13 4 4 0 1 0 22 17 9 1 2 0 5 5 0 0 1 0 0 0 0 105 48 27

7 1 5 0 3 0 1 0 0 2 0 2 0 5 0 0 0 0 19

8 0 4 0 3 4 0 0 1 0 0 3 2 3 0 0 0 1 21

9 1 1 0 0 1 0 0 0 0 0 0 0 6 0 0 0 0 9

Sum 9 70 0 39 31 44 0 2 15 1 42 6 85 4 64 1 1 415

Avg .78 7.78 0.00 4.22 3.22 4.89 0.00 0.22 1.67 0.11 4.67 0.67 9.44 0.44 7.11 0.11 0.11 45.44

Table 4: Identications of CST Relationships by Subject and Type Discrete Relationship Types Observed Only one More than one More than one Judges in Agreement All At least two None Sentences 16 35 37 of the marked pairs were marked by only one judge, the overall data sparseness (in comparison to the number of possible sentence pairs, only about 1/100th of pairs were marked) makes this ratio less discouraging. Further analysis of the data is still needed. The level of judge agreement would seem to indicate that at least some of the proposed CST relationships are recognizable with a suitable degree of correspondence by humans. Before attempting to build automated means of detecting CST hierarchies for a document cluster, a better understanding of which relationships can be empirically demonstrated must be found. Another key step is to gather further data. In order to do so, an automated markup tool in the style of Alembic Workbench or SEE would be extremely helpful. Not only is there a great deal of transcription (and associated possibilities for error) involved in running this experiment on paper, but a number of subjects expressed the belief that an automated tool like this would allow them to provide better and more consistent data.

Table 7: Judge Agreement on Relationship Types among Sentence Pairs Linked by Multiple Judges

4.3 Observations
Because our data comes from observations about (a subset of) a single news cluster, it would clearly be premature to make conclusions about the natural frequencies of these relationships based on the data in Table 5. Nonetheless, we can at least speculate that human subjects are capable of identifying some subset of these relationships when reading articles from this news cluster. On average, subjects identied approximately 45 occurrences of the proposed relationships per article. Interestingly, some relationships were identied much more frequently than others. The relationships Elaboration/Renement, Equivalence, and Description were identied most frequently. Other relationships, such as Translation, Citation, Summary, Reader Prole and Change of Perspective were observed never or only by one subject. Although subjects were encouraged in the study instructions to name new relationships, none did so. As noted above, we need more data before we can say if the lack of identications for these unobserved / rarely observed relationships is because of a true lack of frequency or some other factor. For instance, some of the proposed relationship names, like modality, may not be intuitive enough for judges to feel comfortable identifying them, even though examples were given. However, the most encouraging data concerns the relatively high level of overlap when multiple judges made an observation for a sentence. In 51 of 88 cases where more than one judge marked a sentence pair, at least two judges concurred about at least one relationship holding for the pair. Although approximately two-thirds

5. TWO MORE USER STUDIES

We will now briey note two additional experiments in progress. The rst one deals with cross-document subsumption while the second one is about information provenance.

5.1 Cross-document subsumption

In this experiment, we asked ve paid judges to nd all pairs of sentences in a given cluster of documents such that one of the sentences in the pair subsumes the second one. Subsumption is just one of many cross-document structural relationships that were discussed in the previous section. We chose it for further analysis as it appears to be most closely related to generic 3 multi-document sum marization. The main idea is that if sentence subsumes sentence , then need not be included in the summary if is to be included. For a detailed discussion of cross-document subsumption, refer to [7]. The ve judges were given a subset of queries from the Johns Hop-

kins Workshop corpus (see the nal Section of this paper). The documents in each query-induced cluster of relevant documents are from the Hong Kong News corpus distributed by the Linguistic Data Consortium. A total of 12 clusters (consisting of 10 articles each) were given to two judges each. Table 8 indicates the number of subsumptions found for each cluster. These numbers were computed by John Blitzer from Cornell University who is currently performing further analysis of the subsumption data. Cluster number 112 125 199 241 323 398 447 551 827 883 1014 1197 Associated query Autumn and sports carnivals Narcotics Rehabilitation Intellectual Property Rights Fire safety, building management concerns Battle against disc piracy Flu results in Health Controls Housing (Amendment) Bill Brings Assorted Improvements Natural disaster victims aided Health education for youngsters Public health concerns cause food-business closings Trafc Safety Enforcement Museums: exhibits/hours Number of subsumptions 434 49 111 15 258 52 103 649 142 90 323 228

Features of the articles. In Table 10, we summarize some features of the documents in the news cluster as we try to nd ways to correlate them with the summarization process. Table 11 summarizes the level of interjudge agreement among the six judges out of 103 sentences. High (5 more more judges agree) Medium (3 or 4 judges agree) Low (no more than 2 judges agree) Table 11: Interjudge agreement For experiments of this kind, it is very important to measure the degree of agreement among human judges. According to the criteria and statistics in the table above, it is reasonable to pursue this experiment further. If the human abstractor uses a strategy similar to sentence extraction, judges in our experiment tend to have very high agreement among them; on the contrary, if the human summarizer regenerates the summary totally according to his own understanding of the news cluster, our judges usually have trouble agreeing with each other. Some other interesting observations were made by the participants: 76 22 5

Table 8: Subsumptions identied

Article #7 is only partially relevant to the topic. Some human assessors tend to sequence their paragraphs in
a way that each paragraph corresponds to one or two source articles; others have a more integrated style, using many documents for each paragraph in the summary.

5.2 Information provenance

In this experiment, we wanted to verify two hypotheses: (1) that information in a multi-document summary can be traced back to the article (or articles) which were used to produce it and (2) that human subjects can reach high levels of agreement in determining information provenance. The participants (six in all) were presented with a cluster of ten news articles and four 400-word multi-document summaries (by 4 human assessors). The summaries were not necessarily produced through sentence extraction. They exist at various compression rates, however (50, 100, 200, 400 words). Only the 400-word summaries are used in this experiment. The topic of the news cluster is about day care issues in the U.S. The summaries and source articles were provided as part of the DUC training data. Each summary contains a certain number of sentences. The information in each sentence is supposed to come from a certain article or articles in the cluster. The task of the participants is to identify the information source for each sentence in the summaries. If two articles provide overlapped or even totally redundant information for the same sentence in summary, both should be identied. Summarizing large amount of information is a highly intelligent human behavior. The participants were also asked to report any patterns they noticed about how human generating summaries. Preliminary results. Some preliminary results are shown in Table 9. Here we are interested in how many sentences in the nal summaries each article contributes to. Notice that the contribution of articles is not evenly distributed, according to the statistics in the table above.

Summaries produced by different human assessors tend to

focus on different subsets of the document cluster.

6. CONCLUSION
We presented work done at the University of Michigan as part of the DUC evaluation, 2001. We described our summarizer, MEAD, which is based on centroid-based features. We included the results of our participation in DUC. We also presented some results from three preliminary experiments. These results are likely to be used in the development of next years release of MEAD. The rst step toward the new version was already taken at the summer workshop, held at Johns Hopkins University.

6.1 The Johns Hopkins summer workshop

During the summer of 2001, a team of 10 researchers from 6 countries met together at the Center for Language and Speech processing at Johns Hopkins University (www.clsp.jhu.edu) and worked together for eight weeks (preliminary work was done in advance) to achieve the following goals:

develop a public-domain trainable summarization system by

rewriting MEAD from scratch and including in the new architecture a module that allows salience decisions to be made based on cross-document relationships such as the ones posited by CST,

develop a summarization evaluation system,

Article 1 2 3 4 5 6 7 8 9 10

Subject A 6 11 14 17 5 11 11 11 5 8

Subject B 12 11 18 15 3 11 11 12 6 6

Subject C 11 14 17 18 9 14 11 23 6 7

Subject D 9 8 16 9 4 10 6 7 4 5

Subject E 10 11 19 14 4 10 10 11 6 7

Subject F 8 7 18 17 7 10 3 10 6 7

Avg. Contribution 9.33 10.33 17.00 15.00 5.33 11.00 8.67 12.33 5.50 6.67

Table 9: Information provenance Length in words 1171 1080 1670 817 460 1245 1002 1369 1799 382 Length in sentences 54 45 84 34 19 58 42 63 82 10 Number of proper nouns 10 50 85 10 41 45 30 68 55 12 Occurrence of daycare 1 0 0 0 0 0 0 0 0 0 Early/ Late E E E M M M M L L L

Article 1 2 3 4 5 6 7 8 9 10

Table 10: Article features

develop a large annotated corpus for further research in text

summarization, and

Conference on Research and Development in Information Retrieval, Athens, Greece, 2000. [4] Dragomir Radev. A common theory of information fusion from multiple text sources, step one: Cross-document structure. In Proceedings, 1st ACL SIGDIAL Workshop on Discourse and Dialogue, Hong Kong, October 2000. [5] Dragomir R. Radev, Sasha Blair-Goldensohn, Zhu Zhang, and Revathi Sundara Raghavan. Interactive, domain-independent identication and summarization of topically related news articles. In 5th European Conference on Research and Advanced Technology for Digital Libraries, Darmstadt, Germany, 2001. [6] Dragomir R. Radev, Weiguo Fan, and Zhu Zhang. NewsInEssence: A personalized web-based multi-document summarization and recommendation system. In NAACL Workshop on Automatic Summarization, Pittsburgh, PA, 2001. [7] Dragomir R. Radev, Hongyan Jing, and Malgorzata Budzikowska. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. In ANLP/NAACL Workshop on Summarization, Seattle, WA, April 2000. [8] Dragomir R. Radev and Kathleen R. McKeown. Generating natural language summaries from multiple on-line sources. Computational Linguistics, 24(3):469500, September 1998.

perform a meta-evaluation of a large variety of evaluation

metrics including co-selection (precision, recall, F-measure, Kappa), content-based metrics (cosine, binary cosine, longest common subsequence, and word overlap), relative utility, and relevance preservation.

The resulting system is quite robust: it was used to produce several hundred million summaries of different compression rates (10 in total), two languages (English and Chinese), generic and querybased, both single- and multi-document summaries. The evaluation was carried out on 10 summarization systems (including a trainable version of MEAD) in a variety of settings. All the goals of the meeting were achieved. During the fall of 2001, the summarizer (MEAD), the evaluation toolkit, and the annotations to the corpus will be released to the community. The corpus itself is being made available by the LDC.

REFERENCES

[1] William C. Mann and Sandra A. Thompson. Rhetorical Structure Theory: towards a functional theory of text organization. Text, 8(3):243281, 1988. [2] Daniel Marcu. From Discourse Structures to Text Summaries. In The Proceedings of the ACL97/EACL97 Workshop on Intelligent Scalable Text Summarization, pages 8288, Madrid, Spain, July 11 1997. [3] Andrei Mikheev. Document centered approach for text normalization. In Proceedings of the 23rd ACM SIGIR