Performance of A Three-Stage System For Multi-Document Summarization
Performance of A Three-Stage System For Multi-Document Summarization
Performance of A Three-Stage System For Multi-Document Summarization
a document’s sentence would be chosen for the summary. Unfortunately, automatic parsing of EDUs is still not
We hypothesize that since the document collection only strong enough to meet our needs.
contains documents relevant to the query, the topic de- Instead, we chose to develop patterns using “shallow
scription terms do not add any additional information. parsing” techniques, keying off of lexical cues. The
Clearly, more analysis is required to determine why the sentences passed by the Sentence Extraction component
topic descriptions did not help in the generation of the were run through a part-of-speech (POS) tagger. Each
summaries. sentence, in order of its ranking by the Sentence Ex-
traction component, was matched against the various
4 From Extract to Abstract patterns. The following eliminations were made, when
appropriate:
The output from the Sentence Extraction component is
a ranked set of sentences selected by the QR algorithm. • sentences that begin with an imperative;
The HMM tends to select longer sentences. This means • sentences that contain a personal pronoun at or near
that for a 100-word summary, needed for tasks 2, 3, and the start;
4, the QR algorithm would usually select only 2 or 3
sentences from all those first selected by the HMM. We • gerund clauses;
felt that so few sentences would not supply enough of the
information we would like to see in a summary. • restricted relative-clause appositives;
In order to include more sentences in the summary, we • intra-sentential attribution;
decided to eliminate parts of the top selected sentences
that do not usually convey the most important informa- • lead adverbs.
tion. Occasionally we lose something we should have
kept but, in general, we gain. Shortening the selected 4.1 Sentence Elimination
sentences permits the inclusion of additional sentences, We eliminate two kinds of sentences in our summaries:
potentially gaining additional information. To accommo- imperatives and those “beginning” with pronominals. We
date this, the QR algorithm ranks sentences with about determined that imperatives rarely contain novel infor-
300 words rather than the needed 100 words. mation; in order for them to be effective and under-
Full parsing and comprehension are too costly to pur- stood, they must reference information the reader already
sue. We have done some initial investigation into using knows.
elementary discourse units (EDUs) (Carlson et al., 2002) Sentences that have a personal pronoun close to the
to determine sentence structure, component parts, and start of the sentence seem to fall into two categories: 1)
the importance and relevance of those parts, and would they are preceded by a proper noun, in which case, they
like to use EDUs for the purpose of creating an abstract. are fine to use; or 2) they do not have their reference
Task Number of Number of Total Number
Problems Misses of Sentences
2 1 6 91
3 5 6 102
4 5 2 113
within the sentence in which they appear. In the latter the clause is identified by a comma or a period. The
case, the antecedent generally is in a preceding sentence. following is a sentence from the training data with a
Ideally, when a sentence begins with a pronoun, we gerund phrase that would be removed:
should do analysis and identify its antecedent. This is a More than 800 lives were lost when the 21,794 tonne
very difficult task. We have instead taken the approach ferry, sailing from the Estonian capital Tallinn
of eliminating any sentence that begins with a pronoun to Stockholm, sank within minutes early yesterday
unless its preceding sentence in the original document has morning in the Baltic Sea 40 km south west of the
Finnish island of Uto.
already been included in the summary. Because this is a
one-pass process, even if we later include the sentence Restricted relative-clause appositives usually provide
with the needed reference, the pronominal sentence will background information. Because they are always delim-
not be re-included2 . Elimination of these sentences may ited by punctuation, they can be removed relatively eas-
cause the loss of some important information, but defi- ily. The patterns for these clauses look for specific words
nitely improves the readability of generated summaries. playing specific part-of-speech roles in the sentence:
The sentence eliminating rules are: “who”, “when”, “where”, “which”, and “in which”, and
require the clause to follow a comma and end with a
1. imperatives: if the first word is tagged as a verb comma or period. A sentence from the training data with
“base form” (VB in our POS tagger), the sentence a restricted relative-clause appositive to be removed is:
is considered an imperative and eliminated from use
The Menendez family lived in the Princeton Area
in the summary. until 1986, when they moved to California.
2. pronominals: if a personal pronoun (PRP in our POS While attributions can be informative, we decided that
tagger) appears within the first eight (8) words3 of a they could be sacrificed in order to include other, hope-
sentence and it is not preceded by a proper noun, we fully more important, information in the summary. Iden-
check to see that the sentence immediately preced- tifying intra-sentential attributions is not always easy.
ing it in the original document has been selected. Our rules find some, but not all, cases. We developed
If not, the sentence containing the pronominal is a list of about 50 verbs (and their various forms) that are
eliminated. used in attributions. A verb must be found in this list for
the clause to be considered for removal.
4.2 Clause Elimination When an attribution occurs at the start of a sentence,
Three different kinds of clauses were eliminated: gerund we require it to terminate with “that”, without any pre-
clauses, restricted relative-clause appositives, and intra- ceding punctuation. (We have not yet determined how
sentential attribution. In addition to the patterns identi- to find the proper end of the many attributions that occur
fied to locate the clauses to be removed, we utilized a without this word.) For an attribution that occurs at the
simple heuristic that if the number of tokens to be deleted end of a sentence, it must follow the last comma of the
was greater than or equal to the number of tokens to be sentence. The last word of the sentence must then be one
retained, the elimination was not performed. of our specified attribution verbs. A sentence from the
Gerunds often comment on, rather than advance, a training data with an attribution to be removed is:
narration and therefore tend to be incidental. To eliminate The federal Government’s highway safety watch-
a gerund clause, it must 1) be at the start of the sentence dog said Wednesday that the Ford Bronco II ap-
or immediately follow a comma, and 2) have the gerund pears to be involved in more fatal roll-over accidents
than other vehicles in its class and that it will seek
(VBG) as the lead word or as the second word following
to determine if the vehicle itself contributes to the
a preposition (IN) or “while” or “during”. The end of accidents.
2
We need to do more evaluation to determine if this was a 4.3 Lead Adverb Elimination
correct decision.
3
Eight was chosen heuristically; we know of no evidence For DUC 2002, we eliminated sentence lead words such
that proves it to be the correct number of words. as “And” and “But” since they did not add substantial
Task 2 Task 3 Task 4
Question System Peers Humans System Peers Humans System Peers Humans
16 (mean) (mean) 16 (mean) (mean) 16 (mean) (mean)
1 0.0 0.5 0.0 0.3 0.5 0.0 0.0 0.2 0.0
2 0.0 0.1 0.0 0.1 0.2 0.0 0.1 0.0 0.0
3 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.0
4 0.1 0.2 0.1 0.2 0.2 0.0 0.2 0.2 0.0
5 0.1 0.2 0.0 0.3 0.2 0.0 0.2 0.2 0.1
6 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0
7 0.0 0.2 0.1 0.1 0.2 0.0 0.1 0.1 0.0
8 0.1 0.3 0.1 0.3 0.3 0.0 0.5 0.4 0.0
9 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0
10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 0.3 0.3 0.0 0.1 0.1 0.0 0.0 0.1 0.0
12 0.6 0.6 0.2 0.6 0.6 0.1 0.7 0.6 0.1
information and often hindered the flow of the summary. Note that in tasks 3 and 4, one of the baselines was
For the same reason, in DUC 2003 we expanded this and among the top six scoring systems. By comparison, the
eliminated any adverb (RB) that began a sentence. top scoring baseline for task 2 ranked twelfth and the
corresponding p value was 0.07 meaning we can be 93%
4.4 Impact certain that the top 12 systems do not have the same mean
We performed a “post mortem” analysis of the summaries coverage score.
we generated to see what problems we had created when Table 3 shows how our system fared on the twelve
we eliminated parts of sentences. (We did not evaluate the questions which measure readability on a scale from 0 to
impact of eliminating entire sentences.) Both the number 4. We present the average score on each of the questions
of bad sentences (fragments, missing words, etc.), identi- for system 16, the average peer system (excluding ours),
fied as “problems” in Table 2, and the number of clauses and the average of the 10 humans. Overall, our system
that should have been removed but weren’t, identified as performed fairly well.
“misses” in the table, were relatively small. In some Two observations should be made. On question 8,
cases, the problems were due to poor sentence splitting noun resolution, we performed comparably to humans on
and not to any of the heuristics we applied. task 2 yet worse than the average peer system on tasks 3
and 4. We will need to investigate this anomaly. On ques-
5 Results tion 12, which measures the number of sentences which
appear to be misplaced, all systems seem to perform at
Overall, our system, system 16, was comparable to the
about the same level, which is still significantly below
top systems as rated by mean coverage, as given by NIST.
human performance.
Figure 1 shows the rankings of the humans, machine
(peer) systems, and baseline systems for each of tasks
6 Conclusion and Future Efforts
2, 3, and 4. The systems are sorted by average mean
coverage. Our system ranked third and second among In conclusion, our system performed better than last year,
the peer systems for tasks 2 and 3, respectively. For task but there is still a wipe gap between human and machine
4, it was the top scoring peer system but was inched out performance. The remaining question is whether the
by baseline 5! improvements we would like to make to our system will
However, the top scoring peer systems are extremely help narrow this gap or not.
close. More precisely, using a one-way ANOVA test We will continue to improve our preprocessing by
on the top six ranked systems gives a p value of 0.82, testing state-of-the-art POS taggers and sentence splitters
0.87, 0.92 for tasks 2, 3, and 4, respectively. Under as they become available.
the null hypothesis, i.e., that the top scoring systems The utility of query terms for sentence selection was
have the same mean and variance with regards to mean disappointing. We will seek better ways to use these in
coverage, p gives the probability that these average mean our HMM/QR framework.
coverages would be produced. Thus, mean coverage does Many more heuristics should be developed to eliminate
not separate the top scoring systems. less useful parts of selected sentences. As mentioned in
0.8
0.7
0.6
Mean Coverage Score
0.5
0.4
0.3
0.2
0.1
H D C A B I E G F J 13 6 16 20 23 14 12 18 26 22 11 B3 21 10 17 B2 19 15
0.7
0.6
0.5
Mean Coverage Score
0.4
0.3
0.2
0.1
D C B A E H F G I J 20 16 17 B3 18 22 10 23 11 21 B2 13 15
0.8
0.7
0.6
Mean Coverage Score
0.5
0.4
0.3
0.2
0.1
B H C I E J D G A F B5 16 23 14 22 13 B4 17 20 10 19
References
L. Carlson, D. Marcu, and M.E. Okurowski. 2002.
“Building a Discourse-Tagged Corpus in the Frame-
work of Rhetorical Structure Theory”. In Discourse
and Dialogues. Kluwer Academic Press.
J.M. Conroy and D.P. O’Leary. March, 2001. “Text
Summarization via Hidden Markov Models and Piv-
oted QR Matrix Decomposition”. Technical report,
University of Maryland, College Park, Maryland.
J.M. Conroy, J.D. Schlesinger, D.P. O’Leary, and
M.E. Okurowski. 2001. “Using HMM and Lo-
gistic Regression to Generate Extract Summaries
for DUC”. In DUC 01 Conference Proceedings.
http://duc.nist.gov/.
T. Dunning. 1993. “Accurate Methods for Statistics of
Surprise and Coincidence”. Computational Linguis-
tics, 19:61–74.
W. N. Francis and H. Kucera. 1982. Frequency Analysis
of English Usage: Lexicon and Grammar. Houghton
Mifflin Company, Boston, MA.
C.Y. Lin and E. Hovy. 2002. “The Automatic
Acquisition of Topic Signatures for Text Summa-
rization”. In DUC 02 Conference Proceedings.
http://duc.nist.gov/.
A. Mikheev. 2000. “Tagging Sentence Boundaries”.
In Proceedings of the First Meeting of the North
American Chapter of the Computational Linguistics
(NAACL), pages 264–271, Seattle, WA. Morgan Kauf-
mann.
J.D. Schlesinger, M.E. Okurowski, J.M. Conroy, D.P.
O’Leary, A.Taylor, J. Hobbs, and H.T. Wilson. 2002.
“Understanding Machine Performance in the Context
of Human Performance for Multi-document Sum-
marization”. In DUC 02 Conference Proceedings.
http://duc.nist.gov/.
J.D. Schlesinger, J.M. Conroy, M.E. Okurowski, and D.P.
O’Leary. 2003. “Machine and Human Performance