Performance of A Three-Stage System For Multi-Document Summarization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Performance of a Three-Stage System for Multi-Document Summarization

Daniel M. Dunlavy John M. Conroy, Judith D. Schlesinger


University of Maryland IDA/Center for Computing Sciences
[email protected] {conroy, judith}@super.org

Sarah A. Goodman, Mary Ellen Okurowski Dianne P. O’Leary


Department of Defense University of Maryland
{[email protected], [email protected]} [email protected]

Hans van Halteren


University of Nijmegen
[email protected]
1 Introduction documents, as well as their capability in handling most
of the preprocessing tasks required by our summarization
Our participation in DUC 2003 was limited to Tasks tools. The remaining tasks were performed using tools
2, 3, and 4. Although the tasks differed slightly in created in Perl.
their goals, we applied the same approach in each case: The main tool used for identifying the terms or tokens
preprocess the data for input to our system, apply our of a sentence, tagging each term with its part of speech,
single-document and multi-document summarization al- and detecting and tagging the sentence boundaries was
gorithms, post-process the data for DUC evaluation. We LT POS (Mikheev, 2000), a tool in the LT TTT suite. LT
did not use the topic descriptions for Task 2 or the view- POS is a probabilistic part-of-speech tagger and sentence
point descriptions for Task 3, and used only the novel splitter based on a combination of hidden Markov and
sentences for Task 4. maximum entropy models. The default models, trained
The preprocessing of the data for our needs consisted on the Brown corpus (Francis and Kucera, 1982), were
of term identification, part-of-speech (POS) tagging, sen- used in our system.
tence boundary detection and SGML DTD processing. Prior to this year, we used the SRA NetOwl software
With the exception of sentence boundary detection for with Named-Entity recognition and aliasing to identify
Task 4 (the test data was sentence-delimited using SGML terms. NetOwl also does stemming. We moved away
tags), each of these preprocessing tasks were performed from NetOwl largely to experiment with a simplified
on all of the documents. Details of each of these tasks are preprocessing system. Our goal is to rebuild the pre-
presented in Section 2. processing starting with a simple process and to add
The summarization algorithms were enhanced versions “enhancements” only when they improve summarization.
of those presented by members of our group in the past
To benchmark the change in the way we identified
DUC evaluations (Conroy et al., 2001; Schlesinger et
terms, we used 82 DUC 2001 documents for which we
al., 2002). The enhancements to the previous system are
had tagged sentences that could serve as sources for
detailed in Section 3.
the NIST human generated abstracts. Two HMMs were
Previous post-processing consisted of removing lead trained, the first to utilize the NetOwl preprocessing and
adverbs such as “And” or “But” to make our summaries the second to use with the new simpler preprocessor.
flow more easily. For DUC 2003, we added more exten- Then a single-document extract summary of approxi-
sive editing, eliminating part or all of selected sentences. mately 100 words was generated for each document, us-
This post-processing is described in Section 4. ing the terms generated by each of the two methods. The
outcome of a ten-fold cross-validation was that the new
2 Preprocessing simple method gave an average precision of 60% while
the more complicated NetOwl gave an average precision
Many of the preprocessing tasks were performed using
of 58%.
tools created by the Edinburgh Language Technology
Group (http://www.ltg.ed.ac.uk/). Specifi-
2.1 Parsing Files using DTDs
cally, various components of that group’s LT TTT (v1.0)
parsing system were used. These tools were chosen due Using the SGML document type definition (DTD) for a
to their flexibility in handling both SGML and ASCII text document allowed us to determine the set of all possible
SGML tags that exist in documents of that type. Using corpus at large. To identify these terms, we use the log-
these tag sets, we distinguished which sentences 1) were likelihood statistic suggested by Dunning (1993) and first
candidates for extract summaries, 2) contained key terms used in summarization by Lin and Hovy (2002). The
or phrases that would aid in creating a summary, and statistic is equivalent to a mutual information statistic and
3) contained no useful information for the task of sum- is based on a 2-by-2 contingency table of counts for each
marization. We created a new attribute, stype, for the term.
SGML tag denoting a sentence boundary, <s>, in order The subject terms are a special subset of the signature
to denote each of these three types of sentences. The terms. These are terms that occur sentences with stype =
possible values for this new attribute are 1, 0, and −1, 0, for example, headline and subject heading sentences.
respectively. Table 1 presents the values of stype used The features are normalized component-wise to have
for sentences embedded into the SGML tags encountered mean zero and variance one. In addition, the features
in the several types of documents used in the evaluation. for sentences with stype 0 and -1 are coerced to be -1,
Tags not shown are assigned stype = −1. which forces these sentences to have an extremely low
Choosing to embed information into the document probability of being selected as summary sentences.
itself instead of creating a processing module in our The above process of extract generation was used for
summarization algorithm allows us flexibility in using the both Tasks 2 and 3 of DUC 2003. For Task 4, the
information throughout the various stages of our system. novelty task, we made the design decision that our ex-
Furthermore, it will allow us to expand the types of sen- tracts would be taken from only the novelty sentence set.1
tence classification without affecting the summarization We achieved this by overriding the sentence type of all
system. sentences that were not marked as novel to be type -1.
Thus the HMM would only give high scores to sentences
3 Sentence Extraction that were labeled as novel.
The model was trained using the help of the novelty
Our summarization algorithm uses a hidden Markov data given by NIST. We focused on only the novel sen-
model (HMM) to select sentences for a single-document tences in this set. To strengthen the model further, we
extract summary. A pivoted QR algorithm is added sorted the novel sentences by hand for 24 of the document
to select from those to generate multi-document ex- sets. This process removed many sentences which were
tracts. Details of both algorithms and how they are used no longer relevant in isolation. These data were then used
for sentence scoring and selection are given in (Con- to train the HMM to score the sentences and determine
roy and O’Leary, March 2001), (Conroy et al., 2001), which features should be included.
(Schlesinger et al., 2002), and (Schlesinger et al., 2003).
In particular, the training data helped determine the
Improvements we made to our algorithm for this year are
number of states for the HMM. The upshot was that a
included here.
small state space, consisting of five states, two summary
The HMM uses features based upon terms, which we states and three non-summary states, was optimal. Em-
define as a delimited string consisting of the letters a-z, pirically, the number of summary states roughly corre-
minus a stop list, i.e., everything but Roman letters is sponds to the median length in sentences of the human-
considered to be a delimiter. (All text is first converted selected sentences per document.
to lower case.) The preprocessing tools (2) identify the
Another feature we considered for our system was
terms for the HMM.
using query terms derived from the topic descriptions.
The features we used for the HMM for DUC 2003 are We attempted to use this information in two ways. The
different from prior years. While we previously used the first was to simply add an additional feature to the HMM.
number of terms in a sentence, we now use “subject” and This approach actually decreased the precision1 of the
“signature” terms: system! The second method we considered was using the
derived query terms in conjunction with a retrieval system
• the number of signature terms, nsig , in the
to rank each document. We hoped to use these docu-
sentence—value is o2 (i) = log(nsig + 1).
ment scores in conjunction with HMM sentence scores
• the number of subject terms, nsubj , in the to generate the extract sentences. Unfortunately, the IR
sentence—value is o1 (i) = log(nsubj + 1). scores did not correlate strongly with the likelihood that
1
While this strategy is defensible and perhaps prudent it
• the position of the sentence in the document—built
prevented us from generating a summary for document set 323,
into the state-structure of the HMM. which did not have any novel sentences. We conferred with Paul
Over, who indicated this was an error in the TREC data.
The signature terms are the terms that are more likely 1
In these experiments we assume the summary length is
to occur in the document (or document set) than in the known and, therefore, precision and recall are identical.
Task DTD Filename SGML Tag stype
2,3 ACQUAINT acquaint.dtd <TEXT> 1
<HEADLINE> 0
4 FBIS fbis.dtd <TEXT> 1
<TI> 0
<H1>, . . . , <H8> 0
Federal Register fr.dtd <TEXT> 1
<SUMMARY> 1
<SUPPLEM> 1
<FOOTNOTE> 1
<DOCTITLE> 0
Financial Times ft.dtd <TEXT> 1
<HEADLINE> 0
LA Times latimes.dtd <TEXT> 1
<HEADLINE> 0
<SUBJECT> 0
<GRAPHIC> 0

Table 1: Mapping SGML tags to stype values.

a document’s sentence would be chosen for the summary. Unfortunately, automatic parsing of EDUs is still not
We hypothesize that since the document collection only strong enough to meet our needs.
contains documents relevant to the query, the topic de- Instead, we chose to develop patterns using “shallow
scription terms do not add any additional information. parsing” techniques, keying off of lexical cues. The
Clearly, more analysis is required to determine why the sentences passed by the Sentence Extraction component
topic descriptions did not help in the generation of the were run through a part-of-speech (POS) tagger. Each
summaries. sentence, in order of its ranking by the Sentence Ex-
traction component, was matched against the various
4 From Extract to Abstract patterns. The following eliminations were made, when
appropriate:
The output from the Sentence Extraction component is
a ranked set of sentences selected by the QR algorithm. • sentences that begin with an imperative;
The HMM tends to select longer sentences. This means • sentences that contain a personal pronoun at or near
that for a 100-word summary, needed for tasks 2, 3, and the start;
4, the QR algorithm would usually select only 2 or 3
sentences from all those first selected by the HMM. We • gerund clauses;
felt that so few sentences would not supply enough of the
information we would like to see in a summary. • restricted relative-clause appositives;
In order to include more sentences in the summary, we • intra-sentential attribution;
decided to eliminate parts of the top selected sentences
that do not usually convey the most important informa- • lead adverbs.
tion. Occasionally we lose something we should have
kept but, in general, we gain. Shortening the selected 4.1 Sentence Elimination
sentences permits the inclusion of additional sentences, We eliminate two kinds of sentences in our summaries:
potentially gaining additional information. To accommo- imperatives and those “beginning” with pronominals. We
date this, the QR algorithm ranks sentences with about determined that imperatives rarely contain novel infor-
300 words rather than the needed 100 words. mation; in order for them to be effective and under-
Full parsing and comprehension are too costly to pur- stood, they must reference information the reader already
sue. We have done some initial investigation into using knows.
elementary discourse units (EDUs) (Carlson et al., 2002) Sentences that have a personal pronoun close to the
to determine sentence structure, component parts, and start of the sentence seem to fall into two categories: 1)
the importance and relevance of those parts, and would they are preceded by a proper noun, in which case, they
like to use EDUs for the purpose of creating an abstract. are fine to use; or 2) they do not have their reference
Task Number of Number of Total Number
Problems Misses of Sentences
2 1 6 91
3 5 6 102
4 5 2 113

Table 2: Heuristic Errors

within the sentence in which they appear. In the latter the clause is identified by a comma or a period. The
case, the antecedent generally is in a preceding sentence. following is a sentence from the training data with a
Ideally, when a sentence begins with a pronoun, we gerund phrase that would be removed:
should do analysis and identify its antecedent. This is a More than 800 lives were lost when the 21,794 tonne
very difficult task. We have instead taken the approach ferry, sailing from the Estonian capital Tallinn
of eliminating any sentence that begins with a pronoun to Stockholm, sank within minutes early yesterday
unless its preceding sentence in the original document has morning in the Baltic Sea 40 km south west of the
Finnish island of Uto.
already been included in the summary. Because this is a
one-pass process, even if we later include the sentence Restricted relative-clause appositives usually provide
with the needed reference, the pronominal sentence will background information. Because they are always delim-
not be re-included2 . Elimination of these sentences may ited by punctuation, they can be removed relatively eas-
cause the loss of some important information, but defi- ily. The patterns for these clauses look for specific words
nitely improves the readability of generated summaries. playing specific part-of-speech roles in the sentence:
The sentence eliminating rules are: “who”, “when”, “where”, “which”, and “in which”, and
require the clause to follow a comma and end with a
1. imperatives: if the first word is tagged as a verb comma or period. A sentence from the training data with
“base form” (VB in our POS tagger), the sentence a restricted relative-clause appositive to be removed is:
is considered an imperative and eliminated from use
The Menendez family lived in the Princeton Area
in the summary. until 1986, when they moved to California.
2. pronominals: if a personal pronoun (PRP in our POS While attributions can be informative, we decided that
tagger) appears within the first eight (8) words3 of a they could be sacrificed in order to include other, hope-
sentence and it is not preceded by a proper noun, we fully more important, information in the summary. Iden-
check to see that the sentence immediately preced- tifying intra-sentential attributions is not always easy.
ing it in the original document has been selected. Our rules find some, but not all, cases. We developed
If not, the sentence containing the pronominal is a list of about 50 verbs (and their various forms) that are
eliminated. used in attributions. A verb must be found in this list for
the clause to be considered for removal.
4.2 Clause Elimination When an attribution occurs at the start of a sentence,
Three different kinds of clauses were eliminated: gerund we require it to terminate with “that”, without any pre-
clauses, restricted relative-clause appositives, and intra- ceding punctuation. (We have not yet determined how
sentential attribution. In addition to the patterns identi- to find the proper end of the many attributions that occur
fied to locate the clauses to be removed, we utilized a without this word.) For an attribution that occurs at the
simple heuristic that if the number of tokens to be deleted end of a sentence, it must follow the last comma of the
was greater than or equal to the number of tokens to be sentence. The last word of the sentence must then be one
retained, the elimination was not performed. of our specified attribution verbs. A sentence from the
Gerunds often comment on, rather than advance, a training data with an attribution to be removed is:
narration and therefore tend to be incidental. To eliminate The federal Government’s highway safety watch-
a gerund clause, it must 1) be at the start of the sentence dog said Wednesday that the Ford Bronco II ap-
or immediately follow a comma, and 2) have the gerund pears to be involved in more fatal roll-over accidents
than other vehicles in its class and that it will seek
(VBG) as the lead word or as the second word following
to determine if the vehicle itself contributes to the
a preposition (IN) or “while” or “during”. The end of accidents.
2
We need to do more evaluation to determine if this was a 4.3 Lead Adverb Elimination
correct decision.
3
Eight was chosen heuristically; we know of no evidence For DUC 2002, we eliminated sentence lead words such
that proves it to be the correct number of words. as “And” and “But” since they did not add substantial
Task 2 Task 3 Task 4
Question System Peers Humans System Peers Humans System Peers Humans
16 (mean) (mean) 16 (mean) (mean) 16 (mean) (mean)
1 0.0 0.5 0.0 0.3 0.5 0.0 0.0 0.2 0.0
2 0.0 0.1 0.0 0.1 0.2 0.0 0.1 0.0 0.0
3 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.0
4 0.1 0.2 0.1 0.2 0.2 0.0 0.2 0.2 0.0
5 0.1 0.2 0.0 0.3 0.2 0.0 0.2 0.2 0.1
6 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0
7 0.0 0.2 0.1 0.1 0.2 0.0 0.1 0.1 0.0
8 0.1 0.3 0.1 0.3 0.3 0.0 0.5 0.4 0.0
9 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0
10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 0.3 0.3 0.0 0.1 0.1 0.0 0.0 0.1 0.0
12 0.6 0.6 0.2 0.6 0.6 0.1 0.7 0.6 0.1

Table 3: Average Error Rates on Readability Questions (Scale: 0–4)

information and often hindered the flow of the summary. Note that in tasks 3 and 4, one of the baselines was
For the same reason, in DUC 2003 we expanded this and among the top six scoring systems. By comparison, the
eliminated any adverb (RB) that began a sentence. top scoring baseline for task 2 ranked twelfth and the
corresponding p value was 0.07 meaning we can be 93%
4.4 Impact certain that the top 12 systems do not have the same mean
We performed a “post mortem” analysis of the summaries coverage score.
we generated to see what problems we had created when Table 3 shows how our system fared on the twelve
we eliminated parts of sentences. (We did not evaluate the questions which measure readability on a scale from 0 to
impact of eliminating entire sentences.) Both the number 4. We present the average score on each of the questions
of bad sentences (fragments, missing words, etc.), identi- for system 16, the average peer system (excluding ours),
fied as “problems” in Table 2, and the number of clauses and the average of the 10 humans. Overall, our system
that should have been removed but weren’t, identified as performed fairly well.
“misses” in the table, were relatively small. In some Two observations should be made. On question 8,
cases, the problems were due to poor sentence splitting noun resolution, we performed comparably to humans on
and not to any of the heuristics we applied. task 2 yet worse than the average peer system on tasks 3
and 4. We will need to investigate this anomaly. On ques-
5 Results tion 12, which measures the number of sentences which
appear to be misplaced, all systems seem to perform at
Overall, our system, system 16, was comparable to the
about the same level, which is still significantly below
top systems as rated by mean coverage, as given by NIST.
human performance.
Figure 1 shows the rankings of the humans, machine
(peer) systems, and baseline systems for each of tasks
6 Conclusion and Future Efforts
2, 3, and 4. The systems are sorted by average mean
coverage. Our system ranked third and second among In conclusion, our system performed better than last year,
the peer systems for tasks 2 and 3, respectively. For task but there is still a wipe gap between human and machine
4, it was the top scoring peer system but was inched out performance. The remaining question is whether the
by baseline 5! improvements we would like to make to our system will
However, the top scoring peer systems are extremely help narrow this gap or not.
close. More precisely, using a one-way ANOVA test We will continue to improve our preprocessing by
on the top six ranked systems gives a p value of 0.82, testing state-of-the-art POS taggers and sentence splitters
0.87, 0.92 for tasks 2, 3, and 4, respectively. Under as they become available.
the null hypothesis, i.e., that the top scoring systems The utility of query terms for sentence selection was
have the same mean and variance with regards to mean disappointing. We will seek better ways to use these in
coverage, p gives the probability that these average mean our HMM/QR framework.
coverages would be produced. Thus, mean coverage does Many more heuristics should be developed to eliminate
not separate the top scoring systems. less useful parts of selected sentences. As mentioned in
0.8

0.7

0.6
Mean Coverage Score

0.5

0.4

0.3

0.2

0.1

H D C A B I E G F J 13 6 16 20 23 14 12 18 26 22 11 B3 21 10 17 B2 19 15

(a) Task 2. Humans: A–J, Baselines: B2–B3, Peer Systems: 6–26

0.7

0.6

0.5
Mean Coverage Score

0.4

0.3

0.2

0.1

D C B A E H F G I J 20 16 17 B3 18 22 10 23 11 21 B2 13 15

(b) Task 3. Humans: A–J, Baselines: B2–B3, Peer Systems: 10–23


0.9

0.8

0.7

0.6
Mean Coverage Score

0.5

0.4

0.3

0.2

0.1

B H C I E J D G A F B5 16 23 14 22 13 B4 17 20 10 19

(c) Task 4. Humans: A–J, Baselines: B4–B5, Peer Systems: 10–23

Figure 1: Mean Coverage


Section 4.2, many attributions are still not identified and for Single and Multidocument Summarization”. IEEE
we would like to find a way to do this more thoroughly. Intelligent Systems, 18(1):46–54, Jan/Feb.
We would also like to eliminate most adverbs that occur
but it is difficult to determine when they are needed and
when they can be removed.
Additionally, we looked at eliminating all sentences
with passive construction but found that we were not yet
able to identify only the sentences we want to eliminate.
Finally, we would like to eliminate the various types
of parenthetic phrases—(...), [...], –...–, etc.—but have
so far found it difficult to identify those which contain
information that should be included in a summary.

References
L. Carlson, D. Marcu, and M.E. Okurowski. 2002.
“Building a Discourse-Tagged Corpus in the Frame-
work of Rhetorical Structure Theory”. In Discourse
and Dialogues. Kluwer Academic Press.
J.M. Conroy and D.P. O’Leary. March, 2001. “Text
Summarization via Hidden Markov Models and Piv-
oted QR Matrix Decomposition”. Technical report,
University of Maryland, College Park, Maryland.
J.M. Conroy, J.D. Schlesinger, D.P. O’Leary, and
M.E. Okurowski. 2001. “Using HMM and Lo-
gistic Regression to Generate Extract Summaries
for DUC”. In DUC 01 Conference Proceedings.
http://duc.nist.gov/.
T. Dunning. 1993. “Accurate Methods for Statistics of
Surprise and Coincidence”. Computational Linguis-
tics, 19:61–74.
W. N. Francis and H. Kucera. 1982. Frequency Analysis
of English Usage: Lexicon and Grammar. Houghton
Mifflin Company, Boston, MA.
C.Y. Lin and E. Hovy. 2002. “The Automatic
Acquisition of Topic Signatures for Text Summa-
rization”. In DUC 02 Conference Proceedings.
http://duc.nist.gov/.
A. Mikheev. 2000. “Tagging Sentence Boundaries”.
In Proceedings of the First Meeting of the North
American Chapter of the Computational Linguistics
(NAACL), pages 264–271, Seattle, WA. Morgan Kauf-
mann.
J.D. Schlesinger, M.E. Okurowski, J.M. Conroy, D.P.
O’Leary, A.Taylor, J. Hobbs, and H.T. Wilson. 2002.
“Understanding Machine Performance in the Context
of Human Performance for Multi-document Sum-
marization”. In DUC 02 Conference Proceedings.
http://duc.nist.gov/.
J.D. Schlesinger, J.M. Conroy, M.E. Okurowski, and D.P.
O’Leary. 2003. “Machine and Human Performance

You might also like