Efficient Attentions For Long Document Summarizati

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Efficient Attentions for Long Document Summarization

Luyang Huang 1 Shuyang Cao1 Nikolaus Parulian2 Heng Ji2 Lu Wang1


1
Computer Science and Engineering, University of Michigan, Ann Arbor, MI
2
Department of Computer Science, University of Illinois at Urbana-Champaign, IL
1
{lyhuang, caoshuy, wangluxy}@umich.edu
2
{nnp2, hengji}@illinois.edu

Abstract the calculation of encoder self-attentions (Wang


et al., 2020c; Zaheer et al., 2020) by selectively at-
The quadratic computational and memory tending to neighboring tokens (Beltagy et al., 2020;
complexities of large Transformers have lim-
Child et al., 2019) or relevant words (Kitaev et al.,
ited their scalability for long document sum-
arXiv:2104.02112v1 [cs.CL] 5 Apr 2021

marization. In this paper, we propose H EPOS, 2020; Tay et al., 2020a). Yet, these methods do not
a novel efficient encoder-decoder attention apply to encoder-decoder attentions in summariza-
with head-wise positional strides to effectively tion models since they collaborate and dynamically
pinpoint salient information from the source. pinpoint salient content in the source as the sum-
We further conduct a systematic study of ex- mary is decoded. Truncation is commonly used
isting efficient self-attentions. Combined with to circumvent the issue. However, training on cur-
H EPOS, we are able to process ten times more tailed content further aggravates “hallucination” in
tokens than existing models that use full atten-
existing abstractive models (Maynez et al., 2020).
tions. For evaluation, we present a new dataset,
G OV R EPORT, with significantly longer docu- We argue that summarizing long documents
ments and summaries. Results show that our (e.g., with thousands of words or more) requires ef-
models produce significantly higher ROUGE ficient handling of both types of attentions. To this
scores than competitive comparisons, includ- end, we propose an efficient encoder-decoder atten-
ing new state-of-the-art results on PubMed. tion with head-wise positional strides (H EPOS),
Human evaluation also shows that our mod- where the attention heads follow a strided pattern
els generate more informative summaries with
and have varying starting positions. H EPOS re-
fewer unfaithful errors.
duces computational and memory costs while (1)
1 Introduction maintaining the power of emphasizing important
tokens, and (2) preserving the global context per
Long documents, such as scientific papers and gov- head. H EPOS successfully doubles the processed
ernment reports, often discuss substantial issues at input sequence size, when combined with any en-
length, and thus are time-consuming to read, let coder. To the best of our knowledge, we are the
alone to comprehend. Generating abstractive sum- first to study efficient encoder-decoder attentions
maries can help readers quickly grasp the main and provide a systematic comparison of diverse
topics, yet prior work has mostly focused on short encoder attentions for the task of summarization.2
texts (containing hundreds of words), e.g., news For evaluation, we collect a new large-scale
articles (Gehrmann et al., 2018; Liu and Lapata, dataset, G OV R EPORT, consisting of about 19.5k
2019; Zhang et al., 2019). U.S. government reports with expert-written ab-
Model training efficiency and summary quality stractive summaries.3 G OV R EPORT has two impor-
present a pair of challenges for long document tant features: (1) It contains significantly longer
summarization. State-of-the-art systems (Lewis documents (9.4k words) and summaries (553
et al., 2020; Zhang et al., 2019) are built upon words) than existing datasets, such as PubMed and
Transformer (Vaswani et al., 2017), which uses at- arXiv (Cohan et al., 2018) (see Table 2); (2) Salient
tentions to compute pairwise relations between to-
kens. Such framework has quadratic time and mem- tokens with a batch size of 1, 70GB of memory is needed for
encoder attentions, and 8GB for encoder-decoder attentions.
ory complexities, and is too costly for long docu- 2
Our code is released at https://github.com/
ments 1 . Solutions have been proposed to reduce luyang-huang96/LongDocSum.
3
G OV R EPORT can be downloaded from https://
1
For instance, to fine-tune BART on documents of 10K gov-report-data.github.io.
content is spread throughout the documents, as op- Model Complexity # New Para.
posed to cases where summary-worthy words are Full O(n2 ) —
more heavily concentrated in specific parts of the Encoder Self-attentions
document. These properties make G OV R EPORT an I. Fixed Patterns
important benchmark for producing long document Sliding Window (2020) O(nw) 0
Adaptive Span (2019) O(nŵ) O(1)
summaries with multiple paragraphs. Global Tokens (2020) O(2ng) 0
We conduct experiments on G OV R EPORT and Stride (2019) O(n2 /s) 0
scientific papers in PubMed and arXiv. First, Random (2020) O(nr) 0
when summarizing documents of the same length, II. Low-rank
Linformer (2020c) O(nk) O(n)
H EPOS attention yields significantly better ROUGE
III. Learnable Patterns
scores than a non-trivial comparison that projects LSH (2020) O(lnbl ) 0
attentions into low-rank space (Wang et al., 2020c). Sinkhorn (2020a) O(2nbs ) 0
Second, when trained on the same GPU, H EPOS Encoder-decoder Attentions
attention, combined with sparse encoder attentions, Hepos (ours) O(mn/sh ) 0
Linformer O(mk) O(n)
is able to read more than 10K words and obtains sig-
nificantly higher ROUGE scores on G OV R EPORT Table 1: Summary of efficient Transformer attentions
and new state-of-the-art results on PubMed, com- on memory complexity and newly learned parameters
pared with full encoder-decoder attention models compared with full attentions at each layer. m and n
which can process at most 5K input words. Human are lengths of the input and the output. See § 2 and § 3
judges further rate the summaries generated by our for model-specific hyperparameters.
models to be more informative and faithful.
We further propose a new evaluation metric convention of Tay et al. (2020b), and summarize
for faithfulness, inspired by APES (Eyal et al., their memory complexities and numbers of newly
2019), a fill-in-the-blank QA metric for summary learned parameters in Table 1.
evaluation. With questions generated from refer-
ences, our metric, APESsrc , compares QA answers 2.1 Fixed Patterns
by reading the source and the system summary. It is Fixed patterns are used to limit the scope of atten-
shown to be better correlated with human judgment tions. In our experiments, in addition to window-
than the original metric and an entailment-based based attentions, we also combine them with global
scorer (Kryscinski et al., 2020). tokens, stride patterns, or random attentions.
The rest of the paper is organized as follows. We Sliding window attentions (Beltagy et al., 2020)
describe efficient encoder attentions in prior work aim to capture the local context, which is critical for
in § 2, and formulate our proposed encoder-decoder language understanding (Liu* et al., 2018; Child
attention in § 3. The G OV R EPORT data is presented et al., 2019). Concretely, each query token attends
in § 4. We then share details on evaluation metrics to w/2 neighboring tokens on both left and right,
(§ 5) and experimental results (§ 6). Additional yielding a memory complexity of O(nw).
related work is listed in § 7, with conclusion in §8.
Adaptive span is proposed by Sukhbaatar et al.
2 Prior Work on Efficient Encoder (2019) to learn attention windows at different lay-
Attentions ers. This is implemented by learning a masking
function for each head independently. In practice,
Transformer models are built upon multi-head at- the adaptive span attention has a complexity of
tentions in multiple layers. The attention is calcu- O(nŵ), where ŵ is the maximum values of pre-
T
lated as Attention(Q, K, V) = softmax( QK √
dk
)V, dicted spans for all heads. Besides, it introduces
where Q, K, and V are query, key, and value ma- O(1) new parameters for learning spans.
trices, each consisting of n vectors for a document Global tokens (Beltagy et al., 2020) are often
with n tokens, thus the quadratic memory footprint. added to sliding windows to let pre-selected tokens
Here, we present an overview of representa- attend to the full sequence, to build global represen-
tive methods for efficient encoder self-attentions tations. Importantly, global attention operations are
(henceforth “encoder attentions”) that can be symmetric, i.e., a global token is also attendable
built upon large pre-trained seq2seq models, e.g., to all tokens in the sequence. We select the first g
BART (Lewis et al., 2020). We follow the naming tokens as global tokens, as leading sentences are
often important for summarization. Memory com- Hepos Attention
plexity is O(2ng) due to the symmetric attentions. head 1 head 2 head 3 head 4
Encoder Key
Stride patterns are proposed by Child et al. (2019) Job
to capture long term interactions, where each query in
attends to every s-th token, with s as the stride size. home
care ...
It thus has a complexity of O(n2 /s).
...
Random attention is motivated by the fact that
randomly constructed graphs with Θ̃(n) edges can GAO was asked ...
Decoder Query
approximate the complete graphs spectrally (Za-
heer et al., 2020). Zaheer et al. (2020) propose Figure 1: A toy example of our H EPOS attention, with
to allow each query to attend to r random keys, a stride of 2 and four attention heads. Dark colors in-
resulting in a complexity of O(nr). For efficient dicate that heads 1 and 3 attend to the first and third
tokens (“Job" and “home") in the input, heads 2 and 4
implementations, input tokens are first segmented
look at the second and fourth words (“in" and “care").
into blocks. Tokens in the same block attend to
tokens in another randomly selected block.
2019), which share a similar theoretical foundation
2.2 Low-rank Methods as global tokens; and kernel methods over atten-
Wang et al. (2020c) show that self-attention matri- tions require training models from scratch (Choro-
ces are low-rank. They propose Linformer that manski et al., 2020; Katharopoulos et al., 2020).
linearly projects key and value matrices into a low-
3 Encoder-decoder Attention with
dimensional space, e.g., from n to k, to achieve a
Head-wise Positional Strides (Hepos)
O(nk) complexity. It also introduces O(n) new
parameters for projection matrix learning. The efficient design of encoder-decoder attentions
with head-wise positional strides (H EPOS) allows
2.3 Learnable Patterns models to consume longer sequences. Concretely,
Recently, learnable sparse attentions are proposed our design is motivated by two observations: (1)
to better capture both local and global contexts than Attention heads are redundant (Voita et al., 2019).
attentions based on fixed patterns. (2) Any individual head rarely attends to several
Locality-sensitive hashing (LSH) attentions use tokens in a row (Clark et al., 2019). Therefore, as
a random-projection hashing function to hash sim- illustrated in Fig. 1, H EPOS uses separate encoder-
ilar queries and keys into the same buckets in l decoder heads on the same layer to cover different
rounds (Kitaev et al., 2020). Attentions are then subsets of source tokens at fixed intervals. Each
computed among tokens within each bucket. For head starts at a different position, and all heads
bucket size bl , the complexity of LSH attention is collectively attend to the full sequence.
O(lnbl ). Given a stride size of sh , for the h-th head, its
attention value between decoder query qj (at step
Sinkhorn attentions first segment a sequence into
j) and encoder key vector ki (for the i-th input
blocks, which are then arranged by a learned
token) can be formulated as:
Sinkhorn sorting network (Tay et al., 2020a). Given (
the new permutation, each query attends to bs to- softmax(qj ki ), if (i − h) mod sh = 0
ahji = (1)
kens within the same block to maintain the local 0 otherwise
context and another bs tokens in a neighboring
block to capture global interactions. Its complexity In H EPOS attention, each query token attends to
is O(2nbs ). n/sh tokens per head, yielding a memory complex-
ity of O(mn/sh ), where m is the output length.
2.4 Other Attentions For comparison, Linformer (§ 2.2) can be
We also describe several notable methods that are straightforwardly adapted for encoder-decoder at-
not suitable for our experiments and excluded from tentions by using decoder queries for attention cal-
this study: Recurrence over input segments are culation instead. We do not adapt pattern-based
tailored for an autoregressive decoder only (Dai attentions (§ 2.1 and § 2.3), since they rely on local
et al., 2019); memory methods use a separate mem- token grouping which makes it difficult to pinpoint
ory module to attend to full sequences (Lee et al., salient content.
4 G OV R EPORT Dataset Dataset # Doc Summary Doc Comp. Den.
# word # sent # word
We introduce a new large-scale dataset, G OV R E -
P UB M ED 133,215 202.4 6.8 3049.0 16.2 5.8
PORT, containing 19, 466 long reports published by AR X IV 215,913 272.7 9.6 6029.9 39.8 3.8
U.S. Government Accountability Office (GAO)4 B ILL S UM 23,455 207.7 7.2 1813.0 13.6 4.1
to fulfill requests by congressional members, and B IG PATENT 1,341,362 116.5 3.7 3573.2 36.3 2.4
Congressional Research Service (CRS)5 , covering G OV R EPORT 19,466 553.4 17.8 9409.4 19.0 7.3

researches on a broad range of national policy is-


Table 2: Statistics of G OV R EPORT and existing long
sues. A human-written summary is provided along document summarization datasets. Comp.: compres-
with each report. During data collection, we re- sion ratio, Den.: extractive fragment density (Grusky
move boilerplates from crawled files, and keep the et al., 2018). All values are mean over the whole
section and paragraph structure of the documents dataset except for the “# Doc” column. Documents and
and summaries. Additional data cleaning and pro- summaries in G OV R EPORT are significantly longer.
cessing details are in Appendix A.

coverage of salient bigrams (%)


We obtain 12, 228 GAO reports and 7, 238 CRS 60
reports of high quality evidenced by human inspec- 50
tion of 200 parsed reports. Collected GAO reports 40
and CRS reports have on average 6.9 and 4.6 sec-
30
tions, respectively. We split train, validation and PubMed
test set by publication date on each dataset, and 20 arXiv
BillSum
end up with 17519 training samples, 974 valida- 10 BigPatent
tion documents, and 973 test samples. 0 GovReport
Notably, summaries of GAO reports are 0 10 20 30 40 50 60 70 80 90 100
written by experts, and are often structured position in the source (%)
into three aspects in order: “Why GAO did Figure 2: Percentage of unique salient bigrams accu-
this study”—motivation and problem(s) un- mulated from the start to X% of the source. Key infor-
der discussion, “What GAO found”—findings mation is spread over the documents in G OV R EPORT,
of the report, and “What GAO recommends”— highlighting the importance of understanding longer
text.
suggestions and solutions to the problem(s). All but
three GAO summaries include “What GAO Found”.
The percentages of GAO summaries that contain U.S. patent documents.
“Why GAO did this study” and “What GAO rec- First, documents and summaries in GovReport
ommends” are 94.8% and 29.0%. For compari- are significantly longer than prior datasets. Next,
son, structured summaries are also observed on we inspect the distribution of summary-worthy bi-
P UB M ED (Cohan et al., 2018) samples. Though grams in the source by dividing each document
they do not contain explicit aspect labels, the sum- into ten equisized partitions. For each partition, we
maries can often be broken down into “Introduc- count the occurrence of unique bigrams that also
tion”, “Methods”, “Results”, and “Conclusion” via appear in the reference, accumulated from the start
keyword matching. Details about keyword choices of the document to the end of the partition. Fig. 2
for each aspect are provided in Table 11 in Ap- shows that key information is spread throughout
pendix D. documents in G OV R EPORT, with new salient bi-
grams being steadily added as more content is con-
Comparison with Existing Long Document
sumed. For AR X IV and B IG PATENT, only about
Summarization Datasets. In Table 2, we com-
10% of new salient bigrams are accumulated in the
pare G OV R EPORT with several existing long docu-
second half of the documents, reflecting the heavy
ment summarization datasets, including P UB M ED
positional bias in these two datasets. In contrast, in
and AR X IV (Cohan et al., 2018) that consist of sci-
GovReport and B ILL S UM, more than 18% of new
entific publications; B ILL S UM (Kornilova and Ei-
summary-worthy bigrams appear in the later half
delman, 2019), a collection of congressional bills;
of the articles, showing a more even distribution.
and B IG PATENT (Sharma et al., 2019), a corpus of
A similar trend is observed on unigrams. However,
4
www.gao.gov B ILL S UM has the shortest documents among the
5
crsreports.congress.gov five datasets.
5 Summary Evaluation with Cloze QA Entailment-based Evaluation. We further con-
sider FactCC (Kryscinski et al., 2020), which eval-
This work aims to evaluate whether processing
uates factual consistency of a system summary by
more text improves both informativeness and faith-
predicting an entailment score between the source
fulness of abstractive summaries. In addition to
and the summary. We reproduce their method on
ROUGE (Lin, 2004) and human evaluation, we ex-
our datasets.
tend existing QA-based metric (Eyal et al., 2019)
Additional details for implementing the evalu-
and consider an entailment-based scorer.
ation models and the entity extraction models are
QA-based Evaluation. We present a new faith- given in Appendix B.
fulness evaluation metric by extending the APES
score (Eyal et al., 2019). We follow APES to con- 6 Experimental Results
struct a set of cloze questions, {q}, from each ref-
erence summary by masking entities. Events, dates, In this section, we start with describing training
and numbers are also masked, as they are prevalent details in § 6.1. We then compare attention vari-
in our data. Each masked phrase becomes the gold- ants on documents of the same length (§ 6.2) and
standard answer aref for a question q. We do not study whether reading more text can generate more
generate natural language questions (Durmus et al., informative summaries (§ 6.3). We further report
2020; Wang et al., 2020a), due to the lack of accu- human evaluation on summary informativeness and
rate question generation models for the domains of faithfulness as well as automatic faithfulness scores
government reports and scientific papers. (§ 6.4). Finally, we investigate whether automatic
QA models are trained by reading a question and metrics correlate with human judgment (§ 6.5).
a context to label the answer span in the context.
6.1 Training Details
We construct context by greedily selecting sen-
tences that maximize the improvement of ROUGE- We fine-tune BART (Lewis et al., 2020) for all
2 recall when compared with the reference sum- experiments. We implement our models with Py-
mary. If the answer aref cannot be found in the Torch (Paszke et al., 2019) and Fairseq (Ott et al.,
context, the sample is excluded from training. We 2019). Additional position embeddings are ini-
train all QA models by fine-tuning BERT (Devlin tialized randomly for models that handle longer
et al., 2019) to predict the answer span. inputs. The learning rate is set to 1 × 10−4 and
To evaluate the faithfulness of a system sum- learning rate warm-up is applied for the first 10,000
mary, APES uses the QA model to read the sum- steps. Adafactor (Shazeer and Stern, 2018) opti-
mary and a question q to label an answer asys . It mizer with a gradient clipping of 0.1 is used. All
calculates a unigram F1 score by comparing asys models are trained on two Quadro RTX 6000 GPUs
and aref . Different from APES, we further use the with 24GB memory or one Quadro RTX 8000 with
QA model to read the context (sentences selected 48GB memory. We set a batch size of 2 per step
from the source) and give an answer acxt to the and accumulate gradient every 32 steps. During
question q. We compute a unigram F1 by com- test, we adopt a beam size of 4 and a length penalty
paring asys and acxt , denoted as APESsrc . Given of 2 (Wu et al., 2016) on all datasets.
that existing summarization models rarely rewrite
names or numbers correctly, our metric can better 6.2 Comparing Attention Variants
capture faithfulness by using a gold-standard an- Comparisons. We first experiment with articles
swer constructed from the source article than from that are all truncated at 1024 tokens. For encoder
the human-written abstract. attentions, we consider the following variants: (1)
To extract entities and events, we deploy a sliding W INDOW; (2) adaptive span (A DA S PAN);
state-of-the-art IE framework, OneIE (Lin et al., (3) G LOBAL tokens; (4) S TRIDE; (5) R ANDOM
2020) on G OV R EPORT. On PubMed, we re- tokens; (6) Linformer (L IN .); (7) locality sensitive
train OneIE on Genia 2011 (BioNLP, 2011) and hashing (LSH); and (8) S INKHORN. We ensure
2013 (BioNLP, 2013), and PubMed (Wei et al., models are comparable by setting hyperparame-
2019) datasets to extract domain-specific entities ters to satisfy w = ŵ = k = lbl = 2bs = 256,
and events, such as entities of Gene and Disease. so that models have similar memory complex-
We additionally include numbers and dates ex- ity. For LSH attentions, we select l = 4 rounds
tracted by spaCy (Honnibal and Montani, 2017). of hashing. Following prior work (Zaheer et al.,
GovReport (new) PubMed GovReport PubMed
System R-1 R-2 R-L R-1 R-2 R-L System (M AX L EN) R-1 R-2 R-L R-1 R-2 R-L
FULL 52.83 20.50 50.14 45.36 18.74 40.26 Baselines
Encoder variants w/ full enc-dec attn. PEGASUS (1024) – – – 45.97 20.15 41.34
I. Fixed Patterns TLM (full) – – – 42.13 16.27 39.21
W INDOW 50.78 18.59 48.10 42.74 16.83 37.96 SEAL (full) – – – 46.50 20.10 42.20
+ G LOBAL 51.24 19.01 48.58 43.44 17.07 38.55 DANCER (full) – – – 46.34 19.97 42.42
+ S TRIDE 51.53 19.14 48.68 43.73 17.25 38.82 B IG B IRD (3072) – – – 46.32 20.65 42.33
+ R ANDOM 51.49 18.90 48.75 43.38 16.87 38.45 Encoder variants w/ full enc-dec attn.
A DA S PAN 50.76 18.69 48.13 43.42 17.16 38.60 F ULL (1024) 52.83 20.50 50.14 45.36 18.74 40.26
+ G LOBAL 50.33 18.56 47.80 43.24 17.01 38.42 S TRIDE (4096) 54.29 20.80 51.35 46.95 19.98 41.67
+ S TRIDE 51.56 19.19 48.57 43.71 17.25 38.76 L IN . (3072) 44.84 13.87 41.94 43.69 16.35 38.66
+ R ANDOM 51.39 18.89 48.74 43.28 16.87 38.45 LSH (4096) 54.75 21.36 51.27 47.54 20.79 42.22
II. Low-Rank Methods S INKHORN (5120) 55.45 21.45 52.48 47.96 20.78 42.53
L IN . 50.70 18.48 47.85 43.65 17.12 38.71 Encoder variants w/ H EPOS enc-dec attn. (ours)
III. Learnable Patterns LSH (7168) 55.00 21.13 51.67 48.12 21.06 42.72
LSH 51.95 19.36 48.85 44.74 18.07 39.76 S INKHORN (10240) 56.86 22.62 53.82 47.93 20.74 42.58
S INKHORN 53.00∗ 20.05∗ 50.25∗ 45.10 18.40∗ 40.11∗
Enc-dec variants w/ full encoder attn. Table 4: ROUGE scores for models trained on the same
L IN . 47.79 14.93 45.15 45.16 17.66 40.25 GPU. S INKHORN with H EPOS enc-dec attention and
H EPOS (ours) 51.05∗ 19.44∗ 48.51∗ 45.80∗ 18.61∗ 40.69∗ LSH with H EPOS both read more text and obtain sig-
Enc-dec variants w/ Sinkhorn encoder attn. nificantly better scores than other models on GovRe-
L IN . 42.90 12.86 40.32 44.84 17.65 39.98 port and PubMed (p < 0.0005).
H EPOS (ours) 51.34∗ 19.09∗ 48.73∗ 44.85 18.19∗ 39.91

System (M AX L EN) R-1 R-2 R-L


Table 3: Results on evaluating encoder and encoder-
decoder attentions on input of the same length. Best Baselines
ROUGE scores of fixed patterns, learnable patterns, PEGASUS (1024) 44.21 16.95 38.83
and enc-dec attentions are in red, orange, and purple, TLM (full) 41.62 14.69 38.03
respectively. ∗: significantly better than comparison(s) SEAL (full) 44.3 18.0 39.3
DANCER (full) 45.01 17.60 40.56
using the same encoder or enc-dec attention (approxi-
B IG B IRD (3072) 46.63 19.02 41.77
mation randomization test, p < 0.0005). Encoder variants w/ H EPOS enc-dec attn. (ours)
LSH (7168) 48.24 20.26 41.78
S INKHORN (10240) 47.87 20.00 41.50
2020), we combine G LOBAL, S TRIDE, and R AN -
DOM with W INDOW and A DA S PAN, where we set Table 5: Automatic evaluation on arXiv. Our best
g = n2 /s = r = 128 for a fair comparison. We model yields better ROUGE scores than previous state-
adapt Linformer to encoder-decoder attentions to of-the-art models.
compare with H EPOS, where we use sh = n/k = 4
for all experiments. Finally, we report results us-
ing FULL, i.e., the original, encoder and encoder- full encoder attention, implying the effectiveness
decoder attentions. of H EPOS on both identifying the salient content
and capturing the global context.
Results. Among all encoder variants, learnable
patterns perform the best, approaching the per- 6.3 Reading More Input Boosts
formance of full attentions on both GovReport and Informativeness
PubMed, as shown in Table 3. Within learnable pat-
terns, Sinkhorn attention consistently obtains better We investigate whether processing more words gen-
ROUGE scores. Moreover, combining techniques erates more informative summaries.
in fixed patterns is more effective than simply us- Comparisons include recent top-performing ab-
ing window-based sparse attentions, though with stractive models: PEGASUS (Zhang et al., 2019),
an increased memory cost. a large pre-trained summarization model with
For encoder-decoder attentions, H EPOS consis- truncated inputs; TLM (Pilault et al., 2020),
tently yields higher ROUGE scores than Linformer DANCER (Gidiotis and Tsoumakas, 2020), and
on both datasets, using either full or Sinkhorn en- SEAL (Zhao et al., 2020), all of them using hybrid
coder. Notably, coupled with a Sinkhorn attention, extract-then-abstract methods; and B IG B IRD (Za-
our model’s performance matches the variant using heer et al., 2020), which combines sliding window,
global and random token attentions in the encoder. Human-written Summary:
In fiscal year 2018, Medicaid covered approximately 75
For encoder variants, we pick the best perform- million individuals at an estimated cost of $629 billion,
ing model from fixed patterns to be combined with $393 billion of which were federal funds. (...)
full encoder-decoder attention, i.e., sliding window While CMS is generally required to disallow, or recoup,
federal funds from states for eligibility-related improper
with stride (S TRIDE), low-rank method (L IN .), and payments if the state’s eligibility error rate exceeds 3 per-
learnable patterns (LSH and S INKHORM). We then cent, it has not done so for decades, because the method it
combine learnable patterns with H EPOS to support used for calculating eligibility error rates was found to
be insufficient for that purpose. To address this, in July
processing more text. All models consume as long 2017, CMS issued revised procedures through which it
an input as the memory allows. can recoup funds for eligibility errors, beginning in fiscal
year 2022. (...)
Results. Overall, models that read more text obtain Model w/ full attn.:
higher ROUGE scores, according to results on Gov- Medicaid is a federal-state program that provides health
care coverage to low-income individuals and families. (...)
Report and PubMed in Table 4. First, different en- CMS officials stated that they have provided states with
coder variants with full encoder-decoder attentions guidance on how to use data from SSA’s automated system
attain better results than the full attentions baseline for eligibility determinations, (...)
CMS officials said that they did not have guidance on
except Linformer. Second, adding H EPOS encoder- when states should use SSA data to evaluate eligibility based
decoder attention almost doubles the words that on nonfinancial or financial criteria. (...)
can be processed and further improves the perfor- Model w/ H EPOS enc-dec attn. (ours):
The Patient Protection and Affordable Care Act (PPACA)
mance. This highlights the importance of handling expanded Medicaid coverage to millions of low-income
both encoder attentions and encoder-decoder at- adults and children with disabilities and their eligible de-
tentions efficiently. Notably, H EPOS with an LSH pendents. (...)
The selected states also reported that they did not have ad-
encoder achieves new state-of-the-art results on equate processes to address these issues. CMS has taken
PubMed, outperforming BigBird which only uses steps to improve its oversight of the Medicaid program,
sparse attentions on the encoder. We also report including issuing guidance to states on the use of MAGI-
exempt bases for determining eligibility, but these efforts
performances of our two best models with H EPOS have not been fully implemented. (...)
on arXiv in Table 5, and they outperform all com-
petitive abstractive models. Figure 3: Sample summaries for a government report.
As can be seen from the sample summaries in The model with truncated input generates unfaithful
Fig. 3, our model that reads in 10k tokens generates content. H EPOS attention with a Sinkhorn encoder
covers more salient information.
more informative summary than the full attention
model that only processes 1k tokens. Fig. 4 further
shows that ROUGE-2 scores can be consistently 22
lifted when reading more input, with similar trends
21
ROUGE-2

observed on ROUGE-1 and ROUGE-L. More sam-


ple outputs are presented in Appendix C. 20
19
6.4 Reading More Input Improves PubMed
18
Faithfulness GovReport
2k 4k 6k 8k 10k
Here we first show human evaluation results on Length
informativeness and unfaithful errors in the gener-
ated summaries. We sample 100 documents from Figure 4: Summarizing articles truncated at different
GovReport and PubMed (50 each) with structured lengths by the best models: LSH (7168)+H EPOS on
PubMed and S INKHORN (10240)+H EPOS on GovRe-
references that are labeled with aspects as described
port. Reading more consistently improves ROUGE-2.
in § 4 and Appendix D. Each sample is evaluated
by two fluent English speakers, who have cumu-
latively annotated tens of thousands of sentences deleting crucial entities, events, or clauses, and (iii)
for the same tasks before this work. Annotators false concatenation—inappropriately concatenat-
are asked to label each summary sentence with ing components from different sentences. 1 is given
an aspect and then decide whether it contains any if any judge determines that a certain type of error
type of error. Three types of unfaithful errors are exists in the sentence, 0 otherwise.
considered: (i) hallucination—fabricating content After reading the full summaries, each judge also
not present in the input, (ii) deletion—incorrectly scores aspect-level informativeness—whether the
System (MaxLen) Inf.↑ Hal.↓ Del.↓ Concat.↓ 4.27 4.48 4.54 4.02 4.20 4.21
4 3.67 3.79 3.59
GovReport 2.88 2.75

Score
3

Encoder variants w/ full enc-dec attn. 2.24


2

F ULL (1024) 3.29 15.2% 3.5% 9.5% 1

S INKHORN (5120) 3.32 11.0% 2.3% 9.4%


Introduction Methods Results Conclusion
0
Encoder variants w/ H EPOS enc-dec attn. (ours)
S INKHORN (10240) 3.53 11.5% 3.4% 8.8%
Informativeness
62.0

Precentage (%)
60
53.8
PubMed 50

40
Encoder variants w/ full enc-dec attn. 31.0
F ULL (1024) 3.27 20.1% 2.8% 14.3%
30
25.0
20
15.5 12.3 15.7
S INKHORN (5120) 3.94 4.8% 1.6% 9.6% 10 7.2 8.5 7.4 7.1 10.1
Encoder variants w/ H EPOS enc-dec attn. (ours)
Introduction Methods Results Conclusion
0

S INKHORN (10240) 4.18 3.5% 2.2% 9.1% Unfaithful Errors


Full(1k)+Full Sinkhorn(5k)+Full Sinkhorn(10k)+Hepos
Table 6: Human evaluation on informativeness (Inf.)
(1-to-5), and percentages of unfaithful errors due to Figure 5: Aspect-level informativeness and percent-
hallucination (Hal.), deletion (Del.), and false concate- ages of sentences containing unfaithful errors as la-
nation (Concat.). Inter-rater agreement with Krippen- beled by both human judges on PubMed. Models with
dorf’s α for all columns: 0.59, 0.59, 0.53 and 0.60. efficient attentions reduce errors for later sections in the
sources, e.g., “Results" and “Conclusion".

summary covers important information of an aspect 4.0 3.75 3.80 4.01 3.49 3.55 3.60
3.5
2.99
when compared with the reference. All system sum- 3.0
2.64 2.61
Score

2.5

maries and references are presented in a random 2.0

1.5

order. Human evaluation guidelines and sample 1.0

0.5
summaries for different aspects are included in Ap- 0.0
Why GAO did this study What GAO found What GAO recommends
pendix D. Informativeness
14.0
Precentage (%)

12.4
14

Results. Overall, reading more text significantly 12


10.9 10.7 10.6
10

improves informativeness as well as reduces fab- 8


6.7
ricated content. From Table 6, we observe that 6 5.3
4 3.7 2.8
H EPOS attention, combined with a S INKHORN en- 2

Why GAO did this study What GAO found What GAO recommends
0
coder, obtains better informativeness scores than Unfaithful Errors
comparisons that read in less text on both datasets. Full(1k)+Full Sinkhorn(5k)+Full Sinkhorn(10k)+Hepos
This echos results from automatic evaluation in
Figure 6: Aspect-level informativeness and percent-
the previous section. Moreover, both models that
ages of sentences with unfaithful errors on GovReport.
use efficient attentions reduce unfaithfulness, es-
pecially hallucination errors, when compared with
the full attention model, which only reads 1024 to- Especially, we find that the full attention model
kens. As the models read more content, they learn tends to produce fabricated numbers in resultant
to surface more factual and richer content in the summaries, whereas our models are able to correct
summaries, as seen in Fig. 3. them.
Next, we explore if reading more helps correctly Lastly, we report the entailment-based FactCC
reflect the content in documents’ later sections. We and QA scores APES and APESsrc for top perform-
plot aspect-level human ratings of informativeness ing models in Table 7. The results again show that
and unfaithful errors on PubMed and GovReport consuming longer input leads to more faithful sum-
in Fig. 5 and Fig. 6. We report percentages of sen- maries, though the differences are less pronounced.
tences with unfaithful errors by majority voting
(i.e., at least one error is found by both annota- 6.5 Correlations between Human and
tors in the sentence). As can be seen, our models Automatic Metrics
consistently improve informativeness and reduce Finally, we study whether the faithfulness evalua-
errors across sections, especially for “Results” and tion metrics correlate with human judgment. As
“Conclusions” on PubMed and “What GAO rec- shown in Table 8, on both government reports
ommends” on GovReport—these sections often and scientific papers, QA metrics are better cor-
appear in the later part of the source documents. related with human ratings, with our newly pro-
GovReport PubMed coherent, highlighting the need for handling long
System (MaxLen) F. APES APESsrc F. APES APESsrc documents via abstractive summarization.
F ULL (1024) 58.9 42.7 42.7 74.6 43.2 31.5 To that end, extract-then-abstract methods are
Encoder variants w/ full enc-dec attn. proposed. For example, Pilault et al. (2020) first
S TRIDE (4096) 55.3 43.1 42.5 72.7 43.8 31.9 extract relevant sentences and then rewrite them
L IN . (3072) 48.4 35.7 36.3 67.7 39.3 29.5
LSH (4096) 55.7 44.0 43.6 73.2 46.7 35.1
into paper abstracts. Our work is in line with build-
S INKHORN (5120) 57.0 43.6 42.1 72.9 46.8 35.4 ing end-to-end abstractive summarization models
Encoder variants w/ H EPOS enc-dec attn. (ours)
for long input. Cohan et al. (2018) design a hierar-
LSH (7168) 59.6 44.0 44.2 73.3 47.5 35.6 chical encoder to read different sections separately,
S INKHORN (10240) 60.1 44.0 44.3 71.9 46.2 34.8 and then use combined attentions over words and
sections to generate the summary. Multiple agents
Table 7: Evaluation with FactCC (F.), APES, and the are created to read segments separately, and then
new APESsrc metric, with higher numbers indicating collaboratively write an abstract (Celikyilmaz et al.,
more faithful summaries.
2018). However, both work truncates articles to
2K words. Although efficient encoder attentions
GovReport PubMed
have been studied in Zaheer et al. (2020) for ab-
Metric Inf.↑ Err.↓ Inf.↑ Err.↓
stractive summarization, at most 3K tokens can be
FactCC 0.07 -0.08 0.10 -0.14 consumed by their models. Our H EPOS encoder-
APES 0.16 -0.15 0.25 -0.31
APESsrc 0.21 -0.23∗ 0.32∗ -0.32 decoder attention are able to process more than
10K tokens, significantly improving summary in-
Table 8: Pearson correlation between human ratings formativeness and faithfulness.
and metrics. We use aggregated unfaithful errors (Err.).
∗: significantly better than other metrics based on 8 Conclusion
William’s test (Williams, 1959) (p < 0.05).
We investigate efficient attentions for long docu-
ment summarization. We propose a novel encoder-
posed APESsrc being the stronger of the two. Af- decoder attention, H EPOS, based on head-wise po-
ter inspection, we find that human-written sum- sitional strides that can effectively identify salient
maries contain paraphrases or acronyms that APES content. Models based on H EPOS attention can pro-
cannot capture via strict lexical matching. For in- cess at least twice as many words and produce more
stance, for the question “Diabetes may worsen informative summaries with less unfaithful errors,
in patients”, the reference answer is “death rate”, according to both automatic evaluation and human
whereas answers from the source and the system evaluation. We further show that our new cloze QA
summary are both “mortality”. APESsrc captures metric better correlates with human judgment than
this, but not APES. prior faithfulness evaluation metrics.

7 Additional Related Work Acknowledgements

Summarizing long inputs has been investigated in This research is supported in part by Oracle for
many domains, including books (Mihalcea and Research Cloud Credits, National Science Founda-
Ceylan, 2007), patents (Trappey et al., 2009), tion through Grant IIS-1813341, and by the Office
movie scripts (Gorinski and Lapata, 2015), and sci- of the Director of National Intelligence (ODNI),
entific publications (Qazvinian and Radev, 2008). Intelligence Advanced Research Projects Activity
However, the datasets are often too small to train (IARPA), via contract # FA8650-17-C-9116. The
neural models. Cohan et al. (2018) publish two views and conclusions contained herein are those
large-scale datasets by collecting articles from of the authors and should not be interpreted as
AR X IV and P UB M ED . Popular methods rely on necessarily representing the official policies, either
extractive summarizers that identify salient sen- expressed or implied, of ODNI, IARPA, or the U.S.
tences based on positional information (Dong et al., Government. The U.S. Government is authorized
2020) or combined global and local contexts (Xiao to reproduce and distribute reprints for governmen-
and Carenini, 2019), where each sentence is repre- tal purposes notwithstanding any copyright annota-
sented as aggregated word embeddings. However, tion therein. We thank three anonymous reviewers
extractive summaries are often redundant and in- for their valuable suggestions and comments.
References Yue Dong, Andrei Romascanu, and Jackie CK Che-
ung. 2020. Hiporank: Incorporating hierarchical
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. and positional information into graph-based unsu-
Longformer: The long-document transformer. pervised long document extractive summarization.
BioNLP. 2011. Genia event extraction (genia). arXiv preprint arXiv:2005.00513.

BioNLP. 2013. Genia event extraction for nfkb knowl- Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
edge base. question answering evaluation framework for faith-
fulness assessment in abstractive summarization. In
Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Proceedings of the 58th Annual Meeting of the Asso-
Yejin Choi. 2018. Deep communicating agents for ciation for Computational Linguistics, pages 5055–
abstractive summarization. In Proceedings of the 5070, Online. Association for Computational Lin-
2018 Conference of the North American Chapter of guistics.
the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long Pa- Matan Eyal, Tal Baumel, and Michael Elhadad. 2019.
pers), pages 1662–1675. Question answering as an automatic evaluation met-
ric for news article summarization. In Proceed-
Rewon Child, Scott Gray, Alec Radford, and ings of the 2019 Conference of the North American
Ilya Sutskever. 2019. Generating long se- Chapter of the Association for Computational Lin-
quences with sparse transformers. arXiv preprint guistics: Human Language Technologies, Volume 1
arXiv:1904.10509. (Long and Short Papers), pages 3938–3948, Min-
neapolis, Minnesota. Association for Computational
Krzysztof Choromanski, Valerii Likhosherstov, David Linguistics.
Dohan, Xingyou Song, Andreea Gane, Tamas Sar-
los, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Sebastian Gehrmann, Yuntian Deng, and Alexander M
Lukasz Kaiser, David Belanger, Lucy Colwell, and Rush. 2018. Bottom-up abstractive summarization.
Adrian Weller. 2020. Rethinking attention with per- In Proceedings of the 2018 Conference on Empiri-
formers. cal Methods in Natural Language Processing, pages
4098–4109.
Kevin Clark, Urvashi Khandelwal, Omer Levy, and
Christopher D. Manning. 2019. What does BERT Alexios Gidiotis and Grigorios Tsoumakas. 2020. A
look at? an analysis of BERT’s attention. In Pro- divide-and-conquer approach to the summarization
ceedings of the 2019 ACL Workshop BlackboxNLP: of long documents. arXiv: Computation and Lan-
Analyzing and Interpreting Neural Networks for guage.
NLP, pages 276–286, Florence, Italy. Association
for Computational Linguistics. Philip John Gorinski and Mirella Lapata. 2015. Movie
script summarization as graph-based scene extrac-
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, tion. In Proceedings of the 2015 Conference of the
Trung Bui, Seokhwan Kim, Walter Chang, and Na- North American Chapter of the Association for Com-
zli Goharian. 2018. A discourse-aware attention putational Linguistics: Human Language Technolo-
model for abstractive summarization of long docu- gies, pages 1066–1076, Denver, Colorado. Associa-
ments. In Proceedings of the 2018 Conference of tion for Computational Linguistics.
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech- Max Grusky, Mor Naaman, and Yoav Artzi. 2018.
nologies, Volume 2 (Short Papers), pages 615–621, Newsroom: A dataset of 1.3 million summaries with
New Orleans, Louisiana. Association for Computa- diverse extractive strategies. In Proceedings of the
tional Linguistics. 2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- man Language Technologies, Volume 1 (Long Pa-
bonell, Quoc Le, and Ruslan Salakhutdinov. 2019. pers), pages 708–719, New Orleans, Louisiana. As-
Transformer-XL: Attentive language models beyond sociation for Computational Linguistics.
a fixed-length context. In Proceedings of the 57th
Annual Meeting of the Association for Computa- Matthew Honnibal and Ines Montani. 2017. spaCy 2:
tional Linguistics, pages 2978–2988, Florence, Italy. Natural language understanding with Bloom embed-
Association for Computational Linguistics. dings, convolutional neural networks and incremen-
tal parsing. To appear.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap-
deep bidirectional transformers for language under- pas, and François Fleuret. 2020. Transformers are
standing. In Proceedings of the 2019 Conference rnns: Fast autoregressive transformers with linear at-
of the North American Chapter of the Association tention.
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.
pages 4171–4186, Minneapolis, Minnesota. Associ- 2020. Reformer: The efficient transformer. In Inter-
ation for Computational Linguistics. national Conference on Learning Representations.
Anastassia Kornilova and Vladimir Eidelman. 2019. Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
BillSum: A corpus for automatic summarization of Ryan McDonald. 2020. On faithfulness and factu-
US legislation. In Proceedings of the 2nd Workshop ality in abstractive summarization. In Proceedings
on New Frontiers in Summarization, pages 48–56, of the 58th Annual Meeting of the Association for
Hong Kong, China. Association for Computational Computational Linguistics, pages 1906–1919, On-
Linguistics. line. Association for Computational Linguistics.

Wojciech Kryscinski, Bryan McCann, Caiming Xiong, Rada Mihalcea and Hakan Ceylan. 2007. Explo-
and Richard Socher. 2020. Evaluating the factual rations in automatic book summarization. In Pro-
consistency of abstractive text summarization. In ceedings of the 2007 Joint Conference on Empirical
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing and Com-
Methods in Natural Language Processing (EMNLP), putational Natural Language Learning (EMNLP-
pages 9332–9346, Online. Association for Computa- CoNLL), pages 380–389, Prague, Czech Republic.
tional Linguistics. Association for Computational Linguistics.

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Fan, Sam Gross, Nathan Ng, David Grangier, and
Jaewoo Kang. 2020. Biobert: a pre-trained biomed- Michael Auli. 2019. fairseq: A fast, extensible
ical language representation model for biomedical toolkit for sequence modeling. In Proceedings of
text mining. Bioinformatics, 36(4):1234–1240. NAACL-HLT 2019: Demonstrations.

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Ko- Adam Paszke, Sam Gross, Francisco Massa, Adam
siorek, Seungjin Choi, and Yee Whye Teh. 2019. Lerer, James Bradbury, Gregory Chanan, Trevor
Set transformer: A framework for attention-based Killeen, Zeming Lin, Natalia Gimelshein, Luca
permutation-invariant neural networks. In Proceed- Antiga, Alban Desmaison, Andreas Kopf, Edward
ings of the 36th International Conference on Ma- Yang, Zachary DeVito, Martin Raison, Alykhan Te-
chine Learning, volume 97 of Proceedings of Ma- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
chine Learning Research, pages 3744–3753, Long Junjie Bai, and Soumith Chintala. 2019. Py-
Beach, California, USA. PMLR. torch: An imperative style, high-performance deep
learning library. In H. Wallach, H. Larochelle,
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar-
jan Ghazvininejad, Abdelrahman Mohamed, Omer nett, editors, Advances in Neural Information Pro-
Levy, Veselin Stoyanov, and Luke Zettlemoyer. cessing Systems 32, pages 8024–8035. Curran Asso-
2020. BART: Denoising sequence-to-sequence pre- ciates, Inc.
training for natural language generation, translation,
and comprehension. In Proceedings of the 58th An- Jonathan Pilault, Raymond Li, Sandeep Subramanian,
nual Meeting of the Association for Computational and Chris Pal. 2020. On extractive and abstractive
Linguistics, pages 7871–7880, Online. Association neural document summarization with transformer
for Computational Linguistics. language models. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Language
Chin-Yew Lin. 2004. ROUGE: A package for auto- Processing (EMNLP), pages 9308–9319, Online. As-
matic evaluation of summaries. In Text Summariza- sociation for Computational Linguistics.
tion Branches Out, pages 74–81, Barcelona, Spain.
Association for Computational Linguistics. Vahed Qazvinian and Dragomir R Radev. 2008. Sci-
entific paper summarization using citation summary
Ying Lin, Heng Ji, Fei Huang, and Lingfei Wu. 2020. networks. arXiv preprint arXiv:0807.1560.
A joint neural model for information extraction with
global features. In Proceedings of The 58th Annual Eva Sharma, Chen Li, and Lu Wang. 2019. BIG-
Meeting of the Association for Computational Lin- PATENT: A large-scale dataset for abstractive and
guistics. coherent summarization. In Proceedings of the 57th
Annual Meeting of the Association for Computa-
Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben tional Linguistics, pages 2204–2213, Florence, Italy.
Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Association for Computational Linguistics.
Shazeer. 2018. Generating wikipedia by summariz-
ing long sequences. In International Conference on Noam Shazeer and Mitchell Stern. 2018. Adafactor:
Learning Representations. Adaptive learning rates with sublinear memory cost.
In Proceedings of the 35th International Conference
Yang Liu and Mirella Lapata. 2019. Text summariza- on Machine Learning, volume 80 of Proceedings
tion with pretrained encoders. In Proceedings of of Machine Learning Research, pages 4596–4604,
the 2019 Conference on Empirical Methods in Nat- Stockholmsmässan, Stockholm Sweden. PMLR.
ural Language Processing and the 9th International
Joint Conference on Natural Language Processing Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo-
(EMNLP-IJCNLP), pages 3730–3740, Hong Kong, janowski, and Armand Joulin. 2019. Adaptive at-
China. Association for Computational Linguistics. tention span in transformers. In Proceedings of the
57th Annual Meeting of the Association for Compu- Macherey, et al. 2016. Google’s neural machine
tational Linguistics, pages 331–335, Florence, Italy. translation system: Bridging the gap between hu-
Association for Computational Linguistics. man and machine translation. arXiv preprint
arXiv:1609.08144.
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-
Cheng Juan. 2020a. Sparse sinkhorn attention. Wen Xiao and Giuseppe Carenini. 2019. Extractive
summarization of long documents by combining
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald global and local context. In Proceedings of the
Metzler. 2020b. Efficient transformers: A survey. 2019 Conference on Empirical Methods in Natu-
arXiv preprint arXiv:2009.06732. ral Language Processing and the 9th International
Joint Conference on Natural Language Processing
Amy JC Trappey, Charles V Trappey, and Chun-Yi Wu.
(EMNLP-IJCNLP), pages 3011–3021, Hong Kong,
2009. Automatic patent document summarization
China. Association for Computational Linguistics.
for collaborative knowledge systems and services.
Journal of Systems Science and Systems Engineer-
ing, 18(1):71–94. Manzil Zaheer, Guru Guruganesh, Avinava Dubey,
Joshua Ainslie, Chris Alberti, Santiago Ontanon,
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz et al. 2020. Big bird: Transformers for longer
Kaiser, and Illia Polosukhin. 2017. Attention is all sequences. arxiv e-prints, art. arXiv preprint
you need. In Advances in Neural Information Pro- arXiv:2007.14062.
cessing Systems, volume 30, pages 5998–6008. Cur-
ran Associates, Inc. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
ter J. Liu. 2019. PEGASUS: pre-training with ex-
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen- tracted gap-sentences for abstractive summarization.
nrich, and Ivan Titov. 2019. Analyzing multi-head CoRR, abs/1912.08777.
self-attention: Specialized heads do the heavy lift-
ing, the rest can be pruned. In Proceedings of the Yao Zhao, Mohammad Saleh, and Peter J. Liu.
57th Annual Meeting of the Association for Com- 2020. Seal: Segment-wise extractive-abstractive
putational Linguistics, pages 5797–5808, Florence, long-form text summarization.
Italy. Association for Computational Linguistics.

Christopher Walker, Stephanie Strassel, Stephanie


A GovReport Dataset Collection and
Medero, and Kazuaki Maeda. 2006. Ace 2005 mul- Processing
tilingual training corpus.
For GAO reports, their summaries are organized
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020a. as highlights. We collect GAO reports that include
Asking and answering questions to evaluate the fac- corresponding highlights and were published be-
tual consistency of summaries. In Proceedings of
the 58th Annual Meeting of the Association for Com- fore Jul 7, 2020 . The reports and highlights are
putational Linguistics, pages 5008–5020, Online. published in PDF files. Most of the highlights are
Association for Computational Linguistics. also reorganized and shown on the web page as
Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, HTML. Since PDF parsing is more prone to errors
Russell Reas, Jiangjiang Yang, Darrin Eide, Kathryn than web parsing, we only keep the reports whose
Funk, Rodney Kinney, Ziyang Liu, William Merrill, highlights can be obtained on the corresponding
et al. 2020b. Cord-19: The covid-19 open research web page to ensure the quality of extracted gold-
dataset. ArXiv.
standard summaries. For reports, we first convert
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han the PDF files to HTML using PDFMiner6 . We
Fang, and Hao Ma. 2020c. Linformer: Self- then parse the HTML into text into sections and
attention with linear complexity. paragraphs with handcrafted parsing rules. We re-
Chih-Hsuan Wei, Alexis Allot, Robert Leaman, and move the reports that do not have cover pages, as
Zhiyong Lu. 2019. Pubtator central: automated con- our rules are constructed for documents with then.
cept annotation for biomedical full text articles. Nu- We further remove parsed documents with empty
cleic acids research, 47(W1):W587–W593.
sections, non-capitalized section titles, or a single
E.J. Williams. 1959. Regression Analysis. WILEY section, since these are common patterns of incor-
SERIES in PROBABILITY and STATISTICS: AP- rectly parsed documents. Failed parsing would also
PLIED PROBABILITY and STATIST ICS SEC- result in short documents. Therefore, we examine
TION Series. Wiley.
the reports with shorter length and then filter out
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V 10% of the shortest reports.
Le, Mohammad Norouzi, Wolfgang Macherey,
6
Maxim Krikun, Yuan Cao, Qin Gao, Klaus https://github.com/euske/pdfminer
We collect CRS reports that were published be- Genia 2011 Genia 2013 PubMed
fore May 20, 2020 from EveryCRSReport7 where Entity Type
the original PDF files are already parsed into Anaphora - 105 -
Entity 480 121 -
HTML. We only keep documents with expert- CellLine - - 614
written summaries. We then gather texts from the Chemical - - 14,051
html files. Disease - - 62,228
Mutation - - 164
Protein 11,539 3,562 15,577
B Experiment Details Species - - 52,954
Event Type
FactCC Training Data Construction. Kryscin- Binding 880 167 -
ski et al. (2020) generate training data by apply- Gene Expression 2,076 666 -
ing rule-based transformations to sentences from Localization 264 44 -
Negative Regulation 338 273 -
source documents. We leverage reference sum-
Phosphorylation 175 105 -
maries, where we train a FactCC model by reading Positive Regulation 1,123 311 -
a summary sentence (i.e., the claim) and a context Protein Catabolism 100 23 -
to predict the corresponding label. A context is Protein Modification - 8 -
constructed by greedily selecting sentences that Regulation 292 72 -
Transcription 580 97 -
maximize the improvement of its ROUGE-2 when Ubiquitination - 4 -
compared against the reference summary sentence.
Following FactCC, we apply sentence negation, en- Table 9: Dataset description for training OneIE
tity swap, and number swap to summary sentences Biomedical extraction. While Genia 2011 and 2013
to construct negative claims and use the original datasets focus more on event extraction, PubMed cov-
sentences as positive claims. During testing, we ers more entities.
first find the context for each system summary sen-
such as People, Location, or Organization, and
tence. The model then predicts a sentence-level
events such as Movement, Conflict, or Justice, etc.
faithfulness score by reading the system summary
The second model for scientific domain in-
sentence and the context.
formation extraction is trained on the Genia
Evaluation Model Training. We fine-tune 2011 (BioNLP, 2011), Genia 2013 (BioNLP,
BERT (Devlin et al., 2019) for both FactCC and 2013), and PubMed (Wei et al., 2019) datasets.
QA models. We include an additional classification It extracts entity such as Gene, Variant, Disease,
head to predict entailment label or answer spans Chemical, or Species, and events such as Gene
based on the [CLS] token. For GovReport dataset, Expression, Binding, Protein Modification, or Posi-
we consider a base version of BERT with uncased tive Regulation, etc. The full list of entity and event
tokens. For PubMed, we use a BERT model which types can be found in Table 9. To train this model,
is fine-tuned on PubMed abstracts to obtain better we fine-tune the BioBERT pre-trained model (Lee
performance8 . et al., 2020) on the COVID-19 Open Research
Entity Extraction Model. We use OneIE to ex- (CORD-19) dataset (Wang et al., 2020b). As we
tract entities from the reference summary (Lin et al., proposed, this model is applied to the PubMed data.
2020). OneIE is a unified framework that com-
bines entities, relations, and events extraction in
C Additional Sample Outputs
one model. The model leverages the BERT pre- We include two samples from GovReport and
trained weights as the sentence embedding to pro- PubMed to further illustrate that our model with
duce entities, relations, and events from a sentence. H EPOS attention generates more faithful and infor-
Two OneIE models are built. mative summaries in Fig. 7 and Fig. 8.
The first model for government reports is trained
on the Automatic Content Extraction (ACE) 2005 D Human Evaluation Guideline
dataset (Walker et al., 2006). This model can ex- In human evaluation, annotators are asked to eval-
tract entities from general conversation contexts uate the system summaries generated for a report
7 or a paper. In addition to the summaries, annota-
https://www.everycrsreport.com
8
https://huggingface.co/monologg/ tors are provided with the report or the paper to be
biobert_v1.0_pubmed_pmc summarized and a corresponding human-written
reference. Human judges evaluate each system reference (only misses minor topics), and is
summary sentence by sentence. The annotation free of unfaithful errors.
consists of three tasks, which are described below.
Task 1: Aspect Labeling. First, annotators are • 4: Summary covers major key points (e.g., 80
asked to decide which aspect each sentence be- percent) and may miss one or two key points
longs to. For government reports, each sentence in the reference. Summary can contain one
should be categorized into three aspects: (1) Why unfaithful error.
GAO did this study, (2) What GAO found, and • 3: Summary covers roughly half of the key
(3) What GAO recommends. For scientific papers, points in the reference or contains 2 or 3 un-
summaries have four aspects: (1) Introduction and faithful errors.
Literature, (2) Methods, (3) Results, and (4) Dis-
cussion and Conclusion. Table 10 and Table 11 • 2: Summary only covers 1 or 2 key points
contain example reference summaries with labeled and misses many important topics (e.g. > 80
aspects. percent) in the reference, or contains more
Task 2: Sentence-level Faithfulness Error La- than 3 major unfaithful errors, e.g. summary
beling. Next, annotators will judge whether each fabricates or distorts some facts.
sentence contains any unfaithful content. Unfaith-
ful content is categorized into three types. A “0” • 1: Summary is irrelevant and does not cover
or “1” label will be given to each type, where “0” any content in the reference.
indicates the sentence is free of such type of error,
and “1” otherwise.
Concretely, unfaithful content is the fabricated
or contradictory content which is not present or
contradicts the facts in the source article. It can
also be ambiguous expression which distorts the
meaning. Here are detailed descriptions for the
three types of errors:

• Hallucination error refers to fabricated con-


tent that cannot be found or inferred from the
source.

• Misconstruction error that is due to deletion


of entities, events, or clauses, resulting in sen-
tences that are incomplete, missing context,
or ungrammatical.

• Misconstruction error that is caused by


false concatenation of content from different
places in the source.

Task 3: Aspect-level Summary Quality Rat-


ing. After reading the full summary, annotators
will evaluate the informativeness of the summary
for each aspect— whether the summary provides a
necessary and enough coverage of information in
the reference. For instance, whether the summary
covers all the salient points in “Why GAO did this
study".
Here are detailed descriptions of informative-
ness:

• 5: Summary covers enough key points in the


Summary by Human:
What GAO found
The Food and Drug Administration (FDA) and U.S. Department of Agriculture (USDA) have begun collab-
orating on regulatory oversight of cell-cultured meat. For example, in 2019, the agencies signed an intera-
gency agreement and created three working groups to carry out the terms of the agreement. However, the
agreement and working groups could more fully incorporate practices to enhance and sustain collaboration,
such as defining outcomes. For example, the agreement identifies the development of labeling principles as
an outcome, but does not describe how the agencies will track and monitor progress toward this outcome,
and the working groups identify a lead agency but not members’ roles. (...)
Developing and updating written guidance and agreements is also a leading practice for interagency collabo-
ration. By fully incorporating leading practices into their efforts to collaborate, the agencies could minimize
potential overlap and fragmentation, use resources in a more efficient manner, and better ensure the public
and other key stakeholders have clarity about the agencies’ oversight responsibilities. (...)
Model w/ full attn.:
What GAO found
The frequency of inspections the agencies conduct varies, as follows: FDA. FDA conducts risk-based
inspections of high- risk and non-high-risk food facilities. FDA’s authority requires a risk- based approach,
in which inspection rates vary depending on the level of risk associated with a food product. (...)
In 2010, FDA began conducting inspections of meat slaughter facilities to determine whether they were com-
plying with the agency’s longstanding interpretation of its statutes requiring it to do so. In addition,USDA
conducted inspections of slaughter facilities that process catfish to determine if they are complying with
USDA’s requirement to inspect catfish slaughter facilities. USDA also conducts inspections of processing
facilities to verify that they are following USDA’s requirements to inspect these facilities. In 2011, USDA
began inspecting processing facilities for catfish for the first time. (...)
Model w/ HEPOS enc-dec attn.:
What GAO found
The Departments of Health and Human Services (HHS) and Agriculture have taken several steps to coordi-
nate their oversight of the development, commercialization, and regulation of cell cultured meat and seafood,
but they have not fully incorporated leading practices for interagency collaboration. For example, HHS and
USDA have established interagency working groups that have met to discuss issues such as coordination
and coordination. However, the agencies have not yet fully incorporated some leading practices, such as
establishing clear roles and responsibilities, into their interagency agreements. (...)

Figure 7: Sample summaries for a government report. Model with truncated input generates unfaithful content.
Our HEPOS encoder-decoder attention with Sinkhorn encoder attention covers more salient information in “What
GAO found” aspect.
Original Paper:
(...) anatomical abnormalities included complex pelvicalyceal anatomy in 98, pelviureteric junction obstruc-
tion (pujo) in 35, ectopic, horseshoe, cross-fused ectopic kidneys in 40, and duplex system with nonfunc-
tioning upper or lower moiety in 14 patients. (...)
of the 1404 patients from rural area , 547 (39%) were managed by open surgery as compared to 362 (22%)
of the patients from urban areas. (...)
the mean size of the kidney stone was 5.05 5.88 cm, ureter 1.95 1.33 cm, and bladder 7.6 3.2 cm. (...)
overall success rate was 90% in pcnl, 96% in urs, 100% in cystolithotripsy, and 81% in swl.
Summary by Human:
Results
Results of 3969 surgeries, 2794 (70%) were minimally invasive surgery (mis) techniques to include swl
19%, pcnl 16 % , urs 18.9% , and pucl + pccl 16% and 1175 (30%) were open surgeries .
the main factors necessitating open surgery were large stone burden 37%, anatomical abnormalities 16%,
stones with renal failure 34%, gross hydronephrosis with thin cortex 58% , urinary tract infection (uti)
25%, and failed mis 18%. nearly 50% of the surgeries were necessitated by economic constraints and long
distance from center where one-time treatment was preferred by the patient. tone-free rates by open surg-
eries were pyelolithotomy 91%, ureterolithotomy 100%, and cystolithotomy 100% with complication rate
of up to 3%.
Model w/ full attn.:
Results
results: the mean age of the patients was 6.5 years. the mean follow-up period was 12.5 months. mean stone
size was 1.5cm. the majority of the stones were lager stones (n=454). the stone free rate was 100%. open
surgery was performed in children with large bulk of stones, anatomical abnormalities, marked obstructive
cortical atrophy and scarring, gross hydronephrosis, or urinary tract infection (uti).
Model w/ HEPOS enc-dec attn.:
Results
the mean age of the patients was 6.35, 3.7 years with a male-to-female ratio of 2.8:1, and a range of 25
days (<1 month) to 15 years. anatomical abnormalities included complex pelvicalyceal anatomy in 98,
pelviureteric junction obstruction (pujo) in 35, ectopic, horseshoe, cross-fused ectopic kidneys in 40, and
duplex system with nonfunctioning upper or lower moiety in 14 patients. the mean size of the kidney stone
was 5.05 5.88 cm3. of the 1404 patients from rural areas, 547 (39%) were managed by surgery as compared
to 362 (22%) patients from urban areas. overall success rate was 90% in pcnl , 96% in urs , 100% in
cystolithotripsy , and 81% in swl.

Figure 8: Sample summaries for a scientific paper. Model with truncated input generates fabricated facts. Our
HEPOS encoder-decoder attention with LSH encoder attention are more faithful for the aspect of “results”.
Aspect Example
Why GAO Did This Study To protect data that are shared with state government agencies, federal agencies
have established cybersecurity requirements and related compliance assessment
programs. Specifically, they have numerous cybersecurity requirements for states
to follow when accessing, storing, and transmitting federal data. GAO was asked
to evaluate federal agencies’ cybersecurity requirements and related assessment
programs for state agencies. The objectives were to determine the extent to which
(...)
What GAO Found Although the Centers for Medicare and Medicaid Services (CMS), Federal Bu-
reau of Investigation (FBI), Internal Revenue Service (IRS), and Social Security
Administration (SSA) each established requirements to secure data that states re-
ceive, these requirements often had conflicting parameters. Such parameters in-
volve agencies defining specific values like the number of consecutive unsuccess-
ful logon attempts prior to locking out the user. Among the four federal agencies,
the percentage of total requirements with conflicting parameters ranged from 49
percent to 79 percent. Regarding variance with National Institute of Standards
and Technology guidance, GAO found that the extent to which the four agencies
did not fully address guidance varied from 9 percent to 53 percent of total re-
quirements. The variances were due in part to the federal agencies’ insufficient
coordination in establishing requirements. (...)
What GAO Recommends GAO is making 12 recommendations to the four selected agencies and to OMB.
Three agencies agreed with the recommendations and one agency (IRS) partially
agreed or disagreed with them. OMB did not provide comments. GAO continues
to believe all recommendations are warranted.

Table 10: Sample reference summary with aspects in a GAO report.


Aspect Keywords Example
Introduction and Literature introduction, case, objectives, pur- background : the present study was car-
poses, objective, purpose, background, ried out to assess the effects of commu-
literature, related work nity nutrition intervention based on ad-
vocacy approach on malnutrition status
among school - aged children in shiraz
, iran .

introduction . low serum vitamin d lev-


els are associated with increased postu-
ral sway . vitamin d varies seasonally
. this study investigates whether postu-
ral sway varies seasonally and is asso-
ciated with serum vitamin d and falls
.
Methods materials and methods, techniques, materials and methods : this case - con-
methodology, materials, research de- trol nutritional intervention has been
sign, study design done between 2008 and 2009 on 2897
primary and secondary school boys
and girls ( 7 - 13 years old ) based
on advocacy approach in shiraz , iran .
the project provided nutritious snacks
in public schools over a 2 - year pe-
riod along with advocacy oriented ac-
tions in order to implement and pro-
mote nutritional intervention . for eval-
uation of effectiveness of the interven-
tion growth monitoring indices of pre-
and post - intervention were statisti-
cally compared .
Results results, experiments, observations results : the frequency of subjects with
body mass index lower than 5% de-
creased significantly after intervention
among girls ( p = 0. 02 ) . how-
ever , there were no significant changes
among boys or total population . (...)
Discussion and Conlusion discussion, limitation, conclusions, conclusion : this study demonstrates
concluding the potential success and scalability of
school feeding programs in iran . com-
munity nutrition intervention based on
the advocacy process model is effec-
tive on reducing the prevalence of un-
derweight specifically among female
school aged children .

Table 11: Sample reference summary with aspects labeled in a PubMed article. Keywords are used to match
different parts of the summaries to the four aspects.

You might also like