arXiv:1905.05818v1 [cs.CL] 14 May 2019
Ontology-Aware Clinical Abstractive Summarization
Sean MacAvaney∗
Sajad Sotudeh∗
Arman Cohan
IRLab, Georgetown University
[email protected]
IRLab, Georgetown University
[email protected]
Allen Institute for Artificial
Intelligence
[email protected]
Nazli Goharian
Ish Talati
Ross W. Filice
IRLab, Georgetown University
[email protected]
Department of Radiology,
Georgetown University
[email protected]
MedStar Georgetown
University Hospital
[email protected]
FINDINGS:
LIVER: Liver is echogenic with slightly coarsened echotexture and mildly nodular
contour. No focal lesion. Right hepatic lobe measures 14 cm in length.
BILE DUCTS: No biliary ductal dilatation. Common bile duct measures 0.06 cm.
GALLBLADDER: Partially visualized gallbladder shows multiple gallstones without pericholecystic fluid or wall thickening. Proximal TIPS: 108 cm/sec, previously
82 cm/sec; Mid TIPS: 123 cm/sec, previously 118 cm/sec; Distal TIPS: 85 cm/sec,
previously 86 cm/sec; PORTAL VENOUS SYSTEM: [...]
IMPRESSION: (Summary)
1. Stable examination. Patent TIPS
2. Limited evaluation of gallbladder shows cholelithiasis.
3. Cirrhotic liver morphology without biliary ductal dilatation.
ABSTRACT
Automatically generating accurate summaries from clinical reports
could save a clinician’s time, improve summary coverage, and reduce errors. We propose a sequence-to-sequence abstractive summarization model augmented with domain-specific ontological information to enhance content selection and summary generation.
We apply our method to a dataset of radiology reports and show
that it significantly outperforms the current state-of-the-art on this
task in terms of rouge scores. Extensive human evaluation conducted by a radiologist further indicates that this approach yields
summaries that are less likely to omit important details, without
sacrificing readability or accuracy.
ACM Reference Format:
Sean MacAvaney, Sajad Sotudeh, Arman Cohan, Nazli Goharian, Ish Talati,
and Ross W. Filice. 2019. Ontology-Aware Clinical Abstractive Summarization. In Proceedings of Proceedings of the 42nd International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR ’19).
ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3331184.3331319
1
INTRODUCTION
Clinical note summaries are critical to the clinical process. After
writing a detailed note about a clinical encounter, practitioners
often write a short summary called an impression (example shown
in Figure 1). This summary is important because it is often the
primary document of the encounter considered when reviewing
a patient’s clinical history. The summary allows for a quick view
of the most important information from the report. Automated
summarization of clinical notes could save clinicians’ time, and
has the potential to capture important aspects of the note that the
author might not have considered [7]. If high-quality summaries
are generated frequently, the practitioner may only need to review
the summary and occasionally make minor edits.
∗ Both
authors contributed equally to this research.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from
[email protected].
SIGIR ’19, July 21–25, 2019, Paris, France
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6172-9/19/07. . . $15.00
https://doi.org/10.1145/3331184.3331319
Figure 1: Abbreviated example of radiology note and its summary.
Recently, neural abstractive summarization models have shown
successful results [1, 11, 13, 14]. While promising in general domains, existing abstractive models can suffer from deficiencies in
content accuracy and completeness [18], which is a critical issue
in the medical domain. For instance, when summarizing a clinical
note, it is crucial to include all the main diagnoses in the summary
accurately. To overcome this challenge, we propose an extension to
the pointer-generator model [14] that incorporates domain-specific
knowledge for more accurate content selection. Specifically, we
link entities in the clinical text with a domain-specific medical ontology (e.g., RadLex1 or UMLS2 ), and encode them into a separate
context vector, which is then used to aid the generation process.
We train and evaluate our proposed model on a large collection
of real-world radiology findings and impressions from a large
urban hospital, MedStar Georgetown University Hospital. Results
using the rouge evaluation metric indicate statistically significant
improvements over existing state-of-the-art summarization models.
Further extensive human evaluation by a radiology expert demonstrates that our method produces more complete summaries than
the top-performing baseline, while not sacrificing readability or
accuracy.
In summary, our contributions are: 1) An approach for incorporating domain-specific information into an abstractive summarization model, allowing for domain-informed decoding; and 2)
Extensive automatic and human evaluation on a large collection of
radiology notes, demonstrating the effectiveness of our model and
providing insights into the qualities of our approach.
1.1
Related Work
Recent trends on abstractive summarization are based on sequenceto-sequence (seq2seq) neural networks with the incorporation of
1 RadLex
version 3.10, http://www.radlex.org/Files/radlex3.10.xlsx
2 https://www.nlm.nih.gov/research/umls/
attention [13], copying mechanism [14], reinforcement learning
objective [8, 12], and tracking coverage [14]. While successful, a few
recent studies have shown that neural abstractive summarization
models can have high readability, but fall short in generating accurate and complete content [6, 18]. Content accuracy is especially
crucial in medical domain. In contrast with prior work, we focus on
improving summary completeness using a medical ontology. Gigioli et al. [8] used a reinforced loss for abstractive summarization in
the medical domain, although their focus was headline generation
from medical literature abstracts. Here, we focus on summarization
of clinical notes where content accuracy and completeness are more
critical. The most relevant work to ours is by Zhang et al. [19] where
an additional section from the radiology report (background) is
used to improve summarization. Extensive automated and human
evaluation and analyses demonstrate the benefits of our proposed
model in comparison with existing work.
2
MODEL
Pointer-generator network (PG). Standard neural approaches
for abstractive summarization follow the seq2seq framework where
an encoder network reads the input and a separate decoder network (often augmented with an attention mechanism) learns to
generate the summary [17]. Bidirectional LSTMs (BiLSTMs) [9] are
often used as the encoder and decoder. A more recent successful
summarization modelÐcalled Pointer-generator networkÐallows
the decoder to also directly copy text from the input in addition
to generation [14]. Given a report x = {x 1 , x 2 , ..., x n }, the encoded
input sequence h = BiLSTM(x), and the current decoding state
st = BiLSTM(x ′ )[t], where x ′ is the input to the decoder (i.e., gold
standard summary token at training or previously generated token
at inference time), the model computes the attention weights over
the input terms a = softmax(h⊤ W1 s⊤ ). The attention scores are
employed to compute a context vector c which is a weighted sum
Í
over input c = ni ai hi that is used along with the output of the
decoder BiLSTM to either generate the next term from a known
vocabulary or copy the token from the input sequence with the
highest attention value. We refer the reader to See et al. [14] for
additional details on the pointer-generator architecture.
Ontology-aware pointer-generator (Ontology PG). In this
work, we propose an extension of the pointer-generator network
that allows us to leverage domain-specific knowledge encoded in
an ontology to improve clinical summarization. We introduce a new
encoded sequence u = {u 1 , ..., un ′ } that is the result of linking an
ontology U to the input texts. In other words, u = F U (x) where
F U is a mapping function, e.g., a simple mapping function that
only outputs a word sequence if it appears in the ontology and
otherwise skips it. We then use a second BiLSTM to encode this
additional ontology terms similar to the way the original input is
encoded hu = BiLST M(u). We then calculate an additional context
vector c ′ which includes the domain-ontology information:
Õn ′
⊤
′
a ′ = softmax(h⊤
ai′ ui
(1)
u W2 s ); c =
i
The second context vector acts as additional global information
to aid the decoding process, and is akin to how Zhang et al. [19]
include background information from the report. We modify the
decoder BiLSTM to include the ontology-aware context vector in
the decoding process. Recall that an LSTM network controls the
flow of its previous state and the current input using several gates
(input gate i, forget gate f, and output gate o), where each of these
gates are vectors calculated according to an additive combination
of the previous LSTM state and current input. For example, for
the forget gate we have: ft = tanh(Wf [st −1 ; x t′ ] + b) where st −1
is the previous decoder state and x t′ is the decoder input, and ł;ž
shows concatenation (for more details on LSTMs refer to [9]). The
ontology-aware context vector c ′ is passed as additional input to
this function for all the LSTM gates: e.g., for the forget gate we will
have: ft = tanh(Wf [st −1 ; x t′ ; c ′ ] + b). This intuitively guides the
information flow in the decoder using the ontology information.
3
EXPERIMENTAL SETUP
We train and evaluate our model on a dataset of 41,066 real-world
radiology reports from MedStar Georgetown University Hospital
containing radiology reports with a variety of imaging modalities
(e.g., x-rays, CT scans, etc). The dataset is randomly split into 8010-10 train-dev-test splits. Each report describes clinical findings
about a specific diagnostic case, and an impression summary (as
shown in Figure 1). The findings sections are 136.6 tokens on
average and the impression sections are 37.1 tokens on average.
Performing cross-institutional evaluation is challenging and beyond the scope of this work due to the varying nature of reports
between institutions. For instance, the public Indiana University
radiology dataset [4] consists only of chest x-rays, and has much
shorter reports (average length of findings: 40.0 tokens; average
length of impressions: 10.5 tokens). Thus, in this work, we focus
on summarization within a single institution.
Ontologies. We employ two ontologies in this work. UMLS is a
general medical ontology maintained by the US National Library of
Medicine and includes various procedures, conditions, symptoms,
body parts, etc. We use QuickUMLS [15] (a fuzzy UMLS concept
matcher) with a Jaccard similarity threshold of 0.7 and a window
size of 3 to extract UMLS concepts from the radiology findings.
We also evaluate using an ontology focused on radiology, RadLex,
which is a widely-used ontology of radiological terms maintained by
the Radiological Society of North America. It consists of 68,534 radiological concepts organized according to a hierarchical structure.
We use exact n-gram matching to find important radiological entities, only considering RadLex concepts at a depth of 8 or greater.3
In pilot studies, we found that the entities between depths 8 and
20 tend to represent concrete entities (e.g., ‘thoracolumbar spine
region’) rather than abstract categories (e.g., ‘anatomical entity’).
Comparison. We compare our model to well-established extractive baselines as well as the state-of-the-art abstractive summarization models.
- LSA [16]: An extractive vector-space summarization model based
on Singular Value Decomposition (SVD).
- LexRank [5]: An extractive method which employs graph-based
centrality ranking of the sentence.4
- Pointer-Generator (PG) [14]: An abstractive seq2seq attention
summarization model that incorporates a copy mechanism to
directly copy text from input where appropriate.
3 The
maximum tree depth is 20.
LSA and LexRank, we use the Sumy implementation (https://pypi.python.org/
pypi/sumy) with the top 3 sentences.
4 For
there is no fracture within either hip* or the visualized bony pelvis* . there is mild narrowing of the right hip* joint* with marginal osteophytes . limited evaluation of the left hip* is unremarkable .
RadLex PG
PG
no dense airspace* consolidation* . no pleural* effusion* or pneumothorax . cardiac silhouette is normal . mildly prominent pulmonary vascularity .
RadLex PG
PG
Figure 2: Average attention weight comparison between our approach (RadLex PG) and the baseline (PG). Color differences
show to which term each model attends more while generating summary. RadLex concepts of depth 8 or lower are marked
with *. Our approach attends to more RadLex terms throughout the document, allowing for more complete summaries.
Table 1: rouge results on MedStar Georgetown University
Hospital’s development and test sets. Both the UMLS and
RadLex ontology PG models are statistically better than the
other models (paired t-test, p < 0.05).
Development
Test
Model
RG-1
RG-2
RG-L
RG-1
RG-2
RG-L
LexRank [5]
LSA [16]
PG [14]
Back. PG [19]
27.60
28.04
36.60
36.58
13.85
14.68
21.73
21.86
25.79
26.15
35.40
35.39
28.02
28.16
37.17
36.95
14.26
14.71
22.36
22.37
26.24
26.27
35.45
35.68
UMLS PG (ours)
RadLex PG (ours)
37.41
37.64
22.23
22.45
36.10
36.33
37.98
38.42
23.14
23.29
36.67
37.02
model slightly outperforms the UMLS model, suggesting that the
radiology-specific ontology is beneficial (though the difference between UMLS and RadLex is not statistically significant). We also
experimented incorporating both ontologies in the model simultaneously, but it resulted in slightly lower performance (1.26% lower
than the best model on rouge-1). To verify that including ontological concepts in the decoder helps the model identify and focus
on more radiology terms, we examined the attention weights. In
Figure 2, we show attention plots for two reports, comparing the attention of our approach and PG. The plots show that our approach
results in attention weights being shared across radiological terms
throughout the findings, potentially helping the model to capture
a more complete summary.
4.2
- Background-Aware Pointer-Generator (Back. PG) [19]: An
extension of PG, which is specifically designed to improve radiology note summarization by encoding the Background section
of the report to aid the decoding process.5
Parameters and training. We use 100-dimensional GloVe embeddings pre-trained over a large corpus of 4.5 million radiology
reports [19], a 2-layer BiLSTM encoder with a hidden size of 100,
and a 1-layer LSTM decoder with the hidden size of 200. At inference time, we use beam search with beam size of 5. We use a
dropout of 0.5 in all models, and train to optimize negative loglikelihood loss using the Adam optimizer [10] and a learning rate
of 0.001.
4 RESULTS AND ANALYSIS
4.1 Experimental results
Table 1 presents rouge evaluation results of our model compared
with the baselines (as compared to human-written impressions).
The extractive summarization methods (LexRank and LSA) perform particularly poorly. This may be due to the fact that these
approaches are limited to simply selecting sentences from the text,
and that the most central sentences may not be the most important
for building an effective impression summary. Interestingly, the
Back. PG approach (which uses the background section of the
report to guide the decoding process) is ineffective on our dataset.
This may be due to differences in conventions across institutions,
such as what information is included in a report’s background
and what is considered important to include in its impression.
We observe that our Ontology-Aware models (UMLS PG and
RadLex PG) significantly outperform all other approaches (paired
t-test, p < 0.05) on both the development and test sets. The RadLex
5 Using
the author’s code at github.com/yuhaozhang/summarize-radiology-findings
Expert human evaluation
While our approach surpasses state-of-the-art results on our dataset
in terms of rouge scores, we recognize the limitations of the rouge
framework for evaluating summarization [2, 3]. To gain better insights into how and why our methodology performs better, we
also conduct expert human evaluation. We had a domain expert
(radiologist) who is familiar with the process of writing radiological
findings and impressions evaluate 100 reports. Each report consists of the radiology findings, one manually-written impression,
one impression generated using PG, and one impression generated
using our ontology PG method (with RadLex). In each sample, the
order of the Impressions are shuffled to avoid bias between samples.
Samples were randomly chosen from the test set, one from each
of 100 evenly-spaced bins sorted by our system’s rouge-1 score.
The radiologist was asked to score each impression in terms of the
following on a scale of 1 (worst) to 5 (best):
- Readability. Impression is understandable (5) or gibberish (1).
- Accuracy. Impression is fully accurate (5), or contains critical
errors (1).
- Completeness. Impression contains all important information
(5), or is missing important points (1).
We present our manual evaluation results using histograms and
arrow plots in Figure 3. The histograms indicate the score distributions of each approach, and the arrows indicate how the scores
changed. The starting points of an arrow indicates the score of
an impression we compare to (either the human-written, or the
summary generated by PG). The head of an arrow indicates the
score of our approach. The numbers next to each arrow indicate
how many reports made the transition. The figure shows that our
approach improves completeness considerably, while maintaining
the readability and accuracy. The major improvement in completeness is between the score of 3 and 4, where there is a net gain of 10
reports. Completeness is particularly important because it is where
Accuracy
Readability
PG
0
0
52
9
14
16
9
1
3
2
2
3
4
5
score
(a)
2
4
2
2
1
1
1
1
3
1
1
19
7
2
3
6
1
6
7
5
1
3
score
(b)
4
5
1
2
3
score
4
5
(c)
1
2
3
4
5
score
(d)
1
2
3
score
4
(e)
5
17
22
1
4
1
2
7
5
13
1
5
1
8
21
3
4
2
2
2
1
2
1
2
1
1
1
18
4
1
1
24
0
50
15
9
7
19
20
0
58
Completeness
40
50
Manual
0
4
8
9
5
6
2
0
52
Accuracy
Readability
RadLex PG
50
20
25
25
Completeness
40
50
RadLex PG
50
1
2
3
score
4
5
(f)
Figure 3: Histograms and arrow plots plot depicting differences between impressions of 100 manually-scored radiology reports. Although challenges remain to reach human parity for all metrics, our approach makes strong gains to address the
problem of report completeness (c, f), as compared to the next leading summarization approach (PG).
existing summarization modelsÐsuch as PGÐare currently lacking,
as compared to human performance. Despite the remaining gap
between human and generated completeness, our approach yields
considerable gains toward human-level completeness. Our model
is nearly as accurate as human-written summaries, only making
critical errors (scores of 1 or 2) in 5% of the cases evaluated, as
compared to 8% of cases for PG. No critical errors were found in
the human-written summaries, although the human-written summaries go through a manual review process to ensure accuracy.
The expert annotator furthermore conducted blind qualitative
analysis to gain a better understanding of when our model is doing
better and how it can be further improved. In line with the completeness score improvements, the annotator noted that in many
cases our approach is able to identify pertinent points associated
with RadLex terms that were missed by the PG model. In some
cases, such as when the author picked only one main point, our approach was able to pick up important items that the author missed.
Interestingly, it also was able to include specific measurement details better than the PG network, even though these measurements
do not appear in RadLex. Although readability is generally strong,
our approach sometimes generates repetitive sentences and syntactical errors more often than humans. These could be addressed
in future work with additional post-processing heuristics such as
removing repetitive n-grams as done in [12]. In terms of accuracy,
our approach sometimes mixes up the łleftž and łrightž sides. This
often occurs with findings that have mentions of both sides of a
specific body part. Multi-level attention (e.g., [1]) could address this
by forcing the model to focus on important segments of the text.
There were also some cases where our model under-performed
in terms of accuracy and completeness due to synonymy that is
not captured by RadLex. For instance, in one case our model did
identify torsion, likely due to the fact that in the findings section it
was referred to as twisting (a term that does not appear in RadLex).
5
CONCLUSION
In this work, we present an approach for informing clinical summarization models of ontological information. This is accomplished by
providing an encoding of ontological terms matched in the original
text as an additional feature to guide the decoding. We find that our
system exceeds state-of-the-art performance at this task, producing
summaries that are more comprehensive than those generated by
other methods, while not sacrificing readability or accuracy.
ACKNOWLEDGMENTS
We thank the additional residents, Lee McDaniel and Marlie Philiossaint, for their contributive evaluations, as well as anonymous
reviewers for their useful feedback.
REFERENCES
[1] Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim,
Walter Chang, and Nazli Goharian. 2018. A Discourse-Aware Attention Model
for Abstractive Summarization of Long Documents. In NAACL-HLT.
[2] Arman Cohan and Nazli Goharian. 2016. Revisiting Summarization Evaluation
for Scientific Articles. Proc. of 11th Conference on LREC (2016), 806ś813.
[3] John M. Conroy and Hoa Trang Dang. 2008. Mind the Gap: Dangers of Divorcing
Evaluations of Summary Content from Linguistic Quality. In COLING.
[4] Dina Demner-Fushman, Marc D Kohli, Marc B Rosenman, Sonya E Shooshan,
Laritza Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald.
2015. Preparing a collection of radiology examinations for distribution and
retrieval. Journal of the American Medical Informatics Association (2015).
[5] Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based Lexical
Centrality As Salience in Text Summarization. J. Artif. Int. Res. (2004), 457ś479.
[6] Sebastian Gehrmann, Yuntian Deng, and Alexander M. Rush. 2018. Bottom-Up
Abstractive Summarization. In EMNLP.
[7] Esteban F Gershanik, Ronilda Lacson, and Ramin Khorasani. 2011. Critical finding
capture in the impression section of radiology reports. In AMIA.
[8] Paul Gigioli, Nikhita Sagar, Anand S. Rao, and Joseph Voyles. 2018. Domain-Aware
Abstractive Text Summarization for Medical Documents. In IEEE BIBM.
[9] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
computation 9, 8 (1997), 1735ś1780.
[10] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
[11] Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. CoNLL.
[12] Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A Deep Reinforced
Model for Abstractive Summarization. CoRR (2017).
[13] Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A Neural Attention
Model for Abstractive Sentence Summarization. In EMNLP.
[14] Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point:
Summarization with Pointer-Generator Networks. ACL (2017).
[15] Luca Soldaini and Nazli Goharian. 2016. Quickumls: a fast, unsupervised approach
for medical concept extraction. In MedIR Workshop, SIGIR.
[16] Josef Steinberger and Karel Jez̈ek. 2004. Using latent semantic analysis in text
summarization and summary evaluation. In ISIM.
[17] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
with neural networks. In NIPS. 3104ś3112.
[18] Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2017. Challenges in
Data-to-Document Generation. In EMNLP.
[19] Yuhao Zhang, Daisy Yi Ding, Tianpei Qian, Christopher D. Manning, and Curtis P.
Langlotz. 2018. Learning to Summarize Radiology Findings. In EMNLP Workshop
on Health Text Mining and Information Analysis.