Generation by Selection and Repair As A Method For Adapting Text For The Individual Reader
Generation by Selection and Repair As A Method For Adapting Text For The Individual Reader
Generation by Selection and Repair As A Method For Adapting Text For The Individual Reader
Eduard Hovy
USC / Information Sciences Institute
4676 Admiralty Way
Marina del Rey, CA 90292-6695, U.S.A.
Figure 2: The HealthDoc home page, tailored by WebbeDoc for a highly technical layperson
Figure 3: The HealthDoc home page, tailored by WebbeDoc for a project funder
Figure 4: The HealthDoc home page, tailored by WebbeDoc for a computational linguist
textual customization, WebbeDoc can tailor the doc- Web user! In a realistic and usable implementation,
ument’s style and form of presentation; it can select, WebbeDoc would need an authoring tool and a sen-
according to the user profile, the most appropriate art- tence planner that could work in real-time to repair
work, font, colour, and general layout. and polish the selected text—we can’t expect the aver-
A WebbeDoc master document can also incorporate age Web document author to pre-compile all the pos-
hypertext variations, which can be embedded within sible combinations in advance. Therefore, to develop
larger textual variations, or linked to specific lexical such a system, a number of research issues must be ad-
variations. For example, WebbeDoc can select from dressed, including representation of the master doc-
among a set of near-synonyms according to the user ument; authoring and knowledge-based document
profile, with each near-synonym linked to a stylisti- management; and sentence planning for automated
cally and semantically appropriate hypertext varia- post-editing.
tion.
WebbeDoc represents the first phase in the imple- The next step: Adapting Web pages for the
mentation of the ideas and mechanisms of HealthDoc.
The project Web page that it customizes is itself a mas- individual user by selecting and repairing
ter document, but in this initial implementation, we text
have implemented a form of “generation by selection
only”: the structure of the master document is tightly Representing a master document
constrained so that, after selection, no repairs will be Text Specification Language, or TSL, is the language
needed to produce a coherent and stylistically ade- used to represent master documents in the parent
quate text. HealthDoc system. We anticipate that WebbeDoc mas-
The key to WebbeDoc’s ability to produce tailored ter documents will have a hybrid representation: part
documents by selection from a single master docu- TSL (for the portions that will be subject to syntactic
ment is the manner of representation of the master or stylistic repair), and part “frozen” English text (for
document: a WebbeDoc master document has a well- the portions that need never be revised). We have
defined structure of ordering relations, rhetorical rela- defined TSL to incorporate structures represented in
tions, and other linguistic information, such as corefer- Sentence Plan Language (SPL), the specification lan-
ence links. In the first prototype, the master document guage for the Penman text generation system (Penman
was built manually according to our model of a mas- Natural Language Group 1989), whose KPML deriva-
ter document, with additional structural constraints tion (Bateman 1995) is used in HealthDoc.5 An SPL
imposed so that piecewise selection and recombina- expression is an abstract specification of a sentence,
tion would not create any infelicities such as abrupt which Penman can convert to the corresponding sur-
changes of topic, unnecessary duplications of noun face form. This permits expression of the content of
phrases, or unresolvable pronouns. the document. The basic SPL structures are annotated
But to compose a master document of this style and
internal complexity required the efforts of computa- 5
TSL can actually incorporate multiple representations
tional linguists, rhetoricians, and Web document de- of a sentence; for example, the WebbeDoc system currently
signers; obviously this is not realistic for the average uses TSL with both English and HTML representations.
with information for selection and repair to produce An MD has a coherent high-level communicative
the corresponding TSL representation. goal, such as to inform, to command, to persuade,
The selection information makes reference to a gen- to impress. For example, the purpose of the cur-
eral user profile that describes the possible character- rent WebbeDoc MD is to inform (and impress)
istics of the potential readers: a list of all selection the reader about the goals and technical achieve-
features is stored with their set of possible values. ments of the HealthDoc project.
For example, the personal characteristics of WebbeDoc An MD has a coherent topic structure, with a di-
readers might be decribed by a set of selection features vision into topics, sub-topics, and so on. The
and possible values as follows: smallest topic unit of an MD at the moment is
a sub-sub-topic; however, we believe the form of
:reader-role (layperson physician computer-expert) the “smallest topic unit” will vary with the partic-
:reader-age (child adult senior) ular document. For example, a master document
Other kinds of selection features, such as reading level giving someone information on the treatment of
and preferred style of presentation, will, for the mo- their diabetes8 might start with a definition of the
ment, be represented in a similar manner: two different types of diabetes, followed by the
identification of the reader’s particular type of di-
:technical-level (low-technical high-technical) abetes has and then a description of the medical
:formality (informal informal) characteristics of the two kinds of diabetes.
A selection condition is a boolean expression com- Each sub-topic corresponds to a section of the
posed of particular values of selection features; for document that satisfies a more specific commu-
example: nicative goal, such as to justify or elaborate upon.
:condition '(AND (OR layperson physician) In the diabetes example, one sub-topic elaborates
low-technical)) on the two types of diabetes, first identifying the
reader’s particular form of diabetes then describ-
Such selection conditions can be included as anno- ing the characteristics of the two different forms.
tations at any level in the TSL so that the system can Essentially, a sub-topic is a semantically coherent
make selections at any level of linguistic granularity.6 piece of the document.
But this information isn’t enough. We also require
the internal discourse structure to be represented ex- Each sub-topic is a collection of variation sets that
plicitly, to guide repairs to the structure of the text. are connected by ordering relations, rhetorical
Therefore, TSL contains several kinds of additional relations, coreference links, and formatting rela-
annotations, including topic ordering information, coref- tions. A variation set is a set of textual variations
erence links, and rhetorical relations between sentences. such that each variation fulfills the same com-
As stylistic and pragmatic customization becomes municative goal, but has a semantic content and
more complex, additional representations will prob- pragmatic form tailored to a particular audience.
ably be needed. In addition to these kinds of an- Each variation in a variation set is characterized
notations, WebbeDoc’s TSL will contain information by a logical condition and a semantically coher-
on formatting and document presentation that would ent piece of text. The logical condition uses terms
be marked up for inclusion according to specific user that range over sets of mutually exclusive fea-
preferences.7 tures. We interpret “mutual exclusion” to mean
that the conditions assigned to the variations in a
The model of a master document A master docu- variation set define a clean partition of the set, so
ment is constructed according to a formal model; the that exactly one of the variations must be chosen.
model that we describe here is the most general, in- In the diabetes example, the sub-topic on the two
tended for the overall HealthDoc system, which does types of diabetes might contain a sequence of two
selection and repair of a master document. (The cur- variation sets, the first identifying which condi-
rent version of WebbeDoc, which does generation by tion the reader has and the second explaining the
selection only, with no repairs involved, uses a more medical details of the different conditions.
constrained model of a master document.)
We define the general model of a master document Ordering relations may exist between the variation
(MD) as follows: sets that make up a sub-topic. These relations in-
dicate the preferred order of the sequence of varia-
6
The tradeoffs between amount of variation, grain size tions that have been selected to form the working
of variation, and effort of document authoring and repair document, and thereby specify the ordering of
are a matter for empirical investigation, and will eventually
constitute one of HealthDoc’s theoretical contributions. 8
An example master document that gives basic in-
7
Indeed, we anticipate that there will be a distinct “re- formation on the treatment of diabetes can be found
pair” module for document formatting in the sentence plan-
at: http://logos.uwaterloo.ca/ healthdo/About/Demos/
ner used with WebbeDoc. diabetes.html.
sub-topics prior to the invocation of the sentence and coreferential links between them, and the condi-
planner. tions under which each element should be included
Preferred order can vary by reader. In our exam- in the output. The elements of the text are then typed
ple, we’ve assumed that the order of topics should into the authoring tool in English, and are marked up
be to first identify the reader’s particular type of by the writer with conditions for inclusion, links for
diabetes, then elaborate on the medical character- cohesion and coreference, and annotations for order-
istics of the relevant type of diabetes. However, ing and formatting of the document layout.
some readers might prefer to be presented first The tool then translates the text into TSL, including
with the overall description of the two different the conversion of the English text into SPL. The latter
conditions before focussing on their specific prob- process is essentially one of semi-automated parsing,
lem. so that whenever an ambiguity cannot be resolved, the
writer is queried in an easy-to-understand form. The
Rhetorical relations may exist between the varia- design and development of the authoring tool and its
tion sets that make up a sub-topic. The rhetorical user interface is part of the current phase of the overall
relations that we are currently using are taken HealthDoc project (fall 1996 to spring 1997). The user
from Rhetorical Structure Theory (RST) (Mann interface is being developed by Parsons (1997), while
and Thompson 1988). In the current version of Banks (1997) is implementing the English-to-SPL con-
WebbeDoc, the same rhetorical relation must exist version (for more details on the underlying model of
between any two members of adjacent variation conversion, see DiMarco and Banks (1997)).
sets. In the example we have been using, any
choice from the variation set describing the medi-
cal details of each form of diabetes would have the Functions of sentence planning and automated
same rhetorical relation, elaboration, to any choice post-editing
from the first variation set, which identifies the In general, selecting material from pre-existing text
particular form that the reader has.
and then editing it to recover coherence and cohesion
Coreference links may be defined between any two can involve a wide range of problems in various as-
variation sets. A sequence of coference links in pects of sentence planning. For example, both syntac-
the diabetes document could include: diabetes, tic and semantic aggregation may be needed, as well as
insulin-dependent diabetes, your condition. chunking of whole and partial propositions. Pronouns
and other forms of reference need to be chosen. And,
Formatting information may be defined at each of course, aggregation and sentence restructuring will
topic and sub-topic level. Formatting information affect the rhetorical relations between the elements of
may also be defined between and within variation the text.
sets, including illustrations, choice of colour, de-
sign of layout, and so on. Our current work is focusing on the development
of two key modules of the sentence planner: for dis-
course structuring and for aggregation.
Authoring a master document
It is unlikely that every ordering of the blocks of text
WebbeDoc master documents may be based on the that are organized into a master document will pro-
natural-language text of pre-existing material, or they duced a coherent sequence of selected pieces of text.
may be created from scratch (or some combination of To ensure that any resulting document makes sense,
the two). Either alternative requires the involvement the discourse structuring module uses the rhetorical
of a human. relations that hold among the textual units to produce
The author of a WebbeDoc master document would a sequence that is most likely to be coherent. Its rules
normally be a professional technical writer or Web- are derived from a set of Rhetorical Structure Theory
document designer, who will need to understand the relations (Mann and Thompson 1988) whose Nucleus
nature of customized and customizable texts, but who and Satellite ordering requirements are implemented
should not be assumed to have any special knowledge as constraints; the module applies a constraint satis-
or understanding of TSL or the innards of WebbeDoc. faction algorithm to find all satisfactory ordering(s)
The authoring tool, therefore, should be no more of the input expressions. Of these, one is selected at
difficult for the author to use than, say, the more- random. See Marcu (1997) for details. In later work,
sophisticated features of a typical word processor. The an additional module will be built to determine the
text is therefore written in English, and will be trans- linguistic phrasing of the discourse relation.
lated to TSL by the authoring tool. (The English source The aggregation module eliminates redundancy in
text is retained in the TSL for use in subsequent author- TSL expressions by grouping together entities that are
ing sessions—for example, if the document is updated arguments of the same rhetorical relation, verbal pro-
or amended.) cess, etc. Each aggregation rule recognizes an exact
It is the writer’s job to decide upon the basic ele- match of some portions of two input TSL expressions
ments of the text, the formatting, ordering, rhetorical, and returns a single, fused, expression. The actions
of the aggregation module will generally affect the re- anticipation of all the possible texts that might be gen-
sulting syntactic structure. erated, but also includes annotations (e.g., a condition
A critical problem is the distribution of repair tasks on a piece of canned text) to allow some local cus-
among the planning modules, as there are often strong tomization. However, very free and flexible use of an-
interactions. The responsibilities of each module and notations could lead to problems of repetitive text and
the overlaps between them are an area of on-going inappropriate use of referring expressions, the kinds
research for our sentence-planning group. of problems requiring textual repair that HealthDoc’s
sentence planner is intended to handle.
Related work Like HealthDoc, these systems aim to adapt the style
or content of the texts they generate to the characteris-
A number of other projects have also used a combi-
tics of the individual user, but HealthDoc’s approach
nation of natural language generation techniques and
is more general. HealthDoc allows not only the poten-
hypertext capabilities to provide texts tailored to the tial inclusion of explicit text plans and text templates,
individual reader. In particular, the IDAS project (Re-
but a very flexible yet principled means of organiz-
iter, Mellish, and Levine 1995) comes closest to the
ing both textual and non-textual variations, including
goals of the HealthDoc project in recognizing the need tailored hypertext, into a ‘master document’. Impor-
to tailor both textual and non-textual information, in-
tantly, a HealthDoc, or WebbeDoc, master document
cluding visual formatting, hypertext input, and graph-
can be used by a sentence planner to perform very
ics output. IDAS also emphasizes the need for explicit fine-grained revision and tailoring for a user.
authoring tools in the document generation process,
but here the focus is on authoring at the knowledge-
base level, while the HealthDoc authoring tool deals Conclusion
with an actual draft of the document (which can then We believe that generation by selection and repair is
be translated into a deeper representation). The differ- suitable for applications that exhibit most of the fol-
ence in level of “granularity” of authoring reflects the lowing characteristics:
basic difference in the levels of tailoring done by the Pre-existing semantic content: There is a set of
two systems: IDAS provides the user with a means variations of a document, or alternative forms,
of navigating through the whole “hyperspace” of pos- that have been created by a domain specialist. In
sible texts, but HealthDoc can be seen as providing this case, construction of the master document is
different variations of the texts at any point in the hy- greatly facilitated, especially with the use of an
perspace. Consequently, HealthDoc aims to provide authoring tool that supports the assembly and
a much finer-grained degree of tailoring than does checking of the representations.
IDAS, which correlates with the difference in their in- Similar discourse structure: The alternative
tended types of applications (health information re- forms all have roughly the same overall struc-
quiring subtle distinctions at all levels of the discourse ture. This characteristic, which minimizes dis-
versus technical documentation needing only three course structure planning or repair, is quite com-
different basic styles). mon, especially in cases in which one variation is
While IDAS relies mainly on canned texts, other sys- an expansion of another in level of detail.
tems use more-dynamic text generation: the Migraine
system (Carenini, Mittal, and Moore 1994) uses an ap- Granularity at sentence and lexical levels: Sen-
proach to text planning that adaptively selects and tences are generally indivisible units, and lexical
structures the information to be given to a particular alternation is limited to simple word and phrase
reader. However, Migraine relies on a large number substitutions. These tendencies, which minimize
of context-sensitive and user-sensitive text plans so the amount of repair required, help to enforce
that its methods of tailoring must of necessity be very stylistic uniformity over the range of alternatives.
specific to its particular domain. The PEBA-II system Well-defined criteria for variation: A clear set
(Milosavljevic and Dale 1996) uses more-general text of criteria for choosing between alternatives has
plans, as well as text templates, that it can choose from been identified. Fortunately, in many applica-
to adapt information to the individual reader, but the tions, alternative documents state their intended
tailoring done is very specific, focussing on the user’s readership very clearly.
familiarity with a topic. The PIGLET system (Cawsey, Adaptive-hypertext applications will frequently have
Binsted, and Jones 1995) also uses a combination of these characteristics. Once the core techniques—
text plans and text templates, and, like IDAS and Mi- representation of the master document and methods of
graine, allows the user to be self-guided in selecting repair by sentence planning—are further developed,
the issues to explore. But its tailoring is also quite this model of selection and repair may become one
specific in nature, mainly concerned with emphasiz- of the most attractive and popular approaches to de-
ing material that is relevant to the particular patient. veloping useful Web-based systems that can tailor a
The ILEX-0 system (Knott, Mellish, Oberlander, and document, whether ordinary text or hypertext, to the
O’Donnell 1996) is similar to the PIGLET model in its individual reader.
Acknowledgements hypertext generation.” Proceedings, Eighth Interna-
The HealthDoc Project is supported by a grant from Technol- tional Natural Language Generation Workshop, Herst-
ogy Ontario, administered by the Information Technology monceaux Castle, June 1996, 151–160.
Research Centre. Vic DiCiccio was instrumental in helping Mann, William C. and Thompson, Sandra A. (1988).
us to obtain the grant, and has been invaluable in subse- “Rhetorical Structure Theory: Toward a functional
quent administration. Some material in the section on the theory of text organization.” Text, 8(3), 1988, 243–
functions of sentence planning was written by Daniel Marcu; 281.
it is used here with his permission. The other members of Marcu, Daniel (1997). “From local to global coherence:
the HealthDoc Project have also contributed to the work de- A bottom-up approach to text planning.” Submitted
scribed here, especially Kim Parsons, Mary Ellen Foster, and for publication.
Phil Edmonds.
Milosavljevic, Maria and Dale, Robert. “Strategies for
comparison in encyclopaedia descriptions.” Pro-
References ceedings, Eighth International Natural Language Gen-
Banks, Steven (1997). Master’s thesis. Department of eration Workshop, Herstmonceaux Castle, UK, June
Computer Science, University of Waterloo, expected 1996, 161–170.
Spring 1997. Parsons, Kimberley J. (1997). Master’s thesis, Depart-
Bateman, John Arnold (1995). “KPML: The KOMET– ment of Computer Science, University of Waterloo,
Penman multilingual linguistic resource develop- expected Spring 1997.
ment environment.” Proceedings, 5th European Work- Penman Natural Language Group (1989). “The Pen-
shop in Natural Language Generation, Leiden, May man primer”, “The Penman user guide”, and “The
1995, 219–222. Penman reference manual.” Information Sciences
Brusilovsky, Peter (1996). “Methods and techniques Institute, University of Southern California.
of adaptive hypermedia.” User Modeling and User- Reiter, Ehud; Mellish, Chris; and Levine, John. “Auto-
Adapted Interaction, 6 (2–3), Special Issue on Adap- matic generation of technical documentation.” Ap-
tive Hypertext and Hypermedia, 1996, 87–129. plied Artificial Intelligence, 9, 1995, 259–287.
Carenini, Giuseppe; Mittal, Vibhu O.; and Moore, Wanner, Leo and Hovy, Eduard (1996). “The Health-
Johanna D. “Generating patient-specific interactive Doc sentence planner.” Proceedings of the Eighth In-
natural language explanations.” Proceedings, Eigh- ternational Workshop on Natural Language Generation,
teenth Annual Symposium on Computer Applications Brighton, UK, June 1996.
in Medical Care, Washington D.C., November 1994,
5–9.
Cawsey, Alison; Binsted, Kim; and Jones, Ray. “Per-
sonalised explanations for patient education.” Pro-
ceedings of the Fifth European Workshop on Natural Lan-
guage Generation, 1995, 59–74.
DiMarco, Chrysanne and Banks, Steven (1997). “Us-
ing subsumption classification on a stylistic hierar-
chy as the basis of a multi-stage conversion of natu-
ral language text to sentence plans.” In preparation.
DiMarco, Chrysanne and Foster, Mary Ellen (1997).
“The automated generation of Web documents that
are tailored to the individual reader.” To appear in
Proceedings, 1997 AAAI Spring Symposium on Natural
Language Processing for the World Wide Web, Stanford
University, March 1997.
DiMarco, Chrysanne; Hirst, Graeme; Wanner, Leo;
and Wilkinson, John (1995). “HealthDoc: Cus-
tomizing patient information and health education
by medical condition and personal characteristics.”
Workshop on Artificial Intelligence in Patient Education,
Glasgow, August 1995.
Hovy, Eduard and Wanner, Leo (1996). “Manag-
ing sentence planning requirements.” Proceedings,
ECAI-96 Workshop on Gaps and Bridges: New Direc-
tions in Planning and Natural Language Generation,
Budapest, August 1996.
Knott, Alistair; Mellish, Chris; Oberlander, Jon; and
O’Donnell, Mick. “Sources of flexibility in dynamic