Building a Discourse-Annotated Dutch Text Corpus
Nynke van der Vliet* , Ildikó Berzlánovich* , Gosse Bouma* , Markus Egg†
and Gisela Redeker*
* University of Groningen, †Humboldt University, Berlin.
Abstract
We are compiling a corpus of Dutch texts annotated with discourse structure and lexical cohesion,
containing initially 80 texts from expository and persuasive genres. We are using this resource for
corpus-based studies of discourse relations, discourse markers, cohesion, and genre differences. We
are also exploring the possibilities of automatic text segmentation and semi-automatic discourse annotation. This paper discusses our design choices in text selection and segmentation and in the annotation of discourse structure and lexical cohesion.
1 Introduction
Discourse researchers from descriptive, cognitive, formal, and computational backgrounds unanimously subscribe to the view that texts are structured entities that exhibit
coherence and cohesion (for a recent overview see Taboada and Mann (2006b)). Coherence refers to the way sentences or utterances combine to convey the informational
and intentional (e.g., expressive or persuasive) meanings of the text. Cohesion refers to
elements (conjunctions and other so-called “cue phrases”) that signal how utterances
or larger text parts are related to each other, and to the way lexical elements like pronouns and definite noun phrases refer back to other items in the discourse (Halliday and
Hasan, 1976). The main goal of our corpus-building effort is to provide the basis for
investigating discourse structure, relational and lexical cohesion, and their interactions
with genre, i.e., to support the modeling of textual organization.
Much of the theoretical and empirical research on relational coherence has focused
on local coherence relations and their linguistic signaling (e.g., Sanders et al. (1992,
1993); Knott and Sanders (1998); Webber et al. (2003), Prasad et al. (2008)). Configurational issues concerning the hierarchical composition of larger stretches of text that
arise from recursive application of coherence relations, have received some attention in
computational linguistics, but lack a substantial empirical foundation. Various structures have been proposed, in particular, binary trees (e.g., Carlson et al. (2002); Stede
(2004)), n-ary trees (e.g., Mann and Thompson (1988); Webber (2004), Polanyi et al.
(2004); Thione et al. (2004)), and less constrained graph structures (Danlos (2004);
Wolf and Gibson (2005)).
The interplay of relational discourse structure with referential and lexical cohesion
has been investigated with a focus on the use and interpretation of anaphoric expressions (Fox (1987); Grosz et al. (1995); Kehler (2002); Poesio et al. (2004)); much less
attention has been devoted to the role of lexical cohesion in co-determining the overall
textual organization (but see Hasan (1984) and Hoey (1991)).
Proceedings of the Workshop "Beyond Semantics: Corpus-based Investigations of Pragmatic and Discourse Phenomena",
pages 157-171, Göttingen, Germany, 23-25 February 2011. Bochumer Linguistische Arbeitsberichte (3).
Textual organization cannot be studied without consideration of the variability between text genres (see, e.g., Eggins and Martin (1997), Webber (2009)). In particular,
some texts are organized around a central purpose, e.g. a claim that is argued for or
a request or proposal the text is intended to support, while descriptive or expository
texts are usually organized around a central theme, moving through sub-themes or aspects. This difference is relevant for both, the relational structure and the role of lexical
cohesion. The corpus therefore covers a range of genres.
By annotating relational and lexical organization in a variety of text types, this
project will create a Dutch language resource for corpus-based discourse research,
computational modeling, and applications like question answering and summarization.
2 Corpus design
Our aim is to provide a reliably annotated “gold standard” resource covering a range of
genres. The emphasis on quality and richness of the manual annotation limits the size
of the corpus, as careful annotation work is extremely time consuming.
2.1 Text selection
The corpus covers a range of text genres, including, in particular, expository texts,
whose main purpose is to present information to the reader, and persuasive texts that
aim to affect the readers intentions or actions. The texts vary in length between a
minimum of approximately 190 words and a maximum of approximately 400 words.
Longer texts become unwieldy for relational analysis, and top-level relations tend to be
rather uninformative juxtapositions (Taboada and Mann, 2006b).
The corpus consists of 40 expository texts and 40 persuasive texts. For the expository subcorpus, 20 texts have been selected from online encyclopedias on astronomy1
and 20 from a popular scientific news website2 . The persuasive texts are 20 fundraising letters from humanitarian organizations and 20 commercial advertisements from
lifestyle and news magazines.
Encyclopedia entries as well as popular scientific news are learned exposition, i.e.,
texts that are strictly informational in purpose, but moderately technical in content and
style, and that take the general public as their audience. In this way, we excluded scientific exposition, which is more abstract and technical in style and targets professional
scientific audience (e.g., academic prose). Fundraising letters and advertisements are
prototypical persuasive genres that have received much attention in the literature (e.g.,
Bhatia (1998), Kamalski (2007)). They have a clear and focused purpose and are directed at a general audience.
1 http://www.astronomie.nl;
http://www.sterrenwacht-mercurius.nl/encyclopedie.php5
2 http://www.scientias.nl/category/astronomie
158
2.2 Annotation
The starting point of our annotation work is a syntax-based segmentation of the texts
into clausal atomic units, which has been developed in an extended training phase involving consistency checking aided by a collection of examples (see section 3 below).
We then add annotations for discourse structure, relational cohesion, and lexical cohesion, which we are briefly introducing here (for details see sections 4 and 5).
For the analysis of relational discourse structures, we chose the widely used Rhetorical Structure Theory (RST) (Mann and Thompson, 1988; Taboada and Mann, 2006a)
in its “extended classic” variant. The XML annotation is created using O’Donnell’s
RSTTool3 (O’Donnell, 1997). The definitions of the RST relations are available from
the RST website4 .
Previous research has shown how combining genre analysis and RST analysis enriches our understanding of discourse structure (e.g., Taboada and Lavid (2003), Gruber
and Muntigl (2005)). We are therefore overlaying the RST-trees with a segmentation of
the global text units according to the genre-specific moves they realize (Upton and Cohen, 2009). The mapping of the sequence of moves onto the RST-trees adds relational
and hierarchical information.
Three subsystems of cohesion contribute to the organization of a text: relational cohesion (lexical or phrasal elements that signal coherence relations), referential cohesion
(anaphoric chains, spatial/temporal chaining and ellipsis), and lexical cohesion arising
from the semantic network of the lexical items in the text (Halliday and Hasan (1976);
Halliday and Matthiessen (2004)). In this project we focus on relational cohesion and
lexical cohesion.
The analysis of relational cohesion will include all lexical or phrasal elements (discourse markers) in the text that signal coherence relations at local and global levels of
discourse. We are currently developing our methodology for this analysis.
The analysis of lexical cohesion starts by identifying all content words (nouns, verbs,
adjectives, adverbs) and then locating their neighboring lexical associates in other discourse units. The XML annotation is created with an MMAX-based tool (Müller and
Strube, 2001).
All annotations are done separately by at least two annotators and then discussed.
Inter-annotator agreement using Kappa shows a high level of agreement on the segmentation: .97 for the encyclopedia texts and .99 for the fundraising letters. We computed
inter-annotator agreement for the RST analysis for two fundraising letters and two encyclopedia texts, using the methods proposed in Marcu et al. (1999). On average, the
agreement was .88 on the spans and .82 on the nuclearity. The agreement on the RST
relation labels was only .57. We suspect (and hope to confirm with the complete data
set) that this is not a general deficiency of our annotation but a problem that can mainly
3 available
from http://www.wagsoft.com/RSTTool/.
4 http://www.sfu.ca/rst/
159
be attributed to a few rather confusable relations such as Joint versus Conjunction. As
Marcu et al. (1999) point out, these Kappa values are comparable with the agreement
in other corpora.5
The annotation of all 80 texts in the core corpus will be complete by March 2011.
Manuals detailing the segmentation and annotation rules will be made available along
with the corpus.
3 Segmentation
An essential step in discourse analysis is the identification of suitable Elementary Discourse Units (EDUs). Various definitions of EDUs exist, ranging from a fine-grained
segmentation to segmentation at sentence level. In classic Rhetorical Structure Theory (RST), clauses are considered to be EDUs, except for subject and object clauses,
complement clauses, and restrictive relative clauses (Mann and Thompson, 1988).
For the annotation of the RST Discourse Tree Bank, Carlson and Marcu (2001) use a
fine-grained segmentation in which they also treat complements of attribution verbs and
phrases that begin with a strong discourse marker (e.g. because of, in spite of, according
to) as separate EDUs. Relative clauses, nominal postmodifiers, or clauses that break up
other legitimate EDUs are treated as embedded discourse units. Based on this, Lüngen
et al. (2006) developed segmentation guidelines for German text, but in contrast to
Carlson and Marcu (2001) they exclude restrictive relative clauses, conditional clauses,
and proportional clauses (clauses combined by comparative connectives). Grabski and
Stede (2006) suggest to also include prepositional phrases as EDUs. Tofiloski et al.
(2009) adhere more closely to the original RST proposals (Mann and Thompson, 1988)
and segment coordinated clauses, adjunct clauses and non-restrictive relative clauses.
To our mind, these differences follow from attempts to include semantic considerations
in the definition of EDUs (i.e., including at least some proposition-denoting yet nonclausal segments among the EDUs).
For Dutch, as far as we know, such an elaborate investigation of what should count as
an EDU has not yet been done. RST annotations of Dutch text have used the segmentation of the original RST proposals (Abelen et al., 1993) or taken clauses containing a
finite verb (den Ouden et al., 1998) or whole sentences (Timmerman, 2007) as EDUs.
3.1 Segmentation principles
The segmentation we use for the Dutch corpus is fairly coarse. The EDUs are independent or subordinate clauses or other complete utterances (independent fragments). The
definition of an elementary discourse unit is guided by the question of whether a discourse relation could hold between the unit and another segment. EDUs are typically
5 Brown
corpus (Francis and Kucera, 1979), MUC corpus (Chinchor, 2001), WSJ corpus (Carlson et al., 2002)
160
propositions or segments that constitute speech acts of their own. The segmentation
principles are based on syntax and punctuation rather than semantic criteria.
Like Tofiloski et al. (2009), we treat simple sentences (1), coordinated clauses (2),
subordinate clauses (3) and non-restrictive relative clauses (4) as EDUs.
(1)
[Elke donatie is waardevol!]
[Each donation is valuable!]
(2)
[Cavine kreeg aidsremmers][en dat maakte een levensgroot verschil.]
[Cavine got aids medication][and that made a huge difference.]
(3)
[Omdat de EU binnenkort beslist over nieuwe regels,][voeren we de druk op de
politiek nu hoog op]
[Because the EU will decide on new regulations soon][we are now strongly
increasing our pressure on politics.]
(4)
[Dit gat wordt veroorzaakt door een van de maantjes van Saturnus, Mimas,][die
de ringen verstoort.]
[This gap is caused by one of the moons of Saturn, Mimas,][which disturbs the
rings.]
In contrast to Tofiloski et al. (2009), we consider coordinated elliptical clauses (i.e.
clauses that share a verb that is elided in one of the clauses, as in (5)) as separate EDUs,
because the two clauses that share a verb can be seen as two separate predicates. This
also applies to clauses that share a noun phrase as subject, as in (6). In Carlson and
Marcu (2001), clauses with an ellipted subject are segmented as EDUs as well, whereas
clauses with an ellipted verb are only treated as EDUs when there are strong rhetorical
cues marking the discourse structure as in (7)6 .
(5)
[De planeet draait in 58.6 dagen om haar as] [en in 88.0 dagen om de zon.]
[The planet turns around its axis in 58.6 days ][and around the sun in 88.0
days.]
(6)
[De operatie duurde 15 minuten][en kostte 35 euro.]
[The surgery took 15 minutes][and cost 35 euros.]
(7)
[Back then, Mr. Pinter was not only the angry young playwright,] [but also
the first] [to use silence and sentence fragments and menacing stares, almost
to the exclusion] [of what we preciously understood to be theatrical dialog.]
(wsj 1936)
Non-restrictive relative clauses as in (8) and embedded clauses between parentheses
as in (9) are considered to be embedded discourse units. Restrictive relative clauses,
subject and object clauses, and complement clauses are not treated as separate EDUs
(following classic RST). Contrary to Carlson and Marcu (2001), Lüngen et al. (2006),
6 Example
from Carlson and Marcu (2001)
161
and Jasinskaja et al. (2007), we do not recognize non-clausal appositives as in (10) as
separate EDUs.
(8)
[Echter gedurende de nacht, [die op Mercurius maanden lang kan duren,] daalt
de temperatuur tot zo’n -185 graden Celsius.]
[However during the night, [which can last for months on Mercury,] the temperature decreases to about -185 degrees Celsius.]
(9)
[De binnenste maan [(van 2002 tot 2005 is dat Epithemeus)] beweegt iets sneller
dan de buitenste] [en haalt die ander langzaam (met 450 meter per minuut) in.]
The innermost moon [(from 2002 to 2005 this is Epithemeus)] moves a bit faster
than the outermost] [and slowly (with 450 meters per minute) catches up with
the other.]
(10)
[Het tweede type terrein, het laagland, telt relatief nog minder kraters dan het
hoogland.]
[The second terrain type, the lowland, contains even fewer craters than the
highland.]
Our segmentation uses punctuation in connection with syntax. Periods, exclamation
marks and question marks are EDU boundaries, except for periods that are used in
abbreviations, acronyms, dates and so forth. Independent fragments (subclausal expressions ending with a period) as in (11) are considered to be EDUs.
(11)
[Leuke hebbedingetjes.]
[Nice gadgets.]
Colon or semicolon are only treated as separation markers when the subsequent material is a clause as in (12). If it is a non-clausal expression, as in (13), it is not segmented.
The same rule applies for text structures between hyphens or parentheses: clauses as in
(9) or participle structures as in (14) are segmented as EDUs, but non-clausal material
as in (15) is not segmented.
(12)
[Daar knapt ze zichtbaar van op;][ze begint ook weer te praten!]
[From that, she recuperates visibly; ][she even starts to talk again!]
(13)
[In 2005 zijn nog twee maantjes van Pluto ontdekt: Nix en Hydra.]
[In 2005, two more small moons of Pluto were discovered: Nix and Hydra.]
(14)
[Wat er binnen deze bol [(horizon genoemd)] gebeurt weten we niet.]
[What happens inside this globe [(called horizon)] we don’t know.]
(15)
[De krater Pan (inslagkrater), de grootste krater, is 100 kilometer in doorsnede]
[en minstens 8 kilometer diep.]
[The crater Pan (impact crater), the biggest crater, is 100 kilometers in diameter][and at least 8 kilometers deep.]
162
4 Discourse structure
The annotation of discourse structure is intended to capture the hierarchical structures
arising from coherence relations between discourse units, but also the genre-specific
structures that can help in understanding genre differences in discourse structure.
4.1 Rhetorical Structure Theory
There is wide agreement that discourse is hierarchically structured, and many current
theories assume that this structure arises from the recursive application of coherence
relations. Discourse-annotated corpora are particularly useful for investigating the
realizations, linguistic marking, and genre-specific uses of coherence relations (e.g.,
Webber (2009); Taboada et al. (2009); see also the discussion in Taboada and Mann
(2006a,b)) and we are researching such questions with our corpus. In addition, however, we are also interested in the configurational characteristics of discourse structure.
We thus differ from annotation efforts like the Penn Discourse TreeBank (Prasad et al.,
2008) that focus mainly on coherence relations and on implicit and explicit connectives.
For us, it is essential to represent the full hierarchical structure of our texts.
Rhetorical Structure Theory (RST; Mann and Thompson (1988)) has proven successful for the analysis of whole texts and has been widely applied (for an overview
see Taboada and Mann (2006a,b)) to texts of various languages and used for the annotation of large text corpora (Carlson et al. (2002), Stede (2004)).
We base our analyses on the set of 30 relations as defined in “extended classic”
RST. We do not follow Carlson and Marcu (2001), who use a much larger set of relation labels (mostly necessitated by their more fine-grained segmentation) (for a critical
discussion of both variants of RST, see Stede (2008)).
In particular, we do not use Carlson and Marcu’s (2001) Attribution and Same relations, which we consider problematic. Attribution is defined in Carlson and Marcu
(2001) as the relation between a direct or indirect quotation and its attributing phrase
or clause. This relation is arguably of a categorically different kind than coherence
relations (Tofiloski et al. (2009), Skadhauge and Hardt (2005)). In classic RST, complement clauses and speech parentheticals are not considered as separate EDUs. This
means that speech-reporting EDUs can enter coherence relations as speech events or
by virtue of the speech that is reported (in particular when the quotation is continued in
subsequent EDUs). This flexibility fits in well with the idea that semantic relations in
discourse are often underspecified (Egg and Redeker, 2008).
The pseudo-relation Same is introduced by Carlson and Marcu (2001) to link two
discontinuous parts of an EDU that is interrupted by another, parenthetically embedded,
EDU. In classic RST, parenthetical EDUs are extracted and placed after their host EDU,
thus obviating the need for a pseudo-relation (see, e.g., Redeker and Egg (2006)).7
7 Borisova
and Redeker (2010) point out problems involving the Same relation in the Discourse GraphBank (Wolf et al. (2003)).
163
4.1.1 Discourse trees or graphs?
Rhetorical Structure Theory assumes that the discourse structure of a text can be represented as an ordered tree. In this tree all text parts are in some way connected to
the root of the tree, the most central text part. However, it has been claimed that tree
structures are not sufficient to represent discourse structure (Asher (2008); Lee et al.
(2008); Wolf and Gibson (2005)). Wolf and Gibson (2005) show that crossed dependencies (i.e. structures in which discourse units ABCD (not necessarily adjacent) have
relations AC and BD) and multiple-parent structures (where a unit enters more than one
coherence relation and is thus dominated by more than one node) occur abundantly in
their Discourse GraphBank (Wolf et al. (2003)). They argue that these constellations,
which violate the tree-structure constraints, are necessary to describe the text structures
in their corpus, and that a more complex graph structure is thus required to represent
the discourse structure of a text.
Webber (2006) and Egg and Redeker (2008, 2010), however, argue that the chain
graphs in the Discourse GraphBank conflate discourse constituency and anaphoric dependency. Egg and Redeker (2008) point out that the analyses discussed in Wolf and
Gibson (2005) have plausible tree-based alternatives and Egg and Redeker (2010) further support this argument with data from the Discourse GraphBank. While this question is not yet settled, we do find that trees are adequate data structures to represent the
constituent structure of discourse for the texts in our corpus and thus use RST-trees to
annotate discourse structures.
4.1.2 Non-binary trees
Given the assumption that discourse structure can be adequately represented by trees, it
is tempting to consider the still stronger assumption that would only allow binary trees,
which are much simpler and computationally more tractable. This restriction is indeed
often implemented in discourse parsers (e.g. Marcu (2000); Soricut and Marcu (2003);
Reitter (2003)). In our project, we choose plausibility and validity of our analyses over
computational tractability and allow non-binary structures in our RST trees.
RST-trees do contain mostly binary relations (in particular the asymmetric nucleussatellite relations),8 but they also admit non-binary structures with multiple nuclei or
multiple satellites relating to one nucleus. In the first case, several nuclei are involved
in one multinuclear relation, e.g., List, Sequence or Joint. Binary representations of
such structures (proposed, e.g., by Egg and Redeker (2008)) involve a stacking of binary relations, implying a hierarchical ordering (left- or right-branching or pairwise
clustering) among the list constituents. These binary representations do not reflect the
8 RST distinguishes two kinds of relations: The asymmetric mononuclear relations like Elaboration or Justify relate a nucleus (centrally important)
and a satellite (additional information, which could in many cases be left out without rendering the text incoherent). The symmetric multinuclear
relations like List or Joint relate discourse entitites of equal status.
164
equal importance of the items in the multinuclear relation.
In the second kind of non-binary structures, several nucleus-satellite relations share
the same nucleus, e.g., when the central request of a fundraising letter is supported
by various preceding or succeeding Motivation and/or Justify satellites, as described in
Abelen et al. (1993), or when several separate Elaborations provide details about the
contents of one nucleus. A binary representation of these structures requires that one of
the satellites of the shared nucleus is included in the nucleus of another satellite, which
is in many cases not plausible.9
We consider the regular occurrence of non-binary structures sufficient reason to assume that discourse structure representations require non-binary trees.
4.2 Moves
For comparisons of the global text structure across genres, we identify the genrespecific major building blocks of the texts using move analysis (Upton and Cohen,
2009). We identify the functional components, so-called moves, in the text. A move is
realized by at least one EDU. Contrary to, e.g. Biber et al. (2007), we do not recognize
moves below EDU level and do not allow embedding of moves. The moves in our analysis create a linear, non-hierarchical partition of the EDUs in the text. Each genre has
a particular set of move types that occur regularly in texts of that genre. Some move
types are obligatory. Any move type may be realized more than once in a particular
text. In the encyclopedia entries, we identify the move types name, define and describe. For the fundraising letters, we follow Upton (2002), who identified seven move
types labeled get attention, introduce the cause and/or establish credentials of organization, solicit response, offer incentive, reference insert, express gratitude, conclude
with pleasantries. The move structure of advertisements is based on Bhatia (2005)
and contains the following move types: get attention, justify the product or service
by establishing a niche, detail the product or service, establish credentials, endorsement/testimonial, offer incentive, use pressure tactics, solicit response, and reference
to external material. Finally, the starting point for determining the move structure of
the popular scientific news will be van Dijk’s superstructure of news (van Dijk, 1988),
which is a hierarchical structure containing the main genre elements of news in general.
5 Cohesion
Parallel to the discourse structure annotation, we are annotating the corpus for relational
cohesion and lexical cohesion.
9 An alternative explanation that first collects all satellites in a List or Joint segment, which then as a whole functions as the sole satellite of the
respective nucleus is only feasible in a subgroup of these cases, in which all satellites occur on the same side of the nucleus (before or after it) and are
related to the nucleus in terms of the same relation.
165
5.1 Relational cohesion
Relational cohesion concerns the lexical or phrasal elements (discourse markers) in a
text that signal coherence relations, both at the local and global levels of discourse.
Some relations are often signaled by discourse markers, e.g. the conjunction relation
(and, also), but others are implicit and do not contain clear cues (Taboada, 2006).
In a pilot study we have analyzed the distribution and explicit signaling of coherence
relations in 20 encyclopedia entries and 20 fundraising letters. Intra-sentential relations
are much more often signaled than inter-sentential relations (69% vs 16%), presumably
reflecting the fact that intra-sentential clause combining usually involves an obligatory
conjunction or adverb, while there is no such syntactic requirement for marking intersentential relations.
Future work will include the annotation of discourse markers (conjunctions and conjunctive adverbs) and their scopes, comparable to the annotations in the Penn Discourse
Treebank (Prasad et al., 2008), with the dual aim of theoretical investigations and the
development of a semi-automatic parsing tool for coherence relations.
5.2 Lexical cohesion
In our analysis of lexical cohesion, we aim to cover all types of semantic relations
among lexical items in the text (see section 5.2.2 below; for recent work on an overview
of approaches to lexical cohesion, see Tanskanen (2006)). We include only relations
across elementary discourse units (EDUs), not within EDUs. This allows us to investigate the alignment between discourse structure and lexical cohesion, as both structures
are based on the same units. At a finer level, we also study the co-occurrence of lexical
cohesion types with coherence relations.
5.2.1 Selection of lexical items
As we are interested in the contribution of lexical cohesive relations, we exclude pronouns and do not follow referential chains through the text. The class of items for
participating in lexical cohesion includes content words (nouns, verbs, adjectives, and
adverbs of place, time, and frequency) and proper names. Proper names are treated as
one unit. The elements of multi-word units (except for proper names) are treated as
separate lexical items, while compounds are taken as indecomposable single units.
5.2.2 Categories of lexical cohesive relations
The categories we distinguish for lexical cohesive relations are listed in Table 1. By repetition we mean word repetition. The lexical items in full repetition have fully identical
word form or they differ only in their inflectional suffix, whereas lexical items in partial repetition have different derivational suffixes in their word form. Under the heading
166
Category
Full repetition
Repetition
Partial repetition
Hyponymy
Hyperonymy
Co-hyponymy
Meronymy
Systematic semantic relations Holonymy
Co-meronymy
Synonymy
Antonymy
Collocation
Example
planet - planet
planet - planetary
sun - star
gas - hydrogen
Venus - Mercury
planet - solar system
solar system - sun
Earth - sun
life - existence
light - heavy
light - star
Table 1: Categories of lexical cohesion
systematic semantic relations we include the traditional lexical semantic relations. The
lexical cohesive relation collocation is formed between two lexical items which tend
to occur in similar lexical environments because they describe things that tend to occur
in similar situations or contexts in the world (Morris and Hirst, 1991). Note that this
use of the term implies a meaning relation between the lexical items in contrast to its
use in corpus linguistics, where collocation refers to the mere co-occurrence of words
(Stubbs, 2001), which is not a sufficient criterion for lexical cohesion.
We identify relations arising from lexical meaning (e.g., planet - Earth) and ignore accidental meaning relations that arise from context. In addition, we identify
relations that are easy for the reader to identify with general background knowledge
and for which no further knowledge or textual context is necessary for their identification (e.g., we identify the relation of astronomer with Kepler, but not with Richard
Walker, although the textual context helps us understand that Richard Walker is also
an astronomer). This question is strongly related to the issue of register-sensitive and
domain-sensitive relations. Although we aim to identify general relations, i.e., relations
which are not specific of a certain register or domain, the annotators have to face the
difficulties of drawing the line between general and context-dependent.
5.2.3 Lexical cohesion links as a graph structure
Lexical cohesive links build up graph structures in the text. In our analysis any candidate item can enter into a lexical cohesive relation with any other candidate items as
long as there is a meaning relation between them. For each lexical item in a text, we
identify its lexical links—if any—to preceding lexical items (lemmas), ignoring any
links among the words inside an EDU. If a lexical item is linked to more than one preceding item, all of those relations are registered as cohesive links. Similarly, if a lexical
item enters into cohesive relations with more than one item occurring in succeeding
EDUs, all those links are counted.
167
In this way, we build up networks that represent the lexical cohesive structure of a
text. By assigning graph structures to lexical cohesion, we differ from previous studies that identified lexical cohesive chains in text (e.g., Hasan (1984), Morris and Hirst
(1991)) and follow those that identify networks (Hoey, 1991). Modeling lexical cohesion with graph structures provides a much richer representation than the lexical
cohesive chains model. It also allows us to measure the centrality of a lexical item by
its centrality in the network.
6 Conclusion
The resource we are building aims at a high standard of empirical validity (very careful
annotation based on detailed, explicit rules) and coverage across a theoretically motivated selection of text genres. With a core of 80 texts, the corpus is rather small for
computational applications, but still large enough for distributional analyses and structural comparisons.
We have been using the initially completed parts of this corpus to investigate genre
differences in the use of discourse relations and in the occurrence of lexical cohesion
relations and the interaction of these two aspects of textual organization (Berzlánovich
and Redeker, 2011). As our discourse structure annotation follows the widely used
“classic” RST, we expect our corpus to support cross-linguistic research through its
comparability with RST-based corpora in other languages.
Our segmentation rules are surface oriented (based on syntax and punctuation) and
have been implemented in an automatic segmenter (van der Vliet, 2010). Future work
will include the annotation of discourse markers with the dual aim of theoretical investigations and the development of a semi-automatic parsing tool for coherence relations.
With an eye on crosslinguistic research on discourse and discourse markers in the spirit
of Knott and Sanders (1998), we will strive for compatibility with the annotation in
the Penn Discourse TreeBank (Prasad et al., 2008), but will more freely allow markers
to signal global coherence relations among larger text spans (which is discouraged by
PDTB’s Minimality Principle (Prasad et al. (2007): 19), according to which annotators
have to select the minimally necessary segments).
Finally, we also envisage combining our lexical cohesion analysis with computational coreference resolution (Hendrickx et al., 2008) and testing our network model of
lexical cohesion against approaches based on lexical chaining (see, e.g., Barzilay and
Elhadad (1997)).
Acknowledgments
The work reported here is supported by grant 360-70-280 of the Netherlands Organization for Scientific Research (NWO). For online documentation of the program Modeling discourse organization see www.let.rug.nl/mto. We are grateful to three
anonymous reviewers for their valuable comments on an earlier version of this paper.
168
References
Eric Abelen, Gisela Redeker, and Sandra A. Thompson. The rhetorical structure of US-American and
Dutch fund-raising letters. Text, 13(3):323–350, 1993.
Nicholas Asher. Troubles on the right frontier. In Peter Kühnlein and Anton Benz, editors, Constraints
in Discourse. Benjamins, Amsterdam, 2008.
Regina Barzilay and Michael Elhadad. Using lexical chains for text summarization. In Proceedings
of the ACL Workshop on Intelligent Scalable Text Summarization, volume 17. Madrid, Spain, 1997.
Ildikó Berzlánovich and Gisela Redeker. Genre-dependent interaction of coherence and lexical cohesion in written discourse. In Corpus Linguistics and Linguistic Theory, 2011. To appear.
Vijay K. Bhatia. Generic patterns in fundraising discourse. New directions for philanthropic fundraising, (22):95–110, 1998.
Vijay K. Bhatia. Generic patterns in promotional discourse. In Helana Halmari and Tuija Virtanen,
editors, Persuasion across genres: A linguistic approach, pages 213–228. Benjamins, Amsterdam,
2005.
Douglas Biber, Ulla Connor, and Thomas A. Upton. Discourse on the move: Using corpus analysis
to describe discourse structure. Benjamins, Amsterdam, 2007.
Irina Borisova and Gisela Redeker. Same and Elaboration relations in the Discourse Graphbank. In
Proceedings of the 11th annual SIGdial Meeting on Discourse and Dialogue, Tokyo, 2010.
Lynn Carlson and Daniel Marcu. Discourse tagging reference manual. ISI Technical Report ISI-TR545, 2001.
Lynn Carlson, Daniel Marcu, and Mary E. Okurowski. RST Discourse Treebank. Linguistic Data
Consortium, 2002.
Nancy Chinchor. Message Understanding Conference (MUC) 7. Linguistic Data Consortium,
Philadelphia, 2001.
Laurence Danlos. Discourse dependency structures as constrained DAGs. In M. Strube and C. Sidner, editors, Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue, Cambridge,
Massachusetts, USA, pages 127–135, 2004.
Hanny J .N. den Ouden, Carel H. van Wijk, Jacques M.B. Terken, and Leo .G.M. Noordman. Reliability of discourse structure annotation. IPO Annual Progress Report, 33:129–138, 1998.
Markus Egg and Gisela Redeker. Underspecified discourse representation. In P. Kühnlein and
A. Benz, editors, Constraints in Discourse (CID), Dortmund, June 3-5, 2005, pages 117–138. Benjamins, Amsterdam, 2008.
Markus Egg and Gisela Redeker. How complex is discourse structure? In Proceedings of LREC’10,
Malta, 17-23 May 2010, pages 1619–1623, ELRA, 2010.
Suzanne Eggins and Jim R. Martin. Genres and registers of discourse. In T.A. van Dijk, editor,
Discourse as Structure and Process, volume 1, pages 230–257, 1997.
Barbara A. Fox. Discourse structure and anaphora: Written and conversational English. Cambridge
University Press, 1987.
Nelson Francis and Henry Kucera. Brown Corpus Manual. Brown University, 1979.
Michael Grabski and Manfred Stede. Bei: Intraclausal coherence relations illustrated with a German
preposition. Discourse Processes, 41(2):195–219, 2006.
Barbara J. Grosz, Scott Weinstein, and Aravind K. Joshi. Centering: A framework for modeling the
local coherence of discourse. Computational Linguistics, 21(2):203–225, 1995.
Helmut Gruber and Peter Muntigl. Generic and rhetorical structures of texts: Two sides of the same
coin? Folia Linguistica, 39(1-2):75–113, 2005.
Michael A.K. Halliday and Ruqaiya Hasan. Cohesion in English. Longman, London, 1976.
Michael A.K. Halliday and Christian M.I.M. Matthiessen. An introduction to functional grammar.
Arnold, London, 2004.
Ruqaiya Hasan. Coherence and cohesive harmony. In J. Flood, editor, Understanding reading comprehension: Cognition, language and the structure of prose, pages 181–219. International Reading
Association, Newark, 1984.
169
Iris Hendrickx, Gosse Bouma, Frederik Coppens, Walter Daelemans, Veronique Hoste, Geert Kloosterman, Anne-Marie Mineur, Joeri van der Vloet, and Jean-Luc Verschelde. A coreference corpus
and resolution system for Dutch. In Proceedings of the Sixth International Language Resources
and Evaluation (LREC’08), Marrakech,28-30 May 2008, 2008.
Michael Hoey. Patterns of lexis in text. Oxford University Press, 1991.
Katja Jasinskaja, Jörg Mayer, Jutta Boethke, Annika Neumann, Andreas Peldszus, and Kepa
Rodrı́guez. Discourse tagging guidelines for German radio news and newspaper commentaries.
Technical report, Universität Potsdam, 2007.
Judith M.H. Kamalski. Coherence marking, comprehension and persuasion. On the processing and
representation of discourse. LOT Dissertation Series, 158, 2007.
Andrew Kehler. Coherence, reference, and the theory of grammar. Stanford, CA: CSLI, 2002.
Alistair Knott and Ted Sanders. The classification of coherence relations and their linguistic markers:
An exploration of two languages. Journal of Pragmatics, 30(2):135–175, 1998.
Alan Lee, Rashmi Prasad, Aravind Joshi, and Bonnie Webber. Departures from tree structures in
discourse: Shared arguments in the Penn Discourse Treebank. In Proceedings of the Constraints in
Discourse Workshop (CID08), Potsdam, Germany, 2008.
Harald Lüngen, Csilla Puskàs, Maja Bärenfänger, Mirco Hilbert, and Henning Lobin. Discourse
segmentation of German written text. In Proceedings of the 5th International Conference on Natural
Language Processing (FinTAL 2006), 2006.
William C. Mann and Sandra A. Thompson. Rhetorical structure theory: Toward a functional theory
of text organization. Text, 8(3):243–281, 1988.
Daniel Marcu. The rhetorical parsing of unrestricted texts: A surface-based approach. Computational
Linguistics, 26(3):395–448, 2000.
Daniel Marcu, Estibaliz Amorrortu, and Magdalena Romera. Experiments in constructing a corpus
of discourse trees. In Proceedings of the ACL99 Workshop on Standards and Tools for Discourse
Tagging, pages 48–57, 1999.
Jane Morris and Graeme Hirst. Lexical cohesion computed by thesaural relations as an indicator of
the structure of text. Computational Linguistics, 17(1):21–48, 1991.
Christoph Müller and Michael Strube. MMAX: A tool for the annotation of multi-modal corpora.
In Proceedings of the 2nd IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue
Systems, pages 45–50, 2001.
Michael O’Donnell. RST-Tool: An RST analysis tool. In Proc. of the 6th European Workshop on
Natural Language Generation, Duisburg, 1997.
Massimo Poesio, Rosemary Stevenson, Barbara D. Eugenio, and Janet Hitzeman. Centering: A
parametric theory and its instantiations. Computational Linguistics, 30(3):309–363, 2004.
Livia Polanyi, Martin van den Berg, Chris Culy, Gian L. Thione, and David Ahn. A rule based
approach to discourse parsing. In Proceedings of SIGDIAL ’04. Boston, MA, 2004.
Rashmi Prasad, Eleni Miltsakaki, Nikhil Dinesh, Alan Lee, Aravind Joshi, Livio Robaldo, and Bonnie
Webber. The Penn Discourse TreeBank 2.0. Annotation manual. Technical report, Institute for
Research in Cognitive Science, University of Pennsylvania, 2007.
Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie
Webber. The Penn Discourse Treebank 2.0. In Proceedings of the Sixth International Language
Resources and Evaluation, 2008.
Gisela Redeker and Markus Egg. Says who? On the treatment of speech attributions in discourse
structure. In Proceedings of Constraints in Discourse II, pages 140–146, 2006.
David Reitter. Simple signals for complex rhetorics: On rhetorical analysis with rich-feature support
vector models. LDV-Forum, GLDV Journal for Computational Linguistics and Language Technology, 18:38–52, 2003.
Ted J. M. Sanders, Wilbert P. M. Spooren, and Leo G. M. Noordman. Towards a taxonomy of
coherence relations. Cognitive Linguistics, 15:1–35, 1992.
Ted J. M. Sanders, Wilbert P. M. Spooren, and Leo G. M. Noordman. Coherence relations in a
170
cognitive theory of discourse representation. Journal of Pragmatics, 4:93–133, 1993.
Peter R. Skadhauge and Daniel Hardt. Syntactic identification of attribution in the RST treebank. In
Workshop On Linguistically Interpreted Corpora, 2005.
Radu Soricut and Daniel Marcu. Sentence level discourse parsing using syntactic and lexical information. In Proceedings of HLT/NAACL 2003, pages 228–235, 2003.
Manfred Stede. The Potsdam Commentary Corpus. In Proceedings of the 2004 ACL Workshop on
Discourse Annotation, pages 96–102, 2004.
Manfred Stede. Disambiguating rhetorical structure. Research on Language & Computation, 6(3):
311–332, 2008.
Michael Stubbs. Computer-assisted text and corpus analysis: lexical cohesion and communicative
competence. The handbook of discourse analysis, pages 54–75, 2001.
Maite Taboada. Discourse markers as signals (or not) of rhetorical relations. Journal of Pragmatics,
38(4):567–592, 2006.
Maite Taboada and Julia Lavid. Rhetorical and thematic patterns in scheduling dialogues: A generic
characterization. Functions of Language, 10(2):147–178, 2003.
Maite Taboada and William C. Mann. Applications of rhetorical structure theory. Discourse Studies,
8(4):567–588, 2006a.
Maite Taboada and William C. Mann. Rhetorical structure theory: Looking back and moving ahead.
Discourse Studies, 8(3):423–459, 2006b.
Maite Taboada, Julian Brooke, and Manfred Stede. Genre-based paragraph classification for sentiment analysis. In Proceedings of the SIGDIAL 2009 Conference: The 10th Annual Meeting of the
Special Interest Group on Discourse and Dialogue, pages 62–70, 2009.
Sanna-Kaisa Tanskanen. Collaborating towards coherence: Lexical cohesion in English discourse.
Benjamins, Amsterdam, 2006.
Gian Lorenzo Thione, Martin van der Berg, Chris Culy, and Livia Polanyi. LiveTree: An integrated
workbench for discourse processing. In B. Webber and D. Byron, editors, ACL 2004 Workshop on
Discourse Annotation, Barcelona, Spain, pages 110–117, 2004.
Sander E. J. Timmerman. Automatic recognition of structural relations in Dutch text. MA thesis,
University of Twente, 2007.
Milan Tofiloski, Julian Brooke, and Maite Taboada. A syntactic and lexical-based discourse segmenter. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 77–80, 2009.
Thomas A. Upton. Understanding direct mail letters as a genre. International Journal of Corpus
Linguistics, 7(1):65–85, 2002.
Thomas A. Upton and Mary Ann Cohen. An approach to corpus-based discourse analysis: The move
analysis as example. Discourse Studies, 11(5):585–605, 2009.
Nynke van der Vliet. Syntax-based discourse segmentation of Dutch text. In Marija Slavkovik, editor,
Proceedings of the 15th Student Session, ESSLLI, pages 203–210, 2010.
Teun A. van Dijk. News as discourse. Erlbaum, Hillsdale, 1988.
Bonnie Webber. D-LTAG: extending lexicalized TAG to discourse. Cognitive Science, 28(5):751–
779, 2004.
Bonnie Webber. Accounting for discourse relations: constituency and dependency. In M. Dalrymple,
editor, Intelligent linguistic architectures, pages 339–360, 2006.
Bonnie Webber. Genre distinctions for discourse in the Penn TreeBank. In Proceedings of the Joint
Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on
Natural Language Processing of the AFNLP: Volume 2-Volume 2, pages 674–682, 2009.
Bonnie Webber, Matthew Stone, Aravind Joshi, and Alistair Knott. Anaphora and discourse structure.
Computational Linguistics, 29:545–587, 2003.
Florian Wolf and Edward Gibson. Representing discourse coherence: A corpus-based study. Computational Linguistics, 31(2):249–287, 2005.
Florian Wolf, Edward Gibson, Amy Fisher, and Meredith Knight. A procedure for collecting a
database of texts annotated with coherence relations. Technical report, MIT, Cambridge, MA, 2003.
171