Academia.eduAcademia.edu

Coordination Structures in Dependency Treebanks

Paratactic syntactic structures are notoriously difficult to represent in dependency formalisms. This has painful consequences such as high frequency of parsing errors related to coordination. In other words, coordination is a pending problem in dependency analysis of natural languages. This paper tries to shed some light on this area by bringing a systematizing view of various formal means developed for encoding coordination structures. We introduce a novel taxonomy of such approaches and apply it to treebanks across a typologically diverse range of 26 languages. In addition, empirical observations on convertibility between selected styles of representations are shown too.

Coordination Structures in Dependency Treebanks Martin Popel, David Mareček, Jan Štěpánek, Daniel Zeman, Zdeněk Žabokrtský Charles University in Prague, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics (ÚFAL) Malostranské náměstı́ 25, CZ-11800 Praha, Czechia {popel|marecek|stepanek|zeman|zabokrtsky}@ufal.mff.cuni.cz Abstract Paratactic syntactic structures are notoriously difficult to represent in dependency formalisms. This has painful consequences such as high frequency of parsing errors related to coordination. In other words, coordination is a pending problem in dependency analysis of natural languages. This paper tries to shed some light on this area by bringing a systematizing view of various formal means developed for encoding coordination structures. We introduce a novel taxonomy of such approaches and apply it to treebanks across a typologically diverse range of 26 languages. In addition, empirical observations on convertibility between selected styles of representations are shown too. 1 Introduction In the last decade, dependency parsing has gradually been receiving visible attention. One of the reasons is the increased availability of dependency treebanks, be they results of genuine dependency annotation projects or converted automatically from previously existing phrase-structure treebanks. In both cases, a number of decisions have to be made during the construction or conversion of a dependency treebank. The traditional notion of dependency does not always provide unambiguous solutions, e.g. when it comes to attaching functional words. Worse, dependency representation is at a loss when it comes to representing paratactic linguistic phenomena such as coordination, whose nature is symmetric (two or more conjuncts play the same role), as opposed to the head-modifier asymmetry of dependencies.1 1 We use the term modifier (or child) for all types of dependent nodes including arguments. The dominating solution in treebank design is to introduce artificial rules for the encoding of coordination structures within dependency trees using the same means that express dependencies, i.e., by using edges and by labeling of nodes or edges. Obviously, any tree-shaped representation of a coordination structure (CS) must be perceived only as a “shortcut” since relations present in coordination structures form an undirected cycle, as illustrated already by Tesnière (1959). For example, if a noun is modified by two coordinated adjectives, there is a (symmetric) coordination relation between the two conjuncts and two (asymmetric) dependency relations between the conjuncts and the noun. However, as there is no obvious linguistic intuition telling us which tree-shaped CS encoding is better and since the degree of freedom has several dimensions, one can find a number of distinct conventions introduced in particular dependency treebanks. Variations exist both in topology (tree shape) and labeling. The main goal of this paper is to give a systematic survey of the solutions adopted in these treebanks. Naturally, the interplay of dependency and coordination links in a single tree leads to serious parsing issues.2 The present study does not try to decide which coordination style is the best from the parsing point of view.3 However, we believe that our survey will substantially facilitate experiments in this direction in the future, at least by exploring and describing the space of possible candidates. 2 CSs have been reported to be one of the most frequent sources of parsing errors (Green and Žabokrtský, 2012; McDonald and Nivre, 2007; Kübler et al., 2009; Collins, 2003). Their impact on quality of dependency-based machine translation can also be substantial; as documented on an Englishto-Czech dependency-based translation system (Popel and Žabokrtský, 2009), 39% of serious translation errors which are caused by wrong parsing have to do with coordination. 3 There might be no such answer, as different CS conventions might serve best for different applications or for different parser architectures. The rest of the paper is structured as follows. Section 2 describes some known problems related to CS. Section 3 shows possible “styles” for representing CS. Section 4 lists treebanks whose CS conventions we studied. Section 5 presents empirical observations on CS convertibility. Section 6 concludes the paper. 2 Related work Let us first recall the basic well-known characteristics of CSs. In the simplest case of a CS, a coordinating conjunction joins two (usually syntactically and semantically compatible) words or phrases called conjuncts. Even this simplest case is difficult to represent within a dependency tree because, in the words of Lombardo and Lesmo (1998): Dependency paradigms exhibit obvious difficulties with coordination because, differently from most linguistic structures, it is not possible to characterize the coordination construct with a general schema involving a head and some modifiers of it. Proper formal representation of CSs is further complicated by the following facts: • CSs with more than two conjuncts (multiconjunct CSs) exist and are frequent. • Besides “private” modifiers of individual conjuncts, there are modifiers shared by all conjuncts, such as in “Mary came and cried”. Shared modifiers may appear alongside with private modifiers of particular conjuncts. • Shared modifiers can be coordinated, too: “big and cheap apples and oranges”. • Nested (embedded) coordinations are possible: “John and Mary or Sam and Lisa”. • Punctuation (commas, semicolons, three dots) is frequently used in CSs, mostly with multi-conjunct coordinations or juxtapositions which can be interpreted as CSs without conjunctions (e.g. “Don’t worry, be happy!”). • In many languages, comma or other punctuation mark may play the role of the main coordinating conjunction. • The coordinating conjunction may be a multiword expression (“as well as”). • Deficient CSs with a single conjunct exist. • Abbreviations like “etc.” comprise both the conjunction and the last conjunct. • Coordination may form very intricate structures when combined with ellipsis. For example, a conjunct can be elided while its arguments remain in the sentence, such as in the following traditional example: “I gave the books to Mary and the records to Sue.” • The border between paratactic and hypotactic surface means of expressing coordination relations is fuzzy. Some languages can use enclitics instead of conjunctions/prepositions, e.g. Latin “Senatus Populusque Romanus”. Purely hypotactic surface means such as the preposition in “John with Mary” occur too.4 • Careful semantic analysis of CSs discloses additional complications: if a node is modified by a CS, it might happen that it is the node itself (and not its modifiers) what should be semantically considered as a conjunct. Note the difference between “red and white wine” (which is synonymous to “red wine and white wine”) and “red and white flag of Poland”. Similarly, “five dogs and cats” has a different meaning than “five dogs and five cats”. Some of these issues were recognized already by Tesnière (1959). In his solution, conjuncts are connected by vertical edges directly to the head and by horizontal edges to the conjunction (which constitutes a cycle in every CS). Many different models have been proposed since, out of which the following are the most frequently used ones: • MS = Mel’čuk style used in the MeaningText Theory (MTT): the first conjunct is the head of the CS, with the second conjunct attached as a dependent of the first one, third conjunct under the second one, etc. Coordinating conjunction is attached under the penultimate conjunct, and the last conjunct is attached under the conjunction (Mel’čuk, 1988), • PS = Prague Dependency Treebank (PDT) style: all conjuncts are attached under the coordinating conjunction (along with shared modifiers, which are distinguished by a special attribute) (Hajič et al., 2006), 4 As discussed by Stassen (2000), all languages seem to have some strategy for expressing coordination. Some of them lack the paratactic surface means (the so called WITHlanguages), but the hypotactic surface means are present almost always. • SS = Stanford parser style:5 the first conjunct is the head and the remaining conjuncts (as well as conjunctions) are attached under it. One can find various arguments supporting the particular choices. MTT possesses a complex set of linguistic criteria for identifying the governor of a relation (see Mazziotta (2011) for an overview), which lead to MS. MS is preferred in a rule-based dependency parsing system of Lombardo and Lesmo (1998). PS is advocated by Štěpánek (2006) who claims that it can represent shared modifiers using a single additional binary attribute, while MS would require a more complex co-indexing attribute. An argumentation of Tratz and Hovy (2011) follows a similar direction: We would like to change our [MS] handling of coordinating conjunctions to treat the coordinating conjunction as the head [PS] because this has fewer ambiguities than [MS]. . . We conclude that the influence of the choice of coordination style is a well-known problem in dependency syntax. Nevertheless, published works usually focus only on a narrow ad-hoc selection of few coordination styles, without giving any systematic perspective. Choosing a file format presents a different problem. Despite various efforts to standardize linguistic annotation,6 no commonly accepted standard exists. The primitive format used for CoNLL shared tasks is widely used in dependency parsing, but its weaknesses have already been pointed out (cf. Straňák and Štěpánek (2010)). Moreover, particular treebanks vary in their contents even more than in their format, i.e. each treebank has its own way of representing prepositions or different granularity of syntactic labels. 3 Variations in representing coordination structures Our analysis of variations in representing coordination structures is based on observations from a set of dependency treebanks for 26 languages.7 5 We use the already established MS-PS-SS distinction to facilitate literature overview; as shown in Section 3, the space of possible coordination styles is much richer. 6 For example, TEI (TEI Consortium, 2013), PML (Hana and Štěpánek, 2012), SynAF (ISO 24615, 2010). 7 The primary data sources are the following: Ancient Greek: Ancient Greek Dependency Treebank (Bamman and Crane, 2011), Arabic: Prague Arabic Dependency Treebank 1.0 (Smrž et al., 2008), Basque: Basque Dependency Treebank (larger version than CoNLL 2007 generously pro- In accordance with the usual conventions, we assume that each sentence is represented by one dependency tree, in which each node corresponds to one token (word or punctuation mark). Apart from that, we deliberately limit ourselves to CS representations that have shapes of connected subgraphs of dependency trees. We limit our inventory of means of expressing CSs within dependency trees to (i) tree topology (presence or absence of a directed edge between two nodes, Section 3.1), and (ii) node labeling (additional attributes stored insided nodes, Section 3.2).8 Further, we expect that the set of possible variations can be structured along several dimensions, each of which corresponds to a certain simple characteristic (such as choosing the leftmost conjunct as the CS head, or attaching shared modifiers below the nearest conjunct). Even if it does not make sense to create the full Cartesian product of all dimensions because some values cannot be combined, it allows to explore the space of possible CS styles systematically.9 3.1 Topological variations We distinguish the following dimensions of topological variations of CS styles (see Figure 1): Family – configuration of conjuncts. We divide the topological variations into three main groups, labeled as Prague (fP), Moscow (fM), and vided by IXA Group) (Aduriz and others, 2003), Bulgarian: BulTreeBank (Simov and Osenova, 2005), Czech: Prague Dependency Treebank 2.0 (Hajič et al., 2006), Danish: Danish Dependency Treebank (Kromann et al., 2004), Dutch: Alpino Treebank (van der Beek and others, 2002), English: Penn TreeBank 3 (Marcus et al., 1993), Finnish: Turku Dependency Treebank (Haverinen et al., 2010), German: Tiger Treebank (Brants et al., 2002), Greek (modern): Greek Dependency Treebank (Prokopidis et al., 2005), Hindi, Bengali and Telugu: Hyderabad Dependency Treebank (Husain et al., 2010), Hungarian: Szeged Treebank (Csendes et al., 2005), Italian: Italian Syntactic-Semantic Treebank (Montemagni and others, 2003), Latin: Latin Dependency Treebank (Bamman and Crane, 2011), Persian: Persian Dependency Treebank (Rasooli et al., 2011), Portuguese: Floresta sintá(c)tica (Afonso et al., 2002), Romanian: Romanian Dependency Treebank (Călăcean, 2008), Russian: Syntagrus (Boguslavsky et al., 2000), Slovene: Slovene Dependency Treebank (Džeroski et al., 2006), Spanish: AnCora (Taulé et al., 2008), Swedish: Talbanken05 (Nilsson et al., 2005), Tamil: TamilTB (Ramasamy and Žabokrtský, 2012), Turkish: METU-Sabanci Turkish Treebank (Atalay et al., 2003). 8 Edge labeling can be trivially converted to node labeling in tree structures. 9 The full Cartesian product of variants in Figure 1 would result in topological 216 variants, but only 126 are applicable (the inapplicable combinations are marked with “—” in Figure 1). Those 126 topological variants can be further combined with labeling variants defined in Section 3.2. Main family Prague family (code fP) [14 treebanks] Moscow family (code fM) [5 treebanks] Stanford family (code fS) [6 treebanks] Choice of head dogs dogs , cats Head on left (code hL) [10 treebanks] and , rats cats and rats Head on right (code hR) [14 treebanks] Mixed head (code hM) [1 treebank] A mixture of hL and hR Attachment of shared modifiers Shared modifier below the nearest conjunct (code sN) [15 treebanks] dogs , Shared modifier below head (code sH) [11 treebanks] and lazy dogs la lazy lazy , cats rats rats rats and lazy cats lazy dogs , dogs , cats and Attachment of coordinating conjunction Coordinating conjunction below previous conjunct (code cP) [2 treebanks] Coordinating conjunction below following conjunct (code cF) [1 treebank] Coordinating conjunction between two conjuncts (code cB) [8 treebanks] dogs dogs — , cats dogs cats , rats cats dogs Coordinating conjunction as the head (code cH) is the only applicable style for the Prague family [14 treebanks] rats and and — rats and dogs , cats — , and rats rats and dogs , cats and rats — , cats and rats — Placement of punctuation values pP [7 treebanks], pF [1 treebank] and pB [15 treebanks] are analogous to cP, cF and cB (but applicable also to the Prague family) Figure 1: Different coordination styles, variations in tree topology. Example phrase: “(lazy) dogs, cats and rats”. Style codes are described in Section 3.1. Stanford (fS) families.10 This first dimension distinguishes the configuration of conjuncts: in the Prague family, all the conjuncts are siblings governed by one of the conjunctions (or a punctuation fulfilling its role); in the Moscow family, the conjuncts form a chain where each node in the chain depends on the previous (or following) node; in the Stanford family, the conjuncts are siblings except for the first (or last) conjunct, which is the 10 Names are chosen purely as a mnemonic device, so that Prague Dependency Treebank belongs to the Prague family, Mel’čuk style belongs to the Moscow family, and Stanford parser style belongs to the Stanford family. head.11 Choice of head – leftmost or rightmost. In the Prague family, the head can be either the leftmost12 (hL) or the rightmost (hR) conjunction or punctuation. Similarly, in the Moscow and Stanford families, the head can be either the leftmost (hL) or the rightmost (hR) conjunct. A third op11 Note that for CSs with just two conjuncts, fM and fS may look exactly the same (depending on the attachment of conjunctions and punctuation as described below). 12 For simplicity, we use the terms left and right even if their meaning is reversed for languages with right-to-left writing systems such as Arabic or Persian. tion (hM) is to mix hL and hR based on some criterion, e.g. the Persian treebank uses hR for coordination of verbs and hL otherwise. For the experiments in Section 5, we choose the head which is closer to the parent of the whole CS, with the motivation to make the edge between CS head and its parent shorter, which may improve parser training. Attachment of shared modifiers. Shared modifiers may appear before the first conjunct or after the last one. Therefore, it seems reasonable to attach shared modifiers either to the CS head (sH), or to the nearest (i.e. first or last) conjunct (sN). Attachment of coordinating conjunctions. In the Moscow family, conjunctions may be either part of the chain of conjuncts (cB), or they may be put outside of the chain and attached to the previous (cP) or following (cF) conjunct. In the Stanford family, conjunctions may be either attached to the CS head (and therefore between conjuncts) (cB), or they may be attached to the previous (cP) or the following (cF) conjunct. The cB option in both Moscow and Stanford families, treats conjunctions in the same way as conjuncts (with respect to topology only). In the Prague family, there is just one option available (cH) – one of the conjunctions is the CS head while the others are attached to it. Attachment of punctuation. Punctuation tokens separating conjuncts (commas, semicolons etc.) could be treated the same way as conjunctions. However, in most treebanks it is treated differently, so we consider it as well. The values pP, pF and pB are analogous to cP, cF and cB except that punctuation may be also attached to the conjunction in case of pP and pF (otherwise, a comma before the conjunction would be non-projectively attached to the member following the conjunction). The three established styles mentioned in Section 2 can be defined in terms of the newly introduced abbreviations: PS = fPhRsHcHpB, MS = fMhLsNcBp?, and SS = fShLsNcBp?.13 3.2 Labeling variations Most state-of-the-art dependency parsers can produce labeled edges. However, the parsers produce only one label per edge. To fully capture CSs, we need more than one label, because there are several aspects involved (see the initial assump13 The question marks indicate that the original Mel’čuk and Stanford parser styles ignore punctuation. tions in Section 3): We need to identify the coordinating conjunction (its POS tag might not be enough), conjuncts, shared modifiers, and punctuation that separates conjuncts. Besides that, there should be a label classifying the dependency relation between the CS and its parent. Some of the information can be retrieved from the topology of the tree and the “main label” of each node, but not everything. The additional information can be attached to the main label, but such approach obscures the logical structure. In the Prague family, there are two possible ways to label a conjunction and conjuncts: Code dU (“dependency labeled at the upper level of the CS”). The dependency relation of the whole CS to its parent is represented by the label of the conjunction, while the conjuncts are marked with a special label for conjuncts (e.g. ccof in the Hyderabad Dependency Treebank). Code dL (“lower level”). The CS is represented by a coordinating conjunction (or punctuation if there is no conjunction) with a special label (e.g. Coord in PDT). Subsequently, each conjunct has its own label that reflects the dependency relation towards the parent of the whole CS, therefore, conjuncts of the same CS can have different labels, e.g. “Who[SUBJ] and why[ADV] did it?” Most Prague family treebanks use sH, i.e. shared modifiers are attached to the head (coordinating conjunction). Each child of the head has to belong to one of three sets: conjuncts, shared modifiers, and punctuation or additional conjunctions. In PDT, conjuncts, punctuation and additional conjunctions are recognized by specific labels. Any other children of the head are shared modifiers. In the Stanford and Moscow families, one of the conjuncts is the head. In practice, it is never labeled as a conjunct explicitly, because the fact that it is a conjunct can be deduced from the presence of conjuncts among its children. Usually, the other conjuncts are labeled as conjuncts; conjunctions and punctuation also have a special label. This type of labeling corresponds to the dU type. Alternatively (as found in the Turkish treebank, dL), all conjuncts in the Moscow chain have their own dependency labels and the fact that they are conjuncts follows from the COORDINATION labels of the conjunction and punctuation nodes between them. To represent shared modifiers in the Stan- ford and Moscow families, an additional label is needed again to distinguish between private and shared modifiers since they cannot be distinguished topologically. Moreover, if nested CSs are allowed, a binary label is not sufficient (i.e. “shared” versus “private”) because it also has to indicate which conjuncts the shared modifier belongs to.14 We use the following binary flag codes for capturing which CS participants are distinguished in the annotation: m01 = shared modifiers annotated; m10 = conjuncts annotated; m11 = both annotated; m00 = neither annotated. 4 Coordination Structures in Treebanks In this section, we identify the CS styles defined in the previous section as used in the primary treebank data sources; statistical observations (such as the amount of annotated shared modifiers) presented here, as well as experiments on CS-style convertibility presented in Section 5.2, are based on the normalized shapes of the treebanks as contained in the HamleDT 1.0 treebank collection (Zeman et al., 2012).15 Some of the treebanks were downloaded individually from the web, but most of them came from previously published collections for dependency parsing campaigns: six languages from CoNLL-2006 (Buchholz and Marsi, 2006), seven languages from CoNLL-2007 (Nivre et al., 2007), two languages from CoNLL-2009 (Hajič and others, 2009), three languages from ICON-2010 (Husain et al., 2010). Obviously, there is a certain risk that the CS-related information contained in the source treebanks was slightly biased by the properties of the CoNLL format upon conversion. In addition, many of the treebanks were natively dependency-based (cf. the 2nd column of Table 1), but some were originally based on constituents and thus specific converters to the CoNLL format had to be created (for instance, the Spanish phrase-structure trees were converted to dependencies using a procedure described by Civit et al. (2006); similarly, treebank-specific converters have been used for other languages). Again, 14 This is not needed in Prague family where shared modifiers are attached to the conjunction provided that each shared modifier is shared by conjuncts that form a full subtree together with their coordinating conjunctions; no exceptions were found during the annotation process of the PDT. 15 A subset of the treebanks whose license terms permit redistribution is available directly at http://ufal.mff.cuni.cz/hamledt/. Danish Romanian hunde , katte og ccâini tter rotter şi pisici şobolani Hungarian kutyák , macskák és patkányok Figure 2: Annotation styles of a few treebanks do not fit well into the multidimensional space defined in Section 3.1. there is some risk that the CS-related information contained in treebanks resulting from such conversions is slightly different from what was intended in the very primary annotation. There are several other languages (e.g. Estonian or Chinese) which are not included in our study, despite of the fact that constituency treebanks do exist for them. The reason is that the choice of their CS style would be biased, because no independent converters exist – we would have to convert them to dependencies ourselves. We also know about several more dependency treebanks that we have not processed yet. Table 1 shows 26 languages whose treebanks we have studied from the viewpoint of their CS styles. It gives the basic quantitative properties of the treebanks, their CS style in terms of the taxonomy introduced in Section 3, as well as statistics related to CSs: the average number of CSs per 100 tokens, the average number of conjuncts per one CS, the average number of shared modifiers per one CS,16 and the percentage of nested CSs among all CSs. The reader can return to Figure 1 to see the basic statistics on the “popularity” of individual design decisions among the developers of dependency treebanks or constituency treebank converters. CS styles of most treebanks are easily classifiable using the codes introduced in Section 3, plus a few additional codes: • p0 = punctuation was removed from the treebank. 16 All non-Prague family treebanks are marked sN and m00 or m10, (i.e. shared modifiers not marked in the original annotation, but attached to the head conjunct) because we found no counterexamples (modifiers attached to a conjunct, but not the nearest one). The HamleDT normalization procedure contains a few heuristics to detect shared modifiers, but it cannot recover the missing distinction reliably, so the numbers in the “SMs/CJ” column are mostly underestimated. Language Ancient Greek Arabic Basque Bengali Bulgarian Czech Danish Dutch English Finnish German Greek Hindi Hungarian Italian Latin Persian Portuguese Romanian Russian Slovene Spanish Swedish Tamil Telugu Turkish Orig. Data type set dep dep dep dep phr dep dep phr phr dep phr dep dep phr dep dep dep phr dep dep dep phr phr dep dep dep prim. C07 prim. I10 C06 C07 C06 C06 C07 prim. C09 C07 I10 C07 C07 prim. prim. C06 prim. prim. C06 C09 C06 prim. I10 C07 Sents. Tokens 31 316 3 043 11 225 1 129 13 221 25 650 5 512 13 735 40 613 4 307 38 020 2 902 3 515 6 424 3 359 3 473 12 455 9 359 4 042 34 895 1 936 15 984 11 431 600 1 450 5 935 461 782 116 793 151 593 7 252 196 151 437 020 100 238 200 654 991 535 58 576 680 710 70 223 77 068 139 143 76 295 53 143 189 572 212 545 36 150 497 465 35 140 477 810 197 123 9 581 5 722 69 695 Original CS style code fP hR sH fP hL sH fP hR sN fP hR sH fS hL sN fP hR sH fS* hL sN fP hR sN fP hR sH fS hL sN fM hL sN fP hR sH fP hR sH fT hX sN fS hL sN fP hR sH fM*hM sN fS hL sN fP* hR sN fM hL sN fP hR sH fS hL sN fM hL sN fP hR sH fP hR sH fM hR sN cH cH cH cH cB cH cP cH cH cB cP cH cH cX cB cH cB cB cH cB cH cB cF cH cH cB pB pB pP pP pB pB pB pP pB pB pP pB pP pX pB pB pP pB p0 p0 pB pB pF pB pP pB CSs / CJs / 100 tok. CS dL m11 dL m00 dU m00 dU m11 dU m10 dL m11 dU m10 dU m10 dU m10 dU m10 dU m10 dL m11 dU m11 dL m00 dU m10 dL m11 dU m00 dU m10 dU m10 dU m10 dL m00 dU m10 dU m10 dL m11 dU m11 dL m10 6.54 3.76 3.37 4.87 2.99 4.09 3.68 2.06 2.07 4.06 2.79 3.25 2.45 2.37 3.32 6.74 4.18 2.51 1.80 4.02 4.31 2.79 3.94 1.66 3.48 3.81 2.17 2.42 2.09 1.71 2.19 2.16 1.93 2.17 2.33 2.41 2.09 2.48 1.97 1.90 2.02 2.24 2.10 1.95 2.00 2.02 2.49 1.98 2.19 2.46 1.59 2.04 SMs / Nested CS CS[%] 0.16 0.13 0.03 0.05 0.00 0.20 0.13 0.05 0.05 0.00 0.01 0.18 0.04 0.01 0.03 0.41 0.18 0.26 0.00 0.07 0.00 0.14 0.13 0.22 0.06 0.00 RT UAS 10.3 97.86 10.6 96.69 5.1 99.32 24.1 99.97 0.0 99.74 14.6 99.42 7.5 99.76 3.3 99.47 6.3 99.84 6.4 99.70 0.0 99.73 7.2 99.43 10.3 98.35 2.2 99.84 3.8 99.51 12.3 97.45 3.7 99.82 11.1 99.16 0.0 100.00 3.9 99.86 10.8 98.87 12.7 99.24 0.7 99.66 3.8 99.67 5.0 100.00 34.3 99.23 Table 1: Overview of analyzed treebanks. prim. = primary source; C06–C09 = CoNLL 2006–2009; I10 = ICON 2010; SM = shared modifier; CJ = conjunct; Nested CS = portion of CSs participating in nested CSs (both as the inner and outer CS); RT UAS = unlabeled attachment score of the roundtrip experiment described in Section 5. Style codes are defined in Sections 3 and 4. • fM* = Persian treebank uses a mix of fM and fS: fS for coordination of verbs and fM otherwise. Figure 2 shows three other anomalies: • fS* = Danish treebank employs a mixture of fS and fM, where the last conjunct is attached indirectly via the conjunction. • fP* = Romanian treebank omits punctuation tokens and multi-conjunct coordinations get split. • fT = Hungarian Szeged treebank uses “Tesnière family” – disconnected graphs for CSs where conjuncts (and conjunction and punctuation) are attached directly to the parent of CS, and so the other style dimensions are not applicable (hX, cX, pX). 5 Empirical Observations on Convertibility of Coordination Styles The various styles cannot represent the CS-related information to the same extent. For example, it is not possible to represent nested CSs in the Moscow and Stanford families without significantly changing the number of possible labels.17 The dL style (which is most easily applicable to the Prague family) can represent coordination of different dependency relations. This is again not possible in the other styles without adding e.g. a special “prefix” denoting the relations. We can see that the Prague family has a greater expressive power than the other two families: it can represent complex CSs using just one additional binary label, distinguishing between shared modifiers and conjuncts. A similar additional label is needed in the other styles to distinguish between shared and private modifiers. Because of the different expressive power, converting a CS from one style to another may lead to a loss of information. For example, as 17 Mel’čuk uses “grouping” to nest CSs – cf. related solutions involving coindexing or bubble trees (Kahane, 1997). However, these approaches were not used in any of the researched treebanks. To combine grouping with shared modifiers, each group in a tree should have a different identifier. there is no way of representing shared modifiers in the Moscow family without an additional attribute, converting a CS with shared modifiers from Prague to Moscow family makes the modifiers private. When converting back, one can use certain heuristics to handle the most obvious cases, but sometimes the modifiers will stay private (very often, the nature of a modifier depends on context or is debatable even for humans, e.g. “Young boys and girls”). 5.1 Transformation algorithm We developed an algorithm to transform one CS style to another. Two subtasks must be solved by the algorithm: identification of individual CSs and their participants, and transforming of the individual CSs. Obviously, the individual CSs cannot be transformed independently because of coordination nesting. For instance, when transforming a nested coordination from the Prague style to the Moscow style (e.g. to fMhL), the leftmost conjunct in the inner (lower) coordination must climb up to become the head of the inner CS, but then it must climb up once again to become the head of the outer (upper) CS too. This shows that inner CSs must be transformed first. We tackle this problem by a depth-first recursion. When going down the tree, we only recognize all the participants of the CSs, classify them and gather them in a separate data structure (one for each visited CS). The following four types of CS participants are distinguished: coordinating conjunctions, conjuncts, shared modifiers, and punctuations that separate conjuncts.18 No change of the tree is performed during these descent steps. When returning back from the recursion (i.e., when climbing from a node back up to its parent), we test whether the abandoned node is the topmost node of some CS. If so, then this CS is transformed, which means that its participants are rehanged and relabelled according the the target CS style. This procedure naturally guarantees that the in18 Conjuncts are explicitly marked in most styles. Coordinating conjunctions can be usually identified with the help of dependency labels and POS tags. Punctuation separating conjuncts can be detected with high accuracy using simple rules. If shared modifiers are not annotated (code m00 or m10), one can imagine rule-based heuristics or special classifiers trained to distinguish shared modifiers. For the experiments in this section, we use the HamleDT gold annotation attribute is shared modifier. ner CSs are transformed first and that all CSs are transformed when the recursions returns to the root. 5.2 Roundtrip experiment The number of possible conversion directions obviously grows quadratically with the number of styles. So far, we limited ourselves only to conversions from/to the style of the HamleDT treebank collection, which contains all the treebanks under our study already converted into a common scheme. The common scheme is based on the conventions of PDT, whose CS style is fPhRsHcHpB.19 We selected nine styles (3 families times 3 head choices) and transformed all the HamleDT scheme treebanks to these nine styles and back, which we call a roundtrip. Resulting averaged unlabeled attachment scores (UAS, evaluated against the HamleDT scheme) in the last column of Table 1 indicate that the percentage of transformation errors (i.e. tokens attached to a different parent after the roundtrip) is lower than 1% for 20 out of the 26 languages.20 A manual inspection revealed two main error sources. First, as noted above, the Stanford and Moscow families have lower expressive power than the Prague family, so naturally, the inverse transformation was ambiguous and the transformation heuristics were not capable of identifying the correct variant every time. Second, we also encountered inconsistencies in the original treebanks (which we were not trying to fix in HamleDT for now). 6 Conclusions and Future Work We described a (theoretically very large) space of possible representations of CSs within the dependency framework. We pointed out a range of details that make CSs a really complex phenomenon; anyone dealing with CSs in treebanking should take these observations into account. We proposed a taxonomy of those approaches 19 As documented in Zeman et al. (2012), the normalization procedures used in HamleDT embrace many other phenomena as well (not only those related to coordination), and involve both structural transformation and dependency relation relabeling. 20 Table 1 shows that Latin and Ancient Greek treebanks have on average more than 6 CSs per 100 tokens, more than 2 conjuncts per CS, and Latin has also the highest number of shared modifiers per CS. Therefore the percentage of nodes affected by the roundtrip is the highest for these languages and the lower roundtrip UAS is not surprising. that have been argued for in literature or employed in real treebanks. We studied 26 existing treebanks of different languages. For each value of each dimension in Figure 1, we found at least one treebank where the value is used; even so, several treebanks take their own unique path that cannot be clearly classified under the taxonomy (the taxonomy could indeed be extended, for the price of being less clearly arranged). We discussed the convertibility between the various styles and implemented a universal tool that transforms between any two styles of the taxonomy. The tool achieves a roundtrip accuracy close to 100%. This is important because it opens the door to easily switching coordination styles for parsing experiments, phrase-to-dependency conversion etc. While the focus of this paper is to explore and describe the expressive power of various annotation styles, we did not address the learnability of the styles by parsers. That will be a complementary point of view, and thus a natural direction of future work for us. Acknowledgments We thank the providers of the primary data resources. The work on this project was supported by the Czech Science Foundation grants no. P406/11/1499 and P406/2010/0875, and by research resources of the Charles University in Prague (PRVOUK). This work has been using language resources developed and/or stored and/or distributed by the LINDAT-Clarin project of the Ministry of Education of the Czech Republic (project LM2010013). Further, we would like to thank Jan Hajič, Ondřej Dušek and four anonymous reviewers for many useful comments on the manuscript of this paper. References Itzair Aduriz et al. 2003. Construction of a Basque dependency treebank. In Proceedings of the 2nd Workshop on Treebanks and Linguistic Theories. Susana Afonso, Eckhard Bick, Renato Haber, and Diana Santos. 2002. “Floresta sintá(c)tica”: a treebank for Portuguese. In LREC, pages 1968–1703. Nart B. Atalay, Kemal Oflazer, and Bilge Say. 2003. The annotation process in the Turkish treebank. In Proceedings of the 4th Intern. Workshop on Linguistically Interpreteted Corpora (LINC). David Bamman and Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Language Technology for Cultural Heritage, Theory and Applications of Natural Language Processing, pages 79–98. Springer Berlin Heidelberg. Igor Boguslavsky, Svetlana Grigorieva, Nikolai Grigoriev, Leonid Kreidlin, and Nadezhda Frid. 2000. Dependency treebank for Russian: Concept, tools, types of information. In Proceedings of the 18th conference on Computational linguistics-Volume 2, pages 987–991. Association for Computational Linguistics Morristown, NJ, USA. Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith. 2002. The TIGER treebank. In Proceedings of the Workshop on Treebanks and Linguistic Theories, Sozopol. Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of CoNLL, pages 149–164. Montserrat Civit, Maria Antònia Martı́, and Núria Bufı́. 2006. Cat3LB and Cast3LB: From constituents to dependencies. In FinTAL, volume 4139 of Lecture Notes in Computer Science, pages 141–152. Springer. Michael Collins. 2003. Head-driven statistical models for natural language parsing. Computational linguistics, 29(4):589–637. Dóra Csendes, János Csirik, Tibor Gyimóthy, and András Kocsor. 2005. The Szeged treebank. In TSD, volume 3658 of Lecture Notes in Computer Science, pages 123–131. Springer. Mihaela Călăcean. 2008. Data-driven dependency parsing for Romanian. Master’s thesis, Uppsala University, August. Sašo Džeroski, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdeněk Žabokrtský, and Andreja Žele. 2006. Towards a Slovene dependency treebank. In LREC 2006, pages 1388–1391, Genova, Italy. European Language Resources Association (ELRA). Nathan Green and Zdeněk Žabokrtský. 2012. Hybrid combination of constituency and dependency trees into an ensemble dependency parser. In Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data, pages 19–26, Avignon, France. Association for Computational Linguistics. Jan Hajič, Jarmila Panevová, Eva Hajičová, Petr Sgall, Petr Pajas, Jan Štěpánek, Jiřı́ Havelka, Marie Mikulová, Zdeněk Žabokrtský, and Magda Ševčı́ková-Razı́mová. 2006. Prague Dependency Treebank 2.0. CD-ROM, Linguistic Data Consortium, LDC Catalog No.: LDC2006T01, Philadelphia. Jan Hajič et al. 2009. The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL-2009), June 4-5, Boulder, Colorado, USA. Jirka Hana and Jan Štěpánek. 2012. Prague markup language framework. In Proceedings of the Sixth Linguistic Annotation Workshop, pages 12– 21, Stroudsburg, PA, USA. Association for Computational Linguistics, Association for Computational Linguistics. Katri Haverinen, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Filip Ginter, and Tapio Salakoski. 2010. Treebanking Finnish. In Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories (TLT9), pages 79–90. Samar Husain, Prashanth Mannem, Bharat Ambati, and Phani Gadde. 2010. The ICON-2010 tools contest on Indian language dependency parsing. In Proceedings of ICON-2010 Tools Contest on Indian Language Dependency Parsing, Kharagpur, India. ISO 24615. 2010. Language resource management – Syntactic annotation framework (SynAF). Sylvain Kahane. 1997. Bubble trees and syntactic representations. In Proceedings of the 5th Meeting of the Mathematics of the Language, DFKI, Saarbrucken. Matthias T. Kromann, Line Mikkelsen, and Stine Kern Lynge. 2004. Danish dependency treebank. Sandra Kübler, Erhard Hinrichs, Wolfgang Maier, and Eva Klett. 2009. Parsing coordinations. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 406–414, Athens, Greece, March. Association for Computational Linguistics. Vincenzo Lombardo and Leonardo Lesmo. 1998. Unit coordination and gapping in dependency theory. In Processing of Dependency-Based Grammars; proceedings of the workshop. COLING-ACL, Montreal. Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics, 19:313–330. Nicolar Mazziotta. 2011. Coordination of verbal dependents in Old French: Coordination as a specified juxtaposition or apposition. In Proceedings of International Conference on Dependency Linguistics (DepLing 2011). Ryan McDonald and Joakim Nivre. 2007. Characterizing the errors of data-driven dependency parsing models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 122–131. Igor A. Mel’čuk. 1988. Dependency Syntax: Theory and Practice. State University of New York Press. Simonetta Montemagni et al. 2003. Building the Italian syntactic-semantic treebank. In Building and using Parsed Corpora, Language and Speech series, pages 189–210, Dordrecht. Kluwer. Jens Nilsson, Johan Hall, and Joakim Nivre. 2005. MAMBA meets TIGER: Reconstructing a Swedish treebank from antiquity. In Proceedings of the NODALIDA Special Session on Treebanks. Joakim Nivre, Johan Hall, Sandra Kübler, Ryan McDonald, Jens Nilsson, Sebastian Riedel, and Deniz Yuret. 2007. The CoNLL 2007 shared task on dependency parsing. In Proceedings of the CoNLL 2007 Shared Task. EMNLP-CoNLL, June. Martin Popel and Zdeněk Žabokrtský. 2009. Improving English-Czech Tectogrammatical MT. The Prague Bulletin of Mathematical Linguistics, (92):1–20. Prokopis Prokopidis, Elina Desipri, Maria Koutsombogera, Harris Papageorgiou, and Stelios Piperidis. 2005. Theoretical and practical issues in the construction of a Greek dependency treebank. In Proceedings of the 4th Workshop on Treebanks and Linguistic Theories (TLT), pages 149–160. Loganathan Ramasamy and Zdeněk Žabokrtský. 2012. Prague dependency style treebank for Tamil. In Proceedings of LREC 2012, pages 23–25, İstanbul, Turkey. European Language Resources Association. Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani, and Behrouz MinaeiBidgoli. 2011. A syntactic valency lexicon for Persian verbs: The first steps towards Persian dependency treebank. In 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 227–231, Poznań, Poland. Kiril Simov and Petya Osenova. 2005. Extending the annotation of BulTreeBank: Phase 2. In The Fourth Workshop on Treebanks and Linguistic Theories (TLT 2005), pages 173–184, Barcelona, December. Otakar Smrž, Viktor Bielický, Iveta Kouřilová, Jakub Kráčmar, Jan Hajič, and Petr Zemánek. 2008. Prague Arabic dependency treebank: A word on the million words. In Proceedings of the Workshop on Arabic and Local Languages (LREC) 2008, pages 16–23, Marrakech, Morocco. European Language Resources Association. Leon Stassen. 2000. And-languages and withlanguages. Linguistic Typology, 4(1):1–54. Jan Štěpánek. 2006. Capturing a Sentence Structure by a Dependency Relation in an Annotated Syntactical Corpus (Tools Guaranteeing Data Consistence) (in Czech). Ph.D. thesis, Charles Univer- sity in Prague, Faculty of Mathematics and Physics, Prague, Czech Republic. Pavel Straňák and Jan Štěpánek. 2010. Representing layered and structured data in the CoNLL-ST format. In Alex Fang, Nancy Ide, and Jonathan Webster, editors, Proceedings of the Second International Conference on Global Interoperability for Language Resources, pages 143–152, Hong Kong, China. City University of Hong Kong, City University of Hong Kong. Mariona Taulé, Maria Antònia Martı́, and Marta Recasens. 2008. AnCora: Multilevel annotated corpora for Catalan and Spanish. In LREC. European Language Resources Association. TEI Consortium. 2013. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Lucien Tesnière. 1959. Eléments de syntaxe structurale. Paris. Stephen Tratz and Eduard Hovy. 2011. A fast, accurate, non-projective, semantically-enriched parser. In Proceedings of EMNLP, pages 1257–1268, Edinburgh, Scotland, UK, July. Association for Computational Linguistics. Leonoor van der Beek et al. 2002. Chapter 5. The Alpino dependency treebank. In Algorithms for Linguistic Processing NWO PIONIER Progress Report, Groningen, The Netherlands. Daniel Zeman, David Mareček, Martin Popel, Loganathan Ramasamy, Jan Štěpánek, Zdeněk Žabokrtský, and Jan Hajič. 2012. HamleDT: To parse or not to parse? In Proceedings of LREC 2012, pages 2735–2741, İstanbul, Turkey. European Language Resources Association.