Academia.eduAcademia.edu

Semantic Annotation for the Digital Humanities

Linguistic Issues in Language Technology

This contribution investigates novel techniques for error detection in automatic semantic annotations, as an attempt to reconcile error-prone NLP processing with high quality standards required for empirical research in Digital Humanities. We demonstrate the state-of-the-art performance of semantic NLP systems on a corpus of ritual texts and report performance gains we obtain using domain adaptation techniques. Our main contribution is to explore new techniques for annotation consistency control, as an attempt to reconcile error-prone NLP processing with high quality requirements. The novelty of our approach lies in its attempt to leverage multi-level semantic annotations by defining interaction constraints between local word-level semantic annotations and global discourse-level annotations. These constraints are defined using Markov Logic Networks, a logical formalism for statistical relational inference that allows for violable constraints. We report first results.

Linguistic Issues in Language Technology – LiLT Submitted, January 2012 Semantic Annotation for the Digital Humanities – Using Markov Logic Networks for Annotation Consistency Control Anette Frank, Thomas Bögel, Oliver Hellwig⋆ , Nils Reiter Published by CSLI Publications LiLT volume 7, issue 8 January 2012 Semantic Annotation for the Digital Humanities – Using Markov Logic Networks for Annotation Consistency Control Anette Frank, Thomas Bögel, Oliver Hellwig⋆ , Nils Reiter, Department of Computational Linguistics, Heidelberg University, ⋆ South Asia Institute, Heidelberg University Abstract This contribution investigates novel techniques for error detection in automatic semantic annotations, as an attempt to reconcile error-prone NLP processing with high quality standards required for empirical research in Digital Humanities. We demonstrate the state-of-the-art performance of semantic NLP systems on a corpus of ritual texts and report performance gains we obtain using domain adaptation techniques. Our main contribution is to explore new techniques for annotation consistency control, as an attempt to reconcile error-prone NLP processing with high quality requirements. The novelty of our approach lies in its attempt to leverage multi-level semantic annotations by defining interaction constraints between local word-level semantic annotations and global discourse-level annotations. These constraints are defined using Markov Logic Networks, a logical formalism for statistical relational inference that allows for violable constraints. We report first results. LiLT Volume 7, Issue 8, January 2012. Semantic Annotation for the Digital Humanities. Copyright c 2012, CSLI Publications. 1 2 / LiLT volume 7, issue 8 1 January 2012 Introduction The work described in this paper is embedded in an interdisciplinary project that aims at analyzing regularities and variances in the event structures of Nepalese rituals.1 The focus of this project is on investigating the event structure of rituals by applying computational linguistic analysis techniques to written descriptions of rituals. For scholars working in applied research in Digital Humanities, it is important that any evidence derived from computational analysis is accurate and reliable. Thus, for our project – as for others with similar foundations – it is of utmost importance to produce high-quality automatic annotations. But despite the many prospects that computational analysis can offer for empirical research in the Humanities, assuring near-to-perfect quality of computational analysis is still beyond the limits of state of the art systems. As has been shown in current work on domain adaptation (e.g. Daumé III (2007)) including our own, there is potential in improving the quality of current NLP tools by applying domain adaptation techniques. However, the gap in performance between current system outputs and (near-to-)perfect annotation quality is still considerable. In this contribution we investigate novel techniques for annotation error detection to guide manual annotation control or to acquire training material for domain adaptation. In contrast to most earlier work that concentrates on detection of part of speech (PoS) or parsing errors, our focus is on semantic annotation. The novelty of our approach lies in its attempt to leverage multi-level semantic annotation for annotation consistency control and error detection. We will interface local word-level semantic annotations with global discourse-level annotations. Concretely, we will define interaction constraints between annotations produced by a word sense disambiguation (WSD) and a coreference resolution (CR) system. These constraints will be defined using Markov Logic Networks (MLN, Richardson and Domingos (2006)), a first-order predicate logic formalism for statistical relational inference that allows the definition of violable constraints. The paper is organized as follows. Section 2 reviews previous work on error detection for linguistic annotations. Section 3 presents an evaluation of the performance of various semantic processors: word sense disambiguation (WSD), frame-semantic labeling (SRL) and coreference resolution (CR) systems, which we adapted to the domain of ritual 1 The project is part of the collaborative research center (Sonderforschungsbereich, SFB) “SFB 619: Ritual Dynamics” at Heidelberg University; http://www.ritualdynamik.de. Semantic Annotation for the Digital Humanities / 3 texts. Section 4 motivates our approach for cross-level semantic annotation consistency control and introduces a method for consistency checking using Markov Logic Networks (MLNs). Section 5 reports the results of our first experiments for annotation error detection. Section 6 summarizes our findings. 2 Related Work Methods for detecting annotation errors have been developed early on in the context of treebank construction, to enhance the quality of linguistic annotations for training supervised systems for PoS tagging or parsing (see e.g. Dickinson and Meurers (2003)). Prevalent techniques include observations based on corpus statistics, such as checking for deviations in PoS assignments over identical n-grams in a given corpus (Dickinson and Meurers, 2003, Loftsson, 2009), or inferring infrequent or “negative” n-grams from clean corpora to detect PoS annotation errors (Kvĕtoň and Oliva, 2002). Other techniques make use of manually defined or learned context-sensitive error detection (and correction) rules (Dickinson and Meurers, 2003). Detecting errors in syntactically annotated corpora works in similar ways, by extracting grammar rules or trees from labeled corpora and comparing the obtained rules or structures and their frequencies to those obtained from validated annotated corpora. Methods range from comparing full or partial structures to trees including surface frontiers, to the use of strict or partial overlap criteria (cf. e.g. Dickinson (2010)). There is little research, to date, that investigates methods for detecting errors in semantic annotation. Given the difficulty of the task, most annotation projects rely on the four-eye principle to detect disagreements among annotators. Yu et al. (2008) are the first to investigate methods for detecting mistaken agreements between annotators in assigning word senses. Here, all agreed-upon annotations are compared against the assignments of a supervised WSD system. Clearly, this method requires a substantial amount of annotated instances for training the WSD system, thus it is only suited for high-volume annotations for known lexical items.Yu et al. (2008)’s system identifies 40% of the erroneous annotations in the data. This performance is insufficient for fully automatic error correction. But it achieves a cost-effectiveness ratio that seems high enough to propose suspicious instances for manual control. Dickinson and Lee (2008) apply a data-driven approach to detect errors in predicate-argument structure annotations that relates to earlier work using n-gram error detection for identifying syntax errors. 4 / LiLT volume 7, issue 8 January 2012 Basically, the method identifies identical pieces of text with variational annotation. It thus requires a substantial amount of labeled data with overlapping surface strings. To our knowledge, there is no prior work that makes use of multiple annotation layers to detect inconsistencies in manual or automatic semantic annotations. Also, no attempts have been made to integrate statistical observations with logical constraints to define inter-level dependencies between annotations for this purpose. In our work, this will be attempted using Markov Logic networks, as explained in Section 4. 3 Multi-Level Semantic Annotation This section describes the textual data, the preprocessing steps and the individual semantic analysis components we use for analyzing the event structure of rituals, including the performance they achieve when applied to ritual texts. 3.1 Corpus of ritual descriptions Our corpus of ritual texts consists of ritual descriptions obtained from two types of sources: part of the corpus is supplied from modern ethnographic observations of rituals in Nepal, another from Sanskrit manuals about Nepalese rituals, which are translated to English by ritual experts.2 The complete ritual corpus currently consists of 48 descriptions of rituals, ranging from 3 to 339 sentences and comprises 97,460 tokens. While most texts deal with rites de passage3 in Nepal and are, therefore, rather consistent at the topic level, there are clear differences in their language styles. The translations from Sanskrit texts consist mainly of short sentences with an average sentence length of 18 words. They frequently use a terse, condensed language with many imperatives and nominal constructions, which reflects the style of the underlying Sanskrit originals: “Now, the rules for the ricefeeding ceremony.” “Hand over the flower basket.” The style of the ethnographic descriptions may be characterized as “scientific prose” with longer sentences and nested substructures (average 2 Most texts are drawn from the works of Gutschow and Michaels (2008) and Gutschow and Michaels (2005). An extension of the current corpus is planned on the basis of the upcoming volume about marriage rituals. 3 The term rite de passage denotes rituals that are performed during the transition between important states in a person’s life. Our ritual corpus comprises descriptions of classical Indian transitory rituals (sam . skāra, e.g., first feeding of solid food and beginning of the Veda study) as well as typical Nepalese rituals such as the “Marriage to the Ihi fruit”. Semantic Annotation for the Digital Humanities / 5 sentence length: 26 words): “The pair of pots designated for a female spirit is likewise painted and the lump of clay worshipped as Śiva or Agni during the Buddhists’ Girl’s Marriage to the bel fruit fashioned out of clay.” Both text types contain numerous terms that are specific for SouthAsian material and religious culture such as names of gods (Śiva, Agni) or indigenous fruits (bel fruit). In order to facilitate processing, these terms are replaced by approximate translations to English provided by ritual experts prior to processing. Since we store the original terms as annotations, we can re-insert them after processing (cf. Reiter et al. (2011) for further details on the text characteristics of this corpus). 3.2 NLP architecture All systems are integrated in a full-fledged natural language processing architecture based on UIMA4 , with analysis results stored as stand-off annotations. The architecture comprises various processors for the major preprocessing steps: tokenization, PoS tagging, lemmatization and chunking, as well as the semantic and discourse-level analysis components discussed below: word sense disambiguation, semantic role labeling and coreference resolution. Given the low quality we obtained using off-the-shelf processors trained on common text genres (i.e., newspaper), we experimented with various domain adaptation techniques for PoS tagging and chunking (cf. Reiter et al. (2011) for details). For PoS tagging, a standard model trained on the Wall Street Journal (WSJ) achieved an accuracy of approximately 90% on a manually annotated test set consisting of 672 sentences. We were able to improve on that by retraining the model on the concatenation of the WSJ and oversampled ritual data. This way, the PoS tagger achieved a performance of just under 97%. In a similar fashion, we improved the model for chunking from an f-score of 0.86 (when trained on the WSJ) to an f-score of over 0.88 when trained on the concatenation of the WSJ and oversampled ritual data. In the following, we discuss processing components for three levels of semantic and discourse-level annotation and their adaptation to the ritual domain: word sense disambiguation using WordNet (Fellbaum, 1998) as sense inventory; semantic role labeling based on FrameNet (Baker et al., 1998) and coreference resolution. 4 http://uima.apache.org 6 / LiLT volume 7, issue 8 MFS UKBWN 2.0 UKB+rit-node Nouns Coverage Precision Recall F-Score 94.5 59.8 60.0 59.9 93.3 60.2 53.7 56.8 93.3 64.1 57.3 60.5 Adjectives Coverage Precision Recall F-Score 88.4 48.3 49.3 48.8 86.9 51.2 49.3 50.2 86.9 49.8 47.8 48.8 All Words Coverage Precision Recall F-Score 94.3 53.9 54.5 54.2 93.1 54.2 49.9 51.9 93.1 56.4 51.8 54.0 TABLE 1 3.3 January 2012 Evaluation results: WSD without and with domain adaptation Word Sense Disambiguation For word sense annotation we employ the graph-based UKB system of Agirre and Soroa (2009). While supervised WSD systems rely on manually labelled training data, UKB explores the graph structure of WordNet using the PageRank algorithm for assigning senses to target words in a given context. WSD performance. To build a gold standard for testing UKB’s performance, we randomly chose 50 sentences from all ritual descriptions. These sentences were annotated independently by two annotators with word senses from WordNet 2.0. Both annotators have a computational linguistics background. Differences between the two annotations have been adjudicated.5 This resulted in 462 annotated nouns, verbs, adjectives and adverbs, forming our gold standard for WSD. We assessed the performance of UKB using precision and recall as evaluation metrics, calculated for individual word types and microaveraged over all types. As the semantic annotation of verbs will be mainly covered by FrameNet annotations, we specifically report on the performance of WordNet sense disambiguation for nouns and adjectives, next to performance on all words. Here and in all the following experiments, the WSD system selects candidate synsets based on the PoS tags provided by our own domain-adapted, probabilistic PoS tagger. 5 In two cases WordNet 2.0 did not contain appropriate concepts for annotation: “bel fruit” (Sanskrit bilva; a fruit used for worshipping Śiva) and “block print”. These words were left unannotated. Semantic Annotation for the Digital Humanities / 7 The performance results for different system configurations are summarized in Table 1. We assigned the most frequent sense (MFS) from WordNet 2.0 as a baseline. This baseline achieves a precision of 53.9% and a recall of 54.5% for all words. For 5.7% of the tokens, the baseline implementation does not return a word sense. This loss in coverage is mainly caused by erroneous PoS assignments. We first tested the performance of UKB 0.1.6 using standard WordNet (2.0). The system achieves a precision of 54.2% and a recall of 49.9% (for all words) and thus performs below the MFS baseline (the loss in recall outranks the gain in precision), which is not unusual for unsupervised WSD systems. The coverage drops by a small amount to 93.1%. Domain adaptation for WSD. In order to adapt UKB to the ritual domain, we enriched the WordNet database with domain-specific sense information. We acquired senses that may be characteristic for the ritual domain from a Digital Corpus of Sanskrit (DCS, Hellwig (2011)). This corpus is designed as a general-purpose philological resource that covers Sanskrit texts from 500 BCE until 1900 CE without any special focus on the ritual domain. In this corpus, approximately 400,000 tokens have been manually annotated with word senses from WordNet 2.0. Using this annotated corpus for domain sense acquisition was motivated by the supposition that even general passages from Sanskrit literature may contain a significant amount of senses that are relevant for the ritual domain. We linked all 3,294 word senses that were annotated in this corpus to a newly introduced non-lexicalized pseudo-synset rit-topic. As UKB calculates the page rank between sense-related words in the WordNet database, introducing this node increases the chances that senses specific for Nepalese culture receive a higher rank (cf. Reddy et al. (2010) for a similar approach). As seen in Table 1, linking domain-related senses to a pseudo-synset results in an improvement of 2.2 points in precision and 1.9 points in recall for all words, when compared to UKBWN2.0 . Moreover, the domain-adapted UKB system now closely matches the MFS baseline in F-Score. Note further that for nouns the domain-adapted WSD system obtains the best results (P: 64.1%, F: 60.5), and outperforms the MFS baseline in terms of precision (+4.3) and f-score (+0.6), with only a slight loss in recall (R: 57.3%; -2.7) and coverage remaining stable. This is in line with our general aim towards producing precise annotations. 8 / LiLT volume 7, issue 8 3.4 January 2012 Frame-semantic labeling Semi-automatic frame-semantic labeling. We added frame semantic annotation to the ritual descriptions in a semi-automatic way. First, a learner trained on small amounts of annotated data was used to assign frames in unannotated descriptions. The assigned frames were checked by two annotators, and differences were adjudicated by one supervisor. In a second step, semantic roles were assigned manually to the adjudicated frames by two annotators, and were again checked for consistency by the supervisor. Depending on the complexity and the ambiguity of a frame, we observed an inter-annotator agreement between κ = 0.619 (frame Manipulation) and κ = 1.0 (frame Cutting) for frame annotation. For role annotation, we observed a global κ = 0.469, which indicates rather low agreement. However, a closer look at the data reveals that 89.4% of the differences in role annotations occur when one annotator annotates a role that the other annotator does not recognize. Using this double annotation approach, we built up a manually checked gold corpus that contains 1,505 frames of 12 different types and 3,061 roles of 94 different types. Automatic semantic annotation quality. To reduce the need for time-consuming manual annotation, we experimented with existing semantic role labeling systems. We evaluated the probabilistic SRL system Semafor (Das et al., 2010), which has been trained with FrameNet (1.5) data, against the manually annotated gold corpus described above. Semafor achieved P: 49.6%, R: 34.4% and F: 40.6 for frame labeling.6 Error analysis shows that the accuracy of Semafor varies strongly depending on the frames. Semafor performs poorly with frames that carry culture-specific notions or are evoked by unusual lexemes in the ritual descriptions. For the frame text_creation, for instance, Semafor yields R: 0.2%, P: 0.9% and F: 0.3, because it labels target words such as chant consistently with the frame Communication_manner, while our group decided to annotate the frame Text_creation in these cases.7 The low recall can be explained by the fact that verbs such as recite, which are missing in FrameNet, are annotated manually with the frame Text_creation in the gold corpus. On the other hand, we observe good accuracy for less specialized frames such as Placing (P: 82.2%, R: 76.2%, F: 79.1). An analysis of coverage gaps 6 Eight cases with multiword gold targets were excluded from consideration in automatic evaluation, as it is unclear whether partial matches can be considered as meaning preserving. 7 Chantings in rituals are usually not meant as a form of communication. Semantic Annotation for the Digital Humanities / 9 according to Palmer and Sporleder (2010) shows that about 75% of all errors in frame assignment are caused by insufficient training material in FrameNet.8 The evaluation of semantic roles was restricted to the roles of those frames that were annotated correctly by Semafor. On these 1182 roles, Semafor achieved P: 58.2%, R: 62.1% and F: 60.1, allowing both partial and perfect overlap of spans; P: 52.0%, R: 55.5%, F: 53.7 if restricted to perfect match.9 As major sources of error, we identified non-local roles and non-core roles that are missing in Semafor’s output, domain specific vocabulary of our texts, and syntactic peculiarities such as imperatives. On the whole, we are confident that system annotations for frames and roles can be improved by retraining Semafor on our labeled domain data. 3.5 Coreference Resolution Coreference Resolution using BART. We chose BART (Versley et al., 2008) as our primary tool for coreference resolution. BART implements a classical approach towards coreference resolution based on a classification of mention-pairs, as described in Soon et al. (2001). Integrated preprocessing components (PoS tagging, constituent parsing, etc.) are used to extract mentions and their features. The system includes precompiled models for anaphora and coreference resolution using a standard feature set for pair-wise classification trained on the MUC6 data set (Chinchor and Sundheim, 2003). Best results were achieved using the precompiled MaxEnt model. Domain adaptation techniques. Given extremely poor results when using BART as off-the-shelf coreference resolver (cf. Reiter et al. (2011)), we tested several strategies to enhance its performance on ritual texts. First, to reduce noise from preprocessing, we adapted BART’s integrated preprocessing pipeline to include our own domain-adapted components for PoS tagging and chunking. Two further enhancements are used to tailor the system to our targeted domain and interests. (i) After mention detection, a WordNet lookup filters out mentions of specific semantic classes. This allows us 8 Using the notation introduced in (Palmer and Sporleder, 2010, p. 932f), the detailed numbers are as follows: NOTR-LU: 8.5% (83 instances; including those cases in which the annotation report of FrameNet gives less than three annotated instances), NOTR-TGT: 10.1% (99), UNDEF-LU: 17.7% (174), UNDEF-TGT: 36.9% (362). 9 Precision could be slightly underestimated due to a number of roles (80) in Semafor’s output that are not annotated in the gold standard, but could still be correct. 10 / LiLT volume 7, issue 8 BART preprocessing UIMA preprocessing TABLE 2 January 2012 P MUC R F P 37.68 41.9 59.77 60.53 46.22 49.52 28.79 32.21 B3 R 46.28 46.81 F 35.5 38.16 Standard vs. domain-adapted preprocessing pipeline to concentrate on the most important and most frequent entity types: persons and gods (as opposed to non-animated objects). Also, (ii) we included domain-specific knowledge to improve the predictions of BART’s semantic agreement features: We extended BART’s internal database for names and procedures with a new category for gods and added gender information for items frequently occurring in ritual texts to the existing knowledge databases. Evaluating automatic coreference annotation quality. We evaluated BART’s performance on manually annotated gold standards using MUC and B3 as evaluation metrics. Both metrics compare chains of mentions produced by the system with corresponding chains in the gold standard to measure coreference resolution performance. MUC counts missing links between mentions in the system’s output relative to the gold standard (cf. Vilain et al. (1995)). Despite known shortcomings of MUC, it is still widely used. Bagga and Baldwin (1998) resolve these issues with the introduction of the B3 metric that judges each mention individually, resulting in a stricter and more realistic evaluation metric for most scenarios. Evaluation results: processing pipelines. We tested different pipeline architectures, using our own domain-adapted tagger and chunker (UIMA pipeline) in contrast to BART’s pipeline that includes a standard model for full parsing using the Stanford parser (cf. Table 2). Using chunks provided by the UIMA pipeline clearly outperforms BART’s internal pipeline across both evaluation metrics. Given these results, we chose to use the UIMA pipeline for preprocessing for all subsequent experiments. Evaluation results: entity subtypes and domain knowledge. We evaluated the two domain-specific adaptions discussed above: (i) restricting coreference resolution to entity subtypes, and (ii) extending BART’s semantic knowledge by adding gender information and semantic categories for frequently occurring terms. (i) Table 3 shows overall performance improvements for restriction to the entity types person and god. Gains are very high for MUC, while moderate and mostly oriented towards precision for B3 . This holds both Semantic Annotation for the Digital Humanities / 11 F 49.52 59.64 35.86 32.21 36.82 23.74 46.81 40.46 58.2 38.16 38.56 33.73 48.8 59.62 36.13 29.26 36.92 23.80 50.51 40.2 61.5 37.05 38.49 34.32 P F St a m nd od ar el d P B3 R MUC R all person object 41.9 55.73 25.78 60.53 64.15 58.41 D ge om m nd ain od er el entity subtypes all person object 39.9 55.59 25.6 62.82 63.89 61.83 TABLE 3 Results for CR with entity type restrictions and gender database Standard model Domain gender model TABLE 4 P R F 49.1 47.7 78.9 80.4 60.5 59.9 Identification of mentions for the standard gender model of BART (upper part) and the domainadapted model (lower part). This result fits well with our main interest in analyzing event chains from rituals, where coreference information for the main actors is of primary importance, and our general interest in achieving high-quality annotations. (ii) For the domain-specific enhancements to the gender model, we observe clear gains in recall in comparison to the non-adapted model.10 This goes along with a slight drop in precision across all categories. For the person entity subtype, however, the gender model does not have a clear impact, with pretty stable recall and precision. Table 4 shows that in the model with enhanced gender features more mentions can be linked to entity chains (recall of mention identification rises to 80.4%). This explains the general improvement in recall, with a trend towards a drop in precision, due to misclassifications. In this respect, the person entity type shows robust behavior, with almost identical overall performance. We may still expect improved performance of the domain model when analyzing larger data sets. Overall, our person-restricted domain-adapted models achieve clearly improved precision, with a boost of 18.05 points (MUC) and 8.16 points (B3 ) when compared to the unadapted standard BART model (cf. Table 2), with solid gains in f-scores (8.13 and 3.06 points, respectively). 10 Table 3 (domain gender model) highlights results that outperform the corresponding variant of the standard model. 12 / LiLT volume 7, issue 8 4 January 2012 Exploiting Multiple Layers for Consistency Control As seen above, we can achieve significant improvements in labeling accuracy for WSD and CR by applying different domain adaptation strategies. For frame-semantic annotation, we identified issues of domain-specific senses that can be addressed by retraining Semafor on the domain corpora that were labeled semi-automatically. Still, it soon became clear in our interdisciplinary project that for the ritual scientists it is crucial that any observations obtained from data analysis are reliable. As we have seen, this cannot be realistically achieved by the current state of the art in NLP. Manual annotation, on the other hand, seems out of reach for a substantial amount of data. As a way to counterbalance error-prone automatic annotation with measures to ensure high annotation quality, we investigated methods for consistency control that can help identify erroneous annotations in the data, for targeted manual correction or to acquire valuable training data for improving automatic labelers. As outlined in Section 2, methods for error detection have by now concentrated on morphological and syntactic analysis. The few attempts reported on consistency checking in semantics are confined to a single level of annotation (Yu et al., 2008) or mainly draw on techniques for syntactic error detection (Dickinson and Lee, 2008). Our work focuses on error detection techniques that leverage multiple levels of (discourse-)semantic annotation. Intra- and inter-level consistency. In general, consistency control can be addressed from two perspectives: relying on evidence obtained for a single level of annotation, or else by deriving consistency constraints from known interactions or dependencies across levels that can be used to detect outliers in annotations. We refer to these opposing views as intra- and inter-level consistency. Classical methods for intra-level consistency control are voting or classifier combination using alternative labelers. This is a well-known, effective technique for improved system results in generic classification tasks. It is evaluated in Loftsson (2009) for PoS error detection and could be applied to any level of analysis, including semantics. Other methods rely on frequency distributions obtained from corpora. The focus of our work is on inter-level consistency control. In particular, we exploit dependencies between local and global annotation decisions, by interfacing word-level and discourse-level semantic annotation. Discourse-level semantic dependencies. Our approach starts from a discourse perspective and the observation that coherence at the discourse level affects disambiguation decisions that are typically Semantic Annotation for the Digital Humanities / 13 taken at the word or sentence level, such as WSD or SRL. This dependency is at the heart of the one-sense-per-discourse (OSD) hypothesis (Gale et al., 1992) that was successfully exploited for WSD (Yarowsky, 1995). As we focus on the semantic annotation of discourse in the form of ritual descriptions, we can exploit discourse-level constraints for semantic annotation and vice versa, to detect erroneous annotations. Specifically, we will exploit dependencies between coreference resolution (CR) and word sense disambiguation (WSD). CR establishes coreference chains, consisting of a set of so-called mentions, typically common nouns, pronouns or proper names. This set is also referred to as a (discourse) entity, as all mentions jointly refer to a single entity. The task of WSD is to select a specific sense from the set of possible senses of a word that is appropriate in the given context. A natural assumption for the dependency between CR and WSD is that all common nouns contained in an entity are closely sense-related. Following the OSD hypothesis, this should be trivially true for multiple occurrences of the same common noun. For lexically distinct nouns, we can still assume that for coreferring, but ambiguous nouns, their contextually correct senses are closely related. We will test this hypothesis by defining two consistency constraints that determine sense selection and the assignment of mentions to a discourse entity. They predict: Cons.ws: for a mention m in a given entity e, sense selection (i.e. WSD) chooses a sense s that is close to a “central” concept representation c for entity e. Cons.cr: for a given mention m with contextually assigned sense s, m is assigned to an entity e whose “central” concept c is closely related to or compatible with s. We compute such a central or “centroid” semantic representation for discourse entities using the graph-theoretical notion of a key player. The key player is a measure that determines a central node in a graph by choosing a single node that is closest to all other nodes (Navigli and Lapata, 2010). In our case, we compute a key player sense for an entity from a semantic graph we build from all word senses of all its mentions, using WordNet. The edges of the graph correspond to the sense relations defined in WordNet, choosing the shortest distances between connected senses (cf. Bögel (2011)). We then use the distance11 d between word senses s and the key 11 The distance measure is based on counting edges between nodes. Of course, other measures of similarity or distance could be used. 14 / LiLT volume 7, issue 8 January 2012 // Declarations, used in all rule sets has_sense(ment,sen!) // each mention is assigned exactly one sense poss_sense(ment,sen) // mentions have possible senses in_m_e(ment,ent!) // each mention is assigned to exactly one entity centroid(ent,cen!) // each entity has exactly one centroid dist(sen,cen,int) // distance betw. sense and centroid in path length FIGURE 1 Predicate declarations common to all rule sets player sense c of an entity to estimate the consistency of sense assignment to a mention and assignment of a mention to an entity, according to the constraints defined above: If d is small, Cons.ws predicts sense assignment s to a mention m in e to be consistent with discourse-level decisions captured in an existing entity e. If an alternative sense s′ is closer to c, the decision of the WSD system needs to be revised, and s′ is a candidate sense to consider for m. If, on the other hand, Cons.cr finds a mention m in e whose assigned sense s is not close enough to c or closer to the centroid c′ of another entity e′ , the decision of the CR system needs to be revised, and e′ is a candidate entity to consider for m. That is, we can compute semantic distances between assigned or possible senses of a mention and the centroid concepts of established discourse entities to detect violations of the coherence constraints. Any instances that incur (more or less severe) violations of these constraints should point us to outliers in semantic annotations. Defining dependencies using Markov Logic Networks. Markov Logic is a formalism that uses weighted first-order predicate logic formulas. Formulas with a low weight are “cheaper” to violate, while the violation of formulas with higher weights is more expensive. Weights can be specified manually or learned from data. Next to weighted formulas, the system also allows the encoding of “hard rules” (or hard constraints) that cannot be violated. They receive a very high weight (∞). Figures 1 to 4 show the declarations and rule sets we defined to implement the constraints formulated above. All rules and definitions follow the syntax of Alchemy12 , with !p denoting negation of a literal p, c! enforcing uniqueness on a variable assignment for c, and +c enforcing estimation of weights over formulas of individually grounded variables c. We define three variants of rule sets (Sets I, II and III) for two types of inference rules: one targeted at predicting assignment of a mention to an entity according to Cons.cr, the other targeted at predicting assignment of a sense to a mention m, according to Cons.ws. That is, 12 http://alchemy.cs.washington.edu Semantic Annotation for the Digital Humanities / 15 // CR has_sense(m,s) ^ centroid(e,c) ^ dist(s,c,d) ^ d <= dx => in_m_e(m,e) has_sense(m,s) ^ centroid(e,c) ^ dist(s,c,d) ^ d > dx => !in_m_e(m,e) // WSD poss_sense(m,s) ^ in_m_e(m,e) ^ centroid(e,c) ^ dist(s,c,d) ^ d <= dx => has_sense(m,s) poss_sense(m,s) ^ in_m_e(m,e) ^ centroid(e,c) ^ dist(s,c,d) ^ d > dx => !has_sense(m,s) FIGURE 2 Rule Set I for CR and WSD, dx = 0, . . . , n // CR has_sense(m,s) ^ centroid(e,c) ^ distance(s,c,+d) has_sense(m,s) ^ centroid(e,c) ^ distance(s,c,+d) => in_m_e(m,e) => !in_m_e(m,e) // WSD poss_sense(m,s) ^ in_m_e(m,e) ^ centroid(e,c) ^ distance(s,c,+d) => has_sense(m,s) FIGURE 3 Rule Set II for CR and WSD we define the target predicate in_m_e in the first case, and the predicate has_sense in the second. Figure 1 displays the modeling predicates in the declaration part used for all rule sets. Set I (Figure 2) makes use of a distance threshold (dx = 0, . . . , 3) between m’s (possible) senses and the entity centroid c for the assignment of a possible sense s to a mention (Cons.ws) or a mention m to an entity (Cons.cr). In order to determine plausible distance thresholds, Set II (Figure 3) learns rule weights for the entire range of individual distances observed in our data (0 − 30), for both positive and negative assignments. Rules for Set II are similar to Set I, but specify individual distances distance(s,c,+d). Due to the plus sign, the rules are compiled to individual rules for each distance value and thus, for each distance we obtain an individual rule weight. Set III (Figure 4) offers a formulation that does not resort to a fixed distance threshold or a spread of distinct distance ranges as in Set I and Set II. Instead, we define a discriminative rule set that assigns a mention to an entity e′ if the distance of its sense s to the centroid of e′ is smaller than the distance between s and the centroid of the automatically established entity e; otherwise, assignment of m to e is 16 / LiLT volume 7, issue 8 January 2012 // Predicates only used in set III inferred_in_m_e(ment, ent!) inferred_has_sense(ment, sen!) // CR // 1. if a mention is closer to another entity’s centroid, change decision in_m_e(m,e) ^ has_sense(m,s) ^ centroid(e,c) ^ distance(s,c,d) ^ centroid(e’,c’) ^ e != e’ ^ distance(s,c’,d’) ^ d’ < d => inferred_in_m_e(m,e’) // 2. if mention isn’t closer to another entity’s centroid, keep decision in_m_e(m,e) ^ has_sense(m,s) ^ centroid(e,c) ^ distance(s,c,d) ^ centroid(e’,c’) ^ e != e’ ^ distance(s,c’,d’) ^ d’ >= d => inferred_in_m_e(m,e) // WSD // 1. if a sense is closer to centroid than another in_m_e(m,e) ^ has_sense(m,s) ^ poss_sense(m,s’) ^ s centroid(e,c) ^ distance(s,c,d) ^ distance(s’,c,d’) => sense, keep decision != s’ ^ ^ d <= d’ inferred_has_sense(m,s) // 2. if another sense is closer to the centroid, change the decision in_m_e(m,e) ^ has_sense(m,s) ^ poss_sense(m,s’) ^ s != s’ ^ centroid(e,c) ^ distance(s,c,d) ^ distance(s’,c,d’) ^ d > d’ => inferred_has_sense(m,s’) FIGURE 4 Rule Set III for CR and WSD preserved. In a similar way we define (alternative) assignment of a sense s′ to a mention m if the distance between the mention’s entity centroid c and s′ is smaller than the distance to an alternative sense s. All the above rules are defined as soft constraints. A small number of hard constraints model the WSD and CR task, stating, e.g., that each mention is assigned a single sense and assigned to a single entity. In addition, we experimented with a variant of Set III, Set IIIhard , in which the same rules have been defined as hard constraints. 5 Experiments and Evaluation Data processing. We processed ritual descriptions using UKB+rit-node as a WSD system and BART in its best configuration (gender model restricted to persons/gods) as a coreference resolution system. For this task, entities were filtered to only contain common nouns. This way, centroids are sharply defined, being entirely based on nominal senses. We exported the resulting data into a collection of MLN predicates. A small set of data has been annotated manually to serve as development and test sets. Table 5 shows an overview of the data sets we used. Semantic Annotation for the Digital Humanities / 17 WSD train # tokens # types tokens/type sense ambiguity (chain) (avg/median) sense ambiguity (all nouns) (avg/median) CR train # mentions # chains (entities) # NNs/chain TABLE 5 3795 602 6.18 dev 2656 53 156 11 19.05 4.8 2.92/2 2.21/2 3.62/3 3.38/3 dev test 41 14 2.86 test 141 22 6.4 2.56/2 3.5/3 78 19 4.05 Information about training, development and test data sets Evaluation measures. Since our main goal is error detection, we report precision, recall and f-score for the detection of mistakes in automatic annotations. Ideally, we want high precision (i.e. small number of false positives) and high recall (i.e. small number of false negatives), to be able to propose potential annotation mistakes for manual control. To gain better insight into the data, we further evaluated classical performance measures for WSD and CR for the inferred against gold annotations using the MLN constraints for the best rule set. Experiments. We evaluated the performance of the consistency constraints defined above in a number of experiments, using the Alchemy implementation of Markov Logic. We first determined plausible distance thresholds for Set I by inducing rule weights for individual distances (Set II) on a large set of automatic annotations. Based on this, we selected dx = 0, . . . , 3 for WSD and dx = 0, . . . , 2 for CR in Set I. Running evaluation experiments on the development data sets for Sets I, II and III did not yield significant differences, given the very small data set size. For evaluation on the final test set we therefore report results for all settings. Results. Table 6 presents the results for error detection. The experiments with rule sets I and II use learned rule weights, the experiment with Set III uses hard constraints (i.e., rules with theoretically infinite weight).13 For WSD, we achieve the same results on all rule sets. In particular precision is in need of improvement. Compared to the other rule sets, Set II achieves less recall and precision. For CR the best overall performance (precision, recall and f-score) in error detection is achieved by 13 Our experiments on Set III using learned weights could not be completed due to repeated process failures. In prior experiments we had obtained evaluation results close to Set I. January 2012 P R F P R F 34.8 34.8 34.8 34.8 34.1 34.8 100 100 100 100 96.8 100 51.6 51.6 51.6 51.6 50.4 51.6 67.5 69.6 69.5 82.6 86.3 83.9 74.3 77.1 76.0 68.5 64.9 82.6 31.3 74.9 42.3 TABLE 6 CR Id=0 Id=1 Id=2 Id=3 II IIIhard WSD 18 / LiLT volume 7, issue 8 Experiment results for error detection using a fixed distance threshold of 1, i.e., by limiting the maximal distance between an entity centroid and the senses of its mentions to path length 1. Setting a higher threshold leads to a loss in both precision and recall. The latter figures look promising for automatic error detection as support for targeted annotation control. Set III yields lowest f-score for CR, which is mainly due to a very low recall. This indicates that the decision to reattach a mention to a new entity can not be based on distance alone. Instead, approaches using rule weights learned on data and thus tailored to distribution in the data achieve much better performance. Overall, the figures in Table 6 show mixed results. For WSD, precision for error detection is devastatingly low. For CR, by contrast, we obtain very promising results of 69.6% precision at 77.1 points fscore that seem to reach a level of realistic cost-effectiveness to support manual annotation control. Comparison of classical performance results for the sense and mention assignments predicted by the MLN inference rules in contrast to the original system assignments, however, show that automatic error correction is by far out of reach: The labeling performance of the predicted output of MLN inference drops by 5.06 (MUC)/6.41 (B3 ) points f-score for CR and over 50 points f-score for WSD. Future work will investigate more refined constraint sets to obtain overall higher precision levels, in particular for WSD. 6 Conclusions and Future Work To summarize, the contributions of our paper are two-fold: (i) We discussed performance issues in automatic semantic annotation of ritual texts and showed that domain adaptation can improve the annotation quality for WSD and CR. For frame-semantic annotation we could identify performance problems that can be addressed by retraining the semantic role labeling system on our semi-automatically annotated domain corpora, similarly to the domain adaptation methods employed Semantic Annotation for the Digital Humanities / 19 for preprocessing. (ii) To further reduce the gap between automatic annotation quality and the high quality standards required for empirical research in Digital Humanities, we investigated a novel approach to error detection using Markov Logic as formal framework. Our approach to consistency control for semantic annotation explores inter-level dependencies between local (WSD) and discourse-level (CR) annotation decisions. Our experiments show promising results for detection of mistakes in automatic CR annotations, while error detection on sense assignment could not be achieved at a realistic level of performance. In this paper, we could only come up with first investigations of this novel technique – with ample space for improvement. First, our evaluation results are based on a small evaluation data set. Larger data sets are required to support statistically significant results and conclusions. Also, our current rule sets for consistency control rely on static and still noisy centroids computed on top of automatic CR and sense annotations. This severely restricts the induction of novel, more homogeneous discourse entities. Our current experiments do not yet exploit the full power of Markov Logic Networks in that constraints for CR and WSD error detection are compiled in distinct rule sets. Future work will investigate joint processing of these constraints. We will also integrate the computation of centroids into the MLN inference process, so that changes in sense and mention assignment can more directly affect the computation of consistency constraints. Further improvements could be gained by including surface information for mentions and the formulation of constraints that implement the one-sense-per-discourse hypothesis. Finally, we will pursue deeper investigation of models similar to Set III that make error detection less dependent on optimization of distance thresholds. Acknowledgments This research has been funded by the German Research Foundation (DFG) and is part of the collaborative research center on ritual dynamics (Sonderforschungsbereich SFB-619, Ritualdynamik).14 We thank our student researchers Julio Cezar Rodrigues and Britta Zeller for assisting the experiments and providing gold standard linguistic annotations, as well as Borayin Maitreya Larios and Nils Jakob Liersch who provided frame-semantic annotations of the ritual texts. We further thank the anonymous reviewers for comments and suggestions. 14 http://www.ritualdynamik.de 20 / LiLT volume 7, issue 8 January 2012 References Agirre, Eneko and Aitor Soroa. 2009. Personalizing PageRank for Word Sense Disambiguation. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL), pp. 33–41. Athens, Greece. Bagga, Amit and Breck Baldwin. 1998. Algorithms for Scoring Coreference Chains. In Proceedings of the LREC 1998 Linguistic Coreference Workshop, pp. 536–566. Granada, Spain. Baker, Collin F., Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley FrameNet Project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 1, pp. 86–90. Bögel, Thomas. 2011. Entity-based Coreference Resolution combined with Discourse-New Detection. Bachelor’s thesis, Heidelberg University. Chinchor, Nancy and Beth Sundheim. 2003. Message Understanding Conference (MUC) 6 . Philadelphia: Linguistic Data Consortium. Das, Dipanjan, Nathan Schneider, Desai Chen, and Noah A. Smith. 2010. Probabilistic Frame-Semantic Parsing. In Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 948–956. L.A., California. Daumé III, Hal. 2007. Frustratingly Easy Domain Adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 256–263. Prague, Czech Republic. Dickinson, Markus. 2010. Detecting Errors in Automatically-Parsed Dependency Relations. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden. Dickinson, Markus and Chong Min Lee. 2008. Detecting Errors in Semantic Annotation. In Proceedings of the Sixth International Conference on Language Resources and Evaluation. Marrakech, Morocco. Dickinson, Markus and Detmar Meurers. 2003. Detecting Errors in Part-ofSpeech Annotation. In Proceedings of the 10th Conference of the European Chapter of the ACL (EACL). Budapest, Hungary. Fellbaum, Christiane. 1998. WordNet: An Electronic Lexical Database. MIT Press. Gale, William A., Kenneth W. Church, and David Yarowsky. 1992. A Method for Disambiguating Word Senses in a Large Corpus. Computers and the Humanities 26(5–6):415–439. Gutschow, Niels and Axel Michaels. 2005. Handling Death. The Dynamics of Death and Ancestor Rituals Among the Newars of Bhaktapur, vol. 3 of Ethno-Indology. Heidelberg Studies in South Asian Rituals. Harrassowitz Verlag. Gutschow, Niels and Axel Michaels. 2008. Growing Up. Hindu and Buddhist Initiation Rituals among Newar Children in Bhaktapur , vol. 6 of EthnoIndology. Heidelberg Studies in South Asian Rituals. Harrassowitz Verlag. Hellwig, Oliver. 2011. DCS - The Digital Corpus of Sanskrit. Heidelberg. References / 21 Kvĕtoň, Pavel and Karel Oliva. 2002. (Semi-)Automatic Detection of Errors in PoS-Tagged Corpora. In Proceedings of the 19th International Conference on Computational Linguistics (Coling). Taipei, Taiwan. Loftsson, Hrafn. 2009. Correcting a POS-Tagged Corpus Using Three Complementary Methods. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL), pp. 523–531. Athens, Greece. Navigli, Roberto and Mirella Lapata. 2010. An Experimental Study of Graph Connectivity for Unsupervised Word Sense Disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(4):678–692. Palmer, Alexis and Caroline Sporleder. 2010. Evaluating FrameNet-style semantic parsing: the role of coverage gaps in FrameNet. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling), pp. 928–936. Beijing, China. Reddy, Siva, Abhilash Inumella, Diana MacCarthy, and Mark Stevenson. 2010. IIITH: Domain Specific Word Sense Disambiguation. In Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 387–391. Uppsala, Sweden. Reiter, Nils, Oliver Hellwig, Anette Frank, Irina Gossmann, Borayin Maitreya Larios, Julio Rodrigues, and Britta Zeller. 2011. Adapting NLP Tools and Frame-Semantic Resources for the Semantic Analysis of Ritual Descriptions. In C. Sporleder, A. van den Bosch, and K. Z. Zervanou, eds., Language Technology for Cultural Heritage, Foundations of Human Language Processing and Technology. Springer. Richardson, Matthew and Pedro Domingos. 2006. Markov Logic Networks. Machine Learning 62:107–136. Soon, Wee Meng, Daniel Chung Yong Lim, and Hwee Tou Ng. 2001. A Machine Learning Approach to Coreference Resolution of Noun Phrases. Computational Linguistics 27(4):521–544. Versley, Yannick, Simone Paolo Ponzetto, Massimo Poesio, Vladimir Eidelman, Alan Jern, Jason Smith, Xiaofeng Yang, and Alessandro Moschitti. 2008. BART: A Modular Toolkit for Coreference Resolution. In Proceedings of the ACL-08: HLT Demo Session, pp. 9–12. Columbus, Ohio. Vilain, Marc, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman. 1995. A Model-Theoretic Coreference Scoring Scheme. In Proceedings of the 6th Conference on Message Understanding (MUC), pp. 45–52. Morristown, NJ, USA. Yarowsky, David. 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196. Cambridge, Massachusetts, USA. Yu, Liang-Chih, Chung-Hsien Wu, and Eduard H. Hovy. 2008. OntoNotes: Corpus Cleanup of Mistaken Agreement Using Word Sense Disambiguation. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling), pp. 1057–1064. Manchester, UK.