Im neuen Jahrtausend haben Arbeiten an epigraphischem Material zumindest in der Indologie einen u... more Im neuen Jahrtausend haben Arbeiten an epigraphischem Material zumindest in der Indologie einen ungeahnten Aufschwung genommen. Nach der Erarbeitung grundlegender Corpora zu Beginn des zwanzigsten Jahrhunderts hatten nach dem zweiten Weltkrieg eher literarische und religionshistorische Themen im Vordergrund gestanden. Und dies trotz einer völlig veränderten Verkehrslage: Wo die wissenschaftlichen Vorfahren noch von Abrieben auf Papier oder von Fotografien arbeiten mussten, hätte es doch der moderne Tourismus mit seinen unbegrenzten und auch noch erschwinglich gewordenen Fernflügen erleichtert, sich die Inschriften im Original anzusehen. Dabei kommen Inschriften inhaltlich zeitgenössischer Fragestellung entgegen, enthalten sie doch häufig Aussagen zur Politik und Gesellschaft, die den Produzenten der ‚schönen' Künste kaum eine Erwähnung wert waren. Gerade, wo es um modische Themen wie ‚subaltern', ‚the other' oder ‚equity' geht, können epigraphischen Texten wertvolle Aussagen abgewonnen werden. Eine korrekte Lesung setzt natürlich zweierlei voraus, a) absolute Sicherheit über den Wortlaut des Materials, das heißt Lesekenntnis, und b) die eigene Fähigkeit, diesen Wortlaut auch zu verstehen, vulgo Sprachkenntnis. Wenn man den Inhalt verstanden hat, kann man ihn interpretieren. Dazu bedarf es einer Methodik, das heißt einer Palette von Fragestellungen und ihrer Werkzeuge. Zu den Fragestellungen gehören Äußerlichkeiten: a) Wann wurde ein Text verfasst, von wem, für wen? und b) Warum wurde der Text da angebracht, wo wir ihn im Original vorfinden, bzw. wo war er, bevor er versetzt wurde. Die zweite Fragestellung führte zu einem DFG-Projekt, das es dem ersten Autor erlaubte, ab 1992 zehn Jahre lang in Indien, Pakistan und Nepal die Edikte des Maurya-Herrschers Aśokas aus dem 3. Jh. v. Chr. aufzusuchen und im Bild zu dokumentieren, um sagen zu können, warum Felsen und Säulen dort beschrieben wurden, wo wir sie vorfinden. Das Ergebnis ist eine Bestandsaufnahme, 1 die aus der Gesamtsicht aber auch Antworten erlaubte zu den Umständen einer ersten ‚öffentlichen' Schrift im Reich der Mauryas, die ab etwa 250 v. Chr. verwendet wurde. 2 Hier galt
Word and major constituent order in Vedic and classical Sanskrit have been an important area of r... more Word and major constituent order in Vedic and classical Sanskrit have been an important area of research since at least Bergaigne 1878 and Delbrück 1878. To name just a few important studies, one may cite the foundational overview by Delbrück 1888, Bloomfield's investigation of the mantra variants under this angle (Bloomfield 1913), Lahiri's very useful wide-ranging study, which also deals with material from the Aitareyabrāhmaṇa (Lahiri 1933), Gonda's general observations (Gonda 1952), and Klein's two rich articles devoted to the Rigveda (Klein 1991; Klein 1994). The temporal dimension, which is of special interest for the present paper, has recently been addressed by Reinöhl 2016, who adduces evidence for a diachronically increasing degree of configurationality in Indo-Aryan languages that comes along with a more regulated word order. A more in-depth survey of related studies can be found in Holland 1980, pp. 1-32. Deshpande and Hock 1991, Hock 2013a and Hock 2013b provide a detailed bibliography with further pointers into the relevant literature. This contribution in honor of Prof. Mislav Ježić addresses one peculiar subproblem from the area just sketched: Is there any relevant diachronic development to be observed in the placement of direct objects? Any study dealing with the question of language change in Vedic is confronted with serious methodological problems caused by the structure of its textual corpus. The oldest parts of the Vedic corpus consist almost exclusively of poetry, whereas most texts of the younger strata are in prose. As a consequence, temporal and register boundaries coincide with each other, which makes it difficult to disentangle the mutual influence of these factors. A radical solution to overcome this obstacle has been chosen by those researchers who-like already Delbrück 1888-assumed that poetry should be neglected in general studies concerning word order, as it is incomparable to prose texts due to the freedom offered by 'poetic licence'. Hock 2000, however, has pointed out that this line of thought is neither convincing nor fruitful as it factually excludes large parts of the Vedic corpus from our investigations. On the other hand, the difference in register should not be ignored either, as is in fact done by those who take the strongly differing word order in Vedic poetry and prose on their face value and directly construct a development from a free word order in Rigvedic times to a rigid subject-objectverb (SOV) schema in later Brāhmaṇa prose. 1 By preparing statistics that compare Ṛgvedic word order with that in prose passages of Kālidāsa's Śakuntalā as well as with Aśoka's Rock
This paper describes the first data-driven parser for Vedic Sanskrit, an ancient Indo-Aryan langu... more This paper describes the first data-driven parser for Vedic Sanskrit, an ancient Indo-Aryan language in which a corpus of important religious and philosophical texts has been composed. We report and critically discuss experiments with the input feature representations, paying special attention to the performance of contextualized word embeddings and to the influence of morpho-syntactic representations on the parsing quality. In addition, we provide an in-depth discussion of the parsing errors that covers structural traits of the predicted trees as well as linguistic and extra-textual influence factors. In its optimal configuration, the proposed model achieves 87.61 unlabeled and 81.84 labeled attachment score on a held-out set of test sentences, demonstrating good performance for an under-resourced language.
This document contains annotation decisions and examples for the Vedic Treebank (Hellwig et al. 2... more This document contains annotation decisions and examples for the Vedic Treebank (Hellwig et al. 2020). It is a revised and expanded version of the Guidelines v1 issued with the first release of the Vedic treebank, and is conceived as a vademecum for future annotators as well as revisors, but also as an entry point and reference for whomever might want to consult our data. The brief overview of Universal Dependencies in Section 1 maps UD types to the most typical cases found in our data. Section 2 takes the opposite approach: Starting from linguistic constructions found in our texts, it explains which UD structures we used for annotating them. Expressions of the form A ← obj ← B indicate that A is dependent from B, and their syntactic relation is obj. In this version we introduce sublabels-these are all optional and primarily meant to be helpful for the annotators to adapt to the usage of the main labels used in UD, cf. Section 1. For the sake of simplicity, sandhi phenomena in the examples are resolved.
This paper introduces a latent variable model for ancient languages that aims at quantifying the ... more This paper introduces a latent variable model for ancient languages that aims at quantifying the influence that early authoritative works exert on their literary successors in terms of lexis. The model jointly estimates the amount of word reuse, based on uniand bigrams of words, and the date of composition of each text. We apply the model to a corpus of pre-Renaissance Latin texts composed between the 3rd c. BCE and the 14th c. CE. Our evaluation focusses on the structures of word reuse detected by the model, its temporal predictions and the quality of the inferred diachronic distributions of words, which last aspect is assessed using a newly designed task from the field of computational etymology.
Im neuen Jahrtausend haben Arbeiten an epigraphischem Material zumindest in der Indologie einen u... more Im neuen Jahrtausend haben Arbeiten an epigraphischem Material zumindest in der Indologie einen ungeahnten Aufschwung genommen. Nach der Erarbeitung grundlegender Corpora zu Beginn des zwanzigsten Jahrhunderts hatten nach dem zweiten Weltkrieg eher literarische und religionshistorische Themen im Vordergrund gestanden. Und dies trotz einer völlig veränderten Verkehrslage: Wo die wissenschaftlichen Vorfahren noch von Abrieben auf Papier oder von Fotografien arbeiten mussten, hätte es doch der moderne Tourismus mit seinen unbegrenzten und auch noch erschwinglich gewordenen Fernflügen erleichtert, sich die Inschriften im Original anzusehen. Dabei kommen Inschriften inhaltlich zeitgenössischer Fragestellung entgegen, enthalten sie doch häufig Aussagen zur Politik und Gesellschaft, die den Produzenten der ‚schönen' Künste kaum eine Erwähnung wert waren. Gerade, wo es um modische Themen wie ‚subaltern', ‚the other' oder ‚equity' geht, können epigraphischen Texten wertvolle Aussagen abgewonnen werden. Eine korrekte Lesung setzt natürlich zweierlei voraus, a) absolute Sicherheit über den Wortlaut des Materials, das heißt Lesekenntnis, und b) die eigene Fähigkeit, diesen Wortlaut auch zu verstehen, vulgo Sprachkenntnis. Wenn man den Inhalt verstanden hat, kann man ihn interpretieren. Dazu bedarf es einer Methodik, das heißt einer Palette von Fragestellungen und ihrer Werkzeuge. Zu den Fragestellungen gehören Äußerlichkeiten: a) Wann wurde ein Text verfasst, von wem, für wen? und b) Warum wurde der Text da angebracht, wo wir ihn im Original vorfinden, bzw. wo war er, bevor er versetzt wurde. Die zweite Fragestellung führte zu einem DFG-Projekt, das es dem ersten Autor erlaubte, ab 1992 zehn Jahre lang in Indien, Pakistan und Nepal die Edikte des Maurya-Herrschers Aśokas aus dem 3. Jh. v. Chr. aufzusuchen und im Bild zu dokumentieren, um sagen zu können, warum Felsen und Säulen dort beschrieben wurden, wo wir sie vorfinden. Das Ergebnis ist eine Bestandsaufnahme, 1 die aus der Gesamtsicht aber auch Antworten erlaubte zu den Umständen einer ersten ‚öffentlichen' Schrift im Reich der Mauryas, die ab etwa 250 v. Chr. verwendet wurde. 2 Hier galt
The paper introduces a multi-level annotation of the R. GVEDA, a fundamental Sanskrit text compos... more The paper introduces a multi-level annotation of the R. GVEDA, a fundamental Sanskrit text composed in the 2. millenium BCE that is important for South-Asian and Indo-European linguistics, as well as Cultural Studies. We describe the individual annotation levels, including phonetics, morphology, lexicon, and syntax, and show how these different levels of annotation are merged to create a novel annotated corpus of Vedic Sanskrit. Vedic Sanskrit is a complex, but computationally under-resourced language. Therefore, creating this resource required considerable domain adaptation of existing computational tools, which is discussed in this paper. Because parts of the annotations are selective, we propose a bi-directional LSTM based sequential model to supplement missing verb-argument links.
The paper applies a deep recurrent neural network to the task of sentence boundary detection in S... more The paper applies a deep recurrent neural network to the task of sentence boundary detection in Sanskrit, an important, yet underresourced ancient Indian language. The deep learning approach improves the F scores set by a metrical baseline and by a Conditional Random Field classifier by more than 10%.
This paper introduces the first treebank of Vedic Sanskrit, a morphologically rich ancient Indian... more This paper introduces the first treebank of Vedic Sanskrit, a morphologically rich ancient Indian language that is of central importance for linguistic and historical research. The selection of the more than 3,700 sentences contained in this treebank reflects the development of metrical and prose texts over a period of 600 years. We discuss how these sentences are annotated in the Universal Dependencies scheme and which syntactic constructions required special attention. In addition, we describe a syntactic labeler based on neural networks that supports the initial annotation of the treebank, and whose evaluation can be helpful for setting up a full syntactic parser of Vedic Sanskrit.
The paper describes a novel approach that performs joint splitting of compounds and of Sandhis in... more The paper describes a novel approach that performs joint splitting of compounds and of Sandhis in Sanskrit texts. Sanskrit is a strongly compounding, morphologically and phonetically complex Indo-Aryan language. The interacting levels of its linguistic processes complicate the computer based analysis of its corpus. The paper proposes an algorithm that is able to resolve Sanskrit compounds and “phonetically merged” (sandhied) words using only gold transcripts of correct string splits, but no further lexical or morphological resources. In this way, it may also prove to be useful for other Indo-Aryan languages, for which no or only limited digital resources are
This paper introduces and evaluates a Bayesian mixture model that is designed for dating texts ba... more This paper introduces and evaluates a Bayesian mixture model that is designed for dating texts based on the distributions of linguistic features. The model is applied to the corpus of Vedic Sanskrit the historical structure of which is still unclear in many details. The evaluation concentrates on the interaction between time, genre and linguistic features, detecting those whose distributions are clearly coupled with the historical time. The evaluation also highlights the problems that arise when quantitative results need to be reconciled with philological insights.
The paper presents a method for WordNet supersense tagging of Sanskrit, an ancient Indian languag... more The paper presents a method for WordNet supersense tagging of Sanskrit, an ancient Indian language with a corpus grown over four millenia. The proposed method merges lexical information from Sanskrit texts with lexicographic definitions from Sanskrit-English dictionaries, and compares the performance of two machine learning methods for this task. Evaluation concentrates on Vedic, the oldest layer of Sanskrit. This level of Sanskrit contains numerous rare words that are no longer used in the later language and whose word senses can, therefore, not be induced from their occurrences in other texts. The paper studies how to efficiently transfer knowledge from later forms of Sanskrit and from modern Western dictionaries for this special task of supersense disambiguation.
Old World: Journal of Ancient Africa and Eurasia, 2021
In this paper we introduce an extended version of the Vedic Treebank (vtb, Hellwig et al. 2020) w... more In this paper we introduce an extended version of the Vedic Treebank (vtb, Hellwig et al. 2020) which comes along with revisited and extended annotation guidelines. In order to assess the quality of our annotations as well as the usability and limits of the guidelines we performed an inter-annotator agreement test. The results show that agreement between annotators is hampered by various factors, most prominently by insufficient understanding of the content because of the cultural and temporal gap and incomplete knowledge of Vedic grammar. An in-depth discussion of disagreeing annotations demonstrates that the setup of the workflow, too, has a major influence on inter-annotator agreement. We suggest some measures that can help increase the transparency and annotation consistency according to current knowledge of the language when annotating Vedic Sanskrit, or ancient language varieties in general.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018
The paper introduces end-to-end neural network models that tokenize Sanskrit by jointly splitting... more The paper introduces end-to-end neural network models that tokenize Sanskrit by jointly splitting compounds and resolving phonetic merges (Sandhi). Tokenization of Sanskrit depends on local phonetic and distant semantic features that are incorporated using convolutional and recurrent elements. Contrary to most previous systems, our models do not require feature engineering or extern linguistic resources, but operate solely on parallel versions of raw and segmented text. The models discussed in this paper clearly improve over previous approaches to Sanskrit word segmentation. As they are language agnostic, we will demonstrate that they also outperform the state of the art for the related task of German compound splitting.
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, 2022
Corpus-based studies of diachronic syntactic changes are typically guided by the results of previ... more Corpus-based studies of diachronic syntactic changes are typically guided by the results of previous qualitative research. When such results are missing or, as is the case for Vedic Sanskrit, are restricted to small parts of a transmitted corpus, an exploratory framework that detects such changes in a data-driven fashion can support the research process. In this paper, we introduce an infinite relational model (Kemp et al., 2006) that groups syntactic constituents based on their structural similarities and their diachronic distributions. We propose a simple way to control for register and intellectual affiliation, and discuss our findings for four syntactic structures in Vedic texts.
This contribution investigates novel techniques for error detection in automatic semantic annotat... more This contribution investigates novel techniques for error detection in automatic semantic annotations, as an attempt to reconcile error-prone NLP processing with high quality standards required for empirical research in Digital Humanities. We demonstrate the state-of-the-art performance of semantic NLP systems on a corpus of ritual texts and report performance gains we obtain using domain adaptation techniques. Our main contribution is to explore new techniques for annotation consistency control, as an attempt to reconcile error-prone NLP processing with high quality requirements. The novelty of our approach lies in its attempt to leverage multi-level semantic annotations by defining interaction constraints between local word-level semantic annotations and global discourse-level annotations. These constraints are defined using Markov Logic Networks, a logical formalism for statistical relational inference that allows for violable constraints. We report first results.
Deriving historical dates or datable stratifications for texts in Classical Sanskrit, such as the... more Deriving historical dates or datable stratifications for texts in Classical Sanskrit, such as the epics Mahābhārata and Rāmāyaṇa, is a considerable challenge for text-historical research. This paper provides empirical evidence for subtle but noticeable diachronic changes in the fundamental linguistic structures of Classical Sanskrit, and argues that Classical Sanskrit shows enough diachronic variation for dating texts on the basis of linguistic developments. Building on this evidence, it evaluates machine learning algorithms that predict approximate dates of composition for Sanskrit texts. The paper introduces the required background, discusses the relevance of linguistic features for temporal classification, and presents a text-historical evaluation of Book 6 of the Mahābhārata, whose historical stratification is disputed in Indological research.
Im neuen Jahrtausend haben Arbeiten an epigraphischem Material zumindest in der Indologie einen u... more Im neuen Jahrtausend haben Arbeiten an epigraphischem Material zumindest in der Indologie einen ungeahnten Aufschwung genommen. Nach der Erarbeitung grundlegender Corpora zu Beginn des zwanzigsten Jahrhunderts hatten nach dem zweiten Weltkrieg eher literarische und religionshistorische Themen im Vordergrund gestanden. Und dies trotz einer völlig veränderten Verkehrslage: Wo die wissenschaftlichen Vorfahren noch von Abrieben auf Papier oder von Fotografien arbeiten mussten, hätte es doch der moderne Tourismus mit seinen unbegrenzten und auch noch erschwinglich gewordenen Fernflügen erleichtert, sich die Inschriften im Original anzusehen. Dabei kommen Inschriften inhaltlich zeitgenössischer Fragestellung entgegen, enthalten sie doch häufig Aussagen zur Politik und Gesellschaft, die den Produzenten der ‚schönen' Künste kaum eine Erwähnung wert waren. Gerade, wo es um modische Themen wie ‚subaltern', ‚the other' oder ‚equity' geht, können epigraphischen Texten wertvolle Aussagen abgewonnen werden. Eine korrekte Lesung setzt natürlich zweierlei voraus, a) absolute Sicherheit über den Wortlaut des Materials, das heißt Lesekenntnis, und b) die eigene Fähigkeit, diesen Wortlaut auch zu verstehen, vulgo Sprachkenntnis. Wenn man den Inhalt verstanden hat, kann man ihn interpretieren. Dazu bedarf es einer Methodik, das heißt einer Palette von Fragestellungen und ihrer Werkzeuge. Zu den Fragestellungen gehören Äußerlichkeiten: a) Wann wurde ein Text verfasst, von wem, für wen? und b) Warum wurde der Text da angebracht, wo wir ihn im Original vorfinden, bzw. wo war er, bevor er versetzt wurde. Die zweite Fragestellung führte zu einem DFG-Projekt, das es dem ersten Autor erlaubte, ab 1992 zehn Jahre lang in Indien, Pakistan und Nepal die Edikte des Maurya-Herrschers Aśokas aus dem 3. Jh. v. Chr. aufzusuchen und im Bild zu dokumentieren, um sagen zu können, warum Felsen und Säulen dort beschrieben wurden, wo wir sie vorfinden. Das Ergebnis ist eine Bestandsaufnahme, 1 die aus der Gesamtsicht aber auch Antworten erlaubte zu den Umständen einer ersten ‚öffentlichen' Schrift im Reich der Mauryas, die ab etwa 250 v. Chr. verwendet wurde. 2 Hier galt
Word and major constituent order in Vedic and classical Sanskrit have been an important area of r... more Word and major constituent order in Vedic and classical Sanskrit have been an important area of research since at least Bergaigne 1878 and Delbrück 1878. To name just a few important studies, one may cite the foundational overview by Delbrück 1888, Bloomfield's investigation of the mantra variants under this angle (Bloomfield 1913), Lahiri's very useful wide-ranging study, which also deals with material from the Aitareyabrāhmaṇa (Lahiri 1933), Gonda's general observations (Gonda 1952), and Klein's two rich articles devoted to the Rigveda (Klein 1991; Klein 1994). The temporal dimension, which is of special interest for the present paper, has recently been addressed by Reinöhl 2016, who adduces evidence for a diachronically increasing degree of configurationality in Indo-Aryan languages that comes along with a more regulated word order. A more in-depth survey of related studies can be found in Holland 1980, pp. 1-32. Deshpande and Hock 1991, Hock 2013a and Hock 2013b provide a detailed bibliography with further pointers into the relevant literature. This contribution in honor of Prof. Mislav Ježić addresses one peculiar subproblem from the area just sketched: Is there any relevant diachronic development to be observed in the placement of direct objects? Any study dealing with the question of language change in Vedic is confronted with serious methodological problems caused by the structure of its textual corpus. The oldest parts of the Vedic corpus consist almost exclusively of poetry, whereas most texts of the younger strata are in prose. As a consequence, temporal and register boundaries coincide with each other, which makes it difficult to disentangle the mutual influence of these factors. A radical solution to overcome this obstacle has been chosen by those researchers who-like already Delbrück 1888-assumed that poetry should be neglected in general studies concerning word order, as it is incomparable to prose texts due to the freedom offered by 'poetic licence'. Hock 2000, however, has pointed out that this line of thought is neither convincing nor fruitful as it factually excludes large parts of the Vedic corpus from our investigations. On the other hand, the difference in register should not be ignored either, as is in fact done by those who take the strongly differing word order in Vedic poetry and prose on their face value and directly construct a development from a free word order in Rigvedic times to a rigid subject-objectverb (SOV) schema in later Brāhmaṇa prose. 1 By preparing statistics that compare Ṛgvedic word order with that in prose passages of Kālidāsa's Śakuntalā as well as with Aśoka's Rock
This paper describes the first data-driven parser for Vedic Sanskrit, an ancient Indo-Aryan langu... more This paper describes the first data-driven parser for Vedic Sanskrit, an ancient Indo-Aryan language in which a corpus of important religious and philosophical texts has been composed. We report and critically discuss experiments with the input feature representations, paying special attention to the performance of contextualized word embeddings and to the influence of morpho-syntactic representations on the parsing quality. In addition, we provide an in-depth discussion of the parsing errors that covers structural traits of the predicted trees as well as linguistic and extra-textual influence factors. In its optimal configuration, the proposed model achieves 87.61 unlabeled and 81.84 labeled attachment score on a held-out set of test sentences, demonstrating good performance for an under-resourced language.
This document contains annotation decisions and examples for the Vedic Treebank (Hellwig et al. 2... more This document contains annotation decisions and examples for the Vedic Treebank (Hellwig et al. 2020). It is a revised and expanded version of the Guidelines v1 issued with the first release of the Vedic treebank, and is conceived as a vademecum for future annotators as well as revisors, but also as an entry point and reference for whomever might want to consult our data. The brief overview of Universal Dependencies in Section 1 maps UD types to the most typical cases found in our data. Section 2 takes the opposite approach: Starting from linguistic constructions found in our texts, it explains which UD structures we used for annotating them. Expressions of the form A ← obj ← B indicate that A is dependent from B, and their syntactic relation is obj. In this version we introduce sublabels-these are all optional and primarily meant to be helpful for the annotators to adapt to the usage of the main labels used in UD, cf. Section 1. For the sake of simplicity, sandhi phenomena in the examples are resolved.
This paper introduces a latent variable model for ancient languages that aims at quantifying the ... more This paper introduces a latent variable model for ancient languages that aims at quantifying the influence that early authoritative works exert on their literary successors in terms of lexis. The model jointly estimates the amount of word reuse, based on uniand bigrams of words, and the date of composition of each text. We apply the model to a corpus of pre-Renaissance Latin texts composed between the 3rd c. BCE and the 14th c. CE. Our evaluation focusses on the structures of word reuse detected by the model, its temporal predictions and the quality of the inferred diachronic distributions of words, which last aspect is assessed using a newly designed task from the field of computational etymology.
Im neuen Jahrtausend haben Arbeiten an epigraphischem Material zumindest in der Indologie einen u... more Im neuen Jahrtausend haben Arbeiten an epigraphischem Material zumindest in der Indologie einen ungeahnten Aufschwung genommen. Nach der Erarbeitung grundlegender Corpora zu Beginn des zwanzigsten Jahrhunderts hatten nach dem zweiten Weltkrieg eher literarische und religionshistorische Themen im Vordergrund gestanden. Und dies trotz einer völlig veränderten Verkehrslage: Wo die wissenschaftlichen Vorfahren noch von Abrieben auf Papier oder von Fotografien arbeiten mussten, hätte es doch der moderne Tourismus mit seinen unbegrenzten und auch noch erschwinglich gewordenen Fernflügen erleichtert, sich die Inschriften im Original anzusehen. Dabei kommen Inschriften inhaltlich zeitgenössischer Fragestellung entgegen, enthalten sie doch häufig Aussagen zur Politik und Gesellschaft, die den Produzenten der ‚schönen' Künste kaum eine Erwähnung wert waren. Gerade, wo es um modische Themen wie ‚subaltern', ‚the other' oder ‚equity' geht, können epigraphischen Texten wertvolle Aussagen abgewonnen werden. Eine korrekte Lesung setzt natürlich zweierlei voraus, a) absolute Sicherheit über den Wortlaut des Materials, das heißt Lesekenntnis, und b) die eigene Fähigkeit, diesen Wortlaut auch zu verstehen, vulgo Sprachkenntnis. Wenn man den Inhalt verstanden hat, kann man ihn interpretieren. Dazu bedarf es einer Methodik, das heißt einer Palette von Fragestellungen und ihrer Werkzeuge. Zu den Fragestellungen gehören Äußerlichkeiten: a) Wann wurde ein Text verfasst, von wem, für wen? und b) Warum wurde der Text da angebracht, wo wir ihn im Original vorfinden, bzw. wo war er, bevor er versetzt wurde. Die zweite Fragestellung führte zu einem DFG-Projekt, das es dem ersten Autor erlaubte, ab 1992 zehn Jahre lang in Indien, Pakistan und Nepal die Edikte des Maurya-Herrschers Aśokas aus dem 3. Jh. v. Chr. aufzusuchen und im Bild zu dokumentieren, um sagen zu können, warum Felsen und Säulen dort beschrieben wurden, wo wir sie vorfinden. Das Ergebnis ist eine Bestandsaufnahme, 1 die aus der Gesamtsicht aber auch Antworten erlaubte zu den Umständen einer ersten ‚öffentlichen' Schrift im Reich der Mauryas, die ab etwa 250 v. Chr. verwendet wurde. 2 Hier galt
The paper introduces a multi-level annotation of the R. GVEDA, a fundamental Sanskrit text compos... more The paper introduces a multi-level annotation of the R. GVEDA, a fundamental Sanskrit text composed in the 2. millenium BCE that is important for South-Asian and Indo-European linguistics, as well as Cultural Studies. We describe the individual annotation levels, including phonetics, morphology, lexicon, and syntax, and show how these different levels of annotation are merged to create a novel annotated corpus of Vedic Sanskrit. Vedic Sanskrit is a complex, but computationally under-resourced language. Therefore, creating this resource required considerable domain adaptation of existing computational tools, which is discussed in this paper. Because parts of the annotations are selective, we propose a bi-directional LSTM based sequential model to supplement missing verb-argument links.
The paper applies a deep recurrent neural network to the task of sentence boundary detection in S... more The paper applies a deep recurrent neural network to the task of sentence boundary detection in Sanskrit, an important, yet underresourced ancient Indian language. The deep learning approach improves the F scores set by a metrical baseline and by a Conditional Random Field classifier by more than 10%.
This paper introduces the first treebank of Vedic Sanskrit, a morphologically rich ancient Indian... more This paper introduces the first treebank of Vedic Sanskrit, a morphologically rich ancient Indian language that is of central importance for linguistic and historical research. The selection of the more than 3,700 sentences contained in this treebank reflects the development of metrical and prose texts over a period of 600 years. We discuss how these sentences are annotated in the Universal Dependencies scheme and which syntactic constructions required special attention. In addition, we describe a syntactic labeler based on neural networks that supports the initial annotation of the treebank, and whose evaluation can be helpful for setting up a full syntactic parser of Vedic Sanskrit.
The paper describes a novel approach that performs joint splitting of compounds and of Sandhis in... more The paper describes a novel approach that performs joint splitting of compounds and of Sandhis in Sanskrit texts. Sanskrit is a strongly compounding, morphologically and phonetically complex Indo-Aryan language. The interacting levels of its linguistic processes complicate the computer based analysis of its corpus. The paper proposes an algorithm that is able to resolve Sanskrit compounds and “phonetically merged” (sandhied) words using only gold transcripts of correct string splits, but no further lexical or morphological resources. In this way, it may also prove to be useful for other Indo-Aryan languages, for which no or only limited digital resources are
This paper introduces and evaluates a Bayesian mixture model that is designed for dating texts ba... more This paper introduces and evaluates a Bayesian mixture model that is designed for dating texts based on the distributions of linguistic features. The model is applied to the corpus of Vedic Sanskrit the historical structure of which is still unclear in many details. The evaluation concentrates on the interaction between time, genre and linguistic features, detecting those whose distributions are clearly coupled with the historical time. The evaluation also highlights the problems that arise when quantitative results need to be reconciled with philological insights.
The paper presents a method for WordNet supersense tagging of Sanskrit, an ancient Indian languag... more The paper presents a method for WordNet supersense tagging of Sanskrit, an ancient Indian language with a corpus grown over four millenia. The proposed method merges lexical information from Sanskrit texts with lexicographic definitions from Sanskrit-English dictionaries, and compares the performance of two machine learning methods for this task. Evaluation concentrates on Vedic, the oldest layer of Sanskrit. This level of Sanskrit contains numerous rare words that are no longer used in the later language and whose word senses can, therefore, not be induced from their occurrences in other texts. The paper studies how to efficiently transfer knowledge from later forms of Sanskrit and from modern Western dictionaries for this special task of supersense disambiguation.
Old World: Journal of Ancient Africa and Eurasia, 2021
In this paper we introduce an extended version of the Vedic Treebank (vtb, Hellwig et al. 2020) w... more In this paper we introduce an extended version of the Vedic Treebank (vtb, Hellwig et al. 2020) which comes along with revisited and extended annotation guidelines. In order to assess the quality of our annotations as well as the usability and limits of the guidelines we performed an inter-annotator agreement test. The results show that agreement between annotators is hampered by various factors, most prominently by insufficient understanding of the content because of the cultural and temporal gap and incomplete knowledge of Vedic grammar. An in-depth discussion of disagreeing annotations demonstrates that the setup of the workflow, too, has a major influence on inter-annotator agreement. We suggest some measures that can help increase the transparency and annotation consistency according to current knowledge of the language when annotating Vedic Sanskrit, or ancient language varieties in general.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018
The paper introduces end-to-end neural network models that tokenize Sanskrit by jointly splitting... more The paper introduces end-to-end neural network models that tokenize Sanskrit by jointly splitting compounds and resolving phonetic merges (Sandhi). Tokenization of Sanskrit depends on local phonetic and distant semantic features that are incorporated using convolutional and recurrent elements. Contrary to most previous systems, our models do not require feature engineering or extern linguistic resources, but operate solely on parallel versions of raw and segmented text. The models discussed in this paper clearly improve over previous approaches to Sanskrit word segmentation. As they are language agnostic, we will demonstrate that they also outperform the state of the art for the related task of German compound splitting.
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, 2022
Corpus-based studies of diachronic syntactic changes are typically guided by the results of previ... more Corpus-based studies of diachronic syntactic changes are typically guided by the results of previous qualitative research. When such results are missing or, as is the case for Vedic Sanskrit, are restricted to small parts of a transmitted corpus, an exploratory framework that detects such changes in a data-driven fashion can support the research process. In this paper, we introduce an infinite relational model (Kemp et al., 2006) that groups syntactic constituents based on their structural similarities and their diachronic distributions. We propose a simple way to control for register and intellectual affiliation, and discuss our findings for four syntactic structures in Vedic texts.
This contribution investigates novel techniques for error detection in automatic semantic annotat... more This contribution investigates novel techniques for error detection in automatic semantic annotations, as an attempt to reconcile error-prone NLP processing with high quality standards required for empirical research in Digital Humanities. We demonstrate the state-of-the-art performance of semantic NLP systems on a corpus of ritual texts and report performance gains we obtain using domain adaptation techniques. Our main contribution is to explore new techniques for annotation consistency control, as an attempt to reconcile error-prone NLP processing with high quality requirements. The novelty of our approach lies in its attempt to leverage multi-level semantic annotations by defining interaction constraints between local word-level semantic annotations and global discourse-level annotations. These constraints are defined using Markov Logic Networks, a logical formalism for statistical relational inference that allows for violable constraints. We report first results.
Deriving historical dates or datable stratifications for texts in Classical Sanskrit, such as the... more Deriving historical dates or datable stratifications for texts in Classical Sanskrit, such as the epics Mahābhārata and Rāmāyaṇa, is a considerable challenge for text-historical research. This paper provides empirical evidence for subtle but noticeable diachronic changes in the fundamental linguistic structures of Classical Sanskrit, and argues that Classical Sanskrit shows enough diachronic variation for dating texts on the basis of linguistic developments. Building on this evidence, it evaluates machine learning algorithms that predict approximate dates of composition for Sanskrit texts. The paper introduces the required background, discusses the relevance of linguistic features for temporal classification, and presents a text-historical evaluation of Book 6 of the Mahābhārata, whose historical stratification is disputed in Indological research.
Disputed authorship and text transfer are notorious problems in the textual transmission of Sansk... more Disputed authorship and text transfer are notorious problems in the textual transmission of Sanskrit, especially for large anonymous texts such as the Mahābhārata. Stratification methods for such texts have so far mainly relied on manuscriptology, higher textual criticism, and scattered historical evidence. This paper introduces a quantitative method for text stratification that uses frequent linguistic features for inducing authorial layers in Sanskrit texts. The proposed method is tested with texts whose authorial composition is known, and then applied to the Bhīṣmaparvan of the Mahābhārata.
The paper introduces end-to-end neural network models that tokenize Sanskrit by jointly splitting... more The paper introduces end-to-end neural network models that tokenize Sanskrit by jointly splitting compounds and resolving phonetic merges (Sandhi). Tokenization of Sanskrit depends on local phonetic and distant semantic features that are incorporated using convolutional and recurrent elements. Contrary to most previous systems, our models do not require feature engineering or extern linguistic resources, but operate solely on parallel versions of raw and segmented text. The models discussed in this paper clearly improve over previous approaches to Sanskrit word segmentation. As they are language agnostic, we will demonstrate that they also outperform the state of the art for the related task of German compound splitting.
The paper introduces a multi-level annotation of the Rigveda, a fundamental Sanskrit text compose... more The paper introduces a multi-level annotation of the Rigveda, a fundamental Sanskrit text composed in the 2. millenium BCE that is important for South-Asian and Indo-European linguistics, as well as Cultural Studies. We describe the individual annotation levels, including phonetics, morphology, lexicon, and syntax, and show how these different levels of annotation are merged to create a novel annotated corpus of Vedic Sanskrit. Vedic Sanskrit is a complex, but computationally under-resourced language. Therefore, creating this resource required considerable domain adaptation of existing computational tools, which is discussed in this paper. Because parts of the annotations are selective, we propose a bi-directional LSTM based sequential model to supplement missing verb-argument links.
The paper presents a method for WordNet supersense tagging of Sanskrit, an ancient Indian languag... more The paper presents a method for WordNet supersense tagging of Sanskrit, an ancient Indian language with a corpus grown over four millenia. The proposed method merges lexical information from Sanskrit texts with lexicographic definitions from Sanskrit-English dictionaries, and compares the performance of two machine learning methods for this task. Evaluation concentrates on Vedic, the oldest layer of Sanskrit. This level of Sanskrit contains numerous rare words that are no longer used in the later language and whose word senses can, therefore, not be induced from their occurrences in other texts. The paper studies how to efficiently transfer knowledge from later forms of Sanskrit and from modern Western dictionaries for this special task of supersense disambiguation.
The paper presents strategies for evaluating the influence of Panini's Astadhyayi on the vocabula... more The paper presents strategies for evaluating the influence of Panini's Astadhyayi on the vocabulary of Sanskrit. Using a corpus linguistic approach, it examines how the Paninian sample words are distributed over post-Paninian Sanskrit, and if we can determine any lexicographic influence of the Astadhyayi on later Sanskrit. The primary focus of the paper lies on data exploration, because the underlying corpus shows imbalances in the data distribution.
The paper describes a novel approach that performs joint splitting of compounds and of Sandhis in... more The paper describes a novel approach that performs joint splitting of compounds and of Sandhis in Sanskrit texts. Sanskrit is a strongly compounding, morphologically and phonetically complex Indo-Aryan language. The interacting levels of its linguistic processes complicate the computer based analysis of its corpus. The paper proposes an algorithm that is able to resolve Sanskrit compounds and "phonetically merged" (sandhied) words using only gold transcripts of correct string splits, but no further lexical or morphological resources. In this way, it may also prove to be useful for other Indo-Aryan languages, for which no or only limited digital resources are available.
Studien zu einem quantitativen Kommunikationsmodell für transkulturelle Forschungsfragen am Beisp... more Studien zu einem quantitativen Kommunikationsmodell für transkulturelle Forschungsfragen am Beispiel englischer Reiseberichte aus Indien
Uploads
Books by Oliver Hellwig
Papers by Oliver Hellwig