Papers by Gerold Schneider
This study uses data-driven methods to detect and interpret differences between the High German u... more This study uses data-driven methods to detect and interpret differences between the High German used as standard language of written communication in Switzerland, and German High German. The comparison is based on a comparable web corpus of two million sentences, one million from Switzerland and one million from Germany. We describe differences at the levels of lexis, morphosyntax, and syntax, and compare to previously described differences. We show that data-driven methods manage to detect a wide range of differences.
The Common European Framework of Reference for Languages (CEFR) defines six levels of learner pro... more The Common European Framework of Reference for Languages (CEFR) defines six levels of learner proficiency, and links them to particular communicative abilities. The CEFRLex project aims at compiling lexical resources that link single words and multi-word expressions to particular CEFR levels. The resources are thought to reflect second language learner needs as they are compiled from CEFR-graded textbooks and other learner-directed texts. In this work, we investigate the applicability of CEFRLex resources for building language learning applications. Our main concerns were that vocabulary in language learning materials might be sparse, i.e. that not all vocabulary items that belong to a particular level would also occur in materials for that level, and, on the other hand, that vocabulary items might be used on lower-level materials if required by the topic (e.g. with a simpler paraphrasing or translation). Our results indicate that the English CEFRLex resource is in accordance with external resources that we jointly employ as gold standard. Together with other values obtained from monolingual and parallel corpora, we can indicate which entries need to be adjusted to obtain values that are even more in line with this gold standard. We expect that this finding also holds for the other languages.
We use n-gram language models to investigate how far language approximates an optimal code for hu... more We use n-gram language models to investigate how far language approximates an optimal code for human communication in terms of Information Theory [1], and what differences there are between Learner proficiency levels. Although the language of lower level learners is simpler, it is less optimal in terms of information theory, and as a consequence more difficult to process.
BMC Bioinformatics, Oct 3, 2011
Background: This article describes the approaches taken by the OntoGene group at the University o... more Background: This article describes the approaches taken by the OntoGene group at the University of Zurich in dealing with two tasks of the BioCreative III competition: classification of articles which contain curatable proteinprotein interactions (PPI-ACT) and extraction of experimental methods (PPI-IMT). Results: Two main achievements are described in this paper: (a) a system for document classification which crucially relies on the results of an advanced pipeline of natural language processing tools; (b) a system which is capable of detecting all experimental methods mentioned in scientific literature, and listing them with a competitive ranking (AUC iP/R > 0.5). Conclusions: The results of the BioCreative III shared evaluation clearly demonstrate that significant progress has been achieved in the domain of biomedical text mining in the past few years. Our own contribution, together with the results of other participants, provides evidence that natural language processing techniques have become by now an integral part of advanced text mining approaches.
Transactions of the Philological Society, Nov 1, 2022
Meeting of the Association for Computational Linguistics, Aug 9, 2013
We describe a biological event detection method implemented for the Genia Event Extraction task o... more We describe a biological event detection method implemented for the Genia Event Extraction task of BioNLP 2013. The method relies on syntactic dependency relations provided by a general NLP pipeline, supported by statistics derived from Maximum Entropy models for candidate trigger words, for potential arguments, and for argument frames.
We present an approach towards the automatic detection of names of proteins, genes, species, etc.... more We present an approach towards the automatic detection of names of proteins, genes, species, etc. in biomedical literature and their grounding to widely accepted identifiers. The annotation is based on a large term list that contains the common expression of the terms, a normalization step that matches the terms with their actual representation in the texts, and a disambiguation step that resolves the ambiguity of matched terms. We describe various characteristics of the terms found in existing term resources and of the terms that are used in biomedical texts. We evaluate our results against a corpus of manually annotated protein mentions and achieve a precision of 57% and recall of 72%.
HAL (Le Centre pour la Communication Scientifique Directe), Jun 21, 2019
This study uses data-driven methods to detect and interpret differences between the High German u... more This study uses data-driven methods to detect and interpret differences between the High German used as standard language of written communication in Switzerland, and German High German. The comparison is based on a comparable web corpus of two million sentences, one million from Switzerland and one million from Germany. We describe differences at the levels of lexis, morphosyntax, and syntax, and compare to previously described differences. We show that data-driven methods manage to detect a wide range of differences.
And, of course, different languages are useful for different purposes. Don't use R instead of, bu... more And, of course, different languages are useful for different purposes. Don't use R instead of, but in addition to the languages you already know. There are, for example, APIs from Perl, Prolog and Python to R which allow you to integrate the different languages with each other. Unlike other statistics programs, R is a free, open source language, and has a very active community committed to programming extensions to R. These extensions are known as library, and usually if you have a clear idea of what you want to do in R but can't find a predefined function to do it, you will find that someone has programmed a library which serves your need. We are grateful to guyjantic on Imgur for visualizing the allure of R over other programs:
For social scientists, it is increasingly important to explore large text collections without tim... more For social scientists, it is increasingly important to explore large text collections without time-consuming human intervention. We are presenting a language technology tool kit that allows researchers of the NCCR Democracy Module 1 to extract information on various forms of governance from a comprehensive multilingual corpus. The tool kit called SIFT 1 allows searching for governance entities and measuring their salience, tonality, issue context and media frames. In substantial terms, our tool pipeline enables scholars of governance to extend their research focus to the previously neglected area of the public communication of democratic legitimacy and accountability of various forms of governance.
The investigation of specific features of Irish English has a long tradition. Yet, with the arriv... more The investigation of specific features of Irish English has a long tradition. Yet, with the arrival of large corpora and corpus tools, new avenues of research have opened up for the discipline. The present paper investigates features commonly ascribed to Irish English on the basis of the ICE Ireland corpus in comparison with ICE corpora representing other varieties of English. We use several corpus tools to access the ICE corpora. First, an offline concordance program, AntConc V 3.3 (Anthony 2004). Second, Corpus Navigator, an online corpus query tool allowing researchers to query regular expressions on the surface texts. Third, we are in the process of writing a version of Dependency Bank (Lehmann and Schneider 2012) which contains a selection of ICE corpora, and which will be called ICE online. This research methodology allows us to reassess how specific features found in Irish English are in comparison with other international varieties of English and illustrates that even simple corpus-based search patterns can produce powerful results.
The detection of mentions of proteinprotein interactions in the scientific literature has recentl... more The detection of mentions of proteinprotein interactions in the scientific literature has recently emerged as a core task in biomedical text mining. We present effective techniques for this task, which have been developed using the IntAct database as a gold standard, and have been evaluated in two text mining competitions.
In this paper, we present a declarative formalism for writing rule sets to convert constituent tr... more In this paper, we present a declarative formalism for writing rule sets to convert constituent trees into dependency graphs. The formalism is designed to be independent of the annotation scheme and provides a highly task-related syntax, abstracting away from the underlying graph data structures. We have implemented the formalism in our search tool and used a preliminary version to create a rule set that converts more than 97% of the TIGER corpus.
Background: The BioCreative series of competitive evaluations of text mining systems provide a ma... more Background: The BioCreative series of competitive evaluations of text mining systems provide a major test bed for novel techniques in biomedical text mining. Results from the previous and current competition are of fundamental importance for further development in the area. Results: The OntoGene group participated in all tasks of the current edition. Preliminary results seem satisfactory, however a detailed analysis cannot be performed without a comparison with the results of the other participants.
A major reason why LFG employs c-structure is because it is context-free. According to Tree-Adjoi... more A major reason why LFG employs c-structure is because it is context-free. According to Tree-Adjoining Grammar (TAG), the only context-sensitive operation that is needed to express natural language is Adjoining, from which LFG functional uncertainty has been shown to follow. Functional uncertainty, which is expressed on the level of f-structure, would then be the only extension needed to an otherwise context-free processing of natural language. We suggest that if f-structures can be derived context-freely, full-fledged c-structures are not strictly needed in LFG, and that chunks and dependencies may be sufficient for a formal grammar theory. In order to substantiate this claim, we combine a projection of f-structures from chunks model with statistical techniques and present a parser that outputs LFG f-structure like representations. The parser is representationally minimal, deep-linguistic, robust, and fast, and has been evaluated and applied. The parser addresses context-sensitive constructions by treating the vast majority of long-distance dependencies by approximation with finite-state patterns, by post-processing, and by LFG functional uncertainty.
Syntactic alternations like the dative shift are well researched. But most decisions which speake... more Syntactic alternations like the dative shift are well researched. But most decisions which speakers take are more complex than binary choices. Multifactorial lexicogrammatical approaches and a large inventory of syntactic patterns are needed to supplement current approaches. We use the term semantic alternation for the many ways in which a relation between entities, conveying broadly the same meaning, can be expressed. We use a well-resourced domain, biomedical research texts, for a corpusdriven approach. As entities we use proteins, and as relations we use interactions between them, using Text Mining training data. We discuss three approaches: first, manually designed syntactic patterns, second a corpus-based semi-automatic approach and third a machine-learning language model. The machinelearning approach learns the probability that a syntactic configuration expresses a relevant interaction from an annotated corpus. The inventory of configurations define the envelope of variation and its multitude of forms.
Uploads
Papers by Gerold Schneider