Papers by Heike Zinsmeister
KONVENS, 2019
In this paper, we present the HUIU system (a collaboration of University of Hamburg and Indiana U... more In this paper, we present the HUIU system (a collaboration of University of Hamburg and Indiana University) for the GermEval 2019 shared task 2, subtask 1-the coarsegrained classification of tweets into the classes OFFENSE or OTHER. Our system uses linear SVMs with character n-grams (5 ≤ n ≤ 10), POS n-grams (3 ≤ n ≤ 9) and the tweet's length in number of tokens as features. We obtain a macro-averaged F-score of 65.32 on the test data.
KONVENS, 2018
In this paper, we present experiments on POS tagging historical texts that contain spelling varia... more In this paper, we present experiments on POS tagging historical texts that contain spelling variation. The experiments are conducted in a low-resource scenario with a small amount of training data (here: 12,000 tokens). We investigate different ways of dealing with spelling variation in such a situation on different variants of historical German. Firstly, we add character n-grams as features to the tagger to enable it to learn spelling variation. Our tagging experiments show that this improves accuracy when there is enough variation in the data, but leads to a decrease in accuracy if the amount of variation is low. Secondly, we preprocess the data before training and applying the tagger, reducing spelling variation by normalization, rule-based simplification and substitution of spelling variants. All three methods improve tagging accuracy in comparable levels. Since normalization has the drawback of requiring additional resources, we recommend rule-based simplification and substitution of spelling variants for low-resourced settings. Finally, we evaluate the utility of additional unlabeled data to create word embeddings and employing external resources, which we use to further improve tagging accuracy.
ISBN, 2018
This paper deals with a particular form of anaphora in which the anaphors refer to non-nominal an... more This paper deals with a particular form of anaphora in which the anaphors refer to non-nominal antecedents. We investigate two existing datasets, annotated with pronominal and nominal anaphors (shell nouns) respectively, and attempt to determine to what degree the different types of anaphors provide useful hints as to the form and location of their antecedents. To this end, we look at the distribution of the antecedents, their syntactic form, and their semantic content. In particular, as the difficulty of annotating the phenomenon constitutes a major hurdle to the development of larger datasets, we take a close look at the agreement between annotators and relate this to the different types of anaphors.
Zum Lehramtsstudium im Fach Deutsch gehört ein solides Basiswissen der Sprachwissenschaft. „Sprac... more Zum Lehramtsstudium im Fach Deutsch gehört ein solides Basiswissen der Sprachwissenschaft. „Sprachwissenschaft für das Lehramt“ erklärt alles, was Lehramtsstudierende über Phonologie, Morphologie, Syntax, Semantik, Pragmatik, Spracherwerb, DaZ, Rechtschreibung und Sprachgeschichte wissen müssen. Das Buch bietet damit eine Grundlage für didaktische Entscheidungen und damit für einen guten Deutschunterricht in allen Schulstufen.
This paper investigates the contribution of the German Vorfeld to local coherence. We report on t... more This paper investigates the contribution of the German Vorfeld to local coherence. We report on the annotation of a corpus of parliament debates with a small set of coarse-grained labels, marking the functions of the Vorfeld constituents. The labels encode referential and discourse relations as well as non-relational functions. We achieve interannotator agreement of κ = 0.66. Based on the annotations, we investigate different features and feature correlations that could be of use for automatic text processing. Finally, we perform an experiment, consisting of an insertion task, to assess the individual impact of different types of Vorfeld on local coherence.
... Keim etal.pdf (2.820Mb). When more is less : Non-native perception of level tone contrasts. C... more ... Keim etal.pdf (2.820Mb). When more is less : Non-native perception of level tone contrasts. Chiao_more is less ... Timing of second language singletons and geminates. Kabak_Timing.pdf (253.1Kb). Kabak, Baris; Reckziegel, Tanja; Braun, Bettina (2011), Konferenzveröffentlichung ...
Zeitschrift für germanistische Linguistik, 2015
Linguistic annotation helps corpus users to retrieve relevant examples in an efficient way. It al... more Linguistic annotation helps corpus users to retrieve relevant examples in an efficient way. It also supports the identification of latent patterns in the data by encoding generalizations such as parts of speech. Since manual annotation is very time-consuming, many projects use off-the-shelf tools for annotating or at least preprocessing their data. This article discusses pros and cons of such automatic annotation using part-of-speech tagging as an example case. It argues that errors made by annotation tools are systematic in nature and hence predictable to a certain extent. In addition, the article addresses the issue of descriptive adequacy of tagsets. In particular, it discusses how well the Stuttgart-Tübingen Tagset (STTS) describes German parts of speech. Finally, the article briefly addresses normalization, an additional preprocessing step that is sometimes required before automatic annotation tools can be applied.
ABSTRACT In this paper, we report ongoing work on HyperGramm, a Linked Open Data set of German gr... more ABSTRACT In this paper, we report ongoing work on HyperGramm, a Linked Open Data set of German grammar terms. HyperGramm is based on a print-oriented, manually created resource containing different types of internal and external linking relations that are either explicitly marked by formatting or only implicitly encoded in the language. The initial aim of the HyperGramm resource was the on-line visualization of the terms. However, because this resource could be used in a variety of other scenarios, both for research and learning purposes, it is desirable for the representation to capture as much information as possible about the internal structure of the original resource. We first motivate the data’s conversion into an intermediate, well-defined XML presentation, which serves as the basis for the RDF modeling. Subsequently, we detail the RDF model and demonstrate how it allows us to encode the internal structure and the linking mechanisms in an explicit and interoperable fashion. In addition, we discuss the possible integration of HyperGramm into the LOD Cloud.
Interfaces of Morphology, 2013
LeiKo is a comparable corpus of German easy-to-read news texts. This freely available resource is... more LeiKo is a comparable corpus of German easy-to-read news texts. This freely available resource is systematically compiled and linguistically annotated for linguistic and computational linguistic research. LeiKo consists of 216 news and newspaper texts (approx. 56,600 tokens) and their meta data structured in four subcorpora according to the websites they were published on. All texts are tokenized, lemmatized, part-of-speech tagged and dependency parsed and can be queried in ANNIS (Krause/Zeldes 2016). A core corpus of 40 texts is manually corrected.<br> <br> Version 0.9 contains only the core corpus with lemma and pos annotations and can be queried here: https://corpora.uni-hamburg.de/hzsk/de/hzsk_access/annis/leiko<br> <br> Version 1.0 comprises all 216 texts and not only lemma and pos annotations, but also syntactic annotations and metadata. The corpus is provided in the annis format, which can be directly imported into ANNIS Kickstarter. Version 1.1 is ide...
Dieses Korpus ist als Begleitmaterial zu folgendem Lehrbuch entstanden: Andresen, Melanie und Hei... more Dieses Korpus ist als Begleitmaterial zu folgendem Lehrbuch entstanden: Andresen, Melanie und Heike Zinsmeister (2019): Korpuslinguistik (narr Starter). Narr Francke Attempto: Tübingen. Dieser Upload umfasst folgende Dateien: Foodblog-Korpus.zip: Ein zip-Archiv mit 150 Artikeln aus deutschsprachigen Foodblogs. foodblogs_all.txt: Eine konkatenierte Textdatei aus allen 150 Artikeln (im Lehrbuch in Kapitel 6.1 für den Upload bei WebLicht genutzt). Foodblog-Korpus_Metadaten.csv: tabellarische Metadaten zu den im Korpus enthaltenen Texten im csv-Format Foodblog-Korpus_Metadaten.xls: tabellarische Metadaten zu den im Korpus enthaltenen Texten im Excel-Format Textteile.xslx: Export der manuellen Annotation von Textteilen aus CATMA, im Lehrbuch in Abschnitt 6.2 beschrieben
This study deals with the question of what strategies Chinese L2 learners of German follow when s... more This study deals with the question of what strategies Chinese L2 learners of German follow when starting a declarative sentence in German. The investigation is based on the ALeSKo corpus, a linguistically annotated learner corpus of written German. In previous studies, we observed that the L2 texts show a significant overuse of sentences that start with an information-structural function in comparison to comparable L1 texts. In this paper, we pursue an alternative line of explanation that explores whether the observed difference is due to an overuse of chunks in the L2 texts. We perform a chunk classification and also automatically detect all material copied from the title and the task description - a particular type of chunk. Our findings indicate that although L2 learners use chunks to a substantial degree, an overuse with respect to the beginnings of the sentences could not be confirmed.
Zeitschrift Fur Sprachwissenschaft, 2007
Deutsch als Fremdsprache, 2008
Handbücher zur Sprach- und Kommunikationswissenschaft / Handbooks of Linguistics and Communication Science (HSK) 42/3
Uploads
Papers by Heike Zinsmeister