Edited research books by Katharina Ehret
http://ewave-atlas.org, 2020
Papers by Katharina Ehret
Linguistics vanguard, Mar 25, 2024
Against the backdrop of the sociolinguistic-typological complexity debate which is all about meas... more Against the backdrop of the sociolinguistic-typological complexity debate which is all about measuring, comparing and explaining language complexity, this article investigates how Kolmogorov-based information theoretic complexity relates to linguistic structures. Specifically, the linguistic structure of text which has been compressed with the text compression algorithm gzip will be analysed. One implementation of Kolmogorov-based language complexity is the compression technique (Ehret, Katharina. 2021. An informationtheoretic view on language complexity and register variation: Compressing naturalistic corpus data. Corpus Linguistics and Linguistic Theory (2). 383-410) which employs gzip to measure language complexity in naturalistic text samples. In order to determine what type of structures compression algorithms like gzip capture, and how these compressed strings relate to linguistically meaningful structures, gzip's lexicon output is retrieved and subjected to an in-depth analysis. As a case study, the compression technique is applied to the English version of Lewis Carroll's Alice's Adventures in Wonderland and its lexicon output is extracted. The results show that gzip-like algorithms sometimes capture linguistically meaningful structures which coincide, for instance, with lexical words or suffixes. However, many compressed sequences are linguistically unintelligible or simply do not coincide with any linguistically meaningful structures. Compression algorithms like gzip thus crucially capture purely formal structural regularities. As a consequence, information theoretic complexity, in this context, is a linguistically agnostic, purely structural measure of regularity and redundancy in texts.
Cite the source dataset as Kortmann, Bernd & Lunkenheimer, Kerstin (eds.) 2013. The Electronic Wo... more Cite the source dataset as Kortmann, Bernd & Lunkenheimer, Kerstin (eds.) 2013. The Electronic World Atlas of Varieties of English. Jena: Max Planck Institute for the Science of Human History. (Available online at https://ewave-atlas.org)
Röthlisberger, Lars Hinrichs and Koen Plevoets for helpful comments and suggestions. All remainin... more Röthlisberger, Lars Hinrichs and Koen Plevoets for helpful comments and suggestions. All remaining errors are, of course, our own. 2 Lectometry – a methodology that explores how various language-external dimensions shape language usage in an aggregate perspective – is underused in English-language corpus linguistics. Against this backdrop, the paper utilizes state-of-the-art lectometric analysis techniques to investigate lexical variability in written Standard English, as sampled in the well-known Brown family of corpora. We employ the following five-step procedure: (1) draw on large corpora (the British National Corpus, the American National Corpus, and the Blog Authorship Attribution Corpus) and Semantic Vector Space modeling to obtain an unbiased set of n = 303 lexical variables in a bottom-up and semi-automatic fashion; (2) determine the frequency distribution of lexical variant forms in the Brown corpora; (3) rely on the Profile-based Distance Metric to transform the distributi...
Linguistics Vanguard
This special issue focuses on measuring language complexity. The contributions address methodolog... more This special issue focuses on measuring language complexity. The contributions address methodological challenges, discuss implications for theoretical research, and use complexity measurements for testing theoretical claims. In this introductory article, we explain what knowledge can be gained from quantifying complexity. We then describe a workshop and a shared task which were our attempt to develop a systematic approach to the challenge of finding appropriate and valid measures, and which inspired this special issue. We summarize the contributions focusing on the findings which can be related to the most prominent debates in linguistic complexity research.
Discourse Studies, 2020
This paper brings together cutting-edge, quantitative corpus methodologies and discourse analysis... more This paper brings together cutting-edge, quantitative corpus methodologies and discourse analysis to explore the relationship between text complexity and subjectivity as descriptive features of opinionated language. We are specifically interested in how text complexity and markers of subjectivity and argumentation interact in opinionated discourse. Our contributions include the marriage of quantitative approaches to text complexity with corpus linguistic methods for the study of subjectivity, in addition to large-scale analyses of evaluative discourse. As our corpus, we use the Simon Fraser Opinion and Comments Corpus (SOCC), which comprises approximately 10,000 opinion articles and the corresponding reader comments from the Canadian online newspaper The Globe and Mail, as well as a parallel corpus of hard news articles also sampled from The Globe and Mail. Methodologically, we combine conditional inference trees with the analysis of random forests, an ensemble learning technique, t...
This repository comprises the data, scripts for conducting a multi-dimensional analysis of online... more This repository comprises the data, scripts for conducting a multi-dimensional analysis of online news comments and other web registers, as well as comprehensive statistical material as described in Ehret, Katharina, and Maite Taboada. (accepted). "Characterising online news comments: a multi-dimensional cruise through online registers". <em>Frontiers in Artificial Intelligence</em>.
This study contributes to the typological-sociolinguistic complexity debate that was triggered by... more This study contributes to the typological-sociolinguistic complexity debate that was triggered by challenges (Kusters 2003; McWhorter 2001) of the assumption that all languages are, on the whole, equally complex (e.g. Hockett 1958). A substantial body of research now suggests that languages and language varieties can and do differ in their complexity (e.g. Koplenig et al. 2017; Siegel et al. 2014; Kortmann and Szmrecsanyi 2012). However, most of this research applies complexity metrics that either rely on subjective or on empirically expensive means of measuring complexity. Against this backdrop, I explore the use and applicability of Kolmogorov complexity as a complexity metric in naturalistic corpora. It can be conveniently approximated with compression algorithms and measures the information content, or complexity of texts in terms of the predictability of new text passages on the basis of previously seen text passages. Basically, texts which can be compressed more efficiently ar...
The Freiburg Corpus of English Dialects Sampler (FRED-S) spans a subset of FRED texts, covering 1... more The Freiburg Corpus of English Dialects Sampler (FRED-S) spans a subset of FRED texts, covering 1,011,396 words and c. 123 hours of recorded speech. It consists of 121 interviews with 144 dialect speakers from 5 major dialect areas (the Southwest of England, the Southeast of England, the Midlands, the North of England, and the Scottish Lowlands). The interviews were recorded between 1970 and 2000, the majority during the 1970s and 1980s. All FRED-S interviews are available in three formats and can be downloaded here: • Audio files in mp3 format • Transcripts in txt format • Tagged transcripts in txt format
Frontiers in Artificial Intelligence, 2021
News organisations often allow public comments at the bottom of their news stories. These comment... more News organisations often allow public comments at the bottom of their news stories. These comments constitute a fruitful source of data to investigate linguistic variation online; their characteristics, however, are rather understudied. This paper thus contributes to the description of online news comments and online language in English. In this spirit, we apply multi-dimensional analysis to a large dataset of online news comments and compare them to a corpus of online registers, thus placing online comments in the space of register variation online. We find that online news comments are involved-evaluative and informational at the same time, but mostly argumentative in nature, with such argumentation taking an informal shape. Our analyses lead us to conclude that online registers are a different mode of communication, neither spoken nor written, with individual variation across different types of online registers.
Research on language complexity has been abundant and manifold in the past two decades. Within ty... more Research on language complexity has been abundant and manifold in the past two decades. Within typology, it has to a very large extent been motivated by the question of whether all languages are equally complex, and if not, which language-external factors affect the distribution of complexity across languages. To address this and other questions, a plethora of different metrics and approaches has been put forward to measure the complexity of languages and language varieties. Against this backdrop we address three major gaps in the literature by discussing statistical, theoretical, and methodological problems related to the interpretation of complexity measures. First, we explore core statistical concepts to assess the meaningfulness of measured differences and distributions in complexity based on two case studies. In other words, we assess whether observed measurements are neither random nor negligible. Second, we discuss the common mismatch between measures and their intended meani...
This paper presents an unsupervised information-theoretic measure that is a promising candidate f... more This paper presents an unsupervised information-theoretic measure that is a promising candidate for becoming a universally applicable metric of language complexity. The measure boils down to Kolmogorov complexity and uses compression programs to assess the complexity in text samples via their information content. Generally, better compression rates indicate lower complexity. In this paper, the measure is applied to a typological dataset of 37 languages covering 7 different language families. Specifically, overall, morphological and syntactic complexity are measured. The results often coincide with intuitive complexity judgements, e.g. Afrikaans is overall comparatively simple, Turkish is morphologically complex. Yet, in some cases the results are surprising, e.g. Chinese turns out to be morphologically highly complex. It is concluded that the method needs further adaptation for the application to different writing systems. Despite this caveat, the method is in principle applicable t...
Register Studies, 2020
This article focuses on the question of whether online news comments are like face-to-face conver... more This article focuses on the question of whether online news comments are like face-to-face conversation or not. It is a widespread view that online comments are like “dialogue”, with comments often being referred to as “conversations”. These assumptions, however, lack empirical back-up. In order to answer this question, we systematically explore register-relevant properties of online news comments using multi-dimensional analysis (MDA) techniques. Specifically, we apply MDA to establish what online comments are like by describing their linguistic features and comparing them to traditional registers (e.g. face-to-face conversation, academic writing). Thus, we tap the SFU Opinion and Comments Corpus and the Canadian component of the International Corpus of English. We show that online comments are not like spontaneous conversation but rather closer to opinion articles or exams, and clearly constitute a written register. Furthermore, they should be described as instances of argumentati...
Linguistic Issues in Language Technology, 2014
This chapter demonstrates how compression algorithms can be used to address morphological and syn... more This chapter demonstrates how compression algorithms can be used to address morphological and syntactic complexity in detail by analysing the contribution of specific linguistic features to English texts. The point of departure is the ongoing complexity debate and quest for complexity metrics. After decades of adhering to the equal complexity axiom, recent research seeks to define and measure linguistic complexity (Dahl 2004; Kortmann and Szmrecsanyi 2012; Miestamo et al. 2008). Against this backdrop, I present a new flavour of the Juola-style compression technique (Juola 1998), targeted manipulation. Essentially, compression algorithms are used to measure linguistic complexity via the relative informativeness in text samples. Thus, I assess the contribution of morphs such as –ing or –ed, and functional constructions such as progressive (be + verb-ing) or perfect (have + verb past participle) to the syntactic and morphological complexity in a mixed-genre corpus of Alice’s Adventures...
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), 2018
We evaluate corpus-based measures of linguistic complexity obtained using Universal Dependencies ... more We evaluate corpus-based measures of linguistic complexity obtained using Universal Dependencies (UD) treebanks. We propose a method of estimating robustness of the complexity values obtained using a given measure and a given treebank. The results indicate that measures of syntactic complexity might be on average less robust than those of morphological complexity. We also estimate the validity of complexity measures by comparing the results for very similar languages and checking for unexpected differences. We show that some of those differences that arise can be diminished by using parallel treebanks and, more importantly from the practical point of view, by harmonizing the languagespecific solutions in the UD annotation.
Corpus Linguistics and Linguistic Theory, 2018
This article utilises an innovative, information-theoretic metric to assess complexity variation ... more This article utilises an innovative, information-theoretic metric to assess complexity variation across written and spoken registers of British English. This is novel because previous research on language complexity mainly analysed complexity variation in typological data, single language case studies or geographical varieties of the same language. The measure boils down to Kolmogorov complexity which can be conveniently approximated with off-the-shelf compression programs. Essentially, text samples that can be compressed more efficiently count as linguistically simple. The dataset covers a wide range of traditional written and spoken registers (e.g. broadsheet newspapers, courtroom debate or face-to-face conversation), as sampled in theBritish National Corpus. It turns out that Kolmogorov-based register variation coincides with register formality such that informal registers are overall and morphologically less complex than more formal registers, but more complex in regard to synta...
Second Language Research, 2016
We present a proof-of-concept study that sketches the use of compression algorithms to assess Kol... more We present a proof-of-concept study that sketches the use of compression algorithms to assess Kolmogorov complexity, which is a text-based, quantitative, holistic, and global measure of structural surface redundancy. Kolmogorov complexity has been used to explore cross-linguistic complexity variation in linguistic typology research, but we are the first to apply it to naturalistic second language acquisition (SLA) data. We specifically investigate the relationship between the complexity of second language (L2) English essays and the amount of instruction the essay writers have received. Analysis shows that increased L2 instructional exposure predicts increased overall complexity and increased morphological complexity, but decreased syntactic complexity (defined here as less rigid word order). While the relationship between L2 instructional exposure and complexity is robust across a number of first language (L1) backgrounds, L1 background does predict overall complexity levels.
Complexity, Isolation, and Variation, 2016
International Journal of Corpus Linguistics, 2016
Lectometry is a corpus-based methodology that explores how multiple language-external dimensions ... more Lectometry is a corpus-based methodology that explores how multiple language-external dimensions shape language usage in an aggregate perspective. The paper combines this methodology with Semantic Vector Space modeling to investigate lexical variability in written Standard English, as sampled in the original Brown family of corpora (Brown, LOB, Frown and F-LOB). Based on a joint analysis of 303 lexical variables, which are semi-automatically extracted by means of a SVS, we find that lexical variation in the Brown family is systematically related to three lectal dimensions: discourse type (informative versus imaginative), standard variety (British English versus American English), and time period (1960s versus 1990s). It turns out that most lexical variables are sensitive to at least one of these three language-external dimensions, yet not every dimension has dedicated lexical variables: in particular, distinctive lexical variables for the real time dimension fail to emerge.
Experience Counts: Frequency Effects in Language, 2016
This contribution investigates frequency effects in lexical sociolectometry, and explores by way ... more This contribution investigates frequency effects in lexical sociolectometry, and explores by way of a case study variation in written English as sampled in the well-known Brown family of corpora. Lexical sociolectometry is a productive research paradigm that is concerned with studying aggregate lexical distances between varieties of a language. Lexical distance quantifies the extent to which different varieties use different labels to describe the same concept. If different labels are used in different varieties, then this will increase the lexical distance between the varieties We aggregate over many different concepts, in order to make generalizable claims about the distance between varieties, independently of a specific concept. Our central question is, "When generalizing across concepts, does concept frequency play a role in the aggregation?" To answer this question, we examine three types of frequency weighting (i) boosting low-frequency concepts, (ii) boosting high-frequency concepts, and (iii) no frequency weighting at all, and investigate whether they have an effect on the aggregation. We find no such frequency effect, and discuss reasons for this absence in lexical sociolectometry.
Uploads
Edited research books by Katharina Ehret
Papers by Katharina Ehret