Papers by Sjur Nørstebø Moshagen
Nordlyd, Aug 30, 2022
In this article, we study correction of spelling errors, specifically on how the spelling errors ... more In this article, we study correction of spelling errors, specifically on how the spelling errors are made and how can we model them computationally in order to fix them. The article describes two different approaches to generating spelling correction suggestions for three Uralic languages: Estonian, North Sámi and South Sámi. The first approach of modelling spelling errors is rule-based, where experts write rules that describe the kind of errors that are made, and these are compiled into a finite-state automaton that models the errors. The second is data driven, where we show a machine learning algorithm a list of errors that humans have made, and it creates a neural network that can model the errors. Both approaches require collections of misspelling lists and understanding its contents; therefore, we also describe the actual errors we have seen in detail. We find that while both approaches create error correction systems, with current resources the expert-built systems are still more reliable.
When applying morphological analysis (TWOL) and disambiguation (Constraint Gramrnar) to a text ma... more When applying morphological analysis (TWOL) and disambiguation (Constraint Gramrnar) to a text material, the linguist gets an opportunity to search for contextually disarnbiguated words on the basis of their base forms. A search on a specific base form gives a result that also includes all the inflected forms and same common derivations of the word. This way of assembling inflected words can offer a lat of help in text lemmatization and also in various types of frequency listings. This paper contains and discusses same frequency lists, which have been generated out of a morphologically analysed and disambiguated annual volume of the newspaper GoteborgsPosten.
Journal of the Digital Humanities Association of Southern Africa (DHASA), Jan 26, 2023
One avenue for supporting the continued use and revitalization of endangered languages in the cur... more One avenue for supporting the continued use and revitalization of endangered languages in the current, pervasively computerized world is the creation of computational models of the often rich and complex morphology of these languages. Such computational models can be used as a basis for creating a suite of reader’s and writer’s tools, including e.g. (1) an intelligent electronic dictionary that combines the computational model and a lexical database allowing for linking any inflected form with the appropriate dictionary entry, as well as the generation of word paradigms, (2) an intelligent computer-aided language learning application (ICALL) that allows for the dynamic generation of large numbers of exercises combining the entire core vocabulary (up to several thousand of the most common words) and a substantially smaller set of exercise templates, and (3) a spell-checker that supports adherence with one or more existing orthographical conventions, and thus the production of good-qu...
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, 2014
This paper presents aspects of a computational model of the morphology of Plains Cree based on th... more This paper presents aspects of a computational model of the morphology of Plains Cree based on the technology of finite state transducers (FST). The paper focuses in particular on the modeling of nominal morphology. Plains Cree is a polysynthetic language whose nominal morphology relies on prefixes, suffixes and circumfixes. The model of Plains Cree morphology is capable of handling these complex affixation patterns and the morphophonological alternations that they engender. Plains Cree is an endangered Algonquian language spoken in numerous communities across Canada. The language has no agreed upon standard orthography, and exhibits widespread variation. We describe problems encountered and solutions found, while contextualizing the endeavor in the description, documentation and revitalization of First Nations Languages in Canada.
The article presents Vuosttas Digisanit (VD), an electronic dictionary from North Sami to Norwegi... more The article presents Vuosttas Digisanit (VD), an electronic dictionary from North Sami to Norwegian. Its novelty lies in the way we have utilized existing resources (a basic dictionary and a morphological analyser/generator) in order to create a reception dictionary for language learners for a morphologically rich language. With only 7,9 % of the word forms in Sami running text being identical to the lemma form, an approach along the lines sketched here is a prerequisite for a text-integrated e-dictionary. Being a learner dictionary, VD also gives key paradigms for each lemma. This paradigm is generated when building the dictionary, using our language technology tools. We have also built an infrastructure that can be reused for other languages and dictionaries. Our approach shows how it is possible to build textintegrated electronic dictionaries for morphologically complex languages with limited means. The dictionary is available free of charge at: http://giellatekno.uit.no/words/di...
International Conference on Language Resources and Evaluation, 2018
This paper describes the development of a computational model of the morphology of Northern Haida... more This paper describes the development of a computational model of the morphology of Northern Haida based on finite state machines (FSMs), with a focus on verbs. Northern Haida is highly endangered, and a member of the isolate Haida macrolanguage, spoken in British Columbia and Alaska. Northern Haida is a highly-inflecting language whose verbal morphology relies largely on suffixes, with a limited number of prefixes. The suffixes trigger morphophonological changes in the stem, participate in blocking, and exhibit variable ordering in certain constructions. The computational model of Northern Haida verb morphology is capable of handling these complex affixation patterns and the morphophonological alternations that they engender. In this paper, we describe the challenges we encountered and the solutions we propose, while contextualizing the endeavour in the description, documentation and revitalization of First Nations Languages in Canada.
The article presents the Giellatekno & Divvun language technology resources, more specifically th... more The article presents the Giellatekno & Divvun language technology resources, more specifically the effort to utilise open-source tools to improve the build infrastructure, and the solutions to help adapt to best practices for software development. The article especially discusses how the infrastructure has been remade to cope with an increasing number of languages without incurring extra overhead for the maintainers, and at the same time let the linguists concentrate on the linguistic work. Finally, the article discusses how a uniform infrastructure like the one presented can be used to easily compare languages in terms of morphological or computational complexity, coverage or for cross-lingual applications.
This paper describes an annotation system for Sámi language corpora, which consists of structure... more This paper describes an annotation system for Sámi language corpora, which consists of structured, running texts. The annotation of the texts is fully automatic, starting from the original documents in different formats. The texts are first extracted from the origi-nal documents preserving the original struc-tural markup. The markup is enhanced by a document-specific XSLT script which con-tains document-specific formatting instruc-tions. The overall maintenance is achieved by system-wide XSLT scripts. 1
The article presents the Giellatekno & Divvun language technology resources, more specifically th... more The article presents the Giellatekno & Divvun language technology resources, more specifically the effort to utilise open-source tools to improve the build infrastructure, and the solutions to help adapt to best practices for software development. The article especially discusses how the infrastructure has been remade to cope with an increasing number of languages without incurring extra overhead for the maintainers, and at the same time let the linguists concentrate on the linguistic work. Finally, the article discusses how a uniform infrastructure like the one presented can be used to easily compare languages in terms of morphological or computational complexity, coverage or for cross-lingual applications.
Fifteen years of indigenous language technology development by UiT/Saami Parliament has resulted ... more Fifteen years of indigenous language technology development by UiT/Saami Parliament has resulted in spelling and grammar checkers, desktop/mobile keyboards, morphological analysers, MT, speech synthesis, language learning tools and intelligent electronic dictionaries. This was facilitated by an open source language independent infrastructure, targeted at languages with rich and complex grammar, with integration for host operating systems and apps. The current primary challenge is integration with closed platforms where we cannot currently support user needs. Our proposed solution is a “Manifesto for Open Language Technology”, where APIs, localisations and source code are open, while ensuring community intellectual property custodianship, engagement and commitment.
2 Morphological Structure of the Dene Verb The morphological structure of verbs in Dene languages... more 2 Morphological Structure of the Dene Verb The morphological structure of verbs in Dene languages is considered to be about as complex as it can get among the languages of the world. However, the overall structure, with outer (disjunct), inner (conjunct), and stem ‘zones’ of verb (cf. Kari 1989) is not generally, and thus computationally, difficult to model. What is seen as most challenging primarily concerns the extensive morphological fusion in subject-aspect inflection immediately preceding the stem (cf. K. Rice 2005: 404–407), a span which “is at least historically a concatenation of morphemes before the verb stem” (K. Rice 2005: 404). As Keren Rice (2005: 405) delicately puts it, “[t]he morphophonemics of this span of the verb is complex.”
Proceedings of the Workshop on Computational Methods for Endangered Languages, 2019
Communities of lesser resourced languages like North Sámi benefit from language tools such as sp... more Communities of lesser resourced languages like North Sámi benefit from language tools such as spell checkers and grammar checkers to improve literacy. Accurate error feedback is dependent on well-tokenised input, but traditional tokenisation as shallow preprocessing is inadequate to solve the challenges of real-world language usage. We present an alternative where tokenisation remains ambiguous until we have linguistic context information available. This lets us accurately detect sentence boundaries, multiwords and compound error detection. We describe a North Sámi grammar checker with such a tokenisation system, and show the results of its evaluation.
This paper describes the development of a computational model of the morphology of Northern Haida... more This paper describes the development of a computational model of the morphology of Northern Haida based on finite state machines (FSMs), with a focus on verbs. Northern Haida is highly endangered, and a member of the isolate Haida macrolanguage, spoken in British Columbia and Alaska. Northern Haida is a highly-inflecting language whose verbal morphology relies largely on suffixes, with a limited number of prefixes. The suffixes trigger morphophonological changes in the stem, participate in blocking, and exhibit variable ordering in certain constructions. The computational model of Northern Haida verb morphology is capable of handling these complex affixation patterns and the morphophonological alternations that they engender. In this paper, we describe the challenges we encountered and the solutions we propose, while contextualizing the endeavour in the description, documentation and revitalization of First Nations Languages in Canada.
Proceedings of the Workshop on NLP for Reading and Writing – Resources, Algorithms and Tools (SLT... more Proceedings of the Workshop on NLP for Reading and Writing – Resources, Algorithms and Tools (SLTC 2008). Editors: Rickard Domeij, Sofie Johansson Kokkinakis, Ola Knutsson and Sylvana Sofkova Hashemi. NEALT Proceedings Series, Vol. 3 (2009), 19-21. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/4116 .
Perspectives on Indigenous writing and literacies
Enhancing information accessibility and digital literacy for minorities using language technology... more Enhancing information accessibility and digital literacy for minorities using language technology : The example of Sami and other national minority languages in Sweden
The article discusses correcting of typos due to erroneous use of the so-called soft sign in Skol... more The article discusses correcting of typos due to erroneous use of the so-called soft sign in Skolt Sami, one of the most common orthographic symbols, and the most common source of typographic errors. The discussion is based upon the suggestion mechanism of an existing open source Skolt Sami speller. The discussion shows that with an improved suggestion mechanism, the speller is able to restore a single soft sign error in over 97 % of the cases, and remove a hypercorrect soft sign as first correction in 90 % of the cases. Allowing the target form to be within top-5, the correction performance is well above 99 %. Improving the suggestion mechanism also had a positive impact of its overall performance, rising the percentage of target forms within top-5 from 74.1 % to 84.7 %.
Uploads
Papers by Sjur Nørstebø Moshagen