Natural language use is full of choices among multiple possible alternatives, whether phones, wor... more Natural language use is full of choices among multiple possible alternatives, whether phones, words, or constructions, which are influenced by a large number of contextual factors, and which rather exhibit asymptotic, imperfect tendencies favoring one or more of the alternatives, instead of single, categorical, perfect choices. This contrasts with item-by-item learning in simple controlled experiments which typically have been modelled by the Rescorla-Wagner equations. We find the former "messy" types of problems as a key area of interest in modeling and understanding language use, and consequently consider the application of the Rescorla-Wagner equations in the form of a Naive Discriminative Learning classifier to such complex phenomena of considerable utility in linguistic research. This is an updated version. (Previous version: http://dx.doi.org/10.15496/publikation-8620)
This paper discusses the development and evaluation of a Speech Synthesizer for Plains Cree, an A... more This paper discusses the development and evaluation of a Speech Synthesizer for Plains Cree, an Algonquian language of North America. Synthesis is achieved using Sim-ple4All and evaluation was performed using a modified Cluster Identification, Semantically Unpredictable Sentence, and a basic dichotomized judgment task. Resulting synthesis was not well received; however, observations regarding the process of speech synthesis evaluation in North American indigenous communities were made: chiefly, that tolerance for variation is often much lower in these communities than for majority languages. The evaluator did not recognize grammatically consistent but semantically nonsense strings as licit language. As a result, monosyllabic clusters and semantically unpredictable sentences proved not the most appropriate evaluate tools. Alternative evaluation methods are discussed.
A persistent challenge in the creation of semantically classified dictionaries and lexical resour... more A persistent challenge in the creation of semantically classified dictionaries and lexical resources is the lengthy and expensive process of manual semantic classification, a hindrance which can make adequate semantic resources unattainable for under-resourced language communities. We explore here an alternative to manual classification using a vector semantic method, which, although not yet at the level of human sophistication, can provide usable firstpass semantic classifications in a fraction of the time.
Betingelser for brug af denne artikel Denne artikel er omfattet af ophavsretsloven, og der må cit... more Betingelser for brug af denne artikel Denne artikel er omfattet af ophavsretsloven, og der må citeres fra den. Følgende betingelser skal dog vaere opfyldt: Citatet skal vaere i overensstemmelse med "god skik" Der må kun citeres "i det omfang, som betinges af formålet" Ophavsmanden til teksten skal krediteres, og kilden skal angives, jf. ovenstående bibliografiske oplysninger. Søgbarhed Artiklerne i de aeldre Nordiske studier i leksikografi (1-5) er skannet og OCR-behandlet. OCR står for 'optical character recognition' og kan ved tegngenkendelse konvertere et billede til tekst. Dermed kan man søge i teksten. Imidlertid kan der opstå fejl i tegngenkendelsen, og når man søger på fx navne, skal man vaere forberedt på at søgningen ikke er 100 % pålidelig.
These proceedings contain the papers presented at the 2nd Workshop on the Use of Computational Me... more These proceedings contain the papers presented at the 2nd Workshop on the Use of Computational Methods in the Study of Endangered languages held in Honolulu, March 6-7, 2017. The workshop itself was co-located and took place after the 5th International Conference on Language Documentation and Conservation (ICLDC) at the University of Hawai'i at Mānoa. As the name implies, this is the second workshop held on the topic-the previous meeting was co-located with the ACL main conference in Baltimore, Maryland in 2014. The workshop covers a wide range of topics relevant to the study and documentation of endangered languages, ranging from technical papers on working systems and applications, to reports on community activities with supporting computational components. The purpose of the workshop is to bring together computational researchers, documentary linguists, and people involved with community efforts of language documentation and revitalization to take part in both formal and informal exchanges on how to integrate rapidly evolving language processing methods and tools into efforts of language description, documentation, and revitalization. The organizers are pleased with the range of papers, many of which highlight the importance of interdisciplinary work and interaction between the various communities that the workshop is aimed towards. We received 39 submissions as long papers, short papers, or extended abstracts, of which 23 were selected for this volume (59%). In the proceedings, all papers are either short (≤5 pages) or long (≤9 pages). In addition, the workshop also features presentations from representatives of the National Science Foundation (NSF). Two panel dicussions on the topic of interaction between computational linguistics and the documentation and revitalization community as well as future planning of ComputEL underlined the demand and necessity of a workshop of this nature.
Over the past four decades, two distinct alternatives have emerged to rule-based models of how li... more Over the past four decades, two distinct alternatives have emerged to rule-based models of how linguistic categories are stored and represented as cognitive structures, namely the prototype and exemplar theories. Although these models were initially thought to be mutually exclusive, shifts from one mechanism to the other have been observed in category learning experiments, bringing the models closer together. In this paper we implement a technique akin to varying abstraction modelling, that assumes intermediate abstraction processes to underlie category representations and categorization decisions; we do so using familiar statistical techniques such as regression and clustering that track frequency distributions in input. With this model we simulate, on the basis of actual usage of Russian try verbs and Finnish think verbs as observed in corpora, how prototypes for near-synonymous verbs could be formed from concrete exemplars at different levels of abstraction. In so doing, we take a closer look at the cognitive linguistic flirtation with multiple categorization theories, suggesting three improvements anchored in the fact that cognitive linguistics is a usage-based theory of language. Firstly, we show that language provides support for considering single prototype and full exemplar models as opposite ends along a continuum of abstraction. Secondly, we present a methodology that simulates how prototypes can be obtained from exemplars at more than one level of abstraction in a systematic and verifiable way. And thirdly, we illustrate our claims on the basis of work on verbs, denoting intangible events that are neither stable in nor independent of time and express relational concepts; this implies that verbs are more susceptible to their meanings being influenced by the concepts they relate.
The complex inflectional and derivational morphology of Plains Cree and other Algonquian language... more The complex inflectional and derivational morphology of Plains Cree and other Algonquian languages has long been considered from both a synchronic and diachronic perspective (e.g. Bloomfield 1946; Goddard 1974; Oxford 2014). While the composition of some modern Plains Cree stems has been obscured by sound change, they can often still be identified by linguists, and for speakers, many morphemes are available to freely derive new stems. Unlike derivational morphology, the inflectional morphology of Cree is quite regular and lends itself to straightforward description and this has translated to a computational model that can analyze inflected forms of Plains Cree lemmata (e.g. Harrigan et al. forthcoming; Snoek et al. 2014). Though the derivational morphology poses more challenges to model, lists of existing derivational morphemes can be extracted from existing sources and various morphophonological rules have been described (Cook and Muehlbauer 2010; Wolfart 1996; Wolvengrey 2001). However, we can make use of the derivational model to assess how well the rules and morphemes given for Plains Cree apply when tested against lemmas included in available dictionaries. This approach, following Karttunen (2006), allows us to test theoretical descriptions against larger data sets than those used to produce the rules: where the human mind can only make sense of so much data at once, a quantitative approach can take thousands of words into account. In this article, we present the first version of a computational model for Plains Cree derivational morphology, using a weighted finite-state transducer, and discuss its
In this paper, we describe a computational model of Upper Tanana, a highly endangered Dene (Athab... more In this paper, we describe a computational model of Upper Tanana, a highly endangered Dene (Athabaskan) language spoken in eastern interior Alaska (USA) and in the Yukon Territory (Canada). This model not only parses and generates inflected Upper Tanana verb forms, but uses the language's verb theme category system, a system of lexical-inflectional verb classes, to additionally predict possible derivations and their morphological behavior. This allows us to model a large portion of the Upper Tanana verb lexicon, making it more accessible to learners and scholars alike. Generated derivations will be compared against the narrative corpus of the language as well to the (much more comprehensive) lexical documentation of closely related languages.
International Conference on Computational Linguistics, Aug 1, 2018
In this article, we discuss which text, speech, and image technologies have been developed, and w... more In this article, we discuss which text, speech, and image technologies have been developed, and would be feasible to develop, for the approximately 60 Indigenous languages spoken in Canada. In particular, we concentrate on technologies that may be feasible to develop for most or all of these languages, not just those that may be feasible for the few most-resourced of these. We assess past achievements and consider future horizons for Indigenous language transliteration, text prediction, spell-checking, approximate search, machine translation, speech recognition, speaker diarization, speech synthesis, optical character recognition, and computer-aided language learning.
The cumulative effects hypothesis (CEH) claims that bilingual development would be a challenge fo... more The cumulative effects hypothesis (CEH) claims that bilingual development would be a challenge for children with specific language impairment (SLI). To date, research on second language (L2) children with SLI has been limited mainly to their early years of L2 exposure; however, examining the long-term outcomes of L2 children with SLI is essential for testing the CEH. Accordingly, the present study examined production and grammaticality judgments of English tense morphology from matched groups of L2 children with SLI and L2 children with typical development (TD) for 3 years, from ages 8 to 10 with 4-6 years of exposure to English. This study found that the longitudinal acquisition profile of the L2 children with SLI and TD was similar to the acquisition profile reported for monolinguals with SLI and TD. Furthermore, L2-SLI children's accuracy with tense morphology was similar to that of their monolingual age peers with SLI at the end of the study, and exceeded that of younger monolingual peers with SLI whose age matched the L2 children's length of exposure to English. These findings are not consistent with the CEH, but instead show that morphological acquisition parallel to monolinguals with SLI is possible for L2 children with SLI. Children with specific language impairment (SLI) are late talkers whose language delay extends into their school years (Leonard, 2014). These children's protracted language development is not the consequence of other identifiable sensory, neurodevelopmental, or acquired disorders, for example, hearing loss, autism spectrum disorder, intellectual disability, or neurological trauma (Leonard, 2014). Studies have found that children with SLI show deficits in verbal memory and processing mechanisms compared to their peers with typical development (TD;
The third Workshop on Quantitative Investigations in Theoretical Linguistics (QITL3), to be held ... more The third Workshop on Quantitative Investigations in Theoretical Linguistics (QITL3), to be held on Monday-Wednesday, 2-4 June, 2008, in Helsinki, Finland, is co-hosted by the Linguistic Association of Finland (SKY) in association with the Department of General Linguistics at the University of Helsinki. This workshop is both a continuation of the two previous QITL events held in 2002 and 2006 in Osnabrück, Germany, and the latest in the sequence of summer symposia arranged annually by SKY. We are grateful to a number of people and organizations for their support and assistance in making this Workshop happen.
In the near-synonym lexical choice task, the best alternative out of a set of near-synonyms is se... more In the near-synonym lexical choice task, the best alternative out of a set of near-synonyms is selected to fill a lexical gap in a text. We experiment on an approach of an extensive set, over 650, linguistic features to represent the context of a word, and a range of machine learning approaches in the lexical choice task. We extend previous work by experimenting with unsupervised and semi-supervised methods, and use automatic feature selection to cope with the problems arising from the rich feature set. It is natural to think that linguistic analysis of the word context would yield almost perfect performance in the task but we show that too many features, even linguistic, introduce noise and make the task difficult for unsupervised and semi-supervised methods. We also show that purely syntactic features play the biggest role in the performance, but also certain semantic and morphological features are needed.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022
The composition of richly-inflected words in morphologically complex languages can be a challenge... more The composition of richly-inflected words in morphologically complex languages can be a challenge for language learners developing literacy. Accordingly, Lane and Bird (2020) proposed a finite state approach which maps prefixes in a language to a set of possible completions up to the next morpheme boundary, for the incremental building of complex words. In this work, we develop an approach to morph-based auto-completion based on a finite state morphological analyzer of Plains Cree (nêhiyawêwin), showing the portability of the concept to a much larger, more complete morphological transducer. Additionally, we propose and compare various novel ranking strategies on the morph auto-complete output. The best weighting scheme ranks the target completion in the top 10 results in 64.9% of queries, and in the top 50 in 73.9% of queries.
This paper details a semi-automatic method of word clustering for the Algonquian language, Nêhiya... more This paper details a semi-automatic method of word clustering for the Algonquian language, Nêhiyawêwin (Plains Cree). Although this method worked well, particularly for nouns, it required some amount of manual postprocessing. The main benefit of this approach over implementing an existing classification ontology is that this method approaches the language from an endogenous point of view, while performing classification quicker than in a fully manual context. 1 There is one attempt at semantically classifying Nêhiyawêwin through automatic means found in Dacanay et al. (2021). This work makes use of similar techniques as desccribed in this paper, differing mainly in its mapping of Nêhiyawêwin words onto Wordnet classes.
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)
Modern machine learning techniques have produced many impressive results in language technology, ... more Modern machine learning techniques have produced many impressive results in language technology, but these techniques generally require an amount of training data that is many orders of magnitude greater than what exists for low-resource languages in general, and endangered languages in particular. However, dictionary definitions in a comparatively much more well-resourced majority language can provide a link between low-resource languages and machine learning models trained on massive amounts of majority-language training data. Promising results have been achieved by leveraging these embeddings in the search mechanisms of bilingual dictionaries of Plains Cree (nêhiyawêwin), Arapaho (Hinóno'éitíit), Northern Haida (Xaad Kíl), and Tsuut'ina (Tsúùt'ínà), four Indigenous languages spoken in North America. Not only are the search results in the majority language of the definitions more relevant, but they can be semantically relevant in ways not achievable with classic information retrieval techniques: users can perform successful searches for words that do not occur at all in the dictionary. Not only this, but these techniques are directly applicable to any bilingual dictionary providing translations between a high-and low-resource language.
Natural language use is full of choices among multiple possible alternatives, whether phones, wor... more Natural language use is full of choices among multiple possible alternatives, whether phones, words, or constructions, which are influenced by a large number of contextual factors, and which rather exhibit asymptotic, imperfect tendencies favoring one or more of the alternatives, instead of single, categorical, perfect choices. This contrasts with item-by-item learning in simple controlled experiments which typically have been modelled by the Rescorla-Wagner equations. We find the former "messy" types of problems as a key area of interest in modeling and understanding language use, and consequently consider the application of the Rescorla-Wagner equations in the form of a Naive Discriminative Learning classifier to such complex phenomena of considerable utility in linguistic research. This is an updated version. (Previous version: http://dx.doi.org/10.15496/publikation-8620)
This paper discusses the development and evaluation of a Speech Synthesizer for Plains Cree, an A... more This paper discusses the development and evaluation of a Speech Synthesizer for Plains Cree, an Algonquian language of North America. Synthesis is achieved using Sim-ple4All and evaluation was performed using a modified Cluster Identification, Semantically Unpredictable Sentence, and a basic dichotomized judgment task. Resulting synthesis was not well received; however, observations regarding the process of speech synthesis evaluation in North American indigenous communities were made: chiefly, that tolerance for variation is often much lower in these communities than for majority languages. The evaluator did not recognize grammatically consistent but semantically nonsense strings as licit language. As a result, monosyllabic clusters and semantically unpredictable sentences proved not the most appropriate evaluate tools. Alternative evaluation methods are discussed.
A persistent challenge in the creation of semantically classified dictionaries and lexical resour... more A persistent challenge in the creation of semantically classified dictionaries and lexical resources is the lengthy and expensive process of manual semantic classification, a hindrance which can make adequate semantic resources unattainable for under-resourced language communities. We explore here an alternative to manual classification using a vector semantic method, which, although not yet at the level of human sophistication, can provide usable firstpass semantic classifications in a fraction of the time.
Betingelser for brug af denne artikel Denne artikel er omfattet af ophavsretsloven, og der må cit... more Betingelser for brug af denne artikel Denne artikel er omfattet af ophavsretsloven, og der må citeres fra den. Følgende betingelser skal dog vaere opfyldt: Citatet skal vaere i overensstemmelse med "god skik" Der må kun citeres "i det omfang, som betinges af formålet" Ophavsmanden til teksten skal krediteres, og kilden skal angives, jf. ovenstående bibliografiske oplysninger. Søgbarhed Artiklerne i de aeldre Nordiske studier i leksikografi (1-5) er skannet og OCR-behandlet. OCR står for 'optical character recognition' og kan ved tegngenkendelse konvertere et billede til tekst. Dermed kan man søge i teksten. Imidlertid kan der opstå fejl i tegngenkendelsen, og når man søger på fx navne, skal man vaere forberedt på at søgningen ikke er 100 % pålidelig.
These proceedings contain the papers presented at the 2nd Workshop on the Use of Computational Me... more These proceedings contain the papers presented at the 2nd Workshop on the Use of Computational Methods in the Study of Endangered languages held in Honolulu, March 6-7, 2017. The workshop itself was co-located and took place after the 5th International Conference on Language Documentation and Conservation (ICLDC) at the University of Hawai'i at Mānoa. As the name implies, this is the second workshop held on the topic-the previous meeting was co-located with the ACL main conference in Baltimore, Maryland in 2014. The workshop covers a wide range of topics relevant to the study and documentation of endangered languages, ranging from technical papers on working systems and applications, to reports on community activities with supporting computational components. The purpose of the workshop is to bring together computational researchers, documentary linguists, and people involved with community efforts of language documentation and revitalization to take part in both formal and informal exchanges on how to integrate rapidly evolving language processing methods and tools into efforts of language description, documentation, and revitalization. The organizers are pleased with the range of papers, many of which highlight the importance of interdisciplinary work and interaction between the various communities that the workshop is aimed towards. We received 39 submissions as long papers, short papers, or extended abstracts, of which 23 were selected for this volume (59%). In the proceedings, all papers are either short (≤5 pages) or long (≤9 pages). In addition, the workshop also features presentations from representatives of the National Science Foundation (NSF). Two panel dicussions on the topic of interaction between computational linguistics and the documentation and revitalization community as well as future planning of ComputEL underlined the demand and necessity of a workshop of this nature.
Over the past four decades, two distinct alternatives have emerged to rule-based models of how li... more Over the past four decades, two distinct alternatives have emerged to rule-based models of how linguistic categories are stored and represented as cognitive structures, namely the prototype and exemplar theories. Although these models were initially thought to be mutually exclusive, shifts from one mechanism to the other have been observed in category learning experiments, bringing the models closer together. In this paper we implement a technique akin to varying abstraction modelling, that assumes intermediate abstraction processes to underlie category representations and categorization decisions; we do so using familiar statistical techniques such as regression and clustering that track frequency distributions in input. With this model we simulate, on the basis of actual usage of Russian try verbs and Finnish think verbs as observed in corpora, how prototypes for near-synonymous verbs could be formed from concrete exemplars at different levels of abstraction. In so doing, we take a closer look at the cognitive linguistic flirtation with multiple categorization theories, suggesting three improvements anchored in the fact that cognitive linguistics is a usage-based theory of language. Firstly, we show that language provides support for considering single prototype and full exemplar models as opposite ends along a continuum of abstraction. Secondly, we present a methodology that simulates how prototypes can be obtained from exemplars at more than one level of abstraction in a systematic and verifiable way. And thirdly, we illustrate our claims on the basis of work on verbs, denoting intangible events that are neither stable in nor independent of time and express relational concepts; this implies that verbs are more susceptible to their meanings being influenced by the concepts they relate.
The complex inflectional and derivational morphology of Plains Cree and other Algonquian language... more The complex inflectional and derivational morphology of Plains Cree and other Algonquian languages has long been considered from both a synchronic and diachronic perspective (e.g. Bloomfield 1946; Goddard 1974; Oxford 2014). While the composition of some modern Plains Cree stems has been obscured by sound change, they can often still be identified by linguists, and for speakers, many morphemes are available to freely derive new stems. Unlike derivational morphology, the inflectional morphology of Cree is quite regular and lends itself to straightforward description and this has translated to a computational model that can analyze inflected forms of Plains Cree lemmata (e.g. Harrigan et al. forthcoming; Snoek et al. 2014). Though the derivational morphology poses more challenges to model, lists of existing derivational morphemes can be extracted from existing sources and various morphophonological rules have been described (Cook and Muehlbauer 2010; Wolfart 1996; Wolvengrey 2001). However, we can make use of the derivational model to assess how well the rules and morphemes given for Plains Cree apply when tested against lemmas included in available dictionaries. This approach, following Karttunen (2006), allows us to test theoretical descriptions against larger data sets than those used to produce the rules: where the human mind can only make sense of so much data at once, a quantitative approach can take thousands of words into account. In this article, we present the first version of a computational model for Plains Cree derivational morphology, using a weighted finite-state transducer, and discuss its
In this paper, we describe a computational model of Upper Tanana, a highly endangered Dene (Athab... more In this paper, we describe a computational model of Upper Tanana, a highly endangered Dene (Athabaskan) language spoken in eastern interior Alaska (USA) and in the Yukon Territory (Canada). This model not only parses and generates inflected Upper Tanana verb forms, but uses the language's verb theme category system, a system of lexical-inflectional verb classes, to additionally predict possible derivations and their morphological behavior. This allows us to model a large portion of the Upper Tanana verb lexicon, making it more accessible to learners and scholars alike. Generated derivations will be compared against the narrative corpus of the language as well to the (much more comprehensive) lexical documentation of closely related languages.
International Conference on Computational Linguistics, Aug 1, 2018
In this article, we discuss which text, speech, and image technologies have been developed, and w... more In this article, we discuss which text, speech, and image technologies have been developed, and would be feasible to develop, for the approximately 60 Indigenous languages spoken in Canada. In particular, we concentrate on technologies that may be feasible to develop for most or all of these languages, not just those that may be feasible for the few most-resourced of these. We assess past achievements and consider future horizons for Indigenous language transliteration, text prediction, spell-checking, approximate search, machine translation, speech recognition, speaker diarization, speech synthesis, optical character recognition, and computer-aided language learning.
The cumulative effects hypothesis (CEH) claims that bilingual development would be a challenge fo... more The cumulative effects hypothesis (CEH) claims that bilingual development would be a challenge for children with specific language impairment (SLI). To date, research on second language (L2) children with SLI has been limited mainly to their early years of L2 exposure; however, examining the long-term outcomes of L2 children with SLI is essential for testing the CEH. Accordingly, the present study examined production and grammaticality judgments of English tense morphology from matched groups of L2 children with SLI and L2 children with typical development (TD) for 3 years, from ages 8 to 10 with 4-6 years of exposure to English. This study found that the longitudinal acquisition profile of the L2 children with SLI and TD was similar to the acquisition profile reported for monolinguals with SLI and TD. Furthermore, L2-SLI children's accuracy with tense morphology was similar to that of their monolingual age peers with SLI at the end of the study, and exceeded that of younger monolingual peers with SLI whose age matched the L2 children's length of exposure to English. These findings are not consistent with the CEH, but instead show that morphological acquisition parallel to monolinguals with SLI is possible for L2 children with SLI. Children with specific language impairment (SLI) are late talkers whose language delay extends into their school years (Leonard, 2014). These children's protracted language development is not the consequence of other identifiable sensory, neurodevelopmental, or acquired disorders, for example, hearing loss, autism spectrum disorder, intellectual disability, or neurological trauma (Leonard, 2014). Studies have found that children with SLI show deficits in verbal memory and processing mechanisms compared to their peers with typical development (TD;
The third Workshop on Quantitative Investigations in Theoretical Linguistics (QITL3), to be held ... more The third Workshop on Quantitative Investigations in Theoretical Linguistics (QITL3), to be held on Monday-Wednesday, 2-4 June, 2008, in Helsinki, Finland, is co-hosted by the Linguistic Association of Finland (SKY) in association with the Department of General Linguistics at the University of Helsinki. This workshop is both a continuation of the two previous QITL events held in 2002 and 2006 in Osnabrück, Germany, and the latest in the sequence of summer symposia arranged annually by SKY. We are grateful to a number of people and organizations for their support and assistance in making this Workshop happen.
In the near-synonym lexical choice task, the best alternative out of a set of near-synonyms is se... more In the near-synonym lexical choice task, the best alternative out of a set of near-synonyms is selected to fill a lexical gap in a text. We experiment on an approach of an extensive set, over 650, linguistic features to represent the context of a word, and a range of machine learning approaches in the lexical choice task. We extend previous work by experimenting with unsupervised and semi-supervised methods, and use automatic feature selection to cope with the problems arising from the rich feature set. It is natural to think that linguistic analysis of the word context would yield almost perfect performance in the task but we show that too many features, even linguistic, introduce noise and make the task difficult for unsupervised and semi-supervised methods. We also show that purely syntactic features play the biggest role in the performance, but also certain semantic and morphological features are needed.
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022
The composition of richly-inflected words in morphologically complex languages can be a challenge... more The composition of richly-inflected words in morphologically complex languages can be a challenge for language learners developing literacy. Accordingly, Lane and Bird (2020) proposed a finite state approach which maps prefixes in a language to a set of possible completions up to the next morpheme boundary, for the incremental building of complex words. In this work, we develop an approach to morph-based auto-completion based on a finite state morphological analyzer of Plains Cree (nêhiyawêwin), showing the portability of the concept to a much larger, more complete morphological transducer. Additionally, we propose and compare various novel ranking strategies on the morph auto-complete output. The best weighting scheme ranks the target completion in the top 10 results in 64.9% of queries, and in the top 50 in 73.9% of queries.
This paper details a semi-automatic method of word clustering for the Algonquian language, Nêhiya... more This paper details a semi-automatic method of word clustering for the Algonquian language, Nêhiyawêwin (Plains Cree). Although this method worked well, particularly for nouns, it required some amount of manual postprocessing. The main benefit of this approach over implementing an existing classification ontology is that this method approaches the language from an endogenous point of view, while performing classification quicker than in a fully manual context. 1 There is one attempt at semantically classifying Nêhiyawêwin through automatic means found in Dacanay et al. (2021). This work makes use of similar techniques as desccribed in this paper, differing mainly in its mapping of Nêhiyawêwin words onto Wordnet classes.
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)
Modern machine learning techniques have produced many impressive results in language technology, ... more Modern machine learning techniques have produced many impressive results in language technology, but these techniques generally require an amount of training data that is many orders of magnitude greater than what exists for low-resource languages in general, and endangered languages in particular. However, dictionary definitions in a comparatively much more well-resourced majority language can provide a link between low-resource languages and machine learning models trained on massive amounts of majority-language training data. Promising results have been achieved by leveraging these embeddings in the search mechanisms of bilingual dictionaries of Plains Cree (nêhiyawêwin), Arapaho (Hinóno'éitíit), Northern Haida (Xaad Kíl), and Tsuut'ina (Tsúùt'ínà), four Indigenous languages spoken in North America. Not only are the search results in the majority language of the definitions more relevant, but they can be semantically relevant in ways not achievable with classic information retrieval techniques: users can perform successful searches for words that do not occur at all in the dictionary. Not only this, but these techniques are directly applicable to any bilingual dictionary providing translations between a high-and low-resource language.
Uploads
Papers by Antti Arppe