Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so f... more Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so far, there are few datasets available for specialized critical domains such as law and the available ones are often only for the English language. We curate and release MULTILEGALPILE, a 689GB corpus in 24 languages from 17 jurisdictions. The MULTILEGALPILE corpus, which includes diverse legal data sources with varying licenses, allows for pretraining NLP models under fair use, with more permissive licenses for the Eurlex Resources and Legal mC4 subsets. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the language-specific subsets and evaluate them on LEXTREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEX-TREME and our English models on LexGLUE. We release the dataset, the trained models, and all of the code under the most open possible licenses.
Lately, propelled by the phenomenal advances around the transformer architecture, the legal NLP f... more Lately, propelled by the phenomenal advances around the transformer architecture, the legal NLP field has enjoyed spectacular growth. To measure progress, well curated and challenging benchmarks are crucial. However, most benchmarks are English only and in legal NLP specifically there is no multilingual benchmark available yet. Additionally, many benchmarks are saturated, with the best models clearly outperforming the best humans and achieving near perfect scores. We survey the legal NLP literature and select 11 datasets covering 24 languages, creating LEXTREME. To provide a fair comparison, we propose two aggregate scores, one based on the datasets and one on the languages. The best baseline (XLM-R large) achieves both a dataset aggregate score a language aggregate score of 61.3. This indicates that LEXTREME is still very challenging and leaves ample room for improvement. To make it easy for researchers and practitioners to use, we release LEXTREME on huggingface together with all the code required to evaluate models and a public Weights and Biases project with all the runs.
Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professio... more Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professional domain-specific ones), emphasizing the need for more challenging ones to properly assess LLM capabilities. In this paper, we introduce a novel NLP benchmark that poses challenges to current LLMs across four key dimensions: processing long documents (up to 50K tokens), utilizing domain specific knowledge (embodied in legal texts), multilingual understanding (covering five languages), and multitasking (comprising legal document to document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks). Our benchmark comprises diverse legal NLP datasets from the Swiss legal system, allowing for a comprehensive study of the underlying Non-English, inherently multilingual, federal legal system. Despite recent advances, efficiently processing long documents for intense review/analysis tasks remains an open challenge for LLMs. Also, comprehensive, domain-specific benchmarks requiring high expertise to develop are rare, as are multilingual benchmarks. This scarcity underscores our contribution's value, considering most public models are trained predominantly on English corpora, while other languages remain understudied, particularly for practical domain-specific NLP tasks. Our benchmark allows for testing and advancing the state-of-the-art LLMs. As part of our study, we evaluate several pre-trained multilingual language models on our benchmark to establish strong baselines as a point of reference. Despite the large size of our datasets (tens to hundreds of thousands of examples), * Equal contribution. Preprint. Under review.
The symposium Language in Contact; Yesterday–Today–Tomorrow took place June 21–23, 2017 and was o... more The symposium Language in Contact; Yesterday–Today–Tomorrow took place June 21–23, 2017 and was organized by The Graduate School Language & Literature Munich - Class of Language. Scholars using interdisciplinary approaches were invited to Munich and conveyed both traditional and innovative insights into the vast field of language contact. This included both diachronic (Yesterday) and synchronic contributions (Today) as well as papers discussing the future of contact linguistics (Tomorrow). At the symposium, language contact was defined in a broad sense as the language that emerges when speakers of different languages influence one another’s speech; this brought together multiple areas of linguistic study ranging from language change and language policy to language acquisition and language processing. Key to the conference was connecting what we can learn from past instances of language contact that will help us understand language phenomena in present and future research
Many theories hold that language change, at least on a local level, is driven by a need for impro... more Many theories hold that language change, at least on a local level, is driven by a need for improvement. The present volume explores to what extent this assumption holds true, and whether there is a particular type of language change that we dub language change for the worse, i.e., change with a worsening effect that cannot be explained away as a side-effect of improvement in some other area of the linguistic system. The chapters of the volume, written by leading junior and senior scholars, combine expertise in diachronic and historical linguistics, typology, and formal modelling. They focus on different aspects of grammar (phonology, morphosyntax, semantics) in a variety of language families (Germanic, Romance, Austronesian, Bantu, Jê-Kaingang, Wu Chinese, Greek, Albanian, Altaic, Indo-Aryan, and languages of the Caucasus). The volume contributes to ongoing theoretical debates and discussions between linguists with different theoretical orientations
Many theories hold that language change, at least on a local level, is driven by a need for impro... more Many theories hold that language change, at least on a local level, is driven by a need for improvement. The present volume explores to what extent this assumption holds true, and whether there is a particular type of language change that we dub language change for the worse, i.e., change with a worsening effect that cannot be explained away as a side-effect of improvement in some other area of the linguistic system. The chapters of the volume, written by leading junior and senior scholars, combine expertise in diachronic and historical linguistics, typology, and formal modelling. They focus on different aspects of grammar (phonology, morphosyntax, semantics) in a variety of language families (Germanic, Romance, Austronesian, Bantu, Jê-Kaingang, Wu Chinese, Greek, Albanian, Altaic, Indo-Aryan, and languages of the Caucasus). The volume contributes to ongoing theoretical debates and discussions between linguists with different theoretical orientations
Language change for the worse. (Studies in Diversity Linguistics), 2024
The relevant literature reports differences in the use of clitic doubling in Albanian dialects. Q... more The relevant literature reports differences in the use of clitic doubling in Albanian dialects. Quantitative corpus studies show that all dialects spoken outside of the Republic of Albania show a more frequent use of clitic doubling. The data of this corpus prove that the less restrictive use of clitic doubling is not accompanied by increasing transparency of its usage. In contrast to Standard Albanian, where the usage of clitic doubling is not optional and can almost without exception be explained by topic and focus marking, in the peripheral Albanian dialects outside of the Republic of Albania numerous exceptions from the general tendency can be detected. In order to explain these exceptions, a wide variety of factors must be taken into account and, in certain contexts, point to the optional use of clitic doubling. From a descriptive point of view, these exceptions suggest an increasing degree of functional opaqueness.
Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so f... more Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so far, there are few datasets available for specialized critical domains such as law and the available ones are often only for the English language. We curate and release MULTILEGALPILE, a 689GB corpus in 24 languages from 17 jurisdictions. The MULTILEGALPILE corpus, which includes diverse legal data sources with varying licenses, allows for pretraining NLP models under fair use, with more permissive licenses for the Eurlex Resources and Legal mC4 subsets. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the language-specific subsets and evaluate them on LEXTREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEX-TREME and our English models on LexGLUE. We release the dataset, the trained models, and all of the code under the most open possible licenses.
Lately, propelled by the phenomenal advances around the transformer architecture, the legal NLP f... more Lately, propelled by the phenomenal advances around the transformer architecture, the legal NLP field has enjoyed spectacular growth. To measure progress, well curated and challenging benchmarks are crucial. However, most benchmarks are English only and in legal NLP specifically there is no multilingual benchmark available yet. Additionally, many benchmarks are saturated, with the best models clearly outperforming the best humans and achieving near perfect scores. We survey the legal NLP literature and select 11 datasets covering 24 languages, creating LEXTREME. To provide a fair comparison, we propose two aggregate scores, one based on the datasets and one on the languages. The best baseline (XLM-R large) achieves both a dataset aggregate score a language aggregate score of 61.3. This indicates that LEXTREME is still very challenging and leaves ample room for improvement. To make it easy for researchers and practitioners to use, we release LEXTREME on huggingface together with all the code required to evaluate models and a public Weights and Biases project with all the runs.
Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professio... more Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professional domain-specific ones), emphasizing the need for more challenging ones to properly assess LLM capabilities. In this paper, we introduce a novel NLP benchmark that poses challenges to current LLMs across four key dimensions: processing long documents (up to 50K tokens), utilizing domain specific knowledge (embodied in legal texts), multilingual understanding (covering five languages), and multitasking (comprising legal document to document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks). Our benchmark comprises diverse legal NLP datasets from the Swiss legal system, allowing for a comprehensive study of the underlying Non-English, inherently multilingual, federal legal system. Despite recent advances, efficiently processing long documents for intense review/analysis tasks remains an open challenge for LLMs. Also, comprehensive, domain-specific benchmarks requiring high expertise to develop are rare, as are multilingual benchmarks. This scarcity underscores our contribution's value, considering most public models are trained predominantly on English corpora, while other languages remain understudied, particularly for practical domain-specific NLP tasks. Our benchmark allows for testing and advancing the state-of-the-art LLMs. As part of our study, we evaluate several pre-trained multilingual language models on our benchmark to establish strong baselines as a point of reference. Despite the large size of our datasets (tens to hundreds of thousands of examples), * Equal contribution. Preprint. Under review.
The symposium Language in Contact; Yesterday–Today–Tomorrow took place June 21–23, 2017 and was o... more The symposium Language in Contact; Yesterday–Today–Tomorrow took place June 21–23, 2017 and was organized by The Graduate School Language & Literature Munich - Class of Language. Scholars using interdisciplinary approaches were invited to Munich and conveyed both traditional and innovative insights into the vast field of language contact. This included both diachronic (Yesterday) and synchronic contributions (Today) as well as papers discussing the future of contact linguistics (Tomorrow). At the symposium, language contact was defined in a broad sense as the language that emerges when speakers of different languages influence one another’s speech; this brought together multiple areas of linguistic study ranging from language change and language policy to language acquisition and language processing. Key to the conference was connecting what we can learn from past instances of language contact that will help us understand language phenomena in present and future research
Many theories hold that language change, at least on a local level, is driven by a need for impro... more Many theories hold that language change, at least on a local level, is driven by a need for improvement. The present volume explores to what extent this assumption holds true, and whether there is a particular type of language change that we dub language change for the worse, i.e., change with a worsening effect that cannot be explained away as a side-effect of improvement in some other area of the linguistic system. The chapters of the volume, written by leading junior and senior scholars, combine expertise in diachronic and historical linguistics, typology, and formal modelling. They focus on different aspects of grammar (phonology, morphosyntax, semantics) in a variety of language families (Germanic, Romance, Austronesian, Bantu, Jê-Kaingang, Wu Chinese, Greek, Albanian, Altaic, Indo-Aryan, and languages of the Caucasus). The volume contributes to ongoing theoretical debates and discussions between linguists with different theoretical orientations
Many theories hold that language change, at least on a local level, is driven by a need for impro... more Many theories hold that language change, at least on a local level, is driven by a need for improvement. The present volume explores to what extent this assumption holds true, and whether there is a particular type of language change that we dub language change for the worse, i.e., change with a worsening effect that cannot be explained away as a side-effect of improvement in some other area of the linguistic system. The chapters of the volume, written by leading junior and senior scholars, combine expertise in diachronic and historical linguistics, typology, and formal modelling. They focus on different aspects of grammar (phonology, morphosyntax, semantics) in a variety of language families (Germanic, Romance, Austronesian, Bantu, Jê-Kaingang, Wu Chinese, Greek, Albanian, Altaic, Indo-Aryan, and languages of the Caucasus). The volume contributes to ongoing theoretical debates and discussions between linguists with different theoretical orientations
Language change for the worse. (Studies in Diversity Linguistics), 2024
The relevant literature reports differences in the use of clitic doubling in Albanian dialects. Q... more The relevant literature reports differences in the use of clitic doubling in Albanian dialects. Quantitative corpus studies show that all dialects spoken outside of the Republic of Albania show a more frequent use of clitic doubling. The data of this corpus prove that the less restrictive use of clitic doubling is not accompanied by increasing transparency of its usage. In contrast to Standard Albanian, where the usage of clitic doubling is not optional and can almost without exception be explained by topic and focus marking, in the peripheral Albanian dialects outside of the Republic of Albania numerous exceptions from the general tendency can be detected. In order to explain these exceptions, a wide variety of factors must be taken into account and, in certain contexts, point to the optional use of clitic doubling. From a descriptive point of view, these exceptions suggest an increasing degree of functional opaqueness.
Uploads
Papers by Veton Matoshi
Books by Veton Matoshi