International Conference on Language Resources and Evaluation, 2020
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect a... more This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency and/or noun phrase annotation, the corpus is enriched with the IATE and EuroVoc labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represent a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect a... more This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpo...
Proceedings of the 19th ACM international conference on Multimedia - MM '11, 2011
Abstract This paper gives a short overview of the Opencast Matterhorn system. Built by an open co... more Abstract This paper gives a short overview of the Opencast Matterhorn system. Built by an open community of individuals and institutions, Matterhorn provides a lecture capture platform for both research and production environments. Matterhorn is comprehensive ...
Executive Summary In this deliverable, we present a conceptual architecture of the ALERT system. ... more Executive Summary In this deliverable, we present a conceptual architecture of the ALERT system. The purpose of the conceptual architecture is to direct attention at an appropriate decomposition of the ALERT system without delving into details. Moreover, the deliverable provides a useful vehicle for communicating the architecture to nontechnical audiences, such as management, marketing, and end-users. In order to create and to justify the conceptual architecture of the ALERT system (section 2), we have applied the following ...
This paper gives a short overview of the Opencast Matterhorn system. Built by an open community o... more This paper gives a short overview of the Opencast Matterhorn system. Built by an open community of individuals and institutions, Matterhorn provides a lecture capture platform for both research and production environments. Matterhorn is comprehensive and scalable, and includes components for the acquisition, processing, and playback of content. Matterhorn is licensed under the liberal Educational Community License (ECL 2.0), a flexible OSI approved open source license, and the Opencast community is free for all institutions, corporations, or individuals to join.
International Conference on Language Resources and Evaluation, 2020
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect a... more This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency and/or noun phrase annotation, the corpus is enriched with the IATE and EuroVoc labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represent a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect a... more This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpo...
Proceedings of the 19th ACM international conference on Multimedia - MM '11, 2011
Abstract This paper gives a short overview of the Opencast Matterhorn system. Built by an open co... more Abstract This paper gives a short overview of the Opencast Matterhorn system. Built by an open community of individuals and institutions, Matterhorn provides a lecture capture platform for both research and production environments. Matterhorn is comprehensive ...
Executive Summary In this deliverable, we present a conceptual architecture of the ALERT system. ... more Executive Summary In this deliverable, we present a conceptual architecture of the ALERT system. The purpose of the conceptual architecture is to direct attention at an appropriate decomposition of the ALERT system without delving into details. Moreover, the deliverable provides a useful vehicle for communicating the architecture to nontechnical audiences, such as management, marketing, and end-users. In order to create and to justify the conceptual architecture of the ALERT system (section 2), we have applied the following ...
This paper gives a short overview of the Opencast Matterhorn system. Built by an open community o... more This paper gives a short overview of the Opencast Matterhorn system. Built by an open community of individuals and institutions, Matterhorn provides a lecture capture platform for both research and production environments. Matterhorn is comprehensive and scalable, and includes components for the acquisition, processing, and playback of content. Matterhorn is licensed under the liberal Educational Community License (ECL 2.0), a flexible OSI approved open source license, and the Opencast community is free for all institutions, corporations, or individuals to join.
Uploads
Papers by Matjaz Rihtar