CzeSL-GEC is a corpus containing sentence pairs of original and corrected versions of Czech sente... more CzeSL-GEC is a corpus containing sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech and Czech pupils with Romani background. To create this corpus, unreleased CzeSL-man corpus (http://utkl.ff.cuni.cz/learncorp/) was utilized. All sentences in the corpus are word tokenized.
Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition... more Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text. In addition to a few minor bugs, fixes a critical issue in Release 1: the native speakers of Ukrainian (s_L1:"uk") were wrongly labelled as speakers of "other European languages" (s_L1_group="IE"), instead of speakers of a Slavic language (s_L1_group="S"). The file is now a regular XML document, with all annotation represented as XML attributes.
AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It con... more AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has two times more sentences. If you use this dataset, please use following citation: @article{naplava2019wnut, title={Grammatical Error Correction in Low-Resource Scenarios}, author={N{\'a}plava, Jakub and Straka, Milan}, journal={arXiv preprint arXiv:1910.00353}, year={2019} }
WSD word sense disambiguation XML Extensible Markup Language Chapter 'I'm thinking about you' (DG... more WSD word sense disambiguation XML Extensible Markup Language Chapter 'I'm thinking about you' (DGD_L5_143 ru A1) (4) S: *přišli came.*pl.*ma pět five studentů students.gen 20 CHAPTER 1. INTRODUCTION T: přišlo came.sg.neut pět five studentů students.gen 'five students came' Other deviations include non-standard word order due to an inappropriate topicfocus articulation (information structuring), or due to a misplaced clitic, such as in (5), where jsem 'am' and se-reflexive particle-are both 2nd position clitics and should follow the first constituent během studování na univerzitě 'during university studies' in that order. (5) S: během during studování studying na on univerzitě university *se refl seznámil met *jsem aux.1sg s with Evou Eva T: během during studování studying na on univerzitě university jsem aux.1sg se refl seznámil met s with Evou Eva 'during my university studies I met Eva' (BLAH_DZ_001 ky B2) Similarly, reflexive pronouns are often under-used, as in (6), where the possessive moji 'my' should be replaced by the reflexive possessive svoji. (6) S: miluji love.1sg 14 Example (8) is from SKRIPT 2015, a corpus of young native Czech learners. The text ID is followed by the code for Czech (cs) and the age of the author. 10 Both reasons were also behind the decision to use a relatively small tagset in the manual tiered annotation of the CzeSL corpus. This CzeSL tagset consists of 26 tags.
A specific language as used by different speakers and in different situations has a number of mor... more A specific language as used by different speakers and in different situations has a number of more or less distant varieties. Extending the notion of non-standard language to varieties that do not fit an explicitly or implicitly assumed norm or pattern, we look for methods and tools that could be applied to such texts. The needs start from the theoretical side: categories usable for the analysis of non-standard language are not readily available. However, it is not easy to find methods and tools required for its detection and diagnostics either. A general discussion of issues related to non-standard language is followed by two case studies. The first study presents a taxonomy of morphosyntactic categories as an attempt to analyse non-standard forms produced by non-native learners of Czech. The second study focusses on the role of a rule-based grammar and lexicon in the process of building and using a parsebank.
The issue of incompatible morphosyntactic tagsets in multilingual corpora could be solved by an a... more The issue of incompatible morphosyntactic tagsets in multilingual corpora could be solved by an abstract hierarchy of concepts, mapped to languagespecific tagsets. The hierarchy supports the user and tools by resolving categories that do not match the relevant tagset in queries, by providing links between language-specific tagsets, and by displaying responses using a preferred tagset. The hierarchy, built using the methods of Formal Concept Analysis, can also help to refine morphosyntactic annotation in one language by using word-to-word alignments to parallel texts tagged by a different tagset. * This work was supported by grant no. MSM0021620823 of the Czech Ministry of Education, Youth and Sports, as a contribution to the parallel corpus project InterCorp. 1 The parallel corpus InterCorp currently offers on-line concordances in 23 languages, 14 of them tagged with different morphosyntactic tagsets. The corpus can be queried at korpus.cz/Park after registration at http://ucnk.ff.cuni.cz/english/dohody.php. For more information about the project see http://korpus.cz/intercorp/.
We present the architecture and the current state of InterCorp, a multilingual parallel corpus ce... more We present the architecture and the current state of InterCorp, a multilingual parallel corpus centered around Czech, intended primarily for human users and consisting of written texts with a focus on fiction. Following an outline of its recent development and a comparison with some other multilingual parallel corpora we give an overview of the data collection procedure that covers text selection criteria, data format, conversion, alignment, lemmatization and tagging. Finally, we discuss challenges and prospects of the project.
After a brief account of a parallel corpus project involving many diverse languages and a summary... more After a brief account of a parallel corpus project involving many diverse languages and a summary of two previous evaluations of sentential alignment tools, results are presented from tests of three automatic aligners on English-Czech and French-Czech literary and legal texts, clean and noisy. The results confirm that an alignment tool may perform well on one type of texts and fail on another type, and indicate that near-to-perfect alignment is possible when tools providing high precision are combined with manual checking, where the proofreader can focus only on those parts of the text that were either not aligned at all, or that were aligned less reliably. Further gains in precision are shown to be feasible when alignments proposed by multiple aligners are intersected.
by the governments of Poland and the Czech Republic within the KONTAKT programme (Czech no. 23/20... more by the governments of Poland and the Czech Republic within the KONTAKT programme (Czech no. 23/2004, Polish no. 25/2004/CZ). 'I ordered Honza to support Marie' f. Kazałem ordered-1.SG Janowi Jan-DAT wspierać support-INF Marię. Maria-ACC (OC; P.) 'I ordered Jan to support Maria' Two of the most robust cross-linguistic tests distinguishing raising and control 1 involve passivisation (e.g., Pollard and Sag 1994 and, for Czech, Skoumalová 2002) and idiom chunks (e.g., Postal 1974 and, for Polish, Zabrocki 1981): i) when the lower verb is in the passive, the meaning of the sentence is the same as in the active voice in case of raising constructions, but not in case of control constructions, e.g., (3)-(4), and ii) chunks of sentential idioms can be raised arguments, but they cannot be controllers, e.g., (5)-(6). (3) a. Mary seems to be supported by John. (SR; E.; ≈(1a)) b. I expect Mary to be supported by John. (OR; E.; ≈(1d)) (4) a. Mary tries to be supported by John.
The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its an... more The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked levels to cope with a wide range of error types present in the input. Each level corrects different types of errors; links between the levels allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a doubly-annotated sample of approx. 10,000 words with fair inter-annotator agreement results. We also explore options of application of automated linguistic annotation tools (taggers, spell checkers and grammar checkers) on the learner text to support or even substitute manual annotation.
The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its an... more The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked tiers, designed to handle a wide range of error types present in the input. Each tier corrects different types of errors; links between the tiers allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a data set including approx. 175,000 words with fair inter-annotator agreement results. We also explore the possibility of applying automated linguistic annotation tools (taggers, spell checkers and grammar checkers) to the learner text to support or even substitute manual annotation. Keywords learner corpus • error annotation • second language acquisition • Czech The corpus is one of the tasks of the project Innovation of Education in the Field of Czech as a Second Language (project no. CZ.1.07/2.2.00/07.0259), a part of the operational programme Education for Competiveness, funded by the European Structural Funds (ESF) and the Czech government. The annotation tool was also partially funded by grant no. P406/10/P328 of the Grant Agency of the Czech Republic.
Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition... more Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text. In addition to a few minor bugs, fixes a critical issue in Release 1: the native speakers of Ukrainian (s_L1:"uk") were wrongly labelled as speakers of "other European languages" (s_L1_group="IE"), instead of speakers of a Slavic language (s_L1_group="S"). The file is now a regular XML document, with all annotation represented as XML attributes.
Journal for the theory of language and language cultivation, Mar 1, 2006
In Czech, as in many other languages, second person plural forms are used as a means of formal ad... more In Czech, as in many other languages, second person plural forms are used as a means of formal address. As in French, and unlike in Russian, a finite verb predicated of the pronoun vy in this usage agrees with its subject in plural, as expected, while participles and predicative adjectives are sin- gular. It is argued that this pattern of hybrid agreement, present both within analytical verb forms and syntactic constructions and different from regular plural and singular, justifies the introduction of the morphological category of honorific in Czech. As a result, an account of analytical verb forms cannot be complete without providing an explicit description of the forms of second person formal address. Existing Czech grammar reference books are not quite satisfactory in this respect; furthermore they implicitly presuppose the application of rules concerning grammatical gender. We offer a solution to the issue of the adequate presentation of analytical verbal morphology paradigms, including formal address, along the lines of a Polish verbal morphology handbook.
We present an account of analytic verb forms in a treebank of Czech texts. According to the Czech... more We present an account of analytic verb forms in a treebank of Czech texts. According to the Czech linguistic tradition, description of periphrastic constructions is a task for morphology. On the other hand, their components cannot be analyzed separately from syntax. We show how the paradigmatic and syntagmatic views can be represented within a single framework.
CzeSL-GEC is a corpus containing sentence pairs of original and corrected versions of Czech sente... more CzeSL-GEC is a corpus containing sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech and Czech pupils with Romani background. To create this corpus, unreleased CzeSL-man corpus (http://utkl.ff.cuni.cz/learncorp/) was utilized. All sentences in the corpus are word tokenized.
Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition... more Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text. In addition to a few minor bugs, fixes a critical issue in Release 1: the native speakers of Ukrainian (s_L1:"uk") were wrongly labelled as speakers of "other European languages" (s_L1_group="IE"), instead of speakers of a Slavic language (s_L1_group="S"). The file is now a regular XML document, with all annotation represented as XML attributes.
AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It con... more AKCES-GEC is a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has two times more sentences. If you use this dataset, please use following citation: @article{naplava2019wnut, title={Grammatical Error Correction in Low-Resource Scenarios}, author={N{\'a}plava, Jakub and Straka, Milan}, journal={arXiv preprint arXiv:1910.00353}, year={2019} }
WSD word sense disambiguation XML Extensible Markup Language Chapter 'I'm thinking about you' (DG... more WSD word sense disambiguation XML Extensible Markup Language Chapter 'I'm thinking about you' (DGD_L5_143 ru A1) (4) S: *přišli came.*pl.*ma pět five studentů students.gen 20 CHAPTER 1. INTRODUCTION T: přišlo came.sg.neut pět five studentů students.gen 'five students came' Other deviations include non-standard word order due to an inappropriate topicfocus articulation (information structuring), or due to a misplaced clitic, such as in (5), where jsem 'am' and se-reflexive particle-are both 2nd position clitics and should follow the first constituent během studování na univerzitě 'during university studies' in that order. (5) S: během during studování studying na on univerzitě university *se refl seznámil met *jsem aux.1sg s with Evou Eva T: během during studování studying na on univerzitě university jsem aux.1sg se refl seznámil met s with Evou Eva 'during my university studies I met Eva' (BLAH_DZ_001 ky B2) Similarly, reflexive pronouns are often under-used, as in (6), where the possessive moji 'my' should be replaced by the reflexive possessive svoji. (6) S: miluji love.1sg 14 Example (8) is from SKRIPT 2015, a corpus of young native Czech learners. The text ID is followed by the code for Czech (cs) and the age of the author. 10 Both reasons were also behind the decision to use a relatively small tagset in the manual tiered annotation of the CzeSL corpus. This CzeSL tagset consists of 26 tags.
A specific language as used by different speakers and in different situations has a number of mor... more A specific language as used by different speakers and in different situations has a number of more or less distant varieties. Extending the notion of non-standard language to varieties that do not fit an explicitly or implicitly assumed norm or pattern, we look for methods and tools that could be applied to such texts. The needs start from the theoretical side: categories usable for the analysis of non-standard language are not readily available. However, it is not easy to find methods and tools required for its detection and diagnostics either. A general discussion of issues related to non-standard language is followed by two case studies. The first study presents a taxonomy of morphosyntactic categories as an attempt to analyse non-standard forms produced by non-native learners of Czech. The second study focusses on the role of a rule-based grammar and lexicon in the process of building and using a parsebank.
The issue of incompatible morphosyntactic tagsets in multilingual corpora could be solved by an a... more The issue of incompatible morphosyntactic tagsets in multilingual corpora could be solved by an abstract hierarchy of concepts, mapped to languagespecific tagsets. The hierarchy supports the user and tools by resolving categories that do not match the relevant tagset in queries, by providing links between language-specific tagsets, and by displaying responses using a preferred tagset. The hierarchy, built using the methods of Formal Concept Analysis, can also help to refine morphosyntactic annotation in one language by using word-to-word alignments to parallel texts tagged by a different tagset. * This work was supported by grant no. MSM0021620823 of the Czech Ministry of Education, Youth and Sports, as a contribution to the parallel corpus project InterCorp. 1 The parallel corpus InterCorp currently offers on-line concordances in 23 languages, 14 of them tagged with different morphosyntactic tagsets. The corpus can be queried at korpus.cz/Park after registration at http://ucnk.ff.cuni.cz/english/dohody.php. For more information about the project see http://korpus.cz/intercorp/.
We present the architecture and the current state of InterCorp, a multilingual parallel corpus ce... more We present the architecture and the current state of InterCorp, a multilingual parallel corpus centered around Czech, intended primarily for human users and consisting of written texts with a focus on fiction. Following an outline of its recent development and a comparison with some other multilingual parallel corpora we give an overview of the data collection procedure that covers text selection criteria, data format, conversion, alignment, lemmatization and tagging. Finally, we discuss challenges and prospects of the project.
After a brief account of a parallel corpus project involving many diverse languages and a summary... more After a brief account of a parallel corpus project involving many diverse languages and a summary of two previous evaluations of sentential alignment tools, results are presented from tests of three automatic aligners on English-Czech and French-Czech literary and legal texts, clean and noisy. The results confirm that an alignment tool may perform well on one type of texts and fail on another type, and indicate that near-to-perfect alignment is possible when tools providing high precision are combined with manual checking, where the proofreader can focus only on those parts of the text that were either not aligned at all, or that were aligned less reliably. Further gains in precision are shown to be feasible when alignments proposed by multiple aligners are intersected.
by the governments of Poland and the Czech Republic within the KONTAKT programme (Czech no. 23/20... more by the governments of Poland and the Czech Republic within the KONTAKT programme (Czech no. 23/2004, Polish no. 25/2004/CZ). 'I ordered Honza to support Marie' f. Kazałem ordered-1.SG Janowi Jan-DAT wspierać support-INF Marię. Maria-ACC (OC; P.) 'I ordered Jan to support Maria' Two of the most robust cross-linguistic tests distinguishing raising and control 1 involve passivisation (e.g., Pollard and Sag 1994 and, for Czech, Skoumalová 2002) and idiom chunks (e.g., Postal 1974 and, for Polish, Zabrocki 1981): i) when the lower verb is in the passive, the meaning of the sentence is the same as in the active voice in case of raising constructions, but not in case of control constructions, e.g., (3)-(4), and ii) chunks of sentential idioms can be raised arguments, but they cannot be controllers, e.g., (5)-(6). (3) a. Mary seems to be supported by John. (SR; E.; ≈(1a)) b. I expect Mary to be supported by John. (OR; E.; ≈(1d)) (4) a. Mary tries to be supported by John.
The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its an... more The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked levels to cope with a wide range of error types present in the input. Each level corrects different types of errors; links between the levels allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a doubly-annotated sample of approx. 10,000 words with fair inter-annotator agreement results. We also explore options of application of automated linguistic annotation tools (taggers, spell checkers and grammar checkers) on the learner text to support or even substitute manual annotation.
The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its an... more The paper describes a corpus of texts produced by non-native speakers of Czech. We discuss its annotation scheme, consisting of three interlinked tiers, designed to handle a wide range of error types present in the input. Each tier corrects different types of errors; links between the tiers allow capturing errors in word order and complex discontinuous expressions. Errors are not only corrected, but also classified. The annotation scheme is tested on a data set including approx. 175,000 words with fair inter-annotator agreement results. We also explore the possibility of applying automated linguistic annotation tools (taggers, spell checkers and grammar checkers) to the learner text to support or even substitute manual annotation. Keywords learner corpus • error annotation • second language acquisition • Czech The corpus is one of the tasks of the project Innovation of Education in the Field of Czech as a Second Language (project no. CZ.1.07/2.2.00/07.0259), a part of the operational programme Education for Competiveness, funded by the European Structural Funds (ESF) and the Czech government. The annotation tool was also partially funded by grant no. P406/10/P328 of the Grant Agency of the Czech Republic.
Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition... more Essays written by non-native learners of Czech, a part of AKCES/CLAC – Czech Language Acquisition Corpora. CzeSL-SGT stands for Czech as a Second Language with Spelling, Grammar and Tags. Extends the “foreign” (ciz) part of AKCES 3 (CzeSL-plain) by texts collected in 2013. Original forms and automatic corrections are tagged, lemmatized and assigned erros labels. Most texts have metadata attributes (30 items) about the author and the text. In addition to a few minor bugs, fixes a critical issue in Release 1: the native speakers of Ukrainian (s_L1:"uk") were wrongly labelled as speakers of "other European languages" (s_L1_group="IE"), instead of speakers of a Slavic language (s_L1_group="S"). The file is now a regular XML document, with all annotation represented as XML attributes.
Journal for the theory of language and language cultivation, Mar 1, 2006
In Czech, as in many other languages, second person plural forms are used as a means of formal ad... more In Czech, as in many other languages, second person plural forms are used as a means of formal address. As in French, and unlike in Russian, a finite verb predicated of the pronoun vy in this usage agrees with its subject in plural, as expected, while participles and predicative adjectives are sin- gular. It is argued that this pattern of hybrid agreement, present both within analytical verb forms and syntactic constructions and different from regular plural and singular, justifies the introduction of the morphological category of honorific in Czech. As a result, an account of analytical verb forms cannot be complete without providing an explicit description of the forms of second person formal address. Existing Czech grammar reference books are not quite satisfactory in this respect; furthermore they implicitly presuppose the application of rules concerning grammatical gender. We offer a solution to the issue of the adequate presentation of analytical verbal morphology paradigms, including formal address, along the lines of a Polish verbal morphology handbook.
We present an account of analytic verb forms in a treebank of Czech texts. According to the Czech... more We present an account of analytic verb forms in a treebank of Czech texts. According to the Czech linguistic tradition, description of periphrastic constructions is a task for morphology. On the other hand, their components cannot be analyzed separately from syntax. We show how the paradigmatic and syntagmatic views can be represented within a single framework.
Uploads
Papers by Alexandr Rosen