This multilingual resource contains corpora for 14 languages, gathered at the occasion of the 1.2... more This multilingual resource contains corpora for 14 languages, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on semi-supervised Identification of Verbal MWEs (2020). These corpora were meant to serve as additional "raw" corpora, to help discovering unseen verbal MWEs. The corpora are provided in CONLL-U (https://universaldependencies.org/format.html) format. They contain morphosyntactic annotations (parts of speech, lemmas, morphological features, and syntactic dependencies). Depending on the language, the information comes from treebanks (mostly Universal Dependencies v2.x) or from automatic parsers trained on UD v2.x treebanks (e.g., UDPipe). VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). For the 1.2 shared task edition, the data covers 14 languages, for which VMWEs were annotated...
Human-targeted metrics provide a compromise between human evaluation of machine translation, wher... more Human-targeted metrics provide a compromise between human evaluation of machine translation, where high inter-annotator agreement is difficult to achieve, and fully automatic metrics, such as BLEU or TER, that lack the validity of human assessment. Human-targeted translation edit rate (HTER) is by far the most widely employed human-targeted metric in machine translation, commonly employed, for example, as a gold standard in evaluation of quality estimation. Original experiments justifying the design of HTER, as opposed to other possible formulations, were limited to a small sample of translations and a single language pair, however, and this motivates our re-evaluation of a range of human-targeted metrics on a substantially larger scale. Results show significantly stronger correlation with human judgment for HBLEU over HTER for two of the nine language pairs we include and no significant difference between correlations achieved by HTER and HBLEU for the remaining language pairs. Fin...
Tapadoir (from the Irish ´ tapa ‘fast’ and the nominal suffix -oir ´ ) is a statistical machine t... more Tapadoir (from the Irish ´ tapa ‘fast’ and the nominal suffix -oir ´ ) is a statistical machine translation (SMT) project, funded by the Irish government. This work was commissioned to help government translators meet the translation demands which have arisen from the Irish language’s status as an official EU and national language. The development of this system, which translates English into Irish (a morphologically rich, low-resourced minority language), has produced an interesting set of challenges. These challenges have inspired a creative response to the lack of data and NLP tools available for the Irish language and have also resulted in the development of new resources for the Irish linguistic and NLP community. We show that our SMT system out-performs Google TranslateTM (a widely used general-domain SMT system) as a result of steps we have taken to tailor translation output to the user’s specific needs.
1 In this paper, we measure the effectiveness of using language standardisation, lemmatisation, a... more 1 In this paper, we measure the effectiveness of using language standardisation, lemmatisation, and machine translation to improve full-text search results on dúchas.ie, the web interface to the Irish National Folklore Collection. Our focus is the Schools’ Collection, a scanned manuscript collection which is being transcribed by members of the public via a crowdsourcing initiative. We show that by applying these technologies to the manuscript page transcriptions, we obtain substantial improvements in search engine recall over a test set of actual user queries, with no appreciable drop in precision. Our results motivate the inclusion of this language technology in the search infrastructure of this folklore resource.
espanolEste articulo presenta una adaptacion a dominio exitosa de un sistema de Traduccion automa... more espanolEste articulo presenta una adaptacion a dominio exitosa de un sistema de Traduccion automatica neuronal (NMT) utilizando un corpus bilingue creado con los pies de imagen utilizados en Wikimedia Commons para los pares de idiomas espanol-euskera e ingles-irlandes. EnglishThis paper presents a successful domain adaptation of a general neural machine translation (NMT) system using a bilingual corpus created with captions for images in Wikimedia Commons for the Spanish-Basque and English-Irish pairs.
TEANGA, the Journal of the Irish Association for Applied Linguistics, 2019
In this paper, we discuss the difficulties of building reliable machine translation systems for t... more In this paper, we discuss the difficulties of building reliable machine translation systems for the English-Irish (EN-GA) language pair. In the context of limited datasets, we report on assessing the use of backtranslation as a method for creating artificial EN-GA data to increase training data for use state-of-the-art data-driven translation systems. We compare our results to earlier work on EN-GA machine translation by Dowling et al (2016, 2017, 2018) showing that while our own systems do not compare in quality with respect to traditionally reported BLEU metrics, we provide a linguistic analysis to suggest that future work with domain specific data may prove more successful.
Human-targeted metrics provide a compromise between human evaluation of machine translation, wher... more Human-targeted metrics provide a compromise between human evaluation of machine translation, where high inter-annotator agreement is difficult to achieve, and fully automatic metrics, such as BLEU or TER, that lack the validity of human assessment. Human-targeted translation edit rate (HTER) is by far the most widely employed human-targeted metric in machine translation, commonly employed, for example, as a gold standard in evaluation of quality estimation. Original experiments justifying the design of HTER, as opposed to other possible formulations, were limited to a small sample of translations and a single language pair, however, and this motivates our re-evaluation of a range of human-targeted metrics on a substantially larger scale. Results show significantly stronger correlation with human judgment for HBLEU over HTER for two of the nine language pairs we include and no significant difference between correlations achieved by HTER and HBLEU for the remaining language pairs. Fin...
Irish and Scottish Gaelic are similar but distinct languages from the Celtic language family. Bot... more Irish and Scottish Gaelic are similar but distinct languages from the Celtic language family. Both languages are underresourced in terms of machine translation (MT), with Irish being the better resourced. In this paper, we show how backtranslation can be used to harness the resources of these similar low-resourced languages and build a Scottish-Gaelic to English MT system with little or no highquality bilingual data.
With official status in both Ireland and the EU, there is a need for high-quality English-Irish (... more With official status in both Ireland and the EU, there is a need for high-quality English-Irish (EN-GA) machine translation (MT) systems which are suitable for use in a professional translation environment. While we have seen recent research on improving both statistical MT and neural MT for the EN-GA pair, the results of such systems have always been reported using automatic evaluation metrics. This paper provides the first human evaluation study of EN-GA MT using professional translators and in-domain (public administration) data for a more accurate depiction of the translation quality available via MT.
In this paper, we provide a preliminary comparison of statistical machine translation (SMT) and n... more In this paper, we provide a preliminary comparison of statistical machine translation (SMT) and neural machine translation (NMT) for English→Irish in the fixed domain of public administration. We discuss the challenges for SMT and NMT of a less-resourced language such as Irish, and show that while an out-of-the-box NMT system may not fare quite as well as our tailor-made domain-specific SMT system, the future may still be promising for EN→GA NMT
Cet article presente l’adaptation d’un systeme de traduction automatique statistique, anglais→irl... more Cet article presente l’adaptation d’un systeme de traduction automatique statistique, anglais→irlandais, a un nouveau domain d’utilisation. Ce systeme nomme est actuellement utilise par une equipe de traducteurs du gouvernement irlandais. Nous decrivons egalement le nouveau module de post-edition automatique qui a ete developpe pour ameliorer le systeme actuel et faciliter le travail de post-edition des traducteurs. This paper reports on the continued development of a domain-tailored English→Irish Statistical Machine Translation system currently in use by an in-house translation team of an Irish government department. We describe the new automatic post-editing module that has been developed to enhance the current system and reduce the post-editing required of translators.
Data sparsity is a common problem for machine translation of minority and less-resourced language... more Data sparsity is a common problem for machine translation of minority and less-resourced languages. While data collection for standard, grammatical text can be challenging enough, efforts for collection of parallel user-generated content can be even more challenging. In this paper we describe an approach to collecting English↔Irish translations of user-generated content (tweets) that overcomes some of these hurdles. We show how a crowd-sourced data collection campaign, which was tailored to our target audience (the Irish language community), proved successful in gathering data for a niche domain. We also discuss the reliability of crowd-sourcing English↔Irish tweet translations in terms of quality by reporting on a self-rating approach along with qualified reviewer ratings.
This paper presents a successful domain adaptation of a general neural machine translation (NMT) ... more This paper presents a successful domain adaptation of a general neural machine translation (NMT) system using a bilingual corpus created with captions for images inWikimedia Commons for the Spanish-Basque and English-Irish pairs.
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, 2020
With official status in both Ireland and the EU, there is a need for high-quality English-Irish (... more With official status in both Ireland and the EU, there is a need for high-quality English-Irish (EN-GA) machine translation (MT) systems which are suitable for use in a professional translation environment. While we have seen recent research on improving both statistical MT and neu-ral MT for the EN-GA pair, the results of such systems have always been reported using automatic evaluation metrics. This paper provides the first human evaluation study of EN-GA MT using professional translators and in-domain (public administration) data for a more accurate depiction of the translation quality available via MT.
TEANGA, the Journal of the Irish Association for Applied Linguistics, 2019
In this paper, we discuss the difficulties of building reliable machine translation (MT) systems ... more In this paper, we discuss the difficulties of building reliable machine translation (MT) systems for the English-Irish (EN-GA) language pair. In the context of limited datasets, we report on assessing the use of backtranslation as a method for creating artificial EN-GA data to increase training data for use in state-of-the-art data-driven translation systems. We compare our results to our earlier work on EN-GA machine translation (Dowling et al. 2016; 2017; 2018) showing that while our own systems underperform with respect to traditionally reported automatic evaluation metrics, we provide a linguistic analysis to suggest that future work with domain-specific data may prove more successful.
Irish and Scottish Gaelic are similar but distinct languages from the Celtic language family. Bot... more Irish and Scottish Gaelic are similar but distinct languages from the Celtic language family. Both languages are under-resourced in terms of machine translation (MT), with Irish being the better resourced. In this paper, we show how back-translation can be used to harness the resources of these similar low-resourced languages and build a Scottish-Gaelic to English MT system with little or no high-quality bilingual data.
In this paper, we provide a preliminary comparison of statistical machine translation (SMT) and n... more In this paper, we provide a preliminary comparison of statistical machine translation (SMT) and neural machine translation (NMT) for English→Irish in the fixed domain of public administration. We discuss the challenges for SMT and NMT of a less-resourced language such as Irish, and show that while an out-of-the-box NMT system may not fare quite as well as our tailor-made domain-specific SMT system, the future may still be promising for EN→GA NMT.
Data sparsity is a common problem for machine translation of minority and less-resourced language... more Data sparsity is a common problem for machine translation of minority and less-resourced languages. While data collection for standard, grammatical text can be challenging enough, efforts for collection of parallel user-generated content can be even more challenging. In this paper we describe an approach to collecting English↔Irish translations of user-generated content (tweets) that overcomes some of these hurdles. We show how a crowd-sourced data collection campaign, which was tailored to our target audience (the Irish language community), proved successful in gathering data for a niche domain. We also discuss the reliablity of crowd-sourcing English↔Irish tweet translations in terms of quality by reporting on a self-rating approach along with qualified reviewer ratings.
English to Irish Machine Translation with Automatic Post-Editing This paper reports on the contin... more English to Irish Machine Translation with Automatic Post-Editing This paper reports on the continued development of a domain-tailored English→Irish Statistical Machine Translation system currently in use by an in-house translation team of an Irish government department. We describe the new automatic post-editing module that has been developed to enhance the current system and reduce the post-editing required of translators.
Tapadóir (from the Irish tapa 'fast' and the nominal suffix-óir) is a statistical machine transla... more Tapadóir (from the Irish tapa 'fast' and the nominal suffix-óir) is a statistical machine translation (SMT) project, funded by the Irish government. This work was commissioned to help government translators meet the translation demands which have arisen from the Irish language's status as an official EU and national language. The development of this system, which translates English into Irish (a morphologically rich, low-resourced minority language), has produced an interesting set of challenges. These challenges have inspired a creative response to the lack of data and NLP tools available for the Irish language and have also resulted in the development of new resources for the Irish linguistic and NLP community. We show that our SMT system out-performs Google Translate TM (a widely used general-domain SMT system) as a result of steps we have taken to tailor translation output to the user's specific needs.
This multilingual resource contains corpora for 14 languages, gathered at the occasion of the 1.2... more This multilingual resource contains corpora for 14 languages, gathered at the occasion of the 1.2 edition of the PARSEME Shared Task on semi-supervised Identification of Verbal MWEs (2020). These corpora were meant to serve as additional "raw" corpora, to help discovering unseen verbal MWEs. The corpora are provided in CONLL-U (https://universaldependencies.org/format.html) format. They contain morphosyntactic annotations (parts of speech, lemmas, morphological features, and syntactic dependencies). Depending on the language, the information comes from treebanks (mostly Universal Dependencies v2.x) or from automatic parsers trained on UD v2.x treebanks (e.g., UDPipe). VMWEs include idioms (let the cat out of the bag), light-verb constructions (make a decision), verb-particle constructions (give up), inherently reflexive verbs (help oneself), and multi-verb constructions (make do). For the 1.2 shared task edition, the data covers 14 languages, for which VMWEs were annotated...
Human-targeted metrics provide a compromise between human evaluation of machine translation, wher... more Human-targeted metrics provide a compromise between human evaluation of machine translation, where high inter-annotator agreement is difficult to achieve, and fully automatic metrics, such as BLEU or TER, that lack the validity of human assessment. Human-targeted translation edit rate (HTER) is by far the most widely employed human-targeted metric in machine translation, commonly employed, for example, as a gold standard in evaluation of quality estimation. Original experiments justifying the design of HTER, as opposed to other possible formulations, were limited to a small sample of translations and a single language pair, however, and this motivates our re-evaluation of a range of human-targeted metrics on a substantially larger scale. Results show significantly stronger correlation with human judgment for HBLEU over HTER for two of the nine language pairs we include and no significant difference between correlations achieved by HTER and HBLEU for the remaining language pairs. Fin...
Tapadoir (from the Irish ´ tapa ‘fast’ and the nominal suffix -oir ´ ) is a statistical machine t... more Tapadoir (from the Irish ´ tapa ‘fast’ and the nominal suffix -oir ´ ) is a statistical machine translation (SMT) project, funded by the Irish government. This work was commissioned to help government translators meet the translation demands which have arisen from the Irish language’s status as an official EU and national language. The development of this system, which translates English into Irish (a morphologically rich, low-resourced minority language), has produced an interesting set of challenges. These challenges have inspired a creative response to the lack of data and NLP tools available for the Irish language and have also resulted in the development of new resources for the Irish linguistic and NLP community. We show that our SMT system out-performs Google TranslateTM (a widely used general-domain SMT system) as a result of steps we have taken to tailor translation output to the user’s specific needs.
1 In this paper, we measure the effectiveness of using language standardisation, lemmatisation, a... more 1 In this paper, we measure the effectiveness of using language standardisation, lemmatisation, and machine translation to improve full-text search results on dúchas.ie, the web interface to the Irish National Folklore Collection. Our focus is the Schools’ Collection, a scanned manuscript collection which is being transcribed by members of the public via a crowdsourcing initiative. We show that by applying these technologies to the manuscript page transcriptions, we obtain substantial improvements in search engine recall over a test set of actual user queries, with no appreciable drop in precision. Our results motivate the inclusion of this language technology in the search infrastructure of this folklore resource.
espanolEste articulo presenta una adaptacion a dominio exitosa de un sistema de Traduccion automa... more espanolEste articulo presenta una adaptacion a dominio exitosa de un sistema de Traduccion automatica neuronal (NMT) utilizando un corpus bilingue creado con los pies de imagen utilizados en Wikimedia Commons para los pares de idiomas espanol-euskera e ingles-irlandes. EnglishThis paper presents a successful domain adaptation of a general neural machine translation (NMT) system using a bilingual corpus created with captions for images in Wikimedia Commons for the Spanish-Basque and English-Irish pairs.
TEANGA, the Journal of the Irish Association for Applied Linguistics, 2019
In this paper, we discuss the difficulties of building reliable machine translation systems for t... more In this paper, we discuss the difficulties of building reliable machine translation systems for the English-Irish (EN-GA) language pair. In the context of limited datasets, we report on assessing the use of backtranslation as a method for creating artificial EN-GA data to increase training data for use state-of-the-art data-driven translation systems. We compare our results to earlier work on EN-GA machine translation by Dowling et al (2016, 2017, 2018) showing that while our own systems do not compare in quality with respect to traditionally reported BLEU metrics, we provide a linguistic analysis to suggest that future work with domain specific data may prove more successful.
Human-targeted metrics provide a compromise between human evaluation of machine translation, wher... more Human-targeted metrics provide a compromise between human evaluation of machine translation, where high inter-annotator agreement is difficult to achieve, and fully automatic metrics, such as BLEU or TER, that lack the validity of human assessment. Human-targeted translation edit rate (HTER) is by far the most widely employed human-targeted metric in machine translation, commonly employed, for example, as a gold standard in evaluation of quality estimation. Original experiments justifying the design of HTER, as opposed to other possible formulations, were limited to a small sample of translations and a single language pair, however, and this motivates our re-evaluation of a range of human-targeted metrics on a substantially larger scale. Results show significantly stronger correlation with human judgment for HBLEU over HTER for two of the nine language pairs we include and no significant difference between correlations achieved by HTER and HBLEU for the remaining language pairs. Fin...
Irish and Scottish Gaelic are similar but distinct languages from the Celtic language family. Bot... more Irish and Scottish Gaelic are similar but distinct languages from the Celtic language family. Both languages are underresourced in terms of machine translation (MT), with Irish being the better resourced. In this paper, we show how backtranslation can be used to harness the resources of these similar low-resourced languages and build a Scottish-Gaelic to English MT system with little or no highquality bilingual data.
With official status in both Ireland and the EU, there is a need for high-quality English-Irish (... more With official status in both Ireland and the EU, there is a need for high-quality English-Irish (EN-GA) machine translation (MT) systems which are suitable for use in a professional translation environment. While we have seen recent research on improving both statistical MT and neural MT for the EN-GA pair, the results of such systems have always been reported using automatic evaluation metrics. This paper provides the first human evaluation study of EN-GA MT using professional translators and in-domain (public administration) data for a more accurate depiction of the translation quality available via MT.
In this paper, we provide a preliminary comparison of statistical machine translation (SMT) and n... more In this paper, we provide a preliminary comparison of statistical machine translation (SMT) and neural machine translation (NMT) for English→Irish in the fixed domain of public administration. We discuss the challenges for SMT and NMT of a less-resourced language such as Irish, and show that while an out-of-the-box NMT system may not fare quite as well as our tailor-made domain-specific SMT system, the future may still be promising for EN→GA NMT
Cet article presente l’adaptation d’un systeme de traduction automatique statistique, anglais→irl... more Cet article presente l’adaptation d’un systeme de traduction automatique statistique, anglais→irlandais, a un nouveau domain d’utilisation. Ce systeme nomme est actuellement utilise par une equipe de traducteurs du gouvernement irlandais. Nous decrivons egalement le nouveau module de post-edition automatique qui a ete developpe pour ameliorer le systeme actuel et faciliter le travail de post-edition des traducteurs. This paper reports on the continued development of a domain-tailored English→Irish Statistical Machine Translation system currently in use by an in-house translation team of an Irish government department. We describe the new automatic post-editing module that has been developed to enhance the current system and reduce the post-editing required of translators.
Data sparsity is a common problem for machine translation of minority and less-resourced language... more Data sparsity is a common problem for machine translation of minority and less-resourced languages. While data collection for standard, grammatical text can be challenging enough, efforts for collection of parallel user-generated content can be even more challenging. In this paper we describe an approach to collecting English↔Irish translations of user-generated content (tweets) that overcomes some of these hurdles. We show how a crowd-sourced data collection campaign, which was tailored to our target audience (the Irish language community), proved successful in gathering data for a niche domain. We also discuss the reliability of crowd-sourcing English↔Irish tweet translations in terms of quality by reporting on a self-rating approach along with qualified reviewer ratings.
This paper presents a successful domain adaptation of a general neural machine translation (NMT) ... more This paper presents a successful domain adaptation of a general neural machine translation (NMT) system using a bilingual corpus created with captions for images inWikimedia Commons for the Spanish-Basque and English-Irish pairs.
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, 2020
With official status in both Ireland and the EU, there is a need for high-quality English-Irish (... more With official status in both Ireland and the EU, there is a need for high-quality English-Irish (EN-GA) machine translation (MT) systems which are suitable for use in a professional translation environment. While we have seen recent research on improving both statistical MT and neu-ral MT for the EN-GA pair, the results of such systems have always been reported using automatic evaluation metrics. This paper provides the first human evaluation study of EN-GA MT using professional translators and in-domain (public administration) data for a more accurate depiction of the translation quality available via MT.
TEANGA, the Journal of the Irish Association for Applied Linguistics, 2019
In this paper, we discuss the difficulties of building reliable machine translation (MT) systems ... more In this paper, we discuss the difficulties of building reliable machine translation (MT) systems for the English-Irish (EN-GA) language pair. In the context of limited datasets, we report on assessing the use of backtranslation as a method for creating artificial EN-GA data to increase training data for use in state-of-the-art data-driven translation systems. We compare our results to our earlier work on EN-GA machine translation (Dowling et al. 2016; 2017; 2018) showing that while our own systems underperform with respect to traditionally reported automatic evaluation metrics, we provide a linguistic analysis to suggest that future work with domain-specific data may prove more successful.
Irish and Scottish Gaelic are similar but distinct languages from the Celtic language family. Bot... more Irish and Scottish Gaelic are similar but distinct languages from the Celtic language family. Both languages are under-resourced in terms of machine translation (MT), with Irish being the better resourced. In this paper, we show how back-translation can be used to harness the resources of these similar low-resourced languages and build a Scottish-Gaelic to English MT system with little or no high-quality bilingual data.
In this paper, we provide a preliminary comparison of statistical machine translation (SMT) and n... more In this paper, we provide a preliminary comparison of statistical machine translation (SMT) and neural machine translation (NMT) for English→Irish in the fixed domain of public administration. We discuss the challenges for SMT and NMT of a less-resourced language such as Irish, and show that while an out-of-the-box NMT system may not fare quite as well as our tailor-made domain-specific SMT system, the future may still be promising for EN→GA NMT.
Data sparsity is a common problem for machine translation of minority and less-resourced language... more Data sparsity is a common problem for machine translation of minority and less-resourced languages. While data collection for standard, grammatical text can be challenging enough, efforts for collection of parallel user-generated content can be even more challenging. In this paper we describe an approach to collecting English↔Irish translations of user-generated content (tweets) that overcomes some of these hurdles. We show how a crowd-sourced data collection campaign, which was tailored to our target audience (the Irish language community), proved successful in gathering data for a niche domain. We also discuss the reliablity of crowd-sourcing English↔Irish tweet translations in terms of quality by reporting on a self-rating approach along with qualified reviewer ratings.
English to Irish Machine Translation with Automatic Post-Editing This paper reports on the contin... more English to Irish Machine Translation with Automatic Post-Editing This paper reports on the continued development of a domain-tailored English→Irish Statistical Machine Translation system currently in use by an in-house translation team of an Irish government department. We describe the new automatic post-editing module that has been developed to enhance the current system and reduce the post-editing required of translators.
Tapadóir (from the Irish tapa 'fast' and the nominal suffix-óir) is a statistical machine transla... more Tapadóir (from the Irish tapa 'fast' and the nominal suffix-óir) is a statistical machine translation (SMT) project, funded by the Irish government. This work was commissioned to help government translators meet the translation demands which have arisen from the Irish language's status as an official EU and national language. The development of this system, which translates English into Irish (a morphologically rich, low-resourced minority language), has produced an interesting set of challenges. These challenges have inspired a creative response to the lack of data and NLP tools available for the Irish language and have also resulted in the development of new resources for the Irish linguistic and NLP community. We show that our SMT system out-performs Google Translate TM (a widely used general-domain SMT system) as a result of steps we have taken to tailor translation output to the user's specific needs.
Uploads
Papers by Meghan Dowling