Skip to main content

Martin Popel

Followers

18

Following

15

Co-authors

15

Public Views

Eleftherios Avramidis

German Research Center for Artificial Intelligence

Kent State University

Vilelmini Sosoni

Ionian University

MARIA STASIMIOTI

Ionian University

Gloria Corpas Pastor

Universidad de Málaga

Dublin City University

Carnegie Mellon University

Sabine Hunsicker

Universität des Saarlandes

Interests

Uploads

Papers by Martin Popel

Tools and Guidelines for Principled Machine Translation Development

by Nora Aranberri, Eleftherios Avramidis, Martin Popel, and Maja Popovi ́C

This work addresses the need to aid Machine Translation (MT) development cycles with a complete w... more This work addresses the need to aid Machine Translation (MT) development cycles with a complete workflow of MT evaluation methods. Our aim is to assess, compare and improve MT system variants. We hereby report on novel tools and practices that support various measures, developed in order to support a principled and informed approach of MT development. Our toolkit for automatic evaluation showcases quick and detailed comparison of MT system variants through automatic metrics and n-gram feedback, along with manual evaluation via edit-distance, error annotation and task-based feedback.

Animace algoritmů z teorie automatů

Ways to Improve the Quality of English-Czech Machine Translation

Master's thesis, Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic, 2009

MT-ComparEval: Graphical evaluation interface for Machine Translation development

by Eleftherios Avramidis, Aljoscha Burchardt, and Martin Popel

The tool described in this article has been designed to help MT developers by implementing a web-... more The tool described in this article has been designed to help MT developers by implementing a web-based graphical user interface that allows to systematically compare and evaluate various MT engines/experiments using comparative analysis via automatic measures and statistics. The evaluation panel provides graphs, tests for statistical significance and n-gram statistics. We also present a demo server http://wmt.ufal.cz with WMT14 and WMT15 translations.

The Joy of Parallelism with CzEng 1.0

by M. Majliš and Martin Popel

mt-archive.info, 2012

CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-co... more CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-commercial research or educational purposes. In this release, we approximately doubled the corpus size, reaching 15 million sentence pairs (about 200 million tokens per language). More importantly, we carefully filtered the data to reduce the amount of non-matching sentence pairs. CzEng 1.0 is automatically aligned at the level of sentences as well as words. We provide not only the plain text representation, but also automatic morphological tags, surface syntactic as well as deep syntactic dependency parse trees and automatic co-reference links in both English and Czech. This paper describes key properties of the released resource including the distribution of text domains, the corpus data formats, and a toolkit to handle the provided rich annotation. We also summarize the procedure of the rich annotation (incl. co-reference resolution) and of the automatic filtering. Finally, we provide some suggestions on exploiting such an automatically annotated sentence-parallel corpus.

Maximum Entropy Translation Model in Dependency-Based MT Framework

Maximum Entropy Principle has been used successfully in various NLP tasks. In this paper we propo... more Maximum Entropy Principle has been used successfully in various NLP tasks. In this paper we propose a forward transla- tion model consisting of a set of maxi- mum entropy classifiers: a separate clas- sifier is trained for each (sufficiently fre- quent) source-side lemma. In this way the estimates of translation probabilities can be sensitive to a large number of fea-

Hidden Markov Tree Model in Dependency-based Machine Translation

Meeting of the Association for Computational Linguistics, 2009

We would like to draw attention to Hidden Markov Tree Models (HMTM), which are to our knowledge s... more We would like to draw attention to Hidden Markov Tree Models (HMTM), which are to our knowledge still unexploited in the field of Computational Linguistics, in spite of highly successful Hidden Markov (Chain) Models. In dependency trees, the independence assumptions made by HMTM correspond to the intuition of linguistic dependency. Therefore we suggest to use HMTM and tree-modified Viterbi algorithm for tasks interpretable as labeling nodes of dependency trees. In particular, we show that the transfer phase in a Machine Translation system based on tectogrammatical dependency trees can be seen as a task suitable for HMTM. When using the HMTM approach for the English-Czech translation, we reach a moderate improvement over the baseline. * The work on this project was supported by the grants MSM 0021620838, GAAVČR 1ET101120503, and MŠMŤ CR LC536. We thank Jan Hajič and three anonymous reviewers for many useful comments.

Hidden Markov tree model in dependency-based machine translation

Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09, 2009

We would like to draw attention to Hidden Markov Tree Models (HMTM), which are to our knowledge s... more We would like to draw attention to Hidden Markov Tree Models (HMTM), which are to our knowledge still unexploited in the field of Computational Linguistics, in spite of highly successful Hidden Markov (Chain) Models. In dependency trees, the independence assumptions made by HMTM correspond to the intuition of linguistic dependency. Therefore we suggest to use HMTM and tree-modified Viterbi algorithm for tasks interpretable as labeling nodes of dependency trees. In particular, we show that the transfer phase in a Machine Translation system based on tectogrammatical dependency trees can be seen as a task suitable for HMTM. When using the HMTM approach for the English-Czech translation, we reach a moderate improvement over the baseline. * The work on this project was supported by the grants MSM 0021620838, GAAVČR 1ET101120503, and MŠMŤ CR LC536. We thank Jan Hajič and three anonymous reviewers for many useful comments.

TectoMT: Modular NLP Framework

Lecture Notes in Computer Science, 2010

In the present paper we describe TectoMT, a multi-purpose open-source NLP framework. It allows fo... more In the present paper we describe TectoMT, a multi-purpose open-source NLP framework. It allows for fast and efficient development of NLP applications by exploiting a wide range of software modules already integrated in TectoMT, such as tools for sentence segmentation, tokenization, morphological analysis, POS tagging, shallow and deep syntax parsing, named entity recognition, anaphora resolution, tree-to-tree translation, natural language generation, word-level alignment of parallel corpora, and other tasks. One of the most complex applications of TectoMT is the English-Czech machine translation system with transfer on deep syntactic (tectogrammatical) layer. Several modules are available also for other languages (German, Russian, Arabic). Where possible, modules are implemented in a language-independent way, so they can be reused in many applications.

Perplexity of n-Gram and Dependency Language Models

Lecture Notes in Computer Science, 2010

Language models (LMs) are essential components of many applications such as speech recognition or... more Language models (LMs) are essential components of many applications such as speech recognition or machine translation. LMs factorize the probability of a string of words into a product of P(w i |h i ), where h i is the context (history) of word w i . Most LMs use previous words as the context. The paper presents two alternative approaches: post-ngram LMs (which use following words as context) and dependency LMs (which exploit dependency structure of a sentence and can use e.g. the governing word as context). Dependency LMs could be useful whenever a topology of a dependency tree is available, but its lexical labels are unknown, e.g. in tree-to-tree machine translation. In comparison with baseline interpolated trigram LM both of the approaches achieve significantly lower perplexity for all seven tested languages (Arabic, Catalan, Czech, English, Hungarian, Italian, Turkish).

English-Czech MT in 2008

Proceedings of the Fourth Workshop on Statistical Machine Translation - StatMT '09, 2009

by Martin Popel, J. Mašek, and Daniel Zeman

Cross-language study on influence of coordination style on dependency parsing performance

by Martin Popel and Daniel Zeman

HamleDT 2.0: Thirty Dependency Treebanks Stanfordized

by Martin Popel, J. Mašek, and Daniel Zeman

Coordination Structures in Dependency Treebanks

by Martin Popel and Daniel Zeman

Paratactic syntactic structures are notoriously difficult to represent in dependency formalisms. ... more Paratactic syntactic structures are notoriously difficult to represent in dependency formalisms. This has painful consequences such as high frequency of parsing errors related to coordination. In other words, coordination is a pending problem in dependency analysis of natural languages. This paper tries to shed some light on this area by bringing a systematizing view of various formal means developed for encoding coordination structures. We introduce a novel taxonomy of such approaches and apply it to treebanks across a typologically diverse range of 26 languages. In addition, empirical observations on convertibility between selected styles of representations are shown too.

PhraseFix: Statistical post-editing of TectoMT

by Martin Popel and Petra Galuščáková

Cuni in wmt14: Chimera still awaits bellerophon

Improving English-Czech Tectogrammatical MT

The Prague Bulletin of Mathematical Linguistics, 2009

The present paper summarizes our recent results concerning English-Czech Machine Translation impl... more The present paper summarizes our recent results concerning English-Czech Machine Translation implemented in the TectoMT framework. The system uses tectogrammatical trees as the transfer medium. A detailed analysis of errors made by the previous version of the system (considered as the baseline) is presented first. Then several improvements of the system are described that led to better translation quality in terms of BLEU and NIST scores. The biggest performance gain comes from applying Hidden Tree Markov Model in the transfer phase, which is a novel technique in the field of Machine Translation. 1

Adaptation of machine translation for multilingual information retrieval in the medical domain

by Johannes Leveling, Gjf Jones, Jaroslava Hlaváčová, Martin Popel, Ondřej Dušek, and Jan Hajič

Artificial Intelligence in Medicine, 2014

We investigate machine translation (MT) of user search queries in the context of cross-lingual in... more We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve effectiveness of cross-lingual IR. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech-English, German-English, and French-English. MT quality is evaluated on data sets created within the Khresmoi project and IR effectiveness is tested on the CLEF eHealth 2013 data sets. The search query translation results achieved in our experiments are outstanding - our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech-English, from 23.03 to 40.82 for German-English, and from 32.67 to 40.82 for French-English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French-English. For Czech-English and German-English, the increased MT quality does not lead to better IR results. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance - better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions.

Formemes in English-Czech Deep Syntactic MT

by Martin Popel, Michal Novák, and M. Majliš

acl.eldoc.ub.rug.nl, 2012

One of the most notable recent improvements of the TectoMT English-to-Czech translation is a syst... more One of the most notable recent improvements of the TectoMT English-to-Czech translation is a systematic and theoretically supported revision of formemes-the annotation of morpho-syntactic features of content words in deep dependency syntactic structures based on the Prague tectogrammatics theory. Our modifications aim at reducing data sparsity, increasing consistency across languages and widening the usage area of this markup. Formemes can be used not only in MT, but in various other NLP tasks.

Tools and Guidelines for Principled Machine Translation Development

by Nora Aranberri, Eleftherios Avramidis, Martin Popel, and Maja Popovi ́C

This work addresses the need to aid Machine Translation (MT) development cycles with a complete w... more This work addresses the need to aid Machine Translation (MT) development cycles with a complete workflow of MT evaluation methods. Our aim is to assess, compare and improve MT system variants. We hereby report on novel tools and practices that support various measures, developed in order to support a principled and informed approach of MT development. Our toolkit for automatic evaluation showcases quick and detailed comparison of MT system variants through automatic metrics and n-gram feedback, along with manual evaluation via edit-distance, error annotation and task-based feedback.

Animace algoritmů z teorie automatů

Ways to Improve the Quality of English-Czech Machine Translation

Master's thesis, Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic, 2009

MT-ComparEval: Graphical evaluation interface for Machine Translation development

by Eleftherios Avramidis, Aljoscha Burchardt, and Martin Popel

The tool described in this article has been designed to help MT developers by implementing a web-... more The tool described in this article has been designed to help MT developers by implementing a web-based graphical user interface that allows to systematically compare and evaluate various MT engines/experiments using comparative analysis via automatic measures and statistics. The evaluation panel provides graphs, tests for statistical significance and n-gram statistics. We also present a demo server http://wmt.ufal.cz with WMT14 and WMT15 translations.

The Joy of Parallelism with CzEng 1.0

by M. Majliš and Martin Popel

mt-archive.info, 2012

CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-co... more CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-commercial research or educational purposes. In this release, we approximately doubled the corpus size, reaching 15 million sentence pairs (about 200 million tokens per language). More importantly, we carefully filtered the data to reduce the amount of non-matching sentence pairs. CzEng 1.0 is automatically aligned at the level of sentences as well as words. We provide not only the plain text representation, but also automatic morphological tags, surface syntactic as well as deep syntactic dependency parse trees and automatic co-reference links in both English and Czech. This paper describes key properties of the released resource including the distribution of text domains, the corpus data formats, and a toolkit to handle the provided rich annotation. We also summarize the procedure of the rich annotation (incl. co-reference resolution) and of the automatic filtering. Finally, we provide some suggestions on exploiting such an automatically annotated sentence-parallel corpus.

Maximum Entropy Translation Model in Dependency-Based MT Framework

Maximum Entropy Principle has been used successfully in various NLP tasks. In this paper we propo... more Maximum Entropy Principle has been used successfully in various NLP tasks. In this paper we propose a forward transla- tion model consisting of a set of maxi- mum entropy classifiers: a separate clas- sifier is trained for each (sufficiently fre- quent) source-side lemma. In this way the estimates of translation probabilities can be sensitive to a large number of fea-

Hidden Markov Tree Model in Dependency-based Machine Translation

Meeting of the Association for Computational Linguistics, 2009

We would like to draw attention to Hidden Markov Tree Models (HMTM), which are to our knowledge s... more We would like to draw attention to Hidden Markov Tree Models (HMTM), which are to our knowledge still unexploited in the field of Computational Linguistics, in spite of highly successful Hidden Markov (Chain) Models. In dependency trees, the independence assumptions made by HMTM correspond to the intuition of linguistic dependency. Therefore we suggest to use HMTM and tree-modified Viterbi algorithm for tasks interpretable as labeling nodes of dependency trees. In particular, we show that the transfer phase in a Machine Translation system based on tectogrammatical dependency trees can be seen as a task suitable for HMTM. When using the HMTM approach for the English-Czech translation, we reach a moderate improvement over the baseline. * The work on this project was supported by the grants MSM 0021620838, GAAVČR 1ET101120503, and MŠMŤ CR LC536. We thank Jan Hajič and three anonymous reviewers for many useful comments.

Hidden Markov tree model in dependency-based machine translation

Proceedings of the ACL-IJCNLP 2009 Conference Short Papers on - ACL-IJCNLP '09, 2009

We would like to draw attention to Hidden Markov Tree Models (HMTM), which are to our knowledge s... more We would like to draw attention to Hidden Markov Tree Models (HMTM), which are to our knowledge still unexploited in the field of Computational Linguistics, in spite of highly successful Hidden Markov (Chain) Models. In dependency trees, the independence assumptions made by HMTM correspond to the intuition of linguistic dependency. Therefore we suggest to use HMTM and tree-modified Viterbi algorithm for tasks interpretable as labeling nodes of dependency trees. In particular, we show that the transfer phase in a Machine Translation system based on tectogrammatical dependency trees can be seen as a task suitable for HMTM. When using the HMTM approach for the English-Czech translation, we reach a moderate improvement over the baseline. * The work on this project was supported by the grants MSM 0021620838, GAAVČR 1ET101120503, and MŠMŤ CR LC536. We thank Jan Hajič and three anonymous reviewers for many useful comments.

TectoMT: Modular NLP Framework

Lecture Notes in Computer Science, 2010

In the present paper we describe TectoMT, a multi-purpose open-source NLP framework. It allows fo... more In the present paper we describe TectoMT, a multi-purpose open-source NLP framework. It allows for fast and efficient development of NLP applications by exploiting a wide range of software modules already integrated in TectoMT, such as tools for sentence segmentation, tokenization, morphological analysis, POS tagging, shallow and deep syntax parsing, named entity recognition, anaphora resolution, tree-to-tree translation, natural language generation, word-level alignment of parallel corpora, and other tasks. One of the most complex applications of TectoMT is the English-Czech machine translation system with transfer on deep syntactic (tectogrammatical) layer. Several modules are available also for other languages (German, Russian, Arabic). Where possible, modules are implemented in a language-independent way, so they can be reused in many applications.

Perplexity of n-Gram and Dependency Language Models

Lecture Notes in Computer Science, 2010

Language models (LMs) are essential components of many applications such as speech recognition or... more Language models (LMs) are essential components of many applications such as speech recognition or machine translation. LMs factorize the probability of a string of words into a product of P(w i |h i ), where h i is the context (history) of word w i . Most LMs use previous words as the context. The paper presents two alternative approaches: post-ngram LMs (which use following words as context) and dependency LMs (which exploit dependency structure of a sentence and can use e.g. the governing word as context). Dependency LMs could be useful whenever a topology of a dependency tree is available, but its lexical labels are unknown, e.g. in tree-to-tree machine translation. In comparison with baseline interpolated trigram LM both of the approaches achieve significantly lower perplexity for all seven tested languages (Arabic, Catalan, Czech, English, Hungarian, Italian, Turkish).

English-Czech MT in 2008

Proceedings of the Fourth Workshop on Statistical Machine Translation - StatMT '09, 2009

by Martin Popel, J. Mašek, and Daniel Zeman

Cross-language study on influence of coordination style on dependency parsing performance

by Martin Popel and Daniel Zeman

HamleDT 2.0: Thirty Dependency Treebanks Stanfordized

by Martin Popel, J. Mašek, and Daniel Zeman

Coordination Structures in Dependency Treebanks

by Martin Popel and Daniel Zeman

Paratactic syntactic structures are notoriously difficult to represent in dependency formalisms. ... more Paratactic syntactic structures are notoriously difficult to represent in dependency formalisms. This has painful consequences such as high frequency of parsing errors related to coordination. In other words, coordination is a pending problem in dependency analysis of natural languages. This paper tries to shed some light on this area by bringing a systematizing view of various formal means developed for encoding coordination structures. We introduce a novel taxonomy of such approaches and apply it to treebanks across a typologically diverse range of 26 languages. In addition, empirical observations on convertibility between selected styles of representations are shown too.

PhraseFix: Statistical post-editing of TectoMT

by Martin Popel and Petra Galuščáková

Cuni in wmt14: Chimera still awaits bellerophon

Improving English-Czech Tectogrammatical MT

The Prague Bulletin of Mathematical Linguistics, 2009

The present paper summarizes our recent results concerning English-Czech Machine Translation impl... more The present paper summarizes our recent results concerning English-Czech Machine Translation implemented in the TectoMT framework. The system uses tectogrammatical trees as the transfer medium. A detailed analysis of errors made by the previous version of the system (considered as the baseline) is presented first. Then several improvements of the system are described that led to better translation quality in terms of BLEU and NIST scores. The biggest performance gain comes from applying Hidden Tree Markov Model in the transfer phase, which is a novel technique in the field of Machine Translation. 1

Adaptation of machine translation for multilingual information retrieval in the medical domain

by Johannes Leveling, Gjf Jones, Jaroslava Hlaváčová, Martin Popel, Ondřej Dušek, and Jan Hajič

Artificial Intelligence in Medicine, 2014

We investigate machine translation (MT) of user search queries in the context of cross-lingual in... more We investigate machine translation (MT) of user search queries in the context of cross-lingual information retrieval (IR) in the medical domain. The main focus is on techniques to adapt MT to increase translation quality; however, we also explore MT adaptation to improve effectiveness of cross-lingual IR. Our MT system is Moses, a state-of-the-art phrase-based statistical machine translation system. The IR system is based on the BM25 retrieval model implemented in the Lucene search engine. The MT techniques employed in this work include in-domain training and tuning, intelligent training data selection, optimization of phrase table configuration, compound splitting, and exploiting synonyms as translation variants. The IR methods include morphological normalization and using multiple translation variants for query expansion. The experiments are performed and thoroughly evaluated on three language pairs: Czech-English, German-English, and French-English. MT quality is evaluated on data sets created within the Khresmoi project and IR effectiveness is tested on the CLEF eHealth 2013 data sets. The search query translation results achieved in our experiments are outstanding - our systems outperform not only our strong baselines, but also Google Translate and Microsoft Bing Translator in direct comparison carried out on all the language pairs. The baseline BLEU scores increased from 26.59 to 41.45 for Czech-English, from 23.03 to 40.82 for German-English, and from 32.67 to 40.82 for French-English. This is a 55% improvement on average. In terms of the IR performance on this particular test collection, a significant improvement over the baseline is achieved only for French-English. For Czech-English and German-English, the increased MT quality does not lead to better IR results. Most of the MT techniques employed in our experiments improve MT of medical search queries. Especially the intelligent training data selection proves to be very successful for domain adaptation of MT. Certain improvements are also obtained from German compound splitting on the source language side. Translation quality, however, does not appear to correlate with the IR performance - better translation does not necessarily yield better retrieval. We discuss in detail the contribution of the individual techniques and state-of-the-art features and provide future research directions.

Formemes in English-Czech Deep Syntactic MT

by Martin Popel, Michal Novák, and M. Majliš

acl.eldoc.ub.rug.nl, 2012

One of the most notable recent improvements of the TectoMT English-to-Czech translation is a syst... more One of the most notable recent improvements of the TectoMT English-to-Czech translation is a systematic and theoretically supported revision of formemes-the annotation of morpho-syntactic features of content words in deep dependency syntactic structures based on the Prague tectogrammatics theory. Our modifications aim at reducing data sparsity, increasing consistency across languages and widening the usage area of this markup. Formemes can be used not only in MT, but in various other NLP tasks.