Academia.eduAcademia.edu

Improving SMT by learning translation direction

2009

Improving SMT by learning translation direction Cyril Goutte, David Kurokawa, Pierre Isabelle Interactive Language Technologies group Institute for Information Technology National Research Council April 2008 SMART workshop, Barcelona 2009 Cyril Goutte SMART workshop, Barcelona 2009 / 1 Motivation We address two questions: 1. Is there a difference between original and (human-) translated text and can we detect it reliably? 2. If so, can we use that to improve Machine Translation quality? Cyril Goutte SMART workshop, Barcelona 2009 / 2 Motivation We address two questions: 1. Is there a difference between original and (human-) translated text and can we detect it reliably? 2. If so, can we use that to improve Machine Translation quality? Our answers: 1. Yes: on the Canadian Hansard, we get 90+% accuracy. 2. Yes: on French-English, we obtain up to 0.6 BLEU point increase. Cyril Goutte SMART workshop, Barcelona 2009 / 3 Problem setting Translations often have a “feel” of the original language: Translationese. If translationese is real, it may be possible to detect it! Earlier studies: ◮ Baroni&Bernardini (2006): detect original vs. translation is a monolingual Italian corpus, with accuracy up to 87%. ◮ van Halteren (2008) : detect source language in multi-parallel corpus and identify source language markers. Both show that various aspects of translationese are detectable. We experiment on a large bilingual corpus (Hansard) and investigate how detecting translation direction may impact Machine Translation quality. Cyril Goutte SMART workshop, Barcelona 2009 / 4 Index 1 Motivation and setting ⊲ 1 ◦ 2 Data ⊲ 4 3 Detecting Translation Direction ⊲ 8 4 Exploiting Translation Direction in SMT ⊲ 14 5 Discussion ⊲ 20 Cyril Goutte SMART workshop, Barcelona 2009 / 5 Data: The Hansard corpus Bilingual (En-Fr) transcripts of the sessions of the Canadian parliament. Most of 35th to 39th parliaments, covering 1996-2007. 1. Tagged with information on original language (French or English). 2. High quality translation: Reference material in Canada. 3. Large amount of data: 4.5M sentences, 165M words. words (fr) words (en) sentences blocks fo 14,648K 13,002K 902,349 40,538 eo 72,054K 64,899K 3,668,389 42,750 mx 86,702K 77,901K 4,570,738 83,288 Cyril Goutte SMART workshop, Barcelona 2009 / 6 Data: The Hansard corpus (II) Corpus issues: ◮ Slightly inconsistent tagging, eg both sides claim to be original: puts overall tagging reliability into question. ◮ Missing text/alignment, eg valid English but no translation: seems to be a retrieval issue. ◮ Imbalance at the word/sentence level: 80% originally English. ◮ There may be lexical/contextual hints: Quebec MPs tend to speak French, western Canada MPs almost only anglophones. Cyril Goutte SMART workshop, Barcelona 2009 / 7 Corpus (pre)processing ◮ Tokenized (NRC in-house tokenizer) ◮ Lowercased ◮ Sentence-aligned (NRC implementation of Gale&Church, 1991) We consider two levels of granularity: ◮ Sentence-level: individual sentences; ◮ Block-level: maximal consecutive sequence with same original language. Block-level is balanced, sentence-level is imbalanced 4:1 (eo:fo). Tagged using freely available “Tree Tagger” (Schmid, 1994). =⇒ 4 representations: 1) word, 2) lemma, 3) POS and 4) mixed n-grams. “Mixed”: POS for content words, surface form for grammatical words. Cyril Goutte SMART workshop, Barcelona 2009 / 8 Index 1 Motivation and setting ⊲ 1 2 Data ⊲ 4 ◦ 3 Detecting Translation Direction ⊲ 8 4 Exploiting Translation Direction in SMT ⊲ 14 5 Discussion ⊲ 20 Cyril Goutte SMART workshop, Barcelona 2009 / 9 Detecting translation direction Support Vector Machines trained with T. Joachims’ SVM-Perf. Test various conditions: 1. Block-level (83K examples) or sentence-level (1.8M examples, balanced). 2. Features: word, lemma, POS, mixed. . . n-gram frequencies. 3. N-gram length: 1. . . 3 for word/lemma, 1. . . 5 for POS/mixed. 4. Monolingual (English or French) or bilingual text. Sentence-level: test fewer feature/n-gram combinations (because of computational cost). All results obtained from 10-fold cross-validation. Results reported in F -score (≈ accuracy in this case). Cyril Goutte SMART workshop, Barcelona 2009 / 10 Block-level Performance Detection performance (en) 75 80 tf-idf: small but consistent improvement. 70 word lemma mixed POS tf−idf 65 F−score (%) 85 90 Similar perf. on French, +1-2% for bilingual, same general shape. 1 2 3 4 5 Optimal: word/lemma bigram, POS/mixed trigram. Word bigram: F = 90% Mixed trigram: F = 86%. n−gram size Cyril Goutte SMART workshop, Barcelona 2009 / 11 Influence of block length 100 Perf vs. length ( en ) Up to 99% accuracy for large blocks. 80 85 Large range in block length (3-73887 words!). 70 75 Much better than random for short blocks. word>lemma>mixed 65 Accuracy 90 95 word lemma POS mixed 1−gram 2−gram 3−gram 3 37 68 103 147 213 335 541 1084 2638 Length in words (equal bins) Cyril Goutte SMART workshop, Barcelona 2009 / 12 Sentence-level Performance 78 Sentence−level detection F = 77% 70 68 Some missing conditions (computational cost) 66 72 1.8M examples (balanced) 64 F−score 74 76 word lemma mixed POS French English 1 2 3 4 5 n−gram size Cyril Goutte SMART workshop, Barcelona 2009 / 13 Analysis of Most important bigrams in English (eo= original, fo=translation). Most important=relatively more frequent. “A couple of”: no equivalent in French Canadian alliance, CPC, NDP: mostly western, mostly anglophone parties BQ (Bloc Quebecois): French-speaking French translation overuses articles, prepositions (because French does), and “Mr. Speaker”! eo couple of alliance ) a couple do that , canadian the record forward to , cpc cpc ) of us this country this particular many of canadian alliance across the out there the things for that fo of the mr . , the in the to the , i . the ) : speaker , . i : mr , and . speaker bq ) , bq hon . that the on the Cyril Goutte SMART workshop, Barcelona 2009 / 14 Index 1 Motivation and setting ⊲ 1 2 Data ⊲ 4 3 Detecting Translation Direction ⊲ 8 ◦ 4 Exploiting Translation Direction in SMT ⊲ 14 5 Discussion ⊲ 20 Cyril Goutte SMART workshop, Barcelona 2009 / 15 Impact on Statistical Machine Translation Typical SMT system training: ◮ Gather as much English-French aligned sentences as possible. ◮ Preprocess + split data ◮ Estimate parameters in either direction (en→fr and fr→en) ◮ Original translation direction is not considered at all! ⇒ We use French originals and English translations to train an en→fr system (”reverse” translation??) We know SMT is very sensitive to genre/topic. . . Does difference between original and translation matter? If so, by how much? Cyril Goutte SMART workshop, Barcelona 2009 / 16 Impact on Statistical Machine Translation We analyze the impact of translation direction on MT by investigating: 1. Do we get better performance by sending original text to MT system trained only on original text? Cyril Goutte SMART workshop, Barcelona 2009 / 17 Impact on Statistical Machine Translation We analyze the impact of translation direction on MT by investigating: 1. Do we get better performance by sending original text to MT system trained only on original text? 2. Detecting translation direction and sending text to the “right” MT system. (eo) en−>fr English French orig. Classifier trans. (fo) en−>fr French Cyril Goutte SMART workshop, Barcelona 2009 / 18 Impact of Original Language System trained on eo, fo, or mx, tested on eo/fo part of test set, or all (mx). Train mx fo eo mx test set fr⊲en en⊲fr 36.2 37.1 31.2 30.8 36.6 37.8 fo test set fr⊲en en⊲fr 36.1 37.3 36.2 36.5 33.7 36.0 eo test set fr⊲en en⊲fr 36.1 36.9 30.5 30.1 36.8 38.0 eo system does (much) better on eo test, with 80% of training data. eo system also does better on mx data (test is 88% eo data vs. 80% in train). fo system does much worse on mx and eo data, but about the same as mx on the fo data, with only 20% of the training data! ⇒ Idea: detect source language using classifier, then use the right MT system (“Mixture of Experts”) Cyril Goutte SMART workshop, Barcelona 2009 / 19 Impact of Automatic Detection Top part is more or less identical to previous table. ref: using reference source language information, gain a consistent ∼ 0.6 BLEU points. SVM: using SVM prediction, gain is similar. mx fo eo SVM ref Full test set fr→en en→fr 36.86 37.78 32.00 31.85 37.20 38.23 37.44 38.35 37.46 38.35 Smaller gain over the eo system (due to having 88% eo data in test set). ⇒ Detecting original vs. translation provides a small-ish but consistent improvement in translation performance. ⇒ not worth looking for better classifier (for that task). Other uses of translation direction detection? Cyril Goutte SMART workshop, Barcelona 2009 / 20 Index 1 Motivation and setting ⊲ 1 2 Data ⊲ 4 3 Detecting Translation Direction ⊲ 8 4 Exploiting Translation Direction in SMT ⊲ 14 ◦ 5 Discussion ⊲ 20 Cyril Goutte SMART workshop, Barcelona 2009 / 21 Discussion How general are these results? Will it generalize to: 1. Detection on other English-French data? 2. Training a classifier on another corpus? 3. Another language pair? 4. Other settings: source vs. translations from different languages. Mixture of experts: could use additional input-specific information. ◮ Mother tongue? ◮ Gender? Cyril Goutte SMART workshop, Barcelona 2009 / 22 To Conclude... Can we tell the difference between an original and translated document? → Yes. To what level of accuracy? → Up to 90+% accuracy on blocks, 77% on single sentences. Is translation direction useful for machine translation? → Yes! Is the classification performance sufficient? → Indistinguishable from reference labels. . . Cyril Goutte SMART workshop, Barcelona 2009 / 23 Index 1 Motivation and setting ⊲ 1 2 Data ⊲ 4 3 Detecting Translation Direction ⊲ 8 4 Exploiting Translation Direction in SMT ⊲ 14 5 Discussion ⊲ 20 Cyril Goutte