CobaltF: A Fluent Metric for MT Evaluation
Marina Fomicheva, Núria Bel
IULA, Universitat Pompeu Fabra
[email protected]
Lucia Specia
University of Sheffield, UK
[email protected]
Iria da Cunha
Univ. Nacional de Educación a Distancia
[email protected]
Anton Malinovskiy
Nuroa Internet S. L.
[email protected]
Abstract
translations. Despite its wide use and practical
utility, automatic evaluation based on a straightforward candidate-reference comparison has long
been criticized for its low correlation with human
judgments at sentence-level (Callison-Burch and
Osborne, 2006).
The core aspects of translation quality are fidelity to the source text (or adequacy, in MT parlance) and acceptability (also termed fluency) regarding the target language norms and conventions
(Toury, 2012). Depending on the purpose and intended use of the MT, manual evaluation can be
performed in a number of different ways. However, in any setting both adequacy and fluency
shape human perception of the overall translation
quality.
By contrast, automatic reference-based metrics
are largely focused on MT adequacy, as they do
not evaluate the appropriateness of the translation
in the context of the target language. Translation
fluency is thus assessed only indirectly, through
the comparison with the reference. However,
the difference from a particular human translation
does not imply that the MT output is disfluent
(Fomicheva et al., 2015a).
We propose to explicitly model translation fluency in reference-based MT evaluation. To this
end, we develop a number of features representing translation fluency and integrate them with our
reference-based metric UPF-Cobalt, which was
originally presented at WMT15 (Fomicheva et al.,
2015b). Along with the features based on the
target Language Model (LM) probability of the
MT output, which have been widely used in the
related fields of speech recognition (Uhrik and
Ward, 1997) and quality estimation (Specia et al.,
2009), we design a more detailed representation of
MT fluency that takes into account the number of
disfluent segments observed in the candidate translation. We test our approach with the data avail-
The vast majority of Machine Translation (MT) evaluation approaches are based
on the idea that the closer the MT output is to a human reference translation,
the higher its quality. While translation
quality has two important aspects, adequacy and fluency, the existing referencebased metrics are largely focused on the
former. In this work we combine our
metric UPF-Cobalt, originally presented at
the WMT15 Metrics Task, with a number
of features intended to capture translation
fluency. Experiments show that the integration of fluency-oriented features significantly improves the results, rivalling the
best-performing evaluation metrics on the
WMT15 data.
1
Introduction
Automatic evaluation plays an instrumental role
in the development of Machine Translation (MT)
systems. It is aimed at providing fast, inexpensive,
and objective numerical measurements of translation quality. As a cost-effective alternative to
manual evaluation, the main concern of automatic
evaluation metrics is to accurately approximate
human judgments.
The vast majority of evaluation metrics are
based on the idea that the closer the MT output
is to a human reference translation, the higher its
quality. The evaluation task, therefore, is typically
approached by measuring some kind of similarity between the MT (also called candidate translation) and a reference translation. The most
widely used evaluation metrics, such as BLEU
(Papineni et al., 2002), follow a simple strategy
of counting the number of matching words or
word sequences in the candidate and reference
483
Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 483–490,
Berlin, Germany, August 11-12, 2016. c 2016 Association for Computational Linguistics
al., 2013).
Both translation evaluation and quality estimation aim to evaluate MT quality. Surprisingly,
there have been very few attempts at joining the
insights from these two related tasks. A notable
exception is the work by Specia and Giménez
(2010), who explore the combination of a large
set of quality estimation features extracted from
the source sentence and the candidate translation,
as well as the source-candidate alignment information, with a set of 52 MT evaluation metrics from the Asiya Toolkit (Giménez and
Màrquez, 2010a). They report a significant improvement over the reference-based evaluation
systems on the task of predicting human postediting effort. We follow this line of research by
focusing specifically on integrating fluency information into reference-based evaluation.
able from WMT15 Metrics Task and obtain very
promising results, which rival the best-performing
system submissions. We have also submitted the
metric to the WMT16 Metrics Task.
2
Related Work
The recent advances in the field of MT evaluation
have been largely directed to improving the informativeness and accuracy of candidate-reference
comparison. Meteor (Denkowski and Lavie, 2014)
allows for stem, synonym and paraphrase matches,
thus addressing the problem of acceptable linguistic variation at lexical level. Other metrics
measure syntactic (Liu and Gildea, 2005), semantic (Lo et al., 2012) or even discourse similarity
(Guzmán et al., 2014) between candidate and reference translations. Further improvements have
been recently achieved by combining these partial measurements using different strategies including machine learning techniques (Comelles et
al., 2012; Giménez and Màrquez, 2010b; Guzmán
et al., 2014; Yu et al., 2015). However, none of the
above approaches explicitly addresses the fluency
of the MT output.
Predicting MT quality with respect to the target
language norms has been investigated in a different evaluation scenario, when human translations
are not available as benchmark. This task, referred
to as confidence or quality estimation, is aimed at
MT systems in use and therefore has no access to
reference translations (Specia et al., 2010).
Quality estimation can be performed at different
levels of granularity. Sentence-level quality estimation (Specia et al., 2009; Blatz et al., 2004) is
addressed as a supervised machine learning task
using a variety of algorithms to induce models
from examples of MT sentences annotated with
quality labels. In the word-level variant of this
task, each word in the MT output is to be judged
as correct or incorrect (Luong et al., 2015; Bach et
al., 2011), or labelled for a specific error type.
Research in the field of quality estimation is focused on the design of features and the selection
of appropriate learning schemes to predict translation quality, using source sentences, MT outputs,
internal MT system information and source and
target language corpora. In particular, features
that measure the probability of the MT output
with respect to a target LM, thus capturing translation fluency, have demonstrated highly competitive performance in a variety of settings (Shah et
3
UPF-Cobalt Review
UPF-Cobalt1 is an alignment-based evaluation
metric. Following the strategy introduced by the
well known Meteor (Denkowski and Lavie, 2014),
UPF-Cobalt’s score is based on the number of
aligned words with different levels of lexical similarity. The most important feature of the metric
is a syntactically informed context penalty aimed
at penalizing the matches of similar words that
play different roles in the candidate and reference
sentences. The metric has achieved highly competitive results on the data from previous WMT
tasks, showing that the context penalty allows to
better discriminate between acceptable candidatereference differences and the differences incurred
by MT errors (Fomicheva et al., 2015b). Below we
briefly review the main components of the metric.
For a detailed description of the metric the reader
is referred to (Fomicheva and Bel, 2016).
3.1 Alignment
The alignment module of UPF-Cobalt builds on
an existing system – Monolingual Word Aligner
(MWA), which has been shown to significantly
outperform state-of-the-art results for monolingual alignment (Sultan et al., 2014). We increase the coverage of the aligner by comparing distributed word representations as an additional source of lexical similarity information,
1
The metric is freely available for download at
https://github.com/amalinovskiy/
upf-cobalt.
484
pair. The sentence-level average can be obtained
in a straightforward way from the word-level values (we use it as a feature in the decomposed version of the metric below).
which allows to detect cases of quasi-synonyms
(Fomicheva and Bel, 2016).
3.2 Scoring
UPF-Cobalt’s sentence-level score is a weighted
combination of precision and recall over the sum
of the individual scores computed for each pair of
aligned words. The word-level score for a pair of
aligned words (t, r) in the candidate and reference
translations is based on their lexical similarity
(LexSim) and a context penalty which measures
the difference in their syntactic contexts (CP ):
4
Approach
In this paper we learn an evaluation metric that
combines a series of adequacy-oriented features
extracted from the reference-based metric UPFCobalt with various features intended to focus on
translation fluency. This section first describes
the metric-based features used in our experiments
and then the selection and design of our fluencyoriented features.
score(t, r) = LexSim(t, r) − CP (t, r)
Lexical similarity is defined based on the type
of lexical match (exact match, stem match, synonyms, etc.)2 (Denkowski and Lavie, 2014). The
crucial component of the metric is the context
penalty, which is applied at word-level to identify the cases where the words are aligned (i.e.
lexically similar) but play different roles in the
candidate and reference translations and therefore
should contribute less to the sentence-level score.
Thus, for each pair of aligned words, the words
that constitute their syntactic contexts are compared. The syntactic context of a word is defined
as its head and dependent nodes in a dependency
graph. The context penalty (CP ) is computed as
follows:
4.1 Adequacy-oriented Features
UPF-Cobalt incorporates in a single score various
distinct MT characteristics (lexical choice, word
order, grammar issues, such as wrong word forms
or wrong choice of function words, etc.). We
note that these components can be related, to a
certain extent, to the aspects of translation quality being discussed in this paper. The syntactic
context penalty of UPF-Cobalt is affected by the
well-formedness of the MT output, and may reflect, although indirectly, grammaticality and fluency, whereas the proportion of aligned words depends on the correct lexical choice.
Using the components of the metric instead of
the scores yields a more fine-grained representation of the MT output. We explore this idea in our
experiments by designing a decomposed version
of UPF-Cobalt. More specifically, we use 48 features (grouped below for space reasons):
!
P
∗
X
1..i w(Ci )
P
CP (t, r) =
× ln
w(Ci ) + 1
1..i w(Ci )
1..i
where w refers to the weights that reflect the relative importance of the dependency functions of
the context words, C refers to the words that belong to the syntactic context of the word r and
Ci∗ refers to the context words that are not equivalent.3 For the words to be equivalent two conditions are required to be met: a) they must be
aligned and b) they must be found in the same
or equivalent syntactic relation with the word r.
The context penalty is calculated for both candidate and reference words. The metric computes
an average between reference-side context penalty
and candidate-side context penalty for each word
• Percentage and number of aligned words in
the candidate and reference translations
• Percentage and number of aligned words
with different levels of lexical similarity in
the candidate and reference translations
• Percentage and number of aligned function
and content words in the candidate and reference translations
• Minimum, maximum and average context
penalty
• Percentage and number of words with high
context penalty4
• Number of words in the candidate and reference translations
2
Specifically, the values for different types of lexical similarity are: same word forms - 1.0, lemmatizing or stemming
- 0.9, WordNet synsets - 0.8, paraphrase database - 0.6 and
distributional similarity - 0.5.
3
The weights w are: argument/complement functions 1.0, modifier functions - 0.8 and specifier/auxiliary functions
- 0.2.
4
These are words with the context penalty value higher
than the average computed on the training set used in our
experiments.
485
4.2 Fluency-oriented Features
word combinations that have very high probability according to the LM, the overall sentence-level
LM score may be misleading.
To overcome the above limitations, we use
word-level n-gram frequency measurements and
design various features to extend them to the sentence level in a more informative way. We rely on
LM backoff behaviour, as defined in (Raybaud et
al., 2011). LM backoff behaviour is a score assigned to the word according to how many times
the target LM had to back-off in order to assign
a probability to the word sequence. The intuition
behind is that an n-gram not found in the LM can
indicate a translation error. Specifically, the backoff behaviour value b(wi ) for a word wi in position
i of a sentence is defined as:
We suggest that the fluency aspect of translation quality has been overlooked in the referencebased MT evaluation. Even though syntacticallyinformed metrics capture structural differences
and are, therefore, assumed to account for grammatical errors, we note that the distinction between adequacy and fluency is not limited to grammatical issues and thus exists at all linguistic levels. For instance, at lexical level, the choice of
a particular word or expression may be similar in
meaning to the one present in the reference (adequacy), but awkward or even erroneous if considered in the context of the norms of the target
language use. Conversely, due to the variability
of linguistic expression, neither lexical nor syntactic differences from a particular human translation
imply ill-formedness of the MT output.
Sentence fluency can be described in terms of
the frequencies of the words with respect to a
target LM. Here, in addition to the LM-based
features that have been shown to perform well
for sentence-level quality estimation (Shah et al.,
2013), we introduce more complex features derived from word-level n-gram statistics. Besides
the word-based representation, we rely on Part-ofSpeech (PoS) tags. As suggested by (Felice and
Specia, 2012), morphosyntactic information can
be a good indicator of ill-formedness in MT outputs.
First, we select 16 simple sentence-level features from previous work (Felice and Specia,
2012; Specia et al., 2010), summarized below.
b(wi ) =
7,
6,
5,
4,
3,
2,
1,
if wi−2 , wi−1 , wi exists in the model
if wi−2 , wi−1 and wi−1 , wi both exist
in the model
if only wi−1 , wi exists in the model
if only wi−2 , wi−1 and wi exist
separately in the model
if wi−1 and wi both exist
in the model
if only wi exists in the model
if wi is an out-of-vocabulary word
We compute this score for each word in the MT
output and then use the mean, median, mode, minimum and maximum of the backoff behaviour values as separate sentence-level features. Also, we
calculate the percentage and number of words with
low backoff behaviour values (< 5) to approximate the number of fluency errors in the MT output.
Furthermore, we introduce a separate feature
that counts the words with a backoff behaviour
value of 1, i.e. the number of out-of-vocabulary
(OOV) words. OOV words are indicative of the
cases when source words are left untranslated in
the MT. Intuitively, this should be a strong indicator of low MT quality.
Finally, we note that UPF-Cobalt, not unlike
the majority of reference-based metrics, lacks information regarding the MT words that are not
aligned or matched to any reference word. Such
fragments do not necessarily constitute an MT error, but may be due to acceptable linguistic variations. Collecting fluency information specifically
for these fragments may help to distinguish acceptable variation from MT errors. If a candidate word or phrase is absent from the reference
• Number of words in the candidate translation
• LM probability and perplexity of the candidate translation
• LM probability of the candidate translation
with respect to an LM trained on a corpus of
PoS tags of words
• Percentage and number of content/function
words
• Percentage and number of verbs, nouns and
adjectives
Essentially, these features average LM probabilities of the words to obtain a sentence-level
measurement. While being indeed predictive of
sentence-level translation fluency, they are not representative of the number and scale of the disfluent
fragments contained in the MT sentence. Moreover, if an ill-formed translation contains various
486
In our work we focus on sentence-level metrics’ performance, which is assessed by converting metrics’ scores to ranks and comparing them
to the human judgements with Kendall rank correlation coefficient (τ ). We use the WMT14 official Kendall’s Tau implementation (Macháček and
Bojar, 2014). Following the standard practice at
WMT and to make our work comparable to the
official metrics submitted to the task, we exclude
ties in human judgments both for training and for
testing our system.
Our model is a simple linear interpolation of
the features presented in the previous sections.
For tuning the weights, we use the learn-to-rank
approach (Burges et al., 2005), which has been
successfully applied in similar settings in previous work (Guzmán et al., 2014; Stanojevic and
Sima’an, 2015). We use a standard implementation of Logistic Regression algorithm from the
Python toolkit scikit-learn5 . The model is
trained on WMT14 dataset and tested on WMT15
dataset.
For the extraction of word-level backoff behaviour values and sentence-level fluency features,
we use Quest++6 , an open source tool for quality estimation (Specia et al., 2015). We employ the
LM used to build the baseline system for WMT15
Quality Estimation Task (Bojar et al., 2015).7
This LM provided was trained on data from the
WMT12 translation task (a combination of news
and Europarl data) and thus matches the domain of
the dataset we use in our experiments. PoS tagging
was performed with TreeTagger (Schmid, 1999).
but is fluent in the target language, then the difference is possibly not indicative of an error and
should be penalized less. Based on this observation, we introduce a separate set of features that
compute the word-level measurements discussed
above only for the words that are not aligned to
the reference translation.
This results in 49 additional features, grouped
here for space reasons:
• Summary statistics of the LM backoff behaviour (word and PoS-tag LM)
• Summary statistics of the LM backoff behaviour for non-aligned words only (word
and PoS tag LM)
• Percentage and number of words with low
backoff behaviour value (word and PoS tag
LM)
• Percentage and number of non-aligned words
with low backoff behaviour value (word and
PoS tag LM)
• Percentage and number of OOV words
• Percentage and number of non-aligned OOV
words
5
Experimental Setup
For our experiments, we use the data available
from the WMT14 and WMT15 Metrics Tasks for
into-English translation directions. The datasets
consist of source texts, human reference translations and the outputs from the participating MT
systems for different language pairs. During manual evaluation, for each source sentence the annotators are presented with its human translation and
the outputs of a random sample of five MT systems, and asked to rank the MT outputs from best
to worst (ties are allowed). Pairwise system comparisons are then obtained from this compact annotation. Details on the WMT data for each language pair are given in Table 1.
LP
Cs-En
De-En
Fr-En
Ru-En
Hi-En
Fi-En
Rank
21,130
25,260
26,090
34,460
20,900
-
WMT14
Sys
5
13
8
13
9
-
Src
3,003
3,003
3,003
3,003
2,507
-
Rank
85,877
40,535
29,770
44,539
31,577
WMT15
Sys
16
13
7
13
14
6
Experimental Results
Table 2 summarizes the results of our experiments.
Group I presents the results achieved by UPFCobalt and its decomposed version described in
Section 4.1. Contrary to our expectations, the performance is slightly degraded when using the metrics’ components (UPF-Cobaltcomp). Our intuition
is that this happens due to the sparseness of the
features based on the counts of different types of
lexical matches.
Group II reports the performance of the fluency
features presented in Section 4.2. First of all, we
note that these features on their own (FeaturesF)
Src
2,656
2,169
1,500
2,818
1,370
5
Table 1: Number of pairwise comparisons (Rank),
translation systems (Sys) and source sentences
(Src) per language pair for the WMT14 and
WMT15 datasets
http://scikit-learn.org/
https://github.com/ghpaetzold/
questplusplus
7
http://www.statmt.org/wmt15/
quality-estimation-task.html.
6
487
I
II
III
IV
Metric
UPF-Cobalt
UPF-Cobaltcomp
FeaturesF
CobaltFsimple
CobaltFcomp
MetricsF
DPMFcomb
BEER Treepel
RATATOUILLE
BLEU
Meteor
cs-en
.457±.011
.442±.011
.373±.011
.487±.011
.481±.011
.502±.011
.495±.011
.471±.011
.472±.011
.391±.011
.439±.011
de-en
.427±.011
.418±.011
.337±.011
.445±.011
.438±.011
.457±.011
.482±.011
.447±.011
.441±.011
.360±.011
.422±.011
fi-en
.437±.011
.428±.011
.359±.011
.455±.011
.464±.011
.450±.011
.445±.011
.438±.011
.421±.011
.308±.011
.406±.011
fr-en
.386±.011
.387±.011
.267±.011
.401±.011
.403±.011
.413±.011
.395±.011
.389±.011
.398±.011
.358±.011
.380±.011
ru-en
.402±.011
.388±.011
.263±.011
.395±.011
.395±.011
.410±.011
.418±.011
.403±.011
.393±.011
.329±.011
.386±.011
Avg τ
.422±.011
.413±.012
.320±.011
.437±.012
.436±.011
.447±.011
.447±.011
.429±.011
.425±.010
.349±.011
.407±.012
Table 2: Sentence-level evaluation results for WMT15 dataset in terms of Kendall rank correlation coefficient (τ )
achieve a reasonable correlation with human judgments, showing that fluency information is often
sufficient to compare the quality of two candidate
translations. Secondly, fluency features yield a
significant improvement when used together with
the metrics’ score (CobaltFsimple) or with the components of the metric (CobaltFcomp). We further
boost the performance by combining the scores of
the metrics BLEU, Meteor and UPF-Cobalt with
our fluency features (MetricsF).
metrics from the Asiya Toolkit (Giménez and
Màrquez, 2010a).
7
Conclusions
The performance of reference-based MT evaluation metrics is limited by the fact that dissimilarities from a particular human translation do not always indicate bad MT quality. In this paper we
proposed to amend this issue by integrating translation fluency in the evaluation. This aspect determines how well a translated text conforms to the
linguistic regularities of the target language and
constitutes a strong predictor of the overall MT
quality.
In addition to the LM-based features developed
in the field of quality estimation, we designed a
more fine-grained representation of translation fluency, which in combination with our referencebased evaluation metric UPF-Cobalt yields a
highly competitive performance for the prediction
of pairwise preference judgments. The results of
our experiments thus confirm that the integration
of features intended to address translation fluency
improves reference-based MT evaluation.
In the future we plan to investigate the performance of fluency features for the modelling of
other types of manual evaluation, such as absolute
scoring.
The results demonstrate that fluency features
provide useful information regarding the overall
translation quality, which is not fully captured
by the standard candidate-reference comparison.
These features are discriminative when the relationship to the reference does not provide enough
information to distinguish between the quality of
two alternative candidate translations. For example, it may well be the case that both MT outputs
are very different from human reference, but one
constitutes a valid alternative translation, while the
other is totally unacceptable.
Finally, Groups III and VI contain the results
of the best-performing evaluation systems from
the WMT15 Metrics Task, as well as the baseline
BLEU metric (Papineni et al., 2002) and a strong
competitor, Meteor (Denkowski and Lavie, 2014),
which we reproduce here for the sake of comparison. DPMFComb (Yu et al., 2015) and RATATOUILLE (Marie and Apidianaki, 2015) use a
learnt combination of the scores from different
evaluation metrics, while BEER Treepel (Stanojevic and Sima’an, 2015) combines word matching,
word order and syntax-level features. We note that
the number and complexity of the metrics used in
the above approaches is quite high. For instance,
DPMFComb is based on 72 separate evaluation
systems, including the resource-heavy linguistic
Acknowledgments
This work was partially funded by TUNER
(TIN2015-65308-C5-5-R) and MINECO/FEDER,
UE. Marina Fomicheva was supported by funding
from the FI-DGR grant program of the Generalitat de Catalunya. Iria da Cunha was supported
by a Ramón y Cajal contract (RYC-2014-16935).
Lucia Specia was supported by the QT21 project
(H2020 No. 645452).
488
References
Marina Fomicheva, Núria Bel, Iria da Cunha, and Anton Malinovskiy. 2015b. UPF-Cobalt Submission to
WMT15 Metrics Task. In Proceedings of the Tenth
Workshop on Statistical Machine Translation, pages
373–379.
Nguyen Bach, Fei Huang, and Yaser Al-Onaizan.
2011. Goodness: A Method for Measuring Machine
Translation Confidence. In Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human Language TechnologiesVolume 1, pages 211–219. Association for Computational Linguistics (ACL).
Jesús Giménez and Lluı́s Màrquez. 2010a. Asiya:
An Open Toolkit for Automatic Machine Translation
(Meta-)Evaluation. The Prague Bulletin of Mathematical Linguistics, (94):77–86.
John Blatz, Erin Fitzgerald, George Foster, Simona
Gandrabur, Cyril Goutte, Alex Kulesza, Alberto
Sanchis, and Nicola Ueffing. 2004. Confidence Estimation for Machine Translation. In Proceedings of
the 20th International Conference on Computational
Linguistics, pages 315–321. ACL.
Jesús Giménez and Lluı́s Màrquez. 2010b. Linguistic
Measures for Automatic Machine Translation Evaluation. Machine Translation, 24(3):209–240.
Francisco Guzmán, Shafiq Joty, Lluı́s Màrquez, and
Preslav Nakov. 2014. Using Discourse Structure
Improves Machine Translation Evaluation. In ACL
(1), pages 687–698.
Ondřej Bojar, Rajen Chatterjee, Christian Federmann,
Barry Haddow, Matthias Huck, Chris Hokamp,
Philipp Koehn, Varvara Logacheva, Christof Monz,
Matteo Negri, Matt Post, Carolina Scarton, Lucia
Specia, and Marco Turchi. 2015. Findings of the
2015 Workshop on Statistical Machine Translation.
In Proceedings of the Tenth Workshop on Statistical
Machine Translation, pages 1–46, Lisbon, Portugal,
September. ACL.
Ding Liu and Daniel Gildea. 2005. Syntactic Features
for Evaluation of Machine Translation. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation
and/or Summarization, pages 25–32.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier,
Matt Deeds, Nicole Hamilton, and Greg Hullender.
2005. Learning to Rank Using Gradient Descent. In
Proceedings of the 22nd international conference on
Machine learning, pages 89–96. ACM.
Chi-Kiu Lo, Anand Karthik Tumuluru, and Dekai Wu.
2012. Fully Automatic Semantic MT Evaluation. In
Proceedings of the Seventh Workshop on Statistical
Machine Translation, pages 243–252. ACL.
Chris Callison-Burch and Miles Osborne. 2006. Reevaluating the Role of BLEU in Machine Translation Research. In In Proceedings of the European
Association for Computational Linguistics (EACL),
pages 249–256. ACL.
Ngoc-Quang Luong, Laurent Besacier, and Benjamin
Lecouteux. 2015. Towards Accurate Predictors
of Word Quality for Machine Translation: Lessons
Learned on French–English and English–Spanish
Systems. Data & Knowledge Engineering, 96:32–
42.
Elisabet Comelles, Jordi Atserias, Victoria Arranz, and
Irene Castellón. 2012. VERTa: Linguistic Features
in MT Evaluation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pages 3944–3950.
Matouš Macháček and Ondřej Bojar. 2014. Results of
the WMT14 Metrics Shared Task. In Proceedings of
the Ninth Workshop on Statistical Machine Translation, pages 293–301.
Michael Denkowski and Alon Lavie. 2014. Meteor
Universal: Language Specific Translation Evaluation for any Target Language. In Proceedings of the
Ninth Workshop on Statistical Machine Translation,
pages 376–380.
Benjamin Marie and Marianna Apidianaki. 2015.
Alignment-based Sense Selection in METEOR and
the RATATOUILLE Recipe. In Proceedings of the
Tenth Workshop on Statistical Machine Translation,
pages 385–391.
Mariano Felice and Lucia Specia. 2012. Linguistic
Features for Quality Estimation. In Proceedings of
the Seventh Workshop on Statistical Machine Translation, pages 96–103. ACL.
Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: a Method for Automatic
Evaluation of Machine Translation. In Proceedings
of the 40th Annual Meeting of the ACL, pages 311–
318. ACL.
Marina Fomicheva and Núria Bel. 2016. Using Contextual Information for Machine Translation Evaluation. In Proceedings of the Tenth International
Conference on Language Resources and Evaluation
(LREC 2016), pages 2755–2761.
Sylvain Raybaud, David Langlois, and Kamel Smaı̈li.
2011. this sentence is wrong. detecting errors in
machine-translated sentences. Machine Translation,
25(1):1–34.
Marina Fomicheva, Núria Bel, and Iria da Cunha.
2015a. Neutralizing the Effect of Translation Shifts
on Automatic Machine Translation Evaluation. In
Computational Linguistics and Intelligent Text Processing, pages 596–607.
Helmut Schmid. 1999. Improvements in part-ofspeech tagging with an application to german. In
Natural language processing using very large corpora, pages 13–25. Springer.
489
Kashif Shah, Trevor Cohn, and Lucia Specia. 2013.
An Investigation on the Effectiveness of Features for
Translation Quality Estimation. In Proceedings of
the Machine Translation Summit, volume 14, pages
167–174.
Lucia Specia and Jesús Giménez. 2010. Combining
Confidence Estimation and Reference-based Metrics
for Segment-level MT Evaluation. In The Ninth
Conference of the Association for Machine Translation in the Americas.
Lucia Specia, Marco Turchi, Nicola Cancedda, Marc
Dymetman, and Nello Cristianini. 2009. Estimating
the Sentence-level Quality of Machine Translation
Systems. In 13th Conference of the European Association for Machine Translation, pages 28–37.
Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010.
Machine Translation Evaluation versus Quality Estimation. Machine Translation, 24(1):39–50.
Lucia Specia, Gustavo Paetzold, and Carolina Scarton. 2015. Multi-level Translation Quality Prediction with QuEst++. In 53rd Annual Meeting of the
Association for Computational Linguistics and Seventh International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing: System Demonstrations,
pages 115–120.
Miloš Stanojevic and Khalil Sima’an. 2015. BEER
1.1: ILLC UvA Submission to Metrics and Tuning
Task. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 396–401.
Md Arafat Sultan, Steven Bethard, and Tamara Sumner. 2014. Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual
Evidence. Transactions of the ACL, 2:219–230.
Gideon Toury. 2012. Descriptive Translation Studies and beyond: Revised edition, volume 100. John
Benjamins Publishing.
C. Uhrik and W. Ward. 1997. Confidence Metrics
Based on N-gram Language Model Backoff Behaviors. In Proceedings of Fifth European Conference
on Speech Communication and Technology, pages
2771–2774.
Hui Yu, Qingsong Ma, Xiaofeng Wu, and Qun Liu.
2015. CASICT-DCU Participation in WMT2015
Metrics Task. In Proceedings of the Tenth Workshop
on Statistical Machine Translation, pages 417–421.
490