MT Post Editing Guidelines
MT Post Editing Guidelines
MT Post Editing Guidelines
GUIDELINES
All rights reserved. No part of this book may be reproduced or transmitted in any
form or by any means, electronic or mechanical, including photocopying, record-
ing or by any information storage and retrieval system, without written permis-
sion from the author, except for the inclusion of brief quotations in a review.
The TAUS Guidelines included in this ebook are intended to be applied in combi-
nation with each other.
AUTHORS
Isabella Massardo, Jaap van der Meer, Sharon O’Brien, Fred Hollowood, Nora
Aranberri, Katrin Drescher
TABLE
OF
CONTENTS
Post-editing Productivity
Today, there are many methodologies in use, resulting in a lack of cohesive standards as
organizations take various approaches for evaluating performance. Some use final out-
put quality evaluation or post-editor productivity as a standalone metric. Others ana-
lyze quality data such as “over-edit” or “under-edit” of the post-editor’s effort or evaluate
the percentage of MT suggestions used versus MT suggestions that are discarded in the
final output.
An agreed set of best practices will help the industry fairly and efficiently select the
most suitable talent for post-editing work and identify the training opportunities that
will help translators and new players, such as crowdsourcing resources, become highly
skilled and qualified post-editors.
SCOPE
These guidelines focus on the analysis and interpretation of productiv-
ity testing results combined with the evaluation of the final post-edit-
ed output and scores obtained via industry-standard automated scoring
metrics in order to evaluate post-editors’ performances. It does not cov-
er design and execution of post-editing productivity tests, post-editing
quality levels and pricing recommendations.
TAUS MT Post-Editing Guidelines 5
DEFINING GOALS
The suggested goals for the TRANSLATION SERVICE PROVIDER are:
Identify the best performers from the pool of post-editors, who deliver the desired level
of quality output with the highest productivity gains; identify the “ideal” post-editor
profile for the specific content type and quality requirements (linguist, domain special-
ist, “casual” translator).
Gather intelligence on the performance of the in-house technology used to enable the
post-editing process, such as a translation management system, a recommended post-
editing environment and MT engines.
Based on the post-editor productivity, set the realistic TAT (turnaround time) expec-
tations, determine the appropriate pricing structure for specific content types and lan-
guage pair and reflect the above in an SLA (Service Level Agreement).
Select a subset of the content used for the productivity test for which the highest and
the lowest productivity is seen (the “outliers”), and evaluate the quality of the output
using reviewers and automated quality evaluation tools (spellcheckers, Checkmate,
X-bench, style consistency evaluation tools). Make sure the final output meets your
quality expectations for the selected content types.
Make sure there is minimal subjective evaluation. Provide clear evaluation guidelines to
the reviewers and make certain the reviewers’ expectations and the Post-Editors’ in-
structions are aligned. Refer to the known most common post-editing mistakes in your
guidelines.
Obtain edit distance and MT quality data for the content post-edited during the pro-
ductivity evaluation, using the industry-standard methods, e.g. General Text Matcher,
Levenshtein, BLEU, TER.
If your productivity evaluation tool captures this data, obtain information on the chang-
es made by post-editors, i.e. edit location, nature of edits; if your tool doesn’t capture
such data, a simple file difference tool can be used.

You will be able to determine realistic turnaround time and pricing expectations based
on the productivity and quality data. Always have a clear human translation benchmark
for reference (e.g. 2000 words per day for European languages) unless your tool allows
you to capture the actual in-production data.
For smaller LSPs and freelance translators, tight turnaround projects or other limited
bandwidth scenarios reliance on legacy industry information is recommended.
Assess individual post-editors’ performances using side-by-side data for more than one
post-editor; use the individual post-editor data to identify the most suitable post-editor
profile for the specific content types and quality levels.
Calculate the mode (the number which appears most often in a set of numbers) based
on scores of multiple post-editors to obtain data specific to certain content/quality/lan-
guage combination; gather data for specific sentence length ranges and sentence types.
Do not use obtained productivity data in isolation in order to calculate expected daily
throughputs and turnaround times, as the reported values reflect the “ideal” off-produc-
tion scenario; add time necessary for terminology and concept research, administrative
tasks, breaks and alike.
Identify best and worst performing sentence types (length, presence/absence of DoNot-
Translate elements, tags, terminology, certain syntactic structures) and gather best prac-
tices for post-editors on most optimal handling of such sentences and elements.
Analyze the nature of edits using the data obtained during the productivity tests, use
top performer’s data to create recommendations for lower performers; use the obtained
information to provide feedback on the MT engine.
Each company’s post-editing guidelines are likely to vary depending on a range of pa-
rameters. It is not practical to present a set of guidelines that will cover all scenarios. We
expect that organisations will use these basic guidelines and will tailor them as required
for their own purposes.
Generally, these guidelines assume bilingual post-editing (not monolingual) ideally car-
ried out by a paid translator, but that might in some scenarios be carried out by bilin-
gual domain experts or volunteers. The guidelines are not system or language-specific.
SCOPE
This Guideline focuses on the analysis and interpretation of productivi-
ty testing results combined with the evaluation of the final post-edited
output and scores obtained via industry-standard automated scoring
metrics in order to evaluate post-editors’ performances. It does not cov-
er design and execution of post-editing productivity tests, post-editing
quality levels and pricing recommendations.
TAUS MT Post-Editing Guidelines 14
RECOMMENDATIONS
To reduce the level of post-editing required (regardless of language pair, direction, sys-
tem type or domain), we recommend the following:
• Tune your system appropriately, i.e. ensure high level dictionary and linguistic cod-
ing for rule-based machine translation systems, or training with clean, high-quality,
domain-specific data for data-driven or hybrid systems.
• Ensure the source text is written well (i.e. correct spelling, punctuation, unambig-
uous) and, if possible, tuned for translation by MT (i.e. by using specific authoring
rules that suit the MT system in question).
• Integrate terminology management across source text authoring, MT and translation
memory (TM) systems.
• Train post-editors in advance.
• Examine the raw MT output quality before negotiating throughput and price and set
reasonable expectations.
• Agree a definition for the final quality of the information to be post-edited, based on
user type and levels of acceptance.
• Pay post-editors to give structured feedback on common MT errors (and, if nec-
essary, provide guidance about how this should be done) so the system can be im-
proved over time.
SCOPE
This Guideline focuses on the analysis and interpretation of productivi-
ty testing results combined with the evaluation of the final post-edited
output and scores obtained via industry-standard automated scoring
metrics in order to evaluate post-editors’ performances. It does not cov-
er design and execution of post-editing productivity tests, post-editing
quality levels and pricing recommendations.
TAUS MT Post-Editing Guidelines 15
BASIC PE GUIDLINES
Assuming the recommendations above are implemented, we suggest some basic guide-
lines for post-editing. The effort involved in post-editing will be determined by two
main criteria:
To reach quality similar to “high-quality human translation and revision” (a.k.a. “pub-
lishable quality”), full post-editing is usually recommended. For quality of a lower stan-
dard, often referred to as “good enough” or “fit for purpose”, light post-editing is usually
recommended. However, light post-editing of really poor MT output may not bring the
output up to publishable quality standards.
n the other hand, if the raw MT output is of good quality, then perhaps all that is need-
ed is a light, not a full, post-edit to achieve publishable quality. So, instead of differenti-
ating between guidelines for light and full-post-editing, we will differentiate here be-
tween two levels of expected quality. Other levels could be defined, but we will stick to
two here to keep things simple. The set of guidelines proposed below are conceptualised
as a group of guidelines where individual guidelines can be selected, depending on the
needs of the customer and the raw MT quality.
You may also undertake such tests periodically, say annually, to get an indication of im-
provements (or not) in productivity.
Small-scale, short-term tests, by their nature, gather less data and are less rigorous than
the larger-scale, long-term ones for which we provide a separate set of best practice
guidelines.
The TAUS productivity testing reports show on average and individual post-editor pro-
ductivity, enabling users to determine the influence of outliers on results.
• Human evaluation: For example, using a company’s standard error typology or ade-
quacy/fluency evaluation.
• Edit distance measures: Some research has shown the TER (Translation Edit Rate)
and GTM (General Text Matcher) measures to correlate fairly well with human as-
sessments of quality.
The TAUS quality evaluation tools enable you to undertake adequacy/fluency evalua-
tion and error typology review. Again, reports show on average and individual post-edi-
tor productivity, enabling users to determine the influence of outliers on results.
Involve representatives from your post-editing community in the design and analysis
As in the case of TAUS Machine Translation Post-Editing Guidelines, we recommend
that representatives of the post-editing community be involved in the productivity
measurement. Having a stake in such a process generally leads to a higher level of con-
sensus.
The larger-scale, longer-term tests would be used if you are already rea-
sonably committed to MT, or at least to testing it on a large-scale, and
looking to create a virtuous cycle towards operational excellence, guid-
ed by such tests.
• Human evaluation: For example, using a company’s standard error typology. Note
that human raters of quality often display low rates of agreement. The more raters,
the better (at least three).
• Edit distance measures: Some research has shown the TER (Translation Edit Rate)
and GTM (General Text Matcher) measures to correlate fairly well with human as-
sessments of quality.
The larger-scale, longer-term tests would be used if you are already rea-
sonably committed to MT, or at least to testing it on a large-scale, and
looking to create a virtuous cycle towards operational excellence, guid-
ed by such tests.
CAVEAT EMPTOR
Predictive. A model for pricing MTPE (Machine Translation Post-Editing) helps to pre-
dict the cost.
• Your model should help establish pricing up-front. Therefore, a model should either
allow for extrapolation or be able to calculate the cost of a particular volume of text
instantly. Remember, pricing may change each time you evaluate and deploy a new
version of an engine.
Fair. A model for pricing MTPE provides buyers, language service providers and trans-
lators with a reliable measurement of the required work.
• All parties involved in the translation process, for example, translators, language ser-
vice providers and buyers should be involved in establishing your approach.
• All parties should agree that the pricing model reflects the effort involved.
• Translators should be provided with training on post-editing and realistic expecta-
tions must be set. See TAUS Machine Translation Post-editing Guidelines for more
detailed information on quality expectations.
• It can be difficult to demonstrate you are always being fair, because circumstanes will
serve to undermine the assumptions in certain cases. We ask that you share those
experiences with us ([email protected]) so that we can create a public knowledge
base over time.
-- You test the model on representative test-data, i.e. the quality of the test-data has the
same characteristics of that used in the real setting;
-- You use a representative volume of test-data to allow for a comprehensive study; and
-- The content-type in the test-data matches that of the real setting.
In a post-editing scenario, spot checks to monitor quality is advised, and feedback from
post-editors should be collected — keeping the dialogue open, acknowledging and act-
ing on feedback where possible.
Your method for assessing quality and establishing pricing should be transparent.
You will need to combine a number of approaches to achieve a predictive, fair and ap-
propriate model. This may involve combining automated and human evaluation, and
undertaking a productivity assessment. A productivity assessment should always be
used
GTM
GTM (General Text Matching) measures the similarities between the MT output and
the human reference translations by calculating editing distance.
MT Reversed Analysis
This approach aims to correlate MT output quality with fuzzy-match bands. It calculates
the fuzzy-match level of raw MT segments with respect to their post-edited segments.
The approach relies on a well-established pricing model for TM-aided translation. The
process runs as follows:
• Post-edit the raw MT output. Apply a fuzzy-match model to the raw MT and
post-edited pairs as it is done in TMs. Assuming that a particular engine will behave
the same way in similar scenarios (content type, language pair, external resources),
establish expected fuzzy-match band proportions and rates for each band.
• To calculate cost savings, you can compare: (1) the hypothetical price for the source
and the final translation (post-edited version of the source) obtained through a
fuzzy-match pricing model, and (2) the cost of post-editing the raw MT output
through a productivity assessment to test the results and refine assumptions.
Determine a threshold for automated scores, above which a minimum acceptable level
of improvement in quality for post-editing has been achieved. Or undertake human re-
view to determine that a minimum acceptable level of improvement in quality has been
achieved. Post-edit a sample of representative content. Undertake reversed analysis of
the post-edited content to map to fuzzy match price band rates. Undertake a small-scale
productivity assessment and human review to validate and refine the conclusions. The
errors produced by MT are different from that found in fuzzy matches, hence produc-
tivity tests and human review are necessary. Combining these approaches each time you
have a new engine should ensure your pricing model is predictive, fair and appropriate.
Also, the following organizations reviewed and refined the guidelines at the TAUS
Quality Evaluation Summit 15th March, 2013, Dublin:
Working with partners and representatives globally, TAUS supports all translation op-
erators – translation buyers, language service providers, individual translators and gov-
ernment agencies – with a comprehensive suite of online services, software and knowl-
edge that help them to grow and innovate their business.
Through sharing translation data and quality evaluation metrics, promoting innova-
tion and encouraging positive change, TAUS has extended the reach and growth of the
translation industry.
FEEDBACK