Training Guide PE Certification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 79

SDL Certification

Post-editing Certification

Training_Guide_PE_Certification
Revision Date: 30/10/2013

Table of contents
1

Introduction

1.1

About this training workbook ...................................................................................................................... 1

A brief history of post-editing and MT

2.1

What is MT?............................................................................................................................................... 2

2.2

MT development in the last century ........................................................................................................... 2

2.3

A short history of MT at SDL ...................................................................................................................... 5

Post-editing versus Translation

3.1

Global developments and the localisation industry .................................................................................... 8

3.2

Why post-edit? ......................................................................................................................................... 10

3.3

Why translate? ......................................................................................................................................... 11

MT Technologies

4.1

The challenges of MT .............................................................................................................................. 12

4.2

Rules-based Machine Translation (RBMT) .............................................................................................. 14

4.3

Statistical Machine Translation (SMT) ..................................................................................................... 18

4.4

Hybrid Systems........................................................................................................................................ 21

How the MT output is created

5.1

Baselines ................................................................................................................................................. 23

5.2

Verticals ................................................................................................................................................... 24

5.3

Customisations ........................................................................................................................................ 26

5.4

Engine training process ........................................................................................................................... 27

From the MT output onwards: the basics of post-editing

6.1

Introduction to post-editing....................................................................................................................... 34

6.2

Degrees of post-editing ............................................................................................................................ 35

6.3

The quality check process ....................................................................................................................... 38

How to get the most out of MT

7.1

What makes an effective post-editor? ...................................................................................................... 40

7.2

Post-editing quality expectations ............................................................................................................. 41

7.3

Under-editing ........................................................................................................................................... 43

7.4

Over-editing ............................................................................................................................................. 44

ii

7.5

Help improve MT for the future ................................................................................................................ 47

Expected Statistical MT behavior

8.1

Common patterns to watch for when post-editing .................................................................................... 50

8.2

How to provide feedback to improve the MT output ................................................................................. 52

Using BeGlobal baselines in SDL Trados Studio

9.1

BeGlobal baselines .................................................................................................................................. 59

9.2

How to add SDL BeGlobal Community as a translation provider in SDL Trados Studio .......................... 59

10

Summary

10.1

Conclusion to training workbook .............................................................................................................. 63

11

Further references

11.1

More information on MT and post-editing ................................................................................................ 65

12

Appendix:

12.1

Post-editing examples.............................................................................................................................. 67

iii

1 Introduction
1.1

About this training workbook


The scope of this training workbook is to introduce the reader to the techniques and
skills involved in post-editing machine translation (MT) output. It provides practical
examples of best-practice post-editing and recurrent issues such as over-editing and
under-editing. Moreover, it aims to familiarise translators with MT technology in order to
enable their involvement in the entire process from training engines to post-editing
content to publishable quality.
The document covers the following areas:

The history and development of MT

The various MT technologies currently used and the effects they have on the
quality and post-editability of the MT output

The post-editing and quality check processes and their relation to conventional
human translation

A guide to effectively post-editing MT output to understandable and publishable


quality

Common patterns to watch for when post-editing MT output

Using BeGlobal baselines in Studio

Where to find further information on MT and post-editing processes

In addition, the document aims to address some of the common misconceptions about
MT:

MT is taking away my job

MT output is always low quality

MT material is only useful when it can be easily edited

MT does not leave any room for creativity

MT does not fit with my translation style

MT technology is too complicated

Post-editing is less skilled than translation

2 A brief history of post-editing and MT


2.1

What is MT?
Machine Translation (MT) is automated translation that uses software to translate text
from one natural language to another. It is one of the oldest applications of Artificial
Intelligence and both facilitates and accelerates the creation of high quality translations.
Post-editing MT output can increase productivity in comparison with conventional
translation. It allows companies to deliver a high quality translation at greater speed,
and consequently at lower cost, and as such can be considered a new industry trend.
However, it is important to remember that MT does not replace human translators. MT
is a tool rather than an end solution and a stage of human correction will always be
necessary when post-editing to a publishable quality. Nonetheless, it is an effective tool
when understood and used correctly.

Uses of Machine Translation

Fully Automated
Useful Translation
(FAUT)

Post-editing

2.2

MT is generated by baseline engines or


customised engines and the output is used
directly, with no human intervention.
This solution is used mostly for content such as
emails, support content or instant messages,
where the user wants to have an idea of the
content, without the need for high quality.

MT output from customised MT engines is


post-edited by linguists to a quality level
equivalent to conventional translation.
Post-editing MT content is the preferred
solution for publishable documents. It is used
as part of a high quality translation process.

MT development in the last century


Following on from the efforts of war-time cryptography, MT is generally considered to
have started in the 1950s. In 1954, the successful execution of the Georgetown

Experiment - the fully automated translation of approximately sixty Russian sentences


into English - ushered in an era of significant funding for MT research in the USA.
Researchers believed they could produce a fully automated MT system within three to
five years. This endeavour proved more difficult than expected, however, and ten years
later funding was cut when it became clear that the development of MT had not
progressed as far as originally hoped.
Early attempts at MT typically failed because of a lack of coverage. The models
functioned by encoding a limited selection of transformational rules that simply did not
provide for the diversity of natural language translation. Consequently, the first attempts
in the 1970s and 1980s to commercialise MT operated by drastically increasing the
number of encoded transformational rules. This produced Rules-Based Machine
Translation (RBMT), which functioned relatively successfully with targeted human
feedback over a particular domain. However, this led to the further problem of how to
make the abundance of transformational rules needed to encode language pairs cooperate with one another. The answer was a statistical approach to MT.
In the late 1980s, computational power increased and became less expensive and as a
result interest picked up in Statistical Machine Translation (SMT). From the 1990s,
statistical learning approaches came to the fore, led by cutting-edge work from the IBM
research team. SMT systems no longer required the same human effort to encode
transformational rules and update lexicons and terminology lists, but rather exploited
the wealth of existing translations, covering numerous language pairs, to extract rules
based on statistical probability.
Since the 1990s, SMT has been pushed forward through intensive research and
training as well as support from Industry, Defence Advanced Research Projects Agency
(DARPA) and EC FP7. Statistical MT has been deployed in real-world, commercial
contexts by Language Weaver (now part of SDL), Google, Microsoft and IBM, and
there is on-going research into hybrid phrase-based and syntax-based MT. In 2011,
SMT was boosted with Google's announcement that it would charge for access to the
Google Translate API. Shortly afterwards, Microsoft also announced that it would start
charging for use of the Microsoft Translator API. These two events can be viewed as a
key milestone for the Machine Translation Industry and the Localisation Industry as a
3

whole. The progression to a paid API model for machine translation is a clear sign that
both the use and the quality of MT has matured to a level where enterprises and
developers see sufficient value in MT to invest in it.
After many decades, it appears that the models used in MT are more in line with our
understanding of how human language cognition and processing operates. This does
not mean that MT output is of an equal standard to output produced by the human
brain. However, we now understand more about what MT can contribute to the
Localisation Industry and have an invaluable tool for translation that is becoming ever
more prominent in the field.
MT accuracy is improving every year and many new techniques are being developed
and deployed as the field becomes more and more interdisciplinary, drawing from
computer science, linguistics, probability theory, algorithm design, automata theory and
engineering.

Some facts about MT today


Of the top 50 global companies, 53% publically
acknowledge that they use an MT solution
54% of non-Anglophones use MT when visiting English
language websites

75% of people use free MT tools

It is estimated that at least three-quarters of web users


take advantage of free translation tools due to the greater
accessibility and integration of MT solutions.

2.3

A short history of MT at SDL


SDL first adopted MT into Translation Services in the year 2000 after acquiring a RulesBased Machine Translation (RBMT) engine from Transparent Language, which
became SDL Enterprise Translation Server (ETS). In 2004, the Knowledge-based
Translation System (KbTS) Group was set up to use ETS in a high quality translation
process.
In 2009, Statistical Machine Translation (SMT) was beginning to establish itself firmly in
the localisation industry following rapid development. SDL forged a strategic
partnership with leading SMT developer, Language Weaver, allowing SDL to extend
the languages supported by MT.
In 2010, SDL acquired Language Weaver and are continuing to invest heavily in the
development of SMT technology. SDL rolled out this capability to their Production

Offices which resulted in a huge increase in scalability and allowed the process to grow
rapidly. KbTS was re-branded in 2011 to iMT (intelligent Machine Translation) and the
first post-editing projects were rolled out using SDL Language Weaver SMT.
Today the SDL iMT department consists of an in-house team of language specialists,
MT scientistis and project managers, supplemented by trained teams in the Production
Offices plus a large fully-trained freelance post-editing team. The iMT team are
responsible for the maintenance of MT engines and for all MT evaluations and
customisations within SDL Global Solutions. The Project Management team manages
the set-up of projects, plans and schedules the customisations. The linguists are
responsible for evaluating the project data for MT suitability based on the content to be
translated. Once the project is approved for MT, the linguists prepare the data, test the
results and organise training for the linguistic team in the Production Office as well as
the freelancers who will work on the project. This approach of preparation, testing and
training helps guarantee a high quality MT engine and therefore a high quality final
translation.
And as for the future, developments within MT are made through improved models and
algorithms as well as by adding more high quality training data. SDL is constantly
working on improvements to the machine translation technology so that even better MT
engines can be created going forwards. The future for MT at SDL is full of possibilities
and iMT will be on-hand to offer its many years of expertise as the range of MT
solutions increases.

Brief Timeline of MT at SDL


2000

Acquisition of Rules Based Machine Translation (RBMT)


engine from Transparent Language: SDL Enterprise
Translation Server (ETS)

2004

Knowledge-based Translation System (KbTS) Group set up to


use ETS in a high quality translation process

2009

Partnership with Language Weaver (LW)

2009

Training from LW on how to customise SMT engines

2010

Rollout of post-editing process to Production Offices

2010

Due diligence and acquisition of Language Weaver

2011

Re-branding of KbTS to iMT

2011

First iMT projects using SMT

2013

Continued development of SMT within SDL

3 Post-editing versus Translation


3.1

Global developments and the localisation industry


An increasing number of companies are entering the international market and are
publishing localised materials in a bid to reach more customers and realise greater
sales opportunities. This is based on the finding that 85% of consumers feel that having
pre-purchase information in their own language is a critical factor in buying services.
IBM estimates that 2.5 quintillion (1018) bytes of data are created every day and that
90% of corporate data originated in just the last three years. On average, companies
translate this content into 11 languages. At the same time, strong competition and the
need for faster turnaround times means that there is an immediate need to lower costs
and achieve savings through efficient and streamlined technology processes.

Key trends impacting on global businesses


Business globalisation

Internet use of multiple devices


Explosive growth in digital content
Effective targeting and revenue capture
Growth of translated content
Multimedia and video
Extreme brand management across all channels
Social media and community

Many of the recent trends affecting global business and information management will
have important consequences for the field of translation in the coming years. By the
end of 2014 there will be 2 billion users of computers and most of the growth forecast is
in the upcoming markets. This means that there will be more customers for software

and appliances and consequently a larger need for translations of user interfaces and
manuals.
In addition, by 2014, there will also be 2.5 billion users of the internet, which is 36% of
the worlds population, compared with 22% in 2010. Information equivalent to 10 billion
DVDs will be sent over the internet each month. Not everyone will be able to access the
information in the language of origin and consequently there will be a larger demand for
translations in order to make information as widely accessible as possible.
Furthermore, Cloud Computing has also begun to make an impact in the technologies
industry. The use of the cloud is growing, and more and more users will need
translations of the materials and content. The user interfaces will also require
translation as the number of end-users with different language requirements grows.
Thus, the demand for translation of both content and the interface itself is steadily
increasing.
Finally, social networking tools are rapidly increasing in popularity. The content lacks
specific structure and often involves interaction between users in various languages.
Companies are increasingly adopting social networking and professional use will
ultimately mean that more translations are needed and in a shorter time in fact, often
in real-time, as and when content is created. Again, this will result in a greater need for
translation.
In all of the above, the importance of English as a global lingua franca is slowly
decreasing. Between 2000 and 2010, the two languages with the greatest growth on
the internet were Arabic and Mandarin Chinese both of which grew twentyfold. In
contrast, content in English only tripled. Proportionally, then, English is declining in
importance relatively quickly. It is estimated that by 2020 English will have lost its status
as a lingua franca altogether. However, rather than being replaced with another natural
language, linguistic diversity will be the new status quo and translation will be key to
communication. In summary, then, there will be an increasing demand for more content
at greater speed and in an increasing number of languages.
So the question is, how can MT and post-editing help respond to these trends?

3.2

Why post-edit?
In the last few years, there have been significant developments in MT technology. SDL
has always been up to date with this development, and uses MT mainly to increase
efficiency whilst still delivering quality. This is achieved through integration of the MT
engines with SDLs translation environments SDL Trados Studio, TMS and
WorldServer which results in a streamlined process, leading to faster turnarounds
and higher cost-effectiveness.
A growing number of SDLs customers and freelance translators now rely on MT for a
high-quality, integrated translation process. Customised machine translation engines
deliver output of such good quality that post-editing is faster than translating from
scratch. Indeed, MT solutions can reduce production times by as much as 50% in some
cases. As such, many clients consider MT the only viable way to process the enormous
volume of content they need to localise. Moreover, in certain cases, it allows the client
to consider translating content that they would not otherwise have tackled as the cost
would have been prohibitive.
However, post-editing is not only of value to the client but also has many advantages
for the translator. SDLs intelligent Machine Translation will help freelance translators to
remain competitive and save time. We combine our SMT technology with projectspecific Translation Memories to produce translations of post-editable quality that can
help to increase productivity. Post-editing is not inferior to conventional translation but
requires all the usual translation skills such as domain knowledge, excellent
command of the source and target language, proficiency with CAT tools plus a
willingness to embrace new technological advances.
The demand for MT solutions is growing quickly and post-editing is rapidly becoming a
basic skill for translators. Learning how to post-edit will give linguists a foothold in an
evolving market and open up new freelance possibilities. We have seen a real swing in
attitudes in the last few years with many clients looking to MT as the default option to
help deliver translation faster and cheaper without sacrificing quality.
In summary, the following client and translator benefits apply:

10

Client
benefits

Translator
benefits

3.3

Lower cost
Faster time to market
Publishable quality
Higher volumes for translation
Ability to handle digital content explosion

A valuable new skill that opens more


opportunities
Competitive edge in an evolving market
Greater speed and efficiency
Higher volumes compensate for lower postediting rates

Why translate?
Whilst post-editing can provide a number of benefits for clients and translators alike, not
all projects will be suitable for post-editing. Because MT typically reproduces the
material used to train the engine, previously unseen material can present difficulties.
This is particularly common in text types with highly complex sentence structures or
very specific terminology and texts with a high amount of ambiguity which require
translations to move away from the source.
At SDL, all content is evaluated carefully before a project or part of a project is
considered for MT. Machine Translation technology is improving all the time and
content types that were not suitable two years ago, are now handled very productively
using Machine Translation. In some cases, however, conventional translation will still
be the recommended solution for the foreseeable future.

11

4 MT Technologies
4.1

The challenges of MT
MT shares many of the challenges of human language translation. These include the
ambiguity and polysemy of natural human language as well as the high levels of
linguistic diversity between languages. Particularly, where there is variation in the
morphological or syntactic characteristics of a language it becomes much harder for MT
to match the source and target phrases. Given that no linguistic information is encoded
into the statistical model this often presents problems.
Some of the main issues and active research problems for MT (as well as conventional
translation) are summarised below:

12

The challenges of MT
Domain and genre: vocabulary; style (including active vs. passive)
and sentence length will vary accordingly.

Ambiguity: human language is ambiguous on both lexical and


syntactic levels
E.g. "bank" can be the financial institution or the edge of a river
E.g. "I saw the man with the telescope" - Is it the man or the speaker
who is holding the telescope?

Variation in morphology and word order


E.g. case and definiteness endings in Hungarian, and Swedish
E.g. Verb - Subject - Object order in Arabic and Hebrew
No one-to-one translation: a word that covers many social, cultural
and linguistic meanings in one language may require finer distinctions
in another language and vice versa
E.g. politeness levels in Japanese
E.g. German "Tasse" = English "mug" or English "cup"
Idioms: difficult to translate like any other form of formulaic language
E.g. French "Avoir les dent longues" = English "To be ambitious" (Lit:
"To have long teeth")

Language specific characteristics


E.g. Arabic tokenisation, Chinese word segmentation, etc.

13

4.2

Rules-based Machine Translation (RBMT)


Chronologically speaking, Rules-Based Machine Translation (RBMT) was the first
approach to automated translation. It involves parsing a source sentence, analysing the
structure, converting this to a machine-readable code and then transforming it into the
target.
The core system is based on a set of grammatical rules for each of the languages,
combined with a dictionary. The dictionary contains source words and phrases, their
translations and detailed grammatical information, such as part of speech and
inflection. It provides the modules with the linguistic knowledge they need.
The rules are the linguistic processor of the system, responsible for analysis and
generation. They use linguistic information stored in the dictionary. These rules are
intended to represent the grammatical knowledge of speakers and specify inherent
agreement and relational information.

Example
Determiner and noun need to agree in number and gender
Subject and finite verb need to agree in number
At the translation stage, the MT engine analyses each source sentence and tags the
words and phrases with their part of speech to identify the grammatical components, for
example, the subject, object and verb. The MT system then looks up the translations of
these grammatically tagged words and phrases in the machine dictionary and
combines them using the coded language rules for the target language. This builds the
translated sentence.
A large core dictionary provides the translations for everyday words and phrases. For
translations that use special terminology, an RBMT system can use custom dictionaries
in conjunction with the baseline to improve translation accuracy.
14

How to recognise RBMT output


The RBMT output is based on 3 factors:

Rules for language pair

General settings that can be customized (such as quotation marks, verb tense,
accents, decimal point)

The project dictionary where the specific terminology is entered and which is key
to improve the MT quality.

Some common issues can be identified when post-editing rules based machine
translation. Here we include some examples from English into French, Italian, Spanish,
Portuguese, Dutch, German, Swedish, and Finnish, which are the most common
languages for RBMT.
In order to recognise MT error patterns, post-editors should look out for the following
potential issues when post-editing.

Use of superfluous articles


Superfluous articles are commonly added in most languages, these can also occur
before proper nouns.
EN Source:

Free High Speed Internet Access!

IT MT output:

lAccesso gratuito a internet ad alta velocit!

IT Post-edited:

Accesso gratuito a internet ad alta velocit!

EN Source:

Oil filter unit: Removal - Refitting

FR MT output:

Bloc filtre huile : La dpose - la Repose

FR Post-edited:

Bloc filtre huile : Dpose - Repose

Use of simple prepositions

15

When a term has not been entered in the Customised Dictionary, simple prepositions
are used and they should to be corrected when needed.
EN Source:

Reconnect ECT sensor electrical connector.

FR MT output:

Reconnecter le connecteur lectrique de capteur ECT

FR Post-edited:

Reconnecter le connecteur lectrique du capteur ECT

Acronyms automatically translated into terms


When a specific acronym has not been entered in the Customised Dictionary it is
automatically and consistently translated into a common term which exists in the Core
Dictionary.
EN Source:

MR

IT MT output:

Sig.

DE MT output:

Herr

FR MT output:

M.

Proper nouns translated literally

EN Source:
Thanks to Peter Ferry for reporting the VBScript/Jscript Buffer
Overrun Vulnerability.
IT MT Output:
Grazie al Traghetto di peter per segnalare la Vulnerabilit legata al
sovraccarico del buffer di VBScript JScript.
IT Post-edited:
Grazie a Peter Ferry per aver segnalato la vulnerabilit legata al
sovraccarico del buffer di VBScript JScript.

EN Source:

He lives in Palm Springs.

FR MT output:

Il habite Printemps de Paume.

FR Post-edited:

Il habite Palm Springs.

16

Capitalisation issues
The MT follows the source capitalisation, unless specific terms have been entered in
the Customised Dictionary with the required capitalisation (problem especially in IT
texts, e.g. UI options)
EN Source:

Click Add Custom Phone Tune.

FR MT output:

Cliquez sur Ajoutez l'Air Personnalis de Tlphone.

FR Post-edited:

Cliquez sur Ajouter une mlodie de tlphone personnalise.

EN Source:

Select the appropriate option in the Automatic Synchronization


section

PT-BR MT output:

Selecione a opo apropriada na seo Sincronizao


Automtica

PT-BR Post-edited:

Selecione a opo apropriada na seo Sincronizao


automtica

Disambiguation of homographs
You can encounter what we call homograph resolution. This means that the same
source term can be translated as a noun AND a verb (or an adjective, etc.), for example
NETWORK (a network, to network/networking).
When there is a homograph resolution issue, the entire syntax is misanalysed.
In the following examples the nouns are interpreted as verbs:

EN Source:

Check box D6 on the blue label

DE MT output:

Kasten D6 auf dem blauen Aufkleber prfen

DE Post-edited:

Kontrollkstchen D6 auf dem blauen Aufkleber

PT Source:

The water reservoir does not contain enough water.

PT MT output:

O reservatrio de gua no contm suficiente aguar.


17

PT Post-edited:

O reservatrio de gua no contm gua suficiente.

Compound formation and hyphenation issues


For some languages such as German and Finnish compounding rules may work. If they
do not work, the post-editor must amend accordingly and the term should get encoded.

RBMT Pros and Cons

Pros

Cons

A lot of control of rules and terminology


Once the grammar is established, new projects can be created
from scratch relatively quickly
Once set up, projects are easy to maintain
Consistent use of terminology
The grammar is very time-consuming to develop
Rather literal translations
Too context-sensitive

RBMT allows for excellent terminology control. There is no need for pre-existing TMs
as project dictionaries can be created from scratch and the output is systematic, rightly
or wrongly, meaning that experienced post-editors can post-edit quickly and reliably
with time. However, it can take a number of years to develop a new language pair and
the source must be well-written to generate good output. Moreover, project dictionaries
are time-consuming to create and therefore expensive to maintain and output is often
not very fluent and not sensitive to context, providing a single translation per term.

4.3

Statistical Machine Translation (SMT)


A Statistical Machine Translation (SMT) system learns to translate by analysing large
volumes of previously translated content. The starting point for training an engine is an
aligned corpus of source and translated sentences of hundreds of millions of words.
The training process subdivides each of the source sentences into words and series of
words (n-grams) and analyses the associated translated sentences. In this way the
training process determines for each n-gram in the source the most likely set of

18

translations. By analysing just the translated content, the training process learns the
order in which the translated words are most likely to occur. The more training data and
the more consistency there is in this data, the more accurate the process becomes.
In the next stage of the process, the system compiles all of the learned data into the
runtime MT engine. The runtime MT engine subdivides each sentence into smaller
chunks and looks up the possible translations in the compiled database. For a given
source sentence this process results in many possible translated sentences. The MT
engine uses the statistical data on the probability of a translation and the word order to
determine the best candidate for the MT output.
For general purpose translations, the system uses a baseline language engine that is
trained with a large corpus of broad spectrum content hundreds of millions of words.
To enhance performance for applications that use specific terminology, a SMT system
can be trained with a corpus that contains only or mostly content that is close to the
data that is to be translated. An ideal corpus for this is a large Translation Memory (TM)
that contains the previous translations of a project. The recommended volume of data
required is 1 to 5 million words, although it is possible to work with less than 1 million.
This is known as customisation or training.
The quality of the MT output depends on both the linguistic and technical quality of the
material included. However compared to RBMT, SMT provides a more fluent translation
with some context-sensitivity and better reflects the style of the training material.
SMT Pros and Cons

Pros

Customisation times are quicker than with RBMT


Output reads more fluently and is stylistically better than the output
from a rules-based system
Able to select the correct translation in certain contexts: e.g.
device in IT domain
Generally shorter setup times

19

Cons

Need for large bilingual corpora (millions of words)


Difficult to maintain (for retraining a high amount of content is
needed, which takes time to gather)
Need for processing time file processing times are higher with an
impact on hardware costs

Compared with RBMT, Statistical Machine Translation can offer a larger number of
languages for post-editing as engines are lower cost and faster to train, as well as
easier to maintain. Moreover, because SMT is trained with real sentences and
phrases the direct output can be more fluent than with RBMT, which is good for raw
output requirements and additionally helps the post-editor. In addition, there is a high
level of research activity surrounding SMT and performance improvement is predicted
for the future. For this reason, SMT is the technology of choice at SDL.
However, it should nonetheless be noted that SMT requires large amounts of memory
space and processing capacity though this in itself becomes less of a problem with
technological developments. Moreover, the output is dependent on the quality and
volume of data used for the customization, and therefore the post-editor must be aware
of the range of common trends in order to post-edit accurately. Similarly, it is harder to
implement changes in terminology made by the client than with RBMT and a project
specific engine can only be created if there is sufficient data as a starting point.

Syntax-based SMT pros and cons


Syntax-based translation is based on the idea of translating syntactic units, rather than
single words or strings of words. A Syntax-based statistical engine can improve
grammatical accuracy and ensure that verbs are realised in the correct position.

Pros

Better modelling of target language structure


Ensures there is always a verb present
Realises the verb in the correct position
Better handling of function words, such as prepositions
Has a more powerful decoding algorithm

20

Cons

Early stages of development


Sometimes less accurate terminology as no link to baseline

The following table summarises the key differences between SMT and RBMT:
Attribute

SMT

Does not need a large


volume of aligned data for
training/customisation

RBMT

Number of languages
supported

Setup time for new


language

Terminology control
+
Software UI term handling
+
Raw fluency
+
Raw accuracy
+
Level of research activity
and performance
improvement predicted

4.4

Hybrid Systems
One thing that is being explored in contemporary research into MT technology is the
possibility of creating a hybrid engine, where dictionaries, rules and statistical features
are combined so as to obtain the best of both worlds. This can be done in many

21

different ways; examples are the use of a dictionary to enforce certain translations in
SMT and the use of statistical techniques to determine the best translation for a
homograph such as bank or get, where the translation is different depending on the
context.
However, current solutions are fairly pragmatic and leave room for further development
in future. In some cases, hybrid systems do not back up to a baseline and this can
exacerbate common MT issues, such as terminology inconsistencies and/or content left
untranslated.

22

5 How the MT output is created


Statistical MT is now the technology of choice at SDL, so this course will now
concentrate on SMT technology.
SDL takes a three-pronged approach to SMT and uses the following different engine
types, matching the solution to the particular use case:

Baselines
Generic engines containing diverse data

Verticals
Domain-specific engines

Customised engines
Content trained for specific client corpus

5.1

Baselines
The core MT engines developed by SDL are known as baselines. These baseline
systems are bilingual corpora used as general databases for each language pair. They
are based on a large translated corpus of hundreds of millions of words, taken from
reliable sources available in the public domain, such as news, IT documentation,
technical manuals and publically-available government material, and distributed across
various domains, including IT, automotive, news, sports, electronics, etc.
Baselines are under constant development and new releases are launched frequently.

23

This solution produces good results for clients who require immediate access to MT,
who do not have sufficient volumes of data and/or wish to translate general content
across several domains.
Client-specific customisations and domain-specific verticals normally use baseline
engines as a backup; so if a certain word, phrase, or even grammatical structure is not
present in the training data, the engine may still be able to produce a translation.

Baselines Pros and Cons

Pros

5.2

Cons

Verticals
A vertical is a trained statistical engine that exclusively contains data related to a
specific subject area, or domain, such as IT, Automotive, Electronics etc. When a client
does not have enough translated data to be used for a client-specific training, a vertical
solution can be used instead of a customisation on top of the baseline corpus.
These domain-specific engines therefore provide a point of entry for projects that have
small TMs. They also prove useful in those cases where there is not enough time to
create a project-specific engine before the first jobs start to flow in. Because the vertical

24

is a ready-to-use solution, it does not have the development effort involved in creating
client-specific engines.
Based on the higher volume of data used in a Vertical when compared to a
customisation, the engine is less likely to take translations from the baseline and
therefore less likely to produce a general instead of a more specific technical
translation. However, as the data for the Vertical will come from different sources within
a domain it is also more likely to find inconsistencies in style and terminology that will
need to be checked during the post-editing and quality-checking stages.

SDL Verticals are available for the following domains in a wide number of languages

Automotive Vertical

Consumer Electronics (CE) Vertical

HiTech (IT Hardware) Vertical

Travel Vertical

These engines are always under development and, whenever there is a considerable
amount of new data and/or new technical features that can enhance the overall
performance of the engine, they are retrained to improve the overall quality of the MT
output.

25

The vertical retraining process is designed to increase productivity when working with
vertical output. However, if a client prefers a specific translation for a certain term which
was correct in the original vertical, a retraining might mean that this term could be
changed to a more widely used translation. This will need to be corrected during postediting and we recommend adding terms like this to your QA check.

Verticals Pros and Cons

Pros

5.3

Cons

Customisations
A customisation is a trained statistical engine that only (or mainly) contains clientspecific corpora. It involves preparing client-specific TMs in order to get the best MT
output for production. The recommended requirement for a successful customisation is
an aligned corpus of 1 million words of relevant customer data, although this may vary
per project and language pair, and it is possible to create a customisation with lower
volumes of customer data.
Using this type of material guarantees adherence to client-specific terminology and
style.

26

As the machine translation output is fully based on the bilingual corpus, with no
syntactical or lexical data added, the quality of the output can only be as good as the
quality of the corpus. If the corpus data has inconsistent terminology and/or style, the
resulting MT may also be inconsistent. That is why it is important that the linguist
responsible for the customisation chooses suitable data to be added to the SMT engine
training.

Customisation Pros and Cons

Pros

5.4

Cons

Engine training process


When a project is sent to iMT, all the necessary data is collated including project
TMs, sample files, project information, etc. The next step in the process is to evaluate
the source text and establish if it is suitable for machine translation. A source evaluation
will also allow the linguist to identify any possible issues with the use of MT on the
project, so that action can be taken during engine creation to try to minimise those
issues. If the data is suitable, then the TMs are prepared for training the engine. SMT
engine training is an iterative process, and involves the following steps:

27

TM cleaning

Selection of test sentences

Testing

TM cleaning
Data cleaning is a process applied to the training corpus in order to make it compatible
with the platform where the SMT engines are created. This process improves the
quality of the data by removing content which could adversely affect the MT output,
such as tags, entities, misaligned segments, and corruptions. This could appear in the
output and provoke a drop in productivity. Some parts are also harmonised towards
achieving MT output that will be faster to post-edit, as less changes will be required.

Creation of training
During a customisation, several trainings with different combinations of data may be
uploaded to the system and then evaluated so the iMT team can select the one that
delivers the best results. A second trial is based on the results of the first one the
problems found in the output are traced back to the TM data, which is then manipulated
further to try to solve the issues. The training with the best results is then deployed for
production.

Selection of test sentences


For MT testing purposes, the linguist selects a set of sentences which do not appear in
the corpus which will be uploaded to the SMT system. Ideally, the sentences should be
taken from new untranslated project files, as this is the best way to reproduce a real
translation scenario and really test the engine to the max.
28

Testing
One of the biggest challenges within the MT industry at this point in time is to find an
automatic measure that will be able to forecast if a particular MT output will be able to
reach the particular users goal. Achieving this objective is particularly difficult as there
are no unique solutions in translation. Many translations may be right for one sentence
and even more translations can be wrong. Since an automatic assessment of MT
output quality is generally based on comparing the MT to reference translations, finding
an automatic procedure to determine the MT output quality is a challenging task where
a lot of work is currently being concentrated.
Nowadays, many MT providers choose between human and automatic evaluations (or
a combination of both).
Human evaluation is normally centred on Likert-based scales. With this method,
resources are asked to score aspects of the MT output by following a list of parameters
associated with a numerical scale. For example, score 5 if the output is entirely correct,
score 4 if the output is understandable but has grammatical errors,.This kind of
assessment mainly focuses on understandability, although some vendors have started
looking into Likert-based scales that could help assess the post-editing effort. Human
evaluation can also be used to compare two or more MT engines or systems, and is
based on the evaluator stating their preference between two or more MT outputs
generated for the same source sentences.
Some of the disadvantages inherent with human evaluation are:

Performing this kind of tests is relatively expensive and time consuming, as


several resources are required for assessing each and every engine.

Human evaluations are prone to subjectivity and final assessments may not be
consistent after all.

Resources need to be familiar with the scales and follow them to the letter in
order to obtain valid results.

29

However, when done well, a human evaluation is still often considered to be more
reliable than automated measures, and has the added advantage of a human translator
being able to provide useful comments on the issues found on the MT output.
The productivity increase though is still a difficult factor to predict for all cases, as
productivity may vary per job and also per resource (it varies with post-editing
experience, for instance). Most productivity tests in the industry are based on a
combination of measuring post-editing speed, and post-editing effort, or comparing
post-editing speed with conventional translation speed.
In the last decades, many measures for automated evaluation have been proposed.
Most automated measures assess the quality of the machine translation compared to a
reference translation which is deemed to be high quality. Some of the most widely
spread ones are detailed below.

BLEU (Bilingual Evaluation Understudy) score: this algorithm is meant to evaluate the
quality of text which has been machine-translated. The central idea behind BLEU is
the closer a machine translation is to a professional human translation, the better it is.
For that, scores are calculated for individual translated segments generally sentences
by comparing them with a set of good quality reference translations. Those scores
are then averaged over the whole corpus to reach an estimate of the translation's
overall quality. Intelligibility or grammatical correctness are not taken into account
explicitly, they are supposed to be included in the correct reference translations.

NIST: the name of this metric comes from the US National Institute of Standards and
Technology. This measure is based on the BLEU score, but it differs from this algorithm
in several points.
Whilst BLEU simply calculates how many n-grams match both in the reference
translation and in the MT output and gives these n-grams the same weight, NIST also
calculates how informative a particular n-gram is. When a correct n-gram is found, the
algorithm measures if that combination is a common sequence in the corpus material or
if, on the other hand, that fragment is not that common in the data. Depending on the
result, an n-gram will be given more or less weight. To give an example, if the bigram
30

"on the" is correctly matched, it will receive lower weight than the correct matching of
bigram "interesting calculations", as this is less likely to occur.
NIST also differs from BLEU in how some penalties are calculated. For example, small
variations in translation length do not impact the overall NIST score as much as in
BLEU.

METEOR (Metric for Evaluation of Translation with Explicit ORdering): this metric was
designed to address some of the problems found in the more popular BLEU metric, and
also produce good correlation with human judgment at the sentence or segment level
(this differs from the BLEU metric in that BLEU seeks correlation at the corpus level).
For that, several features that had not been part of any other metrics at the time were
introduced. Matches in METEOR are made by following the parameters below, among
others:
Exact words: as with other metrics, a match is made if two words are identical in the
machine translation output and the reference translation.
Stem: words are reduced to their stem form. If two words have the same stem, a match
is also made.
Synonymy: words are matched if they are synonyms of one another. Words are
considered synonymous if they share any synonym sets according to an external
database.

TER (Translation Edit Rate): this metric measures the number of edits required to
change a machine translation output into one of the human references.

Levenshtein distance: this metric measures the similarity or the dissimilarity (distance)
between two text strings by calculating the minimum amount of single-character edits
(insertion, deletion, substitution) required to change one word into another. In the field
of machine translation, this can be done by comparing the raw MT output to the human
translation.
Lets look at a couple of examples:

31

The Levenshtein distance between "sport" and "short" is 1, because 1 edit is required
to convert one word into the other (replace p with h).
The Levenshtein distance between dog and frog is 2, as it is not possible to convert
the first word into the second with fewer edits (replace d with f and add r).
This algorithm always has a maximum value that corresponds to the maximum length
of both input strings. In the case that 2 words do not have anything in common, the
minimum amount of edits will not exceed the maximum amount of characters of the
longer string.
Example: if we have computer and alibi, the Levenshtein distance will be 8 and no
higher than 8:
replace c with a
replace o with l
replace m with I
replace p with b
replace u with I
delete t
delete e
delete r
As with other automated measures, the results of the Levenshtein distance are not set
in stone. As mentioned before, there can be many correct translations for a single
source; however, the Levenshtein distance will not be able to measure quality on its
own. Results will vary, for example, if clauses are positioned differently in the MT output
and in the human reference translation.
Example:

32

MT: If I go home after 10pm, I will let you know.


Reference human translation: I will let you know if I go home after 10 pm.
In this case, the MT output is correct and no changes would be necessary during a
post-editing stage. However, the Levenshtein distance will be quite high, as many
changes would be required to turn the first sentence into the second one.
That suggests once more the importance of selecting large test beds to run any of
these automated evaluations on, as that will allow us to get more reliable results.

Automatic measures also have their limitations: the reference translation is not always
available, and those measures do not give an indication of post-editing productivity
expected. Therefore, they are useful for engine training development and comparison,
but not necessarily practical for a production scenario.
In January 2011, TAUS began working with a group of its enterprise members with a
clear objective in mind tackle the general problem of evaluating translation quality.
And consequently the idea of the Dynamic Quality Evaluation Framework (DQF) was
born.
The framework is still in development, and will allow users to profile their content and
receive guidance on best-fit evaluation techniques. A knowledge base documenting
best practices provides detailed practical information on how to carry out seven specific
types of quality evaluation. By establishing best practices, metrics and benchmarks
within a dynamic framework, the project team sought to apply best-fit evaluation
approaches depending on content type and usage, moving away from the dated, static
one size fits all approach used by most vendors.

33

6 Using the MT Output: the basics of postediting


6.1

Introduction to post-editing
Post-editing is a new phase that replaces conventional translation for MT projects. It is
a change in the process, but the working environment remains the same. The same
applications and the same reference materials used in a conventional translation
project are also used when post-editing. Machine-translation is a new component in the
process that provides human translators more leverage along with the use of TMs.
Post-editors work on CAT tools editing fuzzy matches from the TM and machinetranslated segments to a publishable quality.
Post-editing is a skill which translators develop with time. Post-editors will not be fully
productive from day one as they need to learn their trade. Industry research has shown
that experience is the single most important factor in translation productivity and
becomes even more influential in post-editing. Over time, translators can adapt their
working practices to use the MT output to their advantage.

34

Integrating post-editing into a production environment

On a file for post-editing, the Translation Memory is applied as usual, to create the
100% matches and fuzzy matches. Machine translation is applied to any untranslated
text left after the TM is applied.
The post-editing phase itself involves a number of key stages. Since the post-editor is
attempting to be as efficient and productive as possible, preparation is key. Do not rush
ahead without taking time to consider the source and MT output. Determine the
useable parts and then build around these. Focus on accuracy, without under- or overediting, and finally check over the grammar and the terminology. Post-editors are
generally advised that if the text scans well, it will flow well.

6.2

Degrees of post-editing
The market makes a distinction between post-editing to publishable quality and postediting to an understandable level. Post-editing to publishable level is the highest
quality standard. This is in line with the expectations of the majority of SDLs clients.

35

After post-editing, files undergo a quality check to ensure that the translation is correct
and fluent. The final quality should be comparable to conventional translation.
Post-editing to understandable quality, or light post-editing is normally required for low
visibility text, or texts that would not otherwise be translated for a client as it would be
too expensive and time-consuming. A client might decide to opt for understandable
quality texts in order to reduce the number of support requests for a product or to
provide an extra service to the user, for example. Typical purposes of understandable
quality texts include offering users a quick answer on how to fix an issue or providing a
translation solution for low visibility content, such as FAQs, blogs, and knowledge
bases.
When post-editing to an understandable level alone, it is less important to correct style
and grammar so long as the meaning of a translation is clear. Most important, however,
is to follow the clear project requirements that should always be provided by the client
in advance.

Examples of light post-editing


LP

IT-EN

IT-EN

FR-EN

SOURCE
Attrezzo di
compressione per
misurare la
sporgenza delle
canne dei cilindri (da
utilizzare con
380000364 e piastre
specifiche)
Prima di iniziare
qualsiasi lavoro in
quest'area, spegnere
il motore ed estrarre
la chiave di
accensione.
Si la valeur souhaite
nest pas obtenue,
rpter les
instructions 3 5.

EN MT
Tools for compression to
measure cylinder liner
protrusion ( use with
380000364 and specific
plates)

EN PE
Tool for compression to
measure cylinder liner
protrusion ( use with
380000364 and specific
plates)

COMMENTS
The plural needs to be
edited because
"attrezzo" is singular in
the Italian source, but
there is no need to
remove the space after
the bracket

Always stop the engine


and remove the Key
before working in this
area.

Always stop the engine


and remove the Key
before working in this
area.

There is no need to
change the uppercase to
lower case

If the desired pressure


has not been reached,
repeat instructions 3 to
5.

If the desired pressure


has not been reached,
repeat instructions 3 to
5.

"Required" would be
better than "desired",
but since this is perfectly
understandable there is
no need to change it.

36

To remove the 3D
diffuser:

Zum Entfernen des 3D


Refraktionstechnik:

Zum Entfernen des 3D


Refraktionstechnik:

The pressure is
reduced to pilot
pressure.

La pression est rduit


la pression pilote.

La pression est rduit


la pression pilote.

EN-DE

EN-FR

The MT has the wrong


case des instead of
der. But the MT
sentence is perfectly
understandable as it is.
The gender agreement is
wrong, should be
rduite instead of
rduit, but the
sentence is
understandable as it is
and that does not need
to be corrected.

Publishable quality vs. Understandable level

Publishable
Quality

Understandable
Level

Most frequent form of post-editing


Generally used for higher visibility texts
Comparable to conventional translation
High quality expectations
Follows standard client expectations

Less frequent form of post-editing


Generally used for lower visibility texts
Focus on meaning not on style and grammar
Expectations based on specific client
requirements
Clear requirements are needed

Post-editing to publishable quality is covered in mode detail in the next chapters. When
post-editing to publishable quality, the following rules apply:

37

6.3

Read the source segment first and then the MT output

Determine the usable elements (single words and phrases) and make
them the basis for your translation

Build from the MT output and use every part of the MT output that can
speed up your work

Take care not to over-edit (unnecessary rephrasing) or under-edit (wrong


prepositions, inflections, compounds, etc.) the MT output. The adjustment
of style (such as may versus might) can be optional, but grammatical
correctness in the target is not

Correct any grammatical errors and make sure that the terminology of the
MT output is compliant with glossaries and termbases. This will always
need to be checked as any inconsistencies in the training material will be
reproduced in the output

Run the compulsory checks (spelling, grammar, terminology check)

Finally, after post-editing each segment, reread your translation and make
sure that no details are missing and you have not left any words that are
not needed

The quality check process


It is recommended that the post-editing process is followed by a quality check, which is
the equivalent of conventional review.

38

As part of SDLs workflow, the quality check is performed as a separate step by a


reviewer and guarantees that the translation is fully publishable. To achieve this, quality
at source is key the post-edited file should already be of publishable quality. To
facilitate this, ensure that the post-editor receives clear instructions and has access to
all most up-to-date reference materials. The required QA checks need to be run and
can be used as an indication of the post-editing quality.
When quality-checking, always bear the MT in mind and understand the initial MT
output. Identify known problems in advance (see section 8) and make sure to include
them in your checks (e.g. wrong prepositions, terminology, known issues with MT). It is
important to learn to distinguish between what needs to be changed and what can
remain untouched. Note that there are some items which always need to be amended
by the post-editor. Examples include date formats, spacing, wrong prepositions or
terminology issues caused by several possible translations of the same word.
When quality-checking machine-translated material, focus on over-editing and underediting (depending on style and client requirements). Over-editing will lead to lower
productivity and needs to be avoided during both the PE and the QA check phase.
Under-editing may result in quality issues and will impact negatively on the time needed
for quality check.
Before starting a quality check, make sure that all the content has been translated.
Then check that the post-edited text reads well from a users point of view. The postedited text must match the source. Be careful to look for mistranslations, words left out
from the translation or additional words which are not on the source text. Check that
there are no typos. Scrolling down the file will enable you to spot spelling mistakes and
inconsistencies. Terminology should be consistent with the master glossary, especially
product names. It is vital that terminology is consistent. Sometimes terminology is not
consistent in the TMs and there are additional lists and guidelines for terminology.
Finally, check that style is overall consistent with the rest of the files and complies with
the style guide from the client.

39

7 How to get the most out of MT


7.1

What makes an effective post-editor?


In order to post-edit effectively, it is essential to use the machine translation output as
much as possible. Do not ignore the machine translation output and do not translate
segments from scratch. In almost all cases some parts of the automatic translation
output can be used and help to speed up work.
The following guidelines will help you to identify usable parts and achieve the maximum
post-editing productivity. The translator needs to achieve publishable quality at the
post-editing stage without sacrificing translation speed. Once you have learnt to identify
usable parts and to use them, you will find post-editing easier and faster than
translating from scratch. Like any other new skill, however, there is a learning curve
with MT post-editing: the more you practice, the faster and easier it gets.

Post-editing tips
Do not ignore or
erase the MT
output

Do not make
alterations for
the sake of
variation alone

Maximise the
usage of the MT
output

Do not replace
words with
synonyms

Use the
appropriate
style and
terminology

Do not spend time


researching
terminology
unless the MT is
clearly wrong

Follow the
project/client
style guidelines

If the MT meets
the project
requirements,
do not modify it

If formatting is an
issue, restore the
original source
format and paste
the useful MT
parts instead

An alternative if
there are many
tags is to delete
them, edit the
text, then insert
the tags again

At the end, re-read


the segment and
compare it to the
source for
accuracy

However, the MT is not only useful when it is easy to edit. You can also use the MT as
a source of inspiration when looking for the correct translation and pick out bits of the
sentence to reuse rather than trying to keep as much of the sentence as possible. This

40

is particularly relevant for longer sentences. Even sentences that are largely incorrect
can be useful so long as deleting the incorrect material is not time-consuming.
Apart from this, it is important to bear in mind that account knowledge is important for
post-editing as well. Whilst this is important for all translation projects conventional as
well as MT a solid knowledge of the project requirements with regard to style
guidelines, terminology, TM and client expectations will help you achieve good postediting productivity.
So what makes a good post-editor?

Excellent
linguistic
skills

Practice!

Positive
attitude
towards MT

7.2

Domain and
subject
knowledge

Proficiency
with CAT
tools and
automated
textchecking

Post-editing quality expectations


The quality expectations will vary according to the degree of post-editing and the client
requirements. However, certain general principles apply. The aim is to deliver a high
quality translation faster than a conventional translation. Translation speed is a key

41

factor when post-editing. Therefore, the machine translation needs to be corrected with
a view to maintaining efficiency.
There should be no difference in quality between a human translation and a post-edited
translation when post-editing to publishable quality. However, there may be a slight
shift in style. Style should be correct and appropriate to the project, but may need to be
less refined in order to allow for a more efficient use of the MT output. Where a client
specifically asks for MT to be used on their project, the client needs to be made aware
of this and expectations need to be managed accordingly.
There will of course be a certain amount of variation but this is a feature of
conventional translation as well. So long as the quality criteria are adhered to, a postedited text will be considered to have met the quality expectations.

42

Post-editing quality criteria


The translation must be a correct reflection of the source.

Spelling and punctuation must be correct.

The translation must be grammatically and syntactically correct and


reflect the conventions of the target language.
The correct terminology must be applied and used consistently
(including preferred translations for frequently occurring terms).
Cultural references (date and time formats, units of measurement,
number formats, currency, etc.) must be correctly adapted.
The style and register of the target must be appropriate for the
document type.
The original formatting must be reproduced.

Project guidelines must be followed.

The translation must read well and be suitable for the end user.

There are two main issues that post-editors often face when attempting to fulfil the
highest possible quality criteria in the shortest amount of time. These are under-editing
and over-editing and will be discussed in more detail in the following sections.

7.3

Under-editing
If a post-editor has under-edited the MT output, they may have missed important errors
that needed to be corrected and may reflect badly on the quality of the translation.
Under-editing is generally characterised by the following features:

43

Under-editing

Errors (spelling, typos)


Mistranslations (target does not match source)
Inconsistent terminology
Inaccuracy
Inconsistency in figures, units of measurement,
etc.
Incorrect formatting
Not following project-specific instructions

Below are some examples of under-editing:


LP

Source

MT

EN-ES

On its walls
you'll discover
the figures of a
puma and a
snake.

En sus murallas,
descubrir la
cifras de un
puma y una
serpiente.

En sus murallas
descubrir la
figuras de un
puma y una
serpiente.

En sus murallas
descubrir las
figuras de un
puma y una
serpiente.

EN-ES

Inside you can


see a
sacrificial altar
made of a
huge stone.

En su interior se
puede ver una
altar de
sacrificios de una
enorme piedra.

En su interior se
puede ver una
altar de sacrificios
hecho con una
enorme piedra.

En su interior se
puede ver un
altar de
sacrificios hecho
con una enorme
piedra.

Combien de
temps dure
l'autonomie
partir d'interactive
fonctions (comme
les jeux) sur mon
tlphone ?

Combien de temps
dure l'autonomie
de la batterie
lorsque j'utilise les
fonctions
interactives
(comme les jeux)
sur mon telephone
?

Quelle est
l'autonomie de la
batterie lorsque
j'utilise les
fonctions
interactives
(comme les jeux)
de mon
telephone ?

EN-FR

7.4

How long will


the battery last
using
interactive
features (such
as games) on
my phone?

PE

Reviewer

Comment
The term cifras has
been correctly postedited and replaced with
figuras, but the article
la has not been
changed to the plural
form.
The preposition de has
been correctly postedited, but the article
una does not
correspond to the gender
of the noun altar (una
is feminine whilst altar
is masculine).
"Combien de temps
dure" should not be
combined with the word
"autonomie". The litteral
translation of "How long
does XXX last" is not
appropriate in this
context. The correct
version is "Quelle est
l'autonomie".
The preposition "sur" is
not appropriate in this
context.

Over-editing
If a post-editor has over-edited the MT output, they may be taking extra time which may
affect their overall productivity and reduce the benefits of post-editing. Over-editing is
typically characterised by preferential rather than necessary changes.

44

Over-editing

Do not rewrite the translation unless


unavoidable
Do not change correct and understandable
translations, even if they could be phrased more
naturally or fluently
If the MT output style meets the project
requirements, do not change it
Reduce changes to a minimum and focus on
actual mistakes

There is always room to allow stylistic changes and creativity with post-editing, and
certainly stylistic features that do not meet with the client style guides should be
amended. The important thing to remember is not to let preferential changes distract
from necessary amendments and not to let these changes have a negative impact on
the overall productivity.
Below are some examples of over-editing:

Language
Pair

DE-EN

PE without
Overediting
Cooling takes place
through the solid
aluminium casing and
the side-mounted
cooling fins - there is
no need whatsoever
for fans.

Source
Die Khlung
erfolgt durch
das massive
AluminiumGehuse und
die seitlich
angebrachten
Khlrippen und
kommt gnzlich
ohne Lfter
aus.
Aber nicht nur
uerlich hat
dieses
Festplattengeh
use einiges zu
bieten.

MT
The cooling takes
place through the solid
aluminum case and the
side-mounted cooling
fins and comes
completely without
fans.

PE with Overediting
The cooling fins
fitted on the side of
the solid aluminium
casing ensure that
the computer is
cooled, as it comes
completely without
fans.

But not only on the


outside, this hard drive
enclosure has
something to offer.

This hard drive casing


has more than just a
great design.

But it's not only on the


outside where this
hard drive casing has
something to offer.

Fotos mit 1,3


Megapixeln

Photos with 1.3


megapixels

1.3 megapixel photos

Photos with 1.3


megapixels

Zudem stehen
verschieden
SATA-Typen zur
Auswahl, wie
z.B. Micro SATA
oder Slimline-

In addition there are


different SATA-types
are available, such as
micro SATA or Slimline
SATA.

There are various


types of SATA
available for this, such
as micro SATA or
slimline SATA.

In addition, there are


different SATA types
available, such as
micro SATA or slimline
SATA.

DE-EN
DE-EN

DE-EN

Comment
on
overedited
version
Unnecessary
re-ordering
and retranslating of
segments

Overedited
version is
stylistically
more
pleasing, but
requires a
major
rewrite, while
version
without
overediting is
equally
correct.
Unnecessary
re-ordering
of segments
Unnecessary
re-phrasing
and change
of syntax.

45

SATA.

DE-EN

EN-DE

Mit der 1 Meter


langen
Tischantenne
knnen Sie
Ihren WLANEmpfang
deutlich
optimieren.

With the 1 meter long


Tischantenne you can
significantly optimize
your WLAN-reception.

You can optimise your


WLAN reception
significantly using the
1-m table-top
antenna.

With the 1-m table-top


antenna you can
significantly optimise
your WLAN reception.

Make sure that


the brake pedal
is depressed
while you
perform this
procedure.

Sicherstellen, dass das


Bremspedal
niedergedrckt wird
whrend Sie dieses
Verfahren durchfhren.

Whrenddessen muss
das Bremspedal
weiterhin gedrckt
werden!

Das Bremspedal muss


niedergedrckt sein,
whrend Sie dieses
Verfahren
durchfhren.

Install the
Bluetooth
printer on your
computer and
set it as the
default printer.

Installieren Sie die


Bluetooth Drucker auf
Ihrem Computer, und
richten Sie ihn als
Standarddrucker.

Installieren Sie den


Bluetooth-Drucker auf
Ihrem Computer, und
legen Sie ihn als
Standarddrucker fest.

Installieren Sie den


Bluetooth-Drucker auf
Ihrem Computer, und
richten Sie ihn als
Standarddrucker ein.

Allow the
computer to
lock
automatically
after 10
seconds.

Warten Sie, bis der


Computer die Sperre
automatisch nach 10
Sekunden.

Gestatten Sie, dass


der Computer nach 10
Sekunden automatisch
gesperrt wird.

Warten Sie, bis der


Computer die Sperre
nach 10 Sekunden
automatisch aktiviert.

When the
proximity
feature is
enabled but
inactive, the
following
message
displays in the
Bluetooth
Device Control
window for the
phone:
This feature
provides a quick
way to transfer
files without
requiring you to
browse the file
system on the
other device.

Wenn der Nhe


Funktion aktiviert, aber
nicht aktiv ist, wird die
folgende Meldung in der
Bluetooth Device
Control Fenster fr das
Telefon:

Wenn die
Nherungsfunktion
eingeschaltet aber
inaktiv ist, wird im
Fenster "BluetoothGertesteuerung" fr
das Telefon die
folgende Meldung
angezeigt:

Wenn die
Nherungsfunktion
aktiviert aber nicht
aktiv ist, wird die
folgende Meldung im
Fenster "BluetoothGertesteuerung" fr
das Telefon angezeigt:

Diese Funktion bietet


eine schnelle
Mglichkeit, Dateien,
ohne die Datei zu
durchsuchen auf der
anderen Gert zu
bertragen.

Mithilfe dieser
Funktion lassen sich
Dateien schnell
bertragen, ohne das
Dateisystem des
anderen Gerts
durchsuchen zu
mssen.

Diese Funktion bietet


eine Mglichkeit,
Dateien schnell ohne
Durchsuchen des
Dateisystems des
anderen Gerts zu
bertragen.

After
disconnecting
the high voltage
terminals,
busbars, etc.,
insulate the

Aprs avoir dbranch


les bornes haute
tension, jeux, etc.,
isoler les pices avec de
la bande adhsive
isolante.

Aprs le
dbranchement des
bornes, barres
collectrices, etc. haute
tension, isoler les
pices avec du ruban

Aprs avoir dbranch


les bornes, barres
collectrices, etc. haute
tension, isoler les
pices avec du ruban
isolant.

EN-DE

EN-DE

EN-DE

EN-DE

EN-FR

Unnecessary
re-ordering
of segments;
more of the
MT can be
left
unchanged if
syntax is
kept as is
Unnecessary
re-write;
usable parts
of the MT
were ignored
in overedited
version
Unnecessary
use of
synonyms;
verb
"einrichten"
was
unnecessarily
replaced by
"festlegen"
Unnecessary
use of
synonyms;
verb
"warten" was
unnecessarily
replaced by
"gestatten";
"warten"
conveyed the
same
meaning in
this context)
Unnecessary
use of
synonyms;
"eingeschalte
t" is synonym
to "aktiviert"
and "inaktiv"
is synonym
to "nicht
aktiv" in this
context
Unnecessary
re-ordering
of segments;
more of the
MT can be
left
unchanged if
the syntax is
kept as is
Unnecessary
change of
syntax

46

parts with
insulating tape.

EN-FR

EN-FR

EN-FR

EN-IT

EN-IT

EN-IT

7.5

For further
information on
the Table View,
see the tutorial
"Table View
Productivity
Features"
Alternator is
found to be
noisy
The oil in these
passages is
trapped and the
blade does not
move.
Be sure that the
hydraulic hose
is free of
abrasion.
Adjust the
angle by raising
the rear of the
vehicle to
ensure water
covers the
joints.
The only way to
allow the
device to
validate a selfsigned
certificate is to
install the
certificate on
the device.

isolant.

Pour plus d'informations


sur l'affichage en
tableau, voir les
sections du tutoriel
"Fonctions de
productivit - Affichage
en tableau"

Pour obtenir de plus


amples
renseignements sur
laffichage en tableau,
voir le tutoriel
Fonctions de
productivit Affichage en tableau

Pour plus
d'informations sur
laffichage en tableau,
voir le tutoriel
Fonctions de
productivit Affichage en tableau

Correct
expression in
MT; not
needing any
editing

Correct
expression in
MT; not
needing any
editing
Unnecessary
rephrasing

L'alternateur est
bruyant

Le client trouve que


lalternateur est
bruyant

L'alternateur est
bruyant

L'huile dans ces


passages est pige et
la lame ne bouge pas.
Accertarsi che il
flessibile idraulico sia
privo di abrasioni.

La lame ne bouge pas


car l'huile de ces
conduits est pige.
Assicurarsi che il
flessibile idraulico sia
privo di abrasioni.

L'huile dans ces


passages est pige et
la lame ne bouge pas.
Accertarsi che il
flessibile idraulico sia
privo di abrasioni.

Regolare l'angolo
sollevando la parte
posteriore del veicolo
per assicurarsi che
l'acqua copre i giunti.

Sollevando la parte
posteriore del veicolo,
regolare l'angolo per
assicurarsi che l'acqua
copra i giunti.

Regolare l'angolo
sollevando la parte
posteriore del veicolo
per assicurarsi che
l'acqua copra i giunti.

Unnecessary
re-ordering
of phrases.

L'unico modo per


consentire il
dispositivo per convalid
are un certificato
autofirmato per
installare il certificato
sul dispositivo.

Per permettere al
dispositivo di convalida
re un certificato
autofirmato, l'unico
modo quello di
installare il certificato
sul dispositivo.

L'unico modo per


consentire al
dispositivo di convalida
re un certificato
autofirmato quello di
installare il certificato
sul dispositivo.

Unnecessary
use of
synonyms
and
reordering of
phrases.

Unnecessary
use of a
synonym.

Help improve MT for the future


To make it easier to post-edit in the future make sure that you post-edit and translate in
an MT-friendly way using simple sentence structure and without adding additional
information or rephrasing the source and complicating the word order in the target
unnecessarily. This will improve the training material with which engines are retrained.
For some language combinations, the word order is considerably different between
source and target and this will always pose problems for MT. However, keeping closer
to the source is generally the best way forward:

47

For example, the English source...

"An error should display on the dash"

...can be translated into German as..

"Am Armaturenbrett muss jetzt ein Fehler angezeigt werden"


...but the following is equally acceptable...
"Es sollte ein Fehler auf dem Armaturenbrett angezeigt werden"

In this instance, the second translation has the advantage that the word order in the
target is closer to the word order in the source. This can help the MT engine to match
up the words error (German: Fehler) and dash (German: Armaturenbrett) more
easily with their correct translations.
If the verb is usually found at the beginning of the sentence in the source and at the
end of the sentence in the target, adding a lot of additional information in the middle
can also make it harder for the MT to match up source and target segments correctly.
As a rule, the MT engine can handle shorter phrases better than long convoluted
sentences.
A more MT-friendly style is also achieved by keeping translations as unambiguous as
possible. Ambiguous constructions are rather problematic for MT engines, for example
participial forms, omission of conjunctions and relative pronouns, omission of
determiners or use of pronouns and possessives.

48

Translators are encouraged to avoid using phrasal verbs (verb+particle) wherever


possible and substitute them with unambiguous verbs. Using English as an example,
went on talking can be rewritten as continued to talk. The same applies for
polysemous words or phrases. They are best replaced with monosemous words or
phrases wherever possible. For example, in English pen such a poem could be
rewritten as compose such a poem.
Further examples are as follows: Replace it can mean either Put it back where it was
or Substitute it; Fix it can mean either Adhere it to something or Repair it. As
mentioned above, structures such as gerunds and participial forms in English also often
cause confusion to the MT engine. Making translations more specific avoids
misunderstandings and produces clearer structures for MT.
To summarise, MT translations should ideally consist of grammatically correct, but
syntactically concise and unambiguous target sentences. Any effort made at postediting (or translation) stage will benefit the MT output in the future, as eventually all
translated and post-edited data will be added to the existing TMs and used in future
customisations. Please note, though, that this is a recommendation and will not
override any style guidelines specified by the client for a particular project.

49

8 Expected Statistical MT behavior


8.1

Common patterns to watch for when post-editing


Post-editing is not a light review of machine-translated content. MT output is seldom
flawless and post-editors need to notice the issues and evaluate which parts of MT
output can be used for the final translation. Post-editing is most efficient when the
translator knows what kind of issues to expect. MT output issues depend to some
extent on the project and language combinations as well as type of MT technology
used.

Common errors in SMT


Statistical MT output is usually more fluent in style than rules-based output but posteditors need to look out for the following issues:

Extra words in the


target

Syntax and word


order errors

Formatting issues

Missing words in
the target

Terminology
inconsistencies

Incorrect
punctuation

Mistranslations/
antonyms

Incorrect or missing
articles and
prepositions

Vocabulary left in
source language

Noun compouns in
the wrong order/
not compounded

Wrong gender,
number, agreement,
inflection

Incorrect
capitalisation

50

Since statistical processes do not actively involve linguistic information in the translation
process (excluding any rules that are extracted from statistical regularities), it is
possible that the opposite meaning may occur in the MT ouput (i.e. positive sentences
instead of negative sentences and vice versa). This is relatively easy to post-edit but
should always be checked during the post-editing stage as it is a major quality concern.
Such mistranslations and opposites are perceived as far more serious errors than any
other potential errors related to SMT as there is a high potential for post-editors to miss
them. This is also the case with fuzzy matches and incorrect editing. We know that
these kinds of errors are found across language pairs and projects but they are
reported only sporadically perhaps on account of how easy they are to correct.

Common errors in RBMT


Despite sticking to the source structure more rigidly than SMT, an RMBT output offers a
certain amount of control over terminology. However, RBMT technology has its own
issues and common patterns include:

Missing or incorrect
prepositions

Proper nouns
translated - e.g.
Windows

Superfluous articles
- e.g. before proper
nouns

Unique translation
for dictionary terms
wrong in context

Capitalization
errors

Word order

Homographs
interpreted
incorrectly - e.g.
nouns as verbs

Compound
formation and
hyphenation issues

Ultimately, the same holds true for both MT technologies: if you know which issues to
look out for, it is much easier to post-edit.

51

For some examples of typical MT behavior, refer to the Appendix -

Post-editing

examples.

8.2

How to provide feedback to improve the MT output


For MT systems, as for all tools and services, feedback from its users is important in
order to find out if they work as intended, show particular problems or can generally be
enhanced and improved. In SDL, all feedback received on MT systems is analysed and
either dealt with within the department or passed on to our technical team. In addition to
this, all feedback is logged using a dedicated tool to ensure that nothing is lost or
forgotten, even if it may not be possible to act upon it straight away.

When to provide feedback


Feedback is useful when:

you think the MT quality for a specific job is below expectations

you would like to share your experience of post-editing

you have an idea on how to improve MT as a whole or for a specific project

How to provide feedback


All feedback is easier and quicker to analyse, if some simple rules are followed.
Preferred forms of feedback

Provide enough information: language pair, source text, machine translation


output, desired translation, any comments, category of the error (terminology,
grammar, corruption, etc)

Feedback which provides clear examples from the relevant job, including original
sentence, MT output and final translation.

52

Highlighting the terms concerned in both source and target and giving a specific
explanation helps to understand the issue at a glance.

Feedback which includes the raw MT as well as the post-edited files (ITDs or
SDLXLIFFs) from the job(s) in question.

The file context helps to get an overall view of the MT quality. MT quality can vary
within a file from segment to segment. It is therefore important to see how less
favourable MT segments compare to more favourable MT segments in a job.

However, the following should be AVOIDED when giving feedback:


53

Providing backtranslations created from the MT output

Providing compare files, which give no indication of the segment status before
editing (i.e. if a segment originally was a 100% match, fuzzy match or MT
segment)

What feedback to provide


The kind of feedback to provide will first of all depend on the type of machine
translation system which is used for a project: a rules-based or a statistical machine
translation engine.
Rules-based systems allow for specific changes down to word/terminology level which
can be implemented almost instantly. Changes to a statistical machine translation
system are only effective at a higher level and in most cases cannot be applied there
and then.

Feedback on rules-based MT versus feedback on statistical MT


Feedback for rules-based MT engines
The MT output of a rules-based engine is based on the following:

rules for a language pair

general settings (related to quotation marks, verb tense, accents, decimal points)

a project dictionary used for building the engine

The general settings for the project dictionary as well as its contents (i.e. the translation
and grammatical information for terms) can be amended relatively easily with almost
instant effect. It therefore makes sense for post-editors to promptly feedback on
missing terms, consistently wrong translations or incorrect grammatical features for
terms - such as gender, inflection, hyphenation, capitalization, etc. so these can be
amended in time for the next post-editing job.
Within SDL, carrying out changes of this kind (i.e. updating and maintaining the project
dictionaries) is the responsibility of the respective language leads for the project. If
there is no feedback from post-editors, the project lead can also fine-tune the rulesbased engine by comparing raw MT and post-edited files. The more terms are entered

54

(encoded) in the project dictionary, the better the MT output. If no new terms are added,
the MT quality is likely to decrease over time.
However, if terms appear incorrectly in the MT output even though they have been
encoded correctly in the dictionary, there may be a technical issue with applying the MT
to the files.

Feedback for statistical MT engines


The MT output of a statistical engine is based on:

a corpus of bilingual data (mostly TMs)

statistical algorithms

Source and translation units are matched up based on statistical probability. As


opposed to rules-based engines, no grammatical rules are applied and no grammatical
information is stored for individual words or terms. Small changes to individual words or
terms are unlikely to influence the overall statistical probability. Therefore feedback of
this kind cannot be incorporated in the same way and with the same effect as with
rules-based engines.
The SMT output quality directly refers back to the quality of the input data (i.e. the
bilingual data/TMs used for building the engine). Improvements to the MT output with
regard to terminology and style come through improving the input data. The more
consistent the style and the terminology are in the TMs, the better the MT output. Vice
versa, if the TMs used are inconsistent, the output based on this data will be
inconsistent too.
When evaluating style and terminology for an SMT engine, it also has to be taken into
account if the engine in question is a customised engine (i.e. specifically created for
one project) or a vertical engine (i.e. created for a domain). Vertical engines combine
data from many different projects within a domain and can therefore be expected to
produce less consistent output than customised engines, which use project-specific
data for a single project/client only.

55

Once created, SMT engines are static and can only be changed through a so-called
re-training of the engine, which means through re-assembling and re-uploading the
data corpus. This in turn only makes sense if enough new or changed bilingual data is
available for the project. Feedback on issues related to the grammar and terminology in
the data corpus will be logged, and is important to get an overall idea of the quality of
the target data used. However, feedback of this kind can only be taken into account for
long-term improvements to the engine through the above mentioned re-training.
Recurring structural (syntax and word order) problems for a target language in the MT
output, which are noticed across projects, should always be reported. Even if there are
no grammatical rules applied to a SMT engine, a change in the algorithms used for a
language may be possible and may in turn improve the output.
In addition to this, any other unusual issues in the MT output, which do not relate back
directly to the data corpus, should always be reported immediately. Examples for such
issues include corrupt characters, random insertion of words or numbers, or a high
amount of untranslated words in the target. These occurrences can be the sign of an
overall technical problem, which needs to be fixed urgently.
The following is an example of a technical issue encountered in an MT project, which
can easily be fixed.
As you turn the thimble in, you will notice that a mark on the main scale (either upper
or lower) will disappear each time the zero point on the thimble scale passes the index
line (see the above diagram).
Was translated as below, inserting numbers that did not belong to the source.
Lorsque vous plus la le document n964150, vous permet de synchroniser que un
repre sur le principal souhaitable (soit suprieur ou infrieur) sera mis chaque fois que
du point zro de l'le document n964150 souhaitable passe la 591 ligne (voir le schma
ci-dessus).

56

If a post-editing job for a project is similar to what was originally evaluated and the
output has unexpected issues, something might have gone wrong. At the same time,
many of the issues reported for SMT engines are in fact expected SMT behaviour.
Post-editors and projects leads should therefore familiarise themselves with the issues
they are likely to encounter when post-editing statistical machine translation output.
These include:

The need to correct format painting and tags

Extra words in target, not present in the source

Words missing in the target (adjectives, nouns, verbs, prefixes)

Words not localised in the target (left as per source)

Mistranslations (antonyms: remove/install, can/cannot, do/do not)

Compounding issues: noun compounds in the wrong order/not compounded

Grammar problems: wrong gender, number, agreement or verb inflections

Syntax and word order issues

Wrong, added or missing articles and prepositions/postpositions

Punctuation issues (spaces added, full stop missing)

Terminology inconsistent/non-compliant with the reference material (not the most


frequent or not appropriate in the context)

Translations taken from the baseline if not included in the corpus

In summary, MT feedback is most useful when given in a structured way, taking into
account the specificities of the different MT systems. Expectations on the
implementation of feedback also have to be adjusted according to the nature of each
system and the type of feedback given.
For rules-based systems implementation can be instant in many cases, whereas for
statistical systems instant implementation is largely limited to bigger, technical issues.
In all cases, post-editors can still influence the quality of MT by making sure project
dictionaries get updated for rules-based MT, and by carefully maintaining TMs and
post-editing in an MT friendly way for statistical MT in order to build up a clean data
corpus for future use.

57

Feedback from translators

Data

Common
issues

Terminology

Style

58

9 Using BeGlobal baselines in SDL Trados


Studio
9.1

BeGlobal baselines
Through SDL Trados Studio, everyone has access to the BeGlobal baselines. You can
apply BeGlobal Machine Translation to the files you receive from your clients and postedit the output to full publishable quality.
The MT output from BeGlobal comes from the baselines, the core MT engines
developed by SDL. These baseline systems are bilingual corpora used as general
databases for each language pair.
They are created with hundreds of millions of words of bilingual data, gathered from
many different sources. The baseline data is mined from many different sources on the
Internet with a distribution over various domains (IT, automotive, news, sports,
electronics, etc.)
Because of its general nature, the Baselines usually give better results for projects with
mixed content, not domain specific or with a mix of different domains, or content with
free style (some types of marketing, sales documentation).
Generally, the Baseline might require additional work for projects with very specific and
technical terminology, content with terminology which is highly project-specific.

9.2

How to add SDL BeGlobal Community as a translation provider in SDL Trados


Studio
In Studio, MT providers are added in a similar way to TMs so you can add SDL
BeGlobal Community either when you are creating a new project (e.g. from files
downloaded from your workflow system ) or to an existing project (e.g. Studio package
received from a client).
During the project creation process, or via Project Settings for an existing project, go to
Translation Memory and Automated Translation, select Add and then select SDL

59

BeGlobal Community from the drop-down menu:

If you have not already registered with SDL BeGlobal Community before you will have
to do so now. Enter your details and click Finish, and then after you receive the
confirmation mail click the link in the mail to activate your BeGlobal Community
account.

60

You may also see a warning message click Yes:

(If you wish, you can tick the Dont ask this question again box so that you dont see
this message again.)

Applying baseline MT segment-by-segment


Once you have added the MT provider to your project, you will immediately be able to
see MT output in the Editor for new segments in the project.
Segments where MT has been applied appear as AT (Automated Translation) in the
Editor.

61

Applying baseline MT to whole files/projects using the Pre-translate Files


batch task
To apply MT to whole files or projects, you can use the Pre-translate Files batch task.
In the settings for this batch task, (accessible via Project Settings or when you run the
batch task itself) choose Apply automated translation for the When no match found
option, and also note that this batch task will re-apply your TMs, so ensure that the
minimum match value is set appropriately (usually 75):

Your TMs will be applied first and then any remaining new segments will have machine
translation applied.
Machine translated words are listed as New in Analysis reports.

62

10 Summary
10.1

Conclusion to training workbook


This training workbook outlines the basic principles involved in post-editing MT
outputs, from training MT engines to the quality check. Readers are given a brief history
of MT in chapter 2, followed by a discussion of the benefits of post-editing when
addressing emerging global business trends as well as the limitations of MT in chapter
3.
Readers are subsequently introduced to the technology behind RBMT and SMT
systems in chapter 4 and then guided through the engine training options currently
available in chapter 5. This includes information on preparation, testing and training,
which is central to the MT process at SDL and ensures the highest quality output.
From the raw MT output onwards, readers are given a series of examples and tips on
how to post-edit to different degrees of understandability and to a high quality,
publishable standard in chapters 6 and 7. The latter explain how post-editing as a skill
can be integrated into the translators workflow and how best to improve productivity
without losing quality by avoiding under-editing and over-editing. Common pitfalls of the
MT technology and how to provide feedback on the issues observed are discussed in
chapter 8.
Finally, the training workbook explained how to access MT through BeGlobal
baselines in SDL Trados Studio and why this can be advantageous to translators, both
as a means of increasing efficiency and as an introduction to the practice of postediting.
The aim of the training workbook has been to cover the material above, whilst
reassuring aspiring post-editors that MT does not remove the need for human
translation or the creative input of the translator but simply facilitates and accelerates
their task. MT provides a means to an end rather than an end solution in itself.
The main advantage of post-editing are the faster throughput thanks to a
productivity increase for translators. This allows both clients and translators to tackle

63

work that they would not otherwise attempt and increase the amount of content
available in a range of different langauges. Consequently post-editing is a valuable skill
in light of the digital content explosion we are currently experiencing.
Therefore, in answer to the common worries listed in the introduction:

MT does not take away the need for human translators

MT output quality is improving year on year

Even low quality MT segments can be useful if they are easy to edit

MT leaves room for creative changes without consistently over-editing

MT and post-editing can be easily integrated into the translators workflow

MT technology is based on simple concepts. The post-editor is best equipped


when they understand which common errors in the output should be corrected.

Post-editing is no less valuable a skill than translation

64

11 Further references
11.1

More information on MT and post-editing

Websites

Forums

Blogs

SDL: http://www.sdl.com/services/languageservices/intelligent-machine-translation.html

TAUS: http://translationautomation.com/

PROZ:
http://www.proz.com/about/overview/education/

PROZ:
http://www.proz.com/forum/machine_translation_
mt-844.html

TRANSLATORS CAF:
http://www.translatorscafe.com/cafe/MegaBBS/for
um-view.asp?forumid=44

SDL: http://www.sdl.com/community/blog/

TRANSLATORS CAF:
http://www.translatorscafe.com/cafe/MegaBBS/for
um-view.asp?forumid=44

GALA: http://www.galaglobal.org/blog/2013/machine-translation-meansnew-business-models-for-lsps/

http://www.gala-global.org/blog/category/mt/

65

SDL recordings

Webinars

http://www.youtube.com/user/SDLonline?feature=
watch

http://www.sdl.com/events/ls/webinars/iMTWebinar-Series.html

SDL: http://www.sdl.com/events/webinars.html

GALA: http://www.gala-global.org/standardswebinars

TAUS:
http://www.translationautomation.com/multimedia/
multimedia#webinars

#sdlonline

Twitter
hashtags

#Machinetranslation

#MT

66

Appendix 1: Post-editing examples


Language
Pair
EN-DA

EN-DA

EN-DA

EN-DA

EN-NO

Source

MT output

Post-edited

Comment

The Loader has


a ROPS decal
showing the
certification of
the ROPS, gross
weight,
approval,
regulation, and
model number
of the machine.
If you are
routing the
cables through
the mounting
surface, mark
the location
directly in the
middle of the
three pilot holes
(optional).
See your
Owners Manual
for detailed
explanation
about the
ENGINE
START/STOP
button.
When a jam
occurs, a
message
indicating the
jam location
and information
to clear the jam
appears on the
printer display.

Frontlsserens har
et frervrn
mrkat viser
bekrftelse af
syrtbjlen,
bruttovgt,
godkendelse,
regulering, og
modelnummer af
maskinen.
Hvis du frer kabler
igennem
monteringsoverflad
en markere
placeringen direkte
i midten af de tre
forboringshuller
(valgfrit).

Frontlsserens har et
mrkat p styrtbjlen, der
viser certificeringen af
syrtbjlen, bruttovgt,
godkendelse, direktiv og
modelnummer for
maskinen.

Delete extra word


(frervrn); adjust
terminology; adjust
preposition (af -> for)

Hvis du frer kabler


igennem
monteringsoverfladen, skal
du markere placeringen
direkte i midten af de tre
forboringshuller (valgfrit).

Insert comma after


monteringsoverfladen
, insert the words
skal du.

Se indeholdt dins
for at f nrmere
forklaring til
START/STOPknappen.

Se i instruktionsbogen for at
f nrmere forklaring om
knappen
MOTORSTART/STOP.

Terminology: Replace
dins with
instruktionsbogen,
add MOTOR to the
word START.

Hvis der opstr


papirstop, en
meddelelse om
papirstoppets
placering og
oplysninger for at
afhjlpe
papirstoppet vises
p printerens
display.
Den ekstra krefter

Hvis der opstr papirstop,


vises en meddelelse om
papirstoppets placering og
oplysninger til afhjlpning
af papirstoppet p
printerdisplayet.

Rephrase the sentence


to comply with Danish
grammar rules: The
verb vises should be
placed at the
beginning of the
sentence. Replace the
preposition for with
the correct one til.

Den ekstra kraften kan gi

Adjust grammar (noun

The additional

67

EN-NO

EN-NO

EN-NO

EN-NO

EN-SV

EN-SV

forces can lead


to faster cycle
times when
digging in tightly
compacted
materials or soil.
Autoguidance
disengages
when the
operator is not
seated.
Open the rear
door of the
printer and the
rear duplex
area, and then
remove the
jammed paper.
Can I log in to
Twitter with the
Facebook
account?
This camcorder
only supports
Twitter login
with the Twitter
account.

kan fre til raskere


syklustider ved
graving i tett
sammenpakkede
materialer eller
jord.
Automatisk styring
kobles ut nr
freren ikke sitter
godt.

raskere syklustider ved


graving i tett
sammenpakkede materialer
eller jord.

from plural to
singular), make phrase
more idomatic

Automatisk veiledning
kobles ut nr freren ikke
sitter.

Adjust terminology,
adjust meaning

pne bakdekselet
p skriveren og den
bakre tosidigenhet
og fjern det
fastkjrte papiret.

pne bakdekselet p
skriveren og
tosidigenheten, og fjern
fastkjrt papir.

Adjust meaning

Kan jeg logge p Twitter


med Facebook-kontoen?
Dette videokameraet
sttter bare Twitterplogging med Twitterkonto.

Adjust meaning
(diskanthyttaler >
Twitter), adjust
grammar

Hvorfor klarer jeg ikke


laste opp opptakene mine
p Internett?
Det finnes
strrelsesbegrensninger for
nettbasert deling.

Delete extra word and


adjust grammar.
Adjust terminology
(online > p Internett).
Adjust grammar and
change preposition.

verdriven spnning kan


leda till att slutvxeln eller
ledarhjulet gr snder,
liksom till att
underredsramen bjs.

Adjust terminology
(verdriven instead of
verfldig); adjust
grammar

Kan jeg logge p


diskanthyttaler
med Facebook
konto?
Dette
videokameraet
sttter bare
diskanthyttaler
plogging med
diskanthyttalerkonto.
Why do I fail to Hvorfor hrer jeg
upload my
ikke klarer laste
recordings
opp min opptak
online?
online?
There is size
Det er strrelse
limitation to
begrensning til
online sharing.
nettbaserte deling.
Excess tension
verfldig
can lead to final spnning kan leda
drive or idler
till slutvxel eller
breakage as well ledarhjul brott
as
liksom
undercarriage
underredesram
frame bending. bjning.
If you are
Om du
routing the
vidarebefordrar

Om du drar kablarna genom Adjust terminology


monteringsytan br du
(drar instead of

68

EN-SV

EN-SV

EN-DE

EN-DE

EN-DE

EN-DE

EN-DE

cables through
the mounting
surface, mark
the location
directly in the
middle of the
three pilot holes
(optional).
This section
describes how
to operate the
audio and visual
system.
At the moment
fluoroscopy is
possible, all
functions of the
system are
operable.
Adjusting the
LED Tilt Angle

kablarna genom
monteringsytan
och mrk ut direkt
i mitten av de tre
rikthlen (valfritt).

mrka ut platsen precis i


mitten av de tre rikthlen
(valfritt).

vidarebefordrar);
delete extra word
(och); adjust grammar

Det hr avsnittet
beskriver hur in
ljud-och Visual
system.

Det hr avsnittet beskriver


anvndningen av ljud- och
bildsystemet.

I det gonblick d
genomlysning r
mjlig, alla
funktioner i de
systemet fungerar.

I det gonblick d
genomlysning r mjlig
fungerar alla funktioner i
systemet.

Translate untranslated
term (Visual); translate
missing word
(operate); adjust
grammar
Adjust grammar

Justieren der LED


Neigewinkel

Justieren des LEDNeigewinkels

<1/>Select the
protocol that
the selected
time server
supports.
The image file
can be stored
on a local
computer, in
the local
network, or at
an Internet
address.
Replace the 10
O-ring seals on
the exposed
remote.

<1/>Whlen Sie
das Protokoll, dass
die ausgewhlten
Zeit-Server
untersttzt.
Die Bilddatei kann
werden auf einem
lokalen Computer,
im lokalen
Netzwerk oder
unter einer
Internet-Adresse.

<1/>Whlen Sie das


Protokoll aus, das der
ausgewhlte Zeitserver
untersttzt.
Die Bilddatei kann auf
einem lokalen Computer,
im lokalen Netzwerk oder
unter einer InternetAdresse gespeichert
werden.

Add missing verb

Ersetzen Sie den 10


O-Ring-Dichtungen
auf die
freiliegenden
Zusatzsteuergert
bettigt wird.
Die folgenden
Grundstze gelten
immer noch als mit

Ersetzen Sie die 10 O-RingDichtungen am


freiliegenden
Zusatzsteuergertblock.

Adjust grammar /
preposition; adust
terminology; delete
extra word (verb) not
in source

Die folgenden Grundstze


gelten ebenso wie fr
herkmmliche

Adjust word choice;


adjust number
agreement (Sing ->

The following
basic principles
still apply as

Adjust gender /
agreement; correct
noun compound /
hyphenation
Add verb prefix; adjust
grammar; correct
noun compound /
hyphenation

69

EN-ES

EN-ES

EN-ES

EN-ES

EN-ES
EN-ES

EN-FR

with
conventional
translation:
Remove the
sealing plug on
the bottom left
corner.

herkmmlichen
bersetzung:

Retire el tapn
hermtico
conector en la
esquina inferior
izquierda.
Now you can
Ahora se puede
adjust the focus, ajustar el enfoque,
focal length and distancia focal, y el
LED tilt controls LED inclinacin
(see below).
controles (consulte
la a continuacin).
Click Connect, Haga clic en
then reprogram Conectar y, a
the 4 files <1>in continuacin,
the order
volver a programar
below,</1>
el 4 archivos <1>en
following the
el orden
instructions
siguiente</1> ,
(switch the
siguiendo las
TopDock off and instrucciones
back on again):
(interruptor la
unidad TopDock
desactive y vuelva
a):
NOT to thresh
No trille, ms
harder, faster or rpido o ms de lo
more than you
que resulta
need to.
necesario.

Overthreshed
wheat
Check for
abnormal noise
on the hydraulic
components.
On the other
hand, if the
Exchange
server is
configured to
require autolock

Overthreshed trigo
Comprobar si hay
anormal en los
componentes des
sistema hidrulico.
D'un autre ct, si
le serveur
Exchange est
configur pour
exiger verrouillage
automatique, la

bersetzungen:

Plural)

Retire el tapn hermtico


de la esquina inferior
izquierda.

Delete extra words not


in source

Ahora se puede ajustar el


enfoque, la distancia focal,
y los controles de
inclinacin de los LED
(consulte la a continuacin).

Add missing article;


adjust gender /
agreement for article;
rearrange word order

Haga clic en Conectar y, a


continuacin, vuelva a
programar los 4 archivos
<1>en el orden
siguiente</1> , siguiendo
las instrucciones (desactive
el interruptor la unidad
TopDock y vuelva a
activarlo):

Adjust verb form;


adjust number
agreement (el -> los);
adjust word order; add
missing information
(switch on = activarlo)

NO trille con ms fuerza,


ms rpido o ms de lo que
resulte necesario.

Adjust capitalization;
add missing
information (harder =
con ms fuerza); adjust
verb form (resulta ->
resulte)
Trigo excesivamente
Translate word left as
trillado
source
Compruebe si hay ruido
Add missing word
anormal en los
(ruido); adjust verb
componentes del sistema
form (comprobar ->
hidrulico.
compruebe)
D'un autre ct, si le
Add missing (le); adjust
serveur Exchange est
word choice and verb
configur pour exiger le
form (russir); adjust
verrouillage automatique, la word order
synchronisation ne sera
(automatique
pas effectue si la fonction fonction); add missing

70

EN-FR

EN-FR

, the
synchronisation
will
not succeed if
the autolock fea
ture is disabled
in the user's
device.
No license
activation
needed in this
release.
It is from these
3 factors you
must run the
combine.

EN-FR

If the release
control function
is not selected,
area <1>E</1>
is not taken into
account!

EN-FR

Right and left


fenders are the
same.

EN-FR

Using the
hardware
removed
earlier, attach
the new center
link.
Transport
function (front
power lift
moves to upper
limit position) is
performed.

EN-IT

EN-IT

Selective
Catalytic
Reduction (SCR)

synchronisation ne
sera pas russir si
le verrouillage
automatique foncti
on est dsactiv de
l'utilisateur.

de verrouillage
automatique est dsactive
sur l'appareil de
l'utilisateur.

information (device).

Aucune activation
de la licence
ncessaire dans
cette version.
Il est de ces 3
facteurs vous
devez utiliser la
moissonneusebatteuse.
Si la rgulation de
dlestage fonction
nest pas
slectionn,
<1>zone E</1>
nest pas pris en
compte !
Ailes droite et
gauche sont bien
les mmes.

Aucune activation de la
licence nest ncessaire
dans cette version.

Add negation and


auxilliary verb

Cest en fonction de ces 3


facteurs ques vous devez
utiliser la moissonneusebatteuse.

Adjust word order

Si la fonction de rgulation
de dlestage nest pas
slectionne, zone
<1>E</1> nest pas pris en
compte !

Adjust word order;


adjust gender
agreement; adjust
format paint

Les ailes droites et gauches


sont les mmes.

laide du matriel
de fixation dpos
plus tt, fixer la
nouvelle liaison
centrale.

A laide du matriel de
fixation dpos plus tt,
fixez la nouvelle
articulation centrale.

Add missing article


(les); adjust number
agreement (droites,
gauches); delete extra
word (bien)
Adust accent in upper
case; adjust verb form
(fixer -> fixez); adjust
terminology (liaison ->
articulation)

Funzione die
trasporto
(sollevatore
anteriore si sposta
nella posizione
limite superiore)
viene eseguito
Sistema di
riduzione catalitica
selettiva (RCS)

Viene eseguita la funzione


di trasporto (il sollevatore
anteriore si sposta nella
posizione limite superiore)

Adjust word order;


adjust gender
agreement (eseguito > eseguita); add
missing articles (la, il)

Istruzioni di base sulla


Riduzione catalitica
selettiva (RCS)

Delete extra word


(sistema); adjust word
order; adjust lower

71

EN-IT

basic
instructions
A servers
Certificate can
be of two types:
The on-screen
keypad can be
used for pasting
the data in
various input
fields, for
example in the
input fields of
the web pages.
After the engine
starts, check the
instruments to
make sure the
indications are
correct.

istruzioni di base

case

Un certificato del
server possono
essere di due tipi:
La tastiera su
schermo pu
essere utilizzata
per incollare i dati
in varie campi di
immissione dati, ad
esempio nei campi
di immissione dati
di Web pagine.
Zodra de motor
start, controleert
de instrumenten
om er zeker van te
zijn dat de
aanwijzingen die
erop staan door
zijn juist.
Check that no
Controleer of er
tools or other
geen wektuigen of
items have been andere items
left on the
hebben zijn op de
machine or in
machine of in de
the operators
cabine.
compartment.
De melding REF.
The message
BEWEGING kunnen
REFERENCE
op het display
CYCLE may
verschijnen.
appear in the
display.

Un certificato del server


pu essere di due tipi:

Adjust verb form

La tastiera su schermo pu
essere utilizzata per
incollare i dati in vari campi
di immissione, ad esempio
nei campi di immissione
delle pagine Web.

Adjust gender
agreement (varie ->
vari); delete extra
words (dati); adjust
word order

Zodra de motor start,


controleert u de
instrumenten om er zeker
van te zijn dat de
aanwijzingen die erop
staan, juist zijn.

Add preposition;
delete extra word
(door); adjust word
order

Controleer of er geen
gerrdschap of andere
voorwerpen zijn
achtergebleven op de
machine of in de cabine.

Adjust terminology;
add missing verb

De melding REF. BEWEGING


kan op het display
verschijnen.

Adjust verb form

EN-NL

The accelerator
pedal can be
used to switch
off the truck
depending on
which function
type is
programmed:

Het rijpedaal kan


worden gebruikt de
machine uit te
schakelen
afhankelijk van
deze functie type is
geprogrammeerd:

Het rijpedaal kan worden


gebruikt om de machine uit
te schakelen, afhankelijk
van welk type functie is
geprogrammeerd:

Add missing
preposition; adjust
punctuation; adjust
word order

EN-NL

The plug
connectors can

De
steekkoppelingen

De steekkoppelingen
kunnen stijf worden en er

Add missing auxilliary


verb; add pronoun and

EN-IT

EN-NL

EN-NL

EN-NL

72

EN-PT-BR

EN-PT-BR

EN-PT-BR

EN-PT-BR

EN-PT-BR

EN-ZH-CN

become stiff
and dirt can
enter the
hydraulic
system.
The windows
are primarily
intended for
developers to
test the
compatibility of
their software
with the Virtual
Terminal
application.
This chapter
helps you
quickly begin
using the
analyzer.
User Tests are
tests that you
create to test
specific
functionality of
your network.
Your search
could also be as
specific as
comparing a
single object in
one database
with a single
object in
another
database.
In order to
avoid mutual
influence, only
one parameter
should be
changed at a
time as a
remedy.
Keep the pinion

kunnen stijf en vuil


in het hydraulisch
systeem kunnen.

kan vuil in het hydraulisch


systeem binnendringen.

missing verb

As janelas se
destinam
principalmente
para os criadores
para testar a
compatibilidade do
seu software com o
Terminal virtual
aplicao.

As janelas se destinam
principalmente aos
programadores para testar
a compatibilidade do seu
software com o aplicativo
Virtual Terminal.

Adjust grammar (para


os -> aos); adjust
terminology (criadores
-> programadores);
adjust word order and
product name

Este captulo ajuda


a verificar
rapidamente
comear a usar o
analisador.
Os testes de
usurio so testes
que criar para
testar
funcionalidades
especficas da rede.
Sua pesquisa
tambm pode ser
to especfico
como comparar um
nico objeto em
um banco de
dados com um
nico objeto em
outro banco de
dados.
Para evitar
influncia mtua,
somente um
parmetro deve ser
substitudo a um
tempo em que
uma soluo.

Este captulo o ajuda a


comear a usar
rapidamente o analisador.

Add pronoun, delete


additional word
verificar, change word
order.

Os testes de usurio so
testes que voc cria para
testar as funcionalidades
especficas de sua rede.

Adjust grammar, add


pronoun.

Sua pesquisa tambm pode


ser to especfica como
comparar um nico objeto
em um banco de dados
com um nico objeto em
outro banco de dados.

Correct gender

Para evitar influncia


mtua, somente um
parmetro deve ser
alterado por vez como
soluo.

Change substitudo
for alterado,
although that could be
considered a
preferencial change
and not necessary.
Correct translation of
at a time
Redundant translation

73

EN-ZH-CN

EN-ZH-CN

EN-ZH-CN

EN-ZH-CN

EN-ZH-CN

DE-EN

shaft from
turning and turn
the flywheel
three
revolutions.
If the gap is
correct,
continue on
with the
remainder of
the procedure.
Use the optional
right angled DC
power cable.

() is deleted, and
logic of the sentence is
improved.

The card holder


is located on the
right side of the
VC70 under the
Service door.
To adjust
keyboard tray
position, loosen
the right and
left locking
knobs two full
turns and rotate
the keyboard
tray to the
desired
position.
Click Mail in
Navigation Pane
on the left, and
click New Email.
Mit der 1 Meter
langen
Tischantenne
knnen Sie
Ihren WLANEmpfang
deutlich

vc 70

VC70

Translation of phrase
"continue on with" is
corrected, and
translation of term
"procedure" is
modified.
Translation of term
"right angled" is
corrected, and
translation of term
"power cable" is
modified.
Translation of model
number "VC70" and
term "Service door"
are modified.

With the 1 meter


long Tischantenne
you can
significantly
optimize your
WLAN-reception.

With the 1-metre long


table-top antenna you can
significantly optimise your
WLAN reception.

Logic of the sentence


and translation of
terms are modified.

Modifications are
made to the single
byte comma, and
inconsistent
treatments of UI.
Adjust spelling
(American English ->
British English);
translate word left
untranslated
(Tischantenne)

74

DE-EN

DE-EN

FR-EN

FR-EN

FR-EN

optimieren.
Drcken des
Fuschalters
aktiviert einen
geringen Druck
in den
Pneumatikzylind
ern.
Hauptspindel
nicht in Betrieb
nehmen, wenn
die Versorgung
mit Khlmittel
nicht
gewhrleistet
ist.
laborer une
introduction
laide des
lments cls
dune entre en
matire efficace
Voici quelques
techniques
pouvant vous
aider faciliter
lapprentissage
selon les styles
dapprentissage
s
Passez le
matriel en
revue, lisez
voix haute,
enregistrez-vous

Press the foot


switch activates a
low pressure in the
pneumatic
cylinders.

Press the foot switch to


activate low pressure in the
pneumatic cylinders.

Adjust verb form;


delete extra article

Do not operate
main spindle if the
supply with
coolant is not
ensured.

Do not operate the main


spindle unless the coolant
supply is ensured.

Adjust word order

Develop an
introduction to
using the key
elements of an
entry in the
effective
Here are some
techniques that can
help you to
facilitate learning,
based on the styles
of programming

Prepare an introduction
using key elements to
provide an effective
overview of the subject
area

Adjust verb meaning,


adjust verb form,
adjust word order, add
in necessary text

Here are some techniques


to facilitate learning, based
on the different learning
styles

Adjust meaning; delete


unnecessary text;
adjust word order

Skip the hardware


review, read aloud,
save-you

Review the material, read


aloud, record yourself

Adjust meaning

75

Appendix 2: References
BLEU score available at http://www.cs.columbia.edu/nlp/sgd/bleu.pdf (accessed June
2013)
NIST available at http://www.itl.nist.gov/iad/mig//tests/mt/ and
http://www.itl.nist.gov/iad/mig//tests/mt/doc/ngram-study.pdf (accessed June 2013)
METEOR available at http://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-mtj2009.pdf (accessed June 2013)
TER available at http://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf (accessed
June 2013)
TAUS Dynamic Quality Framework available at https://tauslabs.com/dynamicquality/about-dqf (accessed June 2013)
Common Sense Advisory available at http://www.commonsenseadvisory.com/
(accessed September 2013)

76

You might also like