Crowdsourcing Parallel Corpus For English-Oromo Neural Machine Translation Using Community Engagement Platform

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/349335658

Crowdsourcing Parallel Corpus for English-Oromo Neural Machine


Translation using Community Engagement Platform

Preprint · February 2021

CITATIONS READS

0 35

6 authors, including:

Sisay Chala Silas Getachew


Fraunhofer Institute for Applied Information Technology FIT Addis Ababa Institute of Technology
15 PUBLICATIONS   36 CITATIONS    1 PUBLICATION   0 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

MikE: Interfacultative Research Network for Feasibility Study of Integrated and Cost-effective Model for Monitoring of Innovative Energy Infrastructures View project

EDUWORKS View project

All content following this page was uploaded by Sisay Chala on 16 February 2021.

The user has requested enhancement of the downloaded file.


Crowdsourcing Parallel Corpus for English – Oromo Neural Machine
Translation using Community Engagement Platform
Sisay Chala1, Bekele Debisa2, Amante Diriba3,
Silas Getachew4, Chala Getu5, Solomon Shiferaw6
1
[email protected], [email protected], [email protected],
4
[email protected], [email protected], [email protected]

Abstract

Even though Afaan Oromo is the most widely spoken language in the Cushitic family by more than
fifty million people in the Horn and East Africa, it is surprisingly resource-scarce from a technological
point of view. The increasing amount of various useful documents written in English language brings
to investigate the machine that can translate those documents and make it easily accessible for local
language. The paper deals with implementing a translation of English to Afaan Oromo and vice versa
using Neural Machine Translation. But the implementation is not very well explored due to the limited
amount and diversity of the corpus. However, using a bilingual corpus of just over 40k sentence pairs
we have collected, this study showed a promising result. About a quarter of this corpus is collected via
Community Engagement Platform (CEP) that was implemented to enrich the parallel corpus through
crowdsourcing translations.
Keywords: Oromo, neural machine translation, under-resourced languages

variety of languages, alphabets and grammatical


structures; multiple possibilities of translation;
1 Introduction and the natural evolution of languages.
Machine translation (MT) is the process of In this study, we focus on developing English-
translating text from one language (i.e., source Oromo MT using openNMT (Klein, Kim, Deng,
language) to its corresponding text in another Senellart, & Rush, 2017). We selected this
language (i.e., target language). Due to its approach because, unlike SMT which requires a
practical demands, MT has been studied for quite pipeline of separate tasks, namely language
many decades leading to various techniques with modeling, translation modeling and decoding --
diverse and complex pros and cons, namely Rule- NMT is built with one network instead of a
based Machine Translation (RBMT) and Corpus- pipeline of separate tasks. Moreover, unlike
based Machine Translation. Th former primarily RBMT, NMT does not require intensive linguistic
utilizes rules developed by linguists whereas the preprocessing of the data.
latter is based on corpus of text out of which it
generates patterns that help predict target text for Though NMT emerged in recent developments,
the given source text. its models outperformed the SMT (Bahdanau,
Cho, & Bengio, 2014) leading many companies
Corpus-based MT is further divided into that offer MT services including Google (Wu, et
Statistical Machine Translation (SMT) and al., 2016) and Microsoft to switch to NMT (Le &
Neural Machine Translation (NMT). Associated Schuster, 2016) (Microsoft Translator, 2016).
with it are a number of challenging aspect aspects Furthermore, as shown in Figure 1, the
of MT with respect to its requirements and performance of neural MT consistently
complex nature of human languages. The main outperforms the other approaches (Le & Schuster,
challenge is the availability of comprehensive 2016) as tested for multiple language pairs.
rules for RBMT and good quality corpus for SMT
and NMT. There are also other challenges such as
translation, NMT has advantage over both
methods.

SMT, while translating, looks in millions of


documents in order to find patterns to decide on
the best translation. NMT translates "whole
sentences at a time, rather than just piece by piece.
It uses this broader context to help it figure out
the most relevant translation, which it then
rearranges and adjusts to be more like a human
speaking with proper grammar" (Alimi & Amiri,
2018).
Figure 1 Translation quality of statistical and neural MT models
by Google (Le & Schuster, 2016) A review of articles to get an overview of
translation score in relation to data size and
quality for other languages is summarized as
The paper is composed as follows: Section 2 follows:
discusses the analysis of related works providing
a brief highlight of similar studies; Section 3 • English to Afaan Oromoo SMT (Adugna &
describes tools and methods utilized in the design Eisele, 2010) utilized bilingual corpora
and implementation of the system; Section 4 20,000 sentences and monolingual corpora of
presents the prototype English-Oromo neural 62,300 sentences, and achieved overall BLEU
machine translation system; in Section 5 we (Papineni, Roukos, Ward, & Zhu, 2002) score
discuss implemented measures to deal with the of 17.4%.
effect of limited linguistic resource; in Section 6
and 7 we discuss some results, and provide • Bidirectional English – Afaan Oromo Using
conclusions and possible future directions for Hybrid Approach by Jabesa Deba (Daba &
research, respectively. Assabie, 2014) employed bilingual corpora of
3,000 sentences and achieved BLEU score of
2 Analysis of Related Works 37.41% and 52.02% for English to Afaan
Oromo and for Afaan Oromoo to English,
Google translate - the most popular translation respectively.
platform supports over 100 languages at various • Bidirectional English-Amharic Machine
levels and serves over 500 million people daily. Translation: An Experiment using
In February 2016 Google announces adding 13 Constrained Corpus by Elleni (Teshome,
new languages including Ethiopia’s Amharic 2013) reported BLUE score 82.22% and
which is the second most widely spoken Semitic 90.59%, for simple sentences English to
language following Arabic, Xhosa of South Amharic and Amharic to English,
Africa to its live translating tool, which is a respectively. Furthermore, for Complex
breakthrough for our country’s translation sentence, the respective BLEU score for
technology. Afaan Oromoo is not yet translated English to Amharic and Amharic to English
by Google. was 73.38% and 84.12%.
All existing research on English to Afaan Oromoo • English-Afaan Oromo SMT by Yitayew
translation uses either Statistical Machine Solomon and Million Meshesha (Meshesha &
Translation (SMT) or Rule Based Machine Solomon, 2018) utilized bilingual corpora of
Translation (RBMT), both the above techniques 6,400 sentences and monolingul corpora of
have been replaced by different organization 19,300 (for English) and 12,200 (for Afaan
(including Google and Microsoft) by the new Oromoo), and reported BLEU score of 27%
approach called neural machine translation. The for phrase level with maximum phrase of
methods have their own weakness in performing length 16.
3 Tools and Methods data obtained from different sources), data
reduction (e.g., eliminating part of the data that
Even though there are number of research and does not have equivalent translation) and data
studies about automatic translation of English to transformation. For this purpose, we used python
Afaan Oromoo, fully function platform where a programming language.
user can easily translate text is still lacking. This
study is set out to implement neural machine 4.2 Description of dataset
translation system in which we provide web-
based translation. In order to train the NMT model we just need two
files: source language file and target language
While the main objective of this study is to file. Each with one sentence per line with words
demonstrate a prototype of English-Oromo neural space separated. The standard format used in
machine translation system, the specific goals are both statistical and neural translation is parallel
threefold: the first is to develop automatic text format, which consists of sentence pairs in
translation of text from English to Oromo using plain text files corresponding to source sentences
NMT, and the second is to compare the accuracy and target translations, aligned line-by-line and
of NMT models with SMT models developed for separated by space, following Chansung (2018).
English-Oromo language pairs (Adugna & Eisele,
2010). The third and final objective is to develop 4.3 Implementation of NMT System
community engagement platform (CEP) in order
to improve the accuracy of the baseline MT Neural Machine Translation (NMT) is an end-to-
system by collecting more accurate parallel text end learning approach for automated translation,
through crowdsourcing (Nishimura, Sudoh, with the potential to overcome many of the
Neubig, & Nakamura, 2018). weaknesses of conventional phrase-based
translation systems. NMT is based on the model
4 Prototype English-Oromo NMT System of neural networks in the human brain, with
4.1 Data Collection and Preprocessing information being sent to different ‘layers’ to be
processed before output. It makes faster
To perform the experiments, the datasets or
translation than statistical method and has the
corpora were collected from documents from
ability to create higher quality output. NMT is
diverse sources such as Ethiopian criminal code,
also able to use algorithms to learn linguistic
Ethiopian constitution, Oromia Regional State
rules on its own from statistical models. The
Duties and Responsibilities, Religious Book,
main strength of NMT is that it is able to learn
books and article form Oromo medias,
directly, the mapping from input text to the
international conventions, translated document
corresponding output text in an end-to-end
form Oromia Regional bureau, some translated
fashion.
books such as fictions, history, phycology. We
also used a monolingual Afaan Oromo and Another major success of NMT ”as Google has
English corpora collected from the web. seen in its use of NMT, the technology has
several advantages, including its application of a
We performed data cleaning and preprocessing to
singular system that can be trained directly on
make the dataset ready for alignment and
both source and target text” (Ulatus, 2018).
experimentation. In other words, whenever the
Another significant element of NMT is its
data is gathered from different sources it is
capability to automatically fix its parameters
collected in raw format which is not feasible for
throughout its training period. Other benefits
the analysis. So, we need data preprocessing to
include that NMT:
achieve better result from applied model in NMT.
Data preprocessing includes data cleaning • Efficient translation of grammatically
(removing unnecessary characters that affect complex languages, such as Korean,
translation quality), data integration (aggregating Japanese, and Arabic (Ulatus, 2018).
• Uses algorithms to detect and learn score to evaluate the performance of the system,
language conventions that come from which is an automatic evaluation technique.
statistical models, resulting in quicker
and better translations (Ulatus, 2018). 5 Handling Limited Language Resources
• Considers the complete sentence, not just
We implemented various techniques to deal with
a string of words.
the effect of limited linguistic resources for
• Learns nuances of languages, such as
English-Oromo language pairs on the
genders, inflection, and formality.
effectiveness of the prototype machine
• Assists in applications, such as translation. First, we tried to improve the
multilingual authoring, translation translation quality by improving alignment of the
checking, and multilingual video sentences than the documents as in the baseline.
conferencing (Ulatus, 2018). Second, we implemented a Community
Engagement Platform (CEP) in order to get more
We use OpenNMT which is open source initiative
data of better quality from the community who
for neural machine translation and neural
provided us with data on voluntary basis.
sequence. OpenNMT has the Following features
such as simple general-purpose interface, 5.1 Community engagement platform (CEP)
requiring only source/target files, highly Community engagement platform is composed of
configurable models and training procedures. two actions the community will take to contribute
Recent research features to improve system to the system, i.e., translation and verification as
performance and engages community from both well as mechanisms for rewarding contributors in
academic and industrial contributions and the form of gamification as described in the
requests. following subsections:

5.1.1 Translation

Volunteers have contributed to the enrichment of


our system by providing translations of the text
on our platform (Lingogrid.com, 2020) sentence
by sentence. One can provide a single translation
or multiple translations as shown in Figure 3 and
Figure 4 , respectively and submit the translation
sentence(s). One can also skip translating the
sentence and proceed to the next sentence.

Figure 2 Neural Machine Translation [Adapted from Park


Chansung (2018) (Chansung, 2018)

4.4 Evaluation
Machine translation systems can be evaluated by
human or automatic evaluation method. Even if
human evaluation is accurate it is costly and Figure 3 Single translation of a given sentence
suffers from limited efficiency as compared to
automatic evaluation. Therefore, we used BLEU
Both translation and verification are done in
batches in order to not overwhelm the contributor
and divide the contribution in manageable
episodes of 5 contributions at a time as shown in
Figure 7. The choice of 5 is arbitrary with the
objective of gathering more inputs without
imposing stress on users.

Figure 4 Multiple translation of a given sentence

5.1.2 Verification

In order to improve the quality of translations,


users have also provided their contributions
through verification (Lingogrid.com, 2020) of
Figure 7 Contributions in a batch of 5
existing translations through an interface
developed for this purpose. This enables us to 5.1.3 Gamification
validate translated texts generated automatically
from parallel documents as well as those Engaging users through gamification
translations provided by volunteers. (Lingogrid.com, 2020), we have developed a
rewarding system that can motivate individual
contributors through badges and leaderboard as
shown in Figure 8 and Figure 9 , respectively.

Figure 5 Validating a translation with rating

Users contribution on verification can be in two


forms: i) rating the given translation as shown in
Figure 5 and ii) rating and also providing
Figure 8 Gamification with Badges
alternative translation as shown in Figure 6.
While badges motivated individual contributors
through sense of achievement, leaderboard
motivated individual contributors through
awareness of competitiveness.

Figure 6 Validating a translation with rating and alternative


translation
and that enables collection of translations as well
as validation of translated texts.

The future works on this study remain to find out


the limitations of translation quality through
more and more data of parallel text with high
quality. One can also analyze cross-domain
translation performance.

In addition, the usability of the CEP system from


the users’ point of view needs to be studied and
Figure 9 Gamification with leaderboard (blurred to ensure quantified in order to identify points of
privacy of the data providers) improvement in look-and-feel as well as
performance (i.e., response time, transaction
6 Results and Discussion
rates and throughput). Expanding the MT system
The main result of this study is to demonstrate the as well as the CEP for more language pairs is also
prototype system (Lingogrid.com, 2020) having another major issue for future work.
interface shown in Figure 10.
Another aspect of the future work will be to
capitalize on machine translation outputs in
supporting public administrations in order to
provide inter-translation of content from one
language to another. This will be of paramount
importance in a multilinguistic communication
setting.
Figure 10 Prototype English-Oromo Translation System 8 References
(screenshot from (Lingogrid.com, 2020))
Alimi, H., & Amiri, M. (2018). Exploring
English Translations of Quran, Chapter
Using the CEP, we were able to collect over 10k
Al-Falaq with an Explanatory Model of
sentence pairs, the quality of which we improved
through the crowdsourced verification. Word Selection via take a look at Google
Translate Chosen Words. Pure
Measuring quality of translation was done using Life, 5(13), 115-151.
BLEU score to try and guess automatically to Adugna, S., & Eisele, A. (2010). English–Oromo
simulate evaluation of human translator about the Machine Translation: An Experiment
quality of a given machine translation. In this Using a Statistical Approach. Seventh
experiment, with 42k sentence pairs of bilingual International Conference on Language
corpus, the achieved BLEU scores of translating Resources and Evaluation (LREC'10).
from English-to-Oromo and Oromo-to-English Bahdanau, D., Cho, K., & Bengio, Y. (2014).
are 26% and 22%, respectively. Neural machine translation by jointly
learning to align and translate. arXiv
7 Conclusions and Future Works preprint arXiv:1409.0473.
Chansung, P. (2018). Seq2Seq model in
We have presented the implementation of a TensorFlow. Retrieved from
prototype English-Oromo neural machine https://towardsdatascience.com/seq2seq-
translation. In addition, aimed at increasing model-in-tensorflow-ec0c557e560f. Last
parallel corpus for under-resourced languages, Accessed: 15 January 2021
we have presented a community engagement Daba, J., & Assabie, Y. (2014). A Hybrid
platform that was augmented with gamification Approach to the Development of
Bidirectional English-Oromiffa Machine Ulatus. (2018). Neural Machine Translation:
Translation. In: Przepiórkowski A., Embarking on the New Frontier.
Ogrodniczuk M. (eds) Advances in Available at:
Natural Language Processing. https://www.ulatus.com/translation-
Klein, G., Kim, Y., Deng, Y., Senellart, J., & blog/neural-machine-translation-
Rush, A. M. (2017). Opennmt: Open- embarking-on-the-new-frontier/
source toolkit for neural machine Wu, Y., Schuster, M., Chen, Z., Le, Q. V.,
translation. arXiv preprint Norouzi, M., Macherey, W., . . . Dean, J.
arXiv:1701.02810. (2016). Google's Neural Machine
Le, Q. V., & Schuster, M. (2016). A Neural Translation System: Bridging the Gap
Network for Machine Translation, at between Human and Machine
Production Scale. Retrieved from Translation. arXiv preprint
https://ai.googleblog.com/2016/09/a- arXiv:1609.08144.
neural-network-for-machine.html. Last
Accessed: 15 January 2021
Lingogrid.com. (2020). Hiikaa - English to Afaan
Oromoo Translation. Retrieved from
https://hika.lingogrid.com. Last
Accessed: 15 January 2021
Meshesha, M., & Solomon, Y. (2018). English-
Afaan Oromo Statistical Machine
Translation. International Journal of
Computational Linguistic (IJCL).
Microsoft Translator. (2016). Microsoft
Translator launching Neural Network
based translations for all its speech
languages. Retrieved from
https://www.microsoft.com/en-
us/translator/blog/2016/11/15/microsoft-
translator-launching-neural-network-
based-translations-for-all-its-speech-
languages/. Last Accessed: 15 January
2021
Nishimura, Y., Sudoh, K., Neubig, G., &
Nakamura, S. (2018). Multi-source neural
machine translation with data
augmentation. arXiv preprint
arXiv:1810.06826.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.
(2002). BLEU: a Method for Automatic
Evaluation of Machine Translation.
Proceedings of the 40th Annual Meeting
on Association for Computational
Linguistics (pp. 311–318). ACL.
Teshome, E. (2013). Bidirectional English-
Amharic Machine Translation: An
Experiment using Constrained Corpus.
Addis Ababa University.

View publication stats

You might also like