Applsci 11 10915 v2
Applsci 11 10915 v2
Applsci 11 10915 v2
sciences
Article
High-Performance English–Chinese Machine Translation Based
on GPU-Enabled Deep Neural Networks with Domain Corpus
Lanxin Zhao 1, *, Wanrong Gao 2 and Jianbin Fang 2
1 School of International Business, Hunan University of Information Technology, Changsha 410151, China
2 School of Computer, National University of Defense Technology, Changsha 410073, China;
[email protected] (W.G.); [email protected] (J.F.)
* Correspondence: [email protected]
Abstract: The ability to automate machine translation has various applications in international
commerce, medicine, travel, education, and text digitization. Due to the different grammar and
lack of clear word boundaries in Chinese, it is challenging to conduct translation from word-based
languages (e.g., English) to Chinese. This article has implemented a GPU-enabled deep learning
machine translation system based on a domain-specific corpus. Our system takes an English text
as input and uses an encoder-decoder model with an attention mechanism based on Google’s
Transformer to translate the text to Chinese output. The model was trained using a simple self-
designed entropy loss function and an Adam optimizer on English–Chinese bilingual text sentences
from the News area of the UM-Corpus. The parallel training process of our model can be performed
on common laptops, desktops, and servers with one or more GPUs. At training time, we not only
track loss over training epochs but also measure the quality of our model’s translations with the
Citation: Zhao, L.; Gao, W.; Fang, J. BLEU score. We also provide an easy-to-use web interface for users so as to manage corpus, training
High-Performance English–Chinese projects, and trained models. The experimental results show that we can achieve a maximum BLEU
Machine Translation Based on score of 29.2. We can further improve this score by tuning other hyperparameters. The GPU-enabled
GPU-Enabled Deep Neural Networks model training runs over 15x faster than on a multi-core CPU, which facilitates us having a shorter
with Domain Corpus. Appl. Sci. 2021,
turn-around time. As a case study, we compare the performance of our model to that of Baidu’s,
11, 10915. https://doi.org/
which shows that our model can compete with the industry-level translation system. We argue
10.3390/app112210915
that our deep-learning-based translation system is particularly suitable for teaching purposes and
small/medium-sized enterprises.
Academic Editors: Jorge
Martin-Gutierrez
and João M. F. Rodrigues
Keywords: neural machine translation; transformer; GPUs; multi-domain corpus
method, NMT has a simple architecture, and it is able to capture long dependency in the
sentence, which shows a great potential in becoming a new trend of language translation.
The dominant NMT approach is the “Embed-Encode-Attend-Decode” paradigm.
Recurrent neural network (RNN) [13], convolutional neural network (CNN) [14], and
self-attention/feed-forward network (SA/FFN) [15] architectures are the most commonly
used approaches based on this paradigm. In particular, Google has proposed Transformer,
which relies entirely on self-attention to compute representations of its input and output
without using sequence-aligned RNNs or CNNs [15]. The Transformer model aims to deal
with long-range dependencies when solving the sequence-to-sequence tasks. This model
has outperformed many other models, which is thus the focus of our work.
Recent works have focused on large pretrained models, which are mostly built based
on the Transformer model, but with much larger capacity [15–18]. This approach utilizes
a combination of pretraining and supervised fine-tuning. The capacity of transformer
language models has increased significantly, from 100 million parameters [16], to 1.5 billion
parameters [17], and finally 17 billion parameters [9,18]. Although each increase has
brought significant performance improvements in downstream NLP tasks, training such
models requires large-scale specialized computing hardware such as Google’s TPUs [19].
These computing clusters are typically unaffordable for small/medium-sized enterprises.
On the other hand, these models are too complicated and their capacity is too large for us
to understand. That is, we know the models perform well, but we do not know the reasons.
They work similar to a “black-box” and are particularly unsuitable for teaching purposes.
The dominant approach to creating machine learning systems is to collect a dataset
of training examples demonstrating correct behavior for a desired task, train a system to
imitate these behaviors, and then test its performance on independent held-out examples.
This approach can provide trained models that work like domain experts, which has
been widely accepted. Many prior works have shown that trained models can yield a
better prediction accuracy from using a domain-specific bilingual corpus [20–22]. For
example, microblogs are an excellent linguistic resource. Ling et al. have shown that some
microblog users post “self-translated” messages targeting audiences who speak different
languages [21]. Based on this observation, the authors have introduced a method for
finding and extracting this naturally occurring Chinese–English parallel segments. Their
evaluation results have demonstrated that the automatically extracted parallel data obtain
significant translation quality improvements. Tian et al. have designed UM-Corpus as
a multi-domain and balanced parallel corpus [23]. It is a two million English–Chinese
aligned corpus from eight different text domains, including Education, Laws, Microblog,
News, Science, Spoken, Subtitles, and Thesis. Although using a domain-specific corpus for
model training has yielded promising results, we believe that there is a lack of a bilingual
corpus from domain experts. Therefore, determining how to leverage domain expertise
and build a new corpus is largely required.
In this work, we present an easy-to-use deep learning machine translation system,
built from the scratch, based on a domain corpus. The deep learning algorithm takes in
English text as input and uses an encoder-decoder model with an attention mechanism
based on Google’s Transformer to translate the text to Chinese output. The model was
trained using a simple self-designed entropy loss function and an Adam optimizer on
paired English and Chinese text sentences from the news area of the UM-Corpus. The
parallel training process of our model can be performed on common laptops, desktops,
and servers with one or more GPUs. During training time, we not only track loss over
training epochs but also measure the quality of our model’s translations using the BLEU
score (see Section 2.1.5).
The experimental results on the UM-corpus show that our trained model can achieve
a maximum BLEU score of 29.2. We can further improve this score by tuning other hyper-
parameters. We provide a web interface for users to build a domain-specific corpus and
configure training parameters. We also observe that training the model on high-end GPUs
is much faster than on a multi-core CPU, and thus the GPU is a very promising training
Appl. Sci. 2021, 11, 10915 3 of 17
platform for NMT. We conclude that our translation system is platform-portable, which is
suitable for teaching purposes and use scenarios in small/medium-sized enterprises. As a
case study, we compare the performance of our model to that of Baidu’s and show that our
model can compete with the production-level translation system.
To summarize, our contributions are as follows:
• We present a transformer-based machine translation system, which is built from
scratch, based on a domain corpus.
• Our translation system is easy to use with a web interface, so that domain experts can
extend the existing corpus and train the model in a fine-tuned manner.
• Our machine learning system is both portable and configurable and can be deployed
on laptops, desktops, or servers with multiple GPUs.
• Our translation system trained based on a domain corpus can achieve competing
performance with a production-level translation system.
where m is the number of words in y, y j is the current generated word, and y< j are the
previously generated words. At the inference time, beam search is typically used to find
the translation that maximizes this probability.
it reads the source sentence word by word and compresses the variable-length sequence
into a fixed-length vector. This process is encoding, i.e., the encoder converts words in the
source sentence into word embedding. These word embeddings are then processed by
neural layers and converted to representations that capture contextual information. These
contextual representations are called the encoder representations. The decoder uses an
attention mechanism, the encoder representations, and previously generated words to
generate the decoder representations, which in turn are used to generate the next target
word. The encoder and decoder can be of RNN [13], CNN [14], or self-attention and
feed-forward [15].
While NMT has shown great potential in capturing the dependencies inside the
sequence, it still suffers a huge performance reduction when the input sentences are too
long. This is due to the limited feature representation ability in a fixed-length vector.
Thus, the attention mechanism came into being. It works as an intermediate component
between Encoder and Decoder, which facilitates the word correlation in a dynamic manner
(Figure 1). As a matter of fact, the inspiration for applying the attention mechanism on
NMT comes from human behavior in reading and translating text data: human beings
often read text repeatedly to mine the word dependency within the sentence.
exp(e ji )
a ji = m , (4)
∑k=1 exp(eki )
where e ji is an alignment score, a is an alignment model that scores the match level of the
inputs around position i and the output at position j, s( j−1) is the decoder hidden state
of the previously generated word, and hi is the encoder hidden state at position i. The
calculated attention vector is then used to weight the encoder hidden states to obtain a
context vector as
n
cj = ∑ a ji hi , (5)
i =1
Appl. Sci. 2021, 11, 10915 5 of 17
This context vector, is fed to the decoder along with the previously generated word and
its hidden state to produce a representation for generating the current word. A decoder
hidden state for the current word s j is computed by
s j = g ( s j −1 , y j −1 , c j ), (6)
where g is an activation decoder function, s( j−1) is the previous decoder hidden state, and
y( j−1) is the embedding of the previous word. The current decoder hidden state s j , the
previous word embedding, and the context vector are fed to a feed-forward layer f and a
softmax layer to compute a score for generating a target word as output:
N
BLEU = BP · exp( ∑ wn log pn ), (8)
n =1
1, c > r
BP = { r (9)
e 1− c , c ≤ r
3. Our Methods
This section provides a detailed description of our methods for English–Chinese
translation based on the Transformer model. We introduce our methods in terms of word
segmentation, data preprocessing, model training, and deployment.
Appl. Sci. 2021, 11, 10915 7 of 17
Each sentence is transformed into a sequence of integers, each integer being the index
of a token in the dictionary. Only the top N frequent words will be taken into account. The
N is set to 32,000 for both the English and the Chinese vocabulary.
of decoder. However, our trained model is much larger than this configuration and is
available upon request.
Feed
Forward
Feed Multi-Head
Forward Attention
Positional Positional
Encoding + + Encoding
Input Output
Embedding Embedding
Inputs Outputs
lr = d− 0.5
model · min ( step_num
−0.5
, step_num · warmup_step−1.5 ), (10)
where step_num denotes the current step number, and warmup_step is the number of steps
used to warm up the training process. This corresponds to increasing the learning rate
linearly for the first warmup_steps training steps and decreasing it thereafter proportionally
to the inverse square root of the step_num. Here, warmup_steps = 4000. Figure 4 shows
the implementation of the training optimizer. These parameters are selected in a trial-and-
error approach.
1 c l a s s TrainOpt :
" " " Optim wrapper t h a t implements r a t e . " " "
3
def _ _ i n i t _ _ ( s e l f , model_size , f a c t o r , warmup , o p t i m i z e r ) :
5 s e l f . optimizer = optimizer
s e l f . _step = 0
7 s e l f . warmup = warmup
self . factor = factor
9 s e l f . model_size = model_size
s e l f . _rate = 0
11
def s t e p ( s e l f ) :
13 " " " Update parameters and r a t e " " "
s e l f . _ s t e p += 1
15 rate = self . rate ()
f o r p in s e l f . o p t i m i z e r . param_groups :
17 p[ ’ lr ’ ] = rate
s e l f . _rate = rate
19 s e l f . optimizer . step ( )
21 def r a t e ( s e l f , s t e p =None ) :
" " " Implement ‘ l r a t e ‘ above " " "
23 i f s t e p i s None :
step = s e l f . _step
25 r e t u r n s e l f . f a c t o r * ( s e l f . model_size * * ( − 0 . 5 ) * min ( s t e p * * ( − 0 . 5 ) , s t e p * s e l f .
,→ warmup * * ( − 1 . 5 ) ) )
In this work, we aim to train our NMT model on diverse available computing resources
such as laptop CPUs, desktop CPUs, and server CPUs with one or multiple GPUs. In
addition, we mainly use data parallelism to speed up the training process. As shown in
Figure 5, the entire English–Chinese parallel corpus is partitioned into a large number
of batches, and each batch of the training data is distributed to a processor of a GPU or
multi-core CPU. Thus, we have to ensure that we have sufficient batches so as to fully
utilize the whole GPU processor or multiple GPUs. Our implementation is built based on
the DataParallel module of the PyTorch framework. Note that we have to use suitable APIs
to create the model and perform data movements between CPUs and GPUs. The batch size
is set to be 32.
.
. . .
Each batch is distributed to a . . .
. .
processor of GPUs or CPUs
possible next steps and keeps the k most likely, where k is a user-specified parameter and
controls the number of beams or parallel searches through the sequence of probabilities.
The development set source sentences are decoded using combinations of beam size and a
length penalty and the combination that gives the best evaluation metric score.
Dataset Details. We use the news dataset from the UM-Corpus, which is a large English–
Chinese parallel corpus. It provides a two million English–Chinese corpus from eight text
domains, covering several topics and text genres, including Education, Laws, Microblog,
News, Science, Spoken, Subtitles, and Thesis [23]. We train our models with the news subset
of 252K sentences consisting of 10,635K Chinese words and 5672K English words. We split
the training samples into 176,943 training pairs, 25,278 validation pairs, and 50,556 test pairs.
Note that our web interface provides users with access to build a new corpus.
the network parameters. The number of batches accumulated in each update is called the
accumulation step. Figure 7 shows the training process on these two GPUs, respectively.
Each curve describes the trend of loss (of the training set and validation set) and BLEU
value of the validation set. We see that the training process on both GPUs is basically
consistent. As the number of epochs increases, the train loss continues to decrease. In
addition, the loss and BLEU values of the verification set are stable at half way. As a matter
of fact, too many training iterations would lead to the problem of overfitting, which has a
negative impact on the translation quality. Thus, we run the verification process for every
five iterations to avoid the overfitting issue.
8 30 8 30
7 25 7 25
6 6
train loss 20 train loss 20
validation loss validation loss
5 5
validation BLEU validation BLEU
BLEU
BLEU
loss
loss
15 15
4 4
10 10
3 3
2 5 2 5
1 0 1 0
0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40
epochs epochs
Figure 7. The training process on RTX 2080Ti (a) and Titan RTX (b).
We compare the test performance of the trained models on two GPUs in Table 3.
The batch_size (bs) and the accumulation step (step) used in the gradient accumulation
method for each GPU are indicated in the table. The different GPU devices basically have
no impact on the model training effect. The gradient accumulation method can facilitate us
in achieving the same performance with small batches as with large batches.
Table 3. The best Dev BLEU and the Test Loss and BLEU on two GPUs.
27.5 3.5
BLEU
Relative time 3.0
27.0
2.5
Relative time
26.5
BLEU 2.0
26.0
1.5
25.5
1.0
25.0 0.5
1 2 3 4 5 6
Beam size
Figure 8. The test BLEU and relative test time (relative to beam size=1).
Table 4. The test BLEU and test time based on different beam sizes.
Beam Size 1 2 3 4 5 6
BLEU 26.06 26.71 26.88 26.92 26.94 26.96
Time (s) 1700 2497 2587 3070 4384 5308
40
Training time
35
30
Training time (min)
25
20
15
10
5
4 8 16 32
Batch size
Figure 9. The training time (minutes) per epoch on Titan RTX based on different batch sizes.
Meanwhile, we compare the training speed of the CPU and GPUs in Figure 10. Due to
the limitation of the memory space of the RTX 2080Ti and RTX 2060 Super, we use a batch
size of 8 and 16 on them, respectively. Meanwhile, we use a batch size of 32 on the Titan
RTX. The training time per epoch is 480 min, 250 min, 170 min, 22 min, 18 min, and 11 min
for the Intel i7-7500U CPU (Laptop), Intel i7-7700K CPU (Desktop), Intel Xeon Platinum
9242 CPU (Xeon server), Titan RTX, RTX 2080Ti (RTX2080), and RTX 2060 Super (RTX2060).
For the Xeon CPU sever, even when adopting a batch size of 96, the training time is around
Appl. Sci. 2021, 11, 10915 14 of 17
15× as long as that on the Titan RTX GPU. That is, the GPU has absolute superiority in
training neural networks.
500
Training time
400
Training time (min)
300
200
100
Figure 11. The translation cases of using Baidu translation system and our model.
4.4. Discussion
Implementing the Transformer-based translation system from scratch is indeed not
new. However, we believe that our translation system stands out and can be applied in
several scenarios. For now, large pretrained models have achieved promising results and
have been widely accepted. Although each increase has brought significant performance
Appl. Sci. 2021, 11, 10915 15 of 17
improvements in downstream NLP tasks, training such models requires large-scale special-
ized computing hardware such as Google’s TPUs. These computing clusters are typically
unaffordable for small/medium-sized enterprises. Our translation system is portable
across laptop CPUs, desktops CPU, and server CPUs with one or multiple GPUs. Such
platforms are typically affordable for small/medium-sized enterprises, and our translation
system can be used as a research infrastructure for such companies.
On the other hand, the large pretrained models are too complicated, and their capacity
is too large for us to understand. That is, we know the models perform well, but we do not
know the reasons. They work similar to a “black-box” and are particularly unsuitable for
teaching purposes. Instead, our translation system can be used as a teaching demonstration
tool for students majoring in translation. In particular, we have provided a web interface to
manage the corpus, model training, and model prediction to ease the use of our translation
system. For instance, our system provides research professors with a web interface to
collect their translation expertise so as to build a new corpus.
5. Conclusions
In this work, we have implemented a deep learning machine translation system based
on a news corpus. The deep learning algorithm takes in English text as input and uses an
encoder-decoder model with an attention mechanism based on Google’s Transformer to
translate the text to Chinese output. The model was trained using a simple self-designed
entropy loss function and an Adam optimizer on paired English and Chinese text sentences
from the news area of the UM-Corpus. We train the model on high-end GPUs with a
parallel approach. During training time, we not only track loss over training epochs, but
measure the quality of our model’s translations using the BLEU score. The experimental
results on the UM-corpus show that our trained model can achieve a maximum BLEU score
of 29.2. We can further improve this score by tuning other hyperparameters and increasing
the complexity of our model, as well as by training on a larger subset of the data to avoid
biased results. As a case study, we compare the performance of our model to that of Baidu’s
and show that our model can compete with the production-level translation system.
For future work, we plan to train our models with large-scale GPU-based clusters. We
also want to incorporate language features into the model to improve its translation quality.
In addition, we will use a more bilingual corpus for improved translation quality.
Author Contributions: Conceptualization, L.Z. and J.F.; methodology, L.Z., J.F., and W.G.; validation,
W.G. and J.F.; writing—original draft preparation, L.Z. and W.G.; writing—review and editing, L.Z.
and J.F. All authors have read and agreed to the published version of the manuscript.
Funding: This work was partially funded by the National Natural Science Foundation of China
under Grant agreement 61972408.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Data collected during this study may be obtained by contacting the
corresponding author at [email protected].
Acknowledgments: We thank the anonymous reviewers for their constructive comments and feedback.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; Kingsbury,
B. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE
Signal Process. Mag. 2012, 29, 82–97. [CrossRef]
2. Dahl, G.E.; Yu, D.; Deng, L.; Acero, A. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech
Recognition. IEEE Trans. Speech Audio Process. 2012, 20, 30–42. [CrossRef]
3. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf.
Process. Syst. 2012, 25, 1097–1105.
Appl. Sci. 2021, 11, 10915 16 of 17
4. Ciresan, D.C.; Meier, U.; Schmidhuber, J. Multi-column deep neural networks for image classification. In Proceedings of the 2012
IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE Computer Society:
Los Alamitos, CA, USA, 2012; pp. 3642–3649.
5. Le, Q.V.; Ranzato, M.; Monga, R.; Devin, M.; Corrado, G.; Chen, K.; Dean, J.; Ng, A.Y. Building high-level features using large
scale unsupervised learning. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh,
Scotland, UK, 26 June–1 July 2012.
6. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998,
86, 2278–2324. [CrossRef]
7. Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. GShard: Scaling Giant Models
with Conditional Computation and Automatic Sharding. In Proceedings of the 9th International Conference on Learning
Representations, ICLR 2021, Virtual, Austria, 3–7 May 2021.
8. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; (Long
and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, pp. 4171–4186. [CrossRef]
9. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33, Proceedings of the Annual
Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M.,
Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020.
10. Forcada, M.L.; Ginestí-Rosell, M.; Nordfalk, J.; O’Regan, J.; Ortiz-Rojas, S.; Pérez-Ortiz, J.A.; Sánchez-Martínez, F.; Ramírez-
Sánchez, G.; Tyers, F.M. Apertium: A free/open-source platform for rule-based machine translation. Mach. Transl. 2011,
25, 127–144. [CrossRef]
11. Koehn, P.; Och, F.J.; Marcu, D. Statistical Phrase-Based Translation. In Proceedings of the Human Language Technology
Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003, Edmonton, AB,
Canada, 27 May–1 June 2003; Hearst, M.A., Ostendorf, M., Eds.; The Association for Computational Linguistics: Stroudsburg, PA,
USA, 2003.
12. Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder
Approaches. In Proceedings of the SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical
Translation, Doha, Qatar, 25 October 2014; Wu, D., Carpuat, M., Carreras, X., Vecchi, E.M., Eds.; Association for Computational
Linguistics: Stroudsburg, PA, USA, 2014; pp. 103–111.
13. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the
3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015.
14. Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional Sequence to Sequence Learning. In Proceedings
of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017; Volume 70,
pp. 1243–1252.
15. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In
Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach,
CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R.,
Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008.
16. Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training, 2018. Available online:
https://gregraiz.com/wp-content/uploads/2020/07/language_understanding_paper.pdf (accessed on 11 May 2021).
17. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI
Blog 2019, 1, 9.
18. Rosset, C.; Xiong, C.; Phan, M.; Song, X.; Bennett, P.N.; Tiwary, S. Knowledge-Aware Language Model Pretraining. arXiv 2020,
arXiv:2007.00655.
19. Norrie, T.; Patil, N.; Yoon, D.H.; Kurian, G.; Li, S.; Laudon, J.; Young, C.; Jouppi, N.P.; Patterson, D.A. Google’s Training Chips
Revealed: TPUv2 and TPUv3. In Proceedings of the IEEE Hot Chips 32 Symposium, HCS 2020, Palo Alto, CA, USA, 16–18
August 2020; IEEE Computer Society: Los Alamitos, CA, USA, 2020; pp. 1–70. [CrossRef]
20. Ling, W.; Marujo, L.; Dyer, C.; Black, A.W.; Trancoso, I. Crowdsourcing High-Quality Parallel Data Extraction from Twitter.
In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, MD, USA, 26–27 June 2014; pp. 426–436.
[CrossRef]
21. Ling, W.; Marujo, L.; Dyer, C.; Black, A.W.; Trancoso, I. Mining Parallel Corpora from Sina Weibo and Twitter. Comput. Linguist.
2016, 42, 307–343. [CrossRef]
22. Ling, W.; Xiang, G.; Dyer, C.; Black, A.W.; Trancoso, I. Microblogs as Parallel Corpora. In Proceedings of the 51st Annual Meeting
of the Association for Computational Linguistics, ACL 2013, Sofia, Bulgaria, 4–9 August 2013; Long Papers; The Association for
Computer Linguistics: Stroudsburg, PA, USA, 2013; Volume 1, pp. 176–186.
Appl. Sci. 2021, 11, 10915 17 of 17
23. Tian, L.; Wong, D.F.; Chao, L.S.; Quaresma, P.; Oliveira, F.; Yi, L. UM-Corpus: A Large English-Chinese Parallel Corpus for
Statistical Machine Translation. In Proceedings of the Ninth International Conference on Language Resources and Evaluation,
LREC 2014, Reykjavik, Iceland, 26–31 May 2014; Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J.,
Moreno, A., Odijk, J., Piperidis, S., Eds.; European Language Resources Association (ELRA): Luxemburg, 2014; pp. 1837–1842.
24. Kalchbrenner, N.; Blunsom, P. Recurrent Continuous Translation Models. In Proceedings of the 2013 Conference on Empirical
Methods in Natural Language Processing, EMNLP 2013, Grand Hyatt Seattle, Seattle, WA, USA, 18–21 October 2013; pp. 1700–1709.
25. Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016;
Volume 1, pp. 1715–1725.
26. Schuster, M.; Nakajima, K. Japanese and Korean voice search. In Proceedings of the 2012 IEEE International Conference on
Acoustics, Speech and Signal Processing, ICASSP 2012, Kyoto, Japan, 25–30 March 2012; IEEE Computer Society: Los Alamitos,
CA, USA, 2012; pp. 5149–5152.
27. Kudo, T.; Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural
Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018:
System Demonstrations, Brussels, Belgium, 31 October–4 November 2018; Blanco, E., Lu, W., Eds.; Association for Computational
Linguistics: Stroudsburg, PA, USA, 2018; pp. 66–71.
28. Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747.
29. Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s
Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144.
30. Hassan, H.; Aue, A.; Chen, C.; Chowdhary, V.; Clark, J.; Federmann, C.; Huang, X.; Junczys-Dowmunt, M.; Lewis, W.; Li, M.; et al.
Achieving Human Parity on Automatic Chinese to English News Translation. arXiv 2018, arXiv:1803.05567.
31. Lin, X.; Liu, J.; Zhang, J.; Lim, S.J. A Novel Beam Search to Improve Neural Machine Translation for English-Chinese. Comput.
Mater. Contin. 2020, 65, 387–404. [CrossRef]
32. Zhou, L.; Zhang, J.; Kang, X.; Zong, C. Deep Neural Network-based Machine Translation System Combination. ACM Trans.
Asian Low Resour. Lang. Inf. Process. 2020, 19, 65:1–65:19. [CrossRef]
33. Xiong, H.; He, Z.; Hu, X.; Wu, H. Multi-Channel Encoder for Neural Machine Translation. In Proceedings of the Thirty-Second
AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and
the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, LA, USA, 2–7 February
2018; pp. 4962–4969.
34. Wang, Y.; Cheng, S.; Jiang, L.; Yang, J.; Chen, W.; Li, M.; Shi, L.; Wang, Y.; Yang, H. Sogou Neural Machine Translation Systems for
WMT17. In Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, 7–8 September
2017; pp. 410–415.
35. Wu, S.; Wang, X.; Wang, L.; Liu, F.; Xie, J.; Tu, Z.; Shi, S.; Li, M. Tencent Neural Machine Translation Systems for the WMT20
News Translation Task. In Proceedings of the Fifth Conference on Machine Translation, WMT@EMNLP 2020, Online, 19–20
November 2020; pp. 313–319.
36. Yang, J.; Wu, S.; Zhang, D.; Li, Z.; Zhou, M. Improved Neural Machine Translation with Chinese Phonologic Features. In Natural
Language Processing and Chinese Computing, Proceedings of the 7th CCF International Conference, NLPCC 2018, Hohhot, China, 26–30
August 2018; Zhang, M., Ng, V., Zhao, D., Li, S., Zan, H., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg,
Germany, 2018; Volume 11108, pp. 303–315.
37. Kuang, S.; Han, L. Apply Chinese Radicals Into Neural Machine Translation: Deeper Than Character Level. arXiv 2018,
arXiv:1805.01565.