Amharic Abstractive Text Summarization
Amharic Abstractive Text Summarization
Amharic Abstractive Text Summarization
Hazem M. Abbas
Department of Computer Engineering
Ain-Shams University
[email protected]
arXiv:2003.13721v1 [cs.AI] 30 Mar 2020
A BSTRACT
Text Summarization is the task of condensing long text into just a handful of sen-
tences. Many approaches have been proposed for this task, some of the very first
were building statistical models (Extractive Methods [Paice (1990)],[Kupiec et al.
(1995)]) capable of selecting important words and copying them to the output, how-
ever these models lacked the ability to paraphrase sentences, as they simply select
important words without actually understanding their contexts nor understanding
their meaning, here comes the use of Deep Learning based architectures (Abstrac-
tive Methods [Chopra et al. (2016)],[Nallapati et al. (2016)]), which effectively
tries to understand the meaning of sentences to build meaningful summaries. In this
work we discuss one of these new novel approaches which combines curriculum
learning with Deep Learning, this model is called Scheduled Sampling [Bengio
et al. (2015)]. We apply this work to one of the most widely spoken African
languages which is the Amharic Language, as we try to enrich the African NLP
community with top-notch Deep Learning architectures.
1. http://www.goolgule.com/
2. https://www.ethiopianregistrar.com/amharic
3. https://amharic.ethsat.com
4. https://ecadforum.com/Amharic
5. https://www.zehabesha.com/amharic/
6. https://www.ethiopianregistrar.com/amharic/
7. https://www.ethiopianreporter.com/article/
we scrapped over 50k articles, and only used those with long titles (about 19k articles).
Word-embedding has proved itself as one of the best methods to represent text for deep-models. One
of the most widely used English word-embedding models is Word2Vec[Mikolov et al. (2013)], it
represents each word with a list of vectors to be easily used in the deep-models, however no such
models were trained for the Amharic Language, this is why we trained our own model for this task.
∗
(https://medium.com/@theamrzaki/) (https://github.com/theamrzaki/ )
1
Published as a conference paper at ICLR 2020
In this work, we provide both the scrapped news dataset, and the trained word-embedding as open
source to help enrich the African NLP research community 1 .
Since our task is a time-series problem, RNN models were first used to address this task, however
given the long sentence dependencies in natural languages, LSTM based architectures were used
given its memory structure [Hochreiter & Schmidhuber (1997)]. Our task can actually be seen as
mapping between input and output, however since they differ in length (long input, short output),
seq2seq based architectures are used [Nallapati et al. (2016)]. To give our models even more human
like abilities in summarization, [Bahdanau et al. (2014)] suggested building a deep model on top of
the seq2seq architecture, which helped it attend to important words in the input.
This previously discussed model has a well-known problem, which is working with unknown out-
of-vocab words, as it can only be trained on a fixed sized vocabulary. A solution was proposed
by [Nallapati et al. (2016)],[See et al. (2017)] which builds a deep model on top of the seq2seq
architecture capable of learning when to copy words, and when to generate new ones.
3 S CHEDULED S AMPLING
One of the problems that the above seq2seq based architecture suffers from, comes from the way it is
trained, as the model is trained by supplying it both an input long text, and a reference short summary,
while when we test the model, we only supply it with the input long text, and no reference is given.
This forms an inconsistency between the training phase and the testing phase, as the model has never
been trained to depend on its own, this problem is called Exposure Bias [Ranzato et al. (2016)].
A solution proposed by [Bengio et al. (2015)] helped in solving this problem, which included
combining curriculum learning with our deep model. We start the training normally by supplying
both the long training text and the reference summary, but when the model becomes mature enough,
we gradually introduce the model to its own mistakes while training, decreasing its dependency on
the reference sentence, teaching the model to depend on itself in the training phase, in other words
making the learning problem more difficult while the model matures, hence curriculum learning.
2
Published as a conference paper at ICLR 2020
4 E XPERIMENTS
We have applied the Scheduled Sampling model on the Amharic Dataset that we have built, we have
used google colab as our training framework, as it provides us with free GPU and up to 24GB of RAM.
Our model is built over the library [Keneshloo et al. (2019)], we have modified it to work on python3
and to work with the Amharic Dataset. We evaluate our experiments using well-known metrics used
for evaluating text summarization, these metrics are ROUGE[Lin (2004)] and BLEU[Papineni et al.
(2002)], which measure the amount of n-grams that overlap between the reference summary and our
generated one, as the measure increases the amount of overlap increases, indicating a better output.
We ran our evaluation on 100 test sentences, scores were BELU=0.3311 , ROUGE 1f=20.51 ,
ROUGE 2f=08.59 , ROUGE Lf=14.76.
For comparison, running scheduled sampling on English well-known datasets (CNN/DailyMail
dataset) achieves ROUGE-1 of 39.53 and 17.28 ROUGE-2, this discrepancy of the results from the
English counterpart, comes from the fact that the English dataset is huge (200k articles with long
summary) compared to our scrapped Amharic dataset of (19k articles with short summaries), this
comes from the fact that collecting English dataset is comparatively much easier than collecting an
African one, due to the huge amount of available English resources.
5 C ONCLUSION
By building a custom word embedding model for a specific African language, we are able to apply
any deep model (that works on English) on that selected African Language, like what we have proven
by our work.
In our coming work, we are willing to experience with other advanced architectures that have recently
proven extremely efficient in addressing seq2seq problems. One of these architectures is BERT
[Devlin et al. (2019)], which stands for Bidirectional Encoder Representations from Transformers, it
uses a similar encoder-decoder architecture, but instead of using recurrent based cells, it only uses
attention (self-attention). One efficient way to use BERT, is by using an already pre-trained model,
that has been trained on the English dataset (CNN/DailyMail dataset), and then apply cross-lingual
transfer to the Amharic dataset, we believe that by this, we believe that this may actually result in
better summaries in spite of the relatively small Amharic dataset.
We hope that by this work, we have helped pave the way in applying novel Deep Learning techniques
for African Languages, we also hope that we have contributed a guideline in applying deep models
that can be further used in other NLP tasks.
3
Published as a conference paper at ICLR 2020
R EFERENCES
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence
prediction with recurrent neural networks. In Proceedings of the 28th International Conference
on Neural Information Processing Systems - Volume 1, NIPS15, pp. 11711179, Cambridge, MA,
USA, 2015. MIT Press.
Sumit Chopra, Michael Auli, and Alexander M. Rush. Abstractive sentence summarization with
attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.
93–98, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.
18653/v1/N16-1012. URL https://www.aclweb.org/anthology/N16-1012.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June
2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https:
//www.aclweb.org/anthology/N19-1423.
Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa
Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In C. Cortes, N. D.
Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information
Processing Systems 28, pp. 1693–1701. Curran Associates, Inc., 2015. URL http://papers.
nips.cc/paper/5945-teaching-machines-to-read-and-comprehend.pdf.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):17351780,
November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL https://doi.
org/10.1162/neco.1997.9.8.1735.
Yaser Keneshloo, Tian Shi, Naren Ramakrishnan, and Chandan Reddy. Deep reinforcement learning
for sequence-to-sequence models. IEEE Transactions on Neural Networks and Learning Systems,
PP:1–21, 08 2019. doi: 10.1109/TNNLS.2019.2929141.
Julian Kupiec, Jan O. Pedersen, and Francine Chen. A trainable document summarizer. In SIGIR,
1995.
Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization
Branches Out, pp. 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
URL https://www.aclweb.org/anthology/W04-1013.
Tomas Mikolov, G.s Corrado, Kai Chen, and Jeffrey Dean. Efficient estimation of word representa-
tions in vector space. pp. 1–12, 01 2013.
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çağlar Gulçehre, and Bing Xiang. Abstractive
text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of The 20th
SIGNLL Conference on Computational Natural Language Learning, pp. 280–290, Berlin, Germany,
August 2016. Association for Computational Linguistics. doi: 10.18653/v1/K16-1028. URL
https://www.aclweb.org/anthology/K16-1028.
Chris D Paice. Constructing literature abstracts by computer: techniques and prospects. Information
Processing & Management, 26(1):171–186, 1990.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic
evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics, pp. 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association
for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://www.aclweb.
org/anthology/P02-1040.
4
Published as a conference paper at ICLR 2020
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training
with recurrent neural networks. In 4th International Conference on Learning Representations,
ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL
http://arxiv.org/abs/1511.06732.
Abigail See, Peter Liu, and Christoper Manning. Get to the point: Summarization with pointer-
generator networks. pp. 1073–1083, 01 2017. doi: 10.18653/v1/P17-1099.