Applsci 12 05720 v2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

applied

sciences
Review
BERT Models for Arabic Text Classification: A
Systematic Review
Ali Saleh Alammary

College of Computing and Informatics, Saudi Electronic University, Jeddah 393453, Saudi Arabia;
[email protected]

Abstract: Bidirectional Encoder Representations from Transformers (BERT) has gained increasing
attention from researchers and practitioners as it has proven to be an invaluable technique in natural
languages processing. This is mainly due to its unique features, including its ability to predict words
conditioned on both the left and the right context, and its ability to be pretrained using the plain text
corpus that is enormously available on the web. As BERT gained more interest, more BERT models
were introduced to support different languages, including Arabic. The current state of knowledge
and practice in applying BERT models to Arabic text classification is limited. In an attempt to begin
remedying this gap, this review synthesizes the different Arabic BERT models that have been applied
to text classification. It investigates the differences between them and compares their performance. It
also examines how effective they are compared to the original English BERT models. It concludes by
offering insight into aspects that need further improvements and future work.

Keywords: BERT; Arabic text classification; language representation; sentiment analysis; natural
language processing

Citation: Alammary, A.S. BERT 1. Introduction


Models for Arabic Text Classification: Text classification is a machine-learning task in which a document is assigned to
A Systematic Review. Appl. Sci. 2022, one or more predefined categories according to its content. It is a fundamental task in
12, 5720. https://doi.org/10.3390/ natural language processing with diverse applications such as sentiment analysis, email
app12115720 routing, offensive language detection, spam filtering, and language identification [1]. Of
Academic Editor: Rafael these applications, sentiment analysis has attracted the most attention. The objective of
Valencia-Garcia sentiment analysis is to identify the polarity of text content, which can take the form of
a binary positive/negative classification, or a more granular set of categories, such as a
Received: 10 May 2022
five-point satisfaction scale [2]. Despite the progress that has been achieved in improving
Accepted: 31 May 2022
the performance of text classification, there is much room for improvement, especially for
Published: 4 June 2022
the Arabic language.
Publisher’s Note: MDPI stays neutral Bidirectional Encoder Representations from Transformers (BERT) is a language rep-
with regard to jurisdictional claims in resentation model that was introduced in 2018 by Jacob Devlin and his colleagues from
published maps and institutional affil- Google [3]. Since its introduction, it has become a ubiquitous baseline in natural language-
iations. processing research [4]. Unlike other language representation models that capture the
context unidirectionally, BERT was designed as a bidirectional model that can predict
words conditioned on both the left and right context [5]. BERT was also built as an unsu-
pervised model that can be trained using the plain text corpus that is enormously available
Copyright: © 2022 by the author.
on the web in most languages. This combination of features allows BERT to demonstrate
Licensee MDPI, Basel, Switzerland.
exceptional performance in various natural language-processing tasks, including text
This article is an open access article
classification [6].
distributed under the terms and
There are two main approaches for using BERT: feature extraction and finetuning.
conditions of the Creative Commons
Attribution (CC BY) license (https://
In feature extraction, the architecture of the BERT model is preserved, i.e., the model’s
creativecommons.org/licenses/by/
parameters are ‘frozen’. Features are extracted from the pretrained BERT model and then
4.0/). fed into a classifier model to solve a given task. In finetuning, the model’s parameters are

Appl. Sci. 2022, 12, 5720. https://doi.org/10.3390/app12115720 https://www.mdpi.com/journal/applsci


Appl. Sci. 2022, 12, 5720 2 of 20

finetuned by adding extra layers to the original BERT architecture. These new layers are
used to train the model on the downstream tasks [3].
In the original paper of Devlin, Chang, Lee and Toutanova [3], two BERT models were
introduced: BERT-large and BERT-base. They are both in English. They were pretrained
from extremely large corpora extracted from the internet. Therefore, they have heavy
computing requirements and a high memory footprint. As BERT gained more interest
from researchers and practitioners, more BERT models were introduced. The new mod-
els attempt to address some of the shortcomings of the original models, improving the
performance [7] or improving the inference speed [8]. There were also models that were
developed to support languages other than English [9,10].
Several BERT models were pretrained to support the Arabic language. For ex-
ample, Devlin and his team developed a multilingual model that supports more than
100 languages, including Arabic. Antoun, et al. [11] developed an Arabic model that they
called Arabert. The model was pretrained on around 24 gigabytes of text. Similarly,
Abdul-Mageed, et al. [12] used 1B tweets to train an Arabic BERT model that they named
MARBERT. While these models have been used for Arabic text classification, it is not
obvious which model is most suitable for the classification task. It is also not clear if
one of them is more effective than the other and whether the process that has been used
to pretrain each model has affected its performance. This systematic review provides a
detailed examination of the different Arabic BERT models that have been used for text
classification. The aim is to provide guidance for researchers and practitioners by critically
appraising and summarizing existing research. To the best of our knowledge, this is the
first systematic review study on this subject.

2. Research Methodology
A systematic review is a useful method for identifying, aggregating, and synthesizing
existing research that is relevant to a research topic. It provides in-depth analysis to answer
specific research questions with the aim of synthesizing evidence [13]. This systematic
review follows the PRISMA guidelines [14]. What follows is a detailed description of the
steps that have been performed to conduct the study.

2.1. Definition of Research Questions


The objective of this study is to investigate BERT models that have been used for
Arabic text classification. Based on this objective, the following questions were defined:
1. What Bert models have been used for Arabic text classification, and how do they differ?
2. How effective are they in classifying Arabic text?
3. How effective are they compared to the original English BERT models?

2.2. Search Strategy


An electronic search was conducted on six scientific databases: IEEE Xplore, ScienceDi-
rect, Springer Link, Taylor & Francis Online, ACM digital library, and ProQuest journals.
These databases were chosen considering their coverage and use in the domain of computer
science. Google Scholar was also searched to increase publication coverage and to include
gray literature, which could make important contributions to a systematic review [15]. The
date of the last search was December 2021.
The search string was constructed following Population, Intervention, Comparison,
and Outcomes (PICO), as suggested by Kitchenham and Charters [16].
• Population: BERT, Arabic.
• Intervention: text classification, sentiment analysis.
• Comparison and Outcomes: these two dimensions were omitted, as the research ques-
tions do not warrant a restriction of the results to a particular outcome or comparison.
Table 1 shows the final search strings that were used to search each database.
Appl. Sci. 2022, 12, 5720 3 of 20

Table 1. Search string for each database.

Database Search String


(“BERT” AND “Arabic” AND (“text classification”
IEEE Xplore
OR “sentiment analysis”))
(‘BERT’ AND ‘Arabic’ AND (‘text classification OR
Sciencedirect
‘sentiment analysis’))
“BERT” AND “Arabic” AND (“text classification”
Springer Link
OR “sentiment analysis”)
[All: “bert”] AND [All: “arabic”] AND [[All: “text
Taylor & Francis Online
classification”] OR [All: “sentiment analysis”]]
[All: “bert”] AND [All: “arabic”] AND [[All: “text
ACM digital library
classification”] OR [All: “sentiment analysis”]]
“BERT” AND “Arabic” AND (“text classification”
ProQuest journals
OR “sentiment analysis”)
“bert” AND “arabic” AND (“text classification” OR
Google Scholar
“sentiment analysis”)

2.3. Selection of Studies


After articles were retrieved from the online databases, they were entered into a
Reference Manager System, i.e., EndNote, and duplicates were removed. Then, the titles
and abstracts of the remaining articles were screened using predefined selection criteria. An
article was deemed suitable for inclusion in this systematic review if it met all the following
inclusion criteria:
1. It uses a BERT model for the Arabic text classification task.
2. It evaluates the performance of the utilized BERT model.
3. The dataset that has been used to evaluate the model is well described.
4. It is written in English or Arabic.
An article was excluded if:
1. The full text of the article is not available online.
2. The article is in the form of a poster, tutorial, abstract, or presentation.
3. It is not in English or Arabic.
4. It does not evaluate the performance of the utilized BERT model.
5. The dataset that was used to evaluate the model is not described.
Where a decision about the inclusion of an article was in doubt, the full text was read
to make a final judgment. The whole selection process was reviewed by two indepen-
dent researchers. The results were compared and discrepancies were discussed to reach
a consensus.

2.4. Quality Assessment


The quality assessment was performed by using a quality appraisal tool adapted from
Zhou, et al. [17]. The tool has criteria for assessing the reporting, rigor, credibility, and
relevance of the included articles (See Table 2). In the assessment of each article, a score of
zero was given if the criterion was not met, one if the criterion was partially met, and two if
the criterion was fully met. The total quality-assessment score of an article was calculated
by summing the individual criterion scores. The article quality was classified as “good” if
the total score was ≥12, “adequate” if the total score was ≥8, and “poor” if the total score
was <8.
Appl. Sci. 2022, 12, 5720 4 of 20

Table 2. Quality appraisal tool.

Dimension Criterion
Is there a clear definition of the research questions, aims,
Reporting
and objectives?
Is the research context adequately described?
Rigor Are the research design, method, and measures clearly described?
Is the research design appropriate to answer the research question?
Credibility Is there a clear description of the study findings?
Has adequate data been presented to support the study findings?
Relevance Does the study contribute to research or practice?
Is the conclusion supported by the results?

Due to the lack of consensus among scholars in regard to the role of quality assessment
in systematic review studies, no article was excluded from this review based on its quality
score [15,18]. The quality assessment helped us to understand the weaknesses and strengths
of the included articles. As demonstrated by Morgan, et al. [19] and Alammary [20], articles
with poor quality would contribute less to the synthesis. Similar to studies selection,
quality assessment was performed by the author and was reviewed by two independent
researchers. The result was discussed until an agreement was reached.

2.5. Data Extraction


A predefined data extraction form was used to extract data from the included articles.
The form included the following variables: title, first author, country, type (e.g., confer-
ence/workshop/journal), publishing year, BERT model/s, evaluation dataset, adoption
approach (i.e., feature extraction or finetuning), models that the BERT model was compared
to, the performance of BERT model/s, and remark about the quality of the article. The
extraction form was designed specifically for this systematic review and was piloted on a
sample of five articles.

2.6. Synthesis of Results


After extracting data from the included studies, the extracted data were synthesized
using a narrative format. Five predetermined themes emerged from the research questions
and were used in the synthesis. Those themes are (1) Arabic BERT models, (2) evaluation
dataset, (3) adoption approach, (4) models that the BERT model was compared to, and
(5) performance of BERT models.

3. Results
As can be seen in Figure 1, a total of 594 articles were found through the search in the
six electronic databases. An additional 1780 articles were retrieved from Google Scholar.
After adding them to EndNote, 591 were excluded for duplication. The screening of the
remaining articles resulted in exclusion of 1686 articles. The vast majority of them did not
use the BERT model for Arabic text classification. During the full-text reading, another
49 articles were excluded. Again, the reason for excluding the majority of these articles is
that they did not use the BERT model for Arabic text classification. Eventually, 48 articles
were deemed suitable for inclusion in this review and were appraised using the quality
appraisal tool.
Appl. Sci. 2022, 12, 5720 5 of 20
Appl. Sci. 2022, 12, x FOR PEER REVIEW 5 of 21

Figure1.1.The
Figure Thesystematic
systematicreview
reviewprocess.
process.

3.1.
3.1.Included
IncludedStudies
StudiesOverview
Overview
Figure
Figure 2 showsthe
2 shows thedistribution
distributionofofarticles
articlesbybypublication
publicationyear.
year.AsAscan
canbe
beseen,
seen,there
there
Appl. Sci. 2022, 12, x FOR PEER REVIEW
has
has been a very sharp increase in the number of articles that utilized BERT for Arabictext
been a very sharp increase in the number of articles that utilized BERT for Arabic text
classification in the last three years. While there were only two articles in 2019, the
classification in the last three years. While there were only two articles in 2019, the number number
increased
increasedto tosixteen
sixteeninin2020
2020and
andthen
thentotothirty
thirtyinin2021.
2021.
The distribution of articles by publication type is presented in Figure 3. The vast ma-
jority of articles included in this study (23 articles, 48%) are conference proceeding articles.
Thirteen articles (27%) are workshop proceeding articles. There are also ten journal articles
(21%) and two theses (4%).
As one would expect, the majority of articles (31 articles, 65%) came from the Middle
East and North Africa (MENA) where Arabic is the official language. Qatar and Saudi
Arabia are accounted for around half of MENA articles. Europe contributed nine articles
(19%), while North America and Asia-Pacific contributed four articles each (See Figure 4).

Figure
Figure 2. Articles
2. Articles by publication
by publication year. year.
Appl. Sci. 2022, 12, 5720 6 of 20

The distribution of articles by publication type is presented in Figure 3. The vast


majority of articles included in this study (23 articles, 48%) are conference proceeding
articles. Thirteen articles (27%) are workshop proceeding articles. There are also ten journal
Figure (21%)
articles 2. Articles
and by
twopublication year.
theses (4%).

Figure 2. Articles by publication year.

Figure3.3.Articles
Figure Articlesbyby publication
publication type.type.

As one would expect, the majority of articles (31 articles, 65%) came from the Middle
East and North Africa (MENA) where Arabic is the official language. Qatar and Saudi
Arabia are accounted for around half of MENA articles. Europe contributed nine articles
Figure 3. Articles by publication type.
(19%), while North America and Asia-Pacific contributed four articles each (See Figure 4).

Figure 4. Articles by region.

3.2. Quality of the Included Studies


Table 3 shows a summary of the quality assessment of the included studies. The over
all quality of the 48 studies was good. Eighteen studies (38%) fully met the eight criteria
while
Figure
Figure 4.twenty-one
4. Articles
Articles by(44%)
region.
by region. partially met these criteria. Only three studies (6%) had majo
quality issues, i.e., met less than four criteria. The most unmet criteria were the ones re
3.2. Quality of the Included Studies
lated to the presented
3.2. Quality data and
of the Included conclusion. Nine studies (19%) were found not to presen
Studies
Table 3data
adequate shows to asupport
summary theofstudy
the quality assessment
findings. of the included
Eight studies studies.
(17%) had The
conclusions tha
overallTable
quality3 shows
of the 48a studies
summary was of the quality
good. Eighteenassessment
studies (38%)offully
the included studies. T
met the eight
were not supported by the results. As discussed in the previous section, no study wa
criteria, whileof
all quality twenty-one
the 48 (44%) partially
studies wasbasis met these
good. criteria.
Eighteen Only three
studies studies
(38%) fully(6%)
methad
the eight
excluded
major qualityfrom thisi.e.,
issues, review on than
met less the of its quality.
four criteria. The most unmet criteria were the ones
while twenty-one (44%) partially met these criteria. Only three studies (6%) ha
related to the presented data and conclusion. Nine studies (19%) were found not to present
quality data
adequate issues, i.e., met
to support less than
the study four
findings. criteria.
Eight studiesThe most
(17%) unmet criteria
had conclusions were the
that were
lated
not to the by
supported presented
the results.data and conclusion.
As discussed Ninesection,
in the previous studies
no (19%) were
study was found not to
excluded
adequate data to support the study findings. Eight studies (17%) had conclusi
from this review on the basis of its quality.
were not supported by the results. As discussed in the previous section, no stu
excluded from this review on the basis of its quality.
Appl. Sci. 2022, 12, 5720 7 of 20

Table 3. Quality assessment results.

Criterion Fully Met Partially Met Not Met


Is there a clear definition of the research
19 23 6
questions, aims, and objectives?
Is the research context adequately described? 32 13 3
Are the research design, method, and measures
27 16 5
clearly described?
Is the research design appropriate to answer the
33 10 5
research question?
Is there a clear description of the
41 7 0
study findings?
Has adequate data been presented to support
18 21 9
the study findings?
Does the study contribute to research
17 29 2
or practice?
Is the conclusion supported by the results? 27 13 8

3.3. BERT Models


As can be seen in Table 4, nine different BERT models were used in the reviewed
articles. Some articles used one model only, while others used more than one. The most
widely used model was the Multilingual BERT of Devlin, Chang, Lee and Toutanova [3]
which was utilized in 65% of the articles. It was followed by a model called AraBERT
which was designed specifically for the Arabic language and was used by around 58% of
the reviewed articles. The other models were used much less frequently, with MARBERT
and ArabicBERT used by 19% and 11% of the articles, respectively, and the remaining six
models used by less than 10% of the articles each. In the vast majority of the articles, BERT
models were utilized by using the finetuning approach. Only five articles (10%) used BERT
models as a feature-extraction approach.

Table 4. Used BERT models.

Feature
Model Articles Fine-Tuning
Extraction
31 articles [21], [22], [10], [23], [24],
[25], [26], [27], [28], [29], [6], [30],
Multilingual BERT [31], [32], [33], [34], [35], [36], [37], 2 29
[11], [38], [39], [40], [41], [42], [43],
[44], [45], [46], [47], [48]
28 articles [49], [50], [51], [9], [52],
[28], [53], [54], [55], [30], [31], [33],
AraBERT [56], [34], [35], [36], [57], [37], [11], 3 26
[58], [39], [40], [59], [43], [60], [46],
[61], [48]
9 articles [50], [62], [27], [53], [34],
MARBERT [43], [46], [63], [64] 0 9

ArabicBERT 5 articles [34], [58], [38], [40], [43] 1 5


ARBERT 4 articles [53], [34], [34], [43] 0 4
XLM-RoBERTa 4 articles [52], [25], [34], [48] 0 4
QARiB 4 articles [53], [34], [43], [40] 0 4
GigaBERT 3 articles [49], [34], [43] 0 3
Arabic ALBERT 1 article [34] 0 1
Appl. Sci. 2022, 12, 5720 8 of 20

3.4. Models Evaluation


The nine models were evaluated by comparing their performances to other BERT and
non-BERT models. A total of 66 comparisons were conducted, as sixteen studies had more
than one comparison each. As can be seen in Table 5, the finetuned Multilingual BERT
was compared to 66 models. It outperformed 27 models (41%) and was outperformed by
39 models (59%). In comparison to the other Arabic BERT models, it only outperformed
XLM-RoBERTa and was outperformed by the other eight models. The model scored the
lowest F1 Score of all the other models (0.118). When using Multilingual BERT as a feature
extraction method, the performance was better. The constructed models were able to
outperform nineteen models, including the finetuned Multilingual BERT (83%). It was only
outperformed by four models (17%).
The finetuned AraBERT achieved better performance than Multilingual BERT. Of
the 50 models that it was compared against, it outperformed 42 models (84%) and was
only outperformed by eight models (16%). When compared to other BERT models, it
outperformed all eight other models but was also outperformed by four of these models in
some of the articles. The model scored the highest F1 Score of all the other models (0.99).
AraBERT also showed excellent performance when it was used as a feature-extraction
method. It outperformed fifteen of the models that it was compared against and was
only outperformed by the finetuned AraBERT. MARBERT was one of the Arabic BERT
models that achieved excellent performance. It outperformed 21 models (75%) and was
outperformed by seven models (25%).

Table 5. Performance of the BERT models.

Performance
Model Outperformed Outperformed By
Score
ArabicBERT, ArabicBERT + CNN, ARBERT,
Arabic ALBERT, GigaBERT, QARiB,
XLM-RoBERTa, MARBERT, AraBERT,
XLM-RoBERTa, cseBERT, SVM, Random
Sentiment Embeddings, Generic Word
Forest Classifier, NB, Word2Vec, CNN,
Embeddings, Surface Features, CNN, LSTM
Majority Baseline Class, LSTM, LR, Mazajak Highest
+ Word Embedding, LSTM, CNN+ Word
Embeddings, hULMonA, Capsule Network, F1 Score: 0.973
Multilingual Embedding, BiLSTM, AraVec, Multi-Layer,
fastText + SVM, Bi-LSTM, SVM + TF-IDF, Lowest
BERT Flair, Perceptron (MLP), FastText, GRU, CNN
Doc2vec + MLP, LSA + SVM, F1 Score: 0.118
(Fine-tuned) + GRU, Majority Class, SVM, AraELECTRA,
Doc2vec + SVM, TFIDF + SVM, Average
AraGPT2, Mazajak + SVM, AraVec + SVM,
Word2vec + SVM, Voting Classifier, AraVec, F1 Score: 0.678
CNN-Text, Semi-Supervised FastText,
AraVec + CBOW, Gated Recurrent Unit
TF-IDF, Word2Vec, SVM + ValenceList,
(GRU), fastText
SVM+ N-gram+ Mazajak, C-LSTM,
SVM+ N-gram, Multilingual BERT + SVM,
ARBERT + SVM
FastText-SkipGram+ CNN,
FastText-SkipGram + BiLSTM,
Unigram + NB SVM LR, Word-ngrams + NB
SVM LR, FastText +NB SVM LR, Random Highest
Embedding+ CNN LSTM GRU, F1 Score: 0.95
Multilingual
AraVec(CBOW)+ CNN LSTM GRU, AraBert + CNN, FastText, Word2Vec, Lowest
BERT (Feature
AraVec + CNN LSTM GRU, AraBERT (Fine-tuned) F1 Score: 0.589
extraction)
AraVec-100-CBOW, AraVec-300-SG, BiLSTM, Average
LSTM, Doc2vec_MLP, LSA_SVM, F1 Score: 0.798
Doc2vec_SVM, TFIDF_SVM,
Word2vec_SVM, Multilingual BERT
(Fine-tuned)
Appl. Sci. 2022, 12, 5720 9 of 20

Table 5. Cont.

Performance
Model Outperformed Outperformed By
Score
Multilingual BERT, GigaBERT,
XLM-RoBERTa, QARiB, MARBERT,
ArabicBERT, ARBERT, Arabic ALBERT,
TF-IDF, hULMonA, FastText, AraELECTRA,
AraGPT2, BOW + TF-IDF, LSTM, Majority
Class, SVM, Mazajak, SVM + Mazajak, Highest
BiLSTM, AraGPT, Capsule Network, GRU, ArabicBert, MARBERT, QARIB, ARBERT, F1 Score: 0.99
AraBERT CNN, CNN-GRU, AraULMFiT, AraBERT + Gradient Boosting Trees (GBT), Lowest
(Fine-tuned) Mazajak + SVM, AraVec + SVM, Multi-Task Learning Model + MarBERT, F1 Score: 0.35
fastText + SVM, AraVec + skip-gram (SG), Frenda et al. (2018) model, SVM Average
AraVec + CBOW, Doc2vec + MLP, F1 Score: 0.769
LSA + SVM, Doc2vec + SVM, TFIDF + SVM,
Word2vec + SVM, Sentence embedding + LR
classifier, TF-IDF + LR classifier, BOW + LR
classifier, Multilingual BERT + SVM,
ARBERT + SVM
Multilingual BERT +CNN, AraBERT,
GigaBERT, FastText-SkipGram+ CNN, Highest
FastText-SkipGram + BiLSTM, TF-IDF, F1 Score: 0.97
AraBERT
BiLSTM, LSTM, Doc2vec_MLP, LSA_SVM, Lowest
(Feature AraBERT (Fine-tuned)
Doc2vec_SVM, TFIDF_SVM, F1 Score: 0.606
extraction)
Word2vec_SVM, Multilingual BERT Average
(Fine-tuned), Multilingual BERT F1 Score: 0.791
(Feature extraction)
AraBERT, ArabicBERT, ARBERT, Arabic
ALBERT, GigaBERT, QARiB, Multilingual
BERT, XLM-RoBERTa, Bi-LSTM + Mazajak, Highest
Gaussian Naive Bayes, Gated Recurrent Unit F1 Score: 0.934
AraBERT, QARiB, ARBERT, AraELECTRA,
MARBERT (GRU), AraELECTRA, BiLSTM, AraGPT, Lowest
AraGPT2, Weighted ensemble, Multi-Task
(Fine-tuned) MTL-LSTM, MTL-CNN, CNN-CE, F1 Score: 0.57
Learning (MTL)-CNN-LSTM
CNN-AraVec, Average
Multi-headed-LSTM-CNNGRU, F1 Score: 0.710
CNN-LSTM+ Dialect Information,
Multi-headed-LSTM-CNNGRU+ TF-IDF
Highest
AraBert, ARBERT, Arabic ALBERT,
F1 Score: 0.884
GigaBERT, XLM-RoBERTa, Multilingual
ArabicBERT QARiB, AraBERT, MARBERT, Arabic BERT + Lowest
BERT, QARiB, BiLSTM, AraGPT2,
(Fine-tuned) CNN, ARBERT, AraELECTRA F1 Score: 0.53
AraVec + skip-gram (SG), AraVec + CBOW,
Average
TF-IDF, CNN-Text, Bi-LSTM, SVM + TF-IDF
F1 Score: 0.721
Highest
F1 Score: 0.897
ArabicBERT
ArabicBERT, Multilingual BERT, CNN-Text, Lowest
(Feature -
Bi-LSTM, SVM + TF-IDF F1 Score: 0.897
extraction)
Average
F1 Score: 0.897
Highest
F1 Score: 0.891
MarBERT, Arabic ALBERT, GigaBERT,
ARBERT AraBERT, QARiB, MARBERT, ArabicBERT, Lowest
Multilingual BERT, AraBERT, ArabicBERT,
(Fine-tuned) XLM-RoBERTa, AraGPT2, AraELECTRA F1 Score: 0.57
BiLSTM, AraGPT2
Average
F1 Score: 0.721
Appl. Sci. 2022, 12, 5720 10 of 20

Table 5. Cont.

Performance
Model Outperformed Outperformed By
Score
Highest
F1 Score: 0.922
ARBERT, Arabic ALBERT, GigaBERT,
XLM-RoBERTa AraBERT, Multilingual BERT, MARBERT, Lowest
Multilingual BERT, FastText, Majority Class,
(Fine-tuned) ArabicBERT, QARiB, AraELECTRA F1 Score: 0.399
BiLSTM, CNN, AraGPT2
Average
F1 Score: 0.684
Highest
ARBERT, AraELECTRA, AraBERT, AraGPT2, F1 Score: 0.87
AraBERT, MARBERT, ARBERT, Arabic
QARiB Multilingual BERT, XLM-RoBERTa, Lowest
ALBERT, AraELECTRA, XLM-RoBERTa,
(Fine-tuned) ArabicBERT, BiLSTM, Arabic ALBERT, F1 Score: 0.589
ArabicBERT
GigaBERT, MARBERT Average
F1 Score: 0.750
Highest
AraBERT, AraBERT+ Gradient Boosting F1 Score: 0.692
GigaBERT Trees (GBT), MARBERT, ArabicBERT, Lowest
Multilingual BERT, AraGPT2
(Fine-tuned) ARBERT, Arabic ALBERT, QARiB, F1 Score: 0.51
XLM-RoBERTa, BiLSTM, AraELECTRA Average
F1 Score: 0.601
Highest
F1 Score: 0.691
Arabic AraBERT, MARBERT, ArabicBERT, ARBERT,
Lowest
ALBERT GigaBERT, Multilingual BERT, AraGPT2, QARiB, XLM-RoBERTa, BiLSTM,
F1 Score: 0.555
(Fine-tuned) AraELECTRA
Average
F1 Score: 0.623

ArabicBERT also achieved good performance both when finetuned and when used
as a feature-extraction method. Similar to AraBERT and Multilingual BERT, it achieved
its best performance when it was utilized for feature extraction. It outperformed the five
models that it was compared against. The finetuned ArabicBERT outperformed 15 models
(71%); seven of them were Arabic BERT models. It was outperformed by six models (29%),
five of which were Arabic BERT models. The performance of QARiB was also good. It
outperformed 11 out of 18 models (61%). The model was able to outperform all the other
eight Arabic BERT models, but was also outperformed by six of them.
ARBERT and XLM-RoBERTa scored a mixed performance. ARBERT outperformed
six BERT and two non-BERT models. It was outperformed by five BERT and two non-
BERT models. XLM-RoBERTa outperformed four BERT and five non-BERT models. It was
outperformed by five BERT and one non-BERT model. The performance of the remaining
two Arabic BERT models was poor. GigaBERT outperformed 2 out of 12 models (17%), and
Arabic ALBERT outperformed 3 out of 11 models (27%).
As can be seen in Figure 5, the vast majority of the articles (90%) used the F1 score to
measure the performance of the models. Accuracy, precision, and recall were used in 60%,
46%, and 44% of the articles, respectively. Other measurements that were used include
Area Under the Receiver Operating Characteristic Curve (AUROC), Precision-Recall Curve
(PRC), and macro-average PR-AUC.
Appl. Sci. 2022, 12, 5720 11 of 20
Appl. Sci. 2022, 12, x FOR PEER REVIEW 12 of 21

Figure5.5. Performance
Figure Performancemeasures.
measures.

3.5.
3.5. Evaluation
Evaluation Datasets
Datasets
Table
Table 6 showsthe
6 shows thedifferent
differentdatasets
datasetsthatthatwere
were used
used to
to evaluate
evaluate the nine BERT models.
The
The vast majority of the datasets had a short text corpus. More than
vast majority of the datasets had a short text corpus. More than 67%
67% contained
containedtweets
tweets
extracted
extracted from Twitter and around 28% had posts from other social mediaplatforms
from Twitter and around 28% had posts from other social media platformssuch
such
as
asFacebook.
Facebook. OnlyOnly two
two datasets
datasets hadhad aa long
long text
text corpus.
corpus. One
One ofof them
them had
had hotel
hotel and
and book
book
reviews,
reviews, while
while thethe other
other one
one had
hadarticles
articles classified
classified bybysubject.
subject. The
The sizes
sizes of
of the
the datasets
datasets
varied;
varied; the smallest dataset had 2479 items and the largest had 2.4 million items. TheThe
the smallest dataset had 2479 items and the largest had 2.4 million items. ma-
majority, however,
jority, however, hadhad
lessless
thanthan twenty-thousand
twenty-thousand itemsitems
each.each.
In 34In 34 articles
articles (71%),(71%), the
the items
items in the datasets were classified as either positive, negative, or neutral.
in the datasets were classified as either positive, negative, or neutral. The remaining da- The remaining
datasets
tasets hadhad items
items belonging
belonging toto multiple
multiple classes.
classes.

Table6.6.Datasets
Table Datasetsthat
thatwere
wereused
usedtotoevaluate
evaluatethe
themodels.
models.

Type Number ofNumber


TypeofofData
data SizeSize of Dataset
of dataset Modelss
Modelss
articles of Articles
AraBERT,
AraBERT, Multilingual
Multilingual BERT,
BERT, GigaBERT, XLM-
Sarcastic and non-sarcastic 12,548, 10,547, GigaBERT,
Sarcastic and 13 RoBERTa, MARBERT,
XLM-RoBERTa,
tweets 15,548
12,548, 10,547, 15,548 13
non-sarcastic tweets ArabicBERT, ARBERT, Ara-
MARBERT,
ArabicBERT,
bic ALBERT, QARiB
ARBERT, Arabic
20,000, 6024,
ALBERT, QARiB
5846, 7839,
20,000, 6024, 5846, 7839,
Offensive and hate speech 10,000, 9000, AraBERT,
Offensive and hate speech 10,000, 9000, 5011, 10,828, 11 AraBERT, Multilingual
posts from social media 5011, 10,828, 11 Multilingual BERT
posts from social media 27,800, 13,794, 10,000 BERT
27,800, 13,794,
Multilingual BERT,
22,000, 5971, 11,112, 2479,
10,000
Tweets that were written in MARBERT,
2,400,000, 540,000, 21,000, 8
standard and dialects Arabic AraBERT,
10,000, 288,086
22,000, 5971, ArabicBERT
Tweets about 11,112, 2479,
20,000 1 AraBERT
Multilingual BERT,
customers satisfaction
Tweets that were written in 2,400,000,
8 MARBERT, AraBERT,
standard and dialects Arabic 540,000,
ArabicBERT
21,000, 10,000,
288,086
Appl. Sci. 2022, 12, 5720 12 of 20

Table 6. Cont.

Number
Type of Data Size of Dataset Modelss
of Articles
AraBERT, QARiB,
Accurate and inaccurate tweets ARBERT,
4966, 3032, 8000, 4072 4
about COVID-19 MARBERT,
Multilingual BERT
Misogynistic and
6603 1 AraBERT
non-misogynistic tweets
Spam and ham tweets 134,222 1 AraBERT
Multilingual BERT,
Hateful and normal tweets 8964, 3480, 5340 3
AraBERT,
Adult (sexual) and Multilingual BERT,
50,000 1
normal tweets AraBERT,
AraBERT,
Positive and negative tweets 10,000, 3962, 10,000 3 Multilingual BERT,
ArabicBERT
AraBERT,
Hotel and book reviews 156,700 1
Multilingual BERT
AraBERT,
Articles classified by subject 22,429 1
Multilingual BERT

4. Discussion
The publication trend indicates the rapidly growing interest in using BERT models for
Arabic text classification. However, given the relatively small number of articles that have
been found in this review, more research is required. Overall, this review of the 48 articles
helped to answer the three research questions.

4.1. What BERT Models Were Used for Arabic Text Classification, and How Do They Differ?
As was shown in the previous section, nine different BERT models were used for
Arabic text classification. Some of them were designed specifically for the Arabic language,
while others support multiple languages, including Arabic.
1. Multilingual BERT:
The model was developed by Devlin et al., the same developer of the original
BERT model. It was not designed for Arabic specifically, but rather supports 104 lan-
guages, including Arabic [21]. It was pretrained on Modern Standard Arabic data from
Wikipedia. The size of the Arabic pretraining corpus is less than 1.4 gigabytes and has
only 7292 tokens [65,66]. Regarding its architecture, Multilingual BERT has 12 layers of
transformers blocks with 768 hidden units each. It also contains 12 self-attention heads and
around 110 million trainable parameters [26].
2. AraBERT:
Unlike Multilingual BERT, AraBERT was built specifically for the Arabic language. The
pretraining dataset contains Modern Standard Arabic news extracted from different Arabic
media. Version 1 of the model has 77 million sentences and 2.7 billion tokens corresponding
to around 23 gigabytes of text. This is 17 times the size of the Arabic pretraining dataset that
was used to train the Multilingual BERT. The newest version of the model used 3.5 times
more data for pretraining, i.e., 77 gigabytes of text. Similar to Multilingual BERT, AraBERT
has 12 transformers blocks with 768 hidden units each. It also has 12 self-attention heads
and a total of 110 million trainable parameters [11]. To support dialectical Arabic, the model
was further pretrained on a corpus of 12,000 sentences written in different Arabic dialects.
This customized version of AraBERT was called SalamBERT [59].
Appl. Sci. 2022, 12, 5720 13 of 20

3. MARBERT:
This model was also designed specifically for the Arabic language, but unlike AraBERT,
it was pretrained on huge Twitter data which contains text in both Modern Standard Arabic
and various Arabic dialects. The pretraining corpus has 1 billion tweets, which sum up to
almost 128 gigabytes of text. The number of tokens in the corpora is around 15.6 billion,
which is almost double the number of tokens of Version 2 of AraBERT. This makes it the
largest pretraining corpora of the nine models. MARBERT has the same architecture as
Multilingual BERT, but without the next-sentence prediction (NSP). The reason for omitting
NSP, according to the model’s developers, is that tweets are too short. The total number of
trainable parameters in MARBERT is around 160 million [27].
4. ArabicBERT:
The model has several versions that were all trained specifically for Arabic. The base
model has the same architecture as Multilingual BERT. It was pretrained on 8.2 billion
tokens extracted from Wikipedia and other Arabic resources, which makes up around
95 gigabytes of text. The data are in Modern Standard Arabic. The multi dialect version,
however, was further pretrained on 10 million tweets that were written in different Arabic
dialects [58].
5. ARBERT:
This model was developed by the developer of MARABERT. The two differ in that
ARBERT was pretrained on Modern Standard Arabic text only. The text was extracted
from Wikipedia, news, and books sources. It has a total size of about 61 gigabytes and
6.2 billion tokens. ARBERT has the same architecture as the Multilingual BERT, with
12 layers of transformers blocks, 768 hidden units, and 12 self-attention heads, but a total
of around 163 million trainable parameters [12].
6. XLM-RoBERTa:
This model was developed by researchers from Facebook. Similar to Multilingual
BERT, XLM-RoBERTa supports multiple languages (100 languages), including Arabic. It
was pretrained on Wikipedia and Common Crawl data. The Arabic pretraining corpus is
in Modern Standard Arabic and it has 2869 tokens that sum up to around 28 gigabytes of
data. The base version of the model has 12 layers of transformers blocks, 768 hidden units,
12 self-attention heads, and around 270 million trainable parameters. While the model
optimizes the BERT approach, it removes the NSP task and introduces a dynamic masking
technique [67].
7. QARiB:
The model was developed for the Arabic language specifically, by researchers from the
Qatar Computing Research Institute. Similar to MARBERT, QARiB was pretrained on text
in both Modern Standard Arabic and various Arabic dialects. The Modern Standard Arabic
text includes news and movie/TV subtitles, while the dialectical text includes tweets. The
model has the same architecture as the Multilingual BERTB. The latest version of the model
was pretrained on a corpus of 14 billion tokens, which makes up around 127 gigabyte of
text [40]. This is the second largest pretraining corpora after MARBERT.
8. GigaBERT:
The model was originally designed as a bilingual BERT for zero-shot transfer learning
from English to Arabic. It was pretrained in both Arabic and English text. The English
pretraining data have 6.1 billion tokens, while the Arabic data have 4.3 billion tokens. The
Arabic data are on Modern Standard Arabic and are mostly comprised of news, Wikipedia
articles, and Common Crawl data. Similar to Multilingual BERT, GigaBERT has 12 layers
of transformers blocks with 768 hidden units each. It also contains 12 self-attention heads
and around 110 million trainable parameters [68].
Appl. Sci. 2022, 12, 5720 14 of 20

9. Arabic ALBERT:
The model is the Arabic version of ALBERT, which stands for A Lite BERT. ALBERT
implements two design changes to BERT’s original architecture, i.e., parameter sharing and
factorization of the embedding parameterization. These changes produced a model that has
almost the same performance as the original BERT but has only a fraction of its parameters
and computational cost. Arabic ALBERT has three different versions: the base one has
12 layers of transformers blocks, 768 hidden units each, 12 self-attention heads, and only
12 million trainable parameters. The model was pretrained on a Modern Standard Arabic
corpus that has 4.4 billion tokens. The data are from Wikipedia articles and Common Crawl
data [34,69].

4.2. How Effective Are They for Classifying Arabic Text?


In the majority of the 66 comparisons that were performed in the included studies,
Arabic BERT models showed superior performance for classifying Arabic text over the other
machine-learning models. Even in the few comparisons in which the used BERT model did
not perform particularly well, the reason does not seem to be the BERT technique itself, but
rather the specific Arabic BERT model that was used, as in most of these comparisons the
used model was Multilingual BERT. As was discussed earlier, Multilingual BERT was not
designed for Arabic specifically, but rather supports a large number of languages, including
Arabic. The superior performance of Arabic BERT models seems to be related to the unique
features of the BERT technique, i.e., its ability to be pretrained from the unlabeled text that
is enormously available on the web and its ability to predict words conditioned on both
their left and right context [3,5]. This finding is consistent with the many prior studies
that found BERT models outperformed other machine learning models in a variety of
natural language processing tasks including text classification [70,71]. This indicates that
researchers and practitioners should have BERT among their first options for Arabic text
classification tasks.
BERT models that were pretrained for the Arabic language specifically, e.g., AraBERT
and MARBERT, showed much better performance than those that support multiple lan-
guages, i.e., Multilingual BERT and XLM-RoBERTa. An obvious explanation for this is
the size of the pretraining corpora that were used to train these models. While Multi-
lingual BERT and XLM-RoBERTa have 7292 and 2869 tokens in their pretraining corpus,
respectively, AraBERT, for example, was pretrained with a corpus that has 2.7 billion to-
kens. This agrees with the results of studies on other languages. Virtanen, et al. [72] and
Martin, et al. [73], for example, found monolingual BERT models that were designed specif-
ically for Finnish and French outperformed Multilingual BERT on the natural language
processing tasks, including text classification.
Using Arabic BERT models as feature-extraction approaches seems to yield better
results than finetuning these models. In three out of four comparisons between the two
approaches (feature extraction vs. finetuning), the feature extraction gave a better per-
formance. However, this finding should be taken with caution considering the very few
studies that compared the two approaches. It is also different than the finding of the devel-
oper of the original BERT model [3] who compared the two approaches and reported that
their performance is comparable, with feature extraction being 0.3 F1 behind finetuning.
Arabic BERT models did not show significant differences in performance at multi-class
classification compared to binary classification. The highest difference in performance was
scored by Multilingual BERT. It had an average F1 score of 0.566 at multi-class classification
and an average of 0.715 at binary classification. Most of the other models, however, scored
slight differences in performance in the two classification tasks. For example, AraBERT had
an average F1 score of 0.715 at multi-class classification compared to an average of 0.794 at
binary classification. There is also the case of MARBERT, where the average F1 Score of the
multi-class classification (0.724) was slightly higher than the one of binary classification
(0.69). This indicates that the BERT technique is suitable for both Arabic multi-class and
binary classification.
Appl. Sci. 2022, 12, 5720 15 of 20

Considering the performance of all nine models, they can be classified into low-
performing models (the ones with an average F1 Score < 0.7) and high-performing models
(the ones with an average F1 Score ≥ 0.7). The low-performing models include Multilin-
gual BERT, XLM-RoBERTa, Arabic ALBERT, and GigaBERT. The first two models were
pretrained so they can support over 100 languages each and therefore, as has been ex-
plained earlier, had small Arabic pretraining corpuses which affected their performance.
The same applies to Arabic ALBERT, which was pretrained on a slightly smaller corpus
compared to the high-performing models. Regarding GigaBERT, it was not designed for
text classification but rather as a bilingual BERT for zero-shot transfer learning from English
to Arabic.
The high-performing models include AraBERT, MARBERT, ArabicBERT, ARBERT,
and QARiB. A common feature that these high-performing models share is that they were
all pretrained on large Arabic corpora. By looking at the number of times each one of them
outperformed the other models, it can be said that MARBERT has the best performance
of the other four. It is followed by QARiB then ARBERT. In fourth place comes AraBERT,
and lastly, ArabicBERT. MARBERT and QARiB are the only models that were pretrained
on both Modern Standard Arabic and dialectical Arabic corpus. This seems to have had
a positive impact on their performance. The other three models, i.e., ARBERT, AraBERT,
and ArabicBERT, were only pretrained on Modern Standard Arabic corpus. Therefore,
when dealing with the testing datasets in which the majority of them contained dialectical
text, MARBERT and QARiB were able to perform slightly better than ARBERT, AraBERT,
and ArabicBERT.
A final point worth mentioning is that a large pretraining corpus does not always
lead to a better performance. Version 1 of AraBERT, which was pretrained on around
23 gigabytes of text, showed better performance than ArabicBERT, which was pretrained
on 95 gigabytes of text. Out of the six comparisons that included both models, AraBERT
was able to outperform ArabicBERT in five comparisons (83%). Therefore, it can be said
that the quality of the pretraining data has a significant impact on the performance of the
BERT model.

4.3. How Effective Are They Compared to the Original English BERT Models?
The performance of the Arabic BERT models, as was found in the reviewed articles, is
no different than the performance of the original English BERT model. The highest perfor-
mance on a binary classification was scored by AraBERT (accuracy = 0.997, F1 Score = 0.981)
on a dataset of 134,222 tweets classified as either spam or ham. The highest performance on
a multi-class classification was also scored by AraBERT (accuracy = 0.99, F1 Score = 0.99)
on a dataset of 22,429 articles classified by topics, e.g., economic, science, and low. Close
results were reported for the English language in a comprehensive review conducted by
Minaee, Kalchbrenner, Cambria, Nikzad, Chenaghlu and Gao [70]. They reported that
the BERT-base model was able to achieve an accuracy score of 0.981 on the Yelp dataset
that contained around 600,000 businesses reviews classified into two classes, i.e., negative
and positive. For multi-class classification, Minaee et al. reported that the BERT- large
model scored an accuracy score of 0.993 on the DBpedia dataset that contained around
630,000 items, each with a 14-class label.
This excellent performance was somewhat surprising considering the fact that the
Arabic language is normally classified as a low-resource language that does not have a
large amount of resources available on the web [74–76]. It seems that this classification is
not valid anymore. The Developers of the Arabic BERT models were able to find billions
of tokens of text to train their models. Wikipedia and news websites provided a plethora
of text for training. The social media platform Twitter was also a major source of data.
According to the latest report from Statista [77], two Arabic countries, i.e., Saudi Arabia
and Egypt, are among the top leading countries based on the number of Twitter users as of
January 2022.
Appl. Sci. 2022, 12, 5720 16 of 20

Despite the excellent performance of Arabic BERT models compared to the original
English BERT, two shortcomings of the included studies have to be highlighted. First,
while English BERT was evaluated by a variety of long and short text corpus [70,78], the
vast majority of the datasets that were used to evaluate the performance of the Arabic
BERT models had short text corpus. In fact, only two datasets had a long text corpus.
Therefore, the data that were found in the reviewed articles were not sufficient to judge
the performance of Arabic BERT models when dealing with documents that had large
text. According to Rao, et al. [79], short text has unique features that differ from long text
including the syntactical structure, sparse nature, noise words, and the use of colloquial
terminologies. The second shortcoming of the included studies is related to the type of data
in the evaluation datasets. Unlike English BERT which has been evaluated with datasets
from different knowledge domains [70,71], the vast majority of the datasets that have been
used to evaluate Arabic BERT models had tweets extracted from the Twitter platform.
Therefore, it is not clear whether the current Arabic BERT models are suitable to be applied
in a variety of application domains.

5. Implications for Future Research


The outcomes of this systematic review have opened up new research areas for further
improvements and future work. First, the nine models that were identified in this review
are general-purpose BERT models that were pretrained on data from Wikipedia, Common
Crawl, and social media posts. Therefore, it is most likely that the high performance
that they showed will not be achieved when they are applied in specific domains such
as medical, financial law, and industry. Lee, et al. [80], for example, found that applying
the original English BERT model to biomedical corpora did not yield satisfactory results
due to a word-distribution shift from general domain corpora to biomedical corpora. They
developed a new BERT model that they called BioBERT. The model was pretrained on large-
scale biomedical corpora and therefore was able to outperform the original BERT model
in a variety of biomedical natural language processing tasks. It would be interesting to
investigate domain-specific Arabic BERT models, if any, evaluate them on text classification
tasks, and explore the different domains in which more models are needed.
It would be also interesting to investigate the impact of the quality of pretraining text
on the performance of the pretrained model. As explained in the previous section, a large
pretraining corpus does not always lead to a better performance. Previous studies also
found that improving performance by relying on more pretraining data is very expensive
because of the diminishing returns of such an approach [81,82]. Pretraining BERT models
costs substantial time and money. The Arabic language is a very rich language that contains
millions of words [83]. Therefore, it seems that careful selection of the type and amount
of pretraining data would have a substantial impact on the performance of Arabic BERT
models. More studies in this area are needed.
In addition, the study showed that using Arabic BERT models as a feature extraction
approach mostly yields better results than finetuning. However, two issues should be noted
regarding this finding. First, only few studies in this review compared the two approaches.
Second, the finding does not agree with previous studies in other languages such as English
and Chinese that compared the two approaches [3,84]. Therefore, more studies in this
regard should be conducted to confirm this finding and also to explore when and how each
approach would perform the best.
Furthermore, this review shows the need to develop more labeled datasets for Arabic
natural language processing tasks. There is a relatively small number of Arabic datasets
available for this type of task compared to English datasets. The majority of the available
Arabic datasets, as found in this review, contain tweets only. According to Li, Peng, Li,
Xia, Yang, Sun, Yu and He [71], the availability of labeled datasets for natural language
processing tasks is a main driving force for the rapid advancement in this research field.
Lastly, none of the reviewed studies provided insights into how the morphological,
syntactic, and orthographic features of Arabic versus English would influence the perfor-
Appl. Sci. 2022, 12, 5720 17 of 20

mance of BERT models. As these features differ greatly between the two languages [85,86],
more research about their impact on the performance of BERT models is needed.

6. Conclusions
BERT proved to be an invaluable technique in natural languages processing. Most of
the research, however, has looked at applying BERT to the resource-rich English language.
Therefore, an analysis of the state-of-the-art application of BERT to Arabic text classification
was conducted. The aim was to: (1) identify Bert models that have been used for Arabic
text classification, (2) compare their performance, and (3) understand how effective they
are compared to the original English BERT models. To the best of the author’s knowledge,
this is the first systematic review on this topic. The review includes 48 articles and yields
several findings. First, it identified nine different models that could be used for classifying
Arabic text. Two of them support many languages, including Arabic; one supports both
Arabic and English; while the remaining six were developed specifically for the Arabic
language. In most of the reviewed studies, the models showed high performance compa-
rable to that of the English BERT models. The highest performing models, in descending
order of performance, were MARBERT, QARiB, ARBERT, AraBERT, and ArabicBERT. A
common feature that these high-performing models share is that they were all pretrained
for Arabic and on a large Arabic corpus. The first three models were pretrained on both
Modern Standard Arabic and a dialectical Arabic corpus, which might have improved their
performance further. Synthesizing the existing research on text classification using Arabic
BERT models, this study also identified new research areas for further improvements and
future work.

Funding: This research received no external funding.


Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Acknowledgments: I would like to thank Ehsan Ahmad and Abdulbasid Banga for their invaluable
help in conducting this systematic review.
Conflicts of Interest: The author declares no conflict of interest.

References
1. Vijayan, V.K.; Bindu, K.; Parameswaran, L. A comprehensive study of text classification algorithms. In Proceedings of the
2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, India, 13–16
September 2017; pp. 1109–1113.
2. El-Din, D.M.; Hussein, M. A survey on sentiment analysis challenges. J. King Saud Univ.-Eng. Sci. 2018, 30, 330–338. [CrossRef]
3. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pretraining of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
4. Rogers, A.; Kovaleva, O.; Rumshisky, A. A Primer in BERTology: What We Know About How BERT Works. Trans. Assoc. Comput.
Linguist. 2020, 8, 842–866. [CrossRef]
5. Zaib, M.; Sheng, Q.Z.; Emma Zhang, W. A short survey of pretrained language models for conversational AI-a new age in NLP.
In Proceedings of the Australasian Computer Science Week Multiconference, Canberra, Australia, 1–5 February 2016; Association
for Computing Machinery: New York, NY, USA, 2020; pp. 1–4.
6. Alshalan, R.; Al-Khalifa, H. A deep learning approach for automatic hate speech detection in the saudi twittersphere. Appl. Sci.
2020, 10, 8614. [CrossRef]
7. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V.J. Roberta: A robustly
optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692.
8. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T.J. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv
2019, arXiv:1910.01108.
9. Almuqren, L. Twitter Analysis to Predict the Satisfaction of Saudi Telecommunication Companies’ Customers. Ph.D. Thesis,
Durham University, Durham, UK, 2021.
10. Pelicon, A.; Shekhar, R.; Škrlj, B.; Purver, M.; Pollak, S. Investigating cross-lingual training for offensive language detection. PeerJ
Comput. Sci. 2021, 7, e559. [CrossRef]
Appl. Sci. 2022, 12, 5720 18 of 20

11. Antoun, W.; Baly, F.; Hajj, H. Arabert: Transformer-based model for arabic language understanding. arXiv 2020, arXiv:2003.00104.
12. Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M.B. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing, Online, 1–6 August 2021; pp. 7088–7105.
13. James, K.L.; Randall, N.P.; Haddaway, N.R. A methodology for systematic mapping in environmental sciences. Environ. Evid.
2016, 5, 1–13. [CrossRef]
14. Moher, D.; Altman, D.G.; Liberati, A.; Tetzlaff, J. PRISMA statement. Epidemiology 2011, 22, 128. [CrossRef]
15. Paez, A. Gray literature: An important resource in systematic reviews. J. Evid.-Based Med. 2017, 10, 233–240. [CrossRef] [PubMed]
16. Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; EBSE: Durham, UK, 2007.
17. Zhou, Y.; Zhang, H.; Huang, X.; Yang, S.; Babar, M.A.; Tang, H. Quality assessment of systematic reviews in software engineering:
A tertiary study. In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering,
Nanjing, China, 27–29 April 2015; Association for Computing Machinery: New York, NY, USA, 2015; pp. 1–14.
18. Bondas, T.; Hall, E.O. Challenges in approaching metasynthesis research. Qual. Health Res. 2007, 17, 113–121. [CrossRef] [PubMed]
19. Morgan, J.A.; Olagunju, A.T.; Corrigan, F.; Baune, B.T. Does ceasing exercise induce depressive symptoms? A systematic review of
experimental trials including immunological and neurogenic markers. J. Affect. Disord. 2018, 234, 180–192. [CrossRef] [PubMed]
20. Alammary, A. Blended learning models for introductory programming courses: A systematic review. PLoS ONE 2019,
14, e0221765. [CrossRef] [PubMed]
21. Bilal, S. A Linguistic System for Predicting Sentiment in Arabic Tweets. In Proceedings of the 2021 3rd International Conference
on Natural Language Processing (ICNLP), Beijing, China, 26–28 March 2021; pp. 134–138.
22. Al-Twairesh, N.; Al-Negheimish, H.J.I.A. Surface and deep features ensemble for sentiment analysis of arabic tweets. IEEE Access
2019, 7, 84122–84131. [CrossRef]
23. Pàmies Massip, M. Multilingual Identification of Offensive Content in Social Media. Available online: https://www.diva-portal.
org/smash/get/diva2:1451543/FULLTEXT01.pdf (accessed on 19 December 2021).
24. Moudjari, L.; Akli-Astouati, K.; Benamara, F. An Algerian corpus and an annotation platform for opinion and emotion analysis.
In Proceedings of the 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, 11–16 May 2020;
pp. 1202–1210.
25. Khalifa, M.; Hassan, H.; Fahmy, A. Zero-Resource Multi-Dialectal Arabic Natural Language Understanding. Int. J. Adv. Comput.
Sci. Appl. 2021, 12, 1–15. [CrossRef]
26. Alshehri, A.; Nagoudi, E.M.B.; Abdul-Mageed, M. Understanding and Detecting Dangerous Speech in Social Media; European
Language Resource Association: Paris, France, 2020.
27. Abdul-Mageed, M.; Zhang, C.; Elmadany, A.; Ungar, L. Toward micro-dialect identification in diaglossic and code-switched
environments. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online, 8–12
November 2020; pp. 5855–5876.
28. Ameur, M.S.H.; Aliane, H. AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News & Hate Speech Detection Dataset.
Procedia Comput. Sci. 2021, 189, 232–241.
29. Moudjari, L.; Karima, A.-A. An Experimental Study On Sentiment Classification Of Algerian Dialect Texts. Procedia Comput. Sci.
2020, 176, 1151–1159. [CrossRef]
30. Alsafari, S.; Sadaoui, S.; Mouhoub, M. Deep learning ensembles for hate speech detection. In Proceedings of the 2020 IEEE 32nd
International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA, 9–11 November 2020; pp. 526–531.
31. Abdelali, A.; Mubarak, H.; Samih, Y.; Hassan, S.; Darwish, K. QADI: Arabic dialect identification in the wild. In Proceedings of
the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19 April 2021; pp. 1–10.
32. Alsafari, S.; Sadaoui, S.; Mouhoub, M.; Media. Hate and offensive speech detection on Arabic social media. Online Soc. Netw.
2020, 19, 100096. [CrossRef]
33. Mubarak, H.; Hassan, S.; Abdelali, A. Adult content detection on arabic twitter: Analysis and experiments. In Proceedings of the
Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19 April 2021; pp. 136–144.
34. Farha, I.A.; Magdy, W. Benchmarking transformer-based language models for Arabic sentiment and sarcasm detection. In
Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19 April 2021; pp. 21–31.
35. Uyangodage, L.; Ranasinghe, T.; Hettiarachchi, H. Transformers to fight the COVID-19 infodemic. Available online:
https://arxiv.org/pdf/2104.12201.pdf (accessed on 4 December 2021).
36. Obied, Z.; Solyman, A.; Ullah, A.; Fat’hAlalim, A.; Alsayed, A. BERT Multilingual and Capsule Network for Arabic Sentiment
Analysis. In Proceedings of the 2020 International Conference on Computer, Control, Electrical, and Electronics Engineering
(ICCCEEE), Khartoum, Sudan, 26–28 February 2021; pp. 1–6.
37. Mubarak, H.; Rashed, A.; Darwish, K.; Samih, Y.; Abdelali, A. Arabic Offensive Language on Twitter: Analysis and Experiments.
Available online: https://arxiv.org/pdf/2004.02192.pdf (accessed on 17 November 2021).
38. Safaya, A.; Abdullatif, M.; Yuret, D. Kuisail at semeval-2020 task 12: Bert-cnn for offensive speech identification in social media.
In Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona, Spain, 12–13 December 2020; pp. 2054–2059.
39. El-Alami, F.-z.; El Alaoui, S.O.; Nahnahi, N.E. Contextual semantic embeddings based on fine-tuned AraBERT model for Arabic
text multi-class categorization. J. King Saud Univ. Comput. Inf. Sci. 2021. [CrossRef]
Appl. Sci. 2022, 12, 5720 19 of 20

40. Abdelali, A.; Hassan, S.; Mubarak, H.; Darwish, K.; Samih, Y. Pre-training bert on arabic tweets: Practical considerations. arXiv
2021, arXiv:2102.10684.
41. Mansour, M.; Tohamy, M.; Ezzat, Z.; Torki, M. Arabic dialect identification using BERT fine-tuning. In Proceedings of the Fifth
Arabic Natural Language Processing Workshop, Barcelona, Spain, 12 December 2020; pp. 308–312.
42. Balaji, N.N.A.; Bharathi, B. Semi-supervised fine-grained approach for Arabic dialect detection task. In Proceedings of the Fifth
Arabic Natural Language Processing Workshop, Barcelona, Spain, 12 December 2020; pp. 257–261.
43. Abuzayed, A.; Al-Khalifa, H. Sarcasm and sentiment detection in Arabic tweets using BERT-based models and data augmentation.
In Proceedings of the sixth Arabic natural language processing workshop, Kyiv, Ukraine, 19 April 2021; pp. 312–317.
44. Saeed, H.H.; Calders, T.; Kamiran, F. OSACT4 shared tasks: Ensembled stacked classification for offensive and hate speech in
Arabic tweets. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on
Offensive Language Detection, Marseille, France, 12 May 2020; pp. 71–75.
45. Zhang, C.; Abdul-Mageed, M. No army, no navy: Bert semi-supervised learning of arabic dialects. In Proceedings of the Fourth
Arabic Natural Language Processing Workshop, Florence, Italy, 1 August 2019; pp. 279–284.
46. Naski, M.; Messaoudi, A.; Haddad, H.; BenHajhmida, M.; Fourati, C.; Mabrouk, A.B.E. iCompass at Shared Task on Sarcasm and
Sentiment Detection in Arabic. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19
April 2021; pp. 381–385.
47. Hassan, S.; Samih, Y.; Mubarak, H.; Abdelali, A. ALT at SemEval-2020 task 12: Arabic and English offensive language identification
in social media. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona, Spain, 12–13 December 2020;
pp. 1891–1897.
48. Faraj, D.; Abdullah, M. Sarcasmdet at sarcasm detection task 2021 in arabic using arabert pretrained model. In Proceedings of the
Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19 April 2021; pp. 345–350.
49. Israeli, A.; Nahum, Y.; Fine, S.; Bar, K. The IDC System for Sentiment Classification and Sarcasm Detection in Arabic. In
Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19 April 2021; pp. 370–375.
50. Aldjanabi, W.; Dahou, A.; Al-qaness, M.A.; Abd Elaziz, M.; Helmi, A.M.; Damaševičius, R. Arabic Offensive and Hate Speech
Detection Using a Cross-Corpora Multi-Task Learning Model. Informatics 2021, 8, 69. [CrossRef]
51. Elgabry, H.; Attia, S.; Abdel-Rahman, A.; Abdel-Ate, A.; Girgis, S. A contextual word embedding for Arabic sarcasm detection
with random forests. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19 April 2021;
pp. 340–344.
52. Alam, F.; Shaar, S.; Dalvi, F.; Sajjad, H.; Nikolov, A.; Mubarak, H.; Martino, G.D.S.; Abdelali, A.; Durrani, N.; Darwish, K. Fighting
the COVID-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the
society. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing., Punta Cana, Dominican
Republic, 7–11 November 2021.
53. Al-Yahya, M.; Al-Khalifa, H.; Al-Baity, H.; AlSaeed, D.; Essam, A. Arabic Fake News Detection: Comparative Study of Neural
Networks and Transformer-Based Approaches. Complexity 2021, 2021. [CrossRef]
54. Mulki, H.; Ghanem, B.J. Let-mi: An Arabic Levantine Twitter dataset for misogynistic language. arXiv 2021, arXiv:2103.10195.
55. Mubarak, H.; Abdelali, A.; Hassan, S.; Darwish, K. Spam detection on arabic twitter. In Proceedings of the International
Conference on Social Informatics, Pisa, Italy, 6–9 October 2020; pp. 237–251.
56. Mubarak, H.; Hassan, S. Arcorona: Analyzing arabic tweets in the early days of coronavirus (COVID-19) pandemic. In
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, Virtual Conference, Online,
19–20 April 2021.
57. El-Alami, F.-z.; El Alaoui, S.O.; Nahnahi, N.E. A multilingual offensive language detection method based on transfer learning
from transformer fine-tuning model. J. King Saud Univ. Comput. Inf. Sci. 2021. [CrossRef]
58. Al-Twairesh, N. The Evolution of Language Models Applied to Emotion Analysis of Arabic Tweets. Information 2021, 12, 84.
[CrossRef]
59. Husain, F.; Uzuner, O. Leveraging offensive language for sarcasm and sentiment detection in Arabic. In Proceedings of the Sixth
Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19 April 2021; pp. 364–369.
60. Wadhawan, A. Arabert and farasa segmentation based approach for sarcasm and sentiment detection in arabic tweets. In
Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19 April 2021.
61. Bashmal, L.; AlZeer, D. ArSarcasm Shared Task: An Ensemble BERT Model for SarcasmDetection in Arabic Tweets. In Proceedings
of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19 April 2021; pp. 323–328.
62. Gaanoun, K.; Benelallam, I. Sarcasm and Sentiment Detection in Arabic language A Hybrid Approach Combining Embeddings
and Rule-based Features. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19 April
2021; pp. 351–356.
63. Alharbi, A.I.; Lee, M. Multi-task learning using a combination of contextualised and static word embeddings for arabic sarcasm
detection and sentiment analysis. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19
April 2021; pp. 318–322.
64. Abdel-Salam, R. Wanlp 2021 shared-task: Towards irony and sentiment detection in arabic tweets using multi-headed-lstm-
cnn-gru and marbert. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kyiv, Ukraine, 19 April 2021;
pp. 306–311.
Appl. Sci. 2022, 12, 5720 20 of 20

65. Wu, S.; Dredze, M. Are all languages created equal in multilingual BERT? In Proceedings of the 5th Workshop on Representation
Learning for NLP, Online, 9 July 2020; pp. 120–130.
66. Abdaoui, A.; Pradel, C.; Sigel, G. Load What You Need: Smaller Versions of Multilingual BERT. In Proceedings of the SustaiNLP:
Workshop on Simple and Efficient Natural Language Processing, Online, 10 November 2021; pp. 119–123.
67. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V.
Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451.
68. Lan, W.; Chen, Y.; Xu, W.; Ritter, A. An Empirical Study of Pre-trained Transformers for Arabic Information Extraction. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November
2020; pp. 4727–4734.
69. Safaya, A. Arabic-ALBERT. arXiv 2022, arXiv:2201.07434.
70. Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep learning–based text classification: A
comprehensive review. ACM Comput. Surv. 2021, 54, 1–40. [CrossRef]
71. Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A Survey on Text Classification: From Traditional to Deep Learning.
ACM Trans. Intell. Syst. Technol. 2021, 37. [CrossRef]
72. Virtanen, A.; Kanerva, J.; Ilo, R.; Luoma, J.; Luotolahti, J.; Salakoski, T.; Ginter, F.; Pyysalo, S. Multilingual is not enough: BERT for
Finnish. arXiv 2019, arXiv:1912.07076.
73. Martin, L.; Muller, B.; Suárez, P.J.O.; Dupont, Y.; Romary, L.; De La Clergerie, É.V.; Seddah, D.; Sagot, B. CamemBERT: A Tasty
French Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online,
5–10 July 2020; pp. 7203–7219.
74. Ranasinghe, T.; Zampieri, M. Multilingual offensive language identification for low-resource languages. Trans. Asian Low-Resour.
Lang. Inf. Processing 2021, 21, 1–13. [CrossRef]
75. Jain, M.; Mathew, M.; Jawahar, C. Unconstrained scene text and video text recognition for arabic script. In Proceedings of the 2017
1st International Workshop on Arabic Script Analysis and Recognition (ASAR), Nancy, France, 3–5 April 2017; pp. 26–30.
76. Himdi, H.; Weir, G.; Assiri, F.; Al-Barhamtoshy, H. Arabic fake news detection based on textual analysis. Arab. J. Sci. Eng. 2022,
1–17. [CrossRef] [PubMed]
77. Statista. Leading Countries Based on Number of Twitter Users as of January 2022. Available online: https://www.statista.com/
statistics/242606/number-of-active-twitter-users-in-selected-countries/ (accessed on 12 January 2022).
78. Moores, B.; Mago, V. A Survey on Automated Sarcasm Detection on Twitter. arXiv 2022, arXiv:2202.02516.
79. Rao, Y.; Xie, H.; Li, J.; Jin, F.; Wang, F.L.; Li, Q. Social emotion classification of short text via topic-level maximum entropy model.
Inf. Manag. 2016, 53, 978–986. [CrossRef]
80. Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model
for biomedical text mining. Bioinformatics 2019, 36, 1234–1240. [CrossRef]
81. Schwartz, R.; Dodge, J.; Smith, N.A.; Etzioni, O. Green AI. Commun. ACM 2020, 63, 54–63. [CrossRef]
82. Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of
the IEEE international conference on computer vision, Venice, Italy, 22–29 October 2017; pp. 843–852.
83. Al-Maimani, M.R.; Al Naamany, A.; Bakar, A.Z.A. Arabic information retrieval: Techniques, tools and challenges. In Proceedings
of the 2011 IEEE GCC Conference and Exhibition (GCC), Dubai, United Arab Emirates, 19–22 February 2011; pp. 541–544.
84. Wang, Y.; Sun, Y.; Ma, Z.; Gao, L.; Xu, Y. Named Entity Recognition in Chinese Medical Literature Using Pretraining Models. Sci.
Program. 2020, 2020, 8812754. [CrossRef]
85. Khemakhem, I.T.; Jamoussi, S.; Hamadou, A.B. Integrating morpho-syntactic features in English-Arabic statistical machine
translation. In Proceedings of the Second Workshop on Hybrid Approaches to Translation, Sofia, Bulgaria, 8 August 2013;
pp. 74–81.
86. Akan, M.F.; Karim, M.R.; Chowdhury, A.M.K. An analysis of Arabic-English translation: Problems and prospects. Adv. Lang. Lit.
Stud. 2019, 10, 58–65. [CrossRef]

You might also like