Root Identification Tool For Arabic Verbs
Root Identification Tool For Arabic Verbs
Root Identification Tool For Arabic Verbs
net/publication/332515338
CITATIONS READS
7 2,499
1 author:
Bakeel Azman
Sana'a University
5 PUBLICATIONS 75 CITATIONS
SEE PROFILE
All content following this page was uploaded by Bakeel Azman on 09 March 2020.
ABSTRACT Numerous Arabic morphology systems have been devoted toward morphed requirements of
words that are required by other text analyzers. Term rooting is an essential requirement in those systems, yet
rooting module in the state-of-the-art morphology systems insufficiently meet that requirement, especially
verb term. Consequently, due to termination in stemming term rather than a rooting term. Since the stem
of the verb is not the root of the verb, it is not feasible to generate or inference verb’s derivations and
whole it’s surface forms (patterns) such tense, number, mood, person, aspect, and others of verb irregular
patterns. Therefore, we propose a new model for identifying the verb’s root produced in a tool (RootIT) in
order to overcome verb root extraction without disambiguation out of traditional methods, applied in current
morphology systems. A major design goal of this system is that it can be used as a standalone tool and
can be integrated, in a good manner, with other linguistic analyzers. The adopted approach is a mapping
surface verb with full-scale derivative verbs discharged previously in the relational database. Moreover,
the proposed system is tested on the adopted dataset from PATB verbs extracted from CoreNLP system. The
extracted dataset, containing more than (7950) distinguishes verbs belonging to (1938) different roots.
The results obtained outstrip the best-compared system by (2.74%) of high accuracy.
INDEX TERMS Root, verb, pattern, stem, morphology, identifying, and ANLP.
2169-3536 2019 IEEE. Translations and content mining are permitted for academic research only.
45866 Personal use is also permitted, but republication/redistribution requires IEEE permission. VOLUME 7, 2019
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
B. Azman: Root Identification Tool for Arabic Verbs
specific tool or sufficient embedded module stand up to iden- language like Arabic. In English, both words are used in the
tify word’s root. same way, to indicate which one can have an affix. Since there
Our proposed model is specialized in the extraction of are very few affixes in English, it doesn’t really matter.
real verb root existing in its original (3) or (4) characters
distancing busy with other morphed features, although it is II. RELATED WORK
possible to include those features due to modeling of root Current morphology systems seek to get the stem of word
and its surface forms in this model. The main method of rather than the. They endeavor for obtaining the stem of a term
RootIT tool depends on tree-structure hierarchized for each as it looks like a non-affixed term such MADAMIRA [13]
verb root, starting by root as tree root, following some levels that adopts some attached toolkit producing stem, based on
ending with leave that present real verb form in real text. decomposition of the input word e.g. suffixes and prefixes
The whole surface forms of the verbs have been derivated along with clitics. The extraction rules have failed in getting
via ALECSO (Arab League Educational, & Scientific and sound root e.g. ‘‘denounce’’, ‘‘commit’’,
Cultural Organization) derivational system Sarf [7]. By this ‘‘hope’’and ‘‘be stained’’. All morphology systems are
method, all derivated and all surface forms of verb roots are supposed to make a difference between stem and root [14].
represented in relational database, in this way, the ability of The potential root of a term should be casted into its original
extracting any verb root will be fully performed. The work orthography of letters, just trilateral or quadrilateral atom.
area will be performed on any text and on any form of verb The list of Arabic stemming proposed models is
without constraints on inputs as happening in similar systems. huge [15]–[22]; thus, it is impossible to review them briefly
in this section. Notwithstanding, we will review the main
A. ROOT vs STEM approaches adopted in common models and highlight state-
A root is a form, which is not further analyzable, either in of-the-art in the same subject. The dominant approaches
terms of derivational or inflectional morphology [8]. Addi- in extracting roots systems are two, light stemmer and
tionally, it is that part of word-form that remains when all root-based stemmer [23]. Stem-based algorithms remove
inflectional and derivational affixes are removed. A root is prefixes and suffixes from Arabic words, while root-based
also the basic part, which is always present in a lexeme. While algorithms reduce stems to roots [24]. Light stemming refers
the stem is a root of a word, together with any derivational and to the process of stripping off a small set of prefixes and/or
inflectional affixes are added [9]. A stem consists minimally suffixes without trying to deal with infixes or recognize
of a root, but may be analyzable into a root plus derivational patterns and find roots [14], by very briefly word, Light
morphemes. A stem may require an inflectional operation stemmer cannot deepened for extracting of root [25], [26].
(often involving a prefix or suffix) in order to ground it into The second approach is root-based stemmers, whereas the
discourse and make it a fully understandable word. If a stem name implies that root is extracted from the word by means
does not stand by itself in a meaningful way in the language, of morphological analysis. It attempts to restore original root
it is referred to as a bound morpheme. of a word and group words accordingly [23]. The basic two
Furthermore, there is a term called ‘‘base’’ which is any steps of root-based stemmers are first to remove prefixes,
form to which affixes of any kind can be added [10]. This and suffixes. Second, is to extract roots by analyzing Arabic
means that any root or any stem can be termed as a base, but words according to their morphological components. This is
the set of bases is not exhausted by the union of the set of accomplished by rule based techniques, table lookup [25],
roots and the set of stems: a derivationally analyzable form or by a mixture of the two.
to which derivational affixes are added can be only referred One of the earliest techniques developed for root-based
to as a base. That is, ‘‘target’’ can act as a base for a stemmers is Khoja stemmer [20]. In this technique, prefixes
prefix to give ‘‘will target’’, but in this process and suffixes are removed, then two dictionaries are used,
could not be referred to as a root because it is analyzable in one to match the remaining letters against Arabic patterns,
terms of derivational morphology, nor as a stem since it is not and the second is to confirm the correctness of the root.
the adding of inflectional affixes which is in question [11]. Taghva et al. [27] is similar to Khoja et al. with no
A root differs partially from a stem in that a stem must use of dictionaries; rather a rule-based technique is used.
have lexical meaning. A root has no lexical meaning and the Sonbol et al. [28] uses a rule-based technique to extract
semantic range of the root is vague if there is any. A stem roots by dividing letters to a part of root, and others are
may contain derivational affixes and it becomes of concern further divided into sub-groups which are examined with
only when dealing with inflectional morphology. In the form well-defined rules to extract the final root, with no use of
‘‘They will target’’ the stem is ‘‘ ’’, although in a dictionary. Moreover, Spline function technique is utilized
the form ‘‘ ’’ the stem is ‘‘ ’’. Stemming sometimes for extracting root. That method is divided into two phases.
affects the semantic of a word, whereas lemma preserve the First phase involves seeking all the possible roots of each term
meaning of a word [12]. analyzed out of the context with a morphalizer. Second phase
Root, stem and base are all terms used in the literature to constructs a disambiguation approach based on continuous
designate that part of a word that remains when all affixes are quadratic splines to choose among these roots the one that
removed. The distinction is only useful in a highly inflected corresponds to the word context [29]. Momani and Faraj [30]
filters rootless words, and then removes suffixes and prefixes. and derivations. Since most of the rule-based Arabic stem-
It removes excessive letters only if it takes place more mers go to remove prefixes, infixes, and suffixes, most of
than once in a word. these stemmers cannot not recognize the right root of some
A new stemming model is introduced to design and imple- Arabic terms, i.e. ‘‘they target’’, and ‘‘decom-
ment an Arabic light stemmer which claimed that it can iden- pose’’, ‘‘interrogated’’. Majority drops takes place in
tify the word root. It uses predefined mathematical rules and vowel-words with one of letters. All those Arabic words
several relations between letters of the term. After applying do not include all the letters of the corresponding Arabic root,
an appropriate rule on a word, some clitics will be removed, so rule-based Arabic stemmers are incapable to extract their
then mapping the produced word against roots dictionary. correct roots.
If no mapped root in the dictionary, splitting the original word Up to now, there is no available appropriate tool for Arabic
process repeats through another rule. The process continues NLP applications fully meets requirements of word root.
recursively until finding a root or last category rules with no Although many research projects have focused on the prob-
root [31]. However, the rules set cannot face multitude of Ara- lem of Arabic morphological analysis using different tech-
bic word morphology, as evidenced by some unsound words niques and approaches as we most briefly discussed fragment
( . . . , etc.). Mohammed [32] conceived models.
the two approaches problems, so he proposed an Arabic
stemmer that combined the rules of root-based stemmer and III. PROPOSED WORK
light-based stemmer to success in facing up that failures. Such The system of RootIT for verb root identification is a
problems are removing the affix before matching it with a root-based system, and it contains two modules, roots
Tafeala and not dealing with the word of three letters database module and module of detecting of root. The general
length, resulted from these two stemmers. A model attempted approach in most morphology systems are top-of-bottom,
to mix the two approaches along with statistical approach starting of capturing a sentence word w, then mapping w
by [33]. It was only for generating the possible roots of any against embedded lexicon, if no mapped, the cycle evokes
given Arabic word. The analyzer is based on automatically trimer of preffixations or suffixation. The process continues
derived rules and statistics. It compounded three modules, recursively, trimming, mapping, until w matches a lexicon
one for taking advantage of a list of Arabic word-root pairs to entry which present the stem of w. Our approach seems tiny
derive a list of prefixes and suffixes, another one for building different; it adopts bottom-up approach along with top-of-
stem templates, and last one for calculating the possibility that bottom. The method centerizes the root r in the center of
a prefix, a suffix, or a template would appear. The second tree, then the root r is reproduced its conjugations and so
accepts Arabic words as input, attempting to construct possi- surface forms surrounded with levels. The leave level (real
ble prefix-suffix temple combinations, and outputs a ranked verb) indicates to its root center via tracing path which will
list of possible roots. present the root of the verb v. FIGURE 1 shows method in an
The most relevant method is [34], which attempts to build example for ‘‘aim’’ root.
a lexicon including all verbal forms. The verbal forms are
generated from more than (15000) roots which is an unde-
fined source. The generation process is applied based on
root-patterns using finite-state transducers theory. Almost
2.5 million verb forms are generated and classified within
a lexicon. Nevertheless, the verification of produced forms
remains unclear, thus the validation of forms cannot be
judged. That is because process of test is implemented on
small fragment of Nemlar [35] corpus which itself is not more
than (500000) words, including nouns and adjectives accord-
ing to his claims. Many systems, related to this research,
have been studied, but unfortunately we have not come across
a model that includes all the basic and standard roots in
the thesaurus, such as the Al-Mukhtar Al-Sahah1 dictionary,
which contains (7400) and more of roots, although the used
ones do not exceed (3000) roots.
In general, extracting root systems are mostly relying
on rule-based approach [36], which in turn has deficien-
cies in tackling abundance of Arabic language inflections
1 https://www.almaany.com/ar/dict/ar-ar/?c=%D9%85%D8%AE%D8%
AA%D8%A7%D8%B1%20%D8%A7%D9%84%D8%B5%D8%AD%
D8%A7%D8%AD c FIGURE 1. Tree-hierarchized structure for . ‘‘aim’’ root.
Detecting root is the function of surface form verb and system, Stanford CoreNLP. Total number of dataset verbs
its path. Given a verb vfrom text fragment, we can retrieve tagged by embedded tool belong to Stanford called Max-
matched verbs against roots database. Since vis mostly entTagger amount to 7591 surface verbs excluding unsound
non-vocalized word, the result of query of search vretrieves tagged verbs. Those verbs belong to distinct (1938) roots, (76)
list of matched entries, that have variant roots. The matched quadrilateral root and the rest are trilateral roots.
verbs are clustered in clusters of roots, as the cluster head is The evaluation method, that we have adopted, is the com-
the root while cluster children are the matched verbs. The root monly used metric and the standard evaluation measure in the
of v is the cluster that has maximum number of children. As an IE community. The standard evaluation measures Precision,
example, consider verb ‘‘it announces’’ retrieved more Recall, and F-measure are used to evaluate the performance
than one root ( ). The sound one is of our system to compare the results of the comparative other
‘‘announce’’, so large retrieved in the list of verbs (children four systems [39]. The comparative systems cannot find the
of clusters) is for and less for rest clusters. root for some verbs (a few listed in TABLE 1). The failure in
Development of RootIT tool seems straightforward with- those systems might be due to two factors. The first one is that
out complexity. RootIT’s method to identify verb root, that such systems get satisfied by stem of verbs without diving
adopts mapping method the real verb in text against surface to root of verb, and the second one is limited rules pattern
verb forms stored in database as shown in tool architecture used by systems against verbs morphology requirements.
in FIGURE 2. Surfaces forms of verb are structured in tree Although Khoja system, that has obtained efficient accuracy
hierarchy, as the root in the tree’s root is followed with some in detecting verbs’ roots as well as nouns’ roots, it has been
levels which represent morphology features ended with leave unable to recognize some verb patterns that have evasiveness
level, which in turn represents the instances of verb forms. trait with formatted morphology rules in system algorithm of
such verbs as . . . etc.
IV. EXPERIMENT
To assess the accuracy of the RootIT, a series of experiments
have been conducted. The effectiveness of the five systems
– Khoja et al. [20], ArabicStemmer [37], MADAMIRA [13],
Buckwalter [21] and our proposed rooter RootIT - has been
evaluated and compared in terms of the accuracy of the To illustrate that our RootIT is more efficient, we present
F-score measure. The data set used in our experiments is some results of Arabic verbs’ root identification systems
extracted from the most popular and standard Arabic cor- mentioned in TABLE 2. TABLE 2 shows the RootIT accuracy
pus (PATB) [38]. The adopted dataset consists of just all of 97.34%. In comparison, we observe that our proposed tool
verbs with their surface forms scattered in PATB, which are can produce better results as FIGURE 3 illustrated. One of
extracted from tagged and stored parsed trees in the enormous the main points of evaluation is the comparison of RootIT
[34] A. A. Neme, ‘‘A fully inflected Arabic verb resource constructed con- [38] M. Maamouri, A. Bies, T. Buckwalter, and W. Mekki, ‘‘The penn Arabic
structed from a lexicon of lemmas by using finite-state transducers,’’ Revue treebank: Building a large-scale annotated arabic corpus,’’ in Proc. NEM-
RIST, vol. 20, no. 2, p. 13, 2013. LAR Conf. Arabic Lang. Resour. Tools, 2004, pp. 466–467.
[35] M. Yaseen et al., ‘‘Building annotated written and spoken Arabic LRs in [39] D. M. Powers, ‘‘Evaluation: From precision, recall and F-measure to ROC,
NEMLAR project,’’ in Proc. LREC, 2006, pp. 533–538. informedness, markedness and correlation,’’ Tech. Rep., 2011.
[36] H. Khafajeh, N. Yousef, and M. Abdeldeen, ‘‘Arabic root extraction using
a hybrid technique,’’ Int. J. Adv. Comput. Res., vol. 8, no. 35, pp. 90–96, Authors’ photographs and biographies not available at the time of
2018. publication.
[37] ArabicStemmer. [Online]. Available: https://www.arabicstemmer.com/