Towards Building KurdNet, the Kurdish WordNet
Purya Aliabadi
SRBIAU
Sanandaj, Iran
[email protected]
Mohammad Sina Ahmadi
University of Kurdistan
Sanandaj, Iran
[email protected]
Shahin Salavati
University of Kurdistan
Sanandaj, Iran
[email protected]
Kyumars Sheykh Esmaili
Nanyang Technological University
Singapore
[email protected]
Abstract
In this paper we highlight the main challenges in building a lexical database for
Kurdish, a resource-scarce and diverse
language. We also report on our effort in
building the first prototype of KurdNet –
the Kurdish WordNet– along with a preliminary evaluation of its impact on Kurdish information retrieval.
1
for Base Concepts (Vossen et al., 1998), which is a
core subset of major meanings in WordNet. More
specifically, we use a bilingual dictionary and simple set theory operations to translate and align
synsets and use a corpus to extract usage examples. The effectiveness of our prototype database
is evaluated via measuring its impact on a Kurdish
information retrieval task. Throughout, we have
made the following contributions:
1. highlight the main challenges in building
a wordnet for the Kurdish language (Section 2),
Introduction
WordNet (Fellbaum, 1998) has been used in numerous natural language processing tasks such as
word sense disambiguation and information extraction with considerable success. Motivated by
this success, many projects have been undertaken
to build similar lexical databases for other languages. Among the large-scale projects are EuroWordNet (Vossen, 1998) and BalkaNet (Tufis et
al., 2004) for European languages and IndoWordNet (Bhattacharyya, 2010) for Indian languages.
Kurdish belongs to the Indo-European family
of languages and is spoken in Kurdistan, a large
geographical region spanning the intersections of
Iran, Iraq, Turkey, and Syria. Kurdish is a lessresourced language for which, among other resources, no wordnet has been built yet.
We have recently launched the Kurdish language processing project (KLPP1 ), aiming at providing basic tools and techniques for Kurdish text
processing. This paper reports on KLPP’s first
outcomes on building KurdNet, the Kurdish WordNet.
At a high level, our approach is semi-automatic
and centered around building a Kurdish alignment
1
http://eng.uok.ac.ir/esmaili/research/klpp/en/main.htm
2. identify a list of available resources that can
facilitate the process of constructing such a
lexical database for Kurdish (Section 3),
3. build the first prototype of KurdNet, the Kurdish WordNet (Section 4), and
4. conduct a preliminary set of experiments to
evaluate the impact of KurdNet on Kurdish
information retrieval (Section 5).
Moreover, a manual effort to translate the glosses
and refine the automatically-generated outputs is
currently underway.
The latest snapshot of KurdNet’s prototype is
freely accessible and can be obtained from (KLPP,
2013). We hope that making this database publicly available, will bolster research on Kurdish
text processing in general, and on KurdNet in particular.
2
Challenges
In the following, we highlight the main challenges
in Kurdish text processing, with a greater focus on
1
2
3
4
5
6
7
8
9
10
11 12 13 14 15
16 17
18
19
20
21
22
23 24
Arabic‐based
ز خ ڤ وو ت ش س ر ق پ ۆ ن م ل ک ژ گ ف ێ د چ ج ب ا
Latin‐based
A
B
C Ç D Ê
F
G
J
K
L M N O P Q R
S
Ş
29
30
31
32
33
ڕ
ع ڵ
غ
ح
T
Û
V
X Z
(a) One-to-One Mappings
Arabic‐based
Latin‐based
25
26
27
28
/ئ
و
ی
ه
I
U/W
Y/Î
E/H
Arabic‐based
Latin‐based
(b) One-to-Two Mappings
(RR) - (E) (X) (H)
(c) One-to-Zero Mappings
Figure 1: The Two Standard Kurdish Alphabets (Esmaili and Salavati, 2013)
the aspects that are relevant to building a Kurdish
wordnet.
2.1 Diversity
Diversity –in both dialects and writing systems–
is the primary challenge in Kurdish language
processing (Gautier, 1998; Gautier, 1996; Esmaili, 2012). In fact, Kurdish is considered a bistandard2 language (Gautier, 1998; Hassanpour et
al., 2012): the Sorani dialect written in an Arabicbased alphabet and the Kurmanji dialect written
in a Latin-based alphabet. Figure 1 shows both of
the standard Kurdish alphabets and the mappings
between them.
The linguistics features distinguishing these
two dialects are phonological, lexical, and morphological. The important morphological differences that concern the construction of KurdNet
are (MacKenzie, 1961; Haig and Matras, 2002):
(i) in contrast to Sorani, Kurmanji has retained
both gender (feminine v. masculine) and case opposition (absolute v. oblique) for nouns and pronouns, and (ii) while is Kurmanji passive voice is
constructed using the helper verb “hatin”, in Sorani it is created via verb morphology.
In summary, as the examples in (Gautier, 1998)
show, the “same” word, when going from Sorani
to Kurmanji, may at the same time go through several levels of change: writing systems, phonology,
morphology, and sometimes semantics.
2.2 Complex Morphology
Kurdish has a complex morphology (Samvelian,
2007; Walther, 2011) and one of the main driving factors behind this complexity is the wide use
of inflectional and derivational suffixes (Esmaili et
2
Within KLPP, our focus has been on Sorani and Kurmanji which are the two most widely-spoken and closelyrelated dialects (Haig and Matras, 2002; Walther and Sagot,
2010).
al., 2013a). Moreover, as demonstrated by the example in Table 1, in the Sorani’s writing system
definiteness markers, possessive pronouns, enclitics, and many of the widely-used postpositions
are used as suffixes (Salavati et al., 2013).
One important implication of this morphological complexity is that any corpus-based
assistance or analysis (e.g., frequencies, cooccurrences, sample passages) would require a
lemmatizer/morphological analyzer.
2.3 Resource-Scarceness
Although there exist a few resources which can
be leveraged in building a wordnet for Kurdish –
these are listed in Section 3– but some of the most
crucial resources are yet to be built for this language. One of such resources is a collection of
comprehensive monolingual and bilingual dictionaries. The main problem with the existing electronic dictionaries is that they are relatively small
and have no notion of sense, gender, or part-ofspeech labels.
Another necessary resource that is yet to be
built, is a mapping system (i.e., a transliteration/translation engine) between the Sorani and
Kurmanji dialects.
3
Available Resources
In this section we give a brief description of the
linguistics resources that our team has built as well
as other useful resources that are available on the
Web.
3.1 KLPP Resources
The main Kurdish text processing resources that
we have previously built are as follows:
− the Pewan corpus (Esmaili and Salavati,
2013): for both Sorani and Kurmanji dialects. Its
basic statistics are shown in Table 2.
daa
postpos.
+
+
+
taan
poss. pron.
+
+
+
ish
conj.
+
+
+
akaan
pl. def. mark.
+
+
+
ktew
lemma
=
=
=
ktewakaanishtaandaa
word
Table 1: An Exemplary Demonstration of Kurdish’s Morphological Complexity (Salavati et al., 2013)
Articles No.
Words No. (dist.)
Words No. (all)
Sorani
115,340
501,054
18,110,723
Kurmanji
25,572
127,272
4,120,027
Table 2: The Pewan Corpus’ Basic Statistics (Esmaili and Salavati, 2013)
− the Pewan test collection (Esmaili et al., 2013a;
Esmaili et al., 2013b): built upon the Pewan corpus, this collection has a set of 22 queries (in Sorani and Kurmanji) and their corresponding relevance judgments.
− the Payv lemmatizer: it is the result of a major revision of Jedar (Salavati et al., 2013), our
Kurdish stemmer whose outputs are stems and not
lemmas. In order to return lemmas, Payv not only
maintains a list of exceptions (e.g., named entities), but also takes into consideration Kurdish’s
inflectional rules.
3.2 Web Resources
To the best of our knowledge, here are the other
existing readily-usable resources that can be obtain from the Web:
− Dictio3 : an English-to-Sorani dictionary with
more than 13,000 headwords. It employs a collaborative mechanism for enrichment.
− Ferheng4 : a collection of dictionaries for the
Kurmanji dialect with sizes ranging from medium
(around 25,000 entries, for German and Turkish)
to small (around 4,500, for English).
− Wikipedia: it currently has more than 12,000
Sorani5 and 20,000 Kurmanji6 articles. One useful application of these entries is to build a parallel
collection of named entities across both dialects.
4
KurdNet’s First Prototype
In the following, we first define the scope of our
first prototype, then after justifying our choice of
construction model, we describe KurdNet’s individual elements.
3
http://dictio.kurditgroup.org/
http://ferheng.org/?Daxistin
5
http://ckb.wikipedia.org/
6
http://ku.wikipedia.org/
4
4.1 Scope
In the first prototype of KurdNet we focus only on
the Sorani dialect. This is mainly due to lack of an
available and reliable Kurmanji-to-English dictionary. Moreover, processing Sorani is in general
more challenging than Kurmanji (Esmaili et al.,
2013a). The Kurmanji version will be built later
and will be closely aligned with its Sorani counterpart. To that end, we have already started building a high-quality transliterator/translator engine
between the two dialects.
4.2 Methodology
There are two well-known models for building
wordnets for a language (Vossen, 1998):
• Expand: in this model, the synsets are built
in correspondence with the WordNet synsets
and the semantic relations are directly imported. It has been used for Italian in MultiWordNet and for Spanish in EuroWordNet.
• Merge: in this model, the synsets and relations are first built independently and then
they are aligned with WordNet’s. It has been
the dominant model in building BalkaNet and
EuroWordNet.
The expand model seems less complex and
guarantees the highest degree of compatibility
across different wordnets. But it also has potential
drawbacks. The most serious risk is that of forcing
an excessive dependency on the lexical and conceptual structure of one of the languages involved,
as pointed out in (Vossen, 1996).
In our project, we follow the Expand model,
since it can be partly automated and therefore
would be faster. More precisely, we aim at creating a Kurdish translation/alignment for the Base
Concepts (Vossen et al., 1998) which is a set of
5,000 essential concepts (i.e. synsets) that play
a major role in the wordnets. Base Concepts
(BC) is available on the Global WordNet Association (GWA)’s Web page7 . The Entity-Relationship
(ER) model for the data represented in Base Concept is shown in Figure 2.
7
http://globalwordnet.org/
Usage
Definition
Sense_no
Kmax
E
POS
ID
k1
e1
k2
Type
N
Literal
Has / Is in
N
e2
N
Synset
N
k3
Lexical
Relation
SUMO
Domain
BCS
Figure 2: Base Concepts’ ER Model
Figure 3: An Illustration of a Synset in Base Concepts and its Maximal and Minimal Alignment
Variants in KurdNet
4.3 Elements
Since KurdNet follows the Expand model, it
inherits most of Base Concepts’ structural properties, including: synsets and the lexical relations
among them, POS, Domain, BCS, and SUMO.
KurdNet’s language-specific aspects, on the other
hand, have been built using a semi-automatic
approach. Below, we elaborate on the details of
construction the remaining three elements.
Synset Alignments: for each synset in BC,
its counterpart in KurdNet is defined semiautomatically. We first use Dictio to translate its
literals (words). Having compiled the translation
lists, we combine them in two different ways:
(i) a maximal alignment (abbr. max) which is a
superset of all lists, and (ii) a minimal alignment
(abbr. min) which is a subset of non-empty
lists. Figure 3 shows an illustration of these two
combination variants. In future, we plan to apply
more advanced techniques, similar to the graph
algorithms described in (Flati and Navigli, 2012).
Usage Examples: we have taken a corpusassisted approach to speed-up the process of
providing usage examples for each aligned synset.
To this end, we: (i) extract all Pewan’s sentences
(820,203), (ii) lemmatize the corpus to extract
all the lemmas (278,873), and (iii) construct a
lemma-to-sentence inverted index. In the current
version of KurdNet, for each synset we build a
pool of sentences by fetching the first 5 sentences
of each of its literals from the inverted list. These
pools will later be assessed by lexicographers to
filter out non-relevant instances. In future, more
sophisticated approaches can be applied (e.g.,
exploiting contextual information).
Definitions: due to lack of proper translation
Kmin
Synset No.
Literal No.
Usage No.
Base
Concepts
4,689
11,171
2,645
KurdNet
(max)
3,801
17,990
89,950
KurdNet
(min)
2,145
6,248
31,240
Table 3: The Main Statistical Properties of Base
Concepts and its Alignment in KurdNet
tools, this element must be aligned manually.
The manual enrichment and assessment process
is currently underway. We have built a graphical
user interface to facilitate the lexicographers’ task.
Table 3 shows a summary of KurdNet’s statistical properties along with those of Base Concepts.
5
Preliminary Experiments
The most reliable way to evaluate the quality of
a wordnet is to manually examine its content and
structure. This is clearly very costly. In this paper we have adopted an indirect evaluation alternative in which we look at the effectiveness of using KurdNet for rewriting IR queries (i.e. query
expansion).
We measure the impact of query expansion using two separate configurations: (i) Terms, which
uses the raw version of the evaluation components
(queries, corpus, and KurdNet), and (ii) Lemmas,
which uses the lemmatized version of them. Furthermore, as depicted in Figure 4, we have considered two alternatives for expanding each query
term: (i) add all of its Synonyms, and (ii) add
all of the synonyms of its direct Hypernym(s).
Hence –given the min and max variants of KurdNet’s synsets– there can be at least 10 different experimental scenarios.
In our experiments we have used the Pewan
test collection (see Section 3.1), the MG4J IR engine (MG4J, 2013), and the Mean Average Precision (MAP) evaluation metric.
w5
w4 w3
w6
w5
w4 w3
w6
w2 w w1
0
w2 w w1
0
(a) By its Synonyms
(b) By its Hypernyms
Figure 4: Expansion Alternatives for the Term W0
The results are summarized in Table 4. The notable patterns are as follows:
• since lemmatization yields additional
matches between query terms and their
inflectional variants in the documents, it
improves the performance (row 2 v. row 3).
Expansion of the same lemmatized queries,
however, degrades the performance (7-10 v.
1,4-6). This degradation can be attributed to
the fact that the projection of KurdNet from
terms to lemmas introduces imprecise entry
merges.
Scenario
Terms & Hypernyms (min)
Lemmas
Terms
Terms & Synonyms (min)
Terms & Hypernyms (max)
Terms & Synonyms (max)
Lemmas & Hypernyms (min)
Lemmas & Synonyms (min)
Lemmas & Hypernyms (max)
Lemmas & Synonyms (max)
MAP
0.4265
0.4263
0.4075
0.3978
0.3960
0.3841
0.3840
0.3587
0.2530
0.2215
Table 4: Different KurdNet-based Query Expansion Scenarios and Their Impact on Kurdish IR
there are many avenues to continue this work.
First, we would like to extend our prototype to
include the Kurmanji dialect. This would require
not only using similar resources to those reported
in this paper, but also building a mapping system
between the Sorani and Kurmanji dialects.
Another direction for future work is to prune the
current structure i.e. handling the lexical idiosyncrasies between Kurdish and English.
• the min approach to align synsets outperforms its max counterpart overwhelmingly
(1,4,7,8 v. 5,6,9,10), confirming the intuition
that the max approach entails high-ambiguity,
References
• expanding query terms by their own synonyms is less effective than by their hypernyms’ synonyms. This phenomena might be
explained by the fact that currently for each
query term, we use all of its synonyms and
no sense disambiguation is applied.
Kyumars Sheykh Esmaili and Shahin Salavati. 2013.
Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison. In Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics (ACL’13), pages 300–305.
Needless to say, a more detailed analysis of the
outputs can provide further insights about the
above results and claims.
6
#
1
2
3
4
5
6
7
8
9
10
Conclusions and Future Work
In this paper we briefly highlighted the main
challenges in building a lexical database for the
Kurdish language and presented the first prototype
of KurdNet –the Kurdish WordNet– along with a
preliminary evaluation of its impact on Kurdish
IR.
We would like to note once more that the KurdNet project is a work in progress. Apart from
the manual enrichment and assessment of the described prototype which is currently underway,
Pushpak Bhattacharyya. 2010. IndoWordNet. In
Proceedings of the 7th International Conference on
Language Resources and Evaluation (LREC’10).
Kyumars Sheykh Esmaili, Shahin Salavati, and Anwitaman Datta. 2013a. Towards Kurdish Information Retrieval. ACM Transactions on Asian Language Information Processing (TALIP), To Appear.
Kyumars Sheykh Esmaili, Shahin Salavati, Somayeh
Yosefi, Donya Eliassi, Purya Aliabadi, Shownem
Hakimi, and Asrin Mohammadi. 2013b. Building a
Test Collection for Sorani Kurdish. In Proceedings
of the 10th IEEE/ACS International Conference on
Computer Systems and Applications (AICCSA ’13).
Kyumars Sheykh Esmaili. 2012. Challenges in Kurdish Text Processing. CoRR, abs/1212.0074.
Christiane Fellbaum. 1998. WordNet: An Electronic
Lexical Database. MIT Press.
Tiziano Flati and Roberto Navigli. 2012. The CQC
Algorithm: Cycling in Graphs to Semantically Enrich and Enhance a Bilingual Dictionary. Journal of
Artificial Intelligence Research, 43(1):135–171.
Gérard Gautier. 1996. A Lexicographic Environment
for Kurdish Language using 4th Dimension. In Proceedings of ICEMCO.
Gérard Gautier. 1998. Building a Kurdish Language
Corpus: An Overview of the Technical Problems.
In Proceedings of ICEMCO.
Goeffrey Haig and Yaron Matras. 2002. Kurdish Linguistics: A Brief Overview. Language Typology and
Universals, 55(1).
Amir Hassanpour, Jaffer Sheyholislami, and Tove
Skutnabb-Kangas. 2012. Introduction. Kurdish:
Linguicide, Resistance and Hope. International
Journal of the Sociology of Language, 217:1–8.
KLPP. 2013. KurdNet’s Download Page. Available
at: https://github.com/klpp/kurdnet.
David N. MacKenzie. 1961. Kurdish Dialect Studies.
Oxford University Press.
MG4J. 2013. Managing Gigabytes for Java. Available
at: http://mg4j.dsi.unimi.it/.
Shahin Salavati, Kyumars Sheykh Esmaili, and Fardin
Akhlaghian. 2013. Stemming for Kurdish Information Retrieval. In The Proceeding (to appear) of
the 9th Asian Information Retrieval Societies Conference (AIRS 2013).
Pollet Samvelian. 2007. A Lexical Account of Sorani Kurdish Prepositions. In Proceedings of International Conference on Head-Driven Phrase Structure Grammar, pages 235–249.
Dan Tufis, Dan Cristea, and Sofia Stamou. 2004.
BalkaNet: Aims, Methods, Results and Perspectives. A General Overview. Romanian Journal of
Information science and technology, 7(1-2):9–43.
Piek Vossen, Laura Bloksma, Horacio Rodriguez, Salvador Climent, Nicoletta Calzolari, Adriana Roventini, Francesca Bertagna, Antonietta Alonge, and
Wim Peters. 1998. The EuroWordNet Base Concepts and Top Ontology. Deliverable D017 D,
34:D036.
Piek Vossen. 1996. Right or Wrong: Combining Lexical Resources in the EuroWordNet Project. In EURALEX, volume 96, pages 715–728.
Piek Vossen. 1998. Introduction to EuroWordNet.
Computers and the Humanities, 32(2-3):73–89.
Géraldine Walther and Benoı̂t Sagot. 2010. Developing a Large-scale Lexicon for a Less-Resourced
Language.
In SaLTMiL’s Workshop on Lessresourced Languages (LREC).
Géraldine Walther. 2011. Fitting into Morphological
Structure: Accounting for Sorani Kurdish Endoclitics. In The Proceedings of the Eighth Mediterranean Morphology Meeting.