The Arabic Language, Arabic Linguistics and Arabic Computational Linguistics
The Arabic Language, Arabic Linguistics and Arabic Computational Linguistics
The Arabic Language, Arabic Linguistics and Arabic Computational Linguistics
. .
.
.
qabylatu qurays in Mecca is assumed to have been the most prestigious Arabic dialect
due to the elevated position the tribe held among other Arab tribes and due to its wealth
and strategic location in Mecca. Mecca at that time was a trade and religious center
which housed all the idols worshiped by the Arabs and all Arab tribes would make a
pilgrimage to Mecca to worship their gods. Most signicantly, the Prophet Mohammed
was a member of the Quraish tribe and it was their dialect that became the language
of the Quran.
The rst codication of the Arabic language was undertaken by early Arab gram-
marians in the 8th century who regarded the language of the Quran as the model of
correctness. This was the rst time the Arabic language was standardized with an ex-
plicit grammar dening correct usage. The codication included all its linguistic levels
such as its phonology, morphology, syntax and lexicons. So for over 1500 years and to
the present, this grammar of classical Arabic is still taught to all Arabic speakers in their
general education courses. Following the expansion of the Islamic State, the Arabic lan-
guage became the ocial language of administration, government and religious teaching
in the new empire. Thus, Classical Arabic satised all the conditions (Mesthrie et al.,
2000) that modern linguistics prescribes for a standardized language. Firstly, it had a
codied grammar and dictionaries that dene the norms of the languages. Secondly, it
was prestigious because it was the language of the Quran and was also assumed to be
the language of God. Thirdly, it was a model representing the correct and pure form
of Arabic. Fourth, it functioned as the language of government, administration, and
46 / Arabic Computational Linguistics
education. Fifth, although it was associated with a powerful and prestigious subgroup
(Quraish) it was also the language that new converts to Islam needed to learn in order to
perform their religious rituals. Thus, the function of Classical Arabic extended beyond
the communicative needs of its native speakers (Mesthrie et al., 2000) as it served as
the standard that all Arabic speakers aspire to master.
2.1.2 The Emergence of the Dialects
The widely accepted Western view is that there were three Arabic language varieties in
the pre-Islamic period
.
.
.
..
:
... s
.
ru lgahilyti and early Islamic era. They were:
1. The language of the Quran which Muslims believed to be the language of God,
and as such, was a perfect language and superior to any other language
2. The poetic language which exhibited the Arabs fondness for oratory. Arabs from
various tribes would compete to create the most owery poetic language which
was highly rhythmical and had elaborated sixteen meters.
3. Everyday language was the language before Islam that Arabs used in their daily
interaction. There is no consensus on how proximate this variety was to the lan-
guage of the Quran or to the poetic language.
In contrast, most Arab linguists believe there was only one language. The Arab view
is that the dierence between the language of the Quran and the Poetic language was
not that signicant. Both are highly elevated forms of Arabic with rhythmical and
rhetorical patterns. With respect to everyday language, people most likely spoke a
language similar to that recorded in the Quran. This is most probable because as a
Muslim, you believed the Quran was a gift from God to the people of the Arabian
peninsula and must be in the language of the people living there. So naturally it must
be in their everyday language as the Quran was not directed toward the educated elite,
but rather, the common people. On the other hand, if you were not a Muslim you might
believe that the Quran was written by the prophet himself. So it must be written in
a language he knew and spoke and that was shared by the people around him. So, in
either case, the Quranic language was not dierent from the everyday language spoken
at the time of the Prophet.
Shortly after the death of the prophet Mohammad in 632 CE, many Arabs left the
Arabian Peninsula to live in the dierent territories under the Islamic rule which ex-
tended from the Arabian Peninsula to the Pacic Ocean. Two processes were taking
place simultaneously in the centuries that followed. Versteegh (1997a) calls them Arabi-
cisation and Islamisation. Arabicisation proceeded more rapidly than Islamisation due
to the prevailing tolerance on the part of the Muslims towards Christians and Jews
which did not generate an urgent need to convert to Islam (Versteegh, 1997a, pg. 93). It
was language, rather than religion, that became the binding force in the Islamic Empire.
The indigenous populations of the conquered countries greatly outnumbered the na-
tive Arabs and they spoke dierent languages. For example the Iraqis spoke Aramaic,
the Syrians Greek and the Egyptians Coptic, but there was a need for linguistic ac-
commodation, mostly on the part of the conquered so that they could communicate
with their conquerors about matters of taxation, administration and trade (Versteegh,
1997a). Further, learning and mastering Arabic would provide employment opportuni-
The Arabic Language / 47
ties for the local people. The early attempts made by the locals at speaking Arabic were
tainted by the traits of their native languages and there have been numerous reports of
the frequently unsuccessful attempts by indigenous peoples to communicate with their
rulers. The new speakers of Arabic spoke the language and read the Quran with an ac-
cent. Early Arab grammarians eorts to codify Arabic were motivated in part, by the
fear that the purity of the Arabic language would be tainted by the corrupt Arabic
spoken by the local people
.
..
_.
:
lh
.
nu l-amati. Arab grammarians felt they needed
to write rules to clearly dene the correct ways of speaking Arabic so their goals
were prescriptive in the sense of both controlling the behavior and the manner in which
Arabic would be used. Once the language was codied, all speakers had to follow these
rules.
Just as there are diering views on the number of early Arabic languages, there are
also dierent theories about the origin of modern Arabic dialects. Many linguists (Ver-
steegh, 1997a) assume that modern Arabic dialects developed from this rst colloquial
Arabic spoken during the early days of the Arab Conquests and other Arabic gram-
marians believe that modern dialects developed from Classical Arabic. As an example,
people who did not know how to speak Classical Arabic correctly, tended to drop case
endings, spoke it with an accent and introduced lexical innovation. A third view is that
of Ferguson (1959a) who refutes the claim that these dialects are descendants of Classi-
cal Arabic. He cites fourteen linguistic features (mostly phonological and morphological
features) that all dialects share but that are lacking in Classical Arabic. He proposes
that all Arabic dialects originate from a form of Arabic spoken at the camps of Arabic
military bases positioned in the conquered territories.
Further, there is no agreement on how many varieties actually exist today. For ex-
ample Ferguson (1959b) refers to two varieties: the high variety, or Classical Arabic
and the low variety which is used in the everyday communication of Arabic speakers.
Badawi (1973) cites ve varieties of Arabic. But, broadly speaking, there are at least
three varieties of Arabic that co-exist side by side.
1. Classical Arabic is the most prestigious variety because it is the language of the
Quran. It is well dened because it has been codied by early Arab grammarians.
The consensus among traditional Arabic grammarians is that this grammar is
complete as it describes a closed corpus i.e., the Arabic religious and literary
heritage.
2. The colloquial dialects are also well dened; not because they are fully codied, but
because they are acquired naturally by their native speakers. Each Arabic speaking
country has its own dialect which is used primarily in everyday communications.
3. Modern Standard Arabic (MSA) is the form of Arabic that is used among educated
speakers of Arabic in formal situations. It is not a well dened variety because,
unlike the colloquial dialects, it is not the native language of anyone. And unlike
Classical Arabic, it has not been fully elucidated and described.
2.1.3 Arabic as the Language of Science and Philosophy
There is a near consensus that a highly elevated poetic form of Arabic existed in the
pre-Islamic period but Islam introduced a new dimension to the Arabic language as it
48 / Arabic Computational Linguistics
became the language of religion. Soon after the death of the Prophet and the establish-
ment of Islamic rule in the greater Middle East, a third dimension was added as Arabic
became the language of government and administration by the 8th century.
The codication of the Arabic language in the 8th century gave birth to an intellectual
movement in the Arab World. The Arabs made an eort to understand the heritage of
the Roman, Greek and Indian civilizations when seeking answers to the questions their
new intellectual endeavors were addressing. The Arabic language lexicon was greatly en-
riched and expanded with the translations of the great Greek and Roman writers. Even
though during this time, the Arabs were the sole keepers of earlier civilizations, their
role was not limited to simply preserving the earlier heritage. They were able to make
important contributions to knowledge after they had mastered contemporary scientic
knowledge. From the 10th to the 13th centuries there was a massive proliferation of
books in the urban centers. Having amassed a huge body of knowledge, they produced
encyclopedias such as those of Al-Khawarizmi _
.
.
.
:
al-h
..
.
.
_.
_
.
:
_.
abn rusd
known in the west as Averroes and for his Aristotelian comments (Harvey, 2000).
Biesterfeldt (2000) describes Al-Farabis encyclopedia ,
.
.
..
..
.
ih
.
s
.
au l-ulwmi
Enumeration of Sciences as having the greatest impact on the classication of sciences
in Islam, and beyond, during the Christian and Jewish Middle Ages. The medieval
western world came to know the classical works of the Greeks either directly through
the Arabic translations or indirectly through the medieval Hebrew encyclopedias that,
in turn, were largely inuenced by the Arabic encyclopedias (Biesterfeldt, 2000).
Arab scholars did not have a unied view of how to classify knowledge. For example,
Al-Khawarizmi _
.
.
.
:
al-h
.
.
_.
.
.
. .
miftah
.
u l-
ulwmi Keys to the Sciences classied sciences into two main categories: Islamic/Arabic
Sciences and foreign sciences. He listed six branches of knowledge under Islamic/Arabic
sciences including jurisprudence, theology, grammar, scribal art, poetry and prosody,
and history. Under foreign sciences he refers to philosophy, logic, medicine, arithmetic,
geometry, astronomy, music, mechanics and alchemy (Biesterfeldt, 2000). On the other
hand Al-Farabi _
.
.
.
..
..
.
ih
.
s
.
au l-ulwmi. He provided a general survey of
each of the sciences in his encyclopedia which consisted of ve topical chapters.
1. The Science of Language
2. The Science of Logic
3. The sciences of music, mathematics: arithmetic, geometry, optics, mathematical
astronomy, the weights, and mechanics.
4. Natural Science and Divine Science (metaphysics)
5. Political Science, Jurisprudence and Theology.
The impressive work produced by the Arab scholars shortly before and after the Abbasid
Caliphate proves that the Arabic language, which had been regarded as primarily a
owery and poetic language, adapted itself to the emerging needs of its speakers. Thus
it became a vehicle for expressing complex accounts of scientic advances in almost
all elds of knowledge. The scientic advances of the Arabs at the time raised the
The Arabic Language / 49
functionality of Classical Arabic to its highest level. It became the language of science
and philosophy in addition to being the language of religion, literature and government.
It is worth noting that the The Center for Muslim Contribution to Civilization in the
State of Qatar recently launched a project to translate the landmarks in Arabic scientic
thought into English. The center has already published and is in the process of publishing
several English translations of Arabic books written by medieval Arabic scholars.
But after the fall of the Abbasid Dynasty, the pre-eminence of the Arabic language
began to decline. For the rst time in the history of Islam, the caliphs were non-Arabs
who were not procient Arabic speakers. Under Turkish rule, the Arabic language was
no longer the language of government. We will talk below about how the position of the
Arabic language was demoted for at least four centuries.
2.1.4 The Demotion of the Arabic Language
It is important to understand the role of language in government under Islamic rule. The
use of the Arabic language in government began when the Prophet Mohammad migrated
from Mecca to Yathrib in 622 AD
.
.
.
.
yat
..
.
.
al-
madyntu. Upon his arrival he drafted a set of deeds known as the Constitution of Medina
(CM) (Arjomand, 2009). The CM is probably the very rst constitution in either the
Arabian Peninsula or the entire world and the CM clearly established the Prophet as the
head of the Umma
.
.
. al-amatu. The CM created a political community on the basis
of well-dened common interests, rights and duties. This was a signicant departure
from the traditional tribal rule where loyalty to the tribe was paramount. The newly
created political community transcended tribal allegiance, transforming tribal allegiance
to political allegiance. The new political community included Muslims and non-Muslims
.
.<
_.
.
ahlu l-kitabi forming a pact that clearly dened the rights and obligations
of the three factions of the Umma in Medina:
1. The Muslims from Mecca who migrated to Medina
.
al-muhagirwn
2. The newly converted Muslims in Medina who welcomed the Prophet and his fol-
lowers from Mecca
.
.
.
.
.
al-ans
.
aru
3. The Jews in Medina
Following his death, Muhammad was succeeded by four caliphs known as the Rashiduun
Caliphs
.
.
.
..
.
:
al-h
waly for each province and/or country under his rule and the
governors role was clearly dened: maintain peace and order within his province, follow
the Islamic jurisprudence and collect taxes from Muslims
.
..
.
.
.
.
.
. . .
_.
.
. .
. al-mutas
.
imu b-al-lah the
last Abbasid caliph who ruled from Baghdad, was killed in the attack. The fall of the
central Islamic Caliphate in Baghdad and the death of the last Abbasid caliph marked
the beginning of a dierent era characterized by the emergence of several small shadow
caliphates and resulted in the disintegration of the Islamic empire. In Egypt for example,
the Mamluks were not of Arabic origin, yet they hosted an Abbasid Caliph, and did so
only to legitimize their rule. Hence, the real power was in the hands of the Mamluks while
the Abbasid Caliph exerted little inuence and usually capitulated to their demands.
Then more than two hundred years later in 1517, the Turkish Sultan Selim I con-
quered Egypt and captured the nominal Abbasid caliph, al-Mutawakkil III. He was
taken to Istanbul and was forced to surrender the Caliphate to the Turks. With the fall
of the Abbasid Caliphate, a new Islamic dynasty formed, known as the Osman Caliphate
.
.
.
..
.
.
.
:
al-h
ilafatu al-ut
manytu.
During this time, while Classical Arabic remained the language of religion, it lost
its role as the language of government and administration. For the rst time in Islamic
history, the Caliph was neither an Arab nor a native speaker of Arabic. However, he
was the religious leader of the empire. As a result during this period, religion became
the only factor that unied the people under his leadership.
Turkish was the predominant language of the Sultans in Constantinople and more
often than not, the Caliph would assign Turkish governors who did not speak Arabic
uently, to govern the Arabic provinces. This motivated the local Arab speaking popu-
lation to learn Turkish in order to obtain administrative positions and to resolve issues
with their rulers. The prior supremacy of the Arabic language was replaced by a mul-
tiplicity of three role-specic languages: Turkish for government, Persian for literature
and Arabic for religion. The relegation of Arabic to solely the realm of religion persisted
for three centuries until well into the nineteenth century which witnessed the rebirth of
Arabic (Versteegh, 1997a).
Despite the setback suered by the Arabic language during the centuries under Turk-
ish domination, it has surprisingly, survived and regained its vitality in the nineteenth
century when the Arab countries came into direct contact with the ideas, culture, phi-
losophy and scientic and technological advances in Western Europe. As a result, this
contact between the Arabs and Europeans in the nineteenth century had a profound
inuence on both the Arabic culture and language as discussed in the next subsection.
2.1.5 The Second Standardization of Arabic: The Emergence of Modern
Standard Arabic
As stated above, the Arabic language revived itself in the nineteenth century via Modern
Standard Arabic (MSA) which emerged when Western concepts had to be translated
The Arabic Language / 51
into Arabic and many Arabists (Versteegh, 1997a) consider the emergence of MSA as a
rebirth of Classical Arabic.
In 1798, Bonaparte led the French expedition across the Mediterranean and occupied
Egypt until 1801. The French wanted to intercept the trade route between Great Britain
and its colonies in the Far East in order to establish a base to destroy the British-
Indian empire (Peretz, 1994). Hence, there was a renewed awareness on the part of
the Western powers as to the strategic importance of the Middle East. For the rst
time, since Egypt came under Ottoman rule in 1517, it came into direct contact with
the Western world. The Egyptians were taken by surprise when the French soldiers
used guns which to them, were new weapons and they realized how backward they had
become as compared to the Western World. Bonaparte brought a printer to Egypt able
to print the Arabic script and he printed leaets directed toward the Egyptians in order
to win their hearts and minds. Whether intentional or not, the French expedition
exposed the Egyptians to the new Western world and its culture. Egypt became even
more inuenced by Western culture when in 1805 it came under the rule of Mohammad
Ali Pasha. He realized that the Turkish Caliphate in Istanbul was in decline and wanted
to establish a strong Islamic empire that extended beyond Egypt. To achieve his goal,
he initiated an ambitious program to modernize and industrialize Egypt. He introduced
a new administrative system and reorganized the scal and agrarian systems (Holt,
1966). He built an organized system of government with a leader mudiir for each
function: nance, navy, war, interior, education and public works marking the beginning
of the cabinet system in Egypt. Following the advice of his French advisers, he sent
young Egyptian men to France to study medicine, engineering and all modern branches
of science. Upon their return from France, they taught at the new technical schools
or colleges founded by Ali Pasha to teach ship-building, medicine, surveying, etc.
So as to prepare Egyptians for these technical schools, he developed the rst secular
state education system in Egypt. He also established a language school to facilitate the
translation of technical materials into Arabic. His openness to the West, especially to
France, brought to Egypt the new ideas and concepts in vogue in Western Europe at
the time.
But the Arabic language had to be used to express ideas and concepts that did not
exist in Classical Arabic. New Arabic words such as constitution, republic, represen-
tations, national assembly, etc. had to be created in addition to thousands of technical
and scientic expressions. This was an important impetus for the development of Mod-
ern Standard Arabic which originated with the Western-educated Arabs who needed
to present political, cultural and technical knowledge in Arabic. The development of
MSA has continued to the present and now functions as the mutually accepted vehicle
of communication among educated Arabs from dierent regions in the Arabic-speaking
world.
2.1.6 The Stability of Classical Arabic
Classical Arabic has remained intelligible, functional and widely used throughout the
last fteen centuries and is currently the ocial language of twenty two states. Holt
(1995) says:
52 / Arabic Computational Linguistics
of the four major literate civilizations: Confucianism, Hinduism, Christianity and Is-
lam, it is only Islam whose sacred language has retained its original form and still be-
came a national and ocial language. Another way of putting it is that, whereas most
sociolinguistic study is concerned with variability, here we are concerned with its op-
posite. Furthermore, other neighboring languages have undergone major structural and
orthographic reform most related to contemporary speech patterns, i.e. Greek, Turkish,
Modern Hebrew and Maltese (Holt, 1995).
So when comparing language stability, the Arabic language has remained relatively
stable in contrast to Old English which became incomprehensible to speakers of Middle
English which in turn became incomprehensible to speakers of Modern English. The
same can be said of Latin which lost its functionality and is now considered to be a
dead language, but gave birth to the Romance languages.
Holt (1995) cites several factors that contribute to the stability of the Arabic lan-
guage. Firstly and from the very beginning, Islam did not separate state and religion
and according to Islamic tenets, the prophet was the religious leader of Muslims as well
as their chief political leader. In fact, the prophet became the leader of the very rst
centralized Arab state. Prior to Islam, the Arabs loyalty was to the tribe but with the
advent of Islam, their loyalty shifted to religious and political leaders. The fact that the
head of the Arabic state was both a political and religious leader protected the Arabic
language from coerced change that would have been imposed by divided leadership. Sec-
ondly, the early expansion of Islam beyond the Arabian peninsula produced an Arabic
ngerprint or Arabocentrism and an Arabic ideology (Tibi, 1990). Thirdly, literary
and scientic work was conducted in Classical Arabic and the use of Classical Arabic
was an essential element in preserving the link between past and present. Fourthly, the
literate minority ..
al-ulama in the Muslim and the Arab world were key to main-
taining social cohesion since as Imams, they had the linguistic knowledge necessary for
interpreting the Quran. They preserved and adhered to the structure of Classical Arabic
in order to facilitate communication and resolve debates among themselves. Fifth, even
though the ..
. ulama coupled with their ideological commitment to preserve the Quran, helps
explain the remarkable homogeneity of the literary genres of the Muslim world.
2.2 Strategically
The Arabic language is the ocial language of 22 countries occupying a strategic geo-
graphical location and their combined size is about one and half times the size of the
USA. These countries extend from the Atlantic Ocean on the west to the Arabian/Per-
sian Gulf in the east and also include a large part of Africa and a signicant portion of
Asia. Further, the Arabic-speaking North African countries have close ties with South-
ern European countries; namely, Spain, Italy, Greece and France due to their proximity
and shared history.
Throughout history the Arab countries played a pivotal role as the crossroad for
east-west trade. To facilitate trade from Western Europe to India, Indonesia and China,
The Arabic Language / 53
the Egyptians dug a canal that provided for the passage of ships coming from Europe
to the Far East circumventing Africa and the treacherous route around the Cape of
Good Hope. The Suez canal provided a safer, faster and less expensive alternative for
transporting goods from the Far East to Europe.
During the expansion of the Roman and the Byzantine empires preceding Islam, Ju-
daism and Christianity spread among the Arabs as Semitic religions. For example the
city of Yathrib
.
.
.
.
yat
.
. <
alkala
and water. There were frequent battles among these tribes over resources and over
control of caravan lines. There were three important cultural traits that dened the life
of Arab nomads in the Peninsula before Islam.
1. Love of poetry. They excelled in using the most orid, eloquent and rhythmic
forms of the Arabic language and used these metaphors, similes, and imagery
extensively in their poetry. Arabic poetry dates to the fth century CE and was
memorized, recited and handed down from one generation to the next. The seven
most prominent Arab poems in the pre-Islamic period are called
.
.
.
.
. mualaqat
and were hung on the walls of the kaba in Mecca. Hence, they are named The
54 / Arabic Computational Linguistics
Suspended Odes or The Hanging Poems. Each one has a dierent author and is
considered to be the authors best work.
2. They valued and praised hospitality and took the utmost pride in being hospitable
and generous toward guests even sacricing their own sustenance for a hungry
guest. This reverence toward guests is reected in their poetry.
3. But bitter tribal warfare was commonplace before Islam as exemplied by the con-
tinual bloody conicts of Aws and Khazraj, Abs and Dhubyan, Bakr and Taghlib.
The Arab nomads valued courage, stamina and horsemanship and demonstrated
supreme courage in battle, taking a keen sense of pride in their skills.
In addition to the nomads, there were other Arabs who lived in settled towns surrounding
the few oases. One of these towns, Mecca, was as early as the 7th century, a renowned
cultural and trade center and was also the birthplace of the prophet Mohammed.
2.3.1 Arabic as an Oral Culture
Arabic discourse has its roots in an oral culture (Johnstone, 1991) that dates to the
pre-Islamic period. As mentioned above, pre-Islamic poetry played an important role
in the social life of the people. Longer poems (odes) were often memorized and recited
orally and the tradition of recitation prevailed even after the founding of Islam. In fact,
the Quran was recited long before it was written and Muslims to this day recite parts
of the Quran at least ve times a day in their daily prayers. Poetry such as that of
the Ummayads and Abbasids was publicly recited and even lengthy prose as in The
Arabian Nights was always recited orally. Story telling was considered a profession and
had a name in the Arabic culture _
.
.
<
.
:
al-h
.
akawaty the story teller.
Traditional Islamic education took place at informal schools called the katatiib
.
.
.
..
.
.
katatyb; a tradition that continues to the present day, relying heavily on repetition and
memorization. At these schools young children memorize the entire Quran at a very
early age without necessarily understanding its meaning. The emphasis on repetition
and memorization in the Arabic culture has led some linguists to perceive it as a lack
of reasoning in Arabic discourse (Johnstone, 1991, 1983). We will elaborate on the
properties of Arabic discourse in section 2.5.1 below.
2.3.2 Arabic Culture and Islam
There is an unbreakable link between the Arabic language and culture, and Islam. For
example, it is dicult to translate concepts specic to Islam without the use of Arabic
terms. Terms such as Jihad ..
,.
gihad, Imam ,.
.
.
imam, daawa
.
... dwt,
Quran
.
qran to name but a few can only be dened with these words. Because
Islamic scholars believe that the Quran cannot be translated accurately, Muslims who
are not native speakers of Arabic have to perform their daily prayers in Arabic even
though they do not understand it. The fact that 1.4 billion Muslims listen to and recite
the Quran on a daily basis extends the presence and inuence of the Arabic language
beyond its speakers in the Arab world. Moreover, because the Quran was revealed to
the Arabs in their spoken language, many Islamic concepts stem from Arabic culture
and these concepts are now shared by all Muslims. Hence, the Arabic culture is woven
into the local cultures of all Muslims.
The Arabic Language / 55
2.4 Linguistically
The internal structure of the Arabic language exhibits many interesting properties for
linguistic theories. In section 2.4.1, we will discuss the nonconcatenative nature of Ara-
bic morphology, the communicative function of subject pronouns, and the role that the
structure of Arabic plays in syntactic constituency. In section 2.5.2 we discuss the un-
usual situation posed by the absence of native speakers of CA and MSA and in section
2.5.1 we discuss how Arabic discourse is aected by the oral tradition of the Arabic
culture.
2.4.1 Implications of Arabic structure to Linguistic theories
The Nonconcatenative nature of Arabic Morphology Arabic, like other Semitic
languages, is characterized by a complex and rich morphology. Traditional Arab gram-
marians recognized the root as the basic underlying form of Arabic words and described
Arabic morphology _
.
,.
. al-s
.
arfu l-raby in terms of patterns
ewzan.
The building blocks of Arabic surface stems are the consonantal root which represent a
semantic eld like[KTB] writing and a vocalism that represents a grammatical form.
It is more often than not that the elements of the root and the vocalism when com-
bined, form Arabic surface words that are discontinuous. For example the surface word
katab wrote can be represented as [KaTaB] where the upper case letters represent
the elements of the root while the vocalism is represented by the lower case letters.
The consonantal root means writing whereas the vocalism [a-a] has the grammati-
cal meaning of the perfected verb. Unlike Romance and many other languages, Arabic
morphemes are not contiguous. The building blocks of Arabic words are usually nested
rather than concatenated, hence the nonconcatenative nature of Arabic morphology
(McCarthy, 1981). The discontinuity of the root and the vocalism suggests that the well
established denition of the morpheme as a meaningful minimal linguistic unit (i.e., one
that does not contain a morpheme boundary within it) is not a comprehensive denition.
This denition works well with languages with morphemes that are concatenated.
McCarthy proposes that Arabic words are formed by an association of the radi-
cals (elements of the consonantal root) with the vocalism forming a prosodic template.
Moreover, further work on Arabic morphology reveals its canonical-invariant property
in Arabic templatic morphology (McCarthy and Prince, 1990). They show that Form
2 which is the causative of Form 1, is generated by reduplication of the second radical
while keeping the canonical form invariant. Similarly, by changing the vocalism, the
voice goes from active to passive while keeping the canonical form invariant. Inspired
by these features of Arabic morphology, McCarthy and Prince developed a theory of
prosodic morphology which denes templatic morphology in terms of the fundamental
units of prosody such as the mora, the syllable, the foot and the phonological word.
McCarthy also points out that the root in CA is subject to certain constraints. For
example, there is a constraint that prohibits the reduplication of the rst element in an
Arabic root while allowing it for the second element.
Communicative Function of Subject Pronouns Arabic, like Italian, Spanish, Ko-
rean and many other languages allows for the deletion of subject pronouns when the
information provided by the subject pronoun is recoverable. This led some linguists
56 / Arabic Computational Linguistics
(Perlmutter, 1972) to consider the subject pronoun in such languages as completely
redundant, with no communicative function whatsoever, since it can be deleted with no
loss of meaning. The optionality of the deletion of the subject pronoun in some languages
led Chomsky to conclude in his Theory of Principle and Parameters (Chomsky, 1986)
that the pro-drop parameter is part of universal grammar that is xed by the early
linguistic experience of a child. Languages with a pro-drop parameter may optionally
delete subject pronouns when the condition of the recoverability of deletion is met
(Chomsky, 1965). Other languages like English, French and German do not allow their
subject pronouns to drop.
Elaborating on the insights of Eid (1980) subject pronouns in Egyptian Arabic are
not redundant but have real communicative value. Farghaly (1982) shows that in Arabic
sentences, the presence or absence of the subject pronoun could change the interpreta-
tion of a sentence. Consider the following two sentences:
(1)
.
,
_
.
.
.
.
. h
.
ad
.
ara ly wans
.
araf hwa Ali came and he left.
(2)
,
_
.
.
.
.
. h
.
ad
.
ara ly wans
.
araf Ali came and left.
The sentence in (1) is a compound sentence conjoined by the Arabic coordinator
wa meaning and. Each of the two conjoined sentences consist of a verb and a subject.
The subject in the rst conjoined sentence is a lexical NP while in the second it a subject
pronoun. However, the preferred reading of (1) is that the two subjects are disjoint in
reference. The sentence in (2) is identical to the one in (1) except that the subject
pronoun of the second verb is deleted. The deletion of the subject pronoun conforms to
the recoverability of deletion condition because the features of the subject pronoun are
present in the inected verb. In contrast to (1), the preferred reading of the the sentence
in (2) indicates that the subject of the two verbs in the sentence are coreferential. Thus,
the presence of the subject pronoun in (1) signals the disjoint reference between Ali
and he while the absence of the subject pronoun in (2) causes obligatory coreference.
Farghaly (1982) proposed dierent structures for the two sentences above.
Eid (1980) argues convincingly that the subject pronouns in Egyptian Arabic could
facilitate disambiguating ambiguous messages. Consider the following sentence:
(3) .
.
. .
.
.
.. _
.
.-.
.
.
.
.
.
.
.
.
th
.
adt
a l-wzyru ma l-s
.
h
.
fy ndama h
.
ad
.
ar
The minister talked to the journalist when he came
The subject pronoun of the verb came is deleted in this sentence. It could refer
either to the journalist or to the minister. However, the preferred reading for (3) is that
the pronoun refers to the journalist i.e. the most adjacent antecedent with which it
agrees. In contrast, if the subject of the verb came is explicit as in (4), the preferred
reading would reference the furthest possible antecedent.
(4)
. .
.
. .
.
.
.. _
.
.-.
.
.
.
.
.
.
.
.
th
.
adt
a l-wzyru ma l-s
.
h
.
fy ndama h
.
ad
.
r
hwa
The minister talked to the journalist when he came
It would be interesting to see if the behavior of subject pronouns in Arabic is shared
by all pro-drop languages and not restricted to Arabic. If other pro-drop languages do
The Arabic Language / 57
not share these properties with Arabic, then we could gain a better understanding of
the ways these languages vary.
Arabic Features and Syntactic Constituency Agreement in Arabic is very com-
plex and encompasses a rich set of features including number, person, gender, humanness
and deniteness. Moreover, Arabic exhibits anti-agreement; and agreement direction-
ality plays an important role in dening certain syntactic categories. For example the
rst term of the Arabic noun construct which is equivalent to the English possessive
phrase the cousin of the king is almost always indenite while the second term could
be either denite or indenite as shown below
(5)
.
..
.
.
_
. h
.
urasu l-qalati
guards the citadel
The guards of the citadel.
(6) *
.
..
.
. _
.
:
al-h
.
uras al-qalt
the guards the citadel
The only dierence between the grammatical noun construct in (5) and the ungram-
matical phrase in (6) is that in (6) the rst term of the construct is denite while in (5)
it is indenite. The so called equational sentences in Arabic consist of a subject and
a predicate. The subject is a noun phrase while the predicate could be an adjectival or
a noun phrase. The subject has to be denite while the predicate must be indenite.
If the deniteness relationship between the subject and the predicate is reversed, the
resulting constituent is no longer a sentence. It becomes a noun construct. Consider:
(7)
_
.
.
_
-
al-ragulu qatilu
the man killer
The-man is a killer.
(8) _
_
.
.
.
q atilu l-raguli
killer the-man
The killer of the man.
The following sentence shows that in Arabic quantiers have to disagree in gender
with the noun they quantify.
(9) .
.
..
.
.
.
.
.
.
. astaraytu bd
.
azhar
bought-I some owers
I bought some owers.
The quantier
..
bd
.
here is masculine and the noun it quanties is feminine because
Arabic does not allow the quantier to agree in gender with the noun it quanties.
Subject-verb agreement in SVO sentences is dierent from subject-verb agreement
in VSO sentences (Fehri, 1993). For example: when the subject precedes the verb, the
verb agrees in gender, person and number with the subject. But when the verb precedes
the subject the verb has to be singular regardless of the number of the subject. The
agreement system of Arabic raises interesting questions about agreement triggers, anti-
agreement and the role of the directionality of agreement.
58 / Arabic Computational Linguistics
S
VP
V
qaabala
NP
PropN
.
-
ah
.
madu
NP
N
.
..
.
.
fatatan
PrepP
Prep
_.
min
NP
PropN
...
ms
.
r
FIGURE 1 The syntactic structure of senetnce (10)
2.5 Prepositional Attachment
Prepositional phrases in Arabic may attach to a verb or to a noun phrase. This creates
the problem of identifying case when the prepositional phrase modies the verb or
when it modies the noun. Consider the following two sentences where the prepositional
attachment problem causes confusion in the English equivalent.
(10) ...
_.
.
..
.
.
.
-
.
q abala ah
.
madu fatatan min mis
.
r
met Ahmed girl from Egypt
Ahmed met a girl from Egypt.
(11) ...
_
.
.
..
.
.
.
-
.
q abala ah
.
madu fatatan fiy mis
.
r
met Ahmed girl in Egypt
Ahmed met a girl in Egypt.
In sentence (10) the prepositional phrase ...
_.
min mis
.
r from Egypt modies
the noun. We can substitute the phrase ...
_.
.
..
.
.
.
.
...
.
..
.
.
fatatan mis
.
riyatan an Egyptian girl. In contrast,
we cannot do the same in sentence (11) because the meaning in (11) does not imply that
the girl that Ahmed met in Egypt, was an Egyptian. She could be American, Lebanese
or from any country including, but not limited to Egypt. The correct analysis would
attach the prepositional phrase in (10) to the noun while the prepositional phrase in
(11) would be attached to the verb phrase. The dierence in the syntatcic structures of
(10) and (11) can be seen in Figures 1 and 2 respectively.
2.5.1 Repetition and Redundancy in Arabic Discourse
An understanding of the nature and properties of Arabic discourse would facilitate ma-
chine translation systems when dealing with a written text or transcribed speech since
Arabic discourse and modes of argumentation are very dierent from Western norms
(Johnstone, 1991, 1983). Johnstone (1991) examined several expository texts in con-
temporary Arabic. She noted that the texts written by highly educated Arabs, were
The Arabic Language / 59
S
VP
V
qaabala
NP
PropN
.
-
ah
.
madu
NP
N
.
..
.
.
fatatan
PrepP
Prep
_
.
fy
NP
PropN
...
ms
.
r
FIGURE 2 The syntactic structure of senetnce (11)
characterized by elaborate patterns of lexical, morphological and syntactic repetitions.
She also conrmed from her personal experience that Arabs convince people: namely
by repeating (Johnstone, 1991). She noted that the texts she examined were highly
repetitious. For example, there was frequent use of paraphrase, parallelism and lexical
couplets, usually conjoined, two-word phrases, that are completely or nearly synony-
mous. Examples of these couplets are:
.
.
..
..
.
..
.
.
.
.
.
.
.
. al-ta ayyydu wa l-musandtu
aid and assistance
.
.
.
.
:
,.
al-wahmu wa l-h
ayalu
illusion and imagination
.
.
.
..
.
.
.
.
-
.
. al-th
rybu wa l-tadmyru
destruction and demolition
Johnstone concludes that Arabs, unlike Westerners, persuade others through rep-
etition and not through logical reasoning. She investigated the origin of this style of
persuasion. She cites other studies (Bauman, 1977, Stankiewicz, 1960) that associate
repetition and the focus on form rather than content, with poetry and oral discourse.
However, what is perceived as meaningless repetition by Westerners, is highly re-
garded by Arabic speakers. For them, it is an embodiment of the beauty and elegance
of Classical Arabic. The coherence, rhythm and rhetoric contribute to the eectiveness
of the argument. But Arabs do not assume that the elegance of an expression alone
is proof of its validity as has been claimed by Bateson (Bateson, 1967). The elaborate
repetitions in contemporary written Arabic discourse can be explained by the oratorical
nature of the Arabic culture as we discussed in section 2.3.1. A comparison was made
between discourse in literate societies and discourse in oral cultures. It was found that
literate cultures rely on the factual accuracy of a message and favor evidence, reason-
ing, and analysis. All information is included in the message and the burden is on the
speaker to provide the proof for his argument. In contrast, oral cultures rely on emo-
tional resonance. They place higher emphasis on symbolism and they are less rational,
adhering to a more intuitive approach. For example, a single anecdote can constitute
60 / Arabic Computational Linguistics
evidence for a conclusion. The burden is on the listener to complete and decipher the
coded message.
2.5.2 The Absence of the Native Speaker
There is has been considerable debate around the question of whether Classical Ara-
bic is a natural language or not and whether it is worthy of investigation within the
generative framework (Elgibali, 1996). Chomsky made it clear that linguistic theory
is only concerned with characterizing native speakers intuitions about their language
(Chomsky, 1965). He says, Linguistics theory is concerned primarily with an ideal
speaker-hearer, in a completely homogeneous speech community, who knows its lan-
guage perfectly. This statement eliminates Classical Arabic and MSA from the scope
of generative grammar because it has often been stated that Classical Arabic is not the
native language of any present group of people. Arabic speakers learn Classical Arabic
at school in the same way they learn any other foreign language. Because Classical
Arabic is not acquired naturally, speakers of Arabic do not have an inherent intuition
of Arabic grammar. Chomsky (1965) postulated that the goal of linguistic theory is to
describe and dene the linguistic knowledge of native speakers. To adhere to this prin-
ciple, linguists rely heavily on the grammatical judgments that they elicit from native
speakers.
In the absence of native speakers, as is the case with Classical Arabic and Modern
Standard Arabic, linguists who study these versions can neither test nor validate their
theoretical assumptions. As a result, many linguists have shifted their attention to the
colloquial versions of Arabic where they have access to native speakers intuitions about
their language.
However, a natural language may not be dened so narrowly: that is, as a language
that has living native speakers. Some endangered languages lose the last native speaker
of the language. This fact does not transform the language from natural to unnatural.
Latin is a natural language although it is not spoken by any present day group of people
nor is it acquired naturally because people learn it as a second language.
Articial languages such as Esperanto and programming languages are created at a
particular time by identiable authors but are neither linked to a culture nor a literary
tradition. In contrast, neither Classical Arabic nor Modern Standard Arabic are asso-
ciated with a denite point of origin nor are they attributed to any particular author.
They arose naturally to fulll the needs of their speech community and as importantly,
are tied to a specic culture and an impressive literary and knowledge-seeking tradition.
It is not surprising then, that Classical Arabic has survived through the centuries, while
the articial language, Esperanto, has not.
3 Arabic Linguistics
The only complete grammar of Arabic to date, is that written by early Arab grammar-
ians. This grammar, written more than a thousand years ago still serves as the main
reference even when used to describe Modern Standard Arabic. Most current work in
Arabic computational Linguistics relies at least to some degree on the analyses made
by those pioneering Arab grammarians (see chapter 4 in this volume).
The Arabic Language / 61
In this section we begin by presenting the assumptions and goals of the Arabic linguis-
tic tradition. Then we discuss their methodology while comparing it with the method-
ology of modern linguistics theory in section 3.2. We present some of their descriptions
of CA in section 3.3 and show that while their description of Arabic was pioneering
and very advanced at the time it was developed, it falls short of the requirements for
computer processing.
3.1 Goals and Assumptions of the Arabic Linguistics School
The Arabs did not seriously study and analyze their language until they converted
to Islam and felt they were privileged and honored to have the Quran revealed in
their language by the Prophet Mohammed. The decision to initiate research in Arabic
grammar was made at the highest level of government by the fourth Caliph Ali Ibn Abi
Talib
.
. _
.
_.
_
.
. ly abn aby t
.
alb who ruled over the Islamic Caliphate from
656 to 661 CE. Versteegh (1997b) cites a very interesting story of how the Caliph was
concerned about the purity of the Arabic language and wanted to set norms for its
usage. The story that was given by Abu l-Aswad Al-Duali, an early Arab grammarian,
is as follows:
I came to the commander of the Believers
_
.
.
..
.
.
.
.
.
amyru l-mumnyn Ali Ibn Abi
Talib - may God have mercy on him - and saw in his hand a manuscript. I said to him
What is this, Commander of the Believers? He said: I was reecting on the language
of the Arabs and noted that it had been corrupted by our mixing with these red persons
- i.e. foreigners - and I wanted to make something for them on which they could fall back
on which they could rely. then he handed me the manuscript, and I saw that it said:
Language is a noun and verb and particle. The noun is what informs about a named
object; the verb is that with which the information is given; and the particle is what
comes for a meaning. He said to me: Follow this direction and add to it what you nd
(Versteegh, 1997b).
3.1.1 Goals of Traditional Arabic Grammarians
The linguistic analysis of Arabic by early Arab grammarians contains clear objectives.
Their objectives were:
1. Proper understanding and interpretation of the Quran and the Prophets sayings
..
.
.
.
.
.
.
.
.
.
.
..
-
.
. al-ah
.
adyt
.
.
.
:
al-h
as
.
ais
.
The Special Features written at the end of 10th
century AD.
Variability of Surface Structure Arab grammarians working on the xed corpus
of the Quran, the pre-Islamic poetry and the trusted, uncontaminated intuitions of
the Bedouins posited rules deduced from that corpus. They described Arabic verbal
The Arabic Language / 63
sentences as having the order Verb, Subject, Object. However, they also noted that
other orders existed such as Subject Verb Object or even Object Verb Subject and Verb
Object Subject. To account for this variability, they claimed that any order other than
VSO is derived from an underlying order _.
.
as
.
l through preposing _.
.
.
.
.
.
. taqdym and
postposing
.
.
-
.
.
.
. tah
_.
_
. .
. al-mamnwu mina
l-s
.
arfi, are assigned one of three cases depending on the function the noun or adjective
in the sentence. The three case endings for nouns and adjectives are:
1. A high round vowel
.
.
d
.
amat corresponds to Nominative referred to in Arabic
grammar as
al-raf
2. A low from vowel
.
-
.
.
fath
.
at corresponds to Accusative referred to as
al-nas
.
b.
3. A high from vowel
.
.
..
al-garr
In principle the three main word categories: nouns, verbs or particles could be gover-
nors and therefore may assign case. A governor may govern multiple governees while a
governee may have one and only one governor (Versteegh, 1997b). For example a verb
governs its subject(s) and assigns it the nominative case and also governs its direct
object and assigns it the accusative case. Nouns and adjectives could be governors and
governees at the same time. Consider:
(12) .
.
.
.
.
.
.
..
.
. sahadtu mudyra l-banki
saw-I manager the-bank
I saw the manager of the bank
The verb saw in the above sentence assigns accusative case to manager. Thus manager
is a governee here. It is also a governor since it is the rst term of the noun construct
manager of the bank and assigns genitive case to the second term of this noun construct
.
.
,.
. al-mud
.
af alayhi.
Proper understanding of the Quran and Arabic poetry relies to a large extent on
identifying the correct case especially for the arguments of predicates because of the
relatively free word order of Arabic and the frequent preposing and postposing of con-
stituents.
Surface and Deep Structure Arab grammarians noticed that in some texts they
could discern meanings that were not explicitly mentioned in the text. For example in
the following sentence:
(13)
..
.
. d
.
araba zaydan
hit-past Zaydan
He hit Zaid.
They dened it as a verbal sentence because it begins with a verb and is followed
by a direct object that is governed by the verb and is therefore assigned the accusative
case. This sentence does not have an explicit subject so is a sentence whose subject
pronoun is dropped. However, the Arabs did not know about the pro-drop parameter.
Their analysis was that there was a subject, but the subject was hidden
.
.
.
..
. mustatir
and that it is interpreted as he because of the declension of the verb. They followed
this reasoning when a constituent appeared at position dierent from the expected
grammatical position. It is moved from its position in the deep structure to the
The Arabic Language / 65
surface structure. By positing an underlying level of analysis, they were able to achieve
more generality for their syntactic rules.
3.3 The Description of Arabic by Traditional Arab Grammarians
In this section we will focus on the classication of the dierent types of Arabic sentences
and how the correct denition of phrasal constituencies, neglected in traditional Arabic
grammar, are crucial in Arabic Computational Linguistics.
3.3.1 Sentence Types
Identifying basic constituents of a sentence is important in machine translation, in-
formation extraction and retrieval, event and relation extraction, question answering
systems, and data mining. We dene basic constituents as the phrasal building blocks
of a sentence such as subject, object, verb, adverbials and any other constituents that
help dene who did what to whom, when, where and for what reasons. Thus, analyzing a
sentence to describe such constituents is important for computational programs whether
based on linguistic engineering or machine learning technology. Correct classication of
sentence type in a language, is an essential step in that direction.
Arabic grammarians divided Arabic sentences into two types.
Verbal Sentences They dened verbal sentences as sentences that begin with a verb
such as:
(14)
.
..
.
.
,
. huzima l-adau
defeated=passive the-enemies
The enemies were defeated.
(15)
_
..
:
.
.-.
.
.
.
.
..
.
alana l-wazyru anahu syh
.
d
.
uru l-h
.
fla
announced the-minister that-he will attend the party
The minister announced that he will attend the party.
Sentence (14) is a verbal sentence as it begins with a verb. The sentence is passive
and the grammatical subject is not the agent but is the grammatical object of the
corresponding active sentence. Sentence (15) is interesting because though it is a verbal
sentence beginning with a verb and an overt subject, it has a sentential complement with
an explicit complementizer
.
anna that. The sentence that follows the complementizer
is not a verbal sentence because it does not begin with a verb although it has a verb, a
subject and a direct object.
Nominal Sentences A nominal sentence is a sentence that begins with a noun such
as:
(16)
.
.
..
.
:
al-h
.
rbu mudamirtun
the-war destructive
The war is destructive.
(17) _.
.
.
.
.
.
.
..
..
.
.
.
.
.
.
..
.
.
.
..
.
. al-fydralyt qadat mutmran s
.
h
.
fyiyan ams
the-federation held conference press yesterday
The federation held a press conference yesterday.
66 / Arabic Computational Linguistics
Both (16) and (17) are the same type of sentences since both begin with a noun. Yet,
they are very dierent. (16) is usually analyzed as a subject followed by a predicate.
The predicate could be a noun phrase, adjectival phrase, or a prepositional phrase. It
does not have an overt verb and its time refers to the present time. (17) on the other
hand is described as beginning with a topic, followed by a verbal sentence which in
turn is analyzed as a verb, with a hidden subject followed by a direct object and a
time adverbial. The reader of the sentence has to make the connection between the
hidden subject of the verbal sentence and the topic of the nominal sentence. While
identifying the main constituents of a sentence might be possible within the Classical
Arabic framework because Classical Arabic texts usually have overt case markers, this
is not possible in Modern Standard Arabic texts. MSA texts are characterized by the
absence of the diacritics that mark case. In spoken MSA, speakers tend to pause at
word endings to save themselves embarrassment if they confuse case endings. MSA
computational linguistics applications would benet greatly from a description based on
more formal and explicit pointers rather than on cases that do not appear in the surface
structure or have hidden subjects that need to be recognized. In the following section we
will show that for computational and linguistic considerations, a dierent classication
and description of Arabic sentences would be more computationally ecient.
3.3.2 Phrasal Constituents in Traditional Arabic Grammar
Describing and explaining the dierent endings of an Arabic word had the highest prior-
ity for Arabic grammarians. Reciting the Quran correctly and interpreting its meaning
depends crucially on knowledge of the rules of parsing
..
.
..
.
qawaidu l-irabi
which are described in terms of governance as shown above. Identifying phrasal cate-
gories below the sentence level such as noun phrases, adjectival phrases and their internal
structure did not receive as much attention as case ending
..
.
al-irab received.
Recognizing Arabic phrasal categories and their syntactic properties and boundaries
is extremely important and proved to be a dicult problem in Arabic computational
linguistics. Below is an example of a very common Arabic structure known as
.
..
.
al-id
.
aft; sometimes referred to as the noun construct. Consider the following:
(18)
.
.
.
.
. maqaru l-wizarti
headquarter ministry
The headquarters of the Ministry.
(19)
.
..
.
.
-
gamyltu l-waghi
pretty-fem thh-face
with the pretty face.
(20)
.
,.
.
amama l-manzili
front the-house
In front of the house.
Arabic traditional grammarians noted that in all the above sentences, the second
term is governed by the rst and is assigned a genitive case so they grouped them
together as an idaafa. In an Arabic machine translation system, we must be able to
distinguish these three phrases by one of the following analyses: (18) is a noun phrase,
The Arabic Language / 67
(19) is an adjectival phrase and (20) is a prepositional phrase. Such an analysis is more
relevant and useful in ANLP applications than for the traditional analysis which treats
them as one structure based on case ending which is not applicable for texts in MSA.
Arabic computational linguistics would benet greatly from surface-based grammars of
MSA the same way NLP systems beneted from lexical-based grammar formalisms such
as Lexical Functional Grammar (Bresnan, 2000) and Head Phrase Structure Grammar
(Pollard and Sag, 1994).
4 Arabic Computational Linguistics
Arabic computational linguistics is part and parcel of mainstream computational lin-
guistics. As such the prevailing paradigms in Arabic computational linguistics are the
same as those found in the eld, in general. This volume contains reports on work in
both paradigms and on the hybrid approach. Thus we begin this section with a brief
description of the two paradigms. In the following section we describe and compare
the goals of developers of Arabic NLP systems in the Arabic speaking world and the
goals of developers in a non-Arabic environment. In section 4.3 we briey comment on
computational treatments of Arabic morphology while in section 4.5 we review Arabic
sentence analysis of the Penn Arabic Treebank.
4.1 The Prevailing Paradigms in the Field
Computational linguistics, like any other discipline, exhibits cyclic change with respect
to prevailing paradigms (Oepen et al., 2000). In the late 1970s and 1980s new tech-
nologies emerged to make linguistic engineering more ecient. The new technologies
targeted two areas: processing eciency and linguistic descriptions. In the area of com-
puter processing more ecient techniques were developed such as chart parsing (Kay,
1973), denite clause grammars (Pereira and Warren, 1980), and unication (Shieber,
1986). At the linguistic level, monostrata formalisms such as head phrase structure
grammar (HPSG) (Pollard and Sag, 1994) proved to be very useful in designing gram-
mars with very wide coverage. However, since 1990s we have seen a dramatic shift in
the eld from linguistic engineering to statistical approaches (Manning and Schuetze,
1999, Jurafsky and Martin, 2000, Baldi and Brunak, 2001, Joachims, 2002, Abe, 2005,
Bishop, 2007, Cherkassky and Mulier, 2007). There were several factors that paved the
way for this change to take place.
1. It is impossible to encode common sense knowledge, knowledge of the world and
the context that native speakers use to communicate this knowledge. In the ab-
sence of such knowledge, NLP systems have to consider a huge number of logically
possible analysis that never arise in the minds of native speakers. Considering all
these logical possibilities slows down processing of input and systems within the
symbolic framework are usually slow.
2. Grammar engineering is expensive especially when going beyond the stage of de-
veloping toy grammars. Grammars for coverage of sizable data elds suers greatly
from rule interaction which slows processing.
3. Many computational linguists were more concerned with strict adherence to lin-
guistic theories than developing systems with practical applications. Instead,hey
68 / Arabic Computational Linguistics
were obsessed with theoretical correctness.
4. The web became the primary source of data needing to be processed. Blogs, emails,
chats and web pages are full of errors, typos, incomplete sentences posing a chal-
lenge for parsing.
5. Rule-based machine translation systems suered from lack of uency in the target
language and translated texts suered from ungrammatical phrases and sentences.
The development of successful speech recognition systems using probability theory and
machine learning approaches paved the way for a new empirical approach (Manning
and Schuetze, 1999) while proving its cost eectiveness, validity and robustness. As a
result, statistical approaches to natural language processing have dominated the eld
since the 1990s. The new paradigm is characterized by the following features:
1. The use of training and testing data. The closer the testing data to the training
set, the better the results.
2. The use of both supervised and unsupervised learning methods.
3. Based on probability theory it looks at language and cognition as probabilistic
and predicts the next word given the previous one.
4. It assumes that linguistic knowledge is present in linguistic data and that machine
learning techniques can extract this knowledge through cycles of training and
retraining until it learns the language.
5. Development time is short compared to rule-based systems.
6. The statistical models are robust.
7. The use of language modeling which gives the output the uency that is absent
from symbolic systems.
8. The use of bilingual parallel corpora and translation memory in machine transla-
tion.
9. The use of n-gram models
10. The development of algorithms that are language independent.
However, there are problems with statistical approaches to NLP. First, performance
deteriorates rapidly when input varies from the training data. In contrast, the perfor-
mance of symbolic systems is usually consistent. Second, in some cases it is dicult to
predict the kind of data a system will receive. Thus, training such a system is dicult.
Third, the language of emails, chat, and blogs are noisy, aecting the systems perfor-
mance. Fourth, there could be a point where adding more training data would confuse
the system; known known as the threshold problem. Fifth, when a phenomenon occurs
sparsely there is insucient data to learn from. This is known as the sparse data
problem. On another level, prominent computational linguists (Zaenen, 2006, Reiter,
2007, Jones, 2007) have recently expressed misgivings regarding the exclusive reliance
on machine learning approaches while neglecting the contributions of symbolic language
processing. Thus, research in Arabic computational linguistics has been inuenced by
the same philosophical shift as the paradigms in natural language processing. Bender
(2009) argues that even when developing language-independent NLP systems, general-
izations from the typology of language which denes the ways languages vary from each
other is very much needed.
The Arabic Language / 69
There is a growing realization that incorporating linguistic knowledge in statistical
NLP systems improves its performance. Please see chapters 7, 9 and 11 in this volume.
4.2 Goals of Arabic Computational Linguistics
Arabic computational linguists developing systems for Arabic speaking users usually
have a dierent focus than those developing Arabic NLP software enabling non-Arabic
speaking people to understand Arabic texts and speech.
Objectives of Arabic NLP Applications in a Non-Arabic Environment
1. To develop tools that enable English speakers with limited or no knowledge of
Arabic deduce the gist or the general meaning of Arabic texts.
2. To be able to extract pieces of information of interest such as person names,
addresses, phone numbers, email addresses, etc., from Arabic texts.
3. Produce tools that can comprehend a limited set of spoken Arabic phrases and
short sentences and that are capable of producing spoken short Arabic utterances
on demand.
4. Develop high quality Arabic to English machine translation systems.
Clearly, the emphasis is on developing Arabic language capabilities for non-Arabic
speaking individuals. We will see below that the objectives of Arabic computational
linguistics in the Arabic speaking world is quite dierent.
Goals of Arabic NLP Applications in the Arabic Speaking World
1. Transfer of knowledge and technology to the Arab World. Most recent publica-
tions in science and technology are published in the English language and are not
accessible to Arab readers who have no competence in English. For humans to
translate this huge amount of material to Arabic is very costly and time consum-
ing. Arabic NLP could reduce the time and cost of translating, summarizing and
retrieving information in Arabic for Arab speakers.
2. Modernize and fertilize the Arabic language. This follows from (1) above. Trans-
lating new concepts and terminology into Arabic involves coinage, arabizations
and making use of lexical gaps in the Arabic language. This will have a dramatic
eect on the revitalization of the Arabic language, allowing it to fulll essential
needs for its speakers.
3. Improve and modernize Arabic linguistics. Arabic NLP needs a more formal and
precise grammar of Arabic than the current traditional grammar. Innovation is
needed while preserving the valuable heritage of traditional Arab grammarians.
4. Make information retrieval, extraction, summarization and translation available
to the Arab user. The hope is to bridge the gap between peoples of the Arab world
and their peers in the advanced countries. By making information available to the
Arabs in their native language, Arabic NLP tools empower the current generation
of educated Arabs. Thus Arabic NLP tools are indispensable in the struggle of
Arabs to keep pace with the rest of the world. This is a matter of national security
for the Arab World (Farghaly, December 2008).
70 / Arabic Computational Linguistics
4.3 Computational Arabic Morphology
Arabic morphology has received much attention from engineers and computational lin-
guists since the late 1970s and early eighties. Pioneering work on the computational
morphology of Arabic focused on the retrieval of the consonantal root from fully inected
words (Hlal, 1979, 1985) through the process of decomposition and repeated matching
against a dictionary of roots. Arabic morphology poses a challenge from a theoretical
point of view (McCarthy, 1981, Farghaly, 1987, 1994). For example, McCarthy (1981)
proposes a two tier notation where the elements of the consonantal root are on one tier
and the vocalism is on a separate tier. Farghaly (1987) says that Arabic surface words
must be analyzed at three levels: the level of the unpronounceable root, the level of the
vocalism and the level of the axes. Farghaly (1994) suggests that the Arabic lexicon
may consist of underspecied entries to represent the discontinuous nature of Arabic
morphemes. The question of what constitutes entries in the Arabic lexicon has not been
resolved yet. Most Arabic morphological analyzers including the most advanced treat-
ment of Arabic morphology at Xerox Research Center in France embrace the view that
the root is the basic entry in the Arabic lexicon (Beesley, 2001). Recently, it has been
shown that a stem-based approach to Arabic morphology would be more relevant for
Arabic computational linguistics applications (Buckwalter, 2002, Farghaly and Senel-
lart, 2003). Positing the stem as the main entry in the Arabic lexicon eliminates the
step of generating stems from roots. Stems are then associated with appropriate mor-
phological, syntactic and semantic features that are needed in the syntactic analysis of
Arabic. But most current morphological systems do not go beyond the strict morpholog-
ical analysis of Arabic words. The morphology syntax interface is often neglected as has
been noted in chapter 5 in this volume. For example, most morphological analyzers of
Arabic do not categorize auxiliary verbs in Arabic as such. A verb like
.
yagib must
and many others for example, do not inect for the rst person nor for the feminine
singular /*
.
agib /*
.
tagib. Such information is crucial for checking agreement
to identify syntactic categories. There are also verbs that serve more than one function.
For example, the auxiliary verb
_
.
.
-
kana ah
.
madu fy mis
.
r Ahmed was in Egypt, while
it functions as a progressive aspect marker when it is followed by a verb as in
.
.
..
.
.
.
.. _
.
.
.
.
..
.
.
..
.
. .
..
.
..
.
.
.
.
.
d
.
amyr mustatir.
There is another case when the subject is completely dropped from the sentence as
the example given on page (13) in the guideline which we present below in (23). This
sentence is a subjectless sentence. Since there is no subject with which to trace the index,
one would expect the agreement features on the verb be copied into the NP-SUBJ node
instead of leaving it completely empty (NP-Subj (-NONE- *). Arabic allows the subject
to drop only if the verb is inected as the inection on the verb provides information
on the subject. Copying these features from the verb into the subject node could be of
great help in MT systems.
(23) (S (VP
..
-
.
ah
lda::->axolad+a::eternalize/perpetuate/remain+he/it [verb]
The Arabic Language / 73
(NP-SBJ (-NONE- *))
(PP-CLR _
:
.
ila::<ilaY::to/towards
(NP ,
al-nawmi::Al+nawom+i::the+sleep+[def.gen.]))))
.::.::nogloss)
>axolad+a + <ilaY + Al+nawom+i
perpetuate + to + the+sleep
He retired to bed
4.5.3 The Noun Construct
While English marks genitive phrases such as the cousin of the king and the kings
cousin by the explicit markers of and s, Arabic does not use any explicit marker
for such construction. It joins the two nouns together which corresponds to cousin
king. Arab grammarians called this construction iDaafa
.
.
.
id
.
aft which literally
means addition. They called the rst term of the construct
,.
.. md
.
af while the
second term was called
.
.
.
,.
. mud
.
afun ilyhi. This construction exhibits interesting
properties that distinguish it from other noun phrases. For example, The rst term of
such a construct is almost always indenite. It governs the second element and assigns
it genitive case. Like English, it is recursive. The whole construct acquires its gender
and number from the rst term while its deniteness is determined by the deniteness
of the last term of the construct.
Although the noun construct is analyzed correctly in the ATB trees as a noun phrase,
this construction is not marked as a possessive noun phrase or as a noun construct. The
noun construct is very common in Arabic and categorizing it as a possessive NP or
as an Idaafa or as a noun construct is necessary for correct translation it by machine
translation systems.
4.5.4 Prepositions and Auxiliary Verbs
There is a list of true prepositions on page (33) of the guidelines. A true preposition
like _
.
fiy in is not on the list nor is it found in the list of nouns formerly treated as
prepositions on page 35. All Arab grammarians know that the particle
.
.
.
.
al-adat,
is a rag bag category. The ATB categorizes auxiliaries and modals such as .
.
qad as
particles. This is not very informative. For example, .
.
qad occurs only before verbs
and has a specic meaning when it precedes an imperfect verb as in
.
..
.
.
_..
.
.
.
qad ah
.
s
.
lu ala l-waz
.
yfati I may get the job
It has a dierent grammatical meaning when it is followed by a perfect verb as in _
.
.
qad ata He has come. In the later case, it is best categorized as a aspect marker. Thus,
the morphological analyzer should have two tags for .
.
qad: one as a modal meaning
may. This is the tag it should have when it comes before an imperfect verb. The other
tag would mark it and its variant .
.
.
wa-::and )
(VP (PRT (NEG PART -lam::_
lam:did not ))
(IV3MS+IV+IVSUFF OOD:J ya+tasan a+[null]::
_.
.
..
.
ytsn]::he/it+
be possible+[jus.] )
(PP-TMP (PREP EalaY::
_
. la::on/above )
(NP (DET+NOUN+CASE DEF GEN Al+fawor+i
al-fawr::
the+immediately+[def.gen.] )))
(NP-SBJ (NOUN+NSUFF FEM SG+CASE DEF NOM maEorif+ap+u:
.
. marift:
: knowledge /acquaintance+[fem.sg.]+[def.nom.] )
(NP (NOUN+CASE DEF GEN sabab+i:
. sabab::
reason/cause+[def.gen.] )
(NP (DET+NOUN+CASE DEF GEN Al+HAdiv+i::
.
.
.
:
a-lh
.
adit
::
the+accident/mishap+[def.gen.] )))))))
.
.
.
:
.
.
.
_
. _
.
.
.
.
_
. sabab would be the head of the lower NP.
Specifying the head is particularly important when analyzing the noun construct.
The head of the gender and number of the head of the noun construct is percolated to
the maximal projection. Consider the following two sentences:
(24)
..
.
. .
.
.
.
.
.
.
..
.
..
.
. .
.
..
.
.
.
.
.
.
.
.
.
.
,.
.. md
.
afun ilayhi.
4.5.6 Prepositional Attachment and Syntactic Boundaries
Attaching prepositions to the right node and identifying where a syntactic constituent
ends are two of the most challenging problems in computational linguistics. One would
expect that the ATB would provide enough training data that could help machine
learning algorithms learn the right features so as to resolve these issues. Consider the
following two sentences:
(26)
_
.
.
.
.
.
.
..
.
at
.
aytu kitaban lialy
gave-I a-book to-Ali
I gave a book to Ali.
(27)
_
.
.
.
.
qaratu kitaban lialy
read-I a-book to-Alki
I read a book by Ali.
The phrase a book to Ali is not one constituent as book and Ali here are two dierent
entities. The preposition to in (26) modies the verb give. The giving was to Ali. In
contrast, the preposition in (27) modies the book and the phrase a book by Ali is
one constituent that could be an answer to the question which book did you read.
Now we expect that a parser generates two dierent parses of (26) and (27) with the
preposition in (26) attached to the verb and the preposition in (27) attached to the
noun phrase.
Passing sentences (26) and (27) to the Stanford Arabic parser which was trained on
the ATB annotated corpus of MSA, we get the parse trees shown in (28) and (29).
76 / Arabic Computational Linguistics
(28) Parse of sentence (26)
_
.
.
.
.
.
.
..
.
at
.
aytu kitaban lialy
Tagging
.
.
.
..
.
at
.
aytu /VBD
.
.
.
kitaban /NN
_
.
lialy /JJ
Parse
(ROOT
(S
(VP (VBD
.
.
.
..
.
at
.
aytu
(NP (NN .
.
.
kitaban (JJ
_
.
lialy))))
The parse tree above shows the preposition modifying the NP and not the verb.
Parse of sentence (27)
(29)
_
.
.
.
.
qaratu kitaban lialy
Tagging
.
qratu /VBD
.
.
.
kitaban /NN
_
.
lialy/JJ
Parse
(ROOT
(S
(VP (VBD
.
.
qaratu)
(NP (NN ).
.
.
kitaban (JJ
_
.
lialy )))))
The parse tree of sentence (27) is the correct one because it shows the preposition as
modifying the NN. However, the structure in (27) is identical to that in (26) which signals
that the parser should have detected the dierence in the prepositional attachment in
(26) and (27). Since the parser is trained on the annotation provided by the ATB, it
is clear that it did not have enough data to be able to learn to distinguish between
the two possible attachments. This is an area where the ATB could provide corpora
focusing on the problematic areas in Arabic computational linguistics. Figures 4 and 5
below show the dierence in structure of (28) and (29).
Another problem that faces Arabic computational linguists is nding where a con-
stituent ends. It is usually easy to learn where a syntactic constituent begins. It usually
has a head and the head usually occurs in the early part of a constituent in head-initial
languages. It is dicult especially in Arabic, to dene the point where a NP ends and
the other begins. Consider the following sentence:
The Arabic Language / 77
S
VP
V
.
.
.
..
.
at
.
aytu
NP
PropN
.
.
kitaban
PrepP
Prep
li
NP
PropN
_
.
.
aly
FIGURE 4 The prepositional phrase as an argument of the verb
S
VP
V
qaratu
NP
PropN
.
.
kitaban
PrepP
Prep
li
NP
PropN
_
.
.
aly
FIGURE 5 The prepositional phrase modifying the head noun
78 / Arabic Computational Linguistics
(30)
..
.
.
.
.
.
.
.
.
.
qabalat al-mudyratu l-d
.
ayfa
met-she the-manager the-guest
The manager met the guest.
A speaker of Arabic would say this sentence consists of a verb followed by a subject
and an object and is a regular VSO Arabic sentence. We passed the sentence to the
Stanford Arabic parser. The following is the parse tree.
(31)
..
.
.
..
.
.
.
.
.
q abalat al-mudyrtu l-d
.
ayfa
Tagging
.
.
qabalat /VBD
.
.
.
.
.
al-mudyrat /DTNN
..
.
al-d
.
ayf /DTNN
Parse
(ROOT
(S
(VP (VBD
.
.
qabalat)
(NP
(NP (DTNN
.
.
.
.
.
al-mudyratu))
(NP (DTNN
..
.
al-d
.
ayfa))))))
In this parse tree, there are only two nodes that are immediately dominated by the
VP: they are the VBD and the higher NP. The higher NP node immediately dominates
two NPs which are the subject and the direct object of the sentence. Assigning the
subject and direct object to one node equates them as one argument. One would expect
the direct object to have a sister node with the verb equal to that of the subject. The
problem here is to nd out when the two noun phrases form one constituent and thus can
have a higher NP node dominating them, or when they are two dierent arguments and
should not have a common node immediately dominant. In order for a parser to learn to
dierentiate the two structures, a large annotated corpora focusing on the problematic
areas in Arabic computational linguistics would be extremely benecial.
5 Conclusion
We have looked at the Arabic language from historical, cultural, strategic and linguistic
view points. The stability of the Arabic language, its inseparable ties to Islam, and its in-
teresting linguistic properties raise many interesting questions for research. Furthermore,
an understanding of its history, culture and linguistic properties is key to productive
cultural dialog with the West. Interest in the Arabic language and culture surged in the
last few years and as a result, several state-of-the-art computational tools have been
developed using machine learning and engineering knowledge. Such eorts could benet
from a fresh look at the Arabic language and its structure. We presented some of the
contributions of traditional Arabic grammarians and pointed out that the grammar they
developed- which works well for Classical Arabic, does not provide sucient rules for
The Arabic Language / 79
computational treatments of MSA texts. Finally, we talked about computational Arabic
morphology and pointed to the absence of descriptions of the morphology-syntax inter-
face. We looked at the recent Arabic treebank guidelines and pointed out some possible
improvements.
References
Abe, Shigeo. 2005. Support Vector Machines for Pattern Classication. Advances in Pattern
Recognition. London: Springer Verlag.
Antonius, George. 1969. The Arab Awakening: The Story of the Arab National Movement.
Beirut, Lebanon: Librairie du Liban.
Aoun, Joseph and YenHui Audrey Li. 2003. Essays on the Representational and Derivational
Nature of Grammar:The Diversity of Wh-Constructions. Boston Massachusetts: The MIT
Press.
Arjomand, Said Amir. 2009. The constitution of Medina: A sociolegal interpretation of Muham-
mads acts of foundation of the umma. International Journal of Middle East Studies 41(4).
Badawi, Al-Saeed Muhammad. 1973. Mustawayaatu al-arabiyya a-muaasira misr. Cairo,
Egypt: Dar al-maaarif.
Baldi, Pierre and Seren Brunak. 2001. Bioinformatics: The Machine Learning Approach. Mas-
sachusetts: MIT Press, 2nd edn.
Bateson, Mary Catherine. 1967. Arabic Language Handbook. Washington, DC: Center for
Applied Linguistics.
Bauman, Richard. 1977. Verbal Art as Performance. Rowley, Massachusetts: Newbury House.
Beesley, Kenneth. 2001. Finite-state morphological analysis and generation of Arabic at Xerox
research: Status and plans in 2001. In ACL 39th Meeting. Proceedings of the workshop on
Arabic language Processing; Status and Prospects, pages 18. Toulouse.
Bender, Emily. 2009. Linguistically nave != language independent: Why NLP needs linguistic
typology. In Proceedings of the EACL 2009 Workshop on the Interaction between Linguistics
and Computational Linguistics, pages 2632. Athens, Greece.
Benmamoun, Elabbas. 2000. The Feature Structure of Functional Categories: A Comparative
Study of Arabic Dialects. Oxford: Oxford University Press.
Biesterfeldt, Hans Hinrich. 2000. Medieval Arabic encyclopedias of science and philosophy. In
S. Harvey, ed., The Medieval Hebrew Encyclopedias of Science and Philosophy. Dordrecht:
Kluwer Academic Publishers.
Bishop, Christopher M. 2007. Pattern Recognition and Machine Learning. Information Science
and Statistics. London: Springer.
Bresnan, Joan. 2000. Lexical-Functional Syntax. Malden, Massachusetts, USA: Blackwell
Publishers Inc.
Broadwell, George Aaron. 2005. It aint necessarily S(V)O: Two kinds of VSO lanuages. In
The 10th International LFG Conference. Bergen, Norway.
Buckwalter, Tim. 2002. Buckwalter Arabic morphological analyzer version 1.0. Linguistic Data
Consortium, Catalog Number LDC 2002 L49.
Cherkassky, Vladimir and Filip M. Mulier. 2007. Learning from Data: Concepts, Theory and
Methods. New Jersey: John Wiley and Sons.
Chomsky, Noam. 1965. Aspects of the Theory of Syntax. Cambridge, Mass.: MIT Press.
Chomsky, Noam. 1981. Lectures on Government and Binding. Dordrecht: Foris.
80 / Arabic Computational Linguistics
Chomsky, Noam. 1986. Knowledge of Language: Its Nature, Origin and Use. New York: Praeger
Publishers.
Chomsky, Noam. 1995. The Minimalist Program. Cambridge, Massachusetts: The MIT Press.
CIA. 2008. CIA World Fact Book. Washington, D.C.: Central Intelligence Agency.
Eid, Mushira. 1980. On the function of pronouns in Egyptian Arabic. In Discourse Symposium,
University of Wisconsin. Milwaukee, Wisconsin.
El-Raghi, Abdu. 1979. Fiqh Al-lugha l-Kutub Al-Arabiyya. Beirut, Lebanon: Dar Al-NahDa
al-Arabiyya.
Elgibali, Alaa, ed. 1996. Understanding Arabic. Cairo, Egypt: The American University in
Cairo Press.
Farghaly, Ali. 1981. Topics in the Syntax of Egyptian Arabic. Ph.D. thesis, The University of
Texas at Austin.
Farghaly, Ali. 1982. Subject pronoun deletion rule. In Second National Symposium on English
Language Teaching in Egypt: Discourse Analysis, pages 8190. Cairo University, Egypt.
Farghaly, Ali. 1987. Three level morphology for Arabic. In Arabic Morphology Workshop.
Linguistic Institute, Stanford University.
Farghaly, Ali. 1994. Discontinuity in the Arabic lexicon: A case from Arabic morphology. In
International Conference on Arabic Linguistics. The American University in Cairo, Cairo,
Egypt.
Farghaly, Ali. December 2008. Arabic NLP: Overview, state of the art, challenges and op-
portunities. In The International Arab Conference on Information Technology, ACIT2008.
Hammamat, Tunisia. Invited Talk.
Farghaly, Ali and Jean Senellart. 2003. Intuitive coding of the Arabic lexicon. In Proceedings
of the IX MT Summit. New Orleans.
Fehri, Abdelkader Fassi. 1993. Issues in the Structure of Arabic Clauses and Words. Dordrecht
& Boston: Kluwer Academic Publishers.
Ferguson, Charles. 1959a. Arabic Koine. Language pages 616630.
Ferguson, Charles. 1959b. Diglossia. Word pages 325340.
Harvey, Steven. 2000. Introduction. In S. Harvey, ed., The Medieval Hebrew Encyclopedias of
Science and Philosophy, pages 130. Dordrecht: Kluwer Academic Publishers.
Hlal, Yehya. 1979. Methode dapprentissage pou l/analyse morphosyntaxique. Ph.D. thesis,
Universite Paris-Sud, Centre d/Orsay.
Hlal, Yehya. 1985. Morphological analysis of Arabic speech. In Conference on Computer
Processing of the Arabic Language. Kuwait.
Holt, Mike. 1995. Divided loyalties: Language and ethnic identity in the Arab world. In
Y. Suleiman, ed., Arabic Sociolinguistics: Issues and Perspectives, pages 1123. Richmond,
Surrey: Curzon Press Limited.
Holt, P. M. 1966. Egypt and the Fertile Crescent: 15161922. Ithaca, New York, USA: Cornell
University Press.
Ibn Khaldun, translated by Rosenthal Franz. 1958. The Muqadimma of Ibn Khaldun: An
Introduction to History. New Jersey: Princeton University Press.
Jackendo, Ray. 1977. X-bar-Syntax: A Study of Phrase Structure. Cambridge, MA: MIT
Press.
Joachims, Thorsten. 2002. Learning to Classify Text Using Support Vector Machines: Methods,
Theory and Algorithms. The International Series in Engineering and Computer Science.
Dordrecht, The Netherlands: Kluwer Academic Publisher.
The Arabic Language / 81
Johnstone, Barbara. 1983. Presentation as proof: The language of Arabic rhetoric. Anthropo-
logical linguistics 25:4760.
Johnstone, Barbara. 1991. Repetition in Arabic Discourse. Amsterdam/Philadelphia: John
Benjamins Publishing Company.
Jones, Karen. 2007. Computational Linguistics What about the Linguistics. Computational
Linguistics 33(3):437441.
Jurafsky, Daniel and James H. Martin. 2000. Speech and Natural Language Processing: An
Introduction to Natural Language Processing. New Jersey: Prentice Hall.
Kay, Martin. 1973. The mind system. In R. Randall, ed., Natural Language Processing, pages
155188. New York, NY: Algorithmic Press.
Maamouri, Mohamed et al. 2009. Penn Arabic Treebank Guidelines, version 4.92. Tech. rep.,
University of Pennsylvania.
Manning, Christopher and Hinrich Schuetze. 1999. Foundations of Statistical Natural Language
Processing. Massachusetts: MIT Press.
McCarthy, John. 1981. A prosodic theory of nonconcatenative morphology. Linguistic Inquiry
12:373418.
McCarthy, John and Alan Prince. 1990. Prosodic morphology and templatic morphology. In
Perspectives on Arabic Linguistics II: Papers from the Second Annual Symposium on Arabic
Linguistics, pages 154. Amsterdam: Benjamin.
Mesthrie, Rajend et al. 2000. Introducing Sociolinguistics. Philadelphia: John Benjamins
Publishing Company.
Oepen, Stephan et al. 2000. Introduction. Natural Language Engineering 6:114.
Pereira, Fernando and D. Warren. 1980. Denite clause grammars for language analysis. Ar-
ticial Intelligence 13:231278.
Peretz, Don. 1994. The Middle East Today. Westport, CT: Praeger Publishers.
Perlmutter, David. 1972. Deep and Surface Structure Constraints in Syntax. New York: Holt,
Rinehart and Winston.
Pollard, Carl and Ivan Sag. 1994. Head-Driven Phrase Structure Grammar. Chicago University
Press and CSLI Publications.
Reiter, Ehud. 2007. The Shrinking Horizon of Computational Linguistics. Computational
Linguistics 33(2):283287.
Sag, Ivan A. and Thomas Wasow. 1999. Syntactic Theory: A Formal Introduction. Stanford,
California: CSLI Publications, Center for the Study of Language and Information.
Shieber, Stuart. 1986. Introduction to Unication-based Approaches to Grammar. CSLI.
Stankiewicz, Edward. 1960. Linguistics and the study of poetic language. In T. A. Sebeck, ed.,
Style in Language, pages 6981. Cambridge, Massachusetts: MIT Press.
Suleiman, Yasir, ed. 1994. Nationalism and the Arabic Language: A Historical Overview, pages
324. Surrey: Curzon Press Ltd.
Tibi, B. 1990. Islam and the Cultural Accommodation of Social Change. San Francisco: West-
view Press.
Versteegh, Kees. 1997a. The Arabic Language. New York: Columbia University Press.
Versteegh, Kees. 1997b. The Arabic Linguistic Tradition. New York: Routledge.
Zaenen, Annie. 2006. Mark-up Barking Up the Wrong Tree. Computational Linguistics
32(4):557580.