Seminar Guidline

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Chapter 2

Morphological Processing of Semitic Languages

Shuly Wintner

2.1 Introduction

In NLP, morphology is the study of how words are built up from smaller
meaningful units called morphemes [1]. These morphemes are the building
blocks of words, and understanding them helps computers process language
more effectively.

Here's a breakdown of key concepts in NLP morphology:

 Morphemes: The smallest units of meaning in a language. They can't


be broken down further into smaller meaningful parts. There are two
main types:
o Stems: The core meaning-carrying unit of a word. For
example, "play" in "playing" is the stem.
o Affixes: Bound morphemes that attach to stems to modify
their meaning or grammatical function. Examples include
prefixes (un- in unhappy), suffixes (-ing in playing), and
infixes (in some languages).
 Morphological Analysis: The process of breaking down a word into
its constituent morphemes. This helps NLP tasks like:
o Lemmatization: Reducing words to their base form (lemma)
for better understanding. For instance, "playing" and
"played" would both be reduced to "play".
o Part-of-Speech Tagging: Identifying the grammatical
function of a word (noun, verb, adjective etc.) based on its
morphemes. For example, the "-ed" suffix often indicates
past tense for verbs.
 Morphological Generation: The process of building words by
combining stems and affixes. This can be useful for tasks like:
o Machine translation: Understanding how morphemes are
combined in one language to create their equivalent in
another.

Overall, morphology plays a crucial role in NLP by helping computers


understand the structure and meaning of words, leading to more accurate and
sophisticated language processing tasks.

Morphological analysis is the process of breaking down words into their


constituent morphemes, which are the smallest units of meaning within a
language. Morphemes can be classified into two main types:

Free Morphemes: These are morphemes that can stand alone as words and
carry meaning by themselves. For example, in English, words like "book,"
"run," and "happy" are free morphemes.

Bound Morphemes: These are morphemes that cannot stand alone as words
and must be attached to other morphemes to convey meaning. Bound
morphemes include prefixes (e.g., "un-" in "unhappy"), suffixes (e.g., "-s" in
"books"), and infixes (morphemes inserted within a word, such as the plural
marker "-en-" in "oxen").

The process of morphological analysis involves identifying and categorizing


these morphemes within words to understand how they contribute to meaning
and grammatical structure. This analysis can reveal insights into word
formation, inflectional patterns, and derivational processes within a language.

This chapter addresses morphological processing of Semitic languages. In light


of the complex morphology and problematic orthography of many of the Semitic
languages, the chapter begins with a recapitulation of the challenges these
phenomena pose on computational applications. It then discusses the approaches
that were suggested to cope with these challenges in the past. The bulk of the
chapter, then, discusses available solutions for morphological processing, including
analysis, generation, and disambiguation, in a variety of Semitic languages. The
concluding section discusses future research directions.
Semitic languages are characterized by complex, productive morphology, with
a basic word-formation mechanism, root-and-pattern, that is unique to languages
of this family. Morphological processing of Semitic languages therefore
necessitates technology that can successfully cope with these complexities. 1
Several linguistic theories, and, consequently, computational linguistic
approaches, are often developed with a narrow set of (mostly European) languages
in mind. The adequacy of such approaches to other families of languages is
sometimes sub-optimal. A related issue is the long tradition of scholarly work on
some Semitic languages, notably Arabic [109] and Amharic [11], which cannot
always be easily consolidated with contemporary approaches.
Inconsistencies between modern, English-centric approaches and traditional
ones are easily observed in matters of lexicography. In order to annotate corpora
or produce tree-banks, an agreed-upon set of part-of-speech (POS) categories is
44 S. Wintner

required. Since early approaches to POS tagging were limited to English,


resources for other languages tend to use “tag sets”, or inventories of categories,
that are minor modifications of the standard English set. Such an adaptation is
problematic for Semitic languages.
These issues are complicated further when morphology is considered. The rich,
non-concatenative morphology of Semitic languages frequently requires
innovative solutions that standard approaches do not always provide.

2.2 Basic Notions

The word ‘word’ is one of the most loaded and ambiguous notions in linguistic
theory [76]. Since most computational applications deal with written texts (as
opposed to spoken language), the most useful notion is that of an orthographic
word. This is a string of characters, from a well-defined alphabet of letters,
delimited by spaces, or other delimiters, such as punctuation. A text typically
consists of sequences of orthographic words, delimited by spaces or punctuation;
orthographic words in a text are often referred to as tokens.
Orthographic words are frequently not atomic: they can be further divided to
smaller units, called morphemes. Morphemes are the smallest meaning-bearing
linguistic elements; they are elementary pairings of form and meaning.
Morphemes can be either free, meaning that they can occur in isolation, as a single
orthographic word; or bound, in which case they must combine with other
morphemes in order to yield a word. For example, the word two consists of a
single (free) morpheme, whereas dogs consists of two morphemes: the free
morpheme dog, combined with the bound morpheme -s. The latter form indicates
the fact that it must combine with other morphemes (hence the preceding dash); its
function is, of course, denoting plurality. When a word consists of some free
morpheme, potentially with combined bound morphemes, the free morpheme is
called a stem, or sometimes root.
Bound morphemes are typically affixes. Affixes come in many varieties:
prefixes attach to a stem before the stem (e.g., re- in reconsider), suffixes attach
after the stem (-ing in dreaming), infixes occur inside a stem .
2 Morphological Processing 45

Morphological processes define the shape of words. They are usually classified
to two types of processes. Derivational morphology deals with word formation;
such processes can create new words from existing ones, potentially changing the
cate- gory of the original word. For example, the processes that create faithfulness
from faithful, and faithful from faith, are derivational. Such processes are typically
not highly productive; for example, one cannot derive *loveful from love. In
contrast, inflectional morphology yields inflected forms, variants of some base, or
citation form, of words; these forms are constructed to adhere to some syntactic
constraints, but they do not change the basic meaning of the base form.
Inflectional processes are usually highly productive, applying to most members of
a particular word class. For example, English nouns inflect for number, so most
nouns occur in two forms, the singular (which is considered the citation form) and
the plural, regularly obtained by adding the suffix -s to the base form.
Word formation in Semitic languages is based on a unique mechanism, known
as root-and-pattern. Words in this language family are often created by the
combination of two bound morphemes, a root and a pattern. The root is a
sequence of consonants only, typically three; and the pattern is a sequence of
vowels and consonants with open slots in it. The root combines with the pattern
through a process called interdigitation: each letter of the root (radical) fills a slot
in the pattern.
In addition to the unique root-and-pattern morphology, Semitic languages are
characterized by a productive system of more standard affixation processes. These
include prefixes, suffixes, infixes and circumfixes, which are involved in both
inflectional and derivational processes

2.3 The Challenges of Morphological Processing

Morphological processing is a crucial component of many natural language


processing (NLP) applications. Whether the goal is information retrieval, question
answering, text summarization or machine translation, NLP systems must be
aware of word structure. For some languages and for some applications, simply
stipulating a list of surface forms is a viable option; this is not the case for
languages with complex morphology, in particular Semitic languages, both
because of the huge number of potential forms and because of the difficulty of
such an approach to
46 S. Wintner

handle out-of-lexicon items (in particular, proper names), which may combine
with prefix or suffix particles.
An alternative solution would be a dedicated morphological analyzer,
implementing the morphological and orthographic rules of the language.
Ideally, a morphological analyzer for any language should reflect the rules
underlying derivational and inflectional processes in that language. Of course,
the more complex the rules, the harder it is to construct such an analyzer. The
main challenge of morphological analysis of Semitic languages stems from the
need to faithfully implement a complex set of interacting rules, some of which are
non-concatenative. Once such a grammar is available, it typically produces more
than one analysis for any given surface form; in other words, Semitic languages
exhibit a high degree of morphological ambiguity, which has to be resolved in a
typical computational application. The level of morphological ambiguity is higher
in many Semitic languages than it is in English, due to the rich morphology and
deficient orthography. This calls for sophisticated methods for disambiguation.
While in English (and other European languages) morphological disambiguation
may amount to POS tagging, Semitic languages require more effort, since
determining the correct POS of a given token is intertwined with the problem of
segmenting the token to morphemes, the set of morphological features (and their
values) is larger, and consequently the number of classes is too large for standard
classification techniques. Several models were
proposed to address these issues.
Contemporary approaches to part-of-speech tagging are all based on machine
learning: a large corpus of text is manually annotated with the correct POS for
each word; then, a classifier is trained on the annotated corpus, resulting in
a system that can predict POS tags for unseen texts with high accuracy. The
state of the art in POS tagging for English is extremely good, with accuracies
that are indistinguishable from human level performance. Various classifiers were
built for this task, implementing a variety of classification techniques, such as
Hidden Markov Models (HMM) [26], Average Perceptron [37], Maximum
Entropy [111, 130, 131, 133, 134], Support Vector Machines (SVM) [58], and
others.
For languages with complex morphology, and Semitic languages in particular,
however, these standard techniques do not perform as well, for several reasons:
1. Due to issues of orthography, a single token in several Semitic languages can
actually be a sequence of more than one lexical item, and hence be associated
with a sequence of tags.
2. The rich morphology implies a much larger tagset, since tags reflect the wealth
of morphological information which words exhibit. The richer tagset
immediately implies problems of data sparseness, since most of the tags occur
only rarely, if at all, in a given corpus.
3. As a result of both orthographic deficiencies and morphological wealth, word
forms in Semitic languages tend to be ambiguous.
4. Word order in Semitic is relatively free, and in any case freer than in English.

2.4 Computational Approaches to Morphology


2 Morphological Processing 47

No single method exists that provides an adequate solution for the challenges
involved in morphological processing of Semitic languages. The most common
approach to morphological processing of natural language is finite-state
technology[22, 81, 83, 89, 113]. The adequacy of this technology for Semitic
languages has frequently been challenged, but clearly, with some sophisticated
developments, such as flag diacritics [19], multi-tape automata [88] or registered
automata [36], finite-state technology has been effectively used for describing the
morphological structure of several Semitic languages [8, 16, 17, 68, 85, 88, 138].

2.4.1 Two-Level Morphology

Two-level morphology was “the first general model in the history of


computational linguistics for the analysis and generation of morphologically
complex languages” [84]. Developed by Koskenniemi [89], this technology
facilitates the specification of rules that relate pairs of surface strings through
systematic rules. Such rules, however, do not specify how one string is to be
derived from another; rather, they specify mutual constraints on those strings.
Furthermore, rules do not apply sequentially. Instead, a set of rules, each of which
constrains a particular string pair correspondence, is applied in parallel, such
that all the constraints must hold simultaneously. In practice, one of the strings
in a pair would be a surface realization, while the other would be an underlying
form.

2.4.2 Registered Automata

Finite-state registered automata [36] were developed with the goal of facilitating
the expression of various non-concatenative morphological phenomena in an
efficient way. The main idea is to augment standard finite-state automata with
(finite) amount of memory, in the form of registers associated with the automaton
transitions. This is done in a restricted way that saves space but does not add
expressivity. The number of registers is finite, usually small, and eliminates the
need to duplicate paths as it enables the automaton to ‘remember’ a finite number
of symbols. Technically, each arc in the automaton is associated (in addition to an
alphabet symbol) with an action on the registers. Cohen-Sygal and Wintner [36]
define two kinds of actions, read and write. The former allows an arc to be
traversed only if a designated register contains a specific symbol. The latter writes
a specific symbol into a designated register when an arc is traversed.
Cohen-Sygal and Wintner [36] show that finite-state registered automata can
efficiently model several non-concatenative morphological phenomena, including
circumfixation, root and pattern word formation in Semitic languages, vowel har-
mony, limited reduplication etc. The representation is highly efficient: for
example, to account for all the possible combinations of r roots and p patterns, an
ordinary FSA requires O(r × p) arcs whereas a registered automaton requires only
48 S. Wintner

O(r C p) arcs. Unfortunately, no implementation of the model exists as part of an


available finite-state toolkit.

2.4.3 Analysis by Generation

Most of the approaches discussed above allow for a declarative specification of


(morphological) grammar rules, from which both analyzers and generators can be
created automatically. A simpler, less generic yet highly efficient approach to the
morphology of Semitic languages had been popular with actual applications. In
this framework, which we call analysis by generation here, the morphological
rules that describe word formation and/or affixation are specified in a way that
supports
2 Morphological Processing 49

generation, but not necessarily analysis. Coupled with a lexicon of morphemes


(typically, base forms and concatenative affixes), such rules can be applied in one
direction to generate all the surface forms of the language. This can be done off-
line, and the generated forms can then be stored in a database; analysis, in this
paradigm, amounts more or less to simple table lookup.
Some of the very first morphological processors of Semitic languages were
developed in this way. Probably the first example is the Hebrew morphological
system of [29,].

2.4.4 Functional Morphology

Functional morphology [51] is a computational framework for defining language


resources, in particular lexicons. It is a language-independent tool, based on a
word- and-paradigm model, which allows the grammar writer to specify the
inflectional paradigms of words in a specific language in a similar way to printed
paradigm tables. A lexicon in functional morphology consists of a list of words,
each associated with its paradigm name, and an inflection engine that can apply
the inflectional rules of the language to the words of the lexicon.
This framework was used to define morphological grammars for several
languages, including modeling of non-concatenative processes such as vowel
harmony, reduplication, and templatic morphology. In particular, uses this
paradigm to implement a morphological processor of Arabic.

2.5 Morphological Analysis and Generation of Semitic


Languages

While much effort was put into the development of systems for processing
(Modern Standard) Arabic and Hebrew, for other languages the development of
such tools lags behind.
We use the term analysis in this chapter to refer to the task of producing all the
possible analyses of a given word form, independently of its context. The problem
of producing the correct analysis in the context is called disambiguation here, and
is discussed in detail in Sect. 2.6.
50 S. Wintner

2.5.1 Amharic

Computational work on Amharic began only recently. Fissaha and Haller [49]
describe a preliminary lexicon of verbs, and discuss the difficulties involved in
implementing verbal morphology with XFST. XFST is also the framework of
choice for the development of an Amharic morphological grammar [8]; but
evaluation on a small set of 1,620 words reveal that while the coverage of
the grammar on this corpus is rather high (85–94 %, depending on the part of
speech), its precision is low and many word forms (especially verbs) are
associated with wrong analyses.
Argaw and Asker [11] describe a stemmer for Amharic. Using a large
dictionary, the stemmer first tries to segment surface forms to sequences of
prefixes, stem, and affixes. The candidate stems are then looked up in the
dictionary, and the longest found stem is chosen (ties are resolved by the
frequency of the stem in a corpus). Evaluation on a small corpus of 1,500 words
shows accuracy of close to 77 %.
The state of the art in Amharic, however, is most probably HornMorpho: it is a
system for morphological processing of Amharic, as well as Tigrinya (another
Ethiopian Semitic language) and Oromo (which is not Semitic). The system is
based on finite-state technology, but the basic transducers are augmented by
feature structures, implementing ideas introduced by Amtrup. Manual evaluation
on 200 Tigrinya verbs and 400 Amharic nouns and verbs shows very accurate
results: in over 96 % of the words, the system produced all and only the correct
analyses.

2.5.2 Other Languages

Morphological resources for other Semitic languages are almost nonexistent. A


few notable exceptions include Biblical Hebrew, for which morphological
analyzers are available from several commercial enterprises; Akkadian, for which
some morphological analyzers were developed; Syriac, which inspired the
development of a new model of computational morphology [88]; and dialectal
Arabic.

2.5.3 Related Applications

Also worth mentioning here are a few works that address other morphology-
related tasks. These include a shallow morphological analyzer for Arabic [39] that
basically segments word forms to sequences of (at most one) prefix, a stem and (at
most one) suffix; a system for identifying the roots of Hebrew and Arabic
(possibly inflected) words; various programs for vocalization, or restoring
diacritics, in Arabic and in other Semitic languages; determining case endings of
2 Morphological Processing 51

Arabic words; and correction of optical character recognizer (OCR) errors.


When downstream applications are considered, such as chunking, parsing, or
machine translation, the question of tokenization gains much importance. Morpho-
logical analysis determines the lexeme and the inflections (and, sometimes, also
the derivational) morphemes of a surface form; but the way in which a surface
form is broken down to its morphemes for the purpose of further processing can
have a significant impact on the accuracy of such applications. For example, it is
convenient to assume that Arabic and Hebrew prefixes are separate tokens; but
what about suffixes? Should there be a distinction between the plural suffixes and
the pronominal enclitics of nouns? Several works address these questions, usually
in the context of a specific application.
52 S. Wintner

Several works investigate various pre-processing techniques for Arabic, in the


context of developing Arabic-to-English statistical machine translation systems
[45, 46, 67, 116]. In the reverse direction, [13] and [2] explore the impact of
morphological segmentation on English-to-Arabic machine translation. The effect
of multiple pre-processing schemes on statistical word alignment for machine
translation is explored by Elming and Habash [47]. And Diab [41] investigates the
effect of differently defined POS tagsets (more or less refined) on the task of base
phrase chunking (shallow parsing).

2.6 Morphological Disambiguation of Semitic Languages

Early attempts at POS tagging and morphological disambiguation of Semitic


languages relied on more “traditional” approaches, borrowed directly from the
general (i.e., English) POS tagging literature. The first such work is probably [87],
who defined a set of 131 POS tags, manually annotated a corpus of 50,000 words
and then implemented a tagger that combines statistical and rule-based techniques
that performs both segmentation and tag disambiguation. Similarly, [42] use SVM
to automatically tokenize, POS-tag, and chunk Arabic texts. To this end, they use
a reduced tag set of only 24 tags, with which the reported results are very high.
The set of tags is extended to 75 in [41].
For Hebrew, two HMM-based POS taggers were developed. The tagger of [14]
is trained on an annotated corpus [80]. The most updated version of the tagger,
trained on a treebank of 4,500 sentences, boasts 97.2 % accuracy for segmentation
(detection of underlying morphemes, including a possibly assimilated definite
article), and 90.8 % accuracy for POS tagging [15]. Adler and Elhadad [1] train an
HMM-based POS tagger on a large-scale unannotated corpus of 6 million words,
the reported accuracy being 92.32 % for POS tagging and 88.5 % for full
morphological disambiguation, including finding the correct lexical entry.
As for Amharic, [48] uses condition random fields for POS tagging. As the
annotated corpus used for training is extremely small (1,000 words), it is not
surprising the accuracy is rather low: 84 % for segmentation, and only 74 % for
POS tagging. Two other works use a recently-created 210,000-word annotated
corpus [54] to train Amharic POS taggers with a tag set of size 30. Gambäck
et al. [55] experiment with HMM, SVM and Maximum Entropy; accuracy ranges
between 88 and 95 %, depending on the test corpus. Similarly, [129] investigate
various classification techniques, using the same corpus for the same task. The
best accuracy, achieved with SVM, is over 86 %, but other classification methods,
including conditional random fields and memory-based learning, perform well.
2 Morphological Processing 53

2.7 Future Directions

The discussion above establishes the inherent difficulty of morphological processing


with Semitic languages, as one instance of languages with rich and complex
morphology. Having said that, it is clear that with a focused effort, contemporary
computational technology is sufficient for tackling the difficulties. As should be
clear from Sect. 2.5, the two Semitic languages that benefitted from most
attention, namely MSA and Hebrew, boast excellent computational morphological
analyzers and generators. Similarly, Sect. 2.6 shows that morphological
disambiguation of these two languages can be done with high accuracy, nearing
the accuracy of disambiguation with European languages.
However, for the less-studied languages, including Amharic, Maltese and
others, much work is still needed in order to produce tools of similar precision.
Resembling the situation in Arabic and Hebrew, this effort should focus on two
fronts: devel- opment of formal, computationally-implementable sets of rules that
describe the morphology of the language in question; and collection and
annotation of corpora from which morphological disambiguation modules can be
trained.
As for future technological improvements, we note that “pipeline” approaches,
whereby the input text is fed, in sequence, to a tokenizer, a morphological
analyzer, a morphological disambiguation module and then a parser, have
probably reached a ceiling, and the stage is ripe for more elaborate, unified
approaches. Several works indeed explore such possibilities, focusing in particular
on joint morphological disambiguation and parsing [35, 59, 92, 132]. We defer an
extensive discussion of these (and other) approaches to the next Chapter on
parsing.

Acknowledgements I am tremendously grateful to Nizar Habash for his help and advice; it
would have been hard to complete this chapter without them. All errors and misconceptions are,
of course, solely my own.

References

1. Argaw AA, Asker L (2007) An Amharic stemmer: reducing words to their citation forms.
In: Proceedings of the ACL-2007 workshop on computational approaches to Semitic lan-
guages, Prague
2. Owens J (1997) The Arabic grammatical tradition. In: Hetzron R (ed) The Semitic
languages. Routledge, London/New York, chap 3, pp 46–58
3. Harley HB (2006) English words: a linguistic introduction. The language library. Wiley-
Blackwell, Malden
4. Brants T (2000) TnT: a statistical part-of-speech tagger. In: Proceedings of the sixth
conference on applied natural language processing, Seattle. Association for Computational
Linguistics, pp 224–231. doi:10.3115/974147.974178, http://www.aclweb.org/anthology/
A00-1031
5. Collins M (2002) Discriminative training methods for hidden markov models: theory and
experiments with perceptron algorithms. In: Proceedings of the ACL-02 conference on
empirical methods in natural language processing, EMNLP ’02, Philadelphia, Vol 10.
Association for Computational Linguistics, pp 1–8. doi:http://dx.doi.org/10.3115/1118693.
1118694
6. Ratnaparkhi A (1996) A maximum entropy model for part-of-speech tagging. In: Brill E,
Church K (eds) Proceedings of the conference on empirical methods in natural language
54 S. Wintner

processing, Copenhagen. Association for Computational Linguistics, pp 133–142


7. Giménez J, Màrquez L (2004) SVMTool: a general POS tagger generator based on support
vector machines. In: Proceedings of 4th international conference on language resources and
evaluation (LREC), Lisbon, pp 43–46
8. Beesley KR, Karttunen L (2003) Finite-state morphology: xerox tools and techniques.
CSLI, Stanford
9. Beesley KR (1998) Arabic morphology using only finite-state operations. In: Rosner M (ed)
Proceedings of the workshop on computational approaches to Semitic languages, COLING-
ACL’98, Montreal, pp 50–57
10. Kiraz GA (2000) Multitiered nonlinear morphology using multitape finite automata: a case
study on Syriac and Arabic. Comput Linguist 26(1):77–105
11. Cohen-Sygal Y, Wintner S (2006) Finite-state registered automata for non-concatenative
morphology. Comput Linguist 32(1):49–82
12. Amsalu S, Gibbon D (2005) A complete finite-state model for Amharic morphographemics.
In: Yli-Jyrä A, Karttunen L, Karhumäki J (eds) FSMNLP. Lecture notes in computer
science, vol 4002. Springer, Berlin/New York, pp 283–284
13. Karttunen L, Beesley KR (2001) A short history of two-level morphology. In: Talk given at
the ESSLLI workshop on finite state methods in natural language processing. http://www.
helsinki.fi/esslli/evening/20years/twol-history.html
14. Koskenniemi K (1983) Two-level morphology: a general computational model for word-form
recognition and production. The Department of General Linguistics, University of Helsinki
15. Choueka Y (1966) Computers and grammar: mechnical analysis of Hebrew verbs.
In: Proceedings of the annual conference of the Israeli Association for Information Process-
ing, Rehovot, pp 49–66. (in Hebrew)
16. Forsberg M, Ranta A (2004) Functional morphology. In: Proceedings of the ninth ACM
SIGPLAN international conference on functional programming (ICFP’04), Snowbird.
ACM, New York, pp 213–223
17. Fissaha S, Haller J (2003) Amharic verb lexicon in the context of machine translation.
In: Proceedings of the TALN workshop on natural language processing of minority languages,
Batz-sur-Mer

You might also like