Seminar Guidline
Seminar Guidline
Seminar Guidline
Shuly Wintner
2.1 Introduction
In NLP, morphology is the study of how words are built up from smaller
meaningful units called morphemes [1]. These morphemes are the building
blocks of words, and understanding them helps computers process language
more effectively.
Free Morphemes: These are morphemes that can stand alone as words and
carry meaning by themselves. For example, in English, words like "book,"
"run," and "happy" are free morphemes.
Bound Morphemes: These are morphemes that cannot stand alone as words
and must be attached to other morphemes to convey meaning. Bound
morphemes include prefixes (e.g., "un-" in "unhappy"), suffixes (e.g., "-s" in
"books"), and infixes (morphemes inserted within a word, such as the plural
marker "-en-" in "oxen").
The word ‘word’ is one of the most loaded and ambiguous notions in linguistic
theory [76]. Since most computational applications deal with written texts (as
opposed to spoken language), the most useful notion is that of an orthographic
word. This is a string of characters, from a well-defined alphabet of letters,
delimited by spaces, or other delimiters, such as punctuation. A text typically
consists of sequences of orthographic words, delimited by spaces or punctuation;
orthographic words in a text are often referred to as tokens.
Orthographic words are frequently not atomic: they can be further divided to
smaller units, called morphemes. Morphemes are the smallest meaning-bearing
linguistic elements; they are elementary pairings of form and meaning.
Morphemes can be either free, meaning that they can occur in isolation, as a single
orthographic word; or bound, in which case they must combine with other
morphemes in order to yield a word. For example, the word two consists of a
single (free) morpheme, whereas dogs consists of two morphemes: the free
morpheme dog, combined with the bound morpheme -s. The latter form indicates
the fact that it must combine with other morphemes (hence the preceding dash); its
function is, of course, denoting plurality. When a word consists of some free
morpheme, potentially with combined bound morphemes, the free morpheme is
called a stem, or sometimes root.
Bound morphemes are typically affixes. Affixes come in many varieties:
prefixes attach to a stem before the stem (e.g., re- in reconsider), suffixes attach
after the stem (-ing in dreaming), infixes occur inside a stem .
2 Morphological Processing 45
Morphological processes define the shape of words. They are usually classified
to two types of processes. Derivational morphology deals with word formation;
such processes can create new words from existing ones, potentially changing the
cate- gory of the original word. For example, the processes that create faithfulness
from faithful, and faithful from faith, are derivational. Such processes are typically
not highly productive; for example, one cannot derive *loveful from love. In
contrast, inflectional morphology yields inflected forms, variants of some base, or
citation form, of words; these forms are constructed to adhere to some syntactic
constraints, but they do not change the basic meaning of the base form.
Inflectional processes are usually highly productive, applying to most members of
a particular word class. For example, English nouns inflect for number, so most
nouns occur in two forms, the singular (which is considered the citation form) and
the plural, regularly obtained by adding the suffix -s to the base form.
Word formation in Semitic languages is based on a unique mechanism, known
as root-and-pattern. Words in this language family are often created by the
combination of two bound morphemes, a root and a pattern. The root is a
sequence of consonants only, typically three; and the pattern is a sequence of
vowels and consonants with open slots in it. The root combines with the pattern
through a process called interdigitation: each letter of the root (radical) fills a slot
in the pattern.
In addition to the unique root-and-pattern morphology, Semitic languages are
characterized by a productive system of more standard affixation processes. These
include prefixes, suffixes, infixes and circumfixes, which are involved in both
inflectional and derivational processes
handle out-of-lexicon items (in particular, proper names), which may combine
with prefix or suffix particles.
An alternative solution would be a dedicated morphological analyzer,
implementing the morphological and orthographic rules of the language.
Ideally, a morphological analyzer for any language should reflect the rules
underlying derivational and inflectional processes in that language. Of course,
the more complex the rules, the harder it is to construct such an analyzer. The
main challenge of morphological analysis of Semitic languages stems from the
need to faithfully implement a complex set of interacting rules, some of which are
non-concatenative. Once such a grammar is available, it typically produces more
than one analysis for any given surface form; in other words, Semitic languages
exhibit a high degree of morphological ambiguity, which has to be resolved in a
typical computational application. The level of morphological ambiguity is higher
in many Semitic languages than it is in English, due to the rich morphology and
deficient orthography. This calls for sophisticated methods for disambiguation.
While in English (and other European languages) morphological disambiguation
may amount to POS tagging, Semitic languages require more effort, since
determining the correct POS of a given token is intertwined with the problem of
segmenting the token to morphemes, the set of morphological features (and their
values) is larger, and consequently the number of classes is too large for standard
classification techniques. Several models were
proposed to address these issues.
Contemporary approaches to part-of-speech tagging are all based on machine
learning: a large corpus of text is manually annotated with the correct POS for
each word; then, a classifier is trained on the annotated corpus, resulting in
a system that can predict POS tags for unseen texts with high accuracy. The
state of the art in POS tagging for English is extremely good, with accuracies
that are indistinguishable from human level performance. Various classifiers were
built for this task, implementing a variety of classification techniques, such as
Hidden Markov Models (HMM) [26], Average Perceptron [37], Maximum
Entropy [111, 130, 131, 133, 134], Support Vector Machines (SVM) [58], and
others.
For languages with complex morphology, and Semitic languages in particular,
however, these standard techniques do not perform as well, for several reasons:
1. Due to issues of orthography, a single token in several Semitic languages can
actually be a sequence of more than one lexical item, and hence be associated
with a sequence of tags.
2. The rich morphology implies a much larger tagset, since tags reflect the wealth
of morphological information which words exhibit. The richer tagset
immediately implies problems of data sparseness, since most of the tags occur
only rarely, if at all, in a given corpus.
3. As a result of both orthographic deficiencies and morphological wealth, word
forms in Semitic languages tend to be ambiguous.
4. Word order in Semitic is relatively free, and in any case freer than in English.
No single method exists that provides an adequate solution for the challenges
involved in morphological processing of Semitic languages. The most common
approach to morphological processing of natural language is finite-state
technology[22, 81, 83, 89, 113]. The adequacy of this technology for Semitic
languages has frequently been challenged, but clearly, with some sophisticated
developments, such as flag diacritics [19], multi-tape automata [88] or registered
automata [36], finite-state technology has been effectively used for describing the
morphological structure of several Semitic languages [8, 16, 17, 68, 85, 88, 138].
Finite-state registered automata [36] were developed with the goal of facilitating
the expression of various non-concatenative morphological phenomena in an
efficient way. The main idea is to augment standard finite-state automata with
(finite) amount of memory, in the form of registers associated with the automaton
transitions. This is done in a restricted way that saves space but does not add
expressivity. The number of registers is finite, usually small, and eliminates the
need to duplicate paths as it enables the automaton to ‘remember’ a finite number
of symbols. Technically, each arc in the automaton is associated (in addition to an
alphabet symbol) with an action on the registers. Cohen-Sygal and Wintner [36]
define two kinds of actions, read and write. The former allows an arc to be
traversed only if a designated register contains a specific symbol. The latter writes
a specific symbol into a designated register when an arc is traversed.
Cohen-Sygal and Wintner [36] show that finite-state registered automata can
efficiently model several non-concatenative morphological phenomena, including
circumfixation, root and pattern word formation in Semitic languages, vowel har-
mony, limited reduplication etc. The representation is highly efficient: for
example, to account for all the possible combinations of r roots and p patterns, an
ordinary FSA requires O(r × p) arcs whereas a registered automaton requires only
48 S. Wintner
While much effort was put into the development of systems for processing
(Modern Standard) Arabic and Hebrew, for other languages the development of
such tools lags behind.
We use the term analysis in this chapter to refer to the task of producing all the
possible analyses of a given word form, independently of its context. The problem
of producing the correct analysis in the context is called disambiguation here, and
is discussed in detail in Sect. 2.6.
50 S. Wintner
2.5.1 Amharic
Computational work on Amharic began only recently. Fissaha and Haller [49]
describe a preliminary lexicon of verbs, and discuss the difficulties involved in
implementing verbal morphology with XFST. XFST is also the framework of
choice for the development of an Amharic morphological grammar [8]; but
evaluation on a small set of 1,620 words reveal that while the coverage of
the grammar on this corpus is rather high (85–94 %, depending on the part of
speech), its precision is low and many word forms (especially verbs) are
associated with wrong analyses.
Argaw and Asker [11] describe a stemmer for Amharic. Using a large
dictionary, the stemmer first tries to segment surface forms to sequences of
prefixes, stem, and affixes. The candidate stems are then looked up in the
dictionary, and the longest found stem is chosen (ties are resolved by the
frequency of the stem in a corpus). Evaluation on a small corpus of 1,500 words
shows accuracy of close to 77 %.
The state of the art in Amharic, however, is most probably HornMorpho: it is a
system for morphological processing of Amharic, as well as Tigrinya (another
Ethiopian Semitic language) and Oromo (which is not Semitic). The system is
based on finite-state technology, but the basic transducers are augmented by
feature structures, implementing ideas introduced by Amtrup. Manual evaluation
on 200 Tigrinya verbs and 400 Amharic nouns and verbs shows very accurate
results: in over 96 % of the words, the system produced all and only the correct
analyses.
Also worth mentioning here are a few works that address other morphology-
related tasks. These include a shallow morphological analyzer for Arabic [39] that
basically segments word forms to sequences of (at most one) prefix, a stem and (at
most one) suffix; a system for identifying the roots of Hebrew and Arabic
(possibly inflected) words; various programs for vocalization, or restoring
diacritics, in Arabic and in other Semitic languages; determining case endings of
2 Morphological Processing 51
Acknowledgements I am tremendously grateful to Nizar Habash for his help and advice; it
would have been hard to complete this chapter without them. All errors and misconceptions are,
of course, solely my own.
References
1. Argaw AA, Asker L (2007) An Amharic stemmer: reducing words to their citation forms.
In: Proceedings of the ACL-2007 workshop on computational approaches to Semitic lan-
guages, Prague
2. Owens J (1997) The Arabic grammatical tradition. In: Hetzron R (ed) The Semitic
languages. Routledge, London/New York, chap 3, pp 46–58
3. Harley HB (2006) English words: a linguistic introduction. The language library. Wiley-
Blackwell, Malden
4. Brants T (2000) TnT: a statistical part-of-speech tagger. In: Proceedings of the sixth
conference on applied natural language processing, Seattle. Association for Computational
Linguistics, pp 224–231. doi:10.3115/974147.974178, http://www.aclweb.org/anthology/
A00-1031
5. Collins M (2002) Discriminative training methods for hidden markov models: theory and
experiments with perceptron algorithms. In: Proceedings of the ACL-02 conference on
empirical methods in natural language processing, EMNLP ’02, Philadelphia, Vol 10.
Association for Computational Linguistics, pp 1–8. doi:http://dx.doi.org/10.3115/1118693.
1118694
6. Ratnaparkhi A (1996) A maximum entropy model for part-of-speech tagging. In: Brill E,
Church K (eds) Proceedings of the conference on empirical methods in natural language
54 S. Wintner