Computational Morphology

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Computational Morphology

Author: Harald Trost

Abstract

Computational morphology deals with the processing of words


and word forms, in both their graphemic, i.e., written form, and their
phonemic, i.e., spoken form. It has a wide range of practical
applications. Probably every one of you has already come across
some of them. Ever used spelling correction? Or wondered about
some strange hyphenation in a newspaper article? This is
computational morphology at work. To solve such seemingly simple
tasks often poses hard problems for a computer program. This section
shall provide you with some insights into why this is so and what
techniques are available to tackle these tasks.

1 Introduction

Natural languages have intricate systems to create words and word forms
from smaller units in a systematic way. The part of linguistics dealing with
these phenomena is morphology. This chapter starts with a quick overview
over this fascinating field. It continues with applications of computational
morphology. The rest is devoted to processing techniques. Computational
morphology has evolved from very modest beginnings using full form
lexica or some ad-hoc concatenation techniques to the much more powerful
tools available today. The chapter concludes with a number of examples for
encoding morphological phenomena from different languages using these
tools.

2 Linguistic fundamentals

What is morphology all about? A simple answer is that morphology deals


with words. In formal language words are just arbitrary strings denoting
constants or variables. Nobody would care about a morphology of formal
languages. In natural languages the picture is very different. Every human
language contains some hundred thousands of words. And continuously new
words are integrated while others are drifting out of use. This infinity of
words is produced from a finite collection of smaller units. The task of
morphology is to find and describe the mechanisms behind this process.

The basic building blocks in morphology are MORPHEMES. They are


defined as the smallest unit in language to which a meaning may be assigned
or, alternatively, as the minimal unit of grammatical analysis. Morphemes
are abstract entities that express basic features. Either semantic concepts
denoting entities or relationships in our world like door, blue or take. Such
morphemes are called roots. Or syntactic features like past or plural.

Their realisation as part of a word is called MORPH. Often, there is a one-to-


one relation, e.g., the morpheme door is realized as the morph door.
With take, on the other hand, we find the two possibilities take and took. In
such a case we speak of allomorphs. Plural in English is usually expressed
by the morph s. There are exceptions though: in oxen plural is expressed
through the morph en, in men by stem vowel alteration. All these different
forms are allomorphs of the plural morpheme.

A basic distinction is the one between bound and free morphs. A free morph
may form a word on its own, e.g., the morph door. We call such words
monomorphemic because they consist of a single morph. Bound morphs, on
the other hand, occur only in combination with other forms. All affixes are
bound morphs. For example, the word doors consists of the free
morph door and the bound morph -s. Words may also consist of free morphs
only, e.g., tearoom, or bound morphs only, e.g., aggression.

Every language typically contains some ten thousand morphs. This is a


magnitude below the number of words. Strict rules govern the combination
of these morphs to words (cf. 2.4). This way of structuring the lexicon
makes the cognitive load of remembering so many words much easier.

2.1 What is a word?

Surprisingly, there is no easy answer to this question. One can easily spot
words" in a text because they are separated from each other by blanks or
punctuation. However, if you record ordinary speech you will find out that
there are no breaks between words. But, we could isolate units which occur
over and over again in speech, but in different combinations. So the notion
of word" makes sense. But how do we define it?

We may look at words" from different perspectives. To syntax words" are


the units that make up sentences. Words are grouped according to their
function in the sentential structure. Each groups gets a tagusually called
part-of-speech or word categoryand grammar deals with these tags only,
omitting the details of specific words.
Morphology, on the other hand, is concerned with the inner structure of
words". It tries to uncover the rules that govern the formation of words
from smaller units. We notice that words that convey the same meaning look
differently depending on their syntactic context. Take, e.g., the
words degrade, degrades, degrading, and degraded. We can think of those
as different forms of the same word". We call the part that carries the
meaning of those forms a base form. In our example this is the
form degrade. All other forms in this example are produced by adding a
suffix. A wide range of other possibilities will be shown in section 2.3. All
the different forms of a word together are called its paradigm. The part of
morphology governing the production of these forms is called inflection .

Base forms in English are at the same time always word forms in their own
right, e.g., the base form degrade is also present tense, active voice, non
3rd person singular. In other languages we find a slightly different situation.
In Italian nouns are marked for gender and number. Different affixes are
used to signal masculine and feminine on the one hand and singular an
plural on the other hand.

SINGULAR PLURAL
MASCULINE pomodoro pomodori tomato
FEMININE cipolla cipolle onion

Neither of the two forms of a noun can function as the base form. Instead,
we must assume that the base form is what is left over after removing the
respective suffixes, i.e., pomodor- and cipoll-. Such base forms that cannot
occur as word forms in their pure form are called stems.

Base forms themselves are not necessarily atomic. By


comparing degrade to downgrade, retrograde and upgrade on the one hand
and decompose, decrease and deport on the other hand we can see that it is
composed of the morphs de- and grade. The morpheme carrying the central
meaning is often called the root of the word. A root may combine with
suffixes (cf. 2.2.2.2) or other roots (cf. 2.2.2.3) to form new base forms.

Finally, we can describe word" from a phonological perspective. Important


for morphology is that phonological units define the range for phonological
processes. Often, the phonological word is identical to the morphological
word but sometimes boundaries differ. For example, the
morphophonological process of final devoicing in German (cf. 2.3.2.2)
works on syllable structure. Lets look at two words derived from the
root lieb. The word be+lieb+ig (arbitrary) is realized as /blibik/ because it is
a single phonological word. On the other hand, lieb+lich (lovely) is realized as
/lipliC/. Here, the last consonant of the root is devoiced because the two morphs are
separated by a phonological word boundary.
2.2 Functions of morphology

How much and what sort of information is expressed by morphology differs


widely between languages.

2.2.1 Inflection

Inflection is required in particular syntactic contexts. It does not change the


part-of-speech category but the grammatical function. The different forms of
a word produced by inflection form its paradigm. Inflection iscomplete, i.e.,
with rare exceptions all the forms of its paradigm exist for a specific word.
Regarding inflection, words can be categorized in three classes:

Particles or not-inflecting words: they occur in just one form. In


English, prepositions, adverbs, conjunctions and articles are particles;
Verbs or words following conjugation;
Nominals or words following declination, i.e., nouns, adjectives, and
pronouns.

Conjugation is mainly concerned with defining tense and aspect and


agreement features like person and number. Take for example
the German verb lesen (to read). German verb forms come in present and
past tense, indicative or subjunctive.

PRESENT PAST
INDICATIVE INDICATIVE SUBJUNCTIVE SUBJUNCTIVE
SINGULAR PLURAL SINGULAR PLURAL SINGULAR PLURAL SINGULAR PLURAL
st
1 PERSON lese lesen lese lesen las lasen lse lsen
2nd PERSON liest lest lesest leset last last lsest lset
3rd PERSON liest lesen lese lesen las lasen lse lsen
Participle lesend gelesen
Imperative lies lest
Infinitive lesen

Declination marks various agreement features like number (singular, plural,


dual, etc.), case (as governed by verbs and prepositions, or to mark various
kinds of semantic relations), gender (male, female, neuter), and comparison.

2.2.2 Derivation and Compounding

In contrast to inflection which produces different forms of the same word


derivation and compounding are processes that create new words. Thus,
derivation and compounding have nothing to do with morphosyntax. They
are a means to extend our lexicon in an economic and principled way.
Application of a derivational morpheme may be restricted to a certain
subclass. For example, application of the English derivational suffix -ity is
restricted to stems of Latin origin, while the suffix -ness can apply to a
wider range:

rare - rarity - ?rareness


red - *reddity - redness
grave - gravity - graveness
weird - *weirdity - weirdness

Derivation can be applied recursively, i.e., words that are already the
product of derivation can undergo the process again. That way a potentially
infinite number of words can be produced. Take, for example, the following
chain of derivations:

hospital hospitalize hospitalization pseudohospitalization

Semantic interpretation of the derived word is often difficult. While a


derivational suffix can usually be given a unique semantic meaning many of
the derived words may still resist compositional interpretation. This may be
due to lexicalization, i.e. a form is no more transparent because, or
ambiguity of the underlying base form. For a more detailed discussion see
Trost (1993).

While inflectional and derivational morphology are mediated by the


attachment of a bound morph to a base form, compounding is the joining of
two or more base forms to form a new word. Most common is just setting
two words one after the other, as in state monopoly, bedtime or red wine. In
some cases parts are joined by a linking morphem (usually the remnant of
case marking) as in bulls eye or German Liebeslied (love-song).

The last part of a compound usually defines its morphosyntactic properties.


Semantic interpretation is even more difficult than with derivation. Almost
any semantic relationship may hold between the components of a
compound:

Wienerschnitzel cutlet made in Viennese style


Schweineschnitzel cutlet made of pork
Kinderschnitzel cutlet made for children

The boundary between derivation and compounding is fuzzy. Historically,


most derivational suffixes developed from words frequently used in
compounding. An obvious example is the ful suffix as in hopeful, wishful,
thankful.
Phrases and compounds cannot always be distinguished. The English
expression red wine in its written form could be both. In spoken language
the stress pattern differs: red wne vs. rd wine. In German phrases are
morphologically marked, while compounds are not: roter Wein vs. Rotwein.
But for verb compounds the situation is similar to English: zu Hause
bleiben vs. zuhausebleiben.

2.3 What constitutes a morph?

Every word form must at the core contain some root form. This root can
(must) then be complemented with additional morphs. How are morphs
realized? Obviously, a morph must somehow be recognizable in the
phonetic or orthographic pattern constituing the word. The most common
type of morph is a continuous sequence of phonemes. All roots and affixes
are of this form. A complex word can then be analyzed as a series of morphs
concatenated together. Agglutinative languages function almost exclusively
this way. But there are surprisingly many other possibilities.

2.3.1 Affixation

An affix is a bound morph that is realised as a sequence of phonemes (or


graphemes). The by far most common types of affixes are prefixes and
suffixes. Many languages have only these two types of affixes. Among them
is English (at least under standard morphological analyses).

A prefix is an affix that is attached in front of a stem. An example is the


English negative marker un- attached to adjectives:

common uncommon

A suffix is an affix that is attached after a stem. Take, e.g., the English
plural marker s:

shoe shoes

Across languages suffixation is far more frequent than prefixation. Also,


certain kinds of morphological information are never expressed via prefixes,
e.g., nominal case marking. Many computational systems for morphological
analysis and generation assume a model of morphology based on prefixation
and suffixation only.

A circumfix is the combination of a prefix and a suffix which together


express some feature. Both theoretically and from a computational point of
view a circumfix can be viewed as really two affixes applied one after the
other.
In German, the circumfixes ge--t and ge--n form the past participle of verbs:

sagen to say gesagt said


laufen to run gelaufen run

An infix is an affix where the placement is defined in terms of some


phonological condition(s). These might result in the infix appearing within
the root to which it is affixed. In Bontoc, a Philippine language, the infix -
um- turns adjectives and nouns into verbs (Fromkin and Rodman 1983). The
infix attaches after the initial consonant:

Reduplication is a border case of affixation. The form of the affix is a


function of the stem to which it is attached, i.e., it copies (some portion of)
the stem. Reduplication may be complete or partial. In the latter case it may
be prefixal, infixal or suffixal. Reduplication can include phonological
alteration on the copy or the original.

In Javanese complete reduplication is used to express the habitual-repetitive. In


case the second vowel is non-/a/, the first vowel in the copy is made nonlow
(changing /a/ to /o/ and /E/ to /e/) and the second becomes /a/. When the second
vowel is /a/, the copy remains unchanged while in the original the /a/ is
changed to /E/ (Kiparsky 1987):

Partial reduplication is more common. In Yidiny, an Australian language,


prefixal reduplication is used for plural marking. Reduplication involves
copying the minimal word (Nash 1980).

An example for infixal reduplication is the frequentative in Amharic, a


semitic language spoken in Ethiopia (Rose 2000).
From a computational point of view one property of reduplication is
especially important: Since reduplication involves copying it cannotat least
in the general casecompletely be described with the use of finite-state
methods.

2.3.2 Root-and-template morphology

Semitic languages (at least according to standard analyses) exhibit a very


peculiar type of morphology: A so-called root, consisting of two to four
consonants, conveys the basic semantic meaning. A vowel pattern marks
information about voice and aspect. A derivational template gives the class
of the word (traditionally called binyan).

2.3.3 Modification in phonetic substance

This term subsumes processes which do neither introduce new nor remove
existing segments. Morphs are not realized as any string of phonemes, but as
a change of phonetic properties or an alteration of the prosodic shape.

Ablaut refers to vowel alternations inherited from Indo-European. It is a


pure example of vowel modification as a morphological process. Examples
are strong verbs in Germanic languages like English (e.g., swim swam
swum). In Icelandic this process is still more common and more regular than
in most other Germanic languages. The following example is from Sproat
(1992, p.62):

Umlaut has its origin in a phonological process, whereby root vowels were
assimilated to a high-front suffix vowel. When this suffix vowel was lost
later on, the change in the root vowel became the sole remaining mark of the
morphological feature originally signalled by the suffix.

In German the plural of nouns may be marked by umlaut (sometimes in


combination with a suffix), whereby in the stem vowel the feature back is
changed to front:

Another possibility to realize a morpheme is to alter the prosodic shape.


Tone modification can be used to signal certain morphological features.
In Ngbaka, spoken in the Democratic Republic of Congo, tense-aspect
contrasts are expressed by four different tonal variants (Nida 1949):

A morpheme may be realised by a stress shift. English noun-verb derivation


sometimes uses a pattern where the stress is shifted from the first to the
second syllable:

NOUN VERB
xport exprt
rcord recrd
cnvict convct

2.3.4 Suppletion

Total modification is a process occurring sporadically and idiosyncratically


within inflectional paradigms. It is usually associated with forms that are
used very frequently. Examples in English are went, the past tense of go,
and the forms of to be: am, are, is, was and were.

2.3.5 Zero Morphology

Sometimes a morphological operation has no phonological expression


whatsoever. Examples are found in many languages.

English noun-to-verb derivation is often not explicitly marked:

man The man smiled. Man the boats.

house He buys a house. They house in a cave.

A possible analysis is to assume a zero morph which attaches to the noun to


form a verb: book+?SUB>V. Another possibility is to assume two
independent lexical items disregarding any morphological relationship.

2.4 The structure of words: Morphotactics

Somehow morphs must be put together to form words. A word grammar is


determining the way this has to be done. This part of morphology is
called morphotactics. As we have seen, the most usual way is simple
concatenation. Lets have a look at the constraints involved. What are the
conditions governing the ordering of morphemes in pseudohospitalization?
(1) *hospitalationizepseudo, *pseudoizehospitalation

(2) *pseudohospitalationize

In (1) an obvious restriction is violated: pseudo- is a prefix and must appear


ahead of the stem, -ize and ation are suffixes and must appear after the
stem. The violation in (2) is less obvious. In addition to the pure ordering
requirements there are also rules governing to which types of stems an affix
may attach: ize attaches to nouns and produces verbs, ation attaches to
verbs and produces nouns.

One possibility to describe the word formation process is to assume a


functor-argument structure. Affixes are functors that pose restrictions on
their (single) argument. That way a binary tree is constructed. Prefixes
induce right branching and suffixes left branching.

Fig. 1: The internal structure of the word pseudohospitalization

In figure 1 the functor pseudo- takes a nominal argument to form a noun,


ize a nominal argument to form a verb, and ation a verbal argument to
form a noun. This description renders two different possible structures
forpseudohospitalization. The one given in figure 1 and a second one
where pseudo- combines first directly with hospital. We may or may not
accept this ambiguity. To avoid the second reading we could state a lexical
constraint that a word with the head pseudo- cannot serve as an argument
anymore.

2.4.1 Constraints on affixes

Affixes is that they attach to specific categories only. This is an example for
a syntactic restriction. Restrictions may also be of a phonological, semantic
or purely lexical nature. A semantic restriction on the English adjectival
prefix un- prevents its attachment to an adjective that already has a negative
meaning:

unhappy *unsad
unhealthy *unill
unclean *undirty
The fact that in English some suffixes may only attach to words of Latin
origin (cf. 2.2.2) is an example for a lexical restriction.

2.4.2 Morphological vs. phonological structure

In some cases there is a mismatch between the phonological and the


morphological structure of a word. One example is comparative formation
with the suffix er in English. Roughly, there is a phonological rule that
prevents attaching this suffix to words that consist of more than two
syllables:

great greater
tall taller
happy happier
competent *competenter
elegant *eleganter

If we want to stick to the above rule unrulier has to be explained with a


structure where the prefix un- is attached to rulier. But, from a
morphological point of view, the adjective ruly does not exist, only the
negative formunruly. This implies that the suffix er is attached to unruly.
We end up with an obvious mismatch!

Another potential problem is cliticization. A clitic is a syntactically separate


word phonologically realized as an affix. The phenomenon is quite common
across languages.

In English auxiliaries have contracted forms that function as affixes:


he shall return -> hell return
In German prepositions can combine with the definite article
an dem Tisch -> am Tisch
in das Haus -> ins Haus
In Italian personal pronouns can be attached to the verb. In this
process the ordering of constituents is also altered.
ce ne facciamo -> facciamocene

2.5 The Influence of Phonology

Morphotactics is responsible to govern the rules for the combination of


morphs into larger entities. One could assume that this is all a system needs
to know to break down words into their component morphemes. But there is
another aspect that makes things more complicated: Phonological rules may
apply and change the shape of morphs. To deal with these changes and their
underlying reasons is the area of morphophonology.
2.5.1 Phonology vs. orthography

Most applications of computational morphology deal with text rather than


speech. But, written language is rarely a true phonemic description. For
some languages, e.g., Finnish, Spanish or Turkish orthography is a good
approximation for a phonetic transcription. English, on the other hand, has
very poor correspondence between writing and pronounciation. As a result,
we often have to deal with orthography rather than phonology. A good
example are English plural rules (cf. 2.4.1).

You might also like