Module 2

Module 2
WORD LEVEL ANALYSIS

INTRODUCTION
• Focus on processing carried out at word level, including methods for characterizing word sequences,
identifying morphological variants, detecting and correcting misspelled words, and identifying correct parts of
speech of a word.
• Different levels and complexities of processing.
• One way => breaking down into constituent units(words, phrases, sentences and paragraph) -> Analyse
• Regular expression 🡪describing words
• Example : Supernova or supernova or supernovas
• RE---used for describing text strings and other information retrieval applications.
• finite-state automaton(FSA) 🡪 speech recognition and synthesis ,spell checking ,and information extraction.
• Errors 🡪 Detecting and correcting
• Classes of words 🡪 Context –POS Tag
Regular Expressions
• Regular Expression or regexes for short, are pattern matching standard for string parsing and
replacement.
• find and replace strings that take a defined format.
• Example: parse dates, urls and email addresses, log files, configurations files, command line
switches ,or programming scripts.
• Tool for the design of language compilers and have been used in NLP for tokenization,
describing lexicons, morphological analysis , etc.
• Regular expression were originally studied as a part of theory of computation .They were first
introduced by Kleene(1956).
• It is an algebraic formula whose value is a pattern consisting of a set strings called the language
of the expression.
Regular Expressions
• The simplest kind of RE is a single symbol /a/
• Denotes the set of strings containing the character ‘a’.
• /supernova/
RE-Character Classes
• Characters are grouped by putting them between square brackets. This way, any character
in the class will match one character in the input .
• any character in the class will match one character in the input.
• /[abcd]/ will match (any of) a,b,c,d. /a/ /b/ /c/ /d/
• The use of brackets specifies a disjunction of characters.
• /[abcdefghijklmnopqrstuvwxyz]/ ----------> matches any lower case letters
• A dash is used to specify a range. Eg. /[a-z]/ ----> matches any lowercase character
• /[m-p]/ ----> any one of the letter m, n, o, or p
• /[5-9]/ ------> specifies any one of the characters 5, 6, 7, 8, or 9
• Regular Expression can also specify what a single character cannot be, by the use of a
caret at the beginning.
• /[^ x]/ -----> matches any single character except ‘x’. This works only if the caret
symbol appears as the first symbol.
• Regular expression are case sensitive
• /sana/ will not match the string /Sana/. This problem is solved by using the disjunction of character
s and S
• The pattern /[sS]/ will match the string containing either ‘s’ or ‘S’.
• The pattern /[sS]ana/ matches with the string ‘sana’ or ‘Sana’.
• The pattern /[sS]upernova[sS]/ matches with ‘supernovas’, ‘Supernovas’, ‘supernovaS’,
‘SupernovaS’ but not supernova. This problem is solved with the use of /?/.
• A question mark makes the preceding character optional, ie. zero or one occurrence of the
previous character.
• The pattern /supernovas?/ matches both supernova and supernovas
• For repeated occurrences of a character, the * operator, called the Kleene * is used. The * operator
specifies zero or more occurrences of a preceding character or regular expression.
• Example: The regular expression /b*/ matches ‘b’, ‘bb’, ‘bbbb’ and also ‘aaa’
• /[ab*]/ matches zero or more occurrences of ‘a’ or ‘b’
• The Kleene+ matches one or more occurrences of the previous character.
• Example: regular expression /a+/ matches one or more occurrences of ‘a’
• The caret (^) is also used as an anchor to specify a match at the
beginning of a line.
• The dollar sign, $, is an anchor that is used to specify a match at the
end of a line.
• Example: The regular expression /^The nature\.$/ will match all the
lines having ‘The nature.’
• The wildcard character, the dot (.) matches any single character.
• Example: The regular expression /.at/ matches any string cat, bat, rat,
mat, nat, 4at, etc.
• /…..berry/ matches ten letter string that ends in berry. For
example Strawberry, sugarberry, blackberry, etc.
• To search ‘Tanveer’ or ‘Siddiqui’ we have to use the disjunction
operator shown by the pipe symbol (|).
• The pattern (Blackberry | blackberries) matches either
Blackberry or blackberries
Exercise
Check if a string is an email address or not.
An email address consist of a non-empty sequence followed by the 'at' symbol ,@, followed by another
non-empty sequence of character ending with pattern like .xx, .xxx, .xxxx etc . The regular expression for an
email address is
^[A-Za-z0-9_\.-]+@[A-Za-z0-9_\.-]+[A-Za-z0-9_][A-Za-z0-9_]$
Morphological Parsing
- Morphology is a sub discipline of linguistics.
- It studies word structure and the formation of words from smaller units (morphemes).
- The goal of the morphological parsing is to discover the morphemes that build a given
word.
- Morphemes are the smallest meaning-bearing units in a language.
- The morphological parser should be able to tell us that the word ‘eggs’ is the plural form
of the noun stem ‘egg’.
- Example:
• ‘Bread’ consists of a single morpheme
• ‘eggs’ consists of two: the morpheme egg and the morpheme -s.
● There are two broad classes of morphemes:
○ Stems
○ Affixes
● Stem is the main morpheme i.e. the morpheme that contain the central meaning.
● Affixes modify the meaning given by the stem.
● Affixes are divided into -
○ Prefix - morphemes which appear before the stem
○ Suffix - morphemes which are applied to the end of the stem
○ Infix - morphemes that may appear inside a stem.
○ Circumfix - morphemes that may be applied to either end of the stem
Example:
• Pilli – cat – pillulu where ulu is plural
• Chevi – ear – chevulu

• Three main ways of word formation:
- Inflection
- Derivation
- Compounding
• Inflection 🡪 a root word is combined with a grammatical morpheme to yield a word of the same class as
the original stem.
• Example: eggs==egg+s
• Derivation 🡪 combines a word stem with a grammatical morpheme to yield a word belonging to a different
class. Ex: formation of the noun “ Computation” from the verb “ compute” compute (V) + tion 🡪
computation (N).
• The formation of a noun from a verb or adjective is called nominalization
• Verb or Adjective Noun => Nominalization
• Compounding is the process of merging two or more words to form a new word.
• Ex: personal computer, desktop, overlook
• In general, parsing takes an input and produce some sort of structures for it.
• Morphological parsing takes as input the inflected surface form of each word in a text and produces the parsed form
consisting of a canonical form (or lemma) of the word and a set of tags showing its syntactical category and morphological
characteristics, eg., possible part of speech and / or inflectional properties (gender, number, person, tense etc.)
• Morphological generation is reverse of morphological parsing.
• A morphological parser uses the following information sources:
• Lexicon:
• Lexicon lists stems and affixes together with basic information about them.
• Morphotactics :
• ordering among the morphemes that form a word. They cannot be arranged arbitrarily.
Example : restlessness 🡪 Rest-ness-less
• Morphotactics deals with the ordering of morphemes. It describes the way morphemes are arranged or touch each
other.
• Orthographic:
• These are the spelling rules that specify the changes that occur when two given morphemes combine.
• Example : y 🡪 ier spelling rule changes easy 🡪 ‘easier’ not to ‘easyer’.
• Morphological analysis can be avoided if an exhaustive lexicon is available that lists features for all the word
forms of all the roots. Then, given a word, we simply consult the lexicon to get its feature values.
• For example, suppose an exhaustive lexicon for Hindi contains the following entries for the Hindi root-word
ghodhaa. Given a word, say ghodhon, we can look up the lexicon to get its feature values.
But this approach has several limitations:

1) It puts a heavy demand on memory.
• We have to list every form of the word which results in a large number of, often redundant, entries is the
lexicon.
2) Exhaustive lexicon fails to show the relationship between different roots having similar word forms.
• approach fails to capture linguistic generalization, which is essential to develop a system capable of
understanding unknown words.
3) Morphologically complex languages like Turkish, the number of possible word forms may be theoretically
infinite. Itis not practical to list all possible word-forms in these languages.
The above limitations explain why morphological parsing is necessary.
• The simplest morphological systems are stemmers that collapse
morphological variations of a given word to one lemma or stem. They
do not require a lexicon.
• Stemmers have been used in information retrieval.
• Two widely used stemming algorithms was developed by lovins(1968)
and porter(1980)
• stemmers do not use a lexicon, instead they make use of rewrite rules
of the form:
ier -> y (eg: earlier -> early)
ing -> e (eg: playing -> play)
Stemming algorithms work in two steps:
1) Suffix removal: This step removes predefined endings from words.
2) Recoding: This step adds predefined endings to the outputs of the
first step.
These two steps can be performed sequentially in Lovin’s stemmer or
simultaneously as in porter’s stemmer.
Porter’s stemmer makes use of the following transformation rule -
ational -> ate
to transform word such as 'rotational' into 'rotate’
Drawbacks of Porter Algorithm:
● Stemmers are not perfect. Krovitz (1993) pointed out errors of
omissions and commissions in the Porter algorithm such as
transformation of the word
⇒ organization into organ
⇒ noise into noisy.
● Porter’s algorithm reduces only suffixes. Prefixes and compounds
are not reduced.
A more efficient two-level morphological model proposed by koskenniemi(1983), can be used for highly inflected
languages.
In two-level morphological model a word is represented as correspondence between its lexical level form and its
surface level form.
1) The surface level represents the actual spelling of the word.
2) The lexical level form represents the concatenation of its constituent morphemes
Morphological parsing is viewed as a mapping from the surface level into morpheme and feature sequences on the
lexical level.
For example:
1. surface form 'playing’ is represented in the lexical form as play + V + PP
The lexical form consists of the stem 'play' followed by the morphological information + V + PP, which tells us that
‘playing’ is the present participle form of the verb.
surface level: playing
lexical: play + V + PP
2. surface level: books
lexical level: book +N +PL
Where the first component is the stem (book) and the second component (N + PL) is the morphological information,
which tells us that the surface level form is a plural noun.
This model is usually implemented with a kind of finite state automata called Finite State Transducer (FST)
Finite State Transducer (FST)
• A finite state transducer maps a set of symbols to another. It does this through a finite state automaton.
• A FST is a two state automaton which recognizes or generates a pair of strings.
• FST passes over the input string by consuming the input symbols on the tape it traverses and consists it to the
output string in the form of symbols.
• Hopcroft and Ullman (1979) define FST as follows -
• Finite state transducer is a 6 tuple (∑1, ∑2, Q, δ, S, F) where
• Ө - set of states
• S - initial state
• F ⊆ Q is a set of final states
• ∑1 is input alphabet
• ∑2 is output alphabet
• δ is a function mapping Q X (∑1 U {ε}) X (∑2 U {ε}) to a subset of the power set of Q.
Finite State Transducer (FST)
• FST is similar to an NFA (Nondeterministic Finite State Automaton)
except that in FST transitions are made on strings rather than on symbols
and in addition, they have outputs.
• Figure shows a simple transducer that accepts two input strings, hot and
cat, and maps them onto cot and bat respectively.
Two Level Morphology Using FST
• As shown in the figure, two steps are involved in conversion from
the surface form of a word to its morphological analysis.
• First Step:
Split the words into possible components
Ex: birds = bird + s, boxes = box + s (or) boxe + s
• Second Step:
Use a lexicon to look up the categories of the stems and meaning of the affixes.
Ex: bird + s = bird + N + PL
box + s = box + N + PL,
We can find out now that boxe is not a legal stem.
Spouses = spouse + s ✔
Parses = parse + s => ✔
Orthographic rules are used to handle spelling variations. One of the rules says -
Add e after -s, -z, -x, -ch, -sh before the s.
e.g., dish -> dishes, box -> boxes
• Two steps are implemented with the help of a transducer. Thus, we
need two transducers
1. To map the surface form to the intermediate form
2. To map the intermediate form to the lexical form
• Develop an FST based morphological parser for singular and plural

nouns in English.
• The plural form of regular nouns usually end with -s or -es. But, a
word ending with ‘s’ need not necessarily be the plural form of the
word. e.g., miss, ass, etc.
• One of the required translations is the deletion of e when introducing
a morpheme boundary. This deletion is required for words ending in
xes, ses, zes
• Eg: suffixes, boxes
The transducer shown in the figure below maps English nouns to the
intermediate form
• The next step is to develop transducer that does the mapping from
intermediate level to lexical level.
• The input to the transducer has one of the following forms:
1. Regular noun stem, Eg. bird, cat

2. Regular noun stem + s, eg. Bird + s
3. Singular irregular noun stem, eg. Goose
4. Plural irregular noun stem, eg. Geese
What is regular and irregular noun?
1. Regular noun stem, Eg. bird, cat
The transducer has to map all symbols of the stem to themselves and
then output N and sg.
2. Regular noun stem + s, eg. Bird + s
The transducer has to map all the symbols of the stem to themselves
and then output N and replaces PL with S
3. Singular irregular noun stem, eg. Goose
The transducer has to map all symbols of the stem to themselves and
then output N and sg.
4. Plural irregular noun stem, eg. Geese
The transducer has to map the irregular plural noun stem to the
corresponding singular stem. (eg: geese to goose) and then add N and
PL.
Spelling Error Detection and Correction
• Errors of typing and spelling causes variation between strings
*Single character omission
*Insertion
*Substitute
*Reversal
• Damearu (1964) reported that over 80% of the typing errors were single_error misspelling
1) Substitution of a single letter

2) omission of a single letter
3) Insertion of a single letter
4) Transposition of two adjacent letters
• Single character omission occurs when a single character is missed(deleted). E.g. when a concept is
accidentally typed as 'concpt'.
• Insertion error refers to the presence of an extra character is a word. E.g. when 'error' is misspell as 'errorn'.
• Substitution error occurs when a wrong letter is typed in place of the right one. E.g. 'errpr' where 'p' appears in
place of 'o'.
• Reversal refers to a situation in which the sequence of characters is reversed. E.g. 'aer' instead of 'are' also
called as transposition.
• Optical Character Recognition (OCR) and other automatic reading devices introduce errors of substitution,
deletion and insertion but not of reversal.
OCR errors are usually grouped into five classes:
o Substitution
o Multi-substitution (or ) framing.
oSpace deletion (or)
oinsertion
oFailures
• OCR substitution errors are caused due to visual similarity such as, c e, l I, r n.
• Multi-substitution :m->rn.
• Failures occur when OCR algorithm fails to select a letter with sufficient accuracy.
• The frequency and type of errors can be corrected using 'context' or by using linguistic structures.
• Spelling errors belong to one of two distinct categories:
1. Non word errors
2. Real word errors
Non word errors:
- when an error results in a word that does not appear in a lexicon or is not a valid orthographic word form, it is
termed as non word error. Ex: Mouse: rnouse
Real word errors:
A real word error results in actual words of the language. It occurs due to typographical mistakes or spelling
errors.
E.g. Substituting the spelling of a homophone or near homophone such as peace for piece, Meat for meet.
Real word errors cause:-
i) Local syntactic errors
ii) Global syntactic errors
iii) Semantic errors
iv) errors at discourse or pragmatic levels
• It is impossible to decide that a word is wrong without some contextual information.

SPELLING CORRECTION
➢ It consists of detecting and correcting errors.
➢ Error detection is the process of finding misspelled words and
➢ Error correction is the process of suggesting correct words to a misspelled
word.
➢ These sub-problems are solved in 2 ways:
i) Isolated-error detection and correction.
ii) Context-dependent error detection and correction

SPELLING CORRECTION
Isolated-word error detection and correction:
❖ In isolated-word error detection and correction, each word is checked separately, independent of its context.
❖ There are many problems associated with this strategy:
○ Strategy requires the existence of a lexicon containing all correct words. Such a lexicon would take a
long time to compile and occupy a lot of space.
○ Some languages are highly productive. It is impossible to list all the correct words of such languages.
○ This strategy fails when spelling error produces a word that belongs to the lexicon. e.g. when 'theses' is
written in place of 'these’. Such error is called real-word error.
○ The larger the lexicon, the more likely it is that an error goes undetected, because the chance of a word
being found is greater in a large lexicon.
Context-dependent error detection and correction:

Context-dependent error detection and correction methods utilize the context of a word to detect and correct
error. This requires grammatical analysis and is thus more complex and language dependent.
SPELLING CORRECTION ALGORITHM
The spelling correction algorithm has been broadly categorized by Kukich (1992) as follows:
1. Minimum edit distance:
➢ The minimum edit distance between two strings is the minimum number of operations (insertions,
deletions, or substitutions) required to transform one string into another.
➢ Spelling correction algorithms based on minimum edit distance are the most studied algorithms.
2. Similarity key techniques:

➢ The idea in similarity key technique is to change a given string into a key such that similar strings will
change into same key.
➢ The SOUNDEX system (Odell and Russell 1918) is an example which is used in phonetic spelling
correction applications.
SPELLING ERROR CORRECTION ALGORITHM
3. n-gram based techniques:
➢ The n-grams can be used for both non-word and real-word error detection because in English alphabet, certain,
bi-grams and tri-grams of letters never occur.
➢ E.g. tri-gram qst and the bi-gram qd. This information can be used to handle non-word error.
➢ n-gram techniques requires large corpus or dictionary as training data, so that n-gram table of possible
combinations of letters can be complied.
4. Neural nets:
➢ These have the ability to do associative recall based on incomplete and noisy data.
➢ They can be trained to adapt to specific error patterns.
➢ The drawback is that they are computationally expensive.
5. Rule-based techniques:
➢ In rule-based technique, a set of rules (heuristics) derived from knowledge of a common spelling error pattern
is used to transform misspelled words into valid words.
➢ E.g. If it is known that many errors occur from the letters 'ue' being typed as ‘ eu ‘, then we may write a rule
that represents this.
Minimum Edit Distance
• The minimum edit distance is the number of insertions, deletions and substitutions required to change one string into
another. The distance between two strings is the minimum edit distance.
• E.g. minimum edit distance between tutor and tumour is 2.
• We substitute 'm' for 't' and insert 'u' before 'r'.
• Edit distance between two strings can be represented as a binary function, ed, which maps two strings to their edit
distance. ed is symmetric. For any two strings s and t, ed(s, t) is always equal to ed(t, s).
• Edit distance can be viewed as string alignment problem. By aligning two strings, we can measure the degree to
which they match.
• There can be more than one possible alignment between two strings. The best possible alignment corresponds to
minimum edit distance between the strings.
• The alignment between Tutor and Tumour has a distance of 2.
T U T O _ R
T U M O U R
● A dash in the upper string indicates insertion. A substitution occurs when the two alignment symbols do not
match.
• The Levensthein distance between two sequences is obtained by assigning a unit cost to each operation.
• Another possible alignment between tutor and tumour is
T U T _ O _ R
T U _ M O U R
which has a cost of 3. But, the better alignment corresponds to minimum edit distance.
• Dynamic programming algorithm is used for finding minimum edit distance between two sequences.
• Dynamic programming refers to a class of algorithms that apply a table-driven approach to solve problems by
combining solutions to sub-problems.
• The dynamic programming algorithm for minimum edit distance is implemented by creating an edit distance matrix.
• This matrix has one row for each symbol in source string and one column for each symbol in the target string.
• The (i,j)th cell in this matrix represents the distance between the first i character of the source and the first j character
of the target string.
• The value in each cell is computed in terms of three possible paths.

dist [i-1, j] + insert cost
dist [i, j] = dist [i-1, j-1]+ sub cost [sourcei, targetj]
dist [ i, j-1] + delete cost
• The substitution will be 0 if the ith character in the source matches with jth character in the target.
• Minimum edit distance algorithms are useful for determining accuracy in speech recognition systems.
How to compute the value of the dist(i,j)
Minimum Edit Distance Algorithm
Minimum Edit Distance between tutor and tumour using Dynamic
Programming Algorithm = 2. VTU - Dec 2018/Jan 2019
Minimum Edit Distance between EXECUTION and INTENTION = 5
(Jan/Feb 2021)
Minimum Edit Distance between KITTEN and SITTING = 3
Minimum Edit Distance between PEACEFUL and PAECFLU (June/July

2019) = 5
WORDS AND WORD CLASSES
WORD AND WORD CLASSES:
• Words are classified into categories called parts of speech. These are also called as word classes or lexical
categories. These lexical categories are defined by their syntactic and morphological behaviour.
• Most common lexical categories are nouns, verbs. Other lexical categories include adjectives, adverbs,
prepositions, conjunctions.
• Word classes are categorized as open and closed word classes. Open word classes constantly acquire new
members while closed word classes do not.
• Open word classes are nouns, verbs (except auxiliary verbs), adjectives, adverbs and interjections.
• Closed word classes are Prepositions , auxiliary verbs, delimiters, conjunction and particles.
PART OF SPEECH TAGGING
• Part of speech tagging is the process of assigning a part-of-speech
such as a noun, verb, pronoun, preposition, adverb and adjective to
each word in a sentence.
• The input to a tagging algorithm is the sequence of words of a
natural language sentence and specified tag sets (a finite list of
part-of-speech tags)
• The output is a single best part-of-speech tag for each word.
• In tagging, we determine the correct lexical category of a word in its
context.
• The tag assigned by a tagger is the most likely for a particular use of
word in a sentence.
• The collection of tags used by a particular tagger is called a tag
set.
• Most part-of-speech tag sets make use of the same basic
categories, i.e., noun, verb, adjective and preposition. But, tag
sets differ in how they define categories and how finely they
divide words into categories.
• Tag sets differ in how they define categories and how finely they
divide words into categories.
• Eg. both eat and eats might be tagged as a verb in one tag set, but
assigned distinct tags in another tag set.
• Most tag sets capture morpho-syntactic information such as
singular/plural, number, gender, tense, etc.
• Consider the following sentences -
• Zuha eats an apple daily.
• Aman ate an apple yesterday.
• They have eaten all the apples in the basket.
• I like to eat guavas.
• The word eat has a distinct grammatical form in each of these four
sentences.
• Eat is the base form
• Ate its past tense
• Eats requires a third person singular subject
• Eaten is the past participle form and cannot occur in another
grammatical context. It is required after have or has.
• The number of tags used by different taggers varies.
• The Penn Treebank tag set contains 45 tags while C7 uses 164.
• For English language which is not morphologically rich C7 tagset is
too big. The tagging process will give too many mistagged words and
the result needs to be manually corrected.
• Bigger tag sets like
• TOSCA-ICE with 270 tags
• TESS with 200 tags
are used for the International Corpus of English.
• The larger the tagset the, the greater the information captured
about a linguistic context. But the task of tagging becomes
complicated and requires manual correction.
• A bigger tag set can be used for morphologically rich languages without
introducing too many tagging errors.
• A tag set that uses just one tag to denote all the verbs will assign identical
tags to all the forms of a verb. This coarse-grained distinction may be
appropriate for some tasks.
• A fine-grained tag set captures more information and is useful for tasks
like syntactic pattern detection.
• The Penn Treebank tag set captures finer distinctions by assigning
distinct tags to distinct grammatical forms of a verb as shown in the
figure below.
Note:
Meaning of gerund:
A gerund is a form of a verb that ends in -ing that is used as a noun.It looks
like a verb, but it acts like a noun.
For example, the word swimming is an example of a gerund.
We can use the word swimming in a sentence as a noun to refer to the act
of moving around in water as in Swimming is fun.
• Part-of-speech tagging is an early stage of text processing in NLP
applications such as speech synthesis, machine translation,
information retrieval and information extraction.
• In information retrieval, part-of-speech tagging can be used for
indexing - for identifying useful tokens like nouns and phrases and
for disambiguating word senses.
• In tagging a complete parse tree is not build, part-of-speech is
assigned to words using contextual information.
• Part-of-speech tagging methods fall under the three general
categories -
1. Rule-based (linguistic)
2. Stochastic (data-driven)
3. Hybrid
Rule based taggers:
• Use hand-coded rules to assign tags to words.
• These rules use a lexicon to obtain a list of candidate tags and
then use rules to discard incorrect tags.
Stochastic taggers:
• Stochastic have data-driven approaches in which frequency-based
information is automatically derived from corpus and used to tag
words.
• Stochastic taggers disambiguate words based on the probability
that a word occurs with a particular tag.
• The simplest scheme is to assign the most frequent tag to each
word.
• An early example of stochastic tagger was CLAWS (Constituent
Likelihood Automatic Word-tagging System), which is equivalent
to TAGGIT.
• HMM (Hidden Markov Model is the standard Stochastic tagger.
• Hybrid Tagger:
• Hybrid taggers combine feature of both rule-based and Stochastic
taggers.
• Like rule-based systems, they use rules to specify tags.
• Like stochastic systems, they use machine-learning to induce rules from a
tagged training corpus automatically.
• The transformation-based tagger or Brill tagger is an example of the
hybrid approach.
Rule Based Tagger
Rule-based taggers have a two-stage architecture
1. First stage is a dictionary look-up procedure, which returns a set of
potential tags (parts-of-speech) and appropriate syntactic features
for each word.
2. The second stage uses a set of hand-coded rules to discard
contextually illegitimate tags to get a single part-of-speech for each
word. For example, consider the noun-verb ambiguity in the
following sentence
The show must go on.
• The potential tags for the word show is {VB, NN}.
• The ambiguity is resolved by using the following rule
IF preceding word is determiner THEN eliminate VB tag.

Rule Based Tagger
• This rule simply disallows verbs after a determiner. Using this rule the word
show in the given sentence can only be noun.
• In addition to contextual information, many tagger use morphological
information to help in the ambiguity process.
• An eg. of a rule that make use of morphological information is -
IF word ends in -ing and preceding word is a verb THEN label it verb (VB).
• Capitalization information can be utilized in the tagging of unknown nouns.
• Rule based taggers require supervised training.
• To induce rules automatically, untagged text is run through a tagger. The
output is then manually corrected.
• The corrected text is then submitted to the tagger, which learns correction
rules by comparing the two sets of data. This process is repeated several
times.
Rule Based Tagger
• The earlier systems for automatic tagging were all rule-based.
• Eg. TAGGIT:
• TAGGIT was used for the initial tagging of the Brown corpus. It used 3,300
disambiguation rules and was able to tag 77% of the words in the Brown
corpus with their correct part-of-speech. The rest was done manually over
several years.
• ENGTWOL is another rule-based tagger.
Advantage of Rule-based Tagger:
• Speed is an advantage of the rule-based tagger.
• They are deterministic.
Disadvantages of Rule-based Tagger:

• Skill and effort is required in writing disambiguation rules.
• Time is spent in writing a rule set
Rule Based Tagger
Sl. Rule-based Tagger Stochastic Tagger

No.
1 It is faster and deterministic Slow and non-deterministic
2 Skill and effort is required in writing disambiguation Require manual work to achieve good performance.
rules
3 Time is spent in writing a rule Time is spent developing restrictions on transitions and
emissions to improve tagger performance
4 It is usable for only one language. Using it for another

one requires a rewrite of most of the program.
Stochastic Tagger - HMM Tagger
• The standard stochastic tagger algorithm is the HMM (Hidden Markov

Model) tagger.
• The Markov model applies the simplifying assumption that the probability of
a chain of symbols can be approximated in terms of its parts or n-grams.
• The simplest n-gram model is the unigram model, which assigns the most
likely tag (part-of-speech) to each token.
• The unigram model needs to be trained using a tagged training corpus before
it can be used to tag data.
• The most likely statistics are gathered over the corpus and used for tagging.
• The context used by the unigram tagger is the text of the word itself.
• For eg., it will assign the tag JJ for each occurrence of fast, since fast is used as an
adjective more frequently than it is used as a noun, verb or adverb.
• This results in incorrect tagging in the following sentences:
1. She had a fast. fast->noun
2. Muslims fast during Ramadan. fast-> verb
3. Those who were injured in the accident need to be helped fast. Fast -> adverb
• In the first sentence, fast is used as a noun, in the second, it is a verb and in the third it is
an adverb.
• More accurate predictions can be obtained if more context is taken into account when
making a tagging decisions.
• A bi-gram tagger uses the current word and the tag of the previous word in the tagging
process.
• As the tag sequence “DT NN” is more likely than the tag sequence “DT JJ”, a bi-gram
model will assign a correct tag to the word fast in sentence 1.
• Similarly, compared to a noun or an adjective, it is more likely that an adverb follows a
verb. Hence the tag assigned to fast will be RB in sentence 3.
• In general an n-gram model considers the current word and the tag
of the previous n-1 words in assigning a tag to a word.
• The context considered by a tri-gram model is shown in fig below.
The area shaded in grey represents the context.
• The objective of the tagger is to assign a tag sequence to a given

sentence.
To find out how HMM tagger assigns the most likely tag sequence to a given
sentence:
• It uses two layers of state:
1. A visible layer corresponding to the input words
2. A hidden layer learnt by the system corresponding to the tags
• While tagging the input data, we only observe the words, the tags (states) are
hidden.
• States of the model are visible in training, not during the tagging task.
• HMM makes use of lexical and bi-gram probabilities estimated over a tagged
training corpus in order to compute the most likely tag sequence for each sentence.
• One way to store the statistical information is to build a probability matrix.
• The probability matrix contains both the probability that an individual word belongs
to a word class as well as the n-gram analysis. Eg. for a bi-gram model the
probability that a word of class X follows a word of class Y.
• This matrix is then used to drive the HMM tagger while tagging an unknown text.
Let W be the sequence of words.
W = w1, w2, …, wn
The task is to find the tag sequence
T = t1, t2, …, Tn
Which maximizes P(T|W), i.e.,
T’ = argmaxT P(T|W)
Applying Bayes Rule, P(T|W) can be estimated using the expression:
P(T|W) = P(W|T) * P(T) / P(W)
As the probability of the word sequence, P(W), remains the same for
each tag sequence, it can be dropped.
Hence the expression for the most likely tag sequence becomes
T’ = argmaxT P(W|T) * P(T)
• Using the Markov assumption, the probability of a tag sequence can be
estimated as the product of the probability of its constituent n-grams, i.e.,
• P(T) = P(t1) * P(t2|t1) * P(t3|t1t2) … * P(tn|t1 … tn-1)
• P(W/T) is the probability of seeing a word sequence, given a tag
sequence. For eg., it is asking the probability of seeing “The egg is rotten”
given “DT NNP VB JJ”.
• The following two assumptions are made
1. The words are independent of each other
2. The probability of a word is dependent only on its tag.
• Using these assumptions, we get
• P(W/T) = P(w1/t1)* P(w2/t2) … P(wi/ti) * … P(wn/tn)
n
• P(W/T) = Π P(wi|ti)
i=1
n
P(W/T) * P(T) = Π P(wi|ti) * P(t1) * P(t2|t1) * P(t3|t1t2) … * P(tn|t1 … tn-1)
i=1
Approximating the tag history using only the two previous tags, the transition
probability becomes
P(T) = P(t1) * P(t2|t1) * P(t3|t1t2) *… * P(tn|tn-2tn-1)
P(T|W) = P(W|T) * P(T)
n
P(T|W) = Π P(wi|ti) * P(t1) * P(t2|t1) * P(t3|t1t2) *… * P(tn|tn-2tn-1)
i=1
Example to demonstrate how the probability of a particular
part-of-speech sequence for a given sentence can be computed
Consider the sentence - The bird can fly.
And the tag sequence - DT NNP MD VB
Using bi-gram approximation, the probability
DT NNP MD VB
P | | | |
The bird can fly
Can be computed as
= P(DT) X P(NN|DT) * P(MD|NNP) X P(VB|MD)
X P(the|DT) X P(bird|NNP) X P(can|MD) X P(fly|VB)
Advantages of Stochastic Tagger:
• Accurate
• Language independent:
• Most stochastic taggers have an accuracy of 96-97%. The accuracy is
measured as a percentage of words.
• For a sentence containing 20 words, accuracy of 96% means, the error rate
per sentence is 1 - 0.9620 = 56%.
• This corresponds to approximately one word per sentence.
Drawback of Stochastic Tagger:
• Requires manually tagged corpus for training.
• Kupiec, Cutting corpus have shown that HMM tagger can be trained
from unannotated text.
• This makes it possible to use the model for languages in which a
manually tagged corpus is not available.
• Tagger trained on a hand-coded corpus performs better than one
trained on an unannotated text.
Hybrid Taggers
• Hybrid taggers combine the feature of both rule based and stochastic
approaches.
• They use rules to assign tags to words.
• Like the stochastic taggers, this is a machine learning technique and
rules are automatically induced from the data.
• Transformation-based learning (TBL) of tags, also known as
Brill tagging is an example of hybrid approach.
• TBL is a machine learning method introduced by E. Brill in 1995.
• Transformation-based error driven learning has been applied to a
number of natural language problems, such as, part-of-speech (POS)
tagging, speech generation, and syntactic parsing.
Hybrid Taggers
Figure shows the TBL process.
TBL is also supervised learning technique.
Hybrid Taggers
Steps involved in TBL tagging algorithm:
Hybrid Taggers
• Each transformation is a pair of a re-write rule of the form t1-> t2 and a contextual
condition.
• In order to limit the set of transformations, a small set of templates is constructed.
• Some of the transformations templates and transformations learned by TBL tagging is
listed below
Hybrid Taggers

Module 2

Uploaded by

Copyright:

Available Formats

Module 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 2

Uploaded by

Copyright:

Available Formats

Module 2

WORD LEVEL ANALYSIS

• Chevi – ear – chevulu

But this approach has several limitations:

• A FST is a two state automaton which recognizes or generates a pair of strings.

• Develop an FST based morphological parser for singular and plural

• The input to the transducer has one of the following forms:

1. Regular noun stem, Eg. bird, cat

1) Substitution of a single letter

o Multi-substitution (or ) framing.

oSpace deletion (or)

i) Local syntactic errors

ii) Global syntactic errors

iii) Semantic errors

iv) errors at discourse or pragmatic levels

• It is impossible to decide that a word is wrong without some contextual information.

ii) Context-dependent error detection and correction

Context-dependent error detection and correction:

2. Similarity key techniques:

• The value in each cell is computed in terms of three possible paths.

dist [i, j] = dist [i-1, j-1]+ sub cost [sourcei, targetj]

dist [ i, j-1] + delete cost

Minimum Edit Distance between PEACEFUL and PAECFLU (June/July

WORD AND WORD CLASSES:

IF preceding word is determiner THEN eliminate VB tag.

Disadvantages of Rule-based Tagger:

Sl. Rule-based Tagger Stochastic Tagger

1 It is faster and deterministic Slow and non-deterministic

4 It is usable for only one language. Using it for another

• The standard stochastic tagger algorithm is the HMM (Hidden Markov

• The objective of the tagger is to assign a tag sequence to a given

2. A hidden layer learnt by the system corresponding to the tags

You might also like