Module 2
Module 2
Module 2
• Example: The regular expression /.at/ matches any string cat, bat, rat,
mat, nat, 4at, etc.
RE-Character Classes
RE-Character Classes
• /…..berry/ matches ten letter string that ends in berry. For
example Strawberry, sugarberry, blackberry, etc.
• To search ‘Tanveer’ or ‘Siddiqui’ we have to use the disjunction
operator shown by the pipe symbol (|).
• The pattern (Blackberry | blackberries) matches either
Blackberry or blackberries
Exercise
Check if a string is an email address or not.
RE-Character Classes
An email address consist of a non-empty sequence followed by the 'at' symbol ,@, followed by another
non-empty sequence of character ending with pattern like .xx, .xxx, .xxxx etc . The regular expression for an
email address is
^[A-Za-z0-9_\.-]+@[A-Za-z0-9_\.-]+[A-Za-z0-9_][A-Za-z0-9_]$
Morphological Parsing
- Morphology is a sub discipline of linguistics.
- It studies word structure and the formation of words from smaller units (morphemes).
- The goal of the morphological parsing is to discover the morphemes that build a given
word.
- Morphemes are the smallest meaning-bearing units in a language.
- The morphological parser should be able to tell us that the word ‘eggs’ is the plural form
of the noun stem ‘egg’.
- Example:
• ‘Bread’ consists of a single morpheme
• ‘eggs’ consists of two: the morpheme egg and the morpheme -s.
Morphological Parsing
● There are two broad classes of morphemes:
○ Stems
○ Affixes
● Stem is the main morpheme i.e. the morpheme that contain the central meaning.
● Affixes modify the meaning given by the stem.
● Affixes are divided into -
○ Prefix - morphemes which appear before the stem
○ Suffix - morphemes which are applied to the end of the stem
○ Infix - morphemes that may appear inside a stem.
○ Circumfix - morphemes that may be applied to either end of the stem
Example:
• Pilli – cat – pillulu where ulu is plural
• FST passes over the input string by consuming the input symbols on the tape it traverses and consists it to the
output string in the form of symbols.
• Hopcroft and Ullman (1979) define FST as follows -
• Finite state transducer is a 6 tuple (∑1, ∑2, Q, δ, S, F) where
• Ө - set of states
• S - initial state
• F ⊆ Q is a set of final states
• ∑1 is input alphabet
• ∑2 is output alphabet
• δ is a function mapping Q X (∑1 U {ε}) X (∑2 U {ε}) to a subset of the power set of Q.
Finite State Transducer (FST)
• FST is similar to an NFA (Nondeterministic Finite State Automaton)
except that in FST transitions are made on strings rather than on symbols
and in addition, they have outputs.
• Figure shows a simple transducer that accepts two input strings, hot and
cat, and maps them onto cot and bat respectively.
Two Level Morphology Using FST
• As shown in the figure, two steps are involved in conversion from
the surface form of a word to its morphological analysis.
Two Level Morphology Using FST
• First Step:
Split the words into possible components
Ex: birds = bird + s, boxes = box + s (or) boxe + s
• Second Step:
Use a lexicon to look up the categories of the stems and meaning of the affixes.
Ex: bird + s = bird + N + PL
box + s = box + N + PL,
We can find out now that boxe is not a legal stem.
Spouses = spouse + s ✔
Parses = parse + s => ✔
Orthographic rules are used to handle spelling variations. One of the rules says -
Add e after -s, -z, -x, -ch, -sh before the s.
e.g., dish -> dishes, box -> boxes
Two Level Morphology Using FST
• Two steps are implemented with the help of a transducer. Thus, we
need two transducers
1. To map the surface form to the intermediate form
2. To map the intermediate form to the lexical form
• Insertion error refers to the presence of an extra character is a word. E.g. when 'error' is misspell as 'errorn'.
• Substitution error occurs when a wrong letter is typed in place of the right one. E.g. 'errpr' where 'p' appears in
place of 'o'.
• Reversal refers to a situation in which the sequence of characters is reversed. E.g. 'aer' instead of 'are' also
called as transposition.
• Optical Character Recognition (OCR) and other automatic reading devices introduce errors of substitution,
deletion and insertion but not of reversal.
Spelling Error Detection and Correction
OCR errors are usually grouped into five classes:
o Substitution
oinsertion
oFailures
Spelling Error Detection and Correction
• OCR substitution errors are caused due to visual similarity such as, c e, l I, r n.
• Multi-substitution :m->rn.
• Failures occur when OCR algorithm fails to select a letter with sufficient accuracy.
• The frequency and type of errors can be corrected using 'context' or by using linguistic structures.
• Spelling errors belong to one of two distinct categories:
1. Non word errors
2. Real word errors
Non word errors:
- when an error results in a word that does not appear in a lexicon or is not a valid orthographic word form, it is
termed as non word error. Ex: Mouse: rnouse
Real word errors:
A real word error results in actual words of the language. It occurs due to typographical mistakes or spelling
errors.
E.g. Substituting the spelling of a homophone or near homophone such as peace for piece, Meat for meet.
Spelling Error Detection and Correction
Real word errors cause:-
❖ In isolated-word error detection and correction, each word is checked separately, independent of its context.
❖ There are many problems associated with this strategy:
○ Strategy requires the existence of a lexicon containing all correct words. Such a lexicon would take a
long time to compile and occupy a lot of space.
○ Some languages are highly productive. It is impossible to list all the correct words of such languages.
○ This strategy fails when spelling error produces a word that belongs to the lexicon. e.g. when 'theses' is
written in place of 'these’. Such error is called real-word error.
○ The larger the lexicon, the more likely it is that an error goes undetected, because the chance of a word
being found is greater in a large lexicon.
➢ The SOUNDEX system (Odell and Russell 1918) is an example which is used in phonetic spelling
correction applications.
SPELLING ERROR CORRECTION ALGORITHM
3. n-gram based techniques:
➢ The n-grams can be used for both non-word and real-word error detection because in English alphabet, certain,
bi-grams and tri-grams of letters never occur.
➢ E.g. tri-gram qst and the bi-gram qd. This information can be used to handle non-word error.
➢ n-gram techniques requires large corpus or dictionary as training data, so that n-gram table of possible
combinations of letters can be complied.
4. Neural nets:
➢ These have the ability to do associative recall based on incomplete and noisy data.
➢ They can be trained to adapt to specific error patterns.
➢ The drawback is that they are computationally expensive.
5. Rule-based techniques:
➢ In rule-based technique, a set of rules (heuristics) derived from knowledge of a common spelling error pattern
is used to transform misspelled words into valid words.
➢ E.g. If it is known that many errors occur from the letters 'ue' being typed as ‘ eu ‘, then we may write a rule
that represents this.
Minimum Edit Distance
• The minimum edit distance is the number of insertions, deletions and substitutions required to change one string into
another. The distance between two strings is the minimum edit distance.
• E.g. minimum edit distance between tutor and tumour is 2.
• We substitute 'm' for 't' and insert 'u' before 'r'.
• Edit distance between two strings can be represented as a binary function, ed, which maps two strings to their edit
distance. ed is symmetric. For any two strings s and t, ed(s, t) is always equal to ed(t, s).
• Edit distance can be viewed as string alignment problem. By aligning two strings, we can measure the degree to
which they match.
• There can be more than one possible alignment between two strings. The best possible alignment corresponds to
minimum edit distance between the strings.
• The alignment between Tutor and Tumour has a distance of 2.
T U T O _ R
T U M O U R
● A dash in the upper string indicates insertion. A substitution occurs when the two alignment symbols do not
match.
Minimum Edit Distance
• The Levensthein distance between two sequences is obtained by assigning a unit cost to each operation.
• Another possible alignment between tutor and tumour is
T U T _ O _ R
T U _ M O U R
which has a cost of 3. But, the better alignment corresponds to minimum edit distance.
• Dynamic programming algorithm is used for finding minimum edit distance between two sequences.
• Dynamic programming refers to a class of algorithms that apply a table-driven approach to solve problems by
combining solutions to sub-problems.
• The dynamic programming algorithm for minimum edit distance is implemented by creating an edit distance matrix.
• This matrix has one row for each symbol in source string and one column for each symbol in the target string.
• The (i,j)th cell in this matrix represents the distance between the first i character of the source and the first j character
of the target string.
Minimum Edit Distance
• The substitution will be 0 if the ith character in the source matches with jth character in the target.
• Minimum edit distance algorithms are useful for determining accuracy in speech recognition systems.
Minimum Edit Distance
How to compute the value of the dist(i,j)
Minimum Edit Distance
Minimum Edit Distance Algorithm
Minimum Edit Distance
Minimum Edit Distance between tutor and tumour using Dynamic
Programming Algorithm = 2. VTU - Dec 2018/Jan 2019
Minimum Edit Distance
Minimum Edit Distance between EXECUTION and INTENTION = 5
(Jan/Feb 2021)
Minimum Edit Distance
Minimum Edit Distance between KITTEN and SITTING = 3
Minimum Edit Distance
• Words are classified into categories called parts of speech. These are also called as word classes or lexical
categories. These lexical categories are defined by their syntactic and morphological behaviour.
• Most common lexical categories are nouns, verbs. Other lexical categories include adjectives, adverbs,
prepositions, conjunctions.
• Word classes are categorized as open and closed word classes. Open word classes constantly acquire new
members while closed word classes do not.
• Open word classes are nouns, verbs (except auxiliary verbs), adjectives, adverbs and interjections.
• Closed word classes are Prepositions , auxiliary verbs, delimiters, conjunction and particles.
PART OF SPEECH TAGGING
• Part of speech tagging is the process of assigning a part-of-speech
such as a noun, verb, pronoun, preposition, adverb and adjective to
each word in a sentence.
• The input to a tagging algorithm is the sequence of words of a
natural language sentence and specified tag sets (a finite list of
part-of-speech tags)
• The output is a single best part-of-speech tag for each word.
• In tagging, we determine the correct lexical category of a word in its
context.
• The tag assigned by a tagger is the most likely for a particular use of
word in a sentence.
PART OF SPEECH TAGGING
• The collection of tags used by a particular tagger is called a tag
set.
• Most part-of-speech tag sets make use of the same basic
categories, i.e., noun, verb, adjective and preposition. But, tag
sets differ in how they define categories and how finely they
divide words into categories.
• Tag sets differ in how they define categories and how finely they
divide words into categories.
• Eg. both eat and eats might be tagged as a verb in one tag set, but
assigned distinct tags in another tag set.
• Most tag sets capture morpho-syntactic information such as
singular/plural, number, gender, tense, etc.
PART OF SPEECH TAGGING
• Consider the following sentences -
• Zuha eats an apple daily.
• Aman ate an apple yesterday.
• They have eaten all the apples in the basket.
• I like to eat guavas.
• The word eat has a distinct grammatical form in each of these four
sentences.
• Eat is the base form
• Ate its past tense
• Eats requires a third person singular subject
• Eaten is the past participle form and cannot occur in another
grammatical context. It is required after have or has.
PART OF SPEECH TAGGING
• The number of tags used by different taggers varies.
• The Penn Treebank tag set contains 45 tags while C7 uses 164.
• For English language which is not morphologically rich C7 tagset is
too big. The tagging process will give too many mistagged words and
the result needs to be manually corrected.
• Bigger tag sets like
• TOSCA-ICE with 270 tags
• TESS with 200 tags
are used for the International Corpus of English.
• The larger the tagset the, the greater the information captured
about a linguistic context. But the task of tagging becomes
complicated and requires manual correction.
PART OF SPEECH TAGGING
• A bigger tag set can be used for morphologically rich languages without
introducing too many tagging errors.
• A tag set that uses just one tag to denote all the verbs will assign identical
tags to all the forms of a verb. This coarse-grained distinction may be
appropriate for some tasks.
• A fine-grained tag set captures more information and is useful for tasks
like syntactic pattern detection.
• The Penn Treebank tag set captures finer distinctions by assigning
distinct tags to distinct grammatical forms of a verb as shown in the
figure below.
PART OF SPEECH TAGGING
Note:
Meaning of gerund:
A gerund is a form of a verb that ends in -ing that is used as a noun.It looks
like a verb, but it acts like a noun.
For example, the word swimming is an example of a gerund.
We can use the word swimming in a sentence as a noun to refer to the act
of moving around in water as in Swimming is fun.
PART OF SPEECH TAGGING
• Part-of-speech tagging is an early stage of text processing in NLP
applications such as speech synthesis, machine translation,
information retrieval and information extraction.
• In information retrieval, part-of-speech tagging can be used for
indexing - for identifying useful tokens like nouns and phrases and
for disambiguating word senses.
• In tagging a complete parse tree is not build, part-of-speech is
assigned to words using contextual information.
• Part-of-speech tagging methods fall under the three general
categories -
1. Rule-based (linguistic)
2. Stochastic (data-driven)
3. Hybrid
PART OF SPEECH TAGGING
Rule based taggers:
• Use hand-coded rules to assign tags to words.
• These rules use a lexicon to obtain a list of candidate tags and
then use rules to discard incorrect tags.
Stochastic taggers:
• Stochastic have data-driven approaches in which frequency-based
information is automatically derived from corpus and used to tag
words.
• Stochastic taggers disambiguate words based on the probability
that a word occurs with a particular tag.
• The simplest scheme is to assign the most frequent tag to each
word.
• An early example of stochastic tagger was CLAWS (Constituent
Likelihood Automatic Word-tagging System), which is equivalent
to TAGGIT.
• HMM (Hidden Markov Model is the standard Stochastic tagger.
PART OF SPEECH TAGGING
• Hybrid Tagger:
• Hybrid taggers combine feature of both rule-based and Stochastic
taggers.
• Like rule-based systems, they use rules to specify tags.
• Like stochastic systems, they use machine-learning to induce rules from a
tagged training corpus automatically.
• The transformation-based tagger or Brill tagger is an example of the
hybrid approach.
Rule Based Tagger
Rule-based taggers have a two-stage architecture
1. First stage is a dictionary look-up procedure, which returns a set of
potential tags (parts-of-speech) and appropriate syntactic features
for each word.
2. The second stage uses a set of hand-coded rules to discard
contextually illegitimate tags to get a single part-of-speech for each
word. For example, consider the noun-verb ambiguity in the
following sentence
The show must go on.
• The potential tags for the word show is {VB, NN}.
• The ambiguity is resolved by using the following rule
2 Skill and effort is required in writing disambiguation Require manual work to achieve good performance.
rules
3 Time is spent in writing a rule Time is spent developing restrictions on transitions and
emissions to improve tagger performance
• While tagging the input data, we only observe the words, the tags (states) are
hidden.
• States of the model are visible in training, not during the tagging task.
• HMM makes use of lexical and bi-gram probabilities estimated over a tagged
training corpus in order to compute the most likely tag sequence for each sentence.
• One way to store the statistical information is to build a probability matrix.
• The probability matrix contains both the probability that an individual word belongs
to a word class as well as the n-gram analysis. Eg. for a bi-gram model the
probability that a word of class X follows a word of class Y.
• This matrix is then used to drive the HMM tagger while tagging an unknown text.
Stochastic Tagger - HMM Tagger
Let W be the sequence of words.
W = w1, w2, …, wn
The task is to find the tag sequence
T = t1, t2, …, Tn
Which maximizes P(T|W), i.e.,
T’ = argmaxT P(T|W)
Applying Bayes Rule, P(T|W) can be estimated using the expression:
P(T|W) = P(W|T) * P(T) / P(W)
As the probability of the word sequence, P(W), remains the same for
each tag sequence, it can be dropped.
Hence the expression for the most likely tag sequence becomes
T’ = argmaxT P(W|T) * P(T)
Stochastic Tagger - HMM Tagger
• Using the Markov assumption, the probability of a tag sequence can be
estimated as the product of the probability of its constituent n-grams, i.e.,
• P(T) = P(t1) * P(t2|t1) * P(t3|t1t2) … * P(tn|t1 … tn-1)
• P(W/T) is the probability of seeing a word sequence, given a tag
sequence. For eg., it is asking the probability of seeing “The egg is rotten”
given “DT NNP VB JJ”.
• The following two assumptions are made
1. The words are independent of each other
2. The probability of a word is dependent only on its tag.
• Using these assumptions, we get
• P(W/T) = P(w1/t1)* P(w2/t2) … P(wi/ti) * … P(wn/tn)
n
• P(W/T) = Π P(wi|ti)
i=1
Stochastic Tagger - HMM Tagger
n
P(W/T) * P(T) = Π P(wi|ti) * P(t1) * P(t2|t1) * P(t3|t1t2) … * P(tn|t1 … tn-1)
i=1
Approximating the tag history using only the two previous tags, the transition
probability becomes
P(T) = P(t1) * P(t2|t1) * P(t3|t1t2) *… * P(tn|tn-2tn-1)
P(T|W) = P(W|T) * P(T)
n
P(T|W) = Π P(wi|ti) * P(t1) * P(t2|t1) * P(t3|t1t2) *… * P(tn|tn-2tn-1)
i=1
Stochastic Tagger - HMM Tagger
Example to demonstrate how the probability of a particular
part-of-speech sequence for a given sentence can be computed
Consider the sentence - The bird can fly.
And the tag sequence - DT NNP MD VB
Using bi-gram approximation, the probability
DT NNP MD VB
P | | | |
The bird can fly
Can be computed as
= P(DT) X P(NN|DT) * P(MD|NNP) X P(VB|MD)
X P(the|DT) X P(bird|NNP) X P(can|MD) X P(fly|VB)
Stochastic Tagger - HMM Tagger
Advantages of Stochastic Tagger:
• Accurate
• Language independent:
• Most stochastic taggers have an accuracy of 96-97%. The accuracy is
measured as a percentage of words.
• For a sentence containing 20 words, accuracy of 96% means, the error rate
per sentence is 1 - 0.9620 = 56%.
• This corresponds to approximately one word per sentence.
Stochastic Tagger - HMM Tagger
Drawback of Stochastic Tagger:
• Requires manually tagged corpus for training.
• Kupiec, Cutting corpus have shown that HMM tagger can be trained
from unannotated text.
• This makes it possible to use the model for languages in which a
manually tagged corpus is not available.
• Tagger trained on a hand-coded corpus performs better than one
trained on an unannotated text.
Hybrid Taggers
• Hybrid taggers combine the feature of both rule based and stochastic
approaches.
• They use rules to assign tags to words.
• Like the stochastic taggers, this is a machine learning technique and
rules are automatically induced from the data.
• Transformation-based learning (TBL) of tags, also known as
Brill tagging is an example of hybrid approach.
• TBL is a machine learning method introduced by E. Brill in 1995.
• Transformation-based error driven learning has been applied to a
number of natural language problems, such as, part-of-speech (POS)
tagging, speech generation, and syntactic parsing.
Hybrid Taggers
Figure shows the TBL process.
TBL is also supervised learning technique.
Hybrid Taggers
Steps involved in TBL tagging algorithm:
Hybrid Taggers
• Each transformation is a pair of a re-write rule of the form t1-> t2 and a contextual
condition.
• In order to limit the set of transformations, a small set of templates is constructed.
• Some of the transformations templates and transformations learned by TBL tagging is
listed below
Hybrid Taggers