NLP Unit Ii
NLP Unit Ii
NLP Unit Ii
Grammar: It is a set of rules which checks whether a string belongs to a particular language a not.
A program consists of various strings of characters. But, every string is not a proper or
meaningful string. So, to identify valid strings in a language, some rules should be specified to
check whether the string is valid or not. These rules are nothing but make Grammar.
Example − In English Language, Grammar checks whether the string of characters is acceptable
or not, i.e., checks whether nouns, verbs, adverbs, etc. are in the proper sequence.
1
Context-Free Grammar(CFG):
Context-free grammar (CFG) is a type of formal grammar used in natural language
processing (NLP) and computational linguistics. It provides a way to describe the structure
of a language in terms of a set of rules for generating its sentences.
S is a starting symbol
1. S -> NP VP
2. NP -> ART N
3. NP -> ART ADJ N
4. VP -> V
5. VP -> V NP
2
Derivations:
A derivation is a sequence of rule applications that derive a terminal string w = w1 … wn from S
• For example:
SNP VP
Pro VP
I VP
I Verb NP
I prefer NP
I prefer Det Nom
I prefer a Nom
I prefer a Nom Noun
I prefer a Noun Noun
I prefer a morning Noun
I prefer a morning flight
Parsing:
In the syntax analysis phase, a compiler verifies whether or not the tokens generated by the
lexical analyzer are grouped according to the syntactic rules of the language. This is done by a
parser.
The parser obtains a string of tokens from the lexical analyzer and verifies that the string can
be the grammar for the source language. It detects and reports any syntax errors and produces a
parse tree from which intermediate code can be generated.
Ambiguity
A grammar that produces more than one parse tree for some sentence is said to be ambiguous.
Eg- consider a grammar
S -> aS | Sa | a
Now for string aaa, we will have 4 parse trees, hence ambiguous
3
Basic concepts of parsing:
Two problems for grammar G and string w:
• Recognition: determine if G accepts w
• Parsing: retrieve (all or some) parse trees assigned to w by G
Two basic search strategies:
• Top-down parser: start at the root of the tree
• Bottom-up parser: start at the leaves
4
5
2.2 Top-Down Parser
A parsing algorithm can be described as a procedure that searches through various ways
of combining grammatical rules to find a combination that generates a tree that could be the
structure of the input sentence. In other words, the algorithm will say whether a certain sentence
is accepted by the grammar or not. The top-down parsing method is related in many artificial
intelligence (AI) search applications.
A top-down parser starts with the S symbol and attempts to rewrite it into a sequence of
terminal symbols that matches the classes of the words in the input sentence. The state of the
parse at any given time can be represented as a list of symbols that are the result of operations
applied so far, called the symbol list.
For example, the parser starts in the state (5) and after applying the rule S -> NP VP the symbol
list will be (NP VP). If it then applies the rule NP ->ART N, the symbol list will be (ART N VP),
and so on.
1. S -> NP VP
2. NP -> ART N
3. NP -> ART ADJ N
4. VP -> V
5. VP -> V NP
The parser could continue in this fashion until the state consisted entirely of terminal symbols,
and then it could check the input sentence to see if it matched. The lexical analyser will produce
list of words from the given sentence. A very small lexicon for use in the examples is
cried: V
dogs: N, V
the: ART
Positions fall between the words, with 1 being the position before the first word. For example,
here is a sentence with its positions indicated:
6
which means it needs to find a V starting at position 3. On the other hand, using rule 5, the new
state would be
((V NP) 3)
A parsing algorithm that is guaranteed to find a parse if there is one must systematically explore
every possible new state. One simple technique for this is called backtracking. Using this
approach, rather than generating a single new state from the state ((VP) 3), you generate all
possible new states. One of these is picked to be the next state and the rest are saved as backup
states. If you ever reach a situation where the current state cannot lead to a solution, you simply
pick a new current state from the list of backup states. Here is the algorithm in a little more
detail.
The algorithm manipulates a list of possible states, called the possibilities list. The first element
of this list is the current state, which consists of a symbol list - and a word position In the
sentence, and the remaining elements of the search state are the backup states, each indicating an
alternate symbol-list—word-position pair. For example, the possibilities list
(((N) 2) ((NAME) 1) ((ADJ N) 1))
indicates that the current state consists of the symbol list (N) at position 2, and that there are two
possible backup states: one consisting of the symbol list (NAME) at position 1 and the other
consisting of the symbol list (ADJ N) at position 1.
7
The algorithm starts with the initial state ((S) 1) and no backup states.
1. Select the current state: Take the first state off the possibilities list and call it C. If the
possibilities list is empty, then the algorithm fails (that is, no successful parse is possible).
2. If C consists of an empty symbol list and the word position is at the end of the sentence,
then the algorithm succeeds.
Consider an example. Using Grammar 3.4, Figure 3.5 shows a trace of the algorithm on the
sentence The dogs cried. First, the initial S symbol is rewritten using rule 1 to produce a new
current state of ((NP VP) 1) in step 2. The NP is then rewritten in turn, but since there are two
possible rules for NP in the grammar, two possible states are generated: The new current state
involves (ART N VP) at position 1, whereas the backup state involves (ART ADJ N VP) at
position 1. In step 4 a word in category ART is found at position 1 of the sentence, and the new
current state becomes (N VP). The backup state generated in step 3 remains untouched. The
parse continues in this fashion to step 5, where two different rules can rewrite VP. The first rule
generates the new current state, while the other rule is pushed onto the stack of backup states.
The parse completes successfully in step 7, since the current state is empty and all the words in
the input sentence have been accounted for.
The parse proceeds as follows (see Figure 3.6). The initial S symbol is rewritten by rule 1 to
produce the new current state of ((NP VP) 1). The NP is rewritten in turn, giving the new state of
((ART N VP) 1) with a backup state of ((ART ADJ N VP) 1). The parse continues, finding the as
an ART to produce the state ((N VP) 2) and then old as an N to obtain the state ((VP) 3). There
are now two ways to rewrite the VP, giving us a current state of ((V) 3) and the backup states of
((V NP) 3) and ((ART ADJ N) 1) from before. The word man can be parsed as a V. giving the
state (04). Unfortunately, while the symbol list is empty, the word position is not at the end of
8
the sentence, so no new state can be generated and a backup state must be used. In the next cycle,
step 8, ((V NP) 3) is attempted. Again man is taken as a V and the new state ((NP) 4) generated.
None of the rewrites of NP yield a successful parse. Finally, in step 12, the last backup state,
((ART ADJ N VP) 1), is tried and leads to a successful parse.
1. Select the first state from the possibilities list (and remove it from the list).
2. Generate the new states by trying every possible option from the selected state (there may be
none if we are on a bad path).
3. Add the states generated in step 2 to the possibilities list.
9
2.3 Bottom-Up Parser
A bottom-up parser builds derivation by working from input sentence back toward the start
symbol S. The bottom-up parser is also known as shift-reduce parser.
The basic operation in bottom-up parsing is to take a sequence of symbols and match it to the
right-hand side of the rules. You could build a bottom-up parser simply by formulating this
matching process as a search process. The state would simply consist of a symbol list, starting
with the words in the sentence. Successor states could be generated by exploring all possible
ways to :
replace a sequence of symbols that matches the right-hand side of a grammar rule by its
left-hand side
10
Example1.
11
EXAMPLE 2: cats scratch people with claws
Starting at the initial state, you can traverse an arc if the current word in the sentence is in the
category on the arc. If the arc is followed, the current word is updated to the next word. A phrase
is a legal NP if there is a path from the node NP to a pop arc (an arc labeled pop) that accounts
12
for every word in the phrase. This network recognizes the same set of sentences as the following
context-free grammar:
NP -> ART NP1
NP1 -> ADJ NP1
NP1 -> N
Consider parsing the NP a purple cow with this network. Starting at the node NP, you can follow
the arc labelled art, since the current word is an article— namely, a. From node NP1 you can
follow the arc labeled adj using the adjective purple, and finally, again from NP1, you can follow
the arc labeled noun using the noun cow. Since you have reached a pop arc, a purple cow is a
legal NP.
Consider finding a path through the S network for the sentence The purple cow ate the grass.
Starting at node 5, to follow the arc labeled NP, you need to traverse the NP network. Starting at
node NP, traverse the network as before for the input the purple cow. Following the pop arc in
the NP network, return to the S network and traverse the arc to node S 1. From node S 1 you
follow the arc labeled verb using the word ate. Finally, the arc labeled NP can be followed if you
can traverse the NP network again. This time the remaining input consists of the words the grass.
You follow the arc labeled art and then the arc labeled noun in the NP network; then take the pop
arc from node NP2 and then another pop from node S3. Since you have traversed the network
and used all the words in the sentence, The purple cow ate the grass is accepted as a legal
sentence.
13
2.5 Feature Systems and Augmented Grammars
In natural languages there are often agreement restrictions between words and phrases.
For example, the NP "a men" is not correct English because the article a indicates a single object
while the noun "men" indicates a plural object; the noun phrase does not satisfy the number
agreement restriction of English. There are many other forms of agreement, including subject-
verb agreement, gender agreement for pronouns, restrictions between the head of a phrase and
the form of its complement, and so on. To handle such phenomena conveniently, the grammati
cal formalism is extended to allow constituents to have features. For example, we might define a
feature NUMBER that may take a value of either s (for singular) or p (for plural), and we then
might write an augmented CFG rule such as
This rule says that a legal noun phrase consists of an article followed by a noun, but only when
the number feature of the first word agrees with the number feature of the second. This one rule
is equivalent to two CFG rules that would use different terminal symbols for encoding singular
and plural forms of all noun phrases, such as
While the two approaches seem similar in ease-of-use in this one example, consider that all rules
in the grammar that use an NP on the right-hand side would now need to be duplicated to include
a rule for NP-SING and a rule for NP -PLURAL, effectively doubling the size of the grammar.
And handling additional features, such as person agreement, would double the size of the
grammar again and again. Using features, the size of the augmented grammar remains the same
as the original one yet accounts for agreement constraints.
This says it is a constituent in the category ART that has as its root the word a and is singular.
Usually an abbreviation is used that gives the CAT value more prominence and provides an
intuitive tie back to simple context-free grammars. In this abbreviated form, constituent ART1
would be written as
14
Feature structures can be used to represent larger constituents as well. To do this, feature
structures themselves can occur as values. Special features based on the integers - 1, 2, 3, and so
on - will stand for the first subconstituent, second subconstituent, and so on, as needed. With
this, the representation of the NP constituent for the phrase "a fish" could be
NP1: (NP NUMBERs
1 (ART ROOT a
NUMBER s)
2 (N ROOT fish
NUMBER s))
Note that this can also be viewed as a representation of a parse tree shown in Figure 4.1, where
the subconstituent features 1 and 2 correspond to the subconstituent links in the tree.
The rules in an augmented grammar are stated in terms of feature structures rather than simple
categories. Variables are allowed as feature values so that a rule can apply to a wide range of
situations. For example, a rule for simple noun phrases would be as follows:
This says that an NP constituent can consist of two subconstituents, the first being an ART and
the second being an N, in which the NUMBER feature in all three constituents is identical.
According to this rule, constituent NP1 given previously is a legal constituent. On the other hand,
the constituent is not allowed because the NUMBER feature of the N constituent is not identical
to the other two NUMBER features.
15
2.6 Morphological Analysis and the Lexicon
Morphological Analysis: Morphological Analysis is the study of lexemes and how they are
created. The discipline is particularly interested in neologisms (newly created words from
existing words(root word)), derivation, and compounding. In morphological analysis each
token will be analysed as follows:
played play+V+past
katternas katt+N+plur+def+gen
16
Often non-deterministic (more than one solution):
plays play+N+plur
plays play+V+3sg
Derivation refers to a way of creating new words by adding affixes to the root of a word - this is
also known as affixation. There are two of affixes: prefixes and suffixes.
Compounding: Compounding refers to the creation of new words by combining two or more
existing words together. Now here are some examples of compounding:
The lexicon must contain information about all the different words that can be used, including all
the relevant feature value restrictions. When a word is ambiguous, it may be described by
multiple entries in the lexicon, one for each different use.
Most English verbs, for example, use the same set of suffixes to indicate different forms: -s is
added for third person singular present tense, -ed for past tense, -ing for the present participle,
and so on.
The idea is to store the base form of the verb in the lexicon and use context-free rules to combine
verbs with suffixes to derive the other entries. Consider the following rule for present tense
verbs:
17
SUBCAT {_np_vp:inf _np_vp:inf}
VFORM pres AGR 3s)
Another rule would generate the constituents for the present tense form not in third person
singular, which for most verbs is identical to the root form:
The following Grammar 4.5 gives a set of rules for deriving different verb and noun forms using
these features. Given a large set of features, the task of writing lexical entries appears very
difficult. Most frameworks allow some mechanisms that help alleviate these problems.
18
Another commonly used technique is to allow the lexicon writing to define clusters of features,
and then indicate a cluster with a single symbol rather than listing them all. Figure 4.6 contains a
small lexicon. It contains many of the words to be used in the examples that follow. It contains
three entries for the word "saw "- as a noun, as a regular verb, and as the irregular past tense
form of the verb "see" - as illustrated in the sentences
The saw was broken.
Jack wanted me to saw the board in half.
I saw Jack eat the pizza.
With the lexicon in Figure 4.6 and Grammar 4.5, correct constituents for the following words can
be derived: been, being, cries, cried, crying, dogs, saws (two interpretations), sawed, sawing,
seen, seeing, seeds, wants, wanting, and wanted. For example, the word cries would be
transformed into the sequence cry +s, and then rule 1 would produce the present tense entry from
the base form in the lexicon.
19
2.7 Parsing with Features
The parsing algorithms are used for context-free grammars and also extended to handle
augmented context-free grammars. This involves generalizing the algorithm for matching rules
to constituents. For instance, the chart-parsing algorithms can be used an operation for extending
active arcs with a new constituent. A constituent X could extend an arc of the form
C -> C1 ... Ci o X ... Cn
to produce a new arc of the form
C -> C1 ... Ci X o ... Cn
A similar operation can be used for grammars with features, but the parser may have to
instantiate variables in the original arc before it can be extended by X. The key to defining this
matching operation precisely is to remember the definition of grammar rules with features. A
rule such as:
1. (NP AGR ?a) -> o (ART AGR ?a) (N AGR ?a)
says that an NP can be constructed out of an ART and an N if all three agree on the AGR feature.
It does not place any restrictions on any other features that the NP, ART, or N may have. Thus,
when matching constituents against this rule, the only thing that matters is the AGR feature. All
other features in the constituent can be ignored. For instance, consider extending arc 1 with the
constituent
2. (ART ROOT A AGR 3s)
To make arc 1 applicable, the variable ?a must be instantiated to 3s, producing
3. (NP AGR 3s) -> o (ART AGR 3s) (N AGR 3s)
This arc can now be extended because every feature in the rule is in constituent 2:
4. (NP AGR 3s) -> (ART AGR 3s) o (N AGR 3s)
Now, consider extending this arc with the constituent for the word dog:
5. (N ROOT DOG1 AGR 3s)
This can be done because the AGR features agree. This completes the arc
6. (NP AGR 3s) —> (ART AGR 3s) (N AGR 3s)
This means the parser has found a constituent of the form (NP AGR 3s).
This algorithm can be specified more precisely as follows: Given an arc A, where the constituent
following the dot is called NEXT, and a new constituent X, which is being used to extend the
arc,
(a.) Find an instantiation of the variables such that all the features specified in NEXT are
found in X.
(b.) Create a new arc A', which is a copy of A except for the instantiations of the
variables determined in step (a).
(c.) Update A' as usual in a chart parser.
For instance, let A be arc 1, and X be the ART constituent 2. Then NEXT will be (ART AGR
?a). In step a, NEXT is matched against X, and you find that ?a must be instantiated to 3s. In step
b, a new copy of A is made, which is shown as arc 3. In step c, the arc is updated to produce the
new arc shown as arc 4.
20
Figure 4.10 contains the final chart produced from parsing the sentence He
wants to cry using Grammar 4.8 & Grammar 4.7.
21
22
2.8 Augmented Transition Networks
The ATN (augmented transition network) is produced by adding new features to a
recursive transition network. Features in an ATN are traditionally called registers. Constituent
structures are created by allowing each network to have a set of registers. Each time a new
network is pushed, a new set of registers is created. As the network is traversed, these registers
are set to values by actions associated with each arc. When the network is popped, the registers
are assembled to form a constituent structure, with the CAT slot being the network name.
Consider Grammar 4.11 is a simple NP network. The actions are listed in the table below the
network. ATNs use a special mechanism to extract the result of following an arc. When a lexical
arc, such as arc 1, is followed, the constituent built from the word in the input is put into a
special variable named "*". The action
DET := *
then assigns this constituent to the DET register. The second action on this arc,
AGR := AGR*
assigns the AGR register of the network to the value of the AGR register of the new word (the
constituent in "*"). Agreement checks are specified in the tests. A test is an expression that
succeeds if it returns a nonempty value and fails if it returns the empty set or nil.
23
If a test fails, its arc is not traversed. The test on arc 2 indicates that the arc can be followed only
if the AGR feature of the network has a non-null intersection with the AGR register of the new
word (the noun constituent in "*").
Features on push arcs are treated similarly. The constituent built by traversing the NP network is
returned as the value "*". Thus in Grammar 4.12, the action on the arc from S to S1,
SUBJ := *
would assign the constituent returned by the NP network to the register SUBJ. The test on arc 2
will succeed only if the AGR register of the constituent in the SUBJ register has a non-null
intersection with the AGR register of the new constituent (the verb). This test enforces subject-
verb agreement.
24
25
2.9 Bayes’ Theorem:
Bayes theorem is also known as the Bayes Rule or Bayes Law. It is used to determine the
conditional probability of event A when event B has already happened. The general statement
of Bayes’ theorem is ―The conditional probability of an event A, given the occurrence of
another event B, is equal to the product of the event of B, given A and the probability of A
divided by the probability of event B.‖ i.e.
P(A|B) = P(B|A)P(A) / P(B)
where,
P(A) and P(B) are the probabilities of events A and B
P(A|B) is the probability of event A when event B happens
P(B|A) is the probability of event B when A happens
26
2.10 Shannon's Spelling Game
Competent spellers are good at recognizing common spelling patterns. This
enables them to predict how any sound might be spelt because they know
that there are only a limited number of options.
For example, if they hear an "o" sound, as in hope, they will consider -
oa-; -oe-; or -o- followed, one consonant later, by the magic e
boat toe cope
27
2.11 Entropy and Cross Entropy
Entropy is defined as the randomness or measuring the disorder of the information being
processed in Machine Learning. Further, in other words, we can say that entropy is the machine
learning metric that measures the unpredictability or impurity in the system.
The entropy is the expected value of the surprisal across all possible events indexed by i:
So, the entropy is the average amount of surprise when something happens.
Entropy always lies between 0 and 1, however depending on the number of classes in
the dataset, it can be greater than 1.
Entropy in base 2 is also optimal number of bits it takes to store the information about what
happened, by Claude Shannon’s source coding theorem. For example if I told you that a full-
length tweet of 280 characters had an entropy of 1 bit per character, that means that, by the laws
of mathematics, no matter what Twitter does, they will always have to have 280 bits (35 bytes) of
storage for that tweet in their database.
28
In the context of our language model, we’ll have to make one tweak. Given that we are interested
in sentences s (sequences of events) of length n, we’ll define the entropy rate per word (event) as:
where the sum is over all sentences of length n and L(s) is the probability of the sentence. Finally,
a technical point: we want to define the entropy of the language L (or language model M)
regardless of sentence length n. So finally we define
This tells us that we can just take a large (n is big) text instead of trying to sample from diverse
texts.
29
Cross-Entropy:
Suppose we mistakenly think that our language model M is correct. Then we observe text
generated by the actual language L without realizing it. The cross-entropy H(L,M) is what we
measure the entropy to be
In the third line, the first term is just the cross-entropy (remember the limits and 1/n terms are
implicit). The second term is the Kullback-Leibler divergence (or KL-divergence). By Gibbs’
inequality the KL-divergence is non-negative and is 0 only if the models L and M are the same.
The KL-divergence is sort of like a distance measure.
30