CSCI 5832 Natural Language Processing: Jim Martin

CSCI 5832
Natural Language Processing
Jim Martin
Lecture 9
10/11/20 1
Today 2/19
• Review HMMs for POS tagging

• Entropy intuition
• Statistical Sequence classifiers
 HMMs
 MaxEnt
 MEMMs
2
10/11/20
Statistical Sequence
Classification
• Given an input sequence, assign a label
(or tag) to each element of the tape
 Or... Given an input tape, write a tag out to an
output tape for each cell on the input tape
• Can be viewed as a classification task if
we view
 The individual cells on the input tape as things
to be classified
 The tags written on the output tape as the
class labels
3
10/11/20
POS Tagging as Sequence
Classification
• We are given a sentence (an “observation” or

“sequence of observations”)
 Secretariat is expected to race tomorrow
• What is the best sequence of tags which
corresponds to this sequence of observations?
• Probabilistic view:
 Consider all possible sequences of tags
 Out of this universe of sequences, choose the tag
sequence which is most probable given the
observation sequence of n words w1…wn.
4
10/11/20
Classification
• We want, out of all sequences of n tags t1…tn the single
tag sequence such that P(t1…tn|w1…wn) is highest.
• Hat ^ means “our estimate of the best one”

• Argmaxx f(x) means “the x such that f(x) is maximized”
5
10/11/20
Road to HMMs
• This equation is guaranteed to give us the

best tag sequence
• But how to make it operational? How to

compute this value?
• Intuition of Bayesian classification:
 Use Bayes rule to transform into a set of other
probabilities that are easier to compute
6
10/11/20
Using Bayes Rule
7
10/11/20
Likelihood and Prior
8
10/11/20
Transition Probabilities
• Tag transition probabilities p(ti|ti-1)

 Determiners likely to precede adjs and nouns
 That/DT flight/NN
 The/DT yellow/JJ hat/NN
 So we expect P(NN|DT) and P(JJ|DT) to be high
 Compute P(NN|DT) by counting in a labeled
corpus:
9
10/11/20
Observation Probabilities
• Word likelihood probabilities p(wi|ti)

 VBZ (3sg Pres verb) likely to be “is”
 Compute P(is|VBZ) by counting in a
labeled corpus:
10
10/11/20
An Example: the verb “race”
• Secretariat/NNP is/VBZ expected/VBN to/TO

race/VB tomorrow/NR
• People/NNS continue/VB to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN
for/IN outer/JJ space/NN
• How do we pick the right tag?
11
10/11/20
Disambiguating “race”
12
10/11/20
Example
• P(NN|TO) = .00047
• P(VB|TO) = .83
• P(race|NN) = .00057
• P(race|VB) = .00012
• P(NR|VB) = .0027
• P(NR|NN) = .0012
• P(VB|TO)P(NR|VB)P(race|VB) = .00000027
• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
• So we (correctly) choose the verb reading,
13
10/11/20
Markov chain for words
14
10/11/20
Markov chain = “First-order
Observable Markov Model”
• A set of states
 Q = q1, q2…qN; the state at time t is qt
• Transition probabilities:
 a set of probabilities A = a01a02…an1…ann.
 Each aij represents the probability of transitioning from
state i to state j
 The set of these is the transition probability matrix A
• Current state only depends on previous state

P(qi | q1 ...qi−1) = P(qi | qi−1 )
15
10/11/20
Hidden Markov Models
• States Q = q1, q2…qN;

• Observations O= o1, o2…oN;
 Each observation is a symbol from a vocabulary V = {v1,v2,…
vV}
• Transition probabilities
 Transition probability matrix A = {aij}
aij = P(qt = j | qt−1 = i) 1 ≤ i, j ≤ N
• Observation likelihoods
 Output probability matrix B={bi(k)}
b (k) = P(X t = ok | qt = i)
probability vector i
• Special initial €
16
10/11/20 π i = P(q1 = i) 1 ≤ i ≤ N
Transitions between the hidden
states of HMM, showing A probs
17
10/11/20
B observation likelihoods for
POS HMM
18
10/11/20
The A matrix for the POS HMM
19
10/11/20
The B matrix for the POS HMM
20
10/11/20
Viterbi intuition: we are looking
for the best ‘path’
S1 S2 S3 S4 S5
RB
NN
VBN
JJ DT VB
TO
VBD
VB NNP NN
promised to back the bill

21
10/11/20
The Viterbi Algorithm
22
10/11/20
Viterbi example
23
10/11/20
Information Theory
• Who is going to win the World Series next

year?
• Well there are 30 teams. Each has a
chance, so there’s a 1/30 chance for any
team…? No.
 Rockies? Big surprise, lots of information
 Yankees? No surprise, not much information
24
10/11/20
Information Theory
• How much uncertainty is there when you

don’t know the outcome of some event
(answer to some question)?
• How much information is to be gained by
knowing the outcome of some event
(answer to some question)?
25
10/11/20
Aside on logs
• Base doesn’t matter. Unless I say

otherwise, I mean base 2.
• Probabilities lie between 0 an 1. So log
probabilities are negative and range from
0 (log 1) to –infinity (log 0).
• The – is a pain so at some point we’ll
make it go away by multiplying by -1.
26
10/11/20
Entropy
• Let’s start with a simple case, the

probability of word sequences with a
unigram model
• Example
 S = “One fish two fish red fish blue fish”
 P(S) =
P(One)P(fish)P(two)P(fish)P(red)P(fish)P(blue)P(fish)
 Log P(S) = Log P(One)+Log P(fish)+…Log P(fish)
27
10/11/20
Entropy cont.
• In general that’s log P ( S ) = ∑logP ( w)

• But note that w∈S
 the order doesn’t

matter
 that words can occur
multiple times
 and that they always
contribute the same
each time
 so rearranging…
log P( s ) = ∑Count (w) * log P(w)
w∈V
28
10/11/20
Entropy cont.
• One fish two fish red fish blue fish
• Fish fish fish fish one two red blue
LogP ( s ) = 4 * log P ( fish) +1* log P (one) +1* log P (two) +1* log P (red ) +1* log P(blue)
29
10/11/20
Entropy cont.
• Now let’s divide both sides by N, the length of the

sequence:
1 1
log P( S ) = ∑Count (w) log P(w)
N N w∈V
• That’s basically an average of the logprobs
30
10/11/20
Entropy
• Now assume the sequence is really really

long.
• Moving the N into the summation you get
Count ( w)
∑ log P( w)
w∈V N
• Rewriting and getting rid of the minus sign
H ( S ) = −∑ P ( w) log P( w)
w∈V 31
10/11/20
Entropy
• Think about this in terms of uncertainty or

surprise.
 The more likely a sequence is, the lower the
entropy. Why?
H ( S ) = −∑ P( w) log P( w)
w∈V
32
10/11/20
Model Evaluation
• Remember the name of the game is to

come up with statistical models that
capture something useful in some body of
text or speech.
• There are precisely a gazzilion ways to do
this
 N-grams of various sizes
 Smoothing
 Backoff…
33
10/11/20
Model Evaluation
• Given a collection of text and a couple of

models, how can we tell which model is
best?
• Intuition… the model that assigns the
highest probability to a set of withheld text
 Withheld text? Text drawn from the same
distribution (corpus), but not used in the
creation of the model being evaluated.
34
10/11/20
Model Evaluation
• The more you’re surprised at some event

that actually happens, the worse your
model was.
• We want models that minimize your
surprise at observed outcomes.
• Given two models and some training data
and some withheld test data… which is
better?
35
10/11/20
Three HMM Problems
• Given a model and an observation

sequence
 Compute Argmax P(states | observation seq)
 Viterbi
 Compute P(observation seq | model)
 Forward
 Compute P(model | observation seq)
 EM (magic)
36
10/11/20
Viterbi
• Given a model and an observation

sequence, what is the most likely state
sequence
 The state sequence is the set of labels
assigned
 So using Viterbi with an HMM solves the
sequence classification task
37
10/11/20
Forward
• Given an HMM model and an observed

sequence, what is the probability of that
sequence?
 P(sequence | Model)
 Sum of all the paths in the model that could have
produced that sequence
 So...
• How do we change Viterbi to get Forward?
38
10/11/20
Who cares?
• Suppose I have two different HMM models

extracted from some training data.
• And suppose I have a good-sized set of
held-out data (not used to produce the
above models).
• How can I tell which model is the better
model?
39
10/11/20
Learning Models
• Now assume that you just have a single

HMM model (pi, A, and B tables)
• How can I produce a second model from
that model?
 Rejigger the numbers... (in such a way that
the tables still function correctly)
 Now how can I tell if I’ve made things better?
40
10/11/20
EM
• Given an HMM structure and a sequence,

we can learn the best parameters for the
model without explicit training data.
 In the case of POS tagging all you need is
unlabelled text.
 Huh? Magic. We’ll come back to this.
41
10/11/20
Generative vs. Discriminative
Models
• For POS tagging we start with the

question... P(tags | words) but we end up
via Bayes at
 P(words| tags)P(tags)
 That’s called a generative model
 We’re reasoning backwards from the models
that could have produced such an output
42
10/11/20
Disambiguating “race”
43
10/11/20
Discriminative Models
• What if we went back to the start to

 Argmax P(tags|words) and didn’t use Bayes?
 Can we get a handle on this directly?
 First let’s generalize to P(tags|evidence)
 Let’s make some independence assumptions and
consider the previous state and the current word as
the evidence. How does that look as a graphical
model?
44
10/11/20
MaxEnt Tagging
45
10/11/20
MaxEnt Tagging
• This framework allows us to throw in a

wide range of “features”. That is, evidence
that can help with the tagging.
46
10/11/20
Classification
47
10/11/20

CSCI 5832 Natural Language Processing: Jim Martin

Uploaded by

Copyright:

Available Formats

CSCI 5832 Natural Language Processing: Jim Martin

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSCI 5832 Natural Language Processing: Jim Martin

Uploaded by

Copyright:

Available Formats

CSCI 5832

Natural Language Processing

• Review HMMs for POS tagging

• We are given a sentence (an “observation” or

• Hat ^ means “our estimate of the best one”

• This equation is guaranteed to give us the

• But how to make it operational? How to

• Tag transition probabilities p(ti|ti-1)

• Word likelihood probabilities p(wi|ti)

• Secretariat/NNP is/VBZ expected/VBN to/TO

• Current state only depends on previous state

• States Q = q1, q2…qN;

promised to back the bill

• Who is going to win the World Series next

• How much uncertainty is there when you

• Base doesn’t matter. Unless I say

• Let’s start with a simple case, the

• In general that’s log P ( S ) = ∑logP ( w)

 the order doesn’t

• One fish two fish red fish blue fish

• Fish fish fish fish one two red blue

• Now let’s divide both sides by N, the length of the

• That’s basically an average of the logprobs

• Now assume the sequence is really really

• Think about this in terms of uncertainty or

• Remember the name of the game is to

• Given a collection of text and a couple of

• The more you’re surprised at some event

• Given a model and an observation

• Given a model and an observation

• Given an HMM model and an observed

• Suppose I have two different HMM models

• Now assume that you just have a single

• Given an HMM structure and a sequence,

• For POS tagging we start with the

• What if we went back to the start to

• This framework allows us to throw in a

You might also like