CSCI 5832 Natural Language Processing: Jim Martin

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 47

CSCI 5832

Natural Language Processing

Jim Martin
Lecture 9

10/11/20 1
Today 2/19

• Review HMMs for POS tagging


• Entropy intuition
• Statistical Sequence classifiers
 HMMs
 MaxEnt
 MEMMs

2
10/11/20
Statistical Sequence
Classification
• Given an input sequence, assign a label
(or tag) to each element of the tape
 Or... Given an input tape, write a tag out to an
output tape for each cell on the input tape
• Can be viewed as a classification task if
we view
 The individual cells on the input tape as things
to be classified
 The tags written on the output tape as the
class labels
3
10/11/20
POS Tagging as Sequence
Classification

• We are given a sentence (an “observation” or


“sequence of observations”)
 Secretariat is expected to race tomorrow
• What is the best sequence of tags which
corresponds to this sequence of observations?
• Probabilistic view:
 Consider all possible sequences of tags
 Out of this universe of sequences, choose the tag
sequence which is most probable given the
observation sequence of n words w1…wn.
4
10/11/20
Statistical Sequence
Classification
• We want, out of all sequences of n tags t1…tn the single
tag sequence such that P(t1…tn|w1…wn) is highest.

• Hat ^ means “our estimate of the best one”


• Argmaxx f(x) means “the x such that f(x) is maximized”

5
10/11/20
Road to HMMs

• This equation is guaranteed to give us the


best tag sequence

• But how to make it operational? How to


compute this value?
• Intuition of Bayesian classification:
 Use Bayes rule to transform into a set of other
probabilities that are easier to compute
6
10/11/20
Using Bayes Rule

7
10/11/20
Likelihood and Prior

8
10/11/20
Transition Probabilities

• Tag transition probabilities p(ti|ti-1)


 Determiners likely to precede adjs and nouns
 That/DT flight/NN
 The/DT yellow/JJ hat/NN
 So we expect P(NN|DT) and P(JJ|DT) to be high
 Compute P(NN|DT) by counting in a labeled
corpus:

9
10/11/20
Observation Probabilities

• Word likelihood probabilities p(wi|ti)


 VBZ (3sg Pres verb) likely to be “is”
 Compute P(is|VBZ) by counting in a
labeled corpus:

10
10/11/20
An Example: the verb “race”

• Secretariat/NNP is/VBZ expected/VBN to/TO


race/VB tomorrow/NR
• People/NNS continue/VB to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN
for/IN outer/JJ space/NN
• How do we pick the right tag?

11
10/11/20
Disambiguating “race”

12
10/11/20
Example

• P(NN|TO) = .00047
• P(VB|TO) = .83
• P(race|NN) = .00057
• P(race|VB) = .00012
• P(NR|VB) = .0027
• P(NR|NN) = .0012
• P(VB|TO)P(NR|VB)P(race|VB) = .00000027
• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
• So we (correctly) choose the verb reading,
13
10/11/20
Markov chain for words

14
10/11/20
Markov chain = “First-order
Observable Markov Model”
• A set of states
 Q = q1, q2…qN; the state at time t is qt
• Transition probabilities:
 a set of probabilities A = a01a02…an1…ann.
 Each aij represents the probability of transitioning from
state i to state j
 The set of these is the transition probability matrix A

• Current state only depends on previous state


P(qi | q1 ...qi−1) = P(qi | qi−1 )
15
10/11/20
Hidden Markov Models

• States Q = q1, q2…qN;


• Observations O= o1, o2…oN;
 Each observation is a symbol from a vocabulary V = {v1,v2,…
vV}
• Transition probabilities
 Transition probability matrix A = {aij}
aij = P(qt = j | qt−1 = i) 1 ≤ i, j ≤ N
• Observation likelihoods
 Output probability matrix B={bi(k)}
b (k) = P(X t = ok | qt = i)
probability vector i
• Special initial €
16
10/11/20 π i = P(q1 = i) 1 ≤ i ≤ N
Transitions between the hidden
states of HMM, showing A probs

17
10/11/20
B observation likelihoods for
POS HMM

18
10/11/20
The A matrix for the POS HMM

19
10/11/20
The B matrix for the POS HMM

20
10/11/20
Viterbi intuition: we are looking
for the best ‘path’
S1 S2 S3 S4 S5
RB

NN

VBN
JJ DT VB
TO
VBD
VB NNP NN

promised to back the bill


21
10/11/20
The Viterbi Algorithm

22
10/11/20
Viterbi example

23
10/11/20
Information Theory

• Who is going to win the World Series next


year?
• Well there are 30 teams. Each has a
chance, so there’s a 1/30 chance for any
team…? No.
 Rockies? Big surprise, lots of information
 Yankees? No surprise, not much information

24
10/11/20
Information Theory

• How much uncertainty is there when you


don’t know the outcome of some event
(answer to some question)?
• How much information is to be gained by
knowing the outcome of some event
(answer to some question)?

25
10/11/20
Aside on logs

• Base doesn’t matter. Unless I say


otherwise, I mean base 2.
• Probabilities lie between 0 an 1. So log
probabilities are negative and range from
0 (log 1) to –infinity (log 0).
• The – is a pain so at some point we’ll
make it go away by multiplying by -1.

26
10/11/20
Entropy

• Let’s start with a simple case, the


probability of word sequences with a
unigram model
• Example
 S = “One fish two fish red fish blue fish”
 P(S) =
P(One)P(fish)P(two)P(fish)P(red)P(fish)P(blue)P(fish)
 Log P(S) = Log P(One)+Log P(fish)+…Log P(fish)

27
10/11/20
Entropy cont.

• In general that’s log P ( S ) = ∑logP ( w)


• But note that w∈S

 the order doesn’t


matter
 that words can occur
multiple times
 and that they always
contribute the same
each time
 so rearranging…
log P( s ) = ∑Count (w) * log P(w)
w∈V

28
10/11/20
Entropy cont.

• One fish two fish red fish blue fish

• Fish fish fish fish one two red blue

LogP ( s ) = 4 * log P ( fish) +1* log P (one) +1* log P (two) +1* log P (red ) +1* log P(blue)

29
10/11/20
Entropy cont.

• Now let’s divide both sides by N, the length of the


sequence:

1 1
log P( S ) = ∑Count (w) log P(w)
N N w∈V

• That’s basically an average of the logprobs

30
10/11/20
Entropy

• Now assume the sequence is really really


long.
• Moving the N into the summation you get

Count ( w)
∑ log P( w)
w∈V N
• Rewriting and getting rid of the minus sign
H ( S ) = −∑ P ( w) log P( w)
w∈V 31
10/11/20
Entropy

• Think about this in terms of uncertainty or


surprise.
 The more likely a sequence is, the lower the
entropy. Why?

H ( S ) = −∑ P( w) log P( w)
w∈V

32
10/11/20
Model Evaluation

• Remember the name of the game is to


come up with statistical models that
capture something useful in some body of
text or speech.
• There are precisely a gazzilion ways to do
this
 N-grams of various sizes
 Smoothing
 Backoff…
33
10/11/20
Model Evaluation

• Given a collection of text and a couple of


models, how can we tell which model is
best?
• Intuition… the model that assigns the
highest probability to a set of withheld text
 Withheld text? Text drawn from the same
distribution (corpus), but not used in the
creation of the model being evaluated.

34
10/11/20
Model Evaluation

• The more you’re surprised at some event


that actually happens, the worse your
model was.
• We want models that minimize your
surprise at observed outcomes.
• Given two models and some training data
and some withheld test data… which is
better?
35
10/11/20
Three HMM Problems

• Given a model and an observation


sequence
 Compute Argmax P(states | observation seq)
 Viterbi
 Compute P(observation seq | model)
 Forward
 Compute P(model | observation seq)
 EM (magic)

36
10/11/20
Viterbi

• Given a model and an observation


sequence, what is the most likely state
sequence
 The state sequence is the set of labels
assigned
 So using Viterbi with an HMM solves the
sequence classification task

37
10/11/20
Forward

• Given an HMM model and an observed


sequence, what is the probability of that
sequence?
 P(sequence | Model)
 Sum of all the paths in the model that could have
produced that sequence
 So...
• How do we change Viterbi to get Forward?

38
10/11/20
Who cares?

• Suppose I have two different HMM models


extracted from some training data.
• And suppose I have a good-sized set of
held-out data (not used to produce the
above models).
• How can I tell which model is the better
model?

39
10/11/20
Learning Models

• Now assume that you just have a single


HMM model (pi, A, and B tables)
• How can I produce a second model from
that model?
 Rejigger the numbers... (in such a way that
the tables still function correctly)
 Now how can I tell if I’ve made things better?

40
10/11/20
EM

• Given an HMM structure and a sequence,


we can learn the best parameters for the
model without explicit training data.
 In the case of POS tagging all you need is
unlabelled text.
 Huh? Magic. We’ll come back to this.

41
10/11/20
Generative vs. Discriminative
Models

• For POS tagging we start with the


question... P(tags | words) but we end up
via Bayes at
 P(words| tags)P(tags)
 That’s called a generative model
 We’re reasoning backwards from the models
that could have produced such an output

42
10/11/20
Disambiguating “race”

43
10/11/20
Discriminative Models

• What if we went back to the start to


 Argmax P(tags|words) and didn’t use Bayes?
 Can we get a handle on this directly?
 First let’s generalize to P(tags|evidence)
 Let’s make some independence assumptions and
consider the previous state and the current word as
the evidence. How does that look as a graphical
model?

44
10/11/20
MaxEnt Tagging

45
10/11/20
MaxEnt Tagging

• This framework allows us to throw in a


wide range of “features”. That is, evidence
that can help with the tagging.

46
10/11/20
Statistical Sequence
Classification

47
10/11/20

You might also like