Speech and Language Processing: SLP Chapter 5
Speech and Language Processing: SLP Chapter 5
Speech and Language Processing: SLP Chapter 5
SLP Chapter 5
Today
Parts of speech (POS) Tagsets POS Tagging
Rule-based tagging HMMs and Viterbi algorithm
5/16/2012
Parts of Speech
8 (ish) traditional parts of speech
Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags... Lots of debate within linguistics about the number, nature, and universality of these
Well completely ignore this debate.
5/16/2012
POS examples
N V ADJ ADV P PRO DET noun chair, bandwidth, pacing verb study, debate, munch adjective purple, tall, ridiculous adverb unfortunately, slowly preposition of, by, to pronoun I, me, mine determiner the, a, that, those
5/16/2012
POS Tagging
The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag
the koala put the keys on the table
5/16/2012
Speech and Language Processing - Jurafsky and Martin
Parsing
Need to know if a word is an N or V before you can parse
Information extraction
Finding names, relations, etc.
Machine Translation
5/16/2012
Speech and Language Processing - Jurafsky and Martin
5/16/2012
Verbs
In English, have morphological affixes (eat/eats/eaten)
5/16/2012
5/16/2012
5/16/2012
10
English Particles
5/16/2012
11
Conjunctions
5/16/2012
12
More commonly used set is finer grained, the Penn TreeBank tagset, 45 tags
PRP$, WRB, WP$, VBG
5/16/2012
13
5/16/2012
14
5/16/2012
15
POS Tagging
Words often have more than one POS:
back
The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for a particular instance of a word.
These examples from Dekang Lin
5/16/2012
Speech and Language Processing - Jurafsky and Martin
16
5/16/2012
17
2. Stochastic
1. Probabilistic sequence models
HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)
5/16/2012
18
Rule-Based Tagging
Start with a dictionary Assign all possible tags to words from the dictionary Write rules by hand to selectively remove tags Leaving the correct tag for each word.
5/16/2012
19
Etc for the ~100,000 words of English with more than 1 tag
5/16/2012
20
5/16/2012
21
5/16/2012
22
salivation
5/16/2012
23
24
It is also related to the noisy channel model thats the basis for ASR, OCR and MT
5/16/2012
25
What is the best sequence of tags that corresponds to this sequence of observations? Probabilistic view:
Consider all possible sequences of tags Out of this universe of sequences, choose the tag sequence which is most probable given the observation sequence of n words w1wn.
5/16/2012
Speech and Language Processing - Jurafsky and Martin
26
Getting to HMMs
We want, out of all sequences of n tags t1tn the single tag sequence such that P(t1tn|w1wn) is highest.
Hat ^ means our estimate of the best one Argmaxx f(x) means the x such that f(x) is maximized
5/16/2012
27
Getting to HMMs
This equation is guaranteed to give us the best tag sequence
But how to make it operational? How to compute this value? Intuition of Bayesian classification:
Use Bayes rule to transform this equation into a set of other probabilities that are easier to compute
5/16/2012
Speech and Language Processing - Jurafsky and Martin
28
5/16/2012
29
5/16/2012
30
5/16/2012
31
5/16/2012
32
5/16/2012
33
Disambiguating race
5/16/2012
34
Example
P(NN|TO) = .00047 P(VB|TO) = .83 P(race|NN) = .00057 P(race|VB) = .00012 P(NR|VB) = .0027 P(NR|NN) = .0012 P(VB|TO)P(NR|VB)P(race|VB) = .00000027 P(NN|TO)P(NR|NN)P(race|NN)=.00000000032 So we (correctly) choose the verb reading,
Speech and Language Processing - Jurafsky and Martin
5/16/2012
35
5/16/2012
36
Definitions
A weighted finite-state automaton adds probabilities to the arcs
The sum of the probabilities leaving any arc must sum to one
A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through Markov chains cant represent inherently ambiguous problems
Useful for assigning probabilities to unambiguous sequences
5/16/2012
37
5/16/2012
38
5/16/2012
39
Transition probabilities:
a set of probabilities A = a01a02an1ann. Each aij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A
40
5/16/2012
41
42
So we need an extension! A Hidden Markov Model is an extension of a Markov chain in which the input symbols are not the same as the states. This means we dont know which state we are in.
5/16/2012
Speech and Language Processing - Jurafsky and Martin
43
Transition probabilities Transition probability matrix A = {aij} aij P(qt j | qt1 i) 1 i, j N Observation likelihoods Output probability matrix B={bi(k)} bi (k) P(Xt ok | qt i) i P(q1 i) 1 i N Special initial probability vector
Speech and Language Processing - Jurafsky and Martin
5/16/2012
44
Eisner Task
Given
Ice Cream Observation Sequence: 1,2,3,2,2,2,3
Produce:
Weather Sequence: H,C,H,H,H,C
5/16/2012
45
5/16/2012
46
Transition Probabilities
5/16/2012
47
Observation Likelihoods
5/16/2012
48
Decoding
Ok, now we have a complete model that can give us what we need. Recall that we need to get
We could just enumerate all paths given the input and use the model to assign probabilities to each.
Not a good idea. Luckily dynamic programming (last seen in Ch. 3 with minimum edit distance) helps us here
5/16/2012
49
5/16/2012
50
Viterbi Example
5/16/2012
51
Viterbi Summary
Create an array
With columns corresponding to inputs Rows corresponding to possible states
Sweep through the array in one pass filling the columns left to right using our transition probs and observations probs Dynamic programming key is that we need only store the MAX prob path to each cell, (not all paths).
5/16/2012
Speech and Language Processing - Jurafsky and Martin
52
Evaluation
So once you have you POS tagger running how do you evaluate it?
Overall error rate with respect to a goldstandard test set. Error rates on particular tags Error rates on particular words Tag confusions...
5/16/2012
53
Error Analysis
Look at a confusion matrix
54
Evaluation
The result is compared with a manually coded Gold Standard
Typically accuracy reaches 96-97% This may be compared with result for a baseline tagger (one that uses no context).
5/16/2012
55
Summary
Parts of speech Tagsets Part of speech tagging HMM Tagging
Markov Chains Hidden Markov Models
5/16/2012
56