PCFG

Tutorial on
Probabilistic Context‐Free
Grammars
Raphael Hoffmann
590AI, Winter 2009
Outline
• PCFGs: Inference and Learning
• Parsing English
• Discriminative CFGs
• Grammar Induction
Image Search for “pcfg”
Live Search
Outline
• Parsing English
The velocity of the seismic waves rises to …
Slide based on “Foundations of Statistical Natural Language Processing” by Christopher Manning and Hinrich Schütze
A CFG consists of
• Terminals w 1 , w 2 , . . . , wV
• Nonterminals N 1, N 2 ,...,
Nn
• Start symbol N1
• Rules N i −→ ζ j
where ζ j is a sequence of
terminals and nonterminals
A (generative) PCFG consists of
• Terminals w 1 , w 2 , . . . , wV
Nn
• Start symbol N1
• Rule probabilities such that

P
∀i j P (N i −→ ζ j ) =
1
Notation
sentence: sequence of words w w .. . w 1 2 m
wab :
the subsequence w ... w a b
N i
ab : nonterminal N i
dominates wa . . . wb
Ni
∗
N =⇒ ζ
i
:
repeated derivation from N i gives ζ
Probability of a Sentence
X
P (w1 n ) = P (w1 n , t)
t
where t a parse tree of
w 1n
Example
• Terminals with, saw, astronomers, ears, stars,

telescopes
• Nonterminals S, PP, P, NP, VP, V
• Start symbol S
astronomers saw stars with ears
Probabilities
P (t 1 ) = 1.0 × 0.1 × 0.7 × 1.0 ×

0.4
×0.18 × 1.0 × 1.0 × 0.18
P (t2) = 0.0009072
= 1.0 × 0.1 × 0.3 × 0.7 ×
1.0
×0.18 × 1.0 × 1.0 × 0.18
P (w15)
= 0.0006804
= P ( t 1 ) + P (t2 ) =
0.0015876
Assumptions of PCFGs
1. Place invariance (like time invariance in HMMs)
j
∀k P (Nk ( k +c ) −→ is the same
ζ)
j
ζ)
2. Context free
j
P (Nklj −→ ζ| words outside wk . . . wl ) = P (Nkl −→ ζ)
j
ζ)
2. Context free
j
P (Nklj −→ ζ| words outside wk . . . wl ) = P (Nkl −→ ζ)
3. Ancestor free
j j
P (Nklj −→ ζ| ancestor nodes of Nkl ) = P (Nkl −→ ζ)
Some Features of PCFGs
• Partial solution for grammar ambiguity
• Can be learned from positive data alone
(but grammar induction difficult)
• Robustness
(admit everything with low probability)
• Gives a probabilistic language model
• Predictive power better than that for a
HMM
Some Features of PCFGs
• Not lexicalized (probabilities do not factor in
lexical co‐occurrence)
• PCFG is a worse language model for English
than n‐gram models
• Certain biases: smaller trees more probable
(average WSJ sentence 23 words)
Inconsistent Distributions
Questions
Let w 1m be a sentence, G a grammar, t a parse tree.
• What is the probability of a sentence?

P (w1m|G)
Questions

P (w1m|G)
• What is the most likely parse of sentence?
arg maxt P (t|w1m, G)
Questions

P (w1m|G)
• What is the most likely parse of sentence?
arg maxt P (t|w1m, G)
• What rule probs. maximize probs. of
sentences?
Find G that maximizes P (w1m|G)
Chomsky Normal Form
• Any CFG grammar can be represented in CNF
where all rules take the form
N i −→ N j N k
N i −→ wj
HMMs and PCFGs
• HMMs: distribution over strings of certain length
P
∀n w1n P (w1 n ) = 1
• PCFGs: distribution over strings of language L
P
w ∈ L P (w) = 1
HMMs and PCFGs
P
∀n w1n P (w1 n ) = 1
P
w ∈ L P (w) = 1
• Consider
high probability in HMM, low probability in PCFG
HMMs and PCFGs
P
∀n w1n P (w1 n ) = 1
P
w ∈ L P (w) = 1
• Consider
high probability in HMM, low probability in PCFG

• Probabilistic Regular Grammar
N i −→ wj N k
N i −→ wj
HMMs and PCFGs
Inside and Outside Probabilities
• For HMMs we have
Forwards αi (t) = P (w 1 ( t − 1 ) , X t = i)
Backwards βi (t) = P (wtT |Xt = i)
• For HMMs we have
Forwards αi (t) = P (w 1 ( t − 1 ) , X t = i)
Backwards βi (t) = P (wtT |Xt = i)
• For PCFGs we have

j
Outside α j (p, q) = P 1(p−1) , Npq , w (q+1)m |G)
Inside (w P (wpq |Npq
j
, G)
β j (p, q) =
Probability of a sentence
j
Outside α j (p, q) = P 1 ( p − 1 ) , Npq , w (q+1)m |G)
Inside (w P (wpq |Npq j
, G)
β j (p, q) =
P (w1m|G) = β1 (1, m)
X
P (w1m|G) = α j (k, k)P (N j −→ wk )
j
Inside Probabilities
pq |Npq ,
j
β j (p, q) = P G)
(w
• Base case j
β j (k, k) = P (wkk |Nkk , G)
= P (N j
−→ wk|G)
Inside Probabilities
pq |Npq ,
j
β j (p, q) = P G)
(w
• Base case j
β j (k, k) = P (wkk |Nkk , G)
= P (N j
−→ wk|G)
• Induction
Want to find β j (p, q) for p < q
Since we assume Chomsky Normal Form,

the first rule must be of the form N j −→ N r N s
So we can divide the sentence in two in

various places and sum the result
X qX− 1
βj (p, q) = P (N j −→ N r N s )βr (p, d)βs (d + 1,
r , s d=p
q)
CYK Algorithm
CYK Algorithm
β N P = 0.1 βP = 1.0 β N P = 0.18

? β N P = 0.04 0.18
CYK Algorithm
βV P = 0.126 β P P = 0.18
? ? ? ?
β N P = 0.1 βV = 1.0 β N P = 0.18 βP = 1.0 β N P = 0.18
β N P = 0.04
CYK Algorithm
βS = 0.015876
βV P =
0.015876
βS =
? β N P = 0.01296
0.0126
βV P = 0.126 β P P = 0.18
β N P = 0.1 βV = 1.0 β N P = βP = 1.0 β N P = 0.18

β N P = 0.04 0.18
CYK Algorithm
Worst case: O(m3r)
m = length of sentence
βS = 0.015876 r = number of rules in grammar
n = number of non‐terminals
βV P =
If we consider all possible CNF rules:
0.015876 O(m3n3)
β S = 0.0126 β N P = 0.01296
βV P = 0.126 β P P = 0.18
β N P = 0.1 βV = 1.0 β N P = 0.18 βP = 1.0 β N P = 0.18

β N P = 0.04
Outside Probabilities
• Compute top‐down (after inside probabilities)
• Base case
α 1 (1, m) = 1
α j (1, m) = 0, for j 6 = 1
• Induction
⎛ ⎞
X Xm
α j (p, q) ⎝ α f (p, e)P ( N f −→ N j N g )βg (q + 1,
= ⎛f , g e = qp+− 11 e)⎠ ⎞
X X
+ ⎝ α f (e, q)P ( N f −→ N g N j )βg (e, p −
f , g e = 1 1) ⎠
Probability of a node existing
• As with a HMM, we can form a product of the
inside and outside probabilities.
α j (p, q)βj (p, q) = P (w1m , Npqj |
G)
G)
• Therefore,
X
P (w1 m , Npq |G) = α j (p, q)βj (p,
j q)
G)
• Therefore,
X
P (w1 m , Npq |G) = α j (p, q)βj (p,
j q)
• Just in the cases of the root node and the
preterminals, we know there will be some
such constituent.
Training
• If have data 
count ˆ j C ( N j −→ ζ)
P (N −→ ζ) = P γ C ( N j −→ γ)
Training
• If have data 
count ˆ j C ( N j −→ ζ)
P (N −→ ζ) = P γ C ( N j −→ γ)
• else use EM (Inside‐Outside‐Algorithm)

repeat
compute α j ‘s and β j ‘s
compute Pˆ ‘s
Pˆ (N j −→ N r N s ) = . . .
Pˆ (N j −→ wk) = . . .
end
tw
Slide based on “Foundations of o
Statistical Natural Language Processing” by Christopher Manning and Hinrich Schütze
EM Problems
• Slow: O(m3n3) for each sentence and each
iteration
• Local maxima (Charniak: 300 trials led to 300
different max.)
• In practice, need >3 times more non‐terminals
than are theoretically needed
• No guarantee that learned non‐terminals
correspond to NP, VP, …
Bracketing helps
Pereira/Schabes ’92:
• Train on sentences:
37% of predicted brackets correct
• Train on sentences + brackets:
90% of predicted brackets correct
Grammar Induction
• Rules typically selected by linguist
• Automatic induction is very difficult for
context‐free languages
• It is easy to find some form of structure, but
little resemblance to that of linguistics/NLP
Outline
• Parsing English
Parsing for Disambiguation
The post office will hold out discounts and
service concessions as incentives.
Parsing for Disambiguation
• There are typically many syntactically
possible parses
• Want to find the most likely parses
Treebanks
• If grammar induction does not work, why not
count expansions in many parse trees?
• Penn Treebank
PCFG weaknesses
• No Context
– (immediate prior context, speaker, …)
• No Lexicalization
– “VP NP NP” more likely if verb is “hand” or “tell”
– fail to capture lexical dependencies (n‐grams do)
• No Structural Context
– How NP expands depends on position
PCFG weaknesses
Expansion % as Subj % as Obj
NP −→ PRP 13.7% 2.1%
NP −→ NNP 3.5% 0.9%
NP −→ DT NN 5.6% 4.6%
NP −→ NN 1.4% 2.8%
NP −→ NP SBAR 0.5% 2.6%
NP −→ NP PP 5.6% 14.1%
Expansion % as 1st Obj % as 2nd Obj
NP −→ NNS 7.5% 0.2%
NP −→ PRP 13.4% 0.9%
NP −→ NP PP 12.2% 14.4%
NP −→ DT NN 10.4% 13.3%
NP −→ NNP 4.5% 5.9%
NP −→ NN 3.9% 9.2%
NP −→ JJ NN 1.1% 10.4%
NP −→ NP SBAR 0.3% 5.1%
Other Approaches
• Challenge: use lexical and structural context,
without too many parameters, sparse data
• Other Grammars
– Probabilistic Left‐Corner Grammars
– Phrase Structure Grammars
– Dependency Grammars
– Probabilistic Tree Substitution Grammars
– History‐based Grammars
Outline
• Parsing English
Generative vs Discriminative
• An HMM (or PCFG) is a generative model
P (y, w)
• Often sufficient is a discriminative model

P (y|w)
• Easier, because does not contain P

(w)
• Cannot model dependent features in HMM,
so one only picks one feature: word’s identity
Generative and Discriminative Models
Sequence General Graphs
HMMs Generative
Naïve Bayes
directed models
Conditional Conditional Conditional
Sequence General Graphs
Logistic Regression Linear‐chain CRFs General CRFs
Slide based on “An introduction to Conditional Random Fields for Relational Learning” by Charles Sutton and Andrew McCallum
General
Sequence Tree Graphs
HMMs PCFGs Generative

Naïve Bayes
directed models
Conditional Conditional Conditional
General
Sequence Graphs

General
Sequence Tree Graphs
HMMs PCFGs Generative

Naïve Bayes
directed models
Conditional Conditional Conditional Conditional
General
Tree Graphs
?
Sequence

Discriminative
Context‐Free Grammars
• Terminals w1 , w 2 ,..., wV
Nn
• Start symbol N1
• Rule scores
XF
S(N i −→ ζ j , p, q) λ k (N i −→ ζ j )fk (w1 2 .. . m, p, q, N i −→
= k=1 w w ζj )
Slide based on “Learning to Extract Information from Semi‐Structured Text using a Discriminative Context Free Grammar” by Paul Viola and Mukund Narasimhan
Features
XF
S(N i −→ ζ j , p, q) = λk (Ni −→ ζj )fk (w1 2 .. . wm , p, q, i
−→ j
w k=1 N ζ )
• Features can depend on all tokens + span
• Consider feature “AllOnTheSameLine”
Mavis Wood Mavis Wood Products
Products
[compare to linear CRF f k (s t , t t − 1 , w1w2 . . . wm , t) ]
Features
XF
S(N i −→ ζ j , p, q) λ k (N i −→ ζ j )fk (w1 2 .. . m, p, q, N i −→
= k=1 w w ζj )
• No independence between features necessary

• Can create features based on words,
dictionaries, digits, capitalization, …
• Can still do efficient Viterbi inference in
O(m3r)
Example
BizContact −→ BizName Address BizPhone

PersonalContact −→ BizName Address HomePhone
Example
Training
• Train feature weight vector for each rule
• Have labels, but not parse trees;
efficiently create trees by ignoring leaves
Collins’ Averaged Perceptron
Results
Linear CRF Discriminative CFG Improvement
Word Error Rate 11.57% 6.29% 45.63%
Record Error Rate 54.71% 27.13% 50.41%
Outline
• Parsing English
Gold’s Theorem ‘67
“Any formal language which has hierarchical
structure capable of infinite recursion is
unlearnable from positive evidence alone.”
Slide based on Wikipedia

Empirical Problems
• Even finite search spaces can be too big
• Noise
• Insufficient data
• Many local optima
Slide based on “Unsupervised grammar induction with Minimum Description Length” by Roni Katzir
Common Approach
• Minimize total description length
• Simulated Annealing
random_neighbor(G)
• Insert:
• Delete
• New Rule
• Split
• Substitute
Energy
Define binary representation for G, code(D|G)
Experiment 1
• Word segmentation by 8‐month old infants
• Vocabulary: pabiku, golatu, daropi, tibudo
• Saffran ’96: use speech synthesizer, no word
breaks, 2 minutes = 180 words
• Infants can distinguish words from non‐
words
• Now try grammar induction (60 words)
Experiment 1
Experiment 2
Experiment 2
• Accurate segmentation
• Inaccurate structural learning
Prototype‐Driven Grammar Induction
• Semi‐supervised approach
• Give only a few dozen prototypical examples
(for NP e.g. determiner noun, pronouns, …)
• On English Penn Treebank: F1 = 65.1
(52% reduction over naïve PCFG induction)
Aria Haghighi and Dan Klein.
Prototype-Driven Grammar Induction.
ACL 2006
Dan Klein and Chris Manning.
A Generative Constituent-Context Model for Improved Grammar Induction.
ACL 2001
That’s it!

PCFG

Uploaded by

Copyright:

Available Formats

PCFG

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PCFG

Uploaded by

Copyright:

Available Formats

Tutorial on

• Rule probabilities such that

where t a parse tree of

• Terminals with, saw, astronomers, ears, stars,

P (t 1 ) = 1.0 × 0.1 × 0.7 × 1.0 ×

• What is the probability of a sentence?

• What is the probability of a sentence?

• What is the probability of a sentence?

high probability in HMM, low probability in PCFG

high probability in HMM, low probability in PCFG

• For PCFGs we have

Since we assume Chomsky Normal Form,

So we can divide the sentence in two in

astronomers saw stars with ears

β N P = 0.1 βP = 1.0 β N P = 0.18

β N P = 0.1 βV = 1.0 β N P = βP = 1.0 β N P = 0.18

β N P = 0.1 βV = 1.0 β N P = 0.18 βP = 1.0 β N P = 0.18

• else use EM (Inside‐Outside‐Algorithm)

• Often sufficient is a discriminative model

• Easier, because does not contain P

Conditional Conditional Conditional

Sequence General Graphs

Logistic Regression Linear‐chain CRFs General CRFs

HMMs PCFGs Generative

Conditional Conditional Conditional

Logistic Regression Linear‐chain CRFs General CRFs

HMMs PCFGs Generative

Conditional Conditional Conditional Conditional

Logistic Regression Linear‐chain CRFs General CRFs

[compare to linear CRF f k (s t , t t − 1 , w1w2 . . . wm , t) ]

• No independence between features necessary

BizContact −→ BizName Address BizPhone

Linear CRF Discriminative CFG Improvement

Word Error Rate 11.57% 6.29% 45.63%

Record Error Rate 54.71% 27.13% 50.41%

Slide based on Wikipedia

Define binary representation for G, code(D|G)

You might also like