CS 388: Natural Language Processing:: N-Gram Language Models

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 22

1

CS 388:
Natural Language Processing:
N-Gram Language Models
Raymond J. Mooney
University of Texas at Austin
Language Models
Formal grammars (e.g. regular, context free)
give a hard binary model of the legal
sentences in a language.
For NLP, a probabilistic model of a
language that gives a probability that a
string is a member of a language is more
useful.
To specify a correct probability distribution,
the probability of all sentences in a
language must sum to 1.
Uses of Language Models
Speech recognition
I ate a cherry is a more likely sentence than Eye eight
uh Jerry
OCR & Handwriting recognition
More probable sentences are more likely correct readings.
Machine translation
More likely sentences are probably better translations.
Generation
More likely sentences are probably better NL generations.
Context sensitive spelling correction
Their are problems wit this sentence.
Completion Prediction
A language model also supports predicting
the completion of a sentence.
Please turn off your cell _____
Your program does not ______
Predictive text input systems can guess what
you are typing and give choices on how to
complete it.

N-Gram Models
Estimate probability of each word given prior context.
P(phone | Please turn off your cell)
Number of parameters required grows exponentially with
the number of words of prior context.
An N-gram model uses only N1 words of prior context.
Unigram: P(phone)
Bigram: P(phone | cell)
Trigram: P(phone | your cell)
The Markov assumption is the presumption that the future
behavior of a dynamical system only depends on its recent
history. In particular, in a kth-order Markov model, the
next state only depends on the k most recent states,
therefore an N-gram model is a (N1)-order Markov model.
N-Gram Model Formulas
Word sequences

Chain rule of probability

Bigram approximation

N-gram approximation

n
n
w w w ...
1 1
=
) | ( ) | ( )... | ( ) | ( ) ( ) (
1
1
1
1
1
2
1 3 1 2 1 1

=

[
= =
k
n
k
k
n
n
n
w w P w w P w w P w w P w P w P
) | ( ) (
1
1
1
1

+
=
[
=
k
N k
n
k
k
n
w w P w P
) | ( ) (
1
1
1
=
[
=
k
n
k
k
n
w w P w P
Estimating Probabilities
N-gram conditional probabilities can be estimated
from raw text based on the relative frequency of
word sequences.




To have a consistent probabilistic model, append a
unique start (<s>) and end (</s>) symbol to every
sentence and treat these as additional words.


) (
) (
) | (
1
1
1

=
n
n n
n n
w C
w w C
w w P
) (
) (
) | (
1
1
1
1
1
1

+

+
=
n
N n
n
n
N n
n
N n n
w C
w w C
w w P
Bigram:
N-gram:
Generative Model & MLE
An N-gram model can be seen as a probabilistic
automata for generating sentences.



Relative frequency estimates can be proven to be
maximum likelihood estimates (MLE) since they
maximize the probability that the model M will
generate the training corpus T.

Initialize sentence with N1 <s> symbols
Until </s> is generated do:
Stochastically pick the next word based on the conditional
probability of each word given the previous N 1 words.
)) ( | ( argmax

M T P =
Example from Textbook
P(<s> i want english food </s>)
= P(i | <s>) P(want | i) P(english | want)
P(food | english) P(</s> | food)
= .25 x .33 x .0011 x .5 x .68 = .000031

P(<s> i want chinese food </s>)
= P(i | <s>) P(want | i) P(chinese | want)
P(food | chinese) P(</s> | food)
= .25 x .33 x .0065 x .52 x .68 = .00019

Train and Test Corpora
A language model must be trained on a large
corpus of text to estimate good parameter values.
Model can be evaluated based on its ability to
predict a high probability for a disjoint (held-out)
test corpus (testing on the training corpus would
give an optimistically biased estimate).
Ideally, the training (and test) corpus should be
representative of the actual application data.
May need to adapt a general model to a small
amount of new (in-domain) data by adding highly
weighted small corpus to original training data.

Unknown Words
How to handle words in the test corpus that
did not occur in the training data, i.e. out of
vocabulary (OOV) words?
Train a model that includes an explicit
symbol for an unknown word (<UNK>).
Choose a vocabulary in advance and replace
other words in the training corpus with
<UNK>.
Replace the first occurrence of each word in the
training data with <UNK>.

Evaluation of Language Models
Ideally, evaluate use of model in end application
(extrinsic, in vivo)
Realistic
Expensive
Evaluate on ability to model test corpus
(intrinsic).
Less realistic
Cheaper
Verify at least once that intrinsic evaluation
correlates with an extrinsic one.
Perplexity
Measure of how well a model fits the test data.
Uses the probability that the model assigns to the
test corpus.
Normalizes for the number of words in the test
corpus and takes the inverse.
N
N
w w w P
W PP
) ... (
1
) (
2 1
=
Measures the weighted average branching factor
in predicting the next word (lower is better).
Sample Perplexity Evaluation
Models trained on 38 million words from
the Wall Street Journal (WSJ) using a
19,979 word vocabulary.
Evaluate on a disjoint set of 1.5 million
WSJ words.
Unigram Bigram Trigram
Perplexity 962 170 109
Smoothing
Since there are a combinatorial number of possible
word sequences, many rare (but not impossible)
combinations never occur in training, so MLE
incorrectly assigns zero to many parameters (a.k.a.
sparse data).
If a new combination occurs during testing, it is
given a probability of zero and the entire sequence
gets a probability of zero (i.e. infinite perplexity).
In practice, parameters are smoothed (a.k.a.
regularized) to reassign some probability mass to
unseen events.
Adding probability mass to unseen events requires
removing it from seen ones (discounting) in order to
maintain a joint distribution that sums to 1.
Laplace (Add-One) Smoothing
Hallucinate additional training data in which each
possible N-gram occurs exactly once and adjust
estimates accordingly.



where V is the total number of possible (N1)-grams
(i.e. the vocabulary size for a bigram model).
V w C
w w C
w w P
n
n n
n n
+
+
=

) (
1 ) (
) | (
1
1
1
V w C
w w C
w w P
n
N n
n
n
N n
n
N n n
+
+
=

+

+
) (
1 ) (
) | (
1
1
1
1
1
1
Bigram:
N-gram:
Tends to reassign too much mass to unseen events,
so can be adjusted to add 0<o<1 (normalized by oV
instead of V).
Advanced Smoothing
Many advanced techniques have been
developed to improve smoothing for
language models.
Good-Turing
Interpolation
Backoff
Kneser-Ney
Class-based (cluster) N-grams
Model Combination
As N increases, the power (expressiveness)
of an N-gram model increases, but the
ability to estimate accurate parameters from
sparse data decreases (i.e. the smoothing
problem gets worse).
A general approach is to combine the results
of multiple N-gram models of increasing
complexity (i.e. increasing N).
Interpolation
Linearly combine estimates of N-gram
models of increasing order.
Learn proper values for
i
by training to
(approximately) maximize the likelihood of
an independent development (a.k.a. tuning)
corpus.
) ( ) | ( ) | ( ) | (

3 1 2 1 , 2 1 1 , 2 n n n n n n n n n
w P w w P w w w P w w w P + + =

Interpolated Trigram Model:
1 =

i
i
Where:
Backoff
Only use lower-order model when data for higher-
order model is unavailable (i.e. count is zero).
Recursively back-off to weaker models until data
is available.

>
=

+

+
+

+
+
otherwise ) | ( ) (
1 ) ( if ) | ( *
) | (
1
2
1
1
1
1
1 1
1
n
N n n katz
n
N n
n
N n
n
N n n n
N n n katz
w w P w
w C w w P
w w P
o
Where P* is a discounted probability estimate to reserve
mass for unseen events and os are back-off weights (see
text for details).
A Problem for N-Grams:
Long Distance Dependencies
Many times local context does not provide the
most useful predictive clues, which instead are
provided by long-distance dependencies.
Syntactic dependencies
The man next to the large oak tree near the grocery store on
the corner is tall.
The men next to the large oak tree near the grocery store on
the corner are tall.
Semantic dependencies
The bird next to the large oak tree near the grocery store on
the corner flies rapidly.
The man next to the large oak tree near the grocery store on
the corner talks rapidly.
More complex models of language are needed to
handle such dependencies.
Summary
Language models assign a probability that a
sentence is a legal string in a language.
They are useful as a component of many NLP
systems, such as ASR, OCR, and MT.
Simple N-gram models are easy to train on
unsupervised corpora and can provide useful
estimates of sentence likelihood.
MLE gives inaccurate parameters for models
trained on sparse data.
Smoothing techniques adjust parameter estimates
to account for unseen (but not impossible) events.

You might also like