Natural Language Processing:: N-Gram Language Models

Natural Language
Processing:
N-Gram Language Models
1
Language Models
 Formal grammars (e.g. regular, context
free) give a hard “binary” model of the
legal sentences in a language.
 For NLP, a probabilistic model of a
language that gives a probability that a
string is a member of a language is more
useful.
 To specify a correct probability
distribution, the probability of all
sentences in a language must sum to 1.
Uses of Language Models
 Speech recognition
 “I ate a cherry” is a more likely sentence than “Eye eight
uh Jerry”
 OCR & Handwriting recognition
 More probable sentences are more likely correct readings.
 Machine translation
 More likely sentences are probably better translations.
 Generation
 More likely sentences are probably better NL generations.
 Context sensitive spelling correction
 “Their are problems wit this sentence.” 6383010676 -
sankar
Completion Prediction
 A language model also supports

predicting the completion of a sentence.
 Please turn off your cell _____
 Your program does not ______
 Predictive text input systems can guess
what you are typing and give choices on
how to complete it.
N-Gram Models
 Estimate probability of each word given prior context.
 P(phone | Please turn off your cell)
 Number of parameters required grows exponentially
with the number of words of prior context.
 An N-gram model uses only N1 words of prior context.
 Unigram: P(phone)
 Bigram: P(phone | cell)
 Trigram: P(phone | your cell)
 The Markov assumption is the presumption that the
future behavior of a dynamical system only depends on
its recent history. In particular, in a kth-order
Markov model, the next state only depends on the k
most recent states, therefore an N-gram model is a
(N1)-order Markov model.
N-Gram Model Formulas
 Word sequences
w1n  w1...wn
 Chain rule of probability
n
P( w )  P( w1 ) P( w2 | w1 ) P( w3 | w )...P( wn | w )   P( wk | w1k 1 )
n
1
2
1
n 1
1
k 1
 Bigram approximation
n
P( w )   P( wk | wk 1 )
n
1
k 1
 N-gram approximation
n
P( w )   P( wk | wkk1N 1 )
n
1
k 1
Estimating Probabilities
 N-gram conditional probabilities can be
estimated from raw text based on the relative
frequency of word sequences.
C ( wn 1wn )
Bigram: P ( wn | wn 1 ) 
C ( wn 1 )
n 1
C ( wn  N 1 wn )
N-gram: P ( wn | wnn1N 1 ) 
C ( wnn1N 1 )
 To have a consistent probabilistic model, append
a unique start (<s>) and end (</s>) symbol to
every sentence and treat these as additional
words.
Word Prediction
 Guess the next word...
 ... I notice three guys standing on the ???
 There are many sources of knowledge
that could be helpful, including world
knowledge.
 But we can do pretty well by simply
looking at the preceding words and
keeping track of simple counts.
Word Prediction
 Formalize this task using N-gram models.
 N-grams are token sequences of length N.
 Given N-grams counts, we can guess likely
next words in a sequence.
08/14/20 9
N-Gram Models
 More formally, we will assess the
conditional probability of candidate words.
 Also, we will assess the probability of an
entire sequence of words.
 Being able to predict the next word (or
any linguistic unit) in a sequence is very
common in many applications.
08/14/20 10
Applications
 It lies at the core of the following applications
 Automatic speech recognition
 Handwriting and character recognition
 Spelling correction
 Machine translation
 Augmentative communication
 Word similarity, generation, POS tagging, etc.
08/14/20 11
Counting
 He stepped out into the hall, was delighted to
encounter a water brother.
 13 tokens, 15 if we include “,” and “.” as separate
tokens.
 Assuming we include the comma and period, how
many bigrams are there?
08/14/20 Speech and Language Processing - Jurafsky and Martin 12

Counting
 Not always that simple
 I do uh main- mainly business data processing
 Spoken language poses various challenges.
 Should we count “uh” and other fillers as tokens?
 What about the repetition of “mainly”? Should such do-
overs count twice or just once?
 The answers depend on the application.
 If we’re focusing on something like Automatic Speech
Recognition to support indexing for search, then “uh” isn’t
helpful (it’s not likely to occur as a query).
 But filled pauses are very useful in dialog management, so we
might want them there.

Counting: Types and Tokens
 They picnicked by the pool, then lay back
on the grass and looked at the stars.
 18 tokens (again counting punctuation)
 only 16 types
 In going forward, we’ll sometimes count
types and sometimes tokens.

Counting: Wordforms
 Should “cats” and “cat” count as the
same?
 How about “geese” and “goose”?
 Lemma: cats & cat and geese & goose have
the same lemmas
 Wordform: fully inflected surface form
 Again, we’ll have occasion to count both
lemmas and wordforms

Counting: Corpora
 Corpus: body of text
 Google
 Crawl of 1,024,908,267,229 English tokens
 13,588,391 wordform types
 That seems like a lot of types... Even large dictionaries of English have only
around 500k types. Why so many here?
•Numbers
•Misspellings
•Names
•Acronyms
•etc

Language Modeling
 To Perform word prediction:
 Assess the conditional probability of a
word given the previous words in the
sequence
 P(wn|w1,w2…wn-1).
 We’ll call a statistical model that can
assess this a Language Model.

Language Modeling
 One way to calculate them is to use the
definition of conditional probabilities, and
get the counts from a large corpus.
 Counts are used to estimate the probabilities.

Language Modeling
 Unfortunately, for most sequences and for
most text collections we won’t get good
estimates from this method.
 We’ll use the chain rule of probability

and an independence assumption.

The Chain Rule
P(its water was so transparent)=

P(its)*
P(water|its)*
P(was|its water)*
P(so|its water was)*
P(transparent|its water was so)

 So, that’s the chain rule
 We still have the problem of estimating
probabilities of word sequences from counts;
we’ll never have enough data.
 The other thing we mentioned (two slides
back) is an independence assumption
 What type of assumption would make sense?

Independence Assumption
 Make the simplifying assumption
 P(lizard|
the,other,day,I,was,walking,along,and,saw,a)
= P(lizard|a)
 Or maybe
 P(lizard|
the,other,day,I,was,walking,along,and,saw,a)
= P(lizard|saw,a)

Independence Assumption
 This kind of independence assumption is called a
Markov assumption after the Russian
mathematician Andrei Markov.

Markov Assumption
For each component in the product replace with the

approximation (assuming a prefix of N)
P(wn | w n1
1 )  P(wn | w n1
nN 1 )
Bigram version
P(w n | w n1
1 )  P(w n | w n1 )

Estimating Bigram
Probabilities
 The Maximum Likelihood Estimate (MLE)
count(w i1,w i )
P(w i | w i1 ) 
count(w i1 )

An Example
 <s> I am Sam </s>
 <s> Sam I am </s>
 <s> I do not like green eggs and ham </s>

Maximum Likelihood Estimates
 The maximum likelihood estimate of some parameter of
a model M from a training set T
 Is the estimate that maximizes the likelihood of the training set
T given the model M
 Suppose the word Chinese occurs 400 times in a corpus
of a million words (Brown corpus)
 What is the probability that a random word from some
other text from the same distribution will be “Chinese”
 MLE estimate is 400/1000000 = .004
 This may be a bad estimate for some other corpus
 (We’ll return to MLEs later – we’ll do smoothing to get
estimates for low-frequency or 0 counts. Even with our
independence assumptions, we still have this problem.)

Berkeley Restaurant Project
Sentences
 can you tell me about any good cantonese restaurants
close by
 mid priced thai food is what i’m looking for
 tell me about chez panisse
 can you give me a listing of the kinds of food that are
available
 i’m looking for a good place to eat breakfast
 when is caffe venezia open during the day

Bigram Counts
 Out of 9222 sentences
 Eg. “I want” occurred 827 times

Bigram Probabilities
 Divide bigram counts by prefix unigram
counts to get probabilities.

Bigram Estimates of Sentence
Probabilities
 P(<s> I want english food </s>) =
P(i|<s>)*
P(want|I)*
P(english|want)*
P(food|english)*
P(</s>|food)*
=.000031

Kinds of Knowledge
 As crude as they are, N-gram probabilities capture a
range of interesting facts about language.
 P(english|want) = .0011
World knowledge
 P(chinese|want) = .0065
 P(to|want) = .66
 P(eat | to) = .28 Syntax
 P(food | to) = 0
 P(want | spend) = 0
 P (i | <s>) = .25 Discourse

Shannon’s Method
 Let’s turn the model around and use it to
generate random sentences that are like
the sentences from which the model was
derived.
 Illustrates:
 Dependence of N-gram models on the specific
training corpus
 Greater N  better model
 Generally attributed to
Claude Shannon.
Shannon’s Method
 Sample a random bigram (<s>, w) according to its probability
 Now sample a random bigram (w, x) according to its probability
 And so on until we randomly choose a (y, </s>)
 Then string the words together
 <s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>

Shakespeare

Shakespeare as a Corpus
 N=884,647 tokens, V=29,066
 Shakespeare produced 300,000 bigram types
out of V2= 844 million possible bigrams...
 So, 99.96% of the possible bigrams were never seen
(have zero entries in the table)
 This is the biggest problem in language modeling;
we’ll come back to it.
 Quadrigrams are worse: What's coming out
looks like Shakespeare because it is
Shakespeare

The Wall Street Journal is Not
Shakespeare

A bad language model
(thanks to Joshua Goodman)
39
40
41
42
Evaluation
 How do we know if one model is better
than another?
 Shannon’s game gives us an intuition.
 The generated texts from the higher order
models sound more like the text the model
was obtained from.

Evaluation
 Standard method
 Train parameters of our model on a training set.
 Look at the model’s performance on some new
data: a test set. A dataset which is different than
our training set, but is drawn from the same source
 Then we need an evaluation metric to tell us how
well our model is doing on the test set.
 One such metric is perplexity (to be introduced below)

Perplexity
 Perplexity is the probability of
the test set (assigned by the
language model), normalized by
the number of words:
 Chain rule:
 For bigrams:
 Minimizing perplexity is the same as maximizing

probability
 The best language model is one that best
predicts an unseen test set
Lower perplexity means a
better model
 Training 38 million words, test 1.5 million
words, WSJ

Evaluating N-Gram Models
 Best evaluation for a language model
 Put model A into an application
 For example, a speech recognizer
 Evaluate the performance of the
application with model A
 Put model B into the application and
evaluate
 Compare performance of the application
with the two models
 Extrinsic evaluation
Difficulty of extrinsic (in-vivo)
evaluation of N-gram models
 Extrinsic evaluation
 This is really time-consuming
 Can take days to run an experiment
 So
 As a temporary solution, in order to run experiments
 To evaluate N-grams we often use an intrinsic
evaluation, an approximation called perplexity
 But perplexity is a poor approximation unless the test
data looks just like the training data
 So is generally only useful in pilot experiments
(generally is not sufficient to publish)
 But is helpful to think about.

Natural Language Processing:: N-Gram Language Models

Uploaded by

Copyright:

Available Formats

Natural Language Processing:: N-Gram Language Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Natural Language Processing:: N-Gram Language Models

Uploaded by

Copyright:

Available Formats

Natural Language

N-Gram Language Models

 A language model also supports

08/14/20 Speech and Language Processing - Jurafsky and Martin 12

08/14/20 Speech and Language Processing - Jurafsky and Martin 13

08/14/20 Speech and Language Processing - Jurafsky and Martin 14

08/14/20 Speech and Language Processing - Jurafsky and Martin 15

08/14/20 Speech and Language Processing - Jurafsky and Martin 16

08/14/20 Speech and Language Processing - Jurafsky and Martin 17

08/14/20 Speech and Language Processing - Jurafsky and Martin 18

 We’ll use the chain rule of probability

08/14/20 Speech and Language Processing - Jurafsky and Martin 19

P(its water was so transparent)=

08/14/20 Speech and Language Processing - Jurafsky and Martin 20

08/14/20 Speech and Language Processing - Jurafsky and Martin 21

08/14/20 Speech and Language Processing - Jurafsky and Martin 22

08/14/20 Speech and Language Processing - Jurafsky and Martin 23

For each component in the product replace with the

08/14/20 Speech and Language Processing - Jurafsky and Martin 24

08/14/20 Speech and Language Processing - Jurafsky and Martin 25

08/14/20 Speech and Language Processing - Jurafsky and Martin 26

08/14/20 Speech and Language Processing - Jurafsky and Martin 28

08/14/20 Speech and Language Processing - Jurafsky and Martin 29

08/14/20 Speech and Language Processing - Jurafsky and Martin 30

08/14/20 Speech and Language Processing - Jurafsky and Martin 31

08/14/20 Speech and Language Processing - Jurafsky and Martin 32

08/14/20 Speech and Language Processing - Jurafsky and Martin 33

08/14/20 Speech and Language Processing - Jurafsky and Martin 35

08/14/20 Speech and Language Processing - Jurafsky and Martin 36

08/14/20 Speech and Language Processing - Jurafsky and Martin 37

08/14/20 Speech and Language Processing - Jurafsky and Martin 38

08/14/20 Speech and Language Processing - Jurafsky and Martin 43

08/14/20 Speech and Language Processing - Jurafsky and Martin 44

 Minimizing perplexity is the same as maximizing

08/14/20 Speech and Language Processing - Jurafsky and Martin 46

08/14/20 Speech and Language Processing - Jurafsky and Martin 48

You might also like