Natural Language Processing:: N-Gram Language Models

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 48

Natural Language

Processing:

N-Gram Language Models

1
Language Models
 Formal grammars (e.g. regular, context
free) give a hard “binary” model of the
legal sentences in a language.
 For NLP, a probabilistic model of a
language that gives a probability that a
string is a member of a language is more
useful.
 To specify a correct probability
distribution, the probability of all
sentences in a language must sum to 1.
Uses of Language Models
 Speech recognition
 “I ate a cherry” is a more likely sentence than “Eye eight
uh Jerry”
 OCR & Handwriting recognition
 More probable sentences are more likely correct readings.
 Machine translation
 More likely sentences are probably better translations.
 Generation
 More likely sentences are probably better NL generations.
 Context sensitive spelling correction
 “Their are problems wit this sentence.” 6383010676 -
sankar
Completion Prediction

 A language model also supports


predicting the completion of a sentence.
 Please turn off your cell _____
 Your program does not ______
 Predictive text input systems can guess
what you are typing and give choices on
how to complete it.
N-Gram Models
 Estimate probability of each word given prior context.
 P(phone | Please turn off your cell)
 Number of parameters required grows exponentially
with the number of words of prior context.
 An N-gram model uses only N1 words of prior context.
 Unigram: P(phone)
 Bigram: P(phone | cell)
 Trigram: P(phone | your cell)
 The Markov assumption is the presumption that the
future behavior of a dynamical system only depends on
its recent history. In particular, in a kth-order
Markov model, the next state only depends on the k
most recent states, therefore an N-gram model is a
(N1)-order Markov model.
N-Gram Model Formulas
 Word sequences
w1n  w1...wn
 Chain rule of probability
n
P( w )  P( w1 ) P( w2 | w1 ) P( w3 | w )...P( wn | w )   P( wk | w1k 1 )
n
1
2
1
n 1
1
k 1
 Bigram approximation
n
P( w )   P( wk | wk 1 )
n
1
k 1
 N-gram approximation
n
P( w )   P( wk | wkk1N 1 )
n
1
k 1
Estimating Probabilities
 N-gram conditional probabilities can be
estimated from raw text based on the relative
frequency of word sequences.
C ( wn 1wn )
Bigram: P ( wn | wn 1 ) 
C ( wn 1 )
n 1
C ( wn  N 1 wn )
N-gram: P ( wn | wnn1N 1 ) 
C ( wnn1N 1 )
 To have a consistent probabilistic model, append
a unique start (<s>) and end (</s>) symbol to
every sentence and treat these as additional
words.
Word Prediction
 Guess the next word...
 ... I notice three guys standing on the ???
 There are many sources of knowledge
that could be helpful, including world
knowledge.
 But we can do pretty well by simply
looking at the preceding words and
keeping track of simple counts.
Word Prediction
 Formalize this task using N-gram models.
 N-grams are token sequences of length N.
 Given N-grams counts, we can guess likely
next words in a sequence.

08/14/20 9
N-Gram Models
 More formally, we will assess the
conditional probability of candidate words.
 Also, we will assess the probability of an
entire sequence of words.
 Being able to predict the next word (or
any linguistic unit) in a sequence is very
common in many applications.

08/14/20 10
Applications
 It lies at the core of the following applications
 Automatic speech recognition
 Handwriting and character recognition
 Spelling correction
 Machine translation
 Augmentative communication
 Word similarity, generation, POS tagging, etc.

08/14/20 11
Counting
 He stepped out into the hall, was delighted to
encounter a water brother.
 13 tokens, 15 if we include “,” and “.” as separate
tokens.
 Assuming we include the comma and period, how
many bigrams are there?

08/14/20 Speech and Language Processing - Jurafsky and Martin 12


Counting
 Not always that simple
 I do uh main- mainly business data processing
 Spoken language poses various challenges.
 Should we count “uh” and other fillers as tokens?
 What about the repetition of “mainly”? Should such do-
overs count twice or just once?
 The answers depend on the application.
 If we’re focusing on something like Automatic Speech
Recognition to support indexing for search, then “uh” isn’t
helpful (it’s not likely to occur as a query).
 But filled pauses are very useful in dialog management, so we
might want them there.

08/14/20 Speech and Language Processing - Jurafsky and Martin 13


Counting: Types and Tokens
 They picnicked by the pool, then lay back
on the grass and looked at the stars.
 18 tokens (again counting punctuation)
 only 16 types
 In going forward, we’ll sometimes count
types and sometimes tokens.

08/14/20 Speech and Language Processing - Jurafsky and Martin 14


Counting: Wordforms
 Should “cats” and “cat” count as the
same?
 How about “geese” and “goose”?
 Lemma: cats & cat and geese & goose have
the same lemmas
 Wordform: fully inflected surface form
 Again, we’ll have occasion to count both
lemmas and wordforms

08/14/20 Speech and Language Processing - Jurafsky and Martin 15


Counting: Corpora
 Corpus: body of text
 Google
 Crawl of 1,024,908,267,229 English tokens
 13,588,391 wordform types
 That seems like a lot of types... Even large dictionaries of English have only
around 500k types. Why so many here?

•Numbers
•Misspellings
•Names
•Acronyms
•etc

08/14/20 Speech and Language Processing - Jurafsky and Martin 16


Language Modeling
 To Perform word prediction:
 Assess the conditional probability of a
word given the previous words in the
sequence
 P(wn|w1,w2…wn-1).
 We’ll call a statistical model that can
assess this a Language Model.

08/14/20 Speech and Language Processing - Jurafsky and Martin 17


Language Modeling
 One way to calculate them is to use the
definition of conditional probabilities, and
get the counts from a large corpus.
 Counts are used to estimate the probabilities.

08/14/20 Speech and Language Processing - Jurafsky and Martin 18


Language Modeling
 Unfortunately, for most sequences and for
most text collections we won’t get good
estimates from this method.

 We’ll use the chain rule of probability


and an independence assumption.

08/14/20 Speech and Language Processing - Jurafsky and Martin 19


The Chain Rule

P(its water was so transparent)=


P(its)*
P(water|its)*
P(was|its water)*
P(so|its water was)*
P(transparent|its water was so)

08/14/20 Speech and Language Processing - Jurafsky and Martin 20


 So, that’s the chain rule
 We still have the problem of estimating
probabilities of word sequences from counts;
we’ll never have enough data.
 The other thing we mentioned (two slides
back) is an independence assumption
 What type of assumption would make sense?

08/14/20 Speech and Language Processing - Jurafsky and Martin 21


Independence Assumption
 Make the simplifying assumption
 P(lizard|
the,other,day,I,was,walking,along,and,saw,a)
= P(lizard|a)
 Or maybe
 P(lizard|
the,other,day,I,was,walking,along,and,saw,a)
= P(lizard|saw,a)

08/14/20 Speech and Language Processing - Jurafsky and Martin 22


Independence Assumption
 This kind of independence assumption is called a
Markov assumption after the Russian
mathematician Andrei Markov.

08/14/20 Speech and Language Processing - Jurafsky and Martin 23


Markov Assumption

For each component in the product replace with the


approximation (assuming a prefix of N)

P(wn | w n1
1 )  P(wn | w n1
nN 1 )
Bigram version

P(w n | w n1
1 )  P(w n | w n1 )

08/14/20 Speech and Language Processing - Jurafsky and Martin 24


Estimating Bigram
Probabilities
 The Maximum Likelihood Estimate (MLE)

count(w i1,w i )
P(w i | w i1 ) 
count(w i1 )

08/14/20 Speech and Language Processing - Jurafsky and Martin 25


An Example
 <s> I am Sam </s>
 <s> Sam I am </s>
 <s> I do not like green eggs and ham </s>

08/14/20 Speech and Language Processing - Jurafsky and Martin 26


08/14/20 Speech and Language Processing - Jurafsky and Martin 27
Maximum Likelihood Estimates
 The maximum likelihood estimate of some parameter of
a model M from a training set T
 Is the estimate that maximizes the likelihood of the training set
T given the model M
 Suppose the word Chinese occurs 400 times in a corpus
of a million words (Brown corpus)
 What is the probability that a random word from some
other text from the same distribution will be “Chinese”
 MLE estimate is 400/1000000 = .004
 This may be a bad estimate for some other corpus
 (We’ll return to MLEs later – we’ll do smoothing to get
estimates for low-frequency or 0 counts. Even with our
independence assumptions, we still have this problem.)

08/14/20 Speech and Language Processing - Jurafsky and Martin 28


Berkeley Restaurant Project
Sentences
 can you tell me about any good cantonese restaurants
close by
 mid priced thai food is what i’m looking for
 tell me about chez panisse
 can you give me a listing of the kinds of food that are
available
 i’m looking for a good place to eat breakfast
 when is caffe venezia open during the day

08/14/20 Speech and Language Processing - Jurafsky and Martin 29


Bigram Counts
 Out of 9222 sentences
 Eg. “I want” occurred 827 times

08/14/20 Speech and Language Processing - Jurafsky and Martin 30


Bigram Probabilities
 Divide bigram counts by prefix unigram
counts to get probabilities.

08/14/20 Speech and Language Processing - Jurafsky and Martin 31


Bigram Estimates of Sentence
Probabilities
 P(<s> I want english food </s>) =
P(i|<s>)*
P(want|I)*
P(english|want)*
P(food|english)*
P(</s>|food)*
=.000031

08/14/20 Speech and Language Processing - Jurafsky and Martin 32


Kinds of Knowledge
 As crude as they are, N-gram probabilities capture a
range of interesting facts about language.

 P(english|want) = .0011
World knowledge
 P(chinese|want) = .0065
 P(to|want) = .66
 P(eat | to) = .28 Syntax
 P(food | to) = 0
 P(want | spend) = 0
 P (i | <s>) = .25 Discourse

08/14/20 Speech and Language Processing - Jurafsky and Martin 33


Shannon’s Method
 Let’s turn the model around and use it to
generate random sentences that are like
the sentences from which the model was
derived.
 Illustrates:
 Dependence of N-gram models on the specific
training corpus
 Greater N  better model
 Generally attributed to
Claude Shannon.
08/14/20 Speech and Language Processing - Jurafsky and Martin 34
Shannon’s Method
 Sample a random bigram (<s>, w) according to its probability
 Now sample a random bigram (w, x) according to its probability
 And so on until we randomly choose a (y, </s>)
 Then string the words together
 <s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>

08/14/20 Speech and Language Processing - Jurafsky and Martin 35


Shakespeare

08/14/20 Speech and Language Processing - Jurafsky and Martin 36


Shakespeare as a Corpus
 N=884,647 tokens, V=29,066
 Shakespeare produced 300,000 bigram types
out of V2= 844 million possible bigrams...
 So, 99.96% of the possible bigrams were never seen
(have zero entries in the table)
 This is the biggest problem in language modeling;
we’ll come back to it.
 Quadrigrams are worse: What's coming out
looks like Shakespeare because it is
Shakespeare

08/14/20 Speech and Language Processing - Jurafsky and Martin 37


The Wall Street Journal is Not
Shakespeare

08/14/20 Speech and Language Processing - Jurafsky and Martin 38


A bad language model
(thanks to Joshua Goodman)

39
A bad language model

40
A bad language model

41
A bad language model

42
Evaluation
 How do we know if one model is better
than another?
 Shannon’s game gives us an intuition.
 The generated texts from the higher order
models sound more like the text the model
was obtained from.

08/14/20 Speech and Language Processing - Jurafsky and Martin 43


Evaluation
 Standard method
 Train parameters of our model on a training set.
 Look at the model’s performance on some new
data: a test set. A dataset which is different than
our training set, but is drawn from the same source
 Then we need an evaluation metric to tell us how
well our model is doing on the test set.
 One such metric is perplexity (to be introduced below)

08/14/20 Speech and Language Processing - Jurafsky and Martin 44


Perplexity
 Perplexity is the probability of
the test set (assigned by the
language model), normalized by
the number of words:
 Chain rule:

 For bigrams:

 Minimizing perplexity is the same as maximizing


probability
 The best language model is one that best
predicts an unseen test set
08/14/20 Speech and Language Processing - Jurafsky and Martin 45
Lower perplexity means a
better model
 Training 38 million words, test 1.5 million
words, WSJ

08/14/20 Speech and Language Processing - Jurafsky and Martin 46


Evaluating N-Gram Models
 Best evaluation for a language model
 Put model A into an application
 For example, a speech recognizer
 Evaluate the performance of the
application with model A
 Put model B into the application and
evaluate
 Compare performance of the application
with the two models
 Extrinsic evaluation
08/14/20 Speech and Language Processing - Jurafsky and Martin 47
Difficulty of extrinsic (in-vivo)
evaluation of N-gram models
 Extrinsic evaluation
 This is really time-consuming
 Can take days to run an experiment
 So
 As a temporary solution, in order to run experiments
 To evaluate N-grams we often use an intrinsic
evaluation, an approximation called perplexity
 But perplexity is a poor approximation unless the test
data looks just like the training data
 So is generally only useful in pilot experiments
(generally is not sufficient to publish)
 But is helpful to think about.

08/14/20 Speech and Language Processing - Jurafsky and Martin 48

You might also like