N-Gram Language Models Lecture
N-Gram Language Models Lecture
Language
Modeling “You are uniformly charming!” cried he, with a smile of associating and now
and then I bowed and they perceived a chaise and four to wish for.
“You are uniformly charming!” cried he, with a smile of associating and
now and then I bowed and they perceived a chaise and four to wish for.
More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint
probability of words in sentence
P( w1w2 wn ) P( wi | w1w2 wi 1 )
i
Or maybe
Markov Assumption
P( w1w2 wn ) P ( wi | wi k wi 1 )
i
In other words, we approximate each
component in the product
P ( wi | w1w2 wi 1 ) P ( wi | wi k wi 1 )
Simplest case: Unigram model
P ( w1w2 wn ) P ( wi )
i
Some automatically generated sentences from a unigram model
P ( wi | w1w2 wi 1 ) P ( wi | wi 1 )
texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen
Result:
Bigram estimates of sentence probabilities
SRILM
◦ http://www.speech.sri.com/projects/srilm/
KenLM
◦ https://kheafield.com/code/kenlm/
Google N-Gram Release, August
2006
…
Google Book N-grams
http://ngrams.googlelabs.com/
Estimating N-gram
Language Probabilities
Modeling
Evaluation and Perplexity
Language
Modeling
Evaluation: How good is our
model?
Does our language model prefer good sentences to bad ones?
◦ Assign higher probability to “real” or “frequently observed” sentences
◦ Than “ungrammatical” or “rarely observed” sentences?
Extrinsic evaluation
◦ Time-consuming; can take days or weeks
So
◦ Sometimes use intrinsic evaluation: perplexity
◦ Bad approximation
◦ unless the test data looks just like the training data
◦ So generally only useful in pilot experiments
◦ But is helpful to think about.
Intuition of Perplexity
The Shannon Game: mushrooms 0.1
◦ How well can we predict the next word? pepperoni 0.1
I always order pizza with cheese and ____ anchovies 0.01
The 33rd President of the US was ____ ….
I saw a ____ fried rice 0.0001
….
◦ Unigrams are terrible at this game. (Why?) and 1e-100
Chain rule:
For bigrams:
Let's imagine a call-routing phone system gets 120K calls and has to recognize
◦ "Operator" (let's say this occurs 1 in 4 calls)
◦ "Sales" (1in 4)
◦ "Technical Support" (1 in 4)
◦ 30,000 different names (each name occurring 1 time in the 120K calls)
◦ What is the perplexity? Next slide
The Shannon Game intuition for perplexity
Josh Goodman: imagine a call-routing phone system gets 120K calls and has to
recognize
◦ "Operator" (let's say this occurs 1 in 4 calls)
◦ "Sales" (1in 4)
◦ "Technical Support" (1 in 4)
◦ 30,000 different names (each name occurring 1 time in the 120K calls)
We get the perplexity of this sequence of length 120Kby first multiplying 120K
probabilities (90K of which are 1/4 and 30K of which are 1/120K), nd then taking
the inverse 120,000th root:
Perp = (¼ * ¼ * ¼* ¼ * ¼ * …. * 1/120K * 1/120K * ….)^(-1/120K)
But this can be arithmetically simplified to just N = 4: the operator (1/4), the sales
(1/4), the tech support (1/4), and the 30,000 names (1/120,000):
Perplexity= ((¼ * ¼ * ¼ * 1/120K)^(-1/4) = 52.6
Perplexity as branching factor
Let’s suppose a sentence consisting of random
digits
What is the perplexity of this sentence according to
a model that assign P=1/10 to each digit?
Lower perplexity = better model
allegations
3 allegations
outcome
reports
2 reports
attack
…
request
claims
1 claims
man
1 request
7 total
Steal probability mass to generalize better
P(w | denied the)
2.5 allegations
allegations
allegations
1.5 reports
outcome
0.5 claims
reports
attack
0.5 request
…
man
claims
request
2 other
7 total
Add-one estimation
Also called Laplace smoothing
Pretend we saw each word one more time than we did
Just add one to all the counts!
MLE estimate:
Add-1 estimate:
Maximum Likelihood Estimates
The maximum likelihood estimate
◦ of some parameter of a model M from a training set T
◦ maximizes the likelihood of the training set T given the model M
Suppose the word “bagel” occurs 400 times in a corpus of a million words
What is the probability that a random word from some other text will be “bagel”?
MLE estimate is 400/1,000,000 = .0004
This may be a bad estimate for some other corpus
◦ But it is the estimate that makes it most likely that “bagel” will occur 400 times in a
million word corpus.
Berkeley Restaurant Corpus: Laplace smoothed
bigram counts
Laplace-smoothed bigrams
Reconstituted counts
Compare with raw bigram
counts
Add-1 estimation is a blunt
instrument
So add-1 isn’t used for N-grams:
◦ We’ll see better methods
But add-1 is used to smooth other NLP models
◦ For text classification
◦ In domains where the number of zeros isn’t so huge.
Smoothing: Add-one
Language (Laplace) smoothing
Modeling