0% found this document useful (0 votes)
6 views

N-Gram Language Models Lecture

Uploaded by

Ridhi Aggarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

N-Gram Language Models Lecture

Uploaded by

Ridhi Aggarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Introduction to N-grams

Language
Modeling “You are uniformly charming!” cried he, with a smile of associating and now

and then I bowed and they perceived a chaise and four to wish for.

Random sentence generated from a Jane Austen trigram model


Probabilistic Language
Models

“You are uniformly charming!” cried he, with a smile of associating and
now and then I bowed and they perceived a chaise and four to wish for.

Random sentence generated from a Jane Austen trigram


model
Probabilistic Language Models

Today’s goal: assign a probability to a sentence


◦ Machine Translation:
◦ P(high winds tonite) > P(large winds tonite)
◦ Spell Correction
Why? ◦ The office is about fifteen minuets from my
house
◦ P(about fifteen minutes from) > P(about fifteen minuets from)
◦ Speech Recognition
◦ P(I saw a van) >> P(eyes awe of an)
◦ + Summarization, question-answering, etc., etc.!!
Probabilistic Language Modeling

Goal: compute the probability of a sentence or sequence of


words:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
Better: the grammar But language model or LM is standard
How to compute P(W)
How to compute this joint probability:

◦ P(its, water, is, so, transparent, that)

Intuition: let’s rely on the Chain Rule of Probability


Reminder: The Chain Rule

Recall the definition of conditional probabilities


p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
The Chain Rule applied to compute joint
probability of words in sentence

P( w1w2  wn )  P( wi | w1w2  wi  1 )
i

P(“its water is so transparent”) =


P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
How to estimate these
probabilities
Could we just count and divide?

No! Too many possible sentences!


We’ll never see enough data for estimating these
Markov Assumption

Simplifying assumption: Andrei Markov

Or maybe
Markov Assumption

P( w1w2  wn )  P ( wi | wi  k  wi  1 )
i
In other words, we approximate each
component in the product

P ( wi | w1w2  wi  1 ) P ( wi | wi  k  wi  1 )
Simplest case: Unigram model
P ( w1w2  wn )  P ( wi )
i
Some automatically generated sentences from a unigram model

fifth, an, of, futures, the, an, incorporated, a,


a, the, inflation, most, dollars, quarter, in, is,
mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the


Bigram model

Condition on the previous word:

P ( wi | w1w2  wi  1 ) P ( wi | wi  1 )
texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november


N-gram models

We can extend to trigrams, 4-grams, 5-grams


In general this is an insufficient model of language
◦ because language has long-distance dependencies:
“The computer which I had just put into the machine room
on the fifth floor crashed.”

But we can often get away with N-gram models


Estimating bigram probabilities
The Maximum Likelihood Estimate
An example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
More examples:
Berkeley Restaurant Project sentences

can you tell me about any good cantonese restaurants close by


mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Raw bigram counts
Out of 9222 sentences
Raw bigram probabilities
Normalize by unigrams:

Result:
Bigram estimates of sentence probabilities

P(<s> I want english food </s>) =


P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
What kinds of knowledge?
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25
Practical Issues

We do everything in log space


◦ Avoid underflow
◦ (also adding is faster than multiplying)
Language Modeling Toolkits

SRILM
◦ http://www.speech.sri.com/projects/srilm/
KenLM
◦ https://kheafield.com/code/kenlm/
Google N-Gram Release, August
2006


Google Book N-grams
http://ngrams.googlelabs.com/
Estimating N-gram
Language Probabilities
Modeling
Evaluation and Perplexity
Language
Modeling
Evaluation: How good is our
model?
Does our language model prefer good sentences to bad ones?
◦ Assign higher probability to “real” or “frequently observed” sentences
◦ Than “ungrammatical” or “rarely observed” sentences?

We train parameters of our model on a training set.


We test the model’s performance on data we haven’t seen.
◦ A test set is an unseen dataset that is different from our training set,
totally unused.
◦ An evaluation metric tells us how well our model does on the test set.
Extrinsic evaluation of N-gram
models
Best evaluation for comparing models A and B
◦ Put each model in a task
◦ spelling corrector, speech recognizer, MT system
◦ Run the task, get an accuracy for A and for B
◦ How many misspelled words corrected properly
◦ How many words translated correctly
◦ Compare accuracy for A and B
Difficulty of extrinsic (in-vivo) evaluation of N-
gram models

Extrinsic evaluation
◦ Time-consuming; can take days or weeks
So
◦ Sometimes use intrinsic evaluation: perplexity
◦ Bad approximation
◦ unless the test data looks just like the training data
◦ So generally only useful in pilot experiments
◦ But is helpful to think about.
Intuition of Perplexity
The Shannon Game: mushrooms 0.1
◦ How well can we predict the next word? pepperoni 0.1
I always order pizza with cheese and ____ anchovies 0.01
The 33rd President of the US was ____ ….
I saw a ____ fried rice 0.0001
….
◦ Unigrams are terrible at this game. (Why?) and 1e-100

A better model of a text


◦ is one which assigns a higher probability to the word that
actually occurs
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence)
Perplexity is the inverse probability of
the test set, normalized by the number
of words:

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability


The Shannon Game intuition for perplexity

From Josh Goodman


Perplexity is weighted equivalent branching factor
How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9’
◦ Perplexity 10

How hard is recognizing (30,000) names at Microsoft.


◦ Perplexity = 30,000

Let's imagine a call-routing phone system gets 120K calls and has to recognize
◦ "Operator" (let's say this occurs 1 in 4 calls)
◦ "Sales" (1in 4)
◦ "Technical Support" (1 in 4)
◦ 30,000 different names (each name occurring 1 time in the 120K calls)
◦ What is the perplexity? Next slide
The Shannon Game intuition for perplexity

Josh Goodman: imagine a call-routing phone system gets 120K calls and has to
recognize
◦ "Operator" (let's say this occurs 1 in 4 calls)
◦ "Sales" (1in 4)
◦ "Technical Support" (1 in 4)
◦ 30,000 different names (each name occurring 1 time in the 120K calls)
We get the perplexity of this sequence of length 120Kby first multiplying 120K
probabilities (90K of which are 1/4 and 30K of which are 1/120K), nd then taking
the inverse 120,000th root:
Perp = (¼ * ¼ * ¼* ¼ * ¼ * …. * 1/120K * 1/120K * ….)^(-1/120K)
But this can be arithmetically simplified to just N = 4: the operator (1/4), the sales
(1/4), the tech support (1/4), and the 30,000 names (1/120,000):
Perplexity= ((¼ * ¼ * ¼ * 1/120K)^(-1/4) = 52.6
Perplexity as branching factor
Let’s suppose a sentence consisting of random
digits
What is the perplexity of this sentence according to
a model that assign P=1/10 to each digit?
Lower perplexity = better model

Training 38 million words, test 1.5 million words, WSJ

N-gram Unigram Bigram Trigram


Order
Perplexity 962 170 109
Evaluation and Perplexity
Language
Modeling
Generalization and zeros
Language
Modeling
The Shannon Visualization
Method

Choose a random bigram


<s> I
(<s>, w) according to its probability I want
Now choose a random bigram want to
(w, x) according to its probability to eat
eat Chinese
And so on until we choose </s> Chinese food
Then string the words together food </s>
I want to eat Chinese food
Approximating Shakespeare
Shakespeare as corpus

N=884,647 tokens, V=29,066


Shakespeare produced 300,000 bigram types
out of V2= 844 million possible bigrams.
◦ So 99.96% of the possible bigrams were never seen
(have zero entries in the table)
Quadrigrams worse: What's coming out looks
like Shakespeare because it is Shakespeare
The Wall Street Journal is not
Shakespeare (no offense)
Can you guess the training set author of the LM
that generated these random 3-gram sentences?

They also point to ninety nine point six billion dollars


from two hundred four oh six three percent of the rates
of interest stores as Mexico and gram Brazil on market
conditions
This shall forbid it should be branded, if renown made it
empty.
“You are uniformly charming!” cried he, with a smile of
associating and now and then I bowed and they
perceived a chaise and four to wish for.
42
The perils of overfitting
N-grams only work well for word prediction if the
test corpus looks like the training corpus
◦ In real life, it often doesn’t
◦ We need to train robust models that generalize!
◦ One kind of generalization: Zeros!
◦ Things that don’t ever occur in the training set
◦ But occur in the test set
Zeros
Training set: • Test set
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request

P(“offer” | denied the) = 0


Zero probability bigrams
Bigrams with zero probability
◦ mean that we will assign 0 probability to the test set!
And hence we cannot compute perplexity (can’t
divide by 0)!
Generalization and zeros
Language
Modeling
Smoothing: Add-one
Language (Laplace) smoothing
Modeling
The intuition of smoothing (from Dan Klein)
When we have sparse statistics:
P(w | denied the)

allegations
3 allegations

outcome
reports
2 reports

attack

request
claims
1 claims

man
1 request
7 total
Steal probability mass to generalize better
P(w | denied the)
2.5 allegations

allegations
allegations
1.5 reports

outcome
0.5 claims

reports

attack
0.5 request

man
claims

request
2 other
7 total
Add-one estimation
Also called Laplace smoothing
Pretend we saw each word one more time than we did
Just add one to all the counts!

MLE estimate:

Add-1 estimate:
Maximum Likelihood Estimates
The maximum likelihood estimate
◦ of some parameter of a model M from a training set T
◦ maximizes the likelihood of the training set T given the model M
Suppose the word “bagel” occurs 400 times in a corpus of a million words
What is the probability that a random word from some other text will be “bagel”?
MLE estimate is 400/1,000,000 = .0004
This may be a bad estimate for some other corpus
◦ But it is the estimate that makes it most likely that “bagel” will occur 400 times in a
million word corpus.
Berkeley Restaurant Corpus: Laplace smoothed
bigram counts
Laplace-smoothed bigrams
Reconstituted counts
Compare with raw bigram
counts
Add-1 estimation is a blunt
instrument
So add-1 isn’t used for N-grams:
◦ We’ll see better methods
But add-1 is used to smooth other NLP models
◦ For text classification
◦ In domains where the number of zeros isn’t so huge.
Smoothing: Add-one
Language (Laplace) smoothing
Modeling

You might also like