0% found this document useful (0 votes)

6 views

N-Gram Language Models Lecture

Uploaded by

Ridhi Aggarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

N-Gram Language Models Lecture

Uploaded by

Ridhi Aggarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 56

Introduction to N-grams

Language
Modeling “You are uniformly charming!” cried he, with a smile of associating and now

and then I bowed and they perceived a chaise and four to wish for.

Random sentence generated from a Jane Austen trigram model

Probabilistic Language
Models

“You are uniformly charming!” cried he, with a smile of associating and
now and then I bowed and they perceived a chaise and four to wish for.

Random sentence generated from a Jane Austen trigram

model
Probabilistic Language Models

Today’s goal: assign a probability to a sentence

◦ Machine Translation:
◦ P(high winds tonite) > P(large winds tonite)
◦ Spell Correction
Why? ◦ The office is about fifteen minuets from my
house
◦ P(about fifteen minutes from) > P(about fifteen minuets from)
◦ Speech Recognition
◦ P(I saw a van) >> P(eyes awe of an)
◦ + Summarization, question-answering, etc., etc.!!
Probabilistic Language Modeling

Goal: compute the probability of a sentence or sequence of

words:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
Better: the grammar But language model or LM is standard
How to compute P(W)
How to compute this joint probability:

◦ P(its, water, is, so, transparent, that)

Intuition: let’s rely on the Chain Rule of Probability

Reminder: The Chain Rule

Recall the definition of conditional probabilities

p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

P( w1w2  wn )  P( wi | w1w2  wi  1 )
i

P(“its water is so transparent”) =

P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
How to estimate these
probabilities
Could we just count and divide?

No! Too many possible sentences!

We’ll never see enough data for estimating these
Markov Assumption

Simplifying assumption: Andrei Markov

Or maybe
Markov Assumption

P( w1w2  wn )  P ( wi | wi  k  wi  1 )
i
In other words, we approximate each
component in the product

P ( wi | w1w2  wi  1 ) P ( wi | wi  k  wi  1 )
Simplest case: Unigram model
P ( w1w2  wn )  P ( wi )
i
Some automatically generated sentences from a unigram model

fifth, an, of, futures, the, an, incorporated, a,

a, the, inflation, most, dollars, quarter, in, is,
mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Bigram model

Condition on the previous word:

P ( wi | w1w2  wi  1 ) P ( wi | wi  1 )
texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

N-gram models

We can extend to trigrams, 4-grams, 5-grams

In general this is an insufficient model of language
◦ because language has long-distance dependencies:
“The computer which I had just put into the machine room
on the fifth floor crashed.”

But we can often get away with N-gram models

Estimating bigram probabilities
The Maximum Likelihood Estimate
An example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
More examples:
Berkeley Restaurant Project sentences

can you tell me about any good cantonese restaurants close by

mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Raw bigram counts
Out of 9222 sentences
Raw bigram probabilities
Normalize by unigrams:

Result:
Bigram estimates of sentence probabilities

P(<s> I want english food </s>) =

We do everything in log space

◦ Avoid underflow
◦ (also adding is faster than multiplying)
Language Modeling Toolkits

SRILM
◦ http://www.speech.sri.com/projects/srilm/
KenLM
◦ https://kheafield.com/code/kenlm/
Google N-Gram Release, August
2006

…
Google Book N-grams
http://ngrams.googlelabs.com/
Estimating N-gram
Language Probabilities
Modeling
Evaluation and Perplexity
Language
Modeling
Evaluation: How good is our
model?
Does our language model prefer good sentences to bad ones?
◦ Assign higher probability to “real” or “frequently observed” sentences
◦ Than “ungrammatical” or “rarely observed” sentences?

We train parameters of our model on a training set.

We test the model’s performance on data we haven’t seen.
◦ A test set is an unseen dataset that is different from our training set,
totally unused.
◦ An evaluation metric tells us how well our model does on the test set.
Extrinsic evaluation of N-gram
models
Best evaluation for comparing models A and B
◦ Put each model in a task
◦ spelling corrector, speech recognizer, MT system
◦ Run the task, get an accuracy for A and for B
◦ How many misspelled words corrected properly
◦ How many words translated correctly
◦ Compare accuracy for A and B
Difficulty of extrinsic (in-vivo) evaluation of N-
gram models

Extrinsic evaluation
◦ Time-consuming; can take days or weeks
So
◦ Sometimes use intrinsic evaluation: perplexity
◦ Bad approximation
◦ unless the test data looks just like the training data
◦ So generally only useful in pilot experiments
◦ But is helpful to think about.
Intuition of Perplexity
The Shannon Game: mushrooms 0.1
◦ How well can we predict the next word? pepperoni 0.1
I always order pizza with cheese and ____ anchovies 0.01
The 33rd President of the US was ____ ….
I saw a ____ fried rice 0.0001
….
◦ Unigrams are terrible at this game. (Why?) and 1e-100

A better model of a text

◦ is one which assigns a higher probability to the word that
actually occurs
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence)
Perplexity is the inverse probability of
the test set, normalized by the number
of words:

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability

The Shannon Game intuition for perplexity

From Josh Goodman

Perplexity is weighted equivalent branching factor
How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9’
◦ Perplexity 10

How hard is recognizing (30,000) names at Microsoft.

◦ Perplexity = 30,000

Let's imagine a call-routing phone system gets 120K calls and has to recognize
◦ "Operator" (let's say this occurs 1 in 4 calls)
◦ "Sales" (1in 4)
◦ "Technical Support" (1 in 4)
◦ 30,000 different names (each name occurring 1 time in the 120K calls)
◦ What is the perplexity? Next slide
The Shannon Game intuition for perplexity

Josh Goodman: imagine a call-routing phone system gets 120K calls and has to
recognize
◦ "Operator" (let's say this occurs 1 in 4 calls)
◦ "Sales" (1in 4)
◦ "Technical Support" (1 in 4)
◦ 30,000 different names (each name occurring 1 time in the 120K calls)
We get the perplexity of this sequence of length 120Kby first multiplying 120K
probabilities (90K of which are 1/4 and 30K of which are 1/120K), nd then taking
the inverse 120,000th root:
Perp = (¼ * ¼ * ¼* ¼ * ¼ * …. * 1/120K * 1/120K * ….)^(-1/120K)
But this can be arithmetically simplified to just N = 4: the operator (1/4), the sales
(1/4), the tech support (1/4), and the 30,000 names (1/120,000):
Perplexity= ((¼ * ¼ * ¼ * 1/120K)^(-1/4) = 52.6
Perplexity as branching factor
Let’s suppose a sentence consisting of random
digits
What is the perplexity of this sentence according to
a model that assign P=1/10 to each digit?
Lower perplexity = better model

Training 38 million words, test 1.5 million words, WSJ

N-gram Unigram Bigram Trigram

Order
Perplexity 962 170 109
Evaluation and Perplexity
Language
Modeling
Generalization and zeros
Language
Modeling
The Shannon Visualization
Method

Choose a random bigram

<s> I
(<s>, w) according to its probability I want
Now choose a random bigram want to
(w, x) according to its probability to eat
eat Chinese
And so on until we choose </s> Chinese food
Then string the words together food </s>
I want to eat Chinese food
Approximating Shakespeare
Shakespeare as corpus

N=884,647 tokens, V=29,066

Shakespeare produced 300,000 bigram types
out of V2= 844 million possible bigrams.
◦ So 99.96% of the possible bigrams were never seen
(have zero entries in the table)
Quadrigrams worse: What's coming out looks
like Shakespeare because it is Shakespeare
The Wall Street Journal is not
Shakespeare (no offense)
Can you guess the training set author of the LM
that generated these random 3-gram sentences?

They also point to ninety nine point six billion dollars

from two hundred four oh six three percent of the rates
of interest stores as Mexico and gram Brazil on market
conditions
This shall forbid it should be branded, if renown made it
empty.
“You are uniformly charming!” cried he, with a smile of
associating and now and then I bowed and they
perceived a chaise and four to wish for.
42
The perils of overfitting
N-grams only work well for word prediction if the
test corpus looks like the training corpus
◦ In real life, it often doesn’t
◦ We need to train robust models that generalize!
◦ One kind of generalization: Zeros!
◦ Things that don’t ever occur in the training set
◦ But occur in the test set
Zeros
Training set: • Test set
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request

P(“offer” | denied the) = 0

Zero probability bigrams
Bigrams with zero probability
◦ mean that we will assign 0 probability to the test set!
And hence we cannot compute perplexity (can’t
divide by 0)!
Generalization and zeros
Language
Modeling
Smoothing: Add-one
Language (Laplace) smoothing
Modeling
The intuition of smoothing (from Dan Klein)
When we have sparse statistics:
P(w | denied the)

allegations
3 allegations

outcome
reports
2 reports

attack
…

request
claims
1 claims

man
1 request
7 total
Steal probability mass to generalize better
P(w | denied the)
2.5 allegations

allegations
allegations
1.5 reports

outcome
0.5 claims

reports

attack
0.5 request
…

man
claims

request
2 other
7 total
Add-one estimation
Also called Laplace smoothing
Pretend we saw each word one more time than we did
Just add one to all the counts!

MLE estimate:

Add-1 estimate:
Maximum Likelihood Estimates
The maximum likelihood estimate
◦ of some parameter of a model M from a training set T
◦ maximizes the likelihood of the training set T given the model M
Suppose the word “bagel” occurs 400 times in a corpus of a million words
What is the probability that a random word from some other text will be “bagel”?
MLE estimate is 400/1,000,000 = .0004
This may be a bad estimate for some other corpus
◦ But it is the estimate that makes it most likely that “bagel” will occur 400 times in a
million word corpus.
Berkeley Restaurant Corpus: Laplace smoothed
bigram counts
Laplace-smoothed bigrams
Reconstituted counts
Compare with raw bigram
counts
Add-1 estimation is a blunt
instrument
So add-1 isn’t used for N-grams:
◦ We’ll see better methods
But add-1 is used to smooth other NLP models
◦ For text classification
◦ In domains where the number of zeros isn’t so huge.
Smoothing: Add-one
Language (Laplace) smoothing
Modeling

N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Lecture 2. N-Gram LMs
No ratings yet
Lecture 2. N-Gram LMs
77 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Week 4
No ratings yet
Week 4
37 pages
Session 2-3 Language Modeling
No ratings yet
Session 2-3 Language Modeling
69 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
N Grams
No ratings yet
N Grams
51 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
08 Language Models
No ratings yet
08 Language Models
69 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
lm24aug
No ratings yet
lm24aug
84 pages
Chapter 03-Number System
No ratings yet
Chapter 03-Number System
88 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
3_LM_2024
No ratings yet
3_LM_2024
78 pages
LM
No ratings yet
LM
76 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
Language Modeling and Spelling Correction
No ratings yet
Language Modeling and Spelling Correction
97 pages
Language Model PDF
No ratings yet
Language Model PDF
76 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
KEN2570 4 LanguageModel
No ratings yet
KEN2570 4 LanguageModel
17 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
session10_cs2731 nlp LM
No ratings yet
session10_cs2731 nlp LM
47 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
CME4408 P5 N-grams Smooting
No ratings yet
CME4408 P5 N-grams Smooting
43 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
N-Grams and Smoothing: CSC 371: Spring 2012
No ratings yet
N-Grams and Smoothing: CSC 371: Spring 2012
39 pages
NLP
No ratings yet
NLP
46 pages
Week 3
No ratings yet
Week 3
24 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
6.Chapter6_LanguageModel
No ratings yet
6.Chapter6_LanguageModel
33 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Ngrams
100% (1)
Ngrams
22 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
lecture5-ngrams
No ratings yet
lecture5-ngrams
40 pages
02 Estimating N-Gram Probabilities 9-38
No ratings yet
02 Estimating N-Gram Probabilities 9-38
4 pages
NLP UNIT III (Part 1)
No ratings yet
NLP UNIT III (Part 1)
15 pages
Adv. Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
No ratings yet
Adv. Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
54 pages
IS 7118 Unit-4 N-Grams
100% (2)
IS 7118 Unit-4 N-Grams
93 pages
2. Language Modeling
No ratings yet
2. Language Modeling
50 pages
n Grams -Nptel Notes
No ratings yet
n Grams -Nptel Notes
75 pages
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
No ratings yet
N-Gram Language Models: Random Sentence Generated From A Jane Austen Trigram Model
28 pages
ai
No ratings yet
ai
13 pages
language modelling_
No ratings yet
language modelling_
17 pages
NLP2 7
No ratings yet
NLP2 7
400 pages
5)Lecture-Feb11&13&17&18
No ratings yet
5)Lecture-Feb11&13&17&18
21 pages
3. Language Modeling
No ratings yet
3. Language Modeling
43 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
BAYES Theorem
From Everand
BAYES Theorem
Jeffery Short
2/5 (5)
3 x Faster French 1 with Linkword
From Everand
3 x Faster French 1 with Linkword
Michael Gruneberg
No ratings yet
Cambridge Ielts Course Workbook
100% (2)
Cambridge Ielts Course Workbook
7 pages
Critical Journal Review (Sociolinguistic)
No ratings yet
Critical Journal Review (Sociolinguistic)
6 pages
lesson plan 28.01
No ratings yet
lesson plan 28.01
2 pages
Guess Which Country I Am Clues
No ratings yet
Guess Which Country I Am Clues
4 pages
Unit 5 Short Test 1B: Grammar
No ratings yet
Unit 5 Short Test 1B: Grammar
1 page
2a Uts English 2 s1 Kep 2020 (Afni)
No ratings yet
2a Uts English 2 s1 Kep 2020 (Afni)
3 pages
Notes On After Babel by George Steiner
No ratings yet
Notes On After Babel by George Steiner
26 pages
Libro de Ingles 3 Isc
0% (2)
Libro de Ingles 3 Isc
105 pages
Week_4_Questions
No ratings yet
Week_4_Questions
1 page
An Overview of Listening Skill Theories: January 2019
No ratings yet
An Overview of Listening Skill Theories: January 2019
10 pages
Examiners' Report Principal Examiner Feedback January 2020
No ratings yet
Examiners' Report Principal Examiner Feedback January 2020
12 pages
55.means of Transport
No ratings yet
55.means of Transport
2 pages
Hern Haiku
No ratings yet
Hern Haiku
7 pages
Biography Text
No ratings yet
Biography Text
7 pages
AVBOB Step 12 English Paper 7
No ratings yet
AVBOB Step 12 English Paper 7
40 pages
Adverbs of Manner
No ratings yet
Adverbs of Manner
4 pages
Session 4 Pre IELTS 1
No ratings yet
Session 4 Pre IELTS 1
6 pages
Fine Tune QP 2021
No ratings yet
Fine Tune QP 2021
3 pages
Resumen Tema 6
No ratings yet
Resumen Tema 6
9 pages
A Course in Phonetics 7° Edition Peter Ladefoged - Download the ebook in PDF with all chapters to read anytime
100% (2)
A Course in Phonetics 7° Edition Peter Ladefoged - Download the ebook in PDF with all chapters to read anytime
53 pages
Advanced Javanese-To-Indonesian Statistical Machine Translation (Aji-Smt) (PDFDrive)
No ratings yet
Advanced Javanese-To-Indonesian Statistical Machine Translation (Aji-Smt) (PDFDrive)
178 pages
HL Essay Outline Minimum 3 Facts
No ratings yet
HL Essay Outline Minimum 3 Facts
14 pages
EF3e Uppint Filetest 06 Answerkey
No ratings yet
EF3e Uppint Filetest 06 Answerkey
6 pages
The Witches of Pendle
No ratings yet
The Witches of Pendle
5 pages
New Inventions - mp3 PDF
No ratings yet
New Inventions - mp3 PDF
1 page
Telephone Call 1
No ratings yet
Telephone Call 1
36 pages
Question 1-15
No ratings yet
Question 1-15
3 pages
International Gcse 4es1 03 Handbook June 2024
No ratings yet
International Gcse 4es1 03 Handbook June 2024
16 pages
English Fal P2 GR9 Memo Nov 2017
No ratings yet
English Fal P2 GR9 Memo Nov 2017
6 pages
Conditionals Grammar
No ratings yet
Conditionals Grammar
24 pages

N-Gram Language Models Lecture

Uploaded by

N-Gram Language Models Lecture

Uploaded by

Introduction to N-grams

Random sentence generated from a Jane Austen trigram model

Random sentence generated from a Jane Austen trigram

Today’s goal: assign a probability to a sentence

Goal: compute the probability of a sentence or sequence of

◦ P(its, water, is, so, transparent, that)

Intuition: let’s rely on the Chain Rule of Probability

Recall the definition of conditional probabilities

P(“its water is so transparent”) =

No! Too many possible sentences!

Simplifying assumption: Andrei Markov

fifth, an, of, futures, the, an, incorporated, a,

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Condition on the previous word:

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

We can extend to trigrams, 4-grams, 5-grams

But we can often get away with N-gram models

can you tell me about any good cantonese restaurants close by

P(<s> I want english food </s>) =

We do everything in log space

We train parameters of our model on a training set.

A better model of a text

Minimizing perplexity is the same as maximizing probability

From Josh Goodman

How hard is recognizing (30,000) names at Microsoft.

Training 38 million words, test 1.5 million words, WSJ

N-gram Unigram Bigram Trigram

Choose a random bigram

N=884,647 tokens, V=29,066

They also point to ninety nine point six billion dollars

P(“offer” | denied the) = 0

You might also like