0% found this document useful (0 votes)

13 views41 pages

Lecture 03

Uploaded by

1162407364

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views41 pages

Lecture 03

Uploaded by

1162407364

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Natural Language

Processing
Lecture 3: n-gram language models

10/29/2020

COMS W4705
Yassine Benajiba
Probability of a Sentence
Probability of a Sentence

“But it must be recognized that the notion of ‘probability of a

sentence’ is an entirely useless one, under any known
interpretation of this term.”
Noam Chomsky (1969)
Language Modeling

• Task: predict the next word given the context.

• Used in speech recognition, handwritten character

recognition, spelling correction, text entry UI, machine
translation,…
Language Modeling

• Stocks plunged this …

• Let’s meet in Times …

• I took the subway to …

From a NYT story
• Stocks plunged this ....

• Stocks plunged this morning, despite a cut interest

rates by the …

• Stocks plunged this morning, despite a cut in interest

rates by the Federal Reserve, as Wall …

• Stocks plunged this morning, despite a cut in interest

rates by the Federal Reserve, as Wall Street began
Human Word Prediction
• Clearly at least some of us have the ability to predict the
future.

• How does this work?

• Domain knowledge

• Syntactic knowledge (guess correct part of speech)

• Lexical knowledge
Probability of the Next Word
• Idea: We do not need to model domain, syntactic, and
lexical knowledge perfectly.

• Instead, we can rely on the notion of probability of a

sequence (letters, words…).
Applications
• Speech recognition: P(“recognize speech”) > P(“wreck a nice beach”)

• Text generation: P(“three houses”) > P(“three house”)

• Spelling correction P(“my cat eats fish”) > P(“my xat eats fish”)

• Machine Translation P(“the blue house”) > P(“the house blue”)

• Other uses

• OCR

• Summarization

• Document classification

• Essay scoring
Language Models

• This model can also be used to describe the probability of

an entire sentence, not just the last word.

• Use the chain rule:

𝑷 𝒘𝟏 , … . , 𝒘𝒏 =
𝑷 𝒘𝒏 𝒘𝟏 , … … , 𝒘𝒏−𝟏 ) 𝑷(𝒘𝟏 , … . , 𝒘𝒏−𝟏 )=
𝑷 𝒘𝒏 𝒘𝟏 , … . , 𝒘𝒏−𝟏 ) 𝑷(𝒘𝒏−𝟏 𝒘𝒏−𝟐 , … . , 𝒘𝟏 𝑷(𝒘𝒏−𝟐 , … . , 𝒘𝟏 )
…
Markov Assumption
• is difficult to estimate.

• The longer the sequence becomes, the less likely

w1 w2 w3 … wn-1 will appear in training data.

• Instead, we make the following simple independence

assumption (Markov assumption):

• The probability to see wn depends only on the previous

k-1 words.
bi-gram language model
• Using the Markov assumption and the chain rule:

• More consistent to use only bigrams:

n-grams

• The sequence wn is a unigram.

• The sequence wn-1, wn is a bigram.

• The sequence wn-2, wn-1, wn is a trigram….

• The sequence wn-2, wn-1, wn is a quadrigram…

Variable-Length Language
Models
• We typically don’t know what the length of the sentence is.

• Instead, we use a special marker STOP that indicates the

end of a sentence.

• We typically just augment the sentence with START and

STOP markers to provide the appropriate context.
START i want to eat Chinese food END

Eat on 0.16 Eat Thai 0.03

Eat some 0.06 Eat breakfast 0.03

Eat lunch 0.06 Eat in 0.02

Eat dinner 0.05 Eat Chinese 0.02

Eat at 0.04 Eat Mexican 0.02

Eat a 0.04 Eat tomorrow 0.01

Eat Indian 0.04 Eat dessert

http://www1.icsi.berkeley.edu/Speech/berp.html 0.007
Bigram example from the Berkeley
Restaurant Project (BeRP)

START I 0.25 Want some 0.04

START I’d 0.06 Want Thai 0.01

START Tell 0.04 To eat 0.26

START I’m 0.02 To have 0.14

I want 0.32 To spend 0.09

I would 0.29 To be 0.02

I don’t 0.08 British food 0.60

Bigram example from the Berkeley
Restaurant Project (BeRP)
• Assume P(END | food) = 0.2

P(I want to eat British food) =

P(I want to eat Chinese food) =

• We often work with log probabilities in practice.

w0 = START
What do ngrams capture?

• Probabilities seem to capture syntactic facts and

world knowledge.

• eat is often followed by a NP.

• British food is not too popular, but Chinese is.

Estimating n-gram
probabilities
• We can estimate n-gram probabilities using maximum
likelihood estimates.

• Or for trigrams:
Bigram Counts from BeRP

I Want To Eat Chinese Food lunch

I 8 1087 0 13 0 0 0

Want 3 0 786 0 6 8 6

To 3 0 10 860 3 0 12

Eat 0 0 2 0 19 2 52

Chinese 2 0 0 0 0 120 1

Food 19 0 17 0 0 0 0
Counts to Probabilities
I Want To Eat Chinese Food lunch

I 8 1087 0 13 0 0 0

Want 3 0 786 0 6 8 6

To 3 0 10 860 3 0 12
• Unigram
Eat
counts:
0 0 2 0 19 2 52

IChinese Want 2 To 0 Eat

0 Chinese
0 Food
0 120Lunch 1

3437
Food 1215
19 3256
0 938
17 0 213 0 1506 0 459
0

Lunch 4 0 0 0 0 1 0
Corpora
• Large digital collections of text or speech. Different languages, domains, modalities.
Annotated or un-annontated.

• English:

• Brown Corpus

• BNC, ANC

• Wall Street Journal

• AP newswire

• DARPA/NIST text/speech corpora

(Call Home,ATIS, switchboard, Broadcast News,…)

• MT: Hansards, Europarl

Google Web 1T 5-gram
Corpus
File sizes: approx. 24 GB compressed (gzip'ed) text files

Number of tokens: 1,024,908,267,229

Number of sentences: 95,119,665,584

Number of unigrams: 13,588,391

Number of bigrams: 314,843,401

Number of trigrams: 977,069,902

Number of fourgrams: 1,313,818,354

Number of fivegrams: 1,176,470,663

Google Web 1T 5-gram
Corpus
• 3-gram examples:

ceramics collectables collectibles 55

ceramics collectables fine 130
ceramics collected by 52
ceramics collectible pottery 50
ceramics collectibles cooking 45
ceramics collection , 144
ceramics collection . 247
ceramics collection </S> 120
ceramics collection and 43
ceramics collection at 52
ceramics collection is 68
ceramics collection of 76
Google Web 1T 5-gram
Corpus
• 4-gram examples:

serve as the incoming 92

serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensible 40
serve as the individual 234
serve as the industrial 52
serve as the industry 607
serve as the info 42
serve as the informal 102
Data sparsity in n-gram
models
• Sparsity is a problem all over NLP: Test data contains
language phenomena not encountered during training.

• For n-gram models there are two issues:

• We may not have seen all tokens.

• We may not have seen all ngrams (even though the

individual tokens are known).

• Token has not been encountered in this context

before.
P(lunch | I ) = 0.0
Unseen Tokens
• Typical approach to unseen tokens:

• Start with a specific lexicon of known tokens.

• Replace all tokens in the training and testing corpus that

are not in the lexicon with an UNK token.

• Practical approach:

• Lexicon contains all words that appear more than k

times in the training corpus.

• Replace all other tokens with UNK.

Unseen Contexts
• Two basic approaches:

• Smoothing / Discounting: Move some probability mass

from seen trigrams to unseen trigrams.

• Back-off: Use n-1-…, n-2-… grams to compute n-gram

probability.

• Other techniques:

• Class-based backoff, use back-off probability for a

specific word class / part-of-speech.
Zipf’s Law
• Problem: n-grams (and most other linguistic phenomena)
follow a Zipfian distribution.

• A few words occur very frequently.

• Most words occur very rarely. Many are seen only once.

Zipf’s law: a word’s frequency is approximately inversely

proportional to its rank in the word distribution list.
Zipf’s Law

frequency

word rank
Zipf’s Law
Wikipedia 10m words per language

https://commons.wikimedia.org/wiki/File:Zipf_30wiki_en_labels.png
Smoothing
• Smoothing flattens spiky distributions.

• before P(w | We denied the)

3 allegations
2 reports
1 claims
1 request
7 total

• after P(w | We denied the)

2.5 allegations
1.5 reports
0.5 claims
0.4 request
2 UNK
7 total
Smoothing is like Robin Hood: Steal from the rich, give to the poor.
Example from Dan Klein.
Additive Smoothing
• Classic approach: Laplacian, a.k.a. additive smoothing.

• N is the number of tokens, V is the number of types (i.e.

size of the vocabulary)

• Inaccurate in practice.
Linear Interpolation
• Use denser distributions of shorter ngrams to “fill in”
sparse ngram distributions.

• Where and .

• Works well in practice (but not a lot of theoretical

justification why).

• Parameters can be estimated on development data (for

example, using Expectation Maximization).
Discounting
• Idea: set aside some probability mass, then fill in the
missing mass using back-off.

• where .

• Then for all seen bigrams:

• For each context v the missing probability mass is

• We can now divide this held-out mass between the

unseen words (evenly or using back-off).
Katz’ Backoff
• Divide the held-out probability mass proportionally to the
unigram probability of the unseen words in context v.
Katz’ Backoff for Trigrams
• For trigrams: recursively compute backoff-probability for
unseen bigrams. Then distribute the held-out probability
mass proportionally to that bigram backoff-probability.

• where:

• Often combined with Good-Turing smoothing.

Evaluating n-gram models
• Extrinsic evaluation: Apply the model in an application (for
example language classification). Evaluate the application.

• Intrinsic evaluation: measure how well the model

approximates unseen language data.

• Can compute the probability of each sentence

according to the model. Higher probability -> better
model.

• Typically we compute Perplexity instead.

Perplexity
• Perplexity (per word) measures how well the ngram model predicts the
sample.

• Given a corpus of ‘m’ sentences ‘si’, where ‘M’ is total number of tokens
in the corpus

• Perplexity is defined as 2-l , where .

• Lower perplexity = better model. Intuition:

• Assume we are predicting one word at a time.

• With uniform distribution, all successor words are equally likely.

Perplexity is equal to vocabulary size.

• Perplexity can be thought of as “effective vocabulary size”.

Simplifications of Context-Free Grammars: Costas Buch - RPI 1
No ratings yet
Simplifications of Context-Free Grammars: Costas Buch - RPI 1
51 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
N Grams
No ratings yet
N Grams
51 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
CME4408 P5 N-grams Smooting
No ratings yet
CME4408 P5 N-grams Smooting
43 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
Chapter 03-Number System
No ratings yet
Chapter 03-Number System
88 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Lecture 2. N-Gram LMs
No ratings yet
Lecture 2. N-Gram LMs
77 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Session 2-3 Language Modeling
No ratings yet
Session 2-3 Language Modeling
69 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
59 pages
N-Gram Language Models Lecture
No ratings yet
N-Gram Language Models Lecture
56 pages
Lecture 4
No ratings yet
Lecture 4
37 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
Language Modeling and Spelling Correction
No ratings yet
Language Modeling and Spelling Correction
97 pages
08 Language Models
No ratings yet
08 Language Models
69 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
LM
No ratings yet
LM
76 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
N-Grams and Smoothing: CSC 371: Spring 2012
No ratings yet
N-Grams and Smoothing: CSC 371: Spring 2012
39 pages
3_LM_2024
No ratings yet
3_LM_2024
78 pages
Language Model PDF
No ratings yet
Language Model PDF
76 pages
lm24aug
No ratings yet
lm24aug
84 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
lecture5-ngrams
No ratings yet
lecture5-ngrams
40 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
NLP_Unit2 (2)
No ratings yet
NLP_Unit2 (2)
65 pages
Module 2
No ratings yet
Module 2
98 pages
5)Lecture-Feb11&13&17&18
No ratings yet
5)Lecture-Feb11&13&17&18
21 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP
No ratings yet
Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP
3 pages
2. Language Modeling
No ratings yet
2. Language Modeling
50 pages
6.Chapter6_LanguageModel
No ratings yet
6.Chapter6_LanguageModel
33 pages
02 Estimating N-Gram Probabilities 9-38
No ratings yet
02 Estimating N-Gram Probabilities 9-38
4 pages
04_N-gram Language Models
No ratings yet
04_N-gram Language Models
41 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
NLP_Module 2(1)
No ratings yet
NLP_Module 2(1)
77 pages
Natural Language Processing_Notes_Unit 2.docx
No ratings yet
Natural Language Processing_Notes_Unit 2.docx
19 pages
NLTK - N-Gram LM
No ratings yet
NLTK - N-Gram LM
13 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
2 N-Gram
No ratings yet
2 N-Gram
70 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
Week 4
No ratings yet
Week 4
37 pages
NLP m2
No ratings yet
NLP m2
74 pages
A7_NLP_Exp2
No ratings yet
A7_NLP_Exp2
11 pages
Unit-3 (NLP)
No ratings yet
Unit-3 (NLP)
28 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Lunch Across Cultures: Recipes from Around the World
From Everand
Lunch Across Cultures: Recipes from Around the World
Chelsey Luciow
No ratings yet
Approaches To Teaching Reading:: A Philippine Perspective
100% (1)
Approaches To Teaching Reading:: A Philippine Perspective
86 pages
Intasc Standard 2 Differentiation For Ells
No ratings yet
Intasc Standard 2 Differentiation For Ells
1 page
Limba Araba Pentru Incepatori
100% (1)
Limba Araba Pentru Incepatori
48 pages
Stylistics and Its Branches
No ratings yet
Stylistics and Its Branches
13 pages
Sources Dictionaries
100% (1)
Sources Dictionaries
13 pages
Lesson Plan For Teaching Listening
No ratings yet
Lesson Plan For Teaching Listening
1 page
Middle English Period Language
100% (1)
Middle English Period Language
31 pages
Unit 1 - Language and Communication
No ratings yet
Unit 1 - Language and Communication
45 pages
Pairing Schemes 2022: All Subjects PDF
No ratings yet
Pairing Schemes 2022: All Subjects PDF
12 pages
Read The Sentences Below and Decide Which Are True T or False F. Then Complete The Sentences Comparing The Children
No ratings yet
Read The Sentences Below and Decide Which Are True T or False F. Then Complete The Sentences Comparing The Children
3 pages
JSU BI Year 5
No ratings yet
JSU BI Year 5
3 pages
Úvod Do Studia Jazyka - L1
No ratings yet
Úvod Do Studia Jazyka - L1
8 pages
Edsc 304 Learning Menu Activity 1
No ratings yet
Edsc 304 Learning Menu Activity 1
2 pages
Lista Verbe Neregulate Si Frazale
100% (1)
Lista Verbe Neregulate Si Frazale
6 pages
Day 3 PH 7.1 Language Functions Final
No ratings yet
Day 3 PH 7.1 Language Functions Final
9 pages
Syntax Summary Updated To 11/19/2012: 1. Syntactic Hierarchy and Constituency
No ratings yet
Syntax Summary Updated To 11/19/2012: 1. Syntactic Hierarchy and Constituency
17 pages
Chapter 5: Pīnyīn (The Finals II) : I A - Ia (Ya) I e - Ie (Ye)
No ratings yet
Chapter 5: Pīnyīn (The Finals II) : I A - Ia (Ya) I e - Ie (Ye)
4 pages
Reported Speech Notes
No ratings yet
Reported Speech Notes
5 pages
Alphabets Preschool Worksheet
No ratings yet
Alphabets Preschool Worksheet
126 pages
Get Working as a Professional Translator 1st Edition Penet free all chapters
100% (1)
Get Working as a Professional Translator 1st Edition Penet free all chapters
55 pages
Test 2 kl9
No ratings yet
Test 2 kl9
2 pages
CATalinuhan 2019 - Language Subtest Answer Key
No ratings yet
CATalinuhan 2019 - Language Subtest Answer Key
5 pages
Co - Q1 - English 7 - Module 8
No ratings yet
Co - Q1 - English 7 - Module 8
21 pages
1 Complete the sentences with the words in the box. Пополнете ги речениците со зборовите
No ratings yet
1 Complete the sentences with the words in the box. Пополнете ги речениците со зборовите
4 pages
DOLEZEL, Lubomir, Commentary
No ratings yet
DOLEZEL, Lubomir, Commentary
7 pages
Use of English Part 1 (U1,5,9) - Improve Your Skills - 2
No ratings yet
Use of English Part 1 (U1,5,9) - Improve Your Skills - 2
6 pages
Teacher Competition - Lesson Plan Example
No ratings yet
Teacher Competition - Lesson Plan Example
3 pages
Vivanco - The Absence of Connectives and The Maintenance of Coherence in Publicity Texts
No ratings yet
Vivanco - The Absence of Connectives and The Maintenance of Coherence in Publicity Texts
17 pages
Simple Present Tense
0% (1)
Simple Present Tense
3 pages