0% found this document useful (0 votes)

44 views

Lectures LM

The document discusses language modeling problems and solutions. It begins with an overview of language modeling, trigram models, and evaluating models with perplexity. It then discusses estimation techniques like linear interpolation to address sparse data problems in trigram modeling. Maximum likelihood estimation leads to many parameters that cannot be reliably estimated from limited training data, so smoothing methods like interpolation are used.

Uploaded by

sersoage

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

Lectures LM

Uploaded by

sersoage

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Language Modeling

Michael Collins, Columbia University

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods

The Language Modeling Problem

We have some (nite) vocabulary, say V = {the, a, man, telescope, Beckham, two, . . .} We have an (innite) set of strings, V the STOP a STOP the fan STOP the fan saw Beckham STOP the fan saw saw STOP the fan saw Beckham play for Real Madrid STOP

The Language Modeling Problem (Continued)

We have a training sample of example sentences in English

The Language Modeling Problem (Continued)

We have a training sample of example sentences in English We need to learn a probability distribution p i.e., p is a function that satises p(x) = 1,
xV

p(x) 0 for all x V

The Language Modeling Problem (Continued)

We have a training sample of example sentences in English We need to learn a probability distribution p i.e., p is a function that satises p(x) = 1,
xV

p(x) 0 for all x V

p(the p(the p(the p(the ... p(the ...

STOP) = 1012 fan STOP) = 108 fan saw Beckham STOP) = 2 108 fan saw saw STOP) = 1015 fan saw Beckham play for Real Madrid STOP) = 2 109

Why on earth would we want to do this?!

Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.)

Why on earth would we want to do this?!

Speech recognition was the original motivation. (Related problems are optical character recognition, handwriting recognition.) The estimation techniques developed for this problem will be VERY useful for other problems in NLP

A Naive Method

We have N training sentences For any sentence x1 . . . xn , c(x1 . . . xn ) is the number of times the sentence is seen in our training data A naive estimate: p(x1 . . . xn ) = c(x1 . . . xn ) N

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods

Markov Processes

Consider a sequence of random variables X1 , X2 , . . . Xn . Each random variable can take any value in a nite set V . For now we assume the length n is xed (e.g., n = 100). Our goal: model P (X1 = x1 , X2 = x2 , . . . , Xn = xn )

First-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn )

First-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n

= P ( X 1 = x1 )
i=2

P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 )

First-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n

= P ( X 1 = x1 )
i=2 n

P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 ) P (Xi = xi |Xi1 = xi1 )

i=2

= P ( X 1 = x1 )

First-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n

= P ( X 1 = x1 )
i=2 n

P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 ) P (Xi = xi |Xi1 = xi1 )

i=2

= P ( X 1 = x1 )

The rst-order Markov assumption: For any i {2 . . . n}, for any x1 . . . xi , P (Xi = xi |X1 = x1 . . . Xi1 = xi1 ) = P (Xi = xi |Xi1 = xi1 )

Second-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn )

Second-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn ) = P (X1 = x1 ) P (X2 = x2 |X1 = x1 )

i=3

P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )

Second-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn ) = P (X1 = x1 ) P (X2 = x2 |X1 = x1 )

i=3 n

P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 ) P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )

=
i=1

(For convenience we assume x0 = x1 = , where is a special start symbol.)

Modeling Variable Length Sequences

We would like the length of the sequence, n, to also be a random variable A simple solution: always dene Xn = STOP where STOP is a special symbol

Modeling Variable Length Sequences

We would like the length of the sequence, n, to also be a random variable A simple solution: always dene Xn = STOP where STOP is a special symbol Then use a Markov process as before:

P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n

=
i=1

P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )

(For convenience we assume x0 = x1 = , where is a special start symbol.)

Trigram Language Models

A trigram language model consists of:
1. A nite set V 2. A parameter q (w|u, v ) for each trigram u, v, w such that w V {STOP}, and u, v V {*}.

Trigram Language Models

A trigram language model consists of:
1. A nite set V 2. A parameter q (w|u, v ) for each trigram u, v, w such that w V {STOP}, and u, v V {*}.

For any sentence x1 . . . xn where xi V for i = 1 . . . (n 1), and xn = STOP, the probability of the sentence under the trigram language model is
n

p(x1 . . . xn ) =
i=1

q (xi |xi2 , xi1 )

where we dene x0 = x1 = *.

An Example

For the sentence the dog barks STOP we would have p(the dog barks STOP) = q (the|*, *) q (dog|*, the) q (barks|the, dog) q (STOP|dog, barks)

The Trigram Estimation Problem

Remaining estimation problem: q (wi | wi2 , wi1 ) For example: q (laughs | the, dog)

The Trigram Estimation Problem

Remaining estimation problem: q (wi | wi2 , wi1 ) For example: q (laughs | the, dog) A natural estimate (the maximum likelihood estimate): q (wi | wi2 , wi1 ) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 ) Count(the, dog, laughs) Count(the, dog)

q (laughs | the, dog) =

Sparse Data Problems

A natural estimate (the maximum likelihood estimate): q (wi | wi2 , wi1 ) = q (laughs | the, dog) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 ) Count(the, dog, laughs) Count(the, dog)

Say our vocabulary size is N = |V|, then there are N 3 parameters in the model. e.g., N = 20, 000 20, 0003 = 8 1012 parameters

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods

Evaluating a Language Model: Perplexity

We have some test data, m sentences s1 , s2 , s3 , . . . , sm

Evaluating a Language Model: Perplexity

We have some test data, m sentences s1 , s2 , s3 , . . . , sm We could look at the probability under our model m i=1 p(si ). Or more conveniently, the log probability
m m

log
i=1

p(si ) =
i=1

log p(si )

Evaluating a Language Model: Perplexity

We have some test data, m sentences s1 , s2 , s3 , . . . , sm We could look at the probability under our model m i=1 p(si ). Or more conveniently, the log probability
m m

log
i=1

p(si ) =
i=1

log p(si )

In fact the usual evaluation measure is perplexity Perplexity = 2

where

1 l= M

log p(si )
i=1

and M is the total number of words in the test data.

Some Intuition about Perplexity

Say we have a vocabulary V , and N = |V| + 1 and model that predicts q (w|u, v ) = 1 N

for all w V {STOP}, for all u, v V {*}. Easy to calculate the perplexity in this case: Perplexity = 2l Perplexity = N Perplexity is a measure of eective branching factor where l = log 1 N

Typical Values of Perplexity

Results from Goodman (A bit of progress in language modeling), where |V| = 50, 000 A trigram model: p(x1 . . . xn ) = Perplexity = 74
n i=1

q (xi |xi2 , xi1 ).

Typical Values of Perplexity

Results from Goodman (A bit of progress in language modeling), where |V| = 50, 000 A trigram model: p(x1 . . . xn ) = Perplexity = 74 A bigram model: p(x1 . . . xn ) = Perplexity = 137
n i=1

q (xi |xi2 , xi1 ).

n i=1

q (xi |xi1 ).

Typical Values of Perplexity

Results from Goodman (A bit of progress in language modeling), where |V| = 50, 000 A trigram model: p(x1 . . . xn ) = Perplexity = 74 A bigram model: p(x1 . . . xn ) = Perplexity = 137 A unigram model: p(x1 . . . xn ) = Perplexity = 955
n i=1

q (xi |xi2 , xi1 ).

n i=1

q (xi |xi1 ).

n i=1

q (xi ).

Some History

Shannon conducted experiments on entropy of English i.e., how good are people at the perplexity game? C. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:5064, 1951.

Some History
Chomsky (in Syntactic Structures (1957)):
Second, the notion grammatical cannot be identied with meaningful or signicant in any semantic sense. Sentences (1) and (2) are equally nonsensical, but any speaker of English will recognize that only the former is grammatical. (1) Colorless green ideas sleep furiously. (2) Furiously sleep ideas green colorless. ... . . . Third, the notion grammatical in English cannot be identied in any way with the notion high order of statistical approximation to English. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally remote from English. Yet (1), though nonsensical, is grammatical, while (2) is not. . . .

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods

Sparse Data Problems

A natural estimate (the maximum likelihood estimate): q (wi | wi2 , wi1 ) = q (laughs | the, dog) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 ) Count(the, dog, laughs) Count(the, dog)

Say our vocabulary size is N = |V|, then there are N 3 parameters in the model. e.g., N = 20, 000 20, 0003 = 8 1012 parameters

The Bias-Variance Trade-O

Trigram maximum-likelihood estimate qML (wi | wi2 , wi1 ) = Count(wi2 , wi1 , wi ) Count(wi2 , wi1 )

Bigram maximum-likelihood estimate qML (wi | wi1 ) = Count(wi1 , wi ) Count(wi1 )

Unigram maximum-likelihood estimate qML (wi ) = Count(wi ) Count()

Linear Interpolation
Take our estimate q (wi | wi2 , wi1 ) to be q (wi | wi2 , wi1 ) = 1 qML (wi | wi2 , wi1 ) +2 qML (wi | wi1 ) +3 qML (wi )

where 1 + 2 + 3 = 1, and i 0 for all i.

Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV

q (w | u, v )

Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)]

Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)] qML (w | u, v ) + 2

= = 1

qML (w | v ) + 3

qML (w)

Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)] qML (w | u, v ) + 2

= = 1

qML (w | v ) + 3

qML (w)

= 1 + 2 + 3

Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)] qML (w | u, v ) + 2

= = 1

qML (w | v ) + 3

qML (w)

= 1 + 2 + 3 =1

Linear Interpolation (continued)

Our estimate correctly denes a distribution (dene V = V {STOP}):
wV

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)] qML (w | u, v ) + 2

= = 1

qML (w | v ) + 3

qML (w)

= 1 + 2 + 3 =1

(Can show also that q (w | u, v ) 0 for all w V )

How to estimate the values?

Hold out part of training set as validation data

How to estimate the values?

Hold out part of training set as validation data Dene c (w1 , w2 , w3 ) to be the number of times the trigram (w1 , w2 , w3 ) is seen in validation set

How to estimate the values?

Hold out part of training set as validation data Dene c (w1 , w2 , w3 ) to be the number of times the trigram (w1 , w2 , w3 ) is seen in validation set Choose 1 , 2 , 3 to maximize: L(1 , 2 , 3 ) =
w1 ,w2 ,w3

c (w1 , w2 , w3 ) log q (w3 | w1 , w2 )

such that 1 + 2 + 3 = 1, and i 0 for all i, and where q (wi | wi2 , wi1 ) = 1 qML (wi | wi2 , wi1 ) +2 qML (wi | wi1 ) +3 qML (wi )

Allowing the s to vary

Take a function that e.g., 1 2 (wi2 , wi1 ) = 3 4 partitions histories If Count(wi1 , wi2 ) = 0 If 1 Count(wi1 , wi2 ) 2 If 3 Count(wi1 , wi2 ) 5 Otherwise

Introduce a dependence of the s on the partition: q (wi | wi2 , wi1 ) = 1 i2 i1 qML (wi | wi2 , wi1 ) (w ,w ) +2 i2 i1 qML (wi | wi1 ) (w ,w ) +3 i2 i1 qML (wi )
(w ,wi1 ) (w ,w )

where 1 i2 i1 + 2 i2 (w ,w ) and i i2 i1 0 for all i.

+ 3

(wi2 ,wi1 )

= 1,

Overview

The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods

Discounting Methods
Say weve seen the following counts: x Count(x) qML (wi | wi1 ) the 48 the, dog 15 the, woman 11 the, man 10 the, park 5 the, job 2 the, telescope 1 the, manual 1 the, afternoon 1 the, country 1 the, street 1 The maximum-likelihood estimates are (particularly for low count items) 15/48 11/48 10/48 5/48 2/48 1/48 1/48 1/48 1/48 1/48 high

Discounting Methods
Now dene discounted counts, Count (x) = Count(x) 0.5 New estimates:
x the the, the, the, the, the, the, the, the, the, the, dog woman man park job telescope manual afternoon country street Count(x) 48 15 11 10 5 2 1 1 1 1 1 14.5 10.5 9.5 4.5 1.5 0.5 0.5 0.5 0.5 0.5 14.5/48 10.5/48 9.5/48 4.5/48 1.5/48 0.5/48 0.5/48 0.5/48 0.5/48 0.5/48 Count (x) Count (x) Count(the)

Discounting Methods (Continued)

We now have some missing probability mass: (wi1 ) = 1
w

Count (wi1 , w) Count(wi1 )

e.g., in our example, (the) = 10 0.5/48 = 5/48

Katz Back-O Models (Bigrams)

For a bigram model, dene two sets A(wi1 ) = {w : Count(wi1 , w) > 0} B (wi1 ) = {w : Count(wi1 , w) = 0} A bigram model Count (wi1 ,wi ) Count(wi1 ) qBO (wi | wi1 ) = q ML (wi ) (wi1 ) q
wB(wi1 )

If wi A(wi1 ) If wi B (wi1 )

ML (w)

where (wi1 ) = 1
wA(wi1 )

Count (wi1 , w) Count(wi1 )

Katz Back-O Models (Trigrams)

For a trigram model, rst dene two sets A(wi2 , wi1 ) = {w : Count(wi2 , wi1 , w) > 0} B (wi2 , wi1 ) = {w : Count(wi2 , wi1 , w) = 0} A trigram model is dened in terms of the bigram model: Count (wi2 ,wi1 ,wi ) Count(wi2 ,wi1 ) If wi A(wi2 , wi1 ) qBO (wi | wi2 , wi1 ) = (wi2 ,wi1 )qBO (wi |wi1 ) wB(w ,w ) qBO (w|wi1 ) i2 i1 If wi B (wi2 , wi1 )

where (wi2 , wi1 ) = 1

wA(wi2 ,wi1 )

Count (wi2 , wi1 , w) Count(wi2 , wi1 )

Summary
Three steps in deriving the language model probabilities:
1. Expand p(w1 , w2 . . . wn ) using Chain rule. 2. Make Markov Independence Assumptions p(wi | w1 , w2 . . . wi2 , wi1 ) = p(wi | wi2 , wi1 ) 3. Smooth the estimates using low order counts.

Other methods used to improve language models:

Topic or long-range features. Syntactic models.

Its generally hard to improve on trigram models though!!

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
58% (81)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Figure-Ground Relations: Jessica Stockholder
No ratings yet
Figure-Ground Relations: Jessica Stockholder
8 pages
The Divine Mercy Chaplet
No ratings yet
The Divine Mercy Chaplet
1 page
John Fowles - The French Lieutenant's Woman
100% (1)
John Fowles - The French Lieutenant's Woman
339 pages
The Maligned God
No ratings yet
The Maligned God
86 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Trigram Language Models
No ratings yet
Trigram Language Models
19 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
plm.17
No ratings yet
plm.17
15 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
6.Chapter6_LanguageModel
No ratings yet
6.Chapter6_LanguageModel
33 pages
Ngrams
100% (1)
Ngrams
22 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
L5_CSE256_FA24_LM
No ratings yet
L5_CSE256_FA24_LM
65 pages
NLp
No ratings yet
NLp
12 pages
language modelling_
No ratings yet
language modelling_
17 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
lm24aug
No ratings yet
lm24aug
84 pages
04_N-gram Language Models
No ratings yet
04_N-gram Language Models
41 pages
NLP_Lec_11
No ratings yet
NLP_Lec_11
6 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
N Grams
No ratings yet
N Grams
51 pages
NLP-UNITS-IV-V
No ratings yet
NLP-UNITS-IV-V
30 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
NLP UNIT-V
No ratings yet
NLP UNIT-V
30 pages
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
No ratings yet
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
42 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
lecture5-ngrams
No ratings yet
lecture5-ngrams
40 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
Language Models: CS6370: Natural Language Processing
No ratings yet
Language Models: CS6370: Natural Language Processing
35 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
Log-Linear Models: Michael Collins
No ratings yet
Log-Linear Models: Michael Collins
20 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
3_LM_2024
No ratings yet
3_LM_2024
78 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
NLP Lecture 8 Week 4
No ratings yet
NLP Lecture 8 Week 4
10 pages
NLP_Unit2 (2)
No ratings yet
NLP_Unit2 (2)
65 pages
UNIT 3 Language Modelling
No ratings yet
UNIT 3 Language Modelling
15 pages
CME4408 P5 N-grams Smooting
No ratings yet
CME4408 P5 N-grams Smooting
43 pages
Statistical Inference
No ratings yet
Statistical Inference
38 pages
NLP and Entropy
No ratings yet
NLP and Entropy
54 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi
No ratings yet
Rnn-Based Ams + Introduction To Language Modeling: Instructor: Preethi Jyothi
36 pages
Langmodel2 PDF
No ratings yet
Langmodel2 PDF
85 pages
N Gram Model
No ratings yet
N Gram Model
4 pages
ai
No ratings yet
ai
13 pages
3_2
No ratings yet
3_2
26 pages
Formal Aspects of Language Modeling
No ratings yet
Formal Aspects of Language Modeling
252 pages
Kami Export - Assignment - 2 - 20240709
No ratings yet
Kami Export - Assignment - 2 - 20240709
13 pages
NLP m2
No ratings yet
NLP m2
74 pages
Maximum Entropy Markov Models: Alan Ritter CSE 5525
No ratings yet
Maximum Entropy Markov Models: Alan Ritter CSE 5525
70 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Recurrent Neural Networks: Amir H. Payberah
No ratings yet
Recurrent Neural Networks: Amir H. Payberah
142 pages
Machine Learning and Statistical Natural Language Processing
No ratings yet
Machine Learning and Statistical Natural Language Processing
27 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Document 1
No ratings yet
Document 1
1 page
Linear and Non-Linear Dynamic Analysis
No ratings yet
Linear and Non-Linear Dynamic Analysis
9 pages
The Effects of Attitude Towards The Performance o
No ratings yet
The Effects of Attitude Towards The Performance o
6 pages
Reliance Employee Welfare
No ratings yet
Reliance Employee Welfare
6 pages
Nayve, Julie-Ann - Finals - MC VED 18 PDF
No ratings yet
Nayve, Julie-Ann - Finals - MC VED 18 PDF
3 pages
The Grand Coolie Damn
No ratings yet
The Grand Coolie Damn
12 pages
A History of The Martinist Order Tradition
No ratings yet
A History of The Martinist Order Tradition
6 pages
Collection by Mr. B. R. Chavan. K. L. Ponda High School, Dahanu
No ratings yet
Collection by Mr. B. R. Chavan. K. L. Ponda High School, Dahanu
24 pages
Developing Critical Thinkers
No ratings yet
Developing Critical Thinkers
7 pages
251-Article Text-996-1-10-20190220
No ratings yet
251-Article Text-996-1-10-20190220
7 pages
How To Conjugate French Verbs
100% (1)
How To Conjugate French Verbs
12 pages
Camus_Outsider_Research_Paper_MLA_Final
No ratings yet
Camus_Outsider_Research_Paper_MLA_Final
7 pages
Module 2 STS 1
No ratings yet
Module 2 STS 1
15 pages
Ahmadinejad's Letter To Bush: Full Transcript
No ratings yet
Ahmadinejad's Letter To Bush: Full Transcript
12 pages
The Artists Way
67% (3)
The Artists Way
1 page
The Apologetic and Literary Value of The Acts of Justin
No ratings yet
The Apologetic and Literary Value of The Acts of Justin
19 pages
Executive Summary
No ratings yet
Executive Summary
3 pages
Learning in The Thick of It
No ratings yet
Learning in The Thick of It
16 pages
Senge FD Chapter 9
No ratings yet
Senge FD Chapter 9
27 pages
Topic 3
No ratings yet
Topic 3
18 pages
Facilitating Learning Module 8
No ratings yet
Facilitating Learning Module 8
24 pages
ON THE FACE OF IT Notes
No ratings yet
ON THE FACE OF IT Notes
7 pages
Studies On Wordsworth: Faria Saeed Khan
100% (1)
Studies On Wordsworth: Faria Saeed Khan
8 pages
Ethical Issues in Business
No ratings yet
Ethical Issues in Business
29 pages
Business Communication For Managerial Effectiveness
No ratings yet
Business Communication For Managerial Effectiveness
14 pages
Theories of Literary Criticism Handout
100% (5)
Theories of Literary Criticism Handout
2 pages

Lectures LM

Uploaded by

Lectures LM

Uploaded by

Language Modeling

Michael Collins, Columbia University

The Language Modeling Problem

The Language Modeling Problem (Continued)

The Language Modeling Problem (Continued)

p(x) 0 for all x V

The Language Modeling Problem (Continued)

p(x) 0 for all x V

p(the p(the p(the p(the ... p(the ...

Why on earth would we want to do this?!

Why on earth would we want to do this?!

First-Order Markov Processes

First-Order Markov Processes

P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 )

First-Order Markov Processes

P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 ) P (Xi = xi |Xi1 = xi1 )

First-Order Markov Processes

P (Xi = xi |X1 = x1 , . . . , Xi1 = xi1 ) P (Xi = xi |Xi1 = xi1 )

Second-Order Markov Processes

Second-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn ) = P (X1 = x1 ) P (X2 = x2 |X1 = x1 )

P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )

Second-Order Markov Processes

P (X1 = x1 , X2 = x2 , . . . Xn = xn ) = P (X1 = x1 ) P (X2 = x2 |X1 = x1 )

(For convenience we assume x0 = x1 = *, where * is a special start symbol.)

Modeling Variable Length Sequences

Modeling Variable Length Sequences

P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )

(For convenience we assume x0 = x1 = *, where * is a special start symbol.)

Trigram Language Models

Trigram Language Models

q (xi |xi2 , xi1 )

The Trigram Estimation Problem

The Trigram Estimation Problem

q (laughs | the, dog) =

Sparse Data Problems

Evaluating a Language Model: Perplexity

Evaluating a Language Model: Perplexity

Evaluating a Language Model: Perplexity

In fact the usual evaluation measure is perplexity Perplexity = 2

and M is the total number of words in the test data.

Some Intuition about Perplexity

Typical Values of Perplexity

q (xi |xi2 , xi1 ).

Typical Values of Perplexity

q (xi |xi2 , xi1 ).

Typical Values of Perplexity

q (xi |xi2 , xi1 ).

Sparse Data Problems

The Bias-Variance Trade-O

Bigram maximum-likelihood estimate qML (wi | wi1 ) = Count(wi1 , wi ) Count(wi1 )

Unigram maximum-likelihood estimate qML (wi ) = Count(wi ) Count()

where 1 + 2 + 3 = 1, and i 0 for all i.

Linear Interpolation (continued)

Linear Interpolation (continued)

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)]

Linear Interpolation (continued)

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)] qML (w | u, v ) + 2

Linear Interpolation (continued)

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)] qML (w | u, v ) + 2

Linear Interpolation (continued)

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)] qML (w | u, v ) + 2

Linear Interpolation (continued)

q (w | u, v ) [1 qML (w | u, v ) + 2 qML (w | v ) + 3 qML (w)] qML (w | u, v ) + 2

(Can show also that q (w | u, v ) 0 for all w V )

How to estimate the values?

How to estimate the values?

How to estimate the values?

c (w1 , w2 , w3 ) log q (w3 | w1 , w2 )

Allowing the s to vary

where 1 i2 i1 + 2 i2 (w ,w ) and i i2 i1 0 for all i.

Discounting Methods (Continued)

Count (wi1 , w) Count(wi1 )

e.g., in our example, (the) = 10 0.5/48 = 5/48

Katz Back-O Models (Bigrams)

Count (wi1 , w) Count(wi1 )

(For convenience we assume x0 = x1 = , where is a special start symbol.)

(For convenience we assume x0 = x1 = , where is a special start symbol.)