Lectures LM
Lectures LM
Overview
The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods
STOP) = 1012 fan STOP) = 108 fan saw Beckham STOP) = 2 108 fan saw saw STOP) = 1015 fan saw Beckham play for Real Madrid STOP) = 2 109
A Naive Method
We have N training sentences For any sentence x1 . . . xn , c(x1 . . . xn ) is the number of times the sentence is seen in our training data A naive estimate: p(x1 . . . xn ) = c(x1 . . . xn ) N
Overview
The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods
Markov Processes
Consider a sequence of random variables X1 , X2 , . . . Xn . Each random variable can take any value in a nite set V . For now we assume the length n is xed (e.g., n = 100). Our goal: model P (X1 = x1 , X2 = x2 , . . . , Xn = xn )
P (X1 = x1 , X2 = x2 , . . . Xn = xn )
P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n
= P ( X 1 = x1 )
i=2
P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n
= P ( X 1 = x1 )
i=2 n
= P ( X 1 = x1 )
P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n
= P ( X 1 = x1 )
i=2 n
= P ( X 1 = x1 )
The rst-order Markov assumption: For any i {2 . . . n}, for any x1 . . . xi , P (Xi = xi |X1 = x1 . . . Xi1 = xi1 ) = P (Xi = xi |Xi1 = xi1 )
P (X1 = x1 , X2 = x2 , . . . Xn = xn )
i=3
i=3 n
P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 ) P (Xi = xi |Xi2 = xi2 , Xi1 = xi1 )
=
i=1
P (X1 = x1 , X2 = x2 , . . . Xn = xn )
n
=
i=1
For any sentence x1 . . . xn where xi V for i = 1 . . . (n 1), and xn = STOP, the probability of the sentence under the trigram language model is
n
p(x1 . . . xn ) =
i=1
where we dene x0 = x1 = *.
An Example
For the sentence the dog barks STOP we would have p(the dog barks STOP) = q (the|*, *) q (dog|*, the) q (barks|the, dog) q (STOP|dog, barks)
Say our vocabulary size is N = |V|, then there are N 3 parameters in the model. e.g., N = 20, 000 20, 0003 = 8 1012 parameters
Overview
The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods
log
i=1
p(si ) =
i=1
log p(si )
log
i=1
p(si ) =
i=1
log p(si )
where
1 l= M
log p(si )
i=1
for all w V {STOP}, for all u, v V {*}. Easy to calculate the perplexity in this case: Perplexity = 2l Perplexity = N Perplexity is a measure of eective branching factor where l = log 1 N
n i=1
q (xi |xi1 ).
n i=1
q (xi |xi1 ).
n i=1
q (xi ).
Some History
Shannon conducted experiments on entropy of English i.e., how good are people at the perplexity game? C. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal, 30:5064, 1951.
Some History
Chomsky (in Syntactic Structures (1957)):
Second, the notion grammatical cannot be identied with meaningful or signicant in any semantic sense. Sentences (1) and (2) are equally nonsensical, but any speaker of English will recognize that only the former is grammatical. (1) Colorless green ideas sleep furiously. (2) Furiously sleep ideas green colorless. ... . . . Third, the notion grammatical in English cannot be identied in any way with the notion high order of statistical approximation to English. It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally remote from English. Yet (1), though nonsensical, is grammatical, while (2) is not. . . .
Overview
The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods
Say our vocabulary size is N = |V|, then there are N 3 parameters in the model. e.g., N = 20, 000 20, 0003 = 8 1012 parameters
Linear Interpolation
Take our estimate q (wi | wi2 , wi1 ) to be q (wi | wi2 , wi1 ) = 1 qML (wi | wi2 , wi1 ) +2 qML (wi | wi1 ) +3 qML (wi )
q (w | u, v )
wV
= = 1
wV
qML (w | v ) + 3
qML (w)
= = 1
wV
qML (w | v ) + 3
qML (w)
= 1 + 2 + 3
= = 1
wV
qML (w | v ) + 3
qML (w)
= 1 + 2 + 3 =1
= = 1
wV
qML (w | v ) + 3
qML (w)
= 1 + 2 + 3 =1
such that 1 + 2 + 3 = 1, and i 0 for all i, and where q (wi | wi2 , wi1 ) = 1 qML (wi | wi2 , wi1 ) +2 qML (wi | wi1 ) +3 qML (wi )
Introduce a dependence of the s on the partition: q (wi | wi2 , wi1 ) = 1 i2 i1 qML (wi | wi2 , wi1 ) (w ,w ) +2 i2 i1 qML (wi | wi1 ) (w ,w ) +3 i2 i1 qML (wi )
(w ,wi1 ) (w ,w )
(w
,w
+ 3
(wi2 ,wi1 )
= 1,
Overview
The language modeling problem Trigram models Evaluating language models: perplexity Estimation techniques:
Linear interpolation Discounting methods
Discounting Methods
Say weve seen the following counts: x Count(x) qML (wi | wi1 ) the 48 the, dog 15 the, woman 11 the, man 10 the, park 5 the, job 2 the, telescope 1 the, manual 1 the, afternoon 1 the, country 1 the, street 1 The maximum-likelihood estimates are (particularly for low count items) 15/48 11/48 10/48 5/48 2/48 1/48 1/48 1/48 1/48 1/48 high
Discounting Methods
Now dene discounted counts, Count (x) = Count(x) 0.5 New estimates:
x the the, the, the, the, the, the, the, the, the, the, dog woman man park job telescope manual afternoon country street Count(x) 48 15 11 10 5 2 1 1 1 1 1 14.5 10.5 9.5 4.5 1.5 0.5 0.5 0.5 0.5 0.5 14.5/48 10.5/48 9.5/48 4.5/48 1.5/48 0.5/48 0.5/48 0.5/48 0.5/48 0.5/48 Count (x) Count (x) Count(the)
If wi A(wi1 ) If wi B (wi1 )
ML (w)
where (wi1 ) = 1
wA(wi1 )
Summary
Three steps in deriving the language model probabilities:
1. Expand p(w1 , w2 . . . wn ) using Chain rule. 2. Make Markov Independence Assumptions p(wi | w1 , w2 . . . wi2 , wi1 ) = p(wi | wi2 , wi1 ) 3. Smooth the estimates using low order counts.