Lecture 5: Language Modeling (N-Gram, BOW)
Lecture 5: Language Modeling (N-Gram, BOW)
(N-gram, BOW)
Lecture Objectives:
他 向 记者 介绍了 主要 内
容
He to reporters introduced main content
• Result:
Bigram estimates of sentence
probabilities
P(<s> I want english food </s>) =
P(I|<s>)
× P(want|I)
× P(chinese|want)
× P(food|chineses)
× P(</s>|food)
=
Training Language Models?
Google N-Gram Release
• serve as the inspiration 1390
• serve as the installation 136
• serve as the institute 187
• serve as the institution 279
• serve as the institutional 461
• serve as the instructional 173
• serve as the instructor 286
• serve as the indicator 120
• serve as the indicators 45
• serve as the indispensable 111
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Smoothing techniques
• Laplace smoothing
• Pretend we saw each word one more time than we
did
• Just add one to all the counts!
• MLE estimate:
• Add-1 estimate:
Examples
• Here are a few other useful probabilities:
P(i|<s>) = 0.25 P(english|want) = 0.0011
P(food|english) = 0.5 P(</s>|food) = 0.68
add-1 smoothed probabilities: P(i|<s>)=0.19 and P(</s>|
food)= 0:40
Evaluation: How good is our model?
• Does our language model prefer good sentences to
bad ones?
– Assign higher probability to “real” or “frequently observed”
sentences
• Than “ungrammatical” or “rarely observed” sentences?
• We train parameters of our model on a training set.
• We test the model’s performance on data we haven’t seen.
– A test set is an unseen dataset that is different from our training set,
totally unused.
– An evaluation metric tells us how well our model does on the test
set.
Perplexity
The best language model is one that best predicts an unseen test set
Chain rule:
For bigrams:
Solution
Smoothing Technique
Markov Models
• The assumption that the probability of a word
depends only on the previous word is called a
Markov assumption.
• Markov models are the class of probabilistic
models that assume we can predict the
probability of some future unit without looking
too far into the past.
P( w1w2 wn ) P( wi | wi k wi 1 )
i
24
References
• Wikipedia.com
• Prof. Jason Eisner (Natural Language
Processing)John Hopkins University.
• Web.Standford.edu
25