13 Ngramlm
13 Ngramlm
Mausam
(Based on slides of Michael Collins, Dan Jurafsky, Dan Klein,
Chris Manning, Luke Zettlemoyer)
Outline
• Motivation
• Task Definition
• N-Gram Probability Estimation
• Evaluation
2
The Language Modeling Problem
Setup: Assume a (finite) vocabulary of words
source channel
w a
P(w) P(a|w)
best observed
decoder
w a
source channel
e f
P(e) P(f|e)
best observed
decoder
e f
11
Probabilistic Language Modeling
• Goal: compute the probability of a sentence or
sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
• Simplifying assumption:
Andrei Markov
• Or maybe
P(the | its water is so transparent that) » P(the | transparent that)
Markov Assumption
… wn ) P ( wi | wi k …
P( w1w2 wi 1 )
i
• Big problem with unigrams: P(the the the the) >> P(I like ice cream)!
Bigram Models
• Conditioned on previous single word
… wi 1 ) P ( wi | wi 1 )
P( wi | w1w2
• Generative process: pick <s>, pick a word conditioned on previous one,
repeat until to pick </s>
• PCFG LM (later):
• [This, quarter, ‘s, surprisingly, independent, attack, paid, off,
the, risk, involving, IRS, leaders, and, transportation, prices, .]
• [It, could, be, announced, sometime, .]
• [Mr., Toseland, believes, the, average, defense, economy, is,
drafted, from, slightly, more, than, 12, stocks, .]
Outline
• Motivation
• Task Definition
• N-Gram Probability Estimation
• Evaluation
42
Evaluation: How good is our model?
• Does our language model prefer good sentences to bad ones?
• Assign higher probability to “real” or “frequently observed” sentences
• Than “ungrammatical” or “rarely observed” sentences?
• We train parameters of our model on a training set.
• We test the model’s performance on data we haven’t seen.
• A test set is an unseen dataset that is different from our training set,
totally unused.
• An evaluation metric tells us how well our model does on the test set.
Extrinsic evaluation of N-gram models
• Best evaluation for comparing models A and B
• Put each model in a task
• spelling corrector, speech recognizer, MT system
• Run the task, get an accuracy for A and for B
• How many misspelled words corrected properly
• How many words translated correctly
• Compare accuracy for A and B
Difficulty of extrinsic (in-vivo) evaluation of N-
gram models
• Extrinsic evaluation
• Time-consuming; requires building applications, new data
• So
• Sometimes use intrinsic evaluation: perplexity
• Bad approximation
• unless the test data looks just like the training data
• So generally only useful in pilot experiments
• But is helpful to think about.
Intuition of Perplexity
mushrooms 0.1
• The Shannon Game:
• How well can we predict the next word? pepperoni 0.1
anchovies 0.01
I always order pizza with cheese and ____
….
The 33rd President of the US was ____
fried rice 0.0001
I saw a ____ ….
• Unigrams are terrible at this game. (Why?) and 1e-100
• A better model of a text
• is one which assigns a higher probability to the word that actually occurs
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence) -
1
PP(W ) = P(w1w2 ...wN ) N
Perplexity is the inverse probability of
the test set, normalized by the number
1
of words: = N
P(w1w2 ...wN )
Chain rule:
For bigrams:
• Lower is better!
• Example:
• uniform model perplexity is N
• Interpretation: effective vocabulary size (accounting for statistical regularities)
• Typical values for newspaper text:
• Uniform: 20,000; Unigram: 1000s, Bigram: 700-1000, Trigram: 100-200
• Important note:
• Its easy to get bogus perplexities by having bogus probabilities that sum to
more than one over their event spaces. Be careful!
Lower perplexity = better model