0% found this document useful (0 votes)
13 views

13 Ngramlm

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

13 Ngramlm

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

N-Gram Language Modeling

Mausam
(Based on slides of Michael Collins, Dan Jurafsky, Dan Klein,
Chris Manning, Luke Zettlemoyer)
Outline
• Motivation
• Task Definition
• N-Gram Probability Estimation
• Evaluation

2
The Language Modeling Problem
 Setup: Assume a (finite) vocabulary of words

 We can construct an (infinite) set of strings

 Data: given a training set of example sentences


 Problem: estimate a probability distribution
The Noisy-Channel Model
• We want to predict a sentence given acoustics:

• The noisy channel approach:

Acoustic model: Distributions Language model:


over acoustic waves given a Distributions over sequences
sentence of words (sentences)
Acoustically Scored Hypotheses

the station signs are in deep in english -14732


the stations signs are in deep in english -14735
the station signs are in deep into english -14739
the station 's signs are in deep in english -14740
the station signs are in deep in the english -14741
the station signs are indeed in english -14757
the station 's signs are indeed in english -14760
the station signs are indians in english -14790
the station signs are indian in english -14799
the stations signs are indians in english -14807
the stations signs are indians and english -14815
ASR System Components
Language Model Acoustic Model

source channel
w a
P(w) P(a|w)

best observed
decoder
w a

argmax P(w|a) = argmax P(a|w)P(w)


w w
MT System Components
Language Model Translation Model

source channel
e f
P(e) P(f|e)

best observed
decoder
e f

argmax P(e|f) = argmax P(f|e)P(e)


e e
Probabilistic Language Models: Other Applications
• Why assign a probability to a sentence?
• Machine Translation:
• P(high winds tonite) > P(large winds tonite)
• Speech Recognition
• P(I saw a van) >> P(eyes awe of an)
• Spell Correction
• The office is about fifteen minuets from my house
• P(about fifteen minutes from) > P(about fifteen minuets from)
• + Summarization, question-answering, etc., etc.!!
Outline
• Motivation
• Task Definition
• N-Gram Probability Estimation
• Evaluation

11
Probabilistic Language Modeling
• Goal: compute the probability of a sentence or
sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)

• Related task: probability of an upcoming word:


P(w5|w1,w2,w3,w4)
• A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
How to compute P(W)
• How to compute this joint probability:

• P(its, water, is, so, transparent, that)

P(“its water is so transparent”) =


P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
How to estimate these probabilities
• Could we just count and divide?

P(the | its water is so transparent that) =


Count(its water is so transparent that the)
Count(its water is so transparent that)

• No! Too many possible sentences!


• We’ll never see enough data for estimating these
Markov Assumption

• Simplifying assumption:
Andrei Markov

P(the | its water is so transparent that) » P(the | that)

• Or maybe
P(the | its water is so transparent that) » P(the | transparent that)
Markov Assumption

… wn )   P ( wi | wi  k …
P( w1w2   wi 1 )
i

• In other words, we approximate each


component in the product
… w )  P( w | w
P( wi | w1w2  … w )

i 1 i i k i 1
Simplest Case: Unigram Models
• Simplest case: unigrams
… wn )   P ( wi )
P( w1w2 
i
• Generative process: pick a word, pick a word, … until you pick </s>
• Graphical model:
w1 w2 …………. wn-1 </s>
• Examples:
• fifth, an, of, futures, the, an, incorporated, a, a, the,
inflation, most, dollars, quarter, in, is, mass
• thrift, did, eighty, said, hard, 'm, july, bullish
• that, or, limited, the

• Big problem with unigrams: P(the the the the) >> P(I like ice cream)!
Bigram Models
• Conditioned on previous single word

… wi 1 )  P ( wi | wi 1 )
P( wi | w1w2 
• Generative process: pick <s>, pick a word conditioned on previous one,
repeat until to pick </s>

• Graphical model: <s> w1 w2 wn-1 </s>


• Examples:
• texaco, rose, one, in, this, issue, is, pursuing, growth, in, a,
boiler, house, said, mr., gurria, mexico, 's, motion, control,
proposal, without, permission, from, five, hundred, fifty, five,
yen
• outside, new, car, parking, lot, of, the, agreement, reached
• this, would, be, a, record, november
N-Gram Models
• We can extend to trigrams, 4-grams, 5-grams
• N-gram models are (weighted) regular languages
• Many linguistic arguments that language isn’t regular.
• Long-distance effects: “The computer which I had just put into the
machine room on the fifth floor ___.”
• Recursive structure
• We often get away with n-gram models

• PCFG LM (later):
• [This, quarter, ‘s, surprisingly, independent, attack, paid, off,
the, risk, involving, IRS, leaders, and, transportation, prices, .]
• [It, could, be, announced, sometime, .]
• [Mr., Toseland, believes, the, average, defense, economy, is,
drafted, from, slightly, more, than, 12, stocks, .]
Outline
• Motivation
• Task Definition
• N-Gram Probability Estimation
• Evaluation

42
Evaluation: How good is our model?
• Does our language model prefer good sentences to bad ones?
• Assign higher probability to “real” or “frequently observed” sentences
• Than “ungrammatical” or “rarely observed” sentences?
• We train parameters of our model on a training set.
• We test the model’s performance on data we haven’t seen.
• A test set is an unseen dataset that is different from our training set,
totally unused.
• An evaluation metric tells us how well our model does on the test set.
Extrinsic evaluation of N-gram models
• Best evaluation for comparing models A and B
• Put each model in a task
• spelling corrector, speech recognizer, MT system
• Run the task, get an accuracy for A and for B
• How many misspelled words corrected properly
• How many words translated correctly
• Compare accuracy for A and B
Difficulty of extrinsic (in-vivo) evaluation of N-
gram models
• Extrinsic evaluation
• Time-consuming; requires building applications, new data
• So
• Sometimes use intrinsic evaluation: perplexity
• Bad approximation
• unless the test data looks just like the training data
• So generally only useful in pilot experiments
• But is helpful to think about.
Intuition of Perplexity
mushrooms 0.1
• The Shannon Game:
• How well can we predict the next word? pepperoni 0.1
anchovies 0.01
I always order pizza with cheese and ____
….
The 33rd President of the US was ____
fried rice 0.0001
I saw a ____ ….
• Unigrams are terrible at this game. (Why?) and 1e-100
• A better model of a text
• is one which assigns a higher probability to the word that actually occurs
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence) -
1
PP(W ) = P(w1w2 ...wN ) N
Perplexity is the inverse probability of
the test set, normalized by the number
1
of words: = N
P(w1w2 ...wN )

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability


The Shannon Game intuition for perplexity
• From Josh Goodman
• How hard is the task of recognizing digits ‘0,1,2,3,4,5,6,7,8,9’
• Perplexity 10
• How hard is recognizing (30,000) names at Microsoft.
• Perplexity = 30,000
• If a system has to recognize
• Operator (1 in 4)
• Sales (1 in 4)
• Technical Support (1 in 4)
• 30,000 names (1 in 120,000 each)
• Perplexity is 53
• Perplexity is weighted equivalent branching factor
Perplexity as branching factor
• Let’s suppose a sentence consisting of random digits
• What is the perplexity of this sentence according to a model
that assign P=1/10 to each digit?
Another form of Perplexity

• Lower is better!
• Example:
• uniform model  perplexity is N
• Interpretation: effective vocabulary size (accounting for statistical regularities)
• Typical values for newspaper text:
• Uniform: 20,000; Unigram: 1000s, Bigram: 700-1000, Trigram: 100-200
• Important note:
• Its easy to get bogus perplexities by having bogus probabilities that sum to
more than one over their event spaces. Be careful!
Lower perplexity = better model

• Training 38 million words, test 1.5 million words, WSJ

N-gram Unigram Bigram Trigram


Order
Perplexity 962 170 109

You might also like