Chapter Four 1
Chapter Four 1
N-Grams
• Problem of word prediction.
• Example: “I’d like to make a collect …”
– Very likely words:
• “call”,
• “international call”, or
• “phone call”, and NOT
• “the”.
• The idea of word prediction is formalized with probabilistic models called N-grams.
– N-grams – predict the next word from previous N-1 words.
– Statistical models of word sequences are also called language models or LMs.
• Computing probability of the next word will turn out to be closely related to computing
the probability of a sequence of words.
• Example:
– “… all of a sudden I notice three guys standing on the sidewalk …”, vs.
– “… on guys all I of notice sidewalk three a sudden standing the …”
N-grams
• Estimators like N-grams that assign a conditional probability to
possible next words can be used to assign a joint probability to an
entire sentence.
• N-gram models are one of the most important tools in speech and
language processing.
• Brown corpus is a 1 million word collection of samples from 500 written texts from
different genres (newspaper, novels, non-fiction, academic, etc.) assembled at Brown
university 1963-1964.
– Example sentence from Brown corpus:
• He stepped out into the hall, was delighted to encounter a water brother.
• 13 words if don’t’ count punctuation-marks as words – 15 if we count
punctuation.
• Treatment of “,” and “.” depends on the task.
• Punctuation marks are critical for identifying boundaries (, . ;) of things and
for identifying some aspects of meaning (? ! ”)
• For some tasks (part-of-speech tagging or parsing or sometimes speech
synthesis) punctuation are treated as being separate words.
Corpora & Counting Words
• Switchboard Corpus – collection of 2430 telephone conversations
averaging 6 minutes each – total of 240 hours of speech with about 3
million words.
– This kind of corpora do not have punctuation.
– Complications with defining words and sentences:
– Example:
• I do uh main- mainly business data processing.
S x1 , x2 , x3 , , x N
– For each element x of the set S; x ∊ S, a probability value is
assigned as a function of x; P(x) with the following properties:
1. P(x) ∊ [0,1], ∀ x ∊ S,
Px 1
2.
xS
Discrete Probability Distributions
• Event is defined as any subset E of the sample space S; E S.
• The probability of the event E is defined as:
P E P x
xE
• Probability of the entire space S is 1 as indicated by property 2 in the
previous slide.
• Probability of the empty or null event is 0.
Venn Diagram
A B
Elementary Theorems of Probability
• If A is any event in S, then
– P(A’) = 1-P(A)
where A’ is set of all events not in A.
• Proof:
– P(A∪A’) = P(A)+P(A’), considering that
– P(A∪A’) = P(S)= 1
– P(A)+P(A’) = 1
Elementary Theorems of Probability
• If A and B are any events in S, then
– P(A∪B) = P(A)+P(B)- P(A∩B),
• Proof:
– P(A∪B) = P(A∩B’)+P(A∩B)+P(A’∩B)=
– P(A∪B) = [P(A∩B’)+P(A∩B)] + [P(A’∩B)+P(A∩B) ] - P(A∩B)
– P(A∪B) = P(A)+P(B)- P(A∩B)
Venn Diagram S
A∪B
A B
P A B P A B
P A | B
P B P A
P A B P A | B PB if P B 0
P A B PB | AP A if P A 0
Independent Events
• If A and B are independent events then :
P A B P A | B PB P APB
P A B PB | AP A PB P A
• In probability theory, to say that two events are independent
intuitively means that the occurrence of one event makes it neither
more nor less probable that the other to occur
Bayes Rule
• If B1, B2, B3,…, Bn are mutually exclusive events of which one must
occur, that is: n , then
P B 1
i 1
i
P A | Bi PBi
PBi | A n
for i 1,2 ,3, ,n
P A | B PB
i 1
i i
Simple (Unsmoothed) N-Grams
• Our goal is to compute the probability of a word w given some history h:
P(w|h).
• Example:
– h ⇒ “its water is so transparent that”
– w ⇒ “the”
– P(the | its water is so transparent that)
• How can we compute this probability?
– One way is to estimate it from relative frequency counts.
– From a very large corpus count number of times we see “its water is so
transparent that” and count the number of times this is followed by
“the”: Out of the times we saw the history h, how many times was
it followed by the word w”:
n
w1 , w2 ,, wn or w 1
P w1n Pw1 Pw2 | w1 P w3 | w12 P wn | w1n 1
n
P wk | w1k 1
k 1
Chain Rule
• The chain rule provides the link between computing
the joint probability of a sequence and computing the
conditional probability of a word given previous
words.
– Equation presented in previous slide provides the way of
computing joint probability estimate of an entire sequence
based on multiplication of a number of conditional
probabilities.
– However, we still do not know any way of computing the
exact probability of a word given a long sequence of
preceding words: P w | wn1
n 1
N-grams
• Approximation:
– Idea of N-gram model is to approximate the history by just the last
few words instead of computing the probability of a word given its
entire history.
• Bigram:
– The bigram model approximates the probability of a word given all
P wn | w n1
1 Pw n | wn1
• The assumption that the conditional probability of the a
word depends only on the previous word is called a
Markov assumption.
• Markov models are the class of probabilistic models that
assume that we can predict the probability of some future
unit without looking too far into the past.
Bi-gram Generalization
• Tri-gram: looks two words into the past
• N-gram: looks N-1 words into the past.
• General equation for N-gram approximation to the conditional
probability of the next word in a sequence is:
P wn | w n 1
1 Pw n |wn 1
n N 1
• The simplest and most intuitive way to estimate probabilities is
the method called Maximum Likelihood Estimation or MLE
for short.
Maximum Likelihood Estimation
• MLE method that provides a solution to the parameter
estimation of a probability distribution function.
– The best estimate of the parameter values is defined to
be the one that maximizes the probability of obtaining
the samples actually observed.
Maximum Likelihood Estimation For N-
Gram
• MLE estimate for the parameters of an N-gram model is done
by taking counts from the training data of a corpus, and
normalizing them so they lie between 0 and 1.
Bi-Gram
• Computing a particular bi-gram probability of a word y given a
previous word x, the count C(xy) is computed and normalized by the
sum of all bi-grams that share the same first word x.
C wn1wn
P wn | wn1
C wn1w
w
MLE for Bi-Gram
n 1
C wnn1N 1wn
P wn | w n N 1
n 1
C wn N 1
i 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0
Unigram Counts
PP W Pw1w2 wN
1
N
1
N
Pw1w2 wN
Definition of Perplexity
• We can use the chain rule to expand the probability of
W:
N
PP W N
1
i 1 P wi | w1 w2 wi 1
• Consider the task of recognizing the digits in English (zero, one, two,...,
nine), given that each of the 10 digits occur with equal probability P = 1/10 .
The perplexity of this language is in fact 10.
• To see that, imagine a string of digits of length N. By Equation presented in
previous slide, the perplexity will be:
PP W P w1w2 wN
1
N
1
1 N N
1
1
10
10 10
Interpretation of Perplexity
• Exercise: Suppose that the number zero is really frequent and occurs
10 times more often than other numbers.
– Show that the perplexity to be lower, as expected since most of the
time the next number will be zero.
– Branching factor however, is still the same for digit recognition
task (e.g. 10).
Interpretation of Perplexity
• Perplexity is also related to the information theoretic notion of entropy
as it will be shown latter in this chapter.
Example of Perplexity Use
• Perplexity is used in following example to compare three N-gram
models.
• Unigram, Bigram, and Trigram grammars are trained on 38
million words (including start-of-sentence tokens) using WSJ
corpora with 19,979 word vocabulary.
– Perplexity is computed on a test set of 1.5 million words via equation
presented in the slide: Definition of Perplexity and the results are
summarized in the Table below:
• For example, we defined above the closed vocabulary task, in which the vocabulary for the
test set is specified in advance. This can greatly reduce the perplexity. As long as this
knowledge is provided equally to each of the models we are comparing, the closed vocabulary
perplexity can still be useful for comparing models, but care must be taken in interpreting the
results. In general, the perplexity of two language models is only comparable if they use the
same vocabulary.
Smoothing
• There is a major problem with the maximum likelihood estimation
process we have seen for training the parameters of an N-gram
model.
– This is the problem of sparse data caused by the fact that our
maximum likelihood estimate was based on a particular set of training
data. For any N-gram that occurred a sufficient number of times, we
might have a good estimate of its probability. But because any corpus
is limited, some perfectly acceptable English word sequences are
bound to be missing from it.
1. This missing data means that the N-gram matrix for any given training
corpus is bound to have a very large number of cases of putative “zero
probability N-grams” that should really have some non-zero
probability.
2. Furthermore, the MLE method also produces poor estimates when the
counts are non-zero but still small.
Smoothing
• We need a method which can help get better estimates for these
zero or low frequency counts.
– Zero counts turn out to cause another huge problem.
• The perplexity metric defined above requires that we compute the
probability of each test sentence.
• But if a test sentence has an N-gram that never appeared in the
training set, the Maximum Likelihood estimate of the probability for
this N-gram, and hence for the whole test sentence, will be zero!
P wi
ci
N
• Laplace smoothing adds one to each count. Considering that there are V words
in the vocabulary, and each one got increased, we also need to adjust the
denominator to take into account the extra V observations in order to have
legitimate probabilities.
ci 1
PLaplace wi
N V
Laplace Smoothing
• It is convenient to describe a smoothing algorithm as a corrective constant that
affects the numerator by defining an adjusted count c* as follows:
ci 1 ci 1 N 1 ci
PLaplace wi
N
ci 1
N V N V N N V N N
N
ci ci 1
N V
Discounting
• A related way to view smoothing is as discounting (lowering) some
non-zero counts in order to get the correct probability mass that will be
assigned to the zero counts.
• Thus instead of referring to the discounted counts c, we might describe
a smoothing algorithm in terms of a relative discount dc, the ratio of
the discounted counts to the original counts:
c*
dc
c
Berkeley Restaurant Project Smoothed Bigram
Counts (V=1446)
i want to eat chinese food lunch spend
i 6 828 1 10 1 1 1 3
want 3 1 609 2 7 7 6 2
to 3 1 5 687 3 1 7 212
eat 1 1 3 1 17 3 43 1
chineze 2 1 1 1 1 83 2 1
food 16 1 16 1 2 5 1 1
lunch 3 1 1 1 1 2 1 1
spend 2 1 2 1 1 1 1 1
Smoothed Bigram Probabilities
• Recall normal bigram probabilites are computed by normalizing each
raw of counts by the unigram count:
C wn 1wn
P wn | wn 1
C wn 1
• For add-one smoothed bigram counts we need to augment the unigram
count by the number of total types in the vocabulary V:
C wn 1wn 1
P *
wn | wn1
C wn 1 V
Laplace
c *
wn1wn C wn 1wn 1 C wn 1
C wn 1 V
Adjusted Counts Table
i want to eat chinese food lunch spend
Nc 1
x:count x c
– The MLE count for Nc is c. The Good-Turing estimate replaces this with a
smoothed count c*, as a function of Nc+1:
c c 1
N c 1
Nc
Good-Turing Discounting
• The previous equation presented in previous slide can be used to
replace the MLE counts for all the bins N1, N2, and so on. Instead of
using this equation directly to re-estimate the smoothed count c* for
N0, the following equation is used that defines probability PGT of the
missing mass:
c 0 1
0 1
MLE p p 0 18
18
c* trout (c 1)
N2 1
c* 2 0.67
N1 3
GT *
pGT *
pGT unseen) N1
3
0.17 *
pGT trout c*
0.67
0.037
N 18 N 18
• Note that the revised count c* for eel as well is discounted from c=1.0 to
c*=0.67 in order to account for some probability mass for unseen species
(unseen) = 3/18=0.17 for catfish and bass.
• Since we know that there are 2 unknown species, the probability of the
next fish being specifically a catfish is p * (catfish) = (1/2)x(3/18) =
GT
0.085