0% found this document useful (0 votes)
31 views

Chapter Four 1

Uploaded by

Wakgari Waif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Chapter Four 1

Uploaded by

Wakgari Waif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

N-Grams

N-Grams
• Problem of word prediction.
• Example: “I’d like to make a collect …”
– Very likely words:
• “call”,
• “international call”, or
• “phone call”, and NOT
• “the”.
• The idea of word prediction is formalized with probabilistic models called N-grams.
– N-grams – predict the next word from previous N-1 words.
– Statistical models of word sequences are also called language models or LMs.
• Computing probability of the next word will turn out to be closely related to computing
the probability of a sequence of words.
• Example:
– “… all of a sudden I notice three guys standing on the sidewalk …”, vs.
– “… on guys all I of notice sidewalk three a sudden standing the …”
N-grams
• Estimators like N-grams that assign a conditional probability to
possible next words can be used to assign a joint probability to an
entire sentence.
• N-gram models are one of the most important tools in speech and
language processing.

• N-grams are essential in any tasks in which the words must be


identified from ambiguous and noisy inputs.
N-Gram Application Areas

• Speech Recognition – the input speech


sounds are very confusable and many words
sound extremely similar.
N-Gram Application Areas
• Handwriting (OCR) Recognition – probabilities of word
sequences help in recognition.
– Woody Allen in his movie “Take the Money and Run”, tries to rob
a bank with a sloppily written hold-up note that the teller
incorrectly reads as “I have a gub”.

– Any speech and language processing system could avoid making


this mistake by using the knowledge that the sequence “I have a
gun” is far more probable than the non-word “I have a gub” or
even “I have a gull”.
N-Gram Application Areas
• Statistical Machine Translation – Example of translation of a
Chinese source sentence:

from a set of potential rough English translations:


– he briefed to reporters on the chief contents of the statement
– he briefed reporters on the chief contents of the statement
– he briefed to reporters on the main contents of the statement
– he briefed reporters on the main contents of the statement
N-Gram Application Areas
• An N-gram grammar might tell us that briefed reporters is more likely
than briefed to reporters, and main contents is more likely than chief
contents.
• Spelling Corrections – need to find correct spelling errors like the
following that accidentally result in real English words:
– They are leaving in about fifteen minuets to go to her house.
– Helping people who are unable to sue speech or sign language to
communicate.
• Problem – real words thus dictionary search will not help.
– Note: “in about fifteen minuets” is a much less probable sequence
than “in about fifteen minutes”
– Spell-checker can use a probability estimator both to detect these
errors and to suggest higher-probability corrections.
N-Gram Application Areas
• Other areas:
– Part of-speech tagging
– Natural Language Generation,
– Word Similarity,
– Authorship identification
– Sentiment Extraction
– Predictive Text Input (Cell phones).
Corpora & Counting Words
• Probabilities are based on counting things.
– Must decide what to count.
– Counting things in natural language is based on a corpus (plural corpora) – an on-
line collection of text or speech.
– Popular corpora “Brown” and “Switchboard”.

• Brown corpus is a 1 million word collection of samples from 500 written texts from
different genres (newspaper, novels, non-fiction, academic, etc.) assembled at Brown
university 1963-1964.
– Example sentence from Brown corpus:
• He stepped out into the hall, was delighted to encounter a water brother.
• 13 words if don’t’ count punctuation-marks as words – 15 if we count
punctuation.
• Treatment of “,” and “.” depends on the task.
• Punctuation marks are critical for identifying boundaries (, . ;) of things and
for identifying some aspects of meaning (? ! ”)
• For some tasks (part-of-speech tagging or parsing or sometimes speech
synthesis) punctuation are treated as being separate words.
Corpora & Counting Words
• Switchboard Corpus – collection of 2430 telephone conversations
averaging 6 minutes each – total of 240 hours of speech with about 3
million words.
– This kind of corpora do not have punctuation.
– Complications with defining words and sentences:

– Example:
• I do uh main- mainly business data processing.

– Two kinds of disfluencies.


• Broken-off word main- is called a fragment.
• Words like uh um are called fillers or filled pauses.
Corpora & Counting Words
– Counting disfluencies as words depends on the application:
• Automatic Dictation System based on Automatic Speech Recognition will
remove disfluencies.
• Speaker Identification application can use disfluencies to identify a person.
• Parsing and word prediction can use disfluencies – Stolcke and Shriberg
(1996) found that treating uh as a word improves next-word prediction (any
ideas why?) and thus most speech recognition systems treat uh and um as
words.
• Clark and Fox Tree (2002) showed that uh and um have different meanings.
Any thoughts what are they are?
– See the link to the paper:
– http://www-psych.stanford.edu/~herb/2000s/Clark.FoxTree.02.pdf
N-Gram
• Are capitalized tokens like “They” and un-capitalized tokens like
“they” the same word?
– In speech recognition they are treated the same.
– In part-of-speech-tagging capitalization is retained as a separate
feature.
– In this chapter models are not case sensitive.
N-Grams
• Inflected forms – cats versus cat. These two words have the same
lemma “cat” but are different wordforms.
– Lemma is a set of lexical forms having the same
• Stem
• Major part-of-speech, and
• Word-sense.
– Wordform is the full
• inflected or
• derived form of the word.
N-Grams
• In this chapter N-grams are based on wordforms.

• N-gram models and counting words in general


requires that we do the kind of tokenization or text
normalization that was introduced in previous chapter:
– Separating out punctuation
– Dealing with abbreviations (m.p.h)
– Normalizing spelling, etc.
N-Grams
 How many words are there in English?
 Must first distinguish
 types – the number of the distinct words in a
corpus or vocabulary size V, from
 tokens – the total number N of running words.
 Example:
 They picnicked by the pool, then lay back on the
grass and looked at the stars.
 16 Tokens
 14 Types
N-Grams
• The Switchboard corpus has • Another very large corpus (Brown
– ~20,000 wordform types 1992a) found that it included
– ~3 million wordform tokens – 293,181 different wordform
• Shakspeare’s complete works have types
– 29,066 wordform types – 583 million wordform tokens
– 884,647 wordform tokens • American Heritage third edition
dictionary lists 200,00 boldface
• Brown corpus has: forms.
– 61,805 wordform types • It seems that the larger corpora the
– 37,851 lemma types more word types are found:
– 1 million wordform tokens – It is suggested that vocabulary
size V (the number of types)
grows at least the square root
of the number of tokens: V >
O(√N)
Discrete Probability Distributions
• Definition:
– The sample space contains the set of all possible outcomes:

S  x1 , x2 , x3 ,  , x N 
– For each element x of the set S; x ∊ S, a probability value is
assigned as a function of x; P(x) with the following properties:
1. P(x) ∊ [0,1], ∀ x ∊ S,

 Px   1
2.

xS
Discrete Probability Distributions
• Event is defined as any subset E of the sample space S; E  S.
• The probability of the event E is defined as:

P E    P x 
xE
• Probability of the entire space S is 1 as indicated by property 2 in the
previous slide.
• Probability of the empty or null event is 0.

• The function P(x) – is a mapping of a point in the sample space to the


“probability” value is called a probability mass function (pmf).
Properties of Probability Function
• If A and B are mutually exclusive events in S, then:
– P(A∪B) = P(A)+P(B)
– Mutual exclusive events are those that A∩B=∅
– In general for n mutually exclusive events:
P A1  A2  A3    An   P A1   P A2   P A3     P An 

Venn Diagram

A B
Elementary Theorems of Probability
• If A is any event in S, then
– P(A’) = 1-P(A)
where A’ is set of all events not in A.
• Proof:
– P(A∪A’) = P(A)+P(A’), considering that
– P(A∪A’) = P(S)= 1
– P(A)+P(A’) = 1
Elementary Theorems of Probability
• If A and B are any events in S, then
– P(A∪B) = P(A)+P(B)- P(A∩B),
• Proof:
– P(A∪B) = P(A∩B’)+P(A∩B)+P(A’∩B)=
– P(A∪B) = [P(A∩B’)+P(A∩B)] + [P(A’∩B)+P(A∩B) ] - P(A∩B)
– P(A∪B) = P(A)+P(B)- P(A∩B)
Venn Diagram S
A∪B
A B

A∩B’ A∩B A’∩B


Conditional Probability
• If A and B are any events in S, and P(B)≠0 or P(A)≠0, the conditional
probability of A relative to B is given by:

P A  B  P A  B 
P A | B   
P B  P  A

• If A and B are any events in S, then

P A  B   P A | B PB  if P B   0
P A  B   PB | AP A if P  A  0
Independent Events
• If A and B are independent events then :

P A  B   P A | B PB   P APB 
P A  B   PB | AP A  PB P A
• In probability theory, to say that two events are independent
intuitively means that the occurrence of one event makes it neither
more nor less probable that the other to occur
Bayes Rule
• If B1, B2, B3,…, Bn are mutually exclusive events of which one must
occur, that is: n , then
 P B   1
i 1
i

P A | Bi PBi 
PBi | A  n
for i  1,2 ,3, ,n
 P A | B PB 
i 1
i i
Simple (Unsmoothed) N-Grams
• Our goal is to compute the probability of a word w given some history h:
P(w|h).
• Example:
– h ⇒ “its water is so transparent that”
– w ⇒ “the”
– P(the | its water is so transparent that)
• How can we compute this probability?
– One way is to estimate it from relative frequency counts.
– From a very large corpus count number of times we see “its water is so
transparent that” and count the number of times this is followed by
“the”: Out of the times we saw the history h, how many times was
it followed by the word w”:

C its water is so transparent that the


Pthe | its water is so transparent that 
C its water is so transparent that
Estimating Probabilities
• Estimating probabilities form counts works fine in many
cases, it turns out that even the www is not big enough to
give us good estimates in most cases.
– Language is creative:
1. new sentences are created all the time.
2. It is not possible to count entire sentences.
3. In addition …
Estimating Probabilities
• Also: Joint Probabilities – probability of an entire sequence
of words like “its water is so transparent”:
– Out of all possible sequences of 5 words how many of them are
“its water is so transparent”
– Must count of all occurrences of “its water is so transparent” and
divide by the sum of counts of all possible 5 word sequences.

• It seems a lot of work for a simple computation of


estimates.
Estimating Probabilities
• Hence, must figure out cleverer ways of estimating the
probability of
– A word w given some history h, or
– An entire word sequence W.

n
w1 , w2 ,, wn or w 1

Pw1 , w2 , w3 ,, wn   P X  w1 , Y  w2 , Z  w3 ,,


Estimating Probabilities
• Introduction of formal notations:
– Random variable – Xi
– Probability Xi taking on the value “the” – P(Xi =“the”) = P(the)
– Sequence of N words: n
w , w ,, w or w
1 2 n 1
– Joint probability of each word in a sequence having a particular
value:

Pw1 , w2 , w3 ,, wn   P X  w1 , Y  w2 , Z  w3 ,,


Chain Rule

• Chain rule of Probability:


  
P X 1 ,  , X n   P X 1 P X 2 | X 1 P X 3 | X 12  P X n | X 1n 1 
 
n
  P X k | X 1k 1
k 1

• Applying the chain rule to words we get:

   
P w1n  Pw1 Pw2 | w1 P w3 | w12  P wn | w1n 1  
 
n
  P wk | w1k 1
k 1
Chain Rule
• The chain rule provides the link between computing
the joint probability of a sequence and computing the
conditional probability of a word given previous
words.
– Equation presented in previous slide provides the way of
computing joint probability estimate of an entire sequence
based on multiplication of a number of conditional
probabilities.
– However, we still do not know any way of computing the
exact probability of a word given a long sequence of

preceding words: P w | wn1
n 1 
N-grams
• Approximation:
– Idea of N-gram model is to approximate the history by just the last
few words instead of computing the probability of a word given its
entire history.
• Bigram:
– The bigram model approximates the probability of a word given all

the previous words Pwn | w1n1  by the conditional probability of


the preceding word P wn | wn 1.

– Example: Instead of computing the probability:

Pthe | Walden Pond' s water is so tranparentthat


It is approximated with the probability: Pthe | that
Bi-gram
• The following approximation is used when the bi-gram
probability is applied:


P wn | w n1
1   Pw n | wn1 
• The assumption that the conditional probability of the a
word depends only on the previous word is called a
Markov assumption.
• Markov models are the class of probabilistic models that
assume that we can predict the probability of some future
unit without looking too far into the past.
Bi-gram Generalization
• Tri-gram: looks two words into the past
• N-gram: looks N-1 words into the past.
• General equation for N-gram approximation to the conditional
probability of the next word in a sequence is:


P wn | w n 1
1   Pw n |wn 1
n  N 1 
• The simplest and most intuitive way to estimate probabilities is
the method called Maximum Likelihood Estimation or MLE
for short.
Maximum Likelihood Estimation
• MLE method that provides a solution to the parameter
estimation of a probability distribution function.
– The best estimate of the parameter values is defined to
be the one that maximizes the probability of obtaining
the samples actually observed.
Maximum Likelihood Estimation For N-
Gram
• MLE estimate for the parameters of an N-gram model is done
by taking counts from the training data of a corpus, and
normalizing them so they lie between 0 and 1.
Bi-Gram
• Computing a particular bi-gram probability of a word y given a
previous word x, the count C(xy) is computed and normalized by the
sum of all bi-grams that share the same first word x.

C wn1wn 

P wn | wn1  
 C wn1w
w
MLE for Bi-Gram

• The previous equation can be further


simplified by noting:
C wn 1wn 
C wn 1    C wn 1w  Pwn | wn 1  
w C wn 1 
Example: Dr. Seuss
http://www.seussville.com
• Mini-corpus containing three sentences marked with beginning
sentence marker <s> and ending sentence marker </s>:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
(From Dr. Seuss series: “Green Eggs and Ham” book)

• Some of the bi-gram calculations from this corpus:


– P(I|<s>) = 2/3 = 0.66 P(Sam|<s>) = 1/3=0.33
– P(am|I) = 2/3 = 0.66 P(Sam|am) = ½=0.5
– P(</s>|Sam) = 1/3 = 0.33 P(</s>|am) = 1/3=0.33
– P(do|I) = 1/3 = 0.33
MLE for N-Gram
• In general case, MLE is calculated for N-gram model using the
following:

 n 1
 

C wnn1N 1wn 
P wn | w n  N 1
n 1
C wn  N 1

• This equation estimates the N-gram probability by dividing the


observed frequency of a particular sequence by the observed frequency
of a prefix. This ratio is called relative frequency.
Relative Frequency Computation for
MLE
• Through Relative frequencies is one way how to estimate
probabilities with Maximum Likelihood Estimation method.
• Conventional MLE is not always the best way to compute
probability estimates (bias toward a training corpus – e.g.,
Brown).
• MLE can be modified to address better those considerations.
Example 2
• Data used from Berkeley Restaurant Project Corpus consisting of 9332
sentences (available from the WWW?):
– can you tell me about any good cantonese restaurants close by
– mid priced thai food is what I’m looking for
– tell me about chez panisse
– can you give me a listing of the kinds of food that are available
– i’am looking for a good place to eat breakfast
– when is caffe venezia open during the day
Bigram counts for 8 of the words (out of V=1446) in Berkeley
Restaurant Project corpus of 9332 sentences

i want to eat chinese food lunch spend

i 5 827 0 9 0 0 0 2
want 2 0 608 1 6 6 5 1
to 2 0 4 686 2 0 6 211
eat 0 0 2 0 16 2 42 0
chinese 1 0 0 0 0 82 1 0
food 15 0 15 0 1 4 0 0
lunch 2 0 0 0 0 1 0 0
spend 1 0 1 0 0 0 0 0
Unigram Counts

i want to eat chinese food lunch spend


2533 927 214 746 158 1093 341 278
Unigram Counts
Bigram Probabilities After Normalization
i want to eat chinese food lunch spend

i 0.002 0.33 0 0.0036 0 0 0 0.00079

want 0.0022 0 0.66 0.0011 0.0065 0.0065 0.0054 0.0011

to 0.00083 0 0.0017 0.28 0.00083 0 0.0025 0.087

eat 0 0 0.0027 0 0.021 0.0027 0.056 0

chinese 0.0063 0 0 0 0 0.52 0.0063 0

food 0.014 0 0.014 0 0.00092 0.0037 0 0

lunch 0.0059 0 0 0 0 0.0029 0 0

spend 0.0036 0 0.0036 0 0 0 0 0


Sentence Probabilities
• Some other useful probabilities:
– P(i|<s>)=0.25 P(english|want)=0.0011
– P(food|english)=0.5 P(</s>|food)=0.68

• Clearly now we can compute probability of sentence like:


– “I want English food”, or
– “I want Chinese food”
by multiplying appropriate bigram probabilities together as
follows:
Bigram Probability
• P(<s> i want english food </s>) =
P(i|<s>) (0.25)
P(want|i) (0.33)
P(english|want) (0.0011)
P(food|english) (0.5)
P(</s>|food) (0.68)
= 0.25 x 0.33 x 0.0011 x 0.5 x 0.68 = 0.000031
• Exercise: Computer the probability of “I want chinese
food”.
Bigram Probability
• Some of the bigram probabilities encode some
facts that we think of as strictly syntactic in
nature:
– What comes after eat is usually
• a noun or
• an adjective, or
– What comes after to is usually
• a verb
Trigram Modeling
• Although we will generally show bigram models in this chapter
for pedagogical purposes, note that when there is sufficient
training data we are more likely to use trigram models, which
condition on the previous two words rather than the previous
word.
– To compute trigram probabilities at the very beginning of sentence,
we can use two pseudo-words for the first trigram (i.e.,
P(I|<s><s>).
Training and Test Sets
• N-gram models are obtained from a corpus that is
trained on.
• Those models are used on some new data in some task
(e.g. speech recognition).
– New data or task will not be exactly the same as data that
was used for training.
• Formally:
– Data that is used to build the N-gram (or any model) are
called Training Set or Training Corpus
– Data that are used to test the models comprise Test Set or
Test Corpus.
Model Evaluation
• Training-and-testing paradigm can also be used to evaluate different N-
gram architectures:
– Comparing N-grams of different order N, or
– Using the different smoothing algorithms
• Train various models using training corpus
• Evaluate each model on the test corpus.

• How do we measure the performance of each model on the test corpus?


– Perplexity – computing probability of each sentence in the test set: the model
that assigns a higher probability to the test set (hence more accurately predicts
the test set) is assumed to be a better model.
• Because evaluation metric is based on test set probability, it’s important
not to let the test sentences into the training set. Avoiding training on the
test set data.
Other Divisions of Data
• Extra source of data to augment the training set is also needed. This
data is called a held-out set.
– N-gram model is based on only training set.
– Held-out set is used to set additional (other) parameters of our
model.
• Used to set interpolation parameters of N-gram model

• Multiple test sets:


– Test set that is used often in measuring performance of the model
typically is called development (test) set.
– Due to its high usage the models may be tuned to it. Thus a new
completely unseen (or seldom used data set) should be used for
final evaluation. This set is called evaluation (test) set.
Picking Train, Development Test and
Evaluation Test Data
• For training we need as much data as possible.
• However, for testing we need sufficient data in order for the
resulting measurements to be statistically significant.
• In practice often the data is divided into 80% training 10%
development and 10% evaluation.
N-gram Sensitivity to the Training Corpus.
1. N-gram modeling, like many statistical models, is
very dependent on the training corpus.
 Often the model encodes very specific facts about a given
training corpus.
2. N-grams do a better and better job of modeling the
training corpus as we increase the value of N.
 This is another aspect of model being tuned to specifically
to training data at the expense of generality.
Size of N in N-gram Models
• The longer the context on which we train the
model, the more coherent the sentences.
– In the unigram sentences, there is no coherent relation between
words, nor sentence-final punctuation.
– The bigram sentences have some very local word-to-word
coherence (especially if we consider that punctuation counts as a
word).
– The trigram and quadrigram sentences are beginning to look a lot
Unknown Words: Open vs. Closed
Vocabulary Tasks
• Sometimes we have a language task in which we know all the words
that can occur, and hence we know the vocabulary size V in advance.
– The closed vocabulary assumption is the assumption that we have
such a lexicon, and that the test set can only contain words from
this lexicon. The closed vocabulary task thus assumes there are no
unknown words.
Unknown Words: Open vs. Closed
Vocabulary Tasks
• As we suggested earlier, the number of unseen words grows
constantly, so we can’t possibly know in advance exactly how many
there are, and we’d like our model to do something reasonable with
them.
• We call these OOV unseen events unknown words, or out of
vocabulary (OOV) words. The percentage of OOV words that appear
in the test set is called the OOV rate.
– An open vocabulary system is one where we model these
potential unknown words in the test set by adding a pseudo-word
called <UNK>.
Training Probabilities of Unknown Model
• We can train the probabilities of the unknown word
model <UNK> as follows:
1. Choose a vocabulary (word list) which is fixed in advance.
2. Convert in the training set any word that is not in this set
(any OOV word) to the unknown word token <UNK> in a
text normalization step.
3. Estimate the probabilities for <UNK> from its counts just
like any other regular word in the trainings set.
Perplexity
• The correct way to evaluate the performance of a language model is to
embed it in an application and measure the total performance of the
application.
– Such end-to end evaluation, also called in vivo evaluation, is the
only way to know if a particular improvement in a component is
really going to help the task at hand.
– Thus for speech recognition, we can compare the performance of
two language models by running the speech recognizer twice, once
with each language model, and seeing which gives the more
accurate transcription.
Perplexity
• End-to-end evaluation is often very expensive; evaluating a large
speech recognition test set, for example, takes hours or even days.
– Thus we would like a metric that can be used to quickly evaluate
potential improvements in a language model.
– Perplexity is the most common evaluation metric for N-gram
language models. While an improvement in perplexity does not
guarantee an improvement in speech recognition performance (or
any other end-to-end metric), it often correlates with such
improvements.
– Thus it is commonly used as a quick check on an algorithm; an
improvement in perplexity can then be confirmed by an end-to-end
evaluation.
Perplexity
• Given two probabilistic models,
– the better model is the one that has a tighter fit to the test data, or
– predicts the details of the test data better.
• We can measure better prediction by looking at the probability the
model assigns to the test data;
– the better model will assign a higher probability to the test data.
Definition of Perplexity
• The perplexity (sometimes called PP for short) of a
language model on a test set is a function of the
probability that the language model assigns to that test
set.
– For a test set W = w1w2 . . .wN, the perplexity is the
probability of the test set, normalized by the number of
words:

PP W   Pw1w2  wN 
1

N

1
N
Pw1w2  wN 
Definition of Perplexity
• We can use the chain rule to expand the probability of
W:
N
PP W   N 
1
i 1 P wi | w1 w2  wi 1 

• For bigram language model the perplexity of W is


computed as:
N
PP W   N 
1
i 1 P wi | wi 1 
Interpretation of Perplexity
1. Minimizing perplexity is equivalent to maximizing the test set
probability according to the language model.

• What we generally use for word sequence in the general Equation


presented in previous slide is the entire sequence of words in some
test set.
– Since this sequence will cross many sentence boundaries, we need
to include
• the begin-and end-sentence markers <s> and </s> in the probability
computation.
• also need to include the end-of-sentence marker </s> (but not the
beginning-of-sentence marker <s>) in the total count of word tokens
N.
Interpretation of Perplexity
• Perplexity can also be interpreted as the weighted average branching
factor of a language.
– The branching factor of a language is the number of possible next
words that can follow any word.

• Consider the task of recognizing the digits in English (zero, one, two,...,
nine), given that each of the 10 digits occur with equal probability P = 1/10 .
The perplexity of this language is in fact 10.
• To see that, imagine a string of digits of length N. By Equation presented in
previous slide, the perplexity will be:

PP W   P w1w2  wN 
1

N

1

 1  N  N
1
1

        10
 10    10 
Interpretation of Perplexity
• Exercise: Suppose that the number zero is really frequent and occurs
10 times more often than other numbers.
– Show that the perplexity to be lower, as expected since most of the
time the next number will be zero.
– Branching factor however, is still the same for digit recognition
task (e.g. 10).
Interpretation of Perplexity
• Perplexity is also related to the information theoretic notion of entropy
as it will be shown latter in this chapter.
Example of Perplexity Use
• Perplexity is used in following example to compare three N-gram
models.
• Unigram, Bigram, and Trigram grammars are trained on 38
million words (including start-of-sentence tokens) using WSJ
corpora with 19,979 word vocabulary.
– Perplexity is computed on a test set of 1.5 million words via equation
presented in the slide: Definition of Perplexity and the results are
summarized in the Table below:

N-gram Order Unigram Bigram Trigram

Perplexity 962 170 109


Example of Perplexity Use
• As we see in previous slide, the more information the N-gram gives us
about the word sequence, the lower the perplexity:
– the perplexity is related inversely to the likelihood of the test sequence
according to the model.

• Note that in computing perplexities the N-gram model P must be


constructed without any knowledge of the test set. Any kind of
knowledge of the test set can cause the perplexity to be artificially
low.

• For example, we defined above the closed vocabulary task, in which the vocabulary for the
test set is specified in advance. This can greatly reduce the perplexity. As long as this
knowledge is provided equally to each of the models we are comparing, the closed vocabulary
perplexity can still be useful for comparing models, but care must be taken in interpreting the
results. In general, the perplexity of two language models is only comparable if they use the
same vocabulary.
Smoothing
• There is a major problem with the maximum likelihood estimation
process we have seen for training the parameters of an N-gram
model.
– This is the problem of sparse data caused by the fact that our
maximum likelihood estimate was based on a particular set of training
data. For any N-gram that occurred a sufficient number of times, we
might have a good estimate of its probability. But because any corpus
is limited, some perfectly acceptable English word sequences are
bound to be missing from it.
1. This missing data means that the N-gram matrix for any given training
corpus is bound to have a very large number of cases of putative “zero
probability N-grams” that should really have some non-zero
probability.
2. Furthermore, the MLE method also produces poor estimates when the
counts are non-zero but still small.
Smoothing
• We need a method which can help get better estimates for these
zero or low frequency counts.
– Zero counts turn out to cause another huge problem.
• The perplexity metric defined above requires that we compute the
probability of each test sentence.
• But if a test sentence has an N-gram that never appeared in the
training set, the Maximum Likelihood estimate of the probability for
this N-gram, and hence for the whole test sentence, will be zero!

• This means that in order to evaluate our language models, we need


to modify the MLE method to assign some non-zero probability to
any N-gram, even one that was never observed in training.
Smoothing
• The term smoothing is used for such modifications that address the
poor estimates due to variability in small data sets.
– The name comes from the fact that (looking ahead a bit) we will be
shaving a little bit of probability mass from the higher counts, and
piling it instead on the zero counts, making the distribution a little
less discontinuous.
• In the next few sections some smoothing algorithms are introduced.
• The original Berkeley Restaurant example introduced previously
will be used to show how smoothing algorithms modify the bigram
probabilities.
Laplace Smoothing
• One simple way to do smoothing is to take our matrix of bigram
counts, before we normalize them into probabilities, and add one to all
the counts. This algorithm is called Laplace smoothing, or Laplace’s
Law.
• Laplace smoothing does not perform well enough to be used in modern
N-gram models, but we begin with it because it introduces many of the
concepts that we will see in other smoothing algorithms, and also gives
us a useful baseline.
Laplace Smoothing to Unigram
Probabilities
• Recall that the unsmoothed maximum likelihood estimate of the unigram
probability of the word wi is its count ci normalized by the total number of
word tokens N:

P wi  
ci
N
• Laplace smoothing adds one to each count. Considering that there are V words
in the vocabulary, and each one got increased, we also need to adjust the
denominator to take into account the extra V observations in order to have
legitimate probabilities.

ci  1
PLaplace wi  
N V
Laplace Smoothing
• It is convenient to describe a smoothing algorithm as a corrective constant that
affects the numerator by defining an adjusted count c* as follows:


ci  1 ci  1 N   1 ci
PLaplace wi  
N
   ci  1  
N V N V N  N V N N
 N
ci  ci  1
N V
Discounting
• A related way to view smoothing is as discounting (lowering) some
non-zero counts in order to get the correct probability mass that will be
assigned to the zero counts.
• Thus instead of referring to the discounted counts c, we might describe
a smoothing algorithm in terms of a relative discount dc, the ratio of
the discounted counts to the original counts:

c*
dc 
c
Berkeley Restaurant Project Smoothed Bigram
Counts (V=1446)
i want to eat chinese food lunch spend
i 6 828 1 10 1 1 1 3
want 3 1 609 2 7 7 6 2
to 3 1 5 687 3 1 7 212
eat 1 1 3 1 17 3 43 1
chineze 2 1 1 1 1 83 2 1
food 16 1 16 1 2 5 1 1
lunch 3 1 1 1 1 2 1 1
spend 2 1 2 1 1 1 1 1
Smoothed Bigram Probabilities
• Recall normal bigram probabilites are computed by normalizing each
raw of counts by the unigram count:

C wn 1wn 
P wn | wn 1  
C wn 1 
• For add-one smoothed bigram counts we need to augment the unigram
count by the number of total types in the vocabulary V:

C wn 1wn   1
P *
wn | wn1  
C wn 1   V
Laplace

• The result is the smoothed bigram probabilities presented in the table


in the next slide.
Bigram Smoothed Probabilities for eight words (out of V=1446) in
Berkeley Restaurant Project corpus of 9332 sentences

i want to eat chinese food lunch spend

i 0.0015 0.21 0.00025 0.0025 0.00025 0.00025 0.00025 0.00075

want 0.0013 0.00042 0.26 0.00084 0.0029 0.0029 0.0025 0.0084

to 0.00078 0.00026 0.0013 0.18 0.00078 0.00026 0.0018 0.055

eat 0.00046 0.00046 0.0014 0.00046 0.0078 0.0014 0.02 0.00046

chinese 0.0012 0.00062 0.00062 0.00062 0.00062 0.052 0.0012 0.00062

food 0.0063 0.00039 0.0063 0.00039 0.00079 0.002 0.00039 0.00039

lunch 0.0017 0.00056 0.00056 0.00056 0.00056 0.0011 0.00056 0.00056

spend 0.0012 0.00058 0.0012 0.00058 0.00058 0.00058 0.00058 0.00058


Adjusted Counts Table
• It is often convenient to reconstruct the count matrix so we can see
how much a smoothing algorithm has changed the original counts.
• These adjusted counts can be computed by Equation presented below
and the table in the next slide shows the reconstructed counts.

c *
wn1wn   C wn 1wn   1 C wn 1 
C wn 1   V
Adjusted Counts Table
i want to eat chinese food lunch spend

i 3.8 527 0.64 6.4 0.64 0.64 0.64 1.9

want 1.2 0.39 238 0.78 2.7 2.7 2.3 0.78

to 1.9 0.63 3.1 430 1.9 0.63 4.4 133

eat 0.34 0.34 1 0.34 5.8 1 15 0.34

chineze 0.2 0.098 0.098 0.098 0.098 8.2 0.2 0.098

food 6.9 0.43 6.9 0.43 0.86 2.2 0.43 0.43

lunch 0.57 0.19 0.19 0.19 0.19 0.38 0.19 0.19

spend 0.32 0.16 0.32 0.16 0.16 0.16 0.16 0.16


Observation
• Note that add-one smoothing has made a very big change to the
counts. C(want to) changed from 609 to 238!
• We can see this in probability space as well: P(to|want) decreases
from .66 in the unsmoothed case to .26 in the smoothed case.
• Looking at the discount d (the ratio between new and old counts)
shows us how strikingly the counts for each prefix-word have been
reduced;
– the discount for the bigram want to is .39, while the discount for
Chinese food is .10, a factor of 10!
Problems with Add-One (Laplace)
Smoothing
• The sharp change in counts and probabilities occurs because too much
probability mass is moved to all the zeros.
– We could move a bit less mass by adding a fractional count rather
than 1 (add-d smoothing; (Lidstone, 1920; Jeffreys, 1948)), but
– this method requires a method for choosing d dynamically, results
in an inappropriate discount for many counts, and turns out to give
counts with poor variances.
– For these and other reasons (Gale and Church, 1994), we’ll need
use better smoothing methods for N-grams like the ones we will
present in the next section.
Good-Turing Discounting
• A number of much better algorithms have been developed that are only
slightly more complex than add-one smoothing: Good-Turing
• The idea behind a number of those algorithms is to use the count of
things you’ve seen once to help estimate the count of things you have
never seen.
• Good described the algorithm in 1953 in which he credits Turing for
the original idea.
• Basic idea in this algorithm is to re-estimate the amount of probability
mass to assign to N-grams with zero counts by looking at the number
of N-grams that occurred only one time.
– A word or N-gram that occurs once is called a singleton.
– Good-Turing algorithm uses the frequency of singletons as a re-
estimate of the frequency of zero-count bigrams.
Good-Turing Discounting
• Algorithm Definition:
– Nc – the number of N-grams that occur c times: frequency of frequency c.
– N0 – the number of bigrams b with count 0.
– N1 – the number of bigram with count 1 (singletons), etc.
– Each Nc is a bin that stores the number of different N-grams that occur in
the training set with frequency c:

Nc   1
x:count x  c

– The MLE count for Nc is c. The Good-Turing estimate replaces this with a
smoothed count c*, as a function of Nc+1:

c  c  1
 N c 1
Nc
Good-Turing Discounting
• The previous equation presented in previous slide can be used to
replace the MLE counts for all the bins N1, N2, and so on. Instead of
using this equation directly to re-estimate the smoothed count c* for

N0, the following equation is used that defines probability PGT of the
missing mass:

P things with frequency zero in training  


 N1
GT
N
• N1 – is the count of items in bin 1 (that were seen once in training), and
N is total number of items we have seen in training.
Good-Turing Discounting
• Example:
– A lake with 8 species of fish (bass, carp, catfish, eel, perch, salmon, trout,
whitefish)
– When fishing we have caught 6 species with the following count:
• 10 carp
• 3 perch, 2 whitefish, 1 trout, 1 salmon, and 1 eel (no catfish and no bass).
– What is the probability that the next fish we catch will be a new species,
i.e., one that had a zero frequency in our training set (catfish or bass)?

• The MLE count c of unseen species (bass or catfish) is 0. From the


equation in the previous slide the probability of a new fish being one of
these unseen species is 3/18, since N1 is 3 and N is 18:

P things with frequency zero in training 


 N1 3
GT 
N 18
Good-Turing Discounting: Example
• Lets now estimate the probability that the next fish will be another
trout? MLE count for trout is 1, so the MLE estimated probability is
1/18.
• However, the Good-Turing estimate must be lower, since we took 3/18
of our probability mass to use on unseen events!
– Must discount MLE probabilities for observed counts (perch,
whitefish, trout, salmon, and eel)

• The revised count c* and Good-Turing smoothed probabilities PGT 


for
species with counts 0 (like bass or catfish in previous example) or
counts 1 (like trout, salmon, or eel) are as follows:
Good-Turing Discounting: Example
Unseen (bass or catfish) trout

c 0 1
0 1
MLE p p 0 18
18
c* trout  (c  1) 
N2 1
c*  2   0.67
N1 3
GT *
pGT *
pGT unseen)   N1

3
 0.17 *
pGT trout  c*

0.67
 0.037
N 18 N 18

• Note that the revised count c* for eel as well is discounted from c=1.0 to
c*=0.67 in order to account for some probability mass for unseen species
(unseen) = 3/18=0.17 for catfish and bass.
• Since we know that there are 2 unknown species, the probability of the
next fish being specifically a catfish is p * (catfish) = (1/2)x(3/18) =
GT
0.085

You might also like