0% found this document useful (0 votes)
90 views

Lecture 5: Language Modeling (N-Gram, BOW)

The document discusses language modeling techniques like n-grams and bag-of-words models. It explains that n-gram models predict the next word given the previous n-1 words. For example, a bigram model predicts the next word based on the current word. The document also covers probabilistic language modeling, including calculating conditional probabilities. It discusses using n-gram models for tasks like machine translation by assigning probabilities to word sequences.

Uploaded by

Manaal Azfar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views

Lecture 5: Language Modeling (N-Gram, BOW)

The document discusses language modeling techniques like n-grams and bag-of-words models. It explains that n-gram models predict the next word given the previous n-1 words. For example, a bigram model predicts the next word based on the current word. The document also covers probabilistic language modeling, including calculating conditional probabilities. It discusses using n-gram models for tasks like machine translation by assigning probabilities to word sequences.

Uploaded by

Manaal Azfar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 25

Lecture 5: Language Modeling

(N-gram, BOW)
Lecture Objectives:

•Student will be able to understand Language Modeling


Techniques
•Students Will be able to understand statistical
computing of terms & formation a matrix of term
probabilities.
CSC-441: Natural Language Processing
What is NLP??
NLP is the
branch of
computer science
focused on
developing
systems that
allow computers
to communicate
with people using
everyday
language
Probabilistic Language Modeling
Assign a probability to a sentence :P(S)=P(∑Wi)
• Goal: compute the probability of a sentence or
sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
• A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a
language model.
• Better: the grammar But language model or LM is
standard
Conditional Probability
• A conditional probability is a probability
whose sample space has been limited to only
those outcomes that fulfill a certain condition.
• The conditional probability of event A given
that event B has happened is
P(A|B)=P(A ∩ B)/P(B).
• The order is very important do not think
that P(A|B)=P(B|A)! THEY ARE
DIFFERENT.
Examples
• Suppose that A and B are events with probabilities:
P(A)=1/3, P(B)=1/4,
P(A ∩ B)=1/10
• Find each of the following:
1. P(A | B) = P(A ∩ B)/P(B)=1/10/1/4=4/10
2. P(B | A) = P(A ∩ B)/P(A)=1/10/1/3=3/10
3. P(A’ | B’) = P(A’ ∩ B’)/P(B’)=
P((A U B)’)/(1-P(B))=(1-P(A U B))/(1 – P(B))=
(1 – (P(A)+P(B)-P(A ∩ B)))/(1-P(B))=
(1 – (1/3+1/4-1/10))/(1-1/10)=(1-29/60)/9/10=
31/60/9/10=31/54.
A view of machine translation
• Assigning probabilities to sequences of
words is essential in machine translation.

他 向 记者 介绍了 主要 内

He to reporters introduced main content

Could be translated as:


•he introduced reporters to the main contents of the statement
•he briefed to reporters the main contents of the statement
•he briefed reporters on the main contents of the statement
Language Models (LMs)

Also known as “N-gram models”


N-gram models
– Definition: 𝑝(𝑥𝑛| 𝑥𝑛−1 , … , 𝑥𝑛−𝑁+1 )
– predict the next word given N-1 previous words
1-gram = unigram
– 𝑝(𝑥𝑛 )
2-gram = bigram
– 𝑝(𝑥𝑛 | 𝑥𝑛−1 )
3-gram = trigram
– 𝑝(𝑥𝑛 |𝑥𝑛−2 , 𝑥𝑛−1 )
Probability value is invariant with respect to ‘𝑛’
“N-gram” (without “models”) means N-word sequences
Unigram
Bigram
Example
<s> I am Sam </s> ,<s> Sam I am </s> , <s> I am Sam </s>
,<s> I do not like green eggs and ham </s>
i am Sam do not like green eggs and ham
<S> 2/3 0 0 0 0 0 0 0 0 0
I 0 2/3 0 1/3 0 0 0 0 0 0
Am 0 0 2/3 0 0 0 0 0 0 0
Sam 1/3 0 0 0 0 0 0 0 0 0
Do 0 0 0 0 1/1 0 0 0 0 0
not 0 0 0 0 0 1/1 0 0 0 0
Like 0 0 0 0 0 0 1/1 0 0 0
Green 0 0 0 0 0 0 0 1/1 0 0
Eggs 0 0 0 0 0 0 0 0 1/1 0
And 0 0 0 0 0 0 0 0 0 1/1
ham 0 0 0 0 0 0 0 0 0 0
Raw bigram counts
• Out of 9222 sentences
Raw bigram probabilities
• Normalize by unigrams:

• Result:
Bigram estimates of sentence
probabilities
P(<s> I want english food </s>) =
P(I|<s>)
× P(want|I)
× P(chinese|want)
× P(food|chineses)
× P(</s>|food)
=
Training Language Models?
Google N-Gram Release
• serve as the inspiration 1390
• serve as the installation 136
• serve as the institute 187
• serve as the institution 279
• serve as the institutional 461
• serve as the instructional 173
• serve as the instructor 286
• serve as the indicator 120
• serve as the indicators 45
• serve as the indispensable 111

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Smoothing techniques
• Laplace smoothing
• Pretend we saw each word one more time than we
did
• Just add one to all the counts!

• MLE estimate:

• Add-1 estimate:
Examples
• Here are a few other useful probabilities:
P(i|<s>) = 0.25 P(english|want) = 0.0011
P(food|english) = 0.5 P(</s>|food) = 0.68
add-1 smoothed probabilities: P(i|<s>)=0.19 and P(</s>|
food)= 0:40
Evaluation: How good is our model?
• Does our language model prefer good sentences to
bad ones?
– Assign higher probability to “real” or “frequently observed”
sentences
• Than “ungrammatical” or “rarely observed” sentences?
• We train parameters of our model on a training set.
• We test the model’s performance on data we haven’t seen.
– A test set is an unseen dataset that is different from our training set,
totally unused.
– An evaluation metric tells us how well our model does on the test
set.
Perplexity
The best language model is one that best predicts an unseen test set

Perplexity is the inverse probability of


the test set, normalized by the number
of words:

Chain rule:

For bigrams:

Minimizing perplexity is the same


as maximizing probability
Perplexity
• What is the perplexity of the sentence
according to a model that assign P=1/10 to
each digit? (Shannon Game)
• How about a letter?
– Is it 26?
• Does the model fit the data?
– A good model will give a high probability
to a real sentence
Out of Vocabulary estimation?
How to estimate unknown words?
As for unknown word
=count(unknown)/count(words)=0/# of
words=??
And hence we cannot compute perplexity
(can’t divide by 0)!

Solution
Smoothing Technique
Markov Models
• The assumption that the probability of a word
depends only on the previous word is called a
Markov assumption.
• Markov models are the class of probabilistic
models that assume we can predict the
probability of some future unit without looking
too far into the past.

P ( the | its water is so transparent that)  P ( the | that )


Markov Assumption

P( w1w2  wn )   P( wi | wi k  wi 1 )
i

• In other words, we approximate each


component in the product

P(wi | w1w2  wi 1 )  P(wi | wi k  wi 1 )


Summary

• Language Models = “word sequence prediction” as a


probabilistic model
• Can be used for information Extraction
• N-gram model help to perform machine translation

24
References
• Wikipedia.com
• Prof. Jason Eisner (Natural Language
Processing)John Hopkins University.
• Web.Standford.edu

25

You might also like