0% found this document useful (0 votes)

17 views

2024 - Slide2 - BayesML Sub

Document in Artificial Intelligence subject

Uploaded by

de.minhduong

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

2024 - Slide2 - BayesML Sub

Document in Artificial Intelligence subject

Uploaded by

de.minhduong

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Bayesian Learning Method

Nguyen Van Vinh

UET-Hanoi VNU
Content
• Bayesian Learning (NB) method
• Examples for NB
• Application (Text Classification, Spam
Mail)
Introduction
Thomas Bayes (c. 1702 – 17 April 1761)
was a British mathematician and
Presbyterian minister.(wikipedia)

Today we learn:
• Bayesian classification
– E.g. How to decide if a patient is ill or healthy,
based on
• A probabilistic model of the observed data
• Prior knowledge
Classification problem
• Training data: examples of the form (d,h(d))
– where d are the data objects to classify (inputs)
– and h(d) are the correct class info for d, h(d){1,…K}
• Goal: given dnew, provide h(dnew)
Why Bayesian?
• Provides practical learning algorithms
– E.g. Naïve Bayes
• Prior knowledge and observed data can be
combined
• It is a generative (model based) approach, which
offers a useful conceptual framework
– E.g. sequences could also be classified, based on
a probabilistic model specification
– Any kind of objects can be classified, based on a
probabilistic model specification
Bayes’ Rule
Understanding Bayes' rule
P ( d | h) P ( h) d  data
p(h | d )  h  hypothesis (model)
P(d ) - rearranging
p ( h | d ) P ( d )  P ( d | h) P ( h)
P ( d , h)  P ( d , h)
the same joint probability
Who is who in Bayes’ rule on both sides

P ( h) : prior belief (probability of hypothesis h before seeing any data)

P ( d | h) : likelihood (probability of the data if the hypothesis h is true)
P(d )   P(d | h) P(h) : data evidence (marginal probability of the data)
h

P(h | d ) : posterior (probability of hypothesis h after having seen the data d )

Does patient have cancer or not?
• A patient takes a lab test and the result comes back
positive. It is known that the test returns a correct
positive result in only 98% of the cases and a correct
negative result in only 97% of the cases. Furthermore,
only 0.008 of the entire population has this disease.

1. What is the probability that this patient has cancer?

2. What is the probability that he does not have cancer?
3. What is the diagnosis?
hypothesis1 : ' cancer'
} hypothesis space H
hypothesis2 : ' cancer'
 data : ' '
P ( | cancer) P (cancer) .........................
1.P (cancer |  )    ..........
P() .........................
P ( | cancer)  0.98
P (cancer)  0.008
P ( )  P ( | cancer) P (cancer)  P ( | cancer) P (cancer)
 ...................................................................
P ( | cancer)  0.03
P (cancer)  ..........

2.P (cancer |  )  ...........................

3.Diagnosis ??
Choosing Hypotheses
• Maximum Likelihood hML  arg max P(d | h)
hypothesis: hH

• Generally we want the hMAP  arg max P(h | d )

hH
most probable hypothesis
given training data.This is
the maximum a posteriori
hypothesis:
– Useful observation: it does
not depend on the
denominator P(d)
Now we compute the diagnosis
– To find the Maximum Likelihood hypothesis, we evaluate P(d|h)
for the data d, which is the positive lab test and chose the
hypothesis (diagnosis) that maximises it:
P( | cancer)  ............
P( | cancer)  .............
 Diagnosis : hML  .................

– To find the Maximum A Posteriori (MAP) hypothesis, we

evaluate P(d|h)P(h) for the data d, which is the positive lab test
and chose the hypothesis (diagnosis) that maximises it. This is
the same as choosing the hypotheses gives the higher posterior
probability.
P( | cancer) P(cancer)  ................
P( | cancer) P(cancer)  .............
 Diagnosis : hMAP  ......................
Bayesian decision theory
• Let x be the value predicted by the agent and x* be
the true value of X.
• The agent has a loss function, which is 0 if x = x*
and 1 otherwise
• Expected loss for predicting x:
 L( x, x*)P( x* | e)   P( x* | e)  1  P( x | e)
x* x * x
• What is the estimate of X that minimizes the
expected loss?
– The one that has the greatest posterior probability P(x|e)
– This is called the Maximum a Posteriori (MAP) decision
MAP decision
• Value x of X that has the highest posterior
probability given the evidence E = e:
P( E  e | X  x) P( X  x)
x*  arg max x P( X  x | E  e) 
P ( E  e)
 arg max x P( E  e | X  x) P( X  x)

P ( x | e)  P (e | x ) P ( x )
posterior likelihood prior

• Maximum likelihood (ML) decision:

x*  arg max x P (e | x)
Naïve Bayes Classifier
• What can we do if our data d has several attributes?
• Naïve Bayes assumption: Attributes that describe data instances are
conditionally independent given the classification hypothesis
P(d | h)  P(a1 ,..., aT | h)   P(at | h)
t

– it is a simplifying assumption, obviously it may be violated in reality

– in spite of that, it works well in practice
• The Bayesian classifier that uses the Naïve Bayes assumption and
computes the MAP hypothesis is called Naïve Bayes classifier
• One of the most practical learning methods
• Successful applications:
– Medical Diagnosis
– Text classification
Example. ‘Play Tennis’ data
Day Outlook Temperature Humidity Wind Play
Tennis

Day1 Sunny Hot High Weak No

Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
Naïve Bayes solution
Classify any new datum instance x=(a1,…aT) as:

hNaive Bayes  arg max P(h) P(x | h)  arg max P(h) P(at | h)
h h t

• To do this based on training examples, we need to estimate the

parameters from the training examples:

– For each target value (hypothesis) h

Pˆ (h) : estimate P(h)

– For each attribute value at of each datum instance

Pˆ (at | h) : estimate P(at | h)

Based on the examples in the table, classify the following datum x:
x=(Outl=Sunny, Temp=Cool, Hum=High, Wind=strong)
• That means: Play tennis or not?
hNB  arg max P(h) P(x | h)  arg max P(h) P(at | h)
h[ yes , no ] h[ yes , no ] t

 arg max P(h) P(Outlook  sunny | h) P(Temp  cool | h) P( Humidity  high | h) P(Wind  strong | h)
h[ yes , no ]

• Working:
P ( PlayTennis  yes)  9 / 14  0.64
P ( PlayTennis  no)  5 / 14  0.36
P (Wind  strong | PlayTennis  yes)  3 / 9  0.33
P (Wind  strong | PlayTennis  no)  3 / 5  0.60
etc.
P ( yes) P( sunny | yes) P(cool | yes) P(high | yes) P ( strong | yes)  0.0053
P (no) P( sunny | no) P(cool | no) P(high | no) P( strong | no)  0.0206
 answer : PlayTennis( x)  no
Example: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Learning to classify text
• Learn from examples which articles are of
interest
• The attributes are the words
• Observe the Naïve Bayes assumption just
means that we have a random sequence
model within each class!
• NB classifiers are one of the most effective for
this task
• Resources for those interested:
– Tom Mitchell: Machine Learning (book) Chapter 6.
Results on a benchmark text corpus
Learning and inference pipeline
Learning Training
Labels
Training
Samples
Learned
Features Training
model

Learned
model
Inference

Features Prediction
Test Sample
Features

• A measurable variable that is (rather,

should be) distinctive of something we
want to model.
• We usually choose features that are useful
to identify something, i.e., to do
classification
– Ex: Cô gái đó rất đẹp trong bữa tiệc hôm đó.
• We often need several features to
adequately model something – but not too
many! 21
Feature vectors
• Values for several features of an
observation can be put into a single vector
# proper # 1st # commas
nouns person
pronouns
2 0 0

5 0 0

0 1 1

22
Feature vectors
• Features should be useful in discriminating
between categories.

23
Feature Representation

• Convert this example to a vector using bag-

of-words features

• Very large vector space (size of vocabulary), sparse features

• Requires indexing the features (mapping them to axes)
• More sophisticated feature mappings possible (m-idf), as well as
lots of other features: character n-grams, parts of speech, …
24
Case study:
Text document classification
• MAP decision: assign a document to the class with the highest
posterior P(class | document)

• Example: spam classification

– Classify a message as spam if P(spam | message) > P(¬spam | message)
Case study:
Text document classification
• MAP decision: assign a document to the class with the highest
posterior P(class | document)

• We have P(class | document)  P(document | class)P(class)

• To enable classification, we need to be able to estimate the

likelihoods P(document | class) for all classes and
priors P(class)
Naïve Bayes Representation
• Goal: estimate likelihoods P(document | class)
and priors P(class)
• Likelihood: bag of words representation
– The document is a sequence of words (w1, …, wn)
– The order of the words in the document is not important
– Each word is conditionally independent of the others given
document class
Naïve Bayes Representation
• Goal: estimate likelihoods P(document | class)
and priors P(class)
• Likelihood: bag of words representation
– The document is a sequence of words (w1, …, wn)
– The order of the words in the document is not important
– Each word is conditionally independent of the others given
document class
n
P(document | class) = P(w1, ... , wn | class) = Õ P(wi | class)
i=1
Bag of words illustration

US Presidential Speeches Tag Cloud

http://chir.ag/projects/preztags/
Bag of words illustration

US Presidential Speeches Tag Cloud

http://chir.ag/projects/preztags/
Bag of words illustration

US Presidential Speeches Tag Cloud

http://chir.ag/projects/preztags/
Naïve Bayes Representation
• Goal: estimate likelihoods P(document | class) and
P(class)
• Likelihood: bag of words representation
– The document is a sequence of words (w1, … , wn)
– The order of the words in the document is not important
– Each word is conditionally independent of the others given
document class
n
P(document | class) = P(w1, ... , wn | class) = Õ P(wi | class)
i=1
– Thus, the problem is reduced to estimating marginal likelihoods
of individual words P(wi | class)
Parameter estimation
• Model parameters: feature likelihoods P(word | class) and
priors P(class)
– How do we obtain the values of these parameters?

prior P(word | spam) P(word | ¬spam)

spam: 0.33
¬spam: 0.67
Parameter estimation
• Model parameters: feature likelihoods P(word | class) and
priors P(class)
– How do we obtain the values of these parameters?
– Need training set of labeled samples from both classes

# of occurrences of this word in docs from this class

P(word | class) =
total # of words in docs from this class

– This is the maximum likelihood (ML) estimate, or estimate

that maximizes the likelihood of the training data:
D nd

 P(w
d 1 i 1
d ,i | classd ,i )
d: index of training document, i: index of a word
Parameter estimation
• Parameter estimate:

# of occurrences of this word in docs from this class

P(word | class) =
total # of words in docs from this class

• Parameter smoothing: dealing with words that were never

seen or seen too few times
– Laplacian smoothing: pretend you have seen every vocabulary word
one more time than you actually did

# of occurrences of this word in docs from this class + 1

P(word | class) =
total # of words in docs from this class + V

(V: total number of unique words)

• Laplace’s estimate:
– Pretend you saw every r r b
outcome once more
than you actually did
Laplace Smoothing
• Laplace’s estimate (extended):
– Pretend you saw every outcome k r r b
extra times

– What’s Laplace with k = 0?

– k is the strength of the prior

• Laplace for conditionals:

– Smooth each condition
independently:
Summarization
• Bayes’ rule can be turned into a classifier
• Maximum A Posteriori (MAP) hypothesis estimation
incorporates prior knowledge; Max Likelihood doesn’t
• Naive Bayes Classifier is a simple but effective Bayesian
classifier for vector data (i.e. data with several attributes)
that assumes that attributes are independent given the
class.
• Bayesian classification is a generative approach to
classification
Reference
• Slides of ML Course, University of Birmingham
• Textbook reading (contains details about using Naïve
Bayes for text classification):
Tom Mitchell, Machine Learning (book), Chapter 6.
• Software: NB for classifying text:
http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-
bayes.html
https://julianstier.com/posts/2021/01/text-classification-with-naive-
bayes-in-numpy/
• Useful reading for those interested to learn more about
NB classification, beyond the scope of this module:
http://www-2.cs.cmu.edu/~tom/NewChapters.html

Time Series Analysis by State Space Methods
100% (9)
Time Series Analysis by State Space Methods
369 pages
Beach and Pedersen (2019) Process Tracing Methods. 2nd Edition
71% (7)
Beach and Pedersen (2019) Process Tracing Methods. 2nd Edition
330 pages
2022 Slide9 BayesML Eng
No ratings yet
2022 Slide9 BayesML Eng
34 pages
Ba Yes Naive
No ratings yet
Ba Yes Naive
15 pages
Bayesian Learning
No ratings yet
Bayesian Learning
44 pages
Mathematics - Iii: Institute of Science&Technology
No ratings yet
Mathematics - Iii: Institute of Science&Technology
16 pages
Bayesian
No ratings yet
Bayesian
91 pages
L23 Bayesian Naive
No ratings yet
L23 Bayesian Naive
18 pages
Naïve Bayes Classifier: April 25, 2006
No ratings yet
Naïve Bayes Classifier: April 25, 2006
19 pages
Bayesian Learning Note
No ratings yet
Bayesian Learning Note
20 pages
2BAYESIAN LEARNING (1)
No ratings yet
2BAYESIAN LEARNING (1)
22 pages
Bayesian Learning
No ratings yet
Bayesian Learning
81 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
No ratings yet
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
54 pages
DA_Unit_2
No ratings yet
DA_Unit_2
124 pages
Bayesian Learning
No ratings yet
Bayesian Learning
49 pages
Slide 1
No ratings yet
Slide 1
37 pages
Bayesian Inference: A Practical Primer: Outline
No ratings yet
Bayesian Inference: A Practical Primer: Outline
28 pages
Unit 4
No ratings yet
Unit 4
18 pages
slide07-bayes
No ratings yet
slide07-bayes
51 pages
Predicting The Missing Value by Bayesian Classification: Abstract
No ratings yet
Predicting The Missing Value by Bayesian Classification: Abstract
5 pages
Bayes
No ratings yet
Bayes
48 pages
L3 (Week3) Bayesian Classifier
No ratings yet
L3 (Week3) Bayesian Classifier
21 pages
PML-UNIT-V-Material
No ratings yet
PML-UNIT-V-Material
44 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
Bark08 Ghahramani Samlbb 01
No ratings yet
Bark08 Ghahramani Samlbb 01
26 pages
Features of Bayesian Learning Methods
No ratings yet
Features of Bayesian Learning Methods
39 pages
Unit 3 Bayesian Learning
No ratings yet
Unit 3 Bayesian Learning
49 pages
Naive Bayes
No ratings yet
Naive Bayes
37 pages
Bayesian Theory: By: Khaliq Bero
No ratings yet
Bayesian Theory: By: Khaliq Bero
15 pages
Module4 Notes
100% (1)
Module4 Notes
31 pages
Unit 2 .Statistical Decision Making-1
No ratings yet
Unit 2 .Statistical Decision Making-1
213 pages
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
No ratings yet
Introduction To Bayesian Learning: Aaron Hertzmann University of Toronto SIGGRAPH 2004 Tutorial
141 pages
K - Nearest Neighbours Classifier / Regressor
No ratings yet
K - Nearest Neighbours Classifier / Regressor
35 pages
Bayes Classifier
No ratings yet
Bayes Classifier
35 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
AI Mod4@AzDOCUMENTS - in
No ratings yet
AI Mod4@AzDOCUMENTS - in
41 pages
Naive Bayes
No ratings yet
Naive Bayes
36 pages
Block 4 ST3189
No ratings yet
Block 4 ST3189
25 pages
Content Library Read
No ratings yet
Content Library Read
24 pages
Lec04 BayesianLearning
No ratings yet
Lec04 BayesianLearning
39 pages
20210913115710D3708 - Session 09-12 Bayes Classifier
No ratings yet
20210913115710D3708 - Session 09-12 Bayes Classifier
30 pages
BST413 12jan Page1to11
No ratings yet
BST413 12jan Page1to11
11 pages
ML UNIT-5 Notes PDF
No ratings yet
ML UNIT-5 Notes PDF
41 pages
Bayesian Learning Methods
No ratings yet
Bayesian Learning Methods
57 pages
Introduction To Bayesian Statistics: 24 February 2016 A Semester's Worth of Material in Just A Few Dozen Slides
No ratings yet
Introduction To Bayesian Statistics: 24 February 2016 A Semester's Worth of Material in Just A Few Dozen Slides
40 pages
Bayes Lectures English
No ratings yet
Bayes Lectures English
74 pages
Error Analysis Lecture 5
No ratings yet
Error Analysis Lecture 5
34 pages
Modern Bayesian Econometrics
No ratings yet
Modern Bayesian Econometrics
100 pages
Unit 2 Bayesian Learning
No ratings yet
Unit 2 Bayesian Learning
50 pages
Lecture 13
No ratings yet
Lecture 13
45 pages
Cs Ai Lecture Notes 02
No ratings yet
Cs Ai Lecture Notes 02
103 pages
Bayesian Ibrahim
No ratings yet
Bayesian Ibrahim
370 pages
2MLIntrodpart 2
No ratings yet
2MLIntrodpart 2
42 pages
4.ML_Estimation
No ratings yet
4.ML_Estimation
19 pages
UNIT 4 - Bayesian Learning
No ratings yet
UNIT 4 - Bayesian Learning
54 pages
Lecture 12
No ratings yet
Lecture 12
42 pages
Data Mining - Classification
No ratings yet
Data Mining - Classification
53 pages
01 Naiv Bayes
No ratings yet
01 Naiv Bayes
25 pages
6Naiev-base
No ratings yet
6Naiev-base
37 pages
Gaussian Mixture Model
No ratings yet
Gaussian Mixture Model
48 pages
BAYES Theorem
From Everand
BAYES Theorem
Jeffery Short
2/5 (5)
Bayesian Modeling For Infectious Diseases Using PyMC3
No ratings yet
Bayesian Modeling For Infectious Diseases Using PyMC3
31 pages
Adaptive Trading Strategies Across Liquidity Pools
0% (1)
Adaptive Trading Strategies Across Liquidity Pools
47 pages
Mathematics PDF
100% (1)
Mathematics PDF
218 pages
Tayal Aditya
No ratings yet
Tayal Aditya
150 pages
SE125 Fall20 HWK 4 Solution PDF
No ratings yet
SE125 Fall20 HWK 4 Solution PDF
18 pages
Instant download An Introduction to Generalized Linear Models Annette J. Dobson pdf all chapter
100% (1)
Instant download An Introduction to Generalized Linear Models Annette J. Dobson pdf all chapter
55 pages
Hacking, Ian - Equipossibility Theories of Probability (1971)
No ratings yet
Hacking, Ian - Equipossibility Theories of Probability (1971)
17 pages
639544869
No ratings yet
639544869
589 pages
(Ebook) Essentials Of Statistical Inference by G. A. Young, R. L. Smith ISBN 9780511126161, 0511126166pdf download
100% (4)
(Ebook) Essentials Of Statistical Inference by G. A. Young, R. L. Smith ISBN 9780511126161, 0511126166pdf download
47 pages
Rizvi, Shamir
No ratings yet
Rizvi, Shamir
81 pages
Unit 2
No ratings yet
Unit 2
19 pages
Sequence Space Jacobian
No ratings yet
Sequence Space Jacobian
84 pages
Bayesian Methods For Regression Models With Fat Data
No ratings yet
Bayesian Methods For Regression Models With Fat Data
51 pages
Probability and Statistics Dr. Ishapathik Das, IIT Tirupati
No ratings yet
Probability and Statistics Dr. Ishapathik Das, IIT Tirupati
37 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
8 pages
Christiano Eichenbaum (2018) ON DSGE MODELS
No ratings yet
Christiano Eichenbaum (2018) ON DSGE MODELS
37 pages
Automatic Reparameterisation of Probabilistic Programs
No ratings yet
Automatic Reparameterisation of Probabilistic Programs
10 pages
Hamada2004 Bayesian Prediction Intervals and Their Relationship To Tolerance Intervals
No ratings yet
Hamada2004 Bayesian Prediction Intervals and Their Relationship To Tolerance Intervals
9 pages
SSM Book (Durbin Koopman)
No ratings yet
SSM Book (Durbin Koopman)
41 pages
Grasping Under Uncertainties Sequential Neural Ratio Estimation For 6-DoF Robotic Grasping
No ratings yet
Grasping Under Uncertainties Sequential Neural Ratio Estimation For 6-DoF Robotic Grasping
7 pages
3assignment Sol
No ratings yet
3assignment Sol
7 pages
WDS Unit 5 Notes
No ratings yet
WDS Unit 5 Notes
20 pages
SOP-Statistcs UMich PDF
No ratings yet
SOP-Statistcs UMich PDF
2 pages
[Ebooks PDF] download Advanced digital signal processing and noise reduction 4th ed Edition Professor Saeed V. Vaseghi full chapters
100% (5)
[Ebooks PDF] download Advanced digital signal processing and noise reduction 4th ed Edition Professor Saeed V. Vaseghi full chapters
50 pages
Chapter 7 Classification and Prediction 3735
No ratings yet
Chapter 7 Classification and Prediction 3735
89 pages
Phylogenetic Tree Construction - Methods
No ratings yet
Phylogenetic Tree Construction - Methods
7 pages
Lab4 Bayesiannetworks
100% (1)
Lab4 Bayesiannetworks
25 pages
DVX Adaptive Learning White Paper
No ratings yet
DVX Adaptive Learning White Paper
30 pages

2024 - Slide2 - BayesML Sub

Uploaded by

2024 - Slide2 - BayesML Sub

Uploaded by

Bayesian Learning Method

Nguyen Van Vinh

P ( h) : prior belief (probability of hypothesis h before seeing any data)

P(h | d ) : posterior (probability of hypothesis h after having seen the data d )

1. What is the probability that this patient has cancer?

2.P (cancer |  )  ...........................

• Generally we want the hMAP  arg max P(h | d )

– To find the Maximum A Posteriori (MAP) hypothesis, we

• Maximum likelihood (ML) decision:

– it is a simplifying assumption, obviously it may be violated in reality

Day1 Sunny Hot High Weak No

• To do this based on training examples, we need to estimate the

– For each target value (hypothesis) h

Pˆ (h) : estimate P(h)

– For each attribute value at of each datum instance

Pˆ (at | h) : estimate P(at | h)

• A measurable variable that is (rather,

• Convert this example to a vector using bag-

• Very large vector space (size of vocabulary), sparse features

• Example: spam classification

• We have P(class | document)  P(document | class)P(class)

• To enable classification, we need to be able to estimate the

US Presidential Speeches Tag Cloud

US Presidential Speeches Tag Cloud

US Presidential Speeches Tag Cloud

prior P(word | spam) P(word | ¬spam)

# of occurrences of this word in docs from this class

– This is the maximum likelihood (ML) estimate, or estimate

# of occurrences of this word in docs from this class

• Parameter smoothing: dealing with words that were never

# of occurrences of this word in docs from this class + 1

(V: total number of unique words)

– What’s Laplace with k = 0?

• Laplace for conditionals:

You might also like