Lect-7-DM
Lect-7-DM
Lecture # 7
Naïve Bayes Classifier
1
Naïve Bayes Classifier
Thomas Bayes
1702 - 1761
We will start off with a visual intuition, before looking at the math…
Background
There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data
Example: multi-layered perceptron with the cross-entropy
cost
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model based classifiers
a) and b) are examples of discriminative classification
c) is an example of generative classification
b) and c) are both examples of probabilistic classification
Grasshoppers Katydids
10
9
8
Antenna Length
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Abdomen Length
Remember this example? Let’s
get lots more data…
With a lot of data, we can build a histogram.
Let us just build one for “Antenna Length” for
now…
10
9
8
Antenna Length
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
Katydids
Grasshoppers
We can leave the
histograms as they are, or
we can summarize them
with two normal
distributions.
Antennae length is 3
p(cj | d) = probability of class cj, given that we have observed d
10
Antennae length is 3
p(cj | d) = probability of class cj, given that we have observed d
Antennae length is 7
p(cj | d) = probability of class cj, given that we have observed d
6 6
Antennae length is 5
Bayes Classifiers
That was a visual intuition for a simple case of the Bayes classifier, also called:
• Idiot Bayes
• Naïve Bayes
• Simple Bayes
We are about to see some of the mathematical formalisms, and more examples,
but keep in mind the basic idea.
Find out the probability of the previously unseen instance belonging to each
class, then simply pick the most probable class.
Self Study
Concepts related to probability
Probability Basics
• Prior, conditional and joint probability
– Prior probability:P(X)
– Conditional probability: P(X1 |X2 ), P(X2 |X1 )
– Joint probability:X (X1 , X2 ), P(X) P(X1 ,X2 )
– Relationship:P(X1 ,X2 ) P(X2 |X1 )P(X1 ) P(X1 |X2 )P(X2 )
– Independence:P(X |X ) P(X ), P(X |X ) P(X ), P(X ,X ) P(X )P(X )
2 1 2 1 2 1 1 2 1 2
• Bayesian Rule
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
Name Sex
Drew Male
Officer Drew Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Nina Female
Sergio Male
Name Sex
Drew Male
Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Officer Drew
Nina Female
Sergio Male
p(male | drew) = 1/3 * 3/8 = 0.125
3/8 3/8
Officer Drew is more
likely to be a Female.
p(female | drew) = 2/5 * 5/8 = 0.250
3/8 3/8
Officer Drew IS a female!
Officer Drew
So far we have only considered Bayes
Classification when we have one attribute (the p(cj | d) = p(d | cj ) p(cj)
“antennae length”, or the “name”). But we may
have many features.
p(d)
How do we use all the features?
Officer Drew is
blue-eyed,
over 170cm tall, p(officer drew| Female) = 2/5 * 3/5 * ….
and has long p(officer drew| Male) = 2/3 * 2/3 * ….
hair
The Naive Bayes classifiers is
often represented as this type of
graph… cj
Note the direction of the arrows,
which state that each class
causes certain features, with a
certain probability
1 2 3 4 5 6 7 8 9 10
Relevant Issues
• Violation of Independence Assumption
– For many real world tasks,P(X1 , , Xn |C) P(X1 |C) P(Xn |C)
– Nevertheless, naïve Bayes works surprisingly well anyway!
• Zero conditional probability Problem
– If no example contains the attribute valueX a , Pˆ ( X a |C c ) 0
j jk j jk i
– In this circumstance, Pˆ ( x |c ) Pˆ (a |c ) Pˆ ( x |c ) 0 during test
1 i jk i n i
– For a remedy, conditional probabilities estimated with
n mp
Pˆ ( X j a jk |C ci ) c
nm
nc : numberof training examplesfor which X j a jk and C ci
n : numberof training examplesfor which C ci
p : prior estimate(usually, p 1 / t for t possible values of X j )
m : weight to prior (numberof " virtual" examples, m 1)
Estimating Probabilities
• Normally, probabilities are estimated based on observed
frequencies in the training data.
• If D contains nk examples in category yk, and nijk of these nk
examples have the jth value for feature Xi, xij, then:
nijk
P( X i xij | Y yk )
nk
• However, estimating such probabilities from small training
sets is error-prone.
9/16/2013 4949
Naïve Bayes Example
Probability positive negative
P(Y) 0.5 0.5
P(medium | Y) 0.1 0.2
P(red | Y) 0.9 0.3 Test Instance:
<medium ,red, circle>
P(circle | Y) 0.9 0.3
9/16/2013 5151
Laplace Smothing Example
• Assume training set contains 10 positive examples:
– 4: small
– 0: medium
– 6: large
• Estimate parameters as follows (if m=1, p=1/3)
– P(small | positive) = (4 + 1/3) / (10 + 1) = 0.394
– P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03
– P(large | positive) = (6 + 1/3) / (10 + 1) = 0.576
– P(small or medium or large | positive) = 1.0
9/16/2013 5252
Numerical Stability
• It is often the case that machine learning algorithms need to
work with very small numbers
– Imagine computing the probability of 2000 independent
coin flips
– MATLAB thinks that (.5)2000=0
9/16/2013 53
Stochastic Language Models
• Models probability of generating strings (each
word in turn) in the language (commonly all
strings over ∑). E.g., unigram model
Model M
0.2 the
the guy likes the fruit
0.1 a
0.01 guy 0.2 0.01 0.02 0.2 0.01
0.01 fruit
0.03 said multiply
0.02 likes
P(s | M) = 0.00000008
…
9/16/2013 54
13.2.1
Numerical Stability
• Instead of comparing P(Y=5|X1,…,Xn) with P(Y=6|X1,…,Xn),
– Compare their logarithms
9/16/2013 55
Underflow Prevention
• Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying
probabilities.
• Class with highest final un-normalized log
probability score is still the most probable.
9/16/2013 5656
Relevant Issues
• Continuous-valued Input Attributes
– Numberless values for an attribute
– Conditional probability modeled with the normal distribution
1 ( X j ji )2
Pˆ ( X j |C ci ) exp
2 ji 2 ji
2
ji : mean(avearage)of attribute values X j of examples for which C ci
ji : standarddeviation of attribute values X j of examples for which C ci
foot
sex height (feet) weight (lbs)
size(inches)
sample 6 130 8
Since posterior numerator is greater in the female case, we predict the
sample is female.
Advantages/Disadvantages of Naïve Bayes
• Advantages:
– Fast to train (single scan). Fast to classify
– Not sensitive to irrelevant features
– Handles real and discrete data
– Handles streaming data well
• Disadvantages:
– Assumes independence of features
Conclusions
Naïve Bayes based on the independence assumption
Training is very easy and fast; just requiring considering each
attribute in each class separately
Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
A popular generative model
Performance competitive to most of state-of-the-art
classifiers even in presence of violating independence
assumption
Many successful applications, e.g., spam mail filtering
Apart from classification, naïve Bayes can do more…
Conclusions
• Naïve Bayes is:
– Really easy to implement and often works well
– Often a good first thing to try
– Commonly used as a “punching bag” for smarter
algorithms
9/16/2013 64
Acknowledgements
Introduction to Machine Learning, Alphaydin
Statistical Pattern Recognition: A Review – A.K Jain et al., PAMI (22) 2000
Material in these slides has been taken from, the following resources
65