Foundations of Data Science - Unit 6 - Naive Bayes
Foundations of Data Science - Unit 6 - Naive Bayes
Data Science
Unit 6
1
3/20/2023
Outline
▪ Supervised Learning
▪ Revision of Probability Theory and Bayes
Theorem
▪ Naïve Bayes Classifier
2
3/20/2023
Notation
Conditional Probability
▪ Independent Events – each event is not affected
by any other events
▪ Example – tossing a coin
▪ Dependent Events – event affected by previous
events
▪ Example – Marbles in a bag
3
3/20/2023
▪ P(A | B, C) = 12/16
▪ P(A, B | ~C) = 5 / 19
▪ P(B | ~A, C) = 4 / 24
Dr. Muhammad Usman 3/20/2023 7
Conditional Independence
▪ Two events A and B are independent if knowing that A has
happened does not say anything about B happening.
P(A B) = P(A) P(B)
P(A | B) = P(A)
▪ Two events A and B are conditionally independent given a
third event C precisely if the occurrence or non-
occurrence of A and B are independent events in their
conditional probability distribution given C.
P(A B | C) = P(A | C) P(B | C)
P(A | B C) = P(A | C)
4
3/20/2023
Bayes Theorem
▪ P (A | B) = P(B | A) P(A)
P(B)
= P(B | A) P(A)
P(B|A)P(A)+P(B|A)P(A)
▪ P(A) is the prior probability and P(A | B) is the posterior probability.
▪ Suppose events A1, A2, ……., Ak are mutually exclusive and exhaustive;
i.e., exactly one of the events must occur. Then for any event B:
P(Ai | B) = P(B | Ai) P(Ai)
P(B | Ai) P(Ai)
Example I
▪ According to American Lung Association, 7% of the population has lung cancer.
Of these people having lung disease, 90% are smokers; and of those not having
lung disease, 25.3% are smokers.
▪ Determine the probability that a randomly selected smoker has lung cancer.
10
5
3/20/2023
Example I Solution
▪ Let L = Lung Cancer, S = Smoker
▪ Given that
▪ P(L) = 0.07
▪ P(S | L) = 0.90 P(~S | L) = 0.10
▪ P(S | ~L) = 0.253 P(~S | ~L) = 0.747
11
Example II
▪ Assume that about 1 in 1000 individuals in a given organization have
committed a security violation.
▪ Assume that the sensitivity of a routine screening polygraph is about
85%. That is, the probability that the polygraph report will indicate a
concern is about 85% if the individual has committed a security
violation.
▪ Assume the specificity of the polygraph is about 80%. That is, if the
individual has not committed a security violation, there is about an 80%
chance that the polygraph report will not indicate a concern.
▪ What is the posterior probability that an individual whose polygraph
report indicates a concern has committed a security violation?
12
6
3/20/2023
Example II Solution
▪ Let
▪ S = Security Violation Committed,
▪ T = Test Positive
▪ Given that
▪ P(S) = 0.001
▪ P(T | S) = 0.85 P(~T | S) = 0.15
▪ P(T | ~S) = 0.20 P(~T | ~S) = 0.80
13
Bayesian Classifiers
▪ Consider each attribute and class label as random variables
▪ Given a record with attributes (A1, A2,…,An)
▪ Goal is to predict class C
▪ Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An )
▪ Can we estimate P(C| A1, A2,…,An ) directly from data?
14
7
3/20/2023
Bayesian Classifiers
▪ Approach:
▪ Compute the posterior probability P(C | A1, A2, …, An) for all values of
C using the Bayes theorem
15
16
8
3/20/2023
17
How to Estimate
ric
al
ric
al Probabilities
ou
s from Data?
u
go go in s▪s Class: P(C) = Nc/N
te te nt
c a c a c o cla
Tid Refund Marital Taxable ▪ P(No) = ?
Status Income Evade P(Yes) = ?
1 Yes Single 125K No
2 No Married 100K No
▪ For discrete attributes:
3 No Single 70K No
4 Yes Married 120K No
P(Ai | Ck) = |Aik|/ NCk
5 No Divorced 95K Yes
6 No Married 60K No ▪ where |Aik| is number of
7 Yes Divorced 220K No instances having attribute Ai
8 No Single 85K Yes and belongs to class Ck
9 No Married 75K No ▪ Examples:
10 No Single 90K Yes
10
P(Status=Married|No) = ?
P(Refund=Yes|Yes)= ?
Dr. Muhammad Usman 3/20/2023 18
18
9
3/20/2023
How to Estimate
ica
l
ica
l Probabilities
ou
s from Data?
or or in
u
teg teg nt s▪s Class: P(C) = Nc/N
c a c a c o cla
Tid Refund Marital Taxable ▪ e.g., P(No) = 7/10,
Status Income Evade P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No
▪ For discrete attributes:
3 No Single 70K No
4 Yes Married 120K No
P(Ai | Ck) = |Aik|/ NCk
5 No Divorced 95K Yes
6 No Married 60K No ▪ where |Aik| is number of
7 Yes Divorced 220K No instances having attribute Ai
8 No Single 85K Yes and belongs to class Ck
9 No Married 75K No ▪ Examples:
10 No Single 90K Yes
10
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
Dr. Muhammad Usman 3/20/2023 19
19
Naïve Bayes
Classification: Mammals vs. Non-mammals
Name
human
Give Birth
yes no
Can Fly Live in Water Have Legs
no yes
Class
mammals ▪ Train the model (learn the
python no no no no non-mammals parameters) using the given
salmon no no yes no non-mammals
whale yes no yes no mammals
data set.
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals ▪ Apply the learned model on new
bat yes yes no yes mammals cases.
pigeon no yes no yes non-mammals
cat yes no no yes mammals
leopard shark yes no yes no non-mammals
turtle no no sometimes yes non-mammals
penguin no no sometimes yes non-mammals
porcupine yes no no yes mammals
eel no no yes no non-mammals
salamander no no sometimes yes non-mammals
gila monster no no no yes non-mammals
platypus no no no yes mammals
owl no yes no yes non-mammals
dolphin yes no yes no mammals
eagle no yes no yes non-mammals
Dr. Muhammad Usman Give Birth Can Fly Live in Water Have Legs Class 3/20/2023 20
yes no yes no ?
20
10
3/20/2023
Naïve Bayes
Classification: Mammals vs. Non-mammals
Name Give Birth Can Fly Live in Water Have Legs Class A: attributes
human yes no no yes mammals
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals N: non-mammals
whale yes no yes no mammals
frog no no sometimes yes non-mammals 6 6 2 2
komodo no no no yes non-mammals P( A | M ) = = 0.06
bat yes yes no yes mammals
7 7 7 7
pigeon no yes no yes non-mammals 1 10 3 4
cat yes no no yes mammals
P( A | N ) = = 0.0042
13 13 13 13
leopard shark yes no yes no non-mammals
7
turtle
penguin
no
no
no
no
sometimes
sometimes
yes
yes
non-mammals
non-mammals
P( A | M ) P( M ) = 0.06 = 0.021
20
porcupine yes no no yes mammals
13
eel no no yes no non-mammals
P( A | N ) P( N ) = 0.004 = 0.0027
salamander no no sometimes yes non-mammals 20
gila monster no no no yes non-mammals
platypus no no no yes mammals P(A|M)P(M) > P(A|N)P(N)
owl no yes no yes non-mammals
dolphin yes no yes no mammals => Mammals
eagle no yes no yes non-mammals
Dr. Muhammad Usman Give Birth Can Fly Live in Water Have Legs Class 3/20/2023 21
yes no yes no ?
21
22
11
3/20/2023
23
24
12