Probability based learning (in book: Machine learning for predictve data analytics)

Summary:
Probability-based Learning
(from book: Machine learning for predictve data analytics)
Duyen Do
1

NỘI DUNG
Bayes’ Theorem
Fundamentals
Bayes Prediction
Standard Approach: The Naïve Bayes Model
Conditional Independence and Factorization
Smoothing
Extensions and Variations
Continuous Features
Bayesian Network
Summary
Q&A
Probability basic
2

 Probability function: P()
Probability mass function (categorical features) and Probability density function (continuous features)
0 ≤ P (f = level) ≤ 1
𝑙 ∈ 𝑙𝑒𝑣𝑒𝑙𝑠(𝑓)
𝑃 𝑓 = 𝑙 = 1.0
 Prior probability or Unconditional probability
P(X)
 Posterior probability or Conditional probability
P(X|Y)
 Joint probability
P(X, Y)
 Joint probability distribution
Example
Table 6.1
Fundamentals3

Fundamentals: Bayes’s Theorem
P(X|Y) =
𝑃(𝑌|𝑋)𝑃(𝑋)
𝑃(𝑌)
A doctor inform a patient both bad news and good news:
- Bad news: 99% has a serious desease
- Good news: the desease is rarely and only 1 in 10,000 people
What is actually probability that patient has the desease?
Using Bayes’Theorem:
𝑃 𝑑 𝑡 =
𝑃 𝑡 𝑑 𝑃 𝑑
𝑃 𝑡
P(t) = P(t|d)P(d) + P(t|-d)P(-d) = (0.99 * 0.0001) + (0.01 * 0.9999) = 0.0101
P(d|t) =
0.99 ∗0.0001
0.0101
= 0.0098
5

With P(t=l): the prior probability of target feature t taking the level l
P(q[1],…,q[m]): joint probability of descriptive feature
P(q[1],…,q[m] | t=l): the conditional probability
Example 1:
Table 6.1
What is probability a person has MENINGTIS when he/she has HEADACHE=true, FEVER=false,
VOMITING=true?
=> P(m | h, -f, v)
Bayesian prediction
P(t=l | q[1],…,q[m]) =
𝑃 𝑞 1 ,…,𝑞 𝑚 𝑡=𝑙)𝑃(𝑡=𝑙)
𝑃(𝑞 1 ,…,𝑞 𝑚 )
6

P(m | h, -f, v) =
𝑃 ℎ,−𝑓,𝑣 𝑚)𝑃(𝑚)
𝑃(ℎ,−ℎ,𝑣)
P(m) =
|{𝐼𝐷5,𝐼𝐷8,𝐼𝐷10}|
|{𝐼𝐷1,𝐼𝐷2,𝐼𝐷3,𝐼𝐷4,𝐼𝐷5,𝐼𝐷6,𝐼𝐷7,𝐼𝐷8,𝐼𝐷9,𝐼𝐷10}|
= 0.3
P(h,-f,v) =
|{𝐼𝐷3,𝐼𝐷4,𝐼𝐷6,𝐼𝐷7,𝐼𝐷8,𝐼𝐷10}|
|{𝐼𝐷1,𝐼𝐷2,𝐼𝐷3,𝐼𝐷4,𝐼𝐷5,𝐼𝐷6,𝐼𝐷7,𝐼𝐷8,𝐼𝐷9,𝐼𝐷10}|
= 0.6
P(h, -f, v | m) = P(h|m) P(-f | h,m) P(v | -f, h, m)
=
|{𝐼𝐷8,𝐼𝐷10}|
|{𝐼𝐷5,𝐼𝐷8,𝐼𝐷10}|
x
|{𝐼𝐷8,𝐼𝐷10}|
|{𝐼𝐷8,𝐼𝐷10}|
x
|{𝐼𝐷8,𝐼𝐷10}|
|{𝐼𝐷8,𝐼𝐷10}|
=
2
3
x
2
2
x
2
2
= 0.66
P(m | h, -f, v) = (0.66*0.3) / 0.6 = 0.33
P(-m | h, -f, v) =
𝑃 ℎ,−𝑓,𝑣 −𝑚)𝑃(−𝑚)
=
𝑃 ℎ −𝑚)𝑃(−𝑓 ℎ,−𝑚 𝑃 𝑣 −𝑓,ℎ,−𝑚)𝑃(−𝑚)
= 0.667
Bayesian prediction
P(t=l | q[1],…,q[m]) =
𝑃 𝑞 1 ,…,𝑞 𝑚 𝑡=𝑙)𝑃(𝑡=𝑙)
𝑃(𝑞 1 ,…,𝑞 𝑚 )
7

Bayesian prediction
Maximum a posterior predction:
M (q) = arg max P(t=l | q[1],…,q[m])
= arg max l ∈ 𝑙𝑒𝑣𝑒𝑙𝑠( 𝑡)
𝑃 𝑞 1 , … , 𝑞 𝑚 𝑡 = 𝑙)𝑃(𝑡 = 𝑙)
𝑃(𝑞 1 , … , 𝑞 𝑚 )
Bayesian MAP prediction model:
M (q) = arg max P(t=l | q[1],…,q[m])
= arg max l ∈ 𝑙𝑒𝑣𝑒𝑙𝑠( 𝑡)
𝑃 𝑞 1 , … , 𝑞 𝑚 𝑡 = 𝑙)𝑃(𝑡 = 𝑙)
8

Example 2:
Table 6.1
What is probability a person has MENINGTIS when he/she has HEADACHE=true, FEVER=true,
VOMITING=false?
 P(m | h, f, -v) =
𝑃 ℎ 𝑚 𝑃 𝑓 ℎ, 𝑚 𝑃 −𝑣 𝑓, ℎ, 𝑚 𝑃(𝑚)
𝑃(ℎ,𝑓,−𝑣)
=
0.66 ∗ 0 ∗0 ∗0.3
0.1
= 0.0
P(-m| h, f, -v) = 1.0 – 0.0 = 1.0
 Overfit data
Bayesian prediction
P(t=l | q[1],…,q[m]) =
𝑃 𝑞 1 ,…,𝑞 𝑚 𝑡=𝑙)𝑃(𝑡=𝑙)
𝑃(𝑞 1 ,…,𝑞 𝑚 )
9

Example
Table 6.2
Naïve Bayes Model12
MAP (Maximum a posterior):
M(q) = arg maxl ∈ levels(t) ((Πi P(q[i] | t=l) * P(t=l))

Example
Table 6.2
CREDIT HISTORY = paid, GUANRANTOR/COAPPLICANT = guarantor, ACCOMMODATION = free?
Extensions and Variations: Smoothing13
Laplace smoothing:
𝑃 𝑓 = 𝑙 𝑡) =
𝑐𝑜𝑢𝑛𝑡 𝑓 = 𝑙 𝑡) + 𝑘
𝑐𝑜𝑢𝑛𝑡 𝑓 𝑡) + (𝑘 ∗ 𝐷𝑜𝑚𝑎𝑖𝑛 𝑓 )

Transform continuous feature to categorical feature with:
 Equal-width binning
 Equal-frequency binning
Example
Table 6.11
Extensions and Variations:
Continuous feature - Binning14

A Bayesian network, Bayes network, Bayes(ian) model or probabilistic directed acyclic graphical
model is a probabilistic graphical model that represents a structural relationship - a set of random variables
and their conditional dependencies - via a directed acyclic graph (DAG)
P(A,B) = P(B|A) * P(A)
Ex1:
P(a, -b) = P(-b|a) * P(a)
= 0.7 * 0.4 = 0.28
The probability of an event x1,…,xn
P(x1,…,xn) = ∏ P(xi | Parents(xi))
Bayesian Network15
A
B
P(A=T) P(A=F)
0.4 0.6
A P(B=T | A) P(B=F | A)
T 0.3 0.7
F 0.4 0.6

Ex2:
P (a, -b, -c, d)
= P(-b|a,-c) * P(-c|d) * P(a) * P(d)
= 0.5 * 0.8 * 0.4 * 0.4
= 0.064
Bayesian Network16
A
B
C
D
P(D=T)
0.4
P(A=T)
0.4 D P(C=T|D)
T 0.2
F 0.5
A C P(B=T|A,C)
T T 0.2
T F 0.5
F T 0.4
F F 0.3

P(t |d[1], …, d[n]) = P(t) * ∏ P(d[j] | t)
Naïve Bayes Classifier18
Target
Descriptive
feature 1
Descriptive
feature 2
Descriptive
feature n

Hybrid approach:
 1. Given the topology of the network
 2. Induce the CPT
What is the best topology structure to give the algorithm as input?
 Causal graphs
Example
Table 6.18

A potential causal theory:
The more equal in a society, the higher the investment that society will make in health and
education, and this in turn result in a lower of corruption
SY LE P(CPI|SY, LE)
L L 1.0
L H 0
H L 1.0
H H 1.0
CPI
GC
SC
LE
GC P(SY=L|GC)
L 0.2
H 0.8 GC P(LE=L|GC)
L 0.2
H 0.8
B P(GC=L)
T 0.5

Using Bayesian Network make predictions22
M (q) = arg max l ∈ levels(t) BayesianNetwork(t=l,q)
Making prediction with missing descriptive feature values:
GC = high, SC = high =>?
Example
Table 6.18

Probability based learning (in book: Machine learning for predictve data analytics)

More Related Content

What's hot (20)

Similar to Probability based learning (in book: Machine learning for predictve data analytics) (20)

Recently uploaded (20)

Probability based learning (in book: Machine learning for predictve data analytics)