3 Decision Trees
3 Decision Trees
categorical
C = female
yes no
YES No
Predict if Customer C will buy product P
numeric
$$$ C spent
before
> 50 <= 50
YES No
Predict if Customer C will buy product P
categorical
C = female
yes no
categorical numeric
C bought P $$$ C spent
before before
yes no > 50 <= 50
YES No YES No
root
internal
Terminal
leaves
Decision Tree
features = atributes = columns = field
• Regression Tree
• Y = Numeric dự đoán số thực
Classification tree()
“Root” Node
C = female
yes no
BBefore $Spent
F $Spent BBefore F
G = 1 – p2 – (1-p)2
= 1 – 12 – (0)2
=0
G = 2 (p - p2)
= 2 (1 - 12)
=0
.5 = Maximal Impurity
p = 0.5 #prob of getting KitKat
1-p = 0.5 #Prob of getting Reese’s
G = 1 – p2 – (1-p)2
= 1 – (.5)2 – (.5)2
= 1 - .25 – .25
= .5
G = 2 (.5 - (.5)2)
= 2 (.5 - .25)
= .5
Entropy D
• Measures uncertainty/chaos
• D = - sum(p * log2(p))
• Entropy = 1 if set is 50/50 (Gini = .5)
• Entropy = 0 if set is pure (Gini = 0)
A B C
Gini Impurity of a Node
= 1 – (prob of getting Yes)2 - (prob of getting No)2
F
yes no
4Y /1N
4Y / 1N 4Y / 3N
2 2
Y N
=1− − Gini = .32
Y+𝑁 Y+𝑁
no yes > 50 <= 50
2 2
4 1
=1− − 1Y / 2N 3Y / 0N 3Y / 0N 1Y / 3N
4+1 4+1 NO
NO YES YES
= 1 – (.8)2 – (.2)2 = .32 Pure node
Gini Impurity of a Node
= 1 – (prob of getting Yes)2 - (prob of getting No)2
F
yes no
4Y /3N
4Y / 1N 4Y / 3N
2 2
Y N
=1− − Gini = .32 Gini = .49
Y+𝑁 Y+𝑁
no yes > 50 <= 50
2 2
4 3
=1− − 1Y / 2N 3Y / 0N 3Y / 0N 1Y / 3N
4+3 4+3 NO
NO YES YES
= 1 – (.57)2 – (.43)2 = .49
Gender: F 2 No / 4 Yes M 3 No / 4 Yes
BBefore $Spent
F $Spent BBefore F
numeric
$$$ C spent
before
> 40, 50, 60 70, 100? >? <= ? <= 40, 50, 60 70, 100?
YES No
Step 1:
Sort $Spent from smallest to largest
Step 2:
Calculate adjacent means
Step 3:
Calculate
Gini Impurity
< adjacent mean
Calculate
Gini Impurity
> adjacent mean
Step 4
smallest