0% found this document useful (0 votes)
2 views

3 Decision Trees

The document discusses Decision Trees, specifically Classification and Regression Trees (CART), highlighting their structure, benefits, and the process of building them. It explains how to measure impurity using Gini Impurity and Entropy, and outlines the steps for selecting features and splitting nodes to improve prediction accuracy. Additionally, it provides insights into using various libraries and commands for implementing tree models in data analysis.

Uploaded by

Vân Nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

3 Decision Trees

The document discusses Decision Trees, specifically Classification and Regression Trees (CART), highlighting their structure, benefits, and the process of building them. It explains how to measure impurity using Gini Impurity and Entropy, and outlines the steps for selecting features and splitting nodes to improve prediction accuracy. Additionally, it provides insights into using various libraries and commands for implementing tree models in data analysis.

Uploaded by

Vân Nguyễn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Decision Tree

Classification and Regression Trees


CART
Problem
• Decision boundaries are non-linear
• Need a model that’s easy to use
without math
Predict if Customer C will buy product P

categorical
C = female

yes no

YES No
Predict if Customer C will buy product P

numeric
$$$ C spent
before
> 50 <= 50

YES No
Predict if Customer C will buy product P
categorical
C = female
yes no

categorical numeric
C bought P $$$ C spent
before before
yes no > 50 <= 50

YES No YES No
root

internal

Terminal
leaves
Decision Tree
features = atributes = columns = field

• How individual features affect the chance of an outcome


• Divide and conquer chia để trị

1. Split into subsets chia thành các tập con

2. See if they pure (all yes, or all no)


→ Yes: Stop
→ No: Repeat
Benefits of Tree models
• Each to understand
• Read rules off the tree
• Looks just like a BPMN flowchart
• Concise description of what makes an item positive
• No black box
• Rules are interpretable by human users
Tree Types
• Classification Tree
• Y = categorical dự đoán biến rời rạc

• Regression Tree
• Y = Numeric dự đoán số thực
Classification tree()
“Root” Node
C = female
yes no

Internal “Node” Internal “Node”


C bought P $$$ C spent
before before
yes no > 50 <= 50

“Leaf” Node “Leaf” Node “Leaf” Node “Leaf” Node


YES No YES No
Dataset
--- 3 X features --- 1 Y response
Gender BBefore $Spent WillBuy
F Y 50 NO
F Y 60 NO
F
F Y 70 YES
F N 50 YES yes no
F N 40 YES
4Y / 2N 4Y / 3N
F N 30 YES
BBefore $Spent
M N 55 YES
M N 60 YES
yes no > 50 <= 50
M N 70 YES
M N 40 YES
1Y / 2N 3Y / 0N 3Y / 0N 1Y / 3N
M Y 10 NO Ŷ= NO YES YES NO
M Y 30 NO
M Y 20 NO
Q: Which feature to use as Root, which feature to go next, etc?

BBefore $Spent

yes no > 50 <= 50

F $Spent BBefore F

no yes > 50 <= 50 no yes yes no

YES NO YES NO YES NO YES NO


Picking a root
• How do you pick the best root feature?
• You can pick any feature to start
• Different features will result in different partitioning of the
data
• We want a feature that gives you splits that are heavily biased
towards YES or No
• Even splits (very impure) are NOT very useful
• Pure splits are the BEST!!!
Gender BBefore $Spent WillBuy
F Y 50 NO
Pure node
F Y 60 NO
F Y 70 YES
F N 50 YES Gender: F 2 No / 4 Yes M 3 No / 4 Yes
F N 40 YES
F N 30 YES
BBefore: Y 5 No / 1 Yes N 0 No / 7 Yes
M N 55 YES
M N 60 YES
$Spent: >50 1 No / 4 Yes <=50 4 No / 4 Yes
M N 70 YES
M N 40 YES
M Y 10 NO All other nodes are impure
M Y 30 NO
M Y 20 NO
Measuring purity
• We want to measure “purity” of the split
• More certain about YES/NO after the split
• Pure set (4 YES / 0 NO) ➔ completely certain 100%
• Impure (3 YES / 3 NO) ➔ Completely uncertain 50%
• We cannot use P(“yes” | set)
• Must be symmetric: (4 YES | 0 NO) is as pure as (0 YES | 4 NO)
Measure Impurity
Two Options:
• Gini Impurity – without log
• Entropy - with log
Gini Impurity G
• Measures income inequality in economics
• Comparable to binomial variance n*p*(1-p) = n*(p - p2)
• Developed by Italian statistician Corrado Gini
• Often used to measure diversity in OB research
• .5 = maximal inequality
• 0 = maximal purity
P 1-P
= 1 – (prob of getting Yes)2 - (prob of getting No)2

= 1 – p2 – (1-p)2 = 1 - p2 – (1 – 2p + p2) = 2 (p - p2) Textbook G


0 = Maximal Purity

p=1 #prob of getting KitKat


1-p = 0 #Prob of getting Reese’s

G = 1 – p2 – (1-p)2
= 1 – 12 – (0)2
=0
G = 2 (p - p2)
= 2 (1 - 12)
=0
.5 = Maximal Impurity
p = 0.5 #prob of getting KitKat
1-p = 0.5 #Prob of getting Reese’s

G = 1 – p2 – (1-p)2
= 1 – (.5)2 – (.5)2
= 1 - .25 – .25
= .5
G = 2 (.5 - (.5)2)
= 2 (.5 - .25)
= .5
Entropy D
• Measures uncertainty/chaos
• D = - sum(p * log2(p))
• Entropy = 1 if set is 50/50 (Gini = .5)
• Entropy = 0 if set is pure (Gini = 0)
A B C
Gini Impurity of a Node
= 1 – (prob of getting Yes)2 - (prob of getting No)2
F
yes no
4Y /1N
4Y / 1N 4Y / 3N
2 2
Y N
=1− − Gini = .32
Y+𝑁 Y+𝑁
no yes > 50 <= 50
2 2
4 1
=1− − 1Y / 2N 3Y / 0N 3Y / 0N 1Y / 3N
4+1 4+1 NO
NO YES YES
= 1 – (.8)2 – (.2)2 = .32 Pure node
Gini Impurity of a Node
= 1 – (prob of getting Yes)2 - (prob of getting No)2
F
yes no
4Y /3N
4Y / 1N 4Y / 3N
2 2
Y N
=1− − Gini = .32 Gini = .49
Y+𝑁 Y+𝑁
no yes > 50 <= 50
2 2
4 3
=1− − 1Y / 2N 3Y / 0N 3Y / 0N 1Y / 3N
4+3 4+3 NO
NO YES YES
= 1 – (.57)2 – (.43)2 = .49
Gender: F 2 No / 4 Yes M 3 No / 4 Yes

Gini Impurity of a Root


= weighted average of node Gini impurities F
Gini = .42
yes no
Root: Gender (F)
4Y / 1N 4Y / 3N
NB NB
= 𝐺𝑏 + 𝐺$
N B + 𝑁$ N B + 𝑁$ Gini = .32 Gini = .49

5 7 no yes > 50 <= 50


= (.32) + (.49)
5+7 5+7
1Y / 2N 3Y / 0N 3Y / 0N 1Y / 3N
NO YES YES NO
= .13 + .29 = .42
Gini Impurity? Gini Impurity?

BBefore $Spent

yes no > 50 <= 50

F $Spent BBefore F

no yes > 50 <= 50 no yes yes no

YES NO YES NO YES NO YES NO


Gender BBefore $Spent WillBuy BBefore: Y 5 No / 1 Yes N 0 No / 7 Yes
F Y 50 NO
F Y 60 NO
F Y 70 YES
F N 50 YES
F N 40 YES BBefore
Gini = .13
F N 30 YES
yes no
M N 55 YES
M N 60 YES
1Y/5N 7Y/0N
M N 70 YES
M N 40 YES Gini = .28 Gini = 0
M Y 10 NO
Pure node
M Y 30 NO
M Y 20 NO
Gender BBefore $Spent WillBuy $Spent: >50 1 No / 4 Yes <=50 4 No / 4 Yes
F Y 50 NO
F Y 60 NO
F Y 70 YES
F N 50 YES
$Spent
F N 40 YES Gini = .43
F N 30 YES > 50 <= 50
M N 55 YES
M N 60 YES 1N/4Y 4N/4Y
M N 70 YES
Gini = .32 Gini = .5
M N 40 YES
M Y 10 NO
M Y 30 NO
M Y 20 NO
Gender BBefore $Spent WillBuy
F Y 50 NO Feature with lowest Gini Impurity score
F Y 60 NO wins The Root Node spot
F Y 70 YES
F N 50 YES
F N 40 YES
F
Gini = .42
F N 30 YES
M N 55 YES
root
M N 60 YES BBefore
M N 70 YES Gini = .13 winner!
M N 40 YES
M Y 10 NO
$Spent
M Y 30 NO Gini = .43
M Y 20 NO
Pure Nodes
Information Gain =
Gini (parent) – Gini
(weighted over kids)
Steps to build a tree
1. Calculate all possible Gini impurity scores
2. Node with lowest score → leaf node
3. Node impurity can be improved by splitting
→Split to maximize information gain (improvement in purity)

Parents want their children to be more “pure” than them!


How to best split numeric data

numeric
$$$ C spent
before

> 40, 50, 60 70, 100? >? <= ? <= 40, 50, 60 70, 100?

YES No
Step 1:
Sort $Spent from smallest to largest
Step 2:
Calculate adjacent means
Step 3:
Calculate
Gini Impurity
< adjacent mean

Calculate
Gini Impurity
> adjacent mean
Step 4
smallest

Splitting criterion for $Spent


Other differences
• tree() can use a feature multiple times in the same model
• lm() predicts an individual Ŷ for each Y
• tree() predicts Ŷ for each Region
• tree() works the best when decision boundaries are non-
overlapping
Method Library Command (tuning parameters) Choosing Models Plot
Classification tree tree(Y ~ x) #FUN=“prune.misclass” asks to use Plot()
tree #type = “class” returns actual class prediction classification error rate, rather
Predict(model, testset, type = “class”) deviance, as cv criteria Text(,
cv.Tree(mode, FUN=prune.misclass) pretty=0)
Reg tree tree tree(formula = Y ~ X, data, subset=train)

Pruning tree #Prune the model to 9-node tree


Prune.misclass(model, best=9)
#Make prediction using the pruned tree
Predict(model, testset, type = “class”)
Random Forest Rando #randomForest() by default uses mtry=p/3 for #See importance of each variable Plot()
m regression, & mtry=sqrt(p) for classification Importance()
Bagging forest #mtry=13: all 13 (m) of 13 (p) Xs are considered
in each split = bagging #Plot the importance of measures
#ntree = number of trees grown varImpPlot(model)
randomForest(Y~X, data, subset, mtry=13,
importance=true, ntree=25)

Boosting Gbm #distribution = “gaussian” for regression; plot()


distribution = “Bernoulli” for classification
Gbm(Y~X, data, distribution = “gaussian”,
n.trees=5000, interaction.depth=4, shrinkage=.2,
verbose=F)
Predict(model, newdata, n.tree=5000)

You might also like