Decision Tree Slides
Decision Tree Slides
Outline
• Recap: Last week- Data and Data exploration
• Classification Concept Review
• Classification Technique: Decision Tree
Data Mining
SMU Classification: Restricted
7
SMU Classification: Restricted
8
SMU Classification: Restricted
Topic 3:
Data Exploration Techniques
-Summary statistics
– Location, Spread, Frequency, Mode, Quartiles,
Percentiles, etc.
-Visualization
– Histogram, Box plots, Scatter plots, Star plots, etc.
Question?
• Dimensionality Reduction
• Purpose: avoid Curse of Dimensionality
• Filter Redundant features and Irrelevant features
• Feature Extraction and Feature Selection (e.g., PCA, SVD)
• Correlation
Curse of Dimensionality
• Exponential growth in dimension space cause high sparsity in
the space for the same dataset!
• This sparsity is problematic for any method that require
statistical significance
d) 4D - 256 regions
X
X X
X
X
X X
XX X
Y
44
…
4 42 43
• For the same size of dataset:
Low dimension space: Higher dimension space: Very high dimension space: there
each region, there are there is no data point at is no data point at most of
some data points some regions regions
No data points! It is hard to solve the problem!!!
11
SMU Classification: Restricted
12
SMU Classification: Restricted
85%
Variance (%)
15%
Original Variable/Dimension A
Dimension Reduction
Case 1 Case 2
Outline
• Recap: Last week- Data and Data exploration
• Classification Concept Review
• Classification Technique: Decision Tree
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be assigned a
class as accurately as possible.
Classification Examples
Outline
• Recap: Last week- Data and Data exploration
• Classification Concept Review
• Classification Technique: Decision Tree
Source: http://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/
https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
TaxInc NO
< 80K > 80K
NO YES
Growing a Tree
1. Features to choose
4. Pruning
– The iteration is when a loop repeatedly executes until the controlling condition
becomes false.
Yes, No No
Tree Induction
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
CarType
Family Luxury
Sports
CarType CarType
{Sports, {Family,
{Family} Luxury} {Sports}
Luxury}
Size
{Small,
• What about this split? Large} {Medium}
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Tree Induction
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
C0: 5
C1: 5
C0: 9
C1: 1 ✓
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
• Entropy
• Misclassification error
Girlfriend
Sunny?
busy?
Yes No Yes No
Before splitting
Gini = 1 – [p(Fish)2 + p(Don’t Fish)2]
Fish 10
= 1 – [p(10/20)2 + p(10/20)2]
Don’t Fish 10
= 1 – (0.25 + 0.25) = 0.5
Split by “Sunny?”
Fish 4 P(Fish) = 4/9 P(Don’t Fish) = 5/9
Yes
Don’t Fish 5 Gini = 1 – [(4/9)2 +(5/9)2]= 0.494 updated value
Before splitting
Gini = 1 – [p(Fish)2 + p(Don’t Fish)2]
Fish 10
= 1 – [p(10/20)2 + p(10/20)2]
Don’t Fish 10
= 1 – (0.25 + 0.25) = 0.5
Split by “Girlfriend busy?”
Fish 7
Yes
Don’t Fish 0
Fish 3
No
Don’t Fish 10
Before splitting
Gini = 1 – [p(Fish)2 + p(Don’t Fish)2]
Fish 10
= 1 – [p(10/20)2 + p(10/20)2]
Don’t Fish 10
= 1 – (0.25 + 0.25) = 0.5
Split by “Girlfriend busy?”
Fish 7 P(Fish) = 7/7 P(Don’t Fish) = 0/7
Yes
Don’t Fish 0 Gini = 1 – [(7/7)2 +(0/7)2]= 0.0
𝑘
𝑛𝑖
𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = 𝐺𝐼𝑁𝐼 𝑖 (1)
𝑛
𝑖=1
Girlfriend
Sunny?
busy?
Yes No Yes No
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
• Entropy
• Misclassification error
Girlfriend
Sunny?
busy?
Yes No Yes No
Before splitting
Entropy = – [p(Fish)log2(p(Fish)) +
Fish 10 p(Don’t Fish)log2(p(Don’t Fish))]
Don’t Fish 10
= – [(10/20) log2(10/20) + (10/20) log2(10/20)]
Split by “Sunny?” = – [-0.5 + -0.5] = 1
Fish 4 P(Fish) = 4/9 P(Don’t Fish) = 5/9
Yes
Don’t Fish 5 Entropy = – [(4/9) log2(4/9) + (5/9) log2(5/9)] = 0.991? Or 0.764
Before splitting
Entropy = – [p(Fish)log2(p(Fish)) +
Fish 10 p(Don’t Fish)log2(p(Don’t Fish))]
Don’t Fish 10
= – [(10/20) log2(10/20) + (10/20) log2(10/20)]
Split by “Girlfriend busy?” = – [-0.5 + -0.5] = 1
Fish 7
Yes
Don’t Fish 0
Fish 3
No
Don’t Fish 10
Before splitting
Entropy = – [p(Fish)log2(p(Fish)) +
Fish 10 p(Don’t Fish)log2(p(Don’t Fish))]
Don’t Fish 10
= – [(10/20) log2(10/20) + (10/20) log2(10/20)]
Split by “Girlfriend busy?” = – [-0.5 + -0.5] = 1
Fish 7 P(Fish) = 7/7 P(Don’t Fish) = 0/7
Yes
Don’t Fish 0 Entropy = – [(7/7) log2(7/7) + (0/7) log2(0/7)] = 0
Information Gain
Before splitting Fish 10 *Note that “fish” or “don’t
fish” is the class label
Don’t Fish 10
𝑛𝑖
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑝) − σ𝑘𝑖=1 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖) (1)
𝑛
✓
(11/20 * 0.994)) (13/20 * 0.779))
=0.007 = 0.494
Day
D1 D2 D3 D4 D5 D6 D7 D8
Fish/Don’t Fish 1/0 0/1 0/1 1/0 1/0 0/1 1/0 0/1
Example: SplitINFO
k
ni n
SplitINFO = − log i
Day i =1 n n
D1 D2 D3 D4 D5 D6 … D20
1/0 0/1 0/1 1/0 1/0 0/1 1/0 or 0/1 0/1
1 1
SplitINFO (Day) = –σ20
𝑖=1 𝑙𝑜𝑔 = 4.3219280949= 4.322
20 20
Girlfriend
busy? SplitINFO (Girlfriend busy)
Yes No 7 7 13 13
= – [( log + log ]
7/0 3/10 20 20 20 20
Example: GainRATIO
Before splitting Fish 10
Day
Don’t Fish 10
D1 D2 D3 D4 D5 D6 … D20
1/0 0/1 0/1 1/0 1/0 0/1 1/0 or 0/1 0/1
Entropy
= – [p(Fish)log2(p(Fish)) + p(Don’t Fish)log2(p(Don’t Fish))]
= – [-0.5 + -0.5) = 1
𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 (Day)=1 𝐺𝐴𝐼𝑁𝑠𝑝𝑙𝑖𝑡 (Girl Friend busy)=0.494
GainRATIO=1/4.322 GainRATIO=0.494/0.934
=0.231 =0.529
• Entropy
• Misclassification error
Girlfriend
Sunny?
busy?
Yes No Yes No
Before splitting
Error = 1 – max(p(Fish),p(Don’t Fish))
Fish 10
= 1 – max((10/20),(10/20))
Don’t Fish 10
= 1 – [0.5] = 0.5
Split by “Sunny?”
Fish 4 P(Fish) = 4/9 P(Don’t Fish) = 5/9
Yes
Don’t Fish 5 Error = 1 – max((4/9),(5/9)) = 0.444 (updated)
Before splitting
Error = 1 – max(p(Fish),p(Don’t Fish))
Fish 10
= 1 – max((10/20),(10/20))
Don’t Fish 10
= 1 – [0.5] = 0.5
Split by “Girlfriend busy?”
Fish 7
Yes
Don’t Fish 0
Fish 3
No
Don’t Fish 10
Before splitting
Error = 1 – max(p(Fish),p(Don’t Fish))
Fish 10
= 1 – max((10/20),(10/20))
Don’t Fish 10
= 1 – [0.5] = 0.5
Split by “Girlfriend busy?”
Fish 7 P(Fish) = 7/7 P(Don’t Fish) = 0/7
Yes
Don’t Fish 0 Error = 1 – max((7/7),(0/7)) = 0
𝑘
𝑛𝑖
𝐸𝑟𝑟𝑜𝑟𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑟𝑟𝑜𝑟 𝑖 (1)
𝑛
𝑖=1
Splitting Error
Before splitting Fish 10 *Note that “fish” or “don’t
fish” is the class label
Don’t Fish 10
Girlfriend
Sunny?
busy?
Yes No Yes No
=
11/20 * 0.454
=
13/20 * 0.231
✓
IS424 Data Mining & Business Analytics 82
SMU Classification: Restricted
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2
N1 N2
C1 3 4
C2 0 3
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Error = 0.3
Error(N1) Error(Children)
= 1 – max((3/3),(0/3)) = 3/10 * 0 + 7/10 * 0.428 N1 N2
=0 = 0.3 C1 3 4
C2 0 3
Error(N2)
= 1 – max((4/7),(3/7)) Error = 0.3
= 0.428
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) Gini(Children)
= 1 – (3/3)2 – (0/3)2 = 3/10 * 0 + 7/10 * 0.489 N1 N2
=0 = 0.342 C1 3 4
C2 0 3
Gini(N2) Gini improves !!
= 1 – (4/7)2 – (3/7)2 Gini = 0.342
= 0.489
Impurity
measure
Tree Induction
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
• Early termination
Refund
Yes No Classification Rules
NO (Refund=Yes) ==> No
Marita l
{Single, Status (Refund=No, Marital Status={Single,Divorced},
{Married}
Divorced} Taxable Income<80K) ==> No
NO YES
Rules generated from decision trees are both mutually
exclusive and exhaustive
Rule set contains as much information as the tree
Classification Tree
(Example)
x1 X2>5
NO YES
X2>1 X1<4
4 YES NO NO YES
RED BLUE
1 5 6 x2
X>2
NO YES
2.00
X>1 X>3
1.50 NO NO
YES YES
0.50
1 2 3 x
Questions?
Thank You
94