Important For Data Mining
Important For Data Mining
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Examples of Classification Task
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction
● Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ, SPRINT
General Structure of Hunt’s Algorithm
Don’t Cheat
Cheat
Tree Induction
● Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
● Issues
– Determine how to split the records
u How to specify the attribute test condition?
u How to determine the best split?
● Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
● Issues
– Determine how to split the records
u How to specify the attribute test condition?
u How to determine the best split?
Size
{Small,
● What about this split? Large} {Medium}
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
● Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
● Issues
– Determine how to split the records
u How to specify the attribute test condition?
u How to determine the best split?
● Greedy approach:
– Nodes with homogeneous class distribution
are preferred
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
● Gini Index
● Entropy
● Misclassification error
How to Find the Best Split
C0 N00
Before Splitting: M0
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI
GINI (t ) = 1 - å[ p( j | t )]2
j
GINI (t ) = 1 - å[ p( j | t )]2
j
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Alternative Splitting Criteria based on INFO
Entropy(t ) = -å p( j | t ) log p( j | t )
j 2
● Information Gain:
æå n ö
= Entropy ( p) - ç
k
GAIN Entropy (i ) ÷
i
è n ø
split i =1
● Gain Ratio:
GAIN n n
GainRATIO = SplitINFO = - å log
Split k
i i
SplitINFO
split
n n i =1
Error (t ) = 1 - max P (i | t )
i
Error (t ) = 1 - max P (i | t )
i
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.361 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves !!
Tom Mitchell’s example
Day Outlook Temperature Humidity Wind Play Tennis?
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
Tree Induction
● Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
● Issues
– Determine how to split the records
u How to specify the attribute test condition?
u How to determine the best split?
● Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets
Example: C4.5
● Missing Values
● Costs of Classification
Underfitting and Overfitting (Example)
Circular points:
0.5 £ sqrt(x12+x22) £ 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Notes on Overfitting
● Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a
bottom-up fashion
– If generalization error improves after trimming,
replace sub-tree by a leaf node.
– Class label of leaf node is determined from
majority class of instances in the sub-tree
– Can use MDL for post-pruning
Example of Post-Pruning
Training Error (Before splitting) = 10/30
A1 A4
A2 A3
C0: 11 C0: 2
C1: 3 C1: 4
– Pessimistic error?
Don’t prune case 1, prune case 2
C0: 14 C0: 2
C1: 3 C1: 2
Handling Missing Attribute Values
Entropy(Children)
Missing = 0.3 (0) + 0.6 (0.9183) = 0.551
value
Gain = 0.9 ´ (0.8813 – 0.551) = 0.3303
Distribute Instances
Corrected table!
Other Issues
● Data Fragmentation
● Search Strategy
● Expressiveness
● Tree Replication
Data Fragmentation
● Other strategies?
– Bottom-up
– Bi-directional
Expressiveness
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y < 0.33?
y
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at a time
Oblique Decision Trees
x+y<1
Class = + Class =
Q R
S 0 Q 1
0 1 S 0
0 1
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL c: FP (false positive)
CLASS Class=No c d
d: TN (true negative)
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
Limitation of Accuracy
PREDICTED CLASS
wa + wb+ wc + w d
1 2 3 4
Model Evaluation
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
ROC Curve
(TP,FP):
● (0,0): declare everything
to be negative class
● (1,1): declare everything
to be positive class
● (1,0): ideal
● Diagonal line:
– Random guessing
– Below diagonal line:
u prediction is opposite of
the true class
Using ROC for Model Comparison
● No model consistently
outperform the other
● M1 is better for
small FPR
● M2 is better for
large FPR
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
ROC Curve: