UNIT 2 - Decision Tree - Issues
UNIT 2 - Decision Tree - Issues
Unit 2 –
Issues in
Decision Tree
Learning
By
Dr. G. Sunitha
Professor & BoS Chairperson
Department of CSE
2
Decision Tree - Overfitting
• Definition: Given a Hypothesis space H, a hypothesis h1 Є H is said to overfit the
training data if there exists some alternative hypothesis h2 Є H, such that
– h1 has smaller error than h2 over the training examples,
– but h2 has a smaller error than h1 over the entire distribution of instances.
3
Decision Tree – Overfitting . . .
4
Decision Tree – Overfitting . . .
5
Decision Tree – Overfitting . . .
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• All approaches to avoid overfitting in decision tree learning can be categorized into
two types:
– Stopping the growth of tree earlier, once it reaches the point where it perfectly
classifies the training data. (pre-pruning by setting hyperparameters)
– Allowing the tree to grow in full, and then prune the tree. (post-pruning)
7
K-fold Cross Validation
8
Reduced Error Pruning
• Pruning a decision node Di consists of removing the subtree of Di and making it a leaf
node.
• A node Di is pruned only “if the resulting pruned tree performs no worse than the original
tree”.
• Pruning is done based on performance of model on validation dataset.
• Pruning is an iterative process, which proceeds by always choosing the node whose
removal increases the tree performance the most.
• Tree pruning continues, until any further pruning degrades the performance of the tree.
• Not suitable if dataset available is small.
9
Rule Post Pruning
Rule post pruning is used in C4.5.
It involves the following steps:
1) Infer decision tree from training data.
2) Convert tree to rules - one rule per branch.
– Each path corresponds to a
rule
– Each node along a path
corresponds to pre-condition
– Each leaf classification is post- Ex:
condition If age = “senior” ∧
3) Prune each rule by removing any credit-rating = “excellent”,
preconditions that result in improved then buys_computer = “yes”
estimated accuracy.
4) Sort the pruned rules by their estimated
accuracy and consider them in this
sequence when classifying unseen
instances.
10
Rule Post Pruning . . .
Why Convert The Decision Tree To Rules Before Pruning?
11
Handling continuous-valued attributes
• How does ID3 handle continuous attributes?
Income (K)
12
18
20
25
34
47
52
55 12
Handling continuous-valued attributes . . .
Income Buys_Computer
(K)
12 No
18 Yes
20 Yes
25 Yes
34 Yes
47 No
52 No
55 No
13
Alternative Attribute Selection Measures
• There is a natural bias in the Information Gain Measure that it favors attributes with
many unique values over those with few unique values.
14
Alternative Attribute Selection Measures . . .
• Alternative Measures:
– Gain Ratio – penalizes attributes with many unique values by incorporating
“split information”. Split information is sensitive to how broadly and uniformly the
attribute splits the data.
𝑺𝒊 𝑺𝒊
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏𝒎 𝑺, 𝑨 = − σ𝒄𝒊=𝟏 𝐥𝐨𝐠 𝟐
𝑺 𝑺
15
Alternative Attribute Selection Measures . . .
• Alternative Measures: 𝑺𝑨 − 𝑴𝑨
𝑮𝒂𝒊𝒏𝒎 𝑺, 𝑨 = 𝑮 𝑺, 𝑨
– Gain Ratio . . . 𝑺𝑨
𝑮𝒂𝒊𝒏𝒎 𝑺, 𝑨
𝑮𝒂𝒊𝒏𝑹𝒂𝒕𝒊𝒐𝒎 𝑺, 𝑨 =
𝑺𝒑𝒍𝒊𝒕𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏𝒎 𝑺, 𝑨
16
Handling Missing Attribute Values in Training Data
• Ignore the tuple:
– usually done when class label is missing.
– not effective when the percentage of missing values per attribute varies
considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in the missing value automatically with
– a global constant
• e.g., “unknown”, 0, -1 , ∞ etc
• affects quality of learning
– the attribute mean
– the attribute mean for all samples belonging to the same class
– the attribute mode
– the attribute median
• C4.5 is successor of ID3. C4.5 inherently supports handling attributes with missing
values.
17
Handling Missing Attribute Values in Training Data . . .
18
Handling Missing Attribute Values in Training Data . . .
Outlook
𝑺𝒐𝒖𝒕𝒍𝒐𝒐𝒌 − 𝑴𝒐𝒖𝒕𝒍𝒐𝒐𝒌
𝑮𝒂𝒊𝒏𝒎 𝑫, 𝒐𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑮 𝑫, 𝒐𝒖𝒕𝒍𝒐𝒐𝒌
𝑺𝑶𝒖𝒕𝒍𝒐𝒐𝒌
19
Handling Missing Attribute Values in Training Data . . .
5 5 3 3
− log 2 − log 2 = 0.95
8 8 8 8
20
Handling Missing Attribute Values in Training Data . . .
Outlook
Records with
Sunny Rain Overcast missing values
Entropy(Dsunny) Entropy(Drain) Entropy(DOvercast ) Entropy(Dmissing)
0 0 1 1
Entropy(Dsunny) = − log 2 + log 2 = 0
1 1 1 1
2 2 2 2
Entropy(Drain) = − log 2 + log 2 = 0.5
4 4 4 4
3 3 0 0
Entropy(Dovercast) = − log 2 + log 2 =0
3 3 3 3
1 4 3
Gain(D, outlook) = × 0 + × 0.5 + × 0 = 0.5
8 8 8
14 − 6
𝐺𝑎𝑖𝑛𝑚 (D, outlook) = × Gain(D, outlook) = 0.257
14
21
Handling Missing Attribute Values in Training Data . . .
Outlook
Records with
Sunny Rain Overcast
missing values
1 1 4 4 3 3 6 6
− log 2 − log 2 − log 2 − log 2 = 1.659
14 14 14 14 14 14 14 14
0.257
= = 0.154
1.659
22
Handling Attributes with Different Costs
• In some learning tasks, importance (or priority) of attributes may vary.
• The priority of the attributes is represented in the form of costs.
• In such cases, the decision trees are required to consider low-cost attributes
wherever possible. High-cost attributes shall be used only when needed to produce
reliable classifications.
• ID3 can be modified to consider attribute costs for constructing the decision tree.
• Other attribute selection measures –
– Ex1: 𝑮𝒂𝒊𝒏𝟐 (D, A)
𝑪𝒐𝒔𝒕 (A)
– Ex2: 𝟐𝑮𝒂𝒊𝒏(𝑫,𝑨) − 𝟏
𝑪𝒐𝒔𝒕 𝑨 + 𝟏 𝒘
where w = [ 0 , 1 ] and it
represents the relative importance of cost vs Gain
23