Fundamentals of Machine Learning For Predictive Data Analytics
Fundamentals of Machine Learning For Predictive Data Analytics
5 Model Ensembles
Boosting
Bagging
6 Summary
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
IG (S TREAM, D) = 0.3060
IG (S LOPE, D) = 0.5774
IG (E LEVATION, D) = 0.8774
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
H (S TREAM, D)
X
=− P(S TREAM = l) × log2 (P(S TREAM = l))
l∈ ’true’,
n o
’false’
4 4 3 3
=− /7 × log2 ( /7 ) + /7 × log2 ( /7 )
= 0.9852 bits
H (S LOPE, D)
X
=− P(S LOPE = l) × log2 (P(S LOPE = l))
(
’flat’,
)
l∈ ’moderate’,
’steep’
1 1 1 1 5 5
=− /7 × log2 ( /7 ) + /7 × log2 ( /7 ) + /7 × log2 ( /7 )
= 1.1488 bits
H (E LEVATION, D)
X
=− P(E LEVATION = l) × log2 (P(E LEVATION = l))
’low’,
’medium’,
l∈
’high’,
’highest’
1 1 2 2 3 3 1 1
=− /7 × log2 ( /7 ) + /7 × log2 ( /7 ) + /7 × log2 ( /7 ) + /7 × log2 ( /7 )
= 1.8424 bits
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
0.3060
GR (S TREAM, D) = = 0.3106
0.9852
0.5774
GR (S LOPE, D) = = 0.5026
1.1488
0.8774
GR (E LEVATION, D) = = 0.4762
1.8424
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Slope
true false
riparian chaparral
Gini (V EGETATION, D)
X
=1− P(V EGETATION = l)2
( )
’chapparal’,
l∈ ’riparian’,
’conifer’
2 2
2 2 2
= 1 − (3/7 ) + /7 + /7
= 0.6531
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Elevation
<4,175 ≥4,175
Elevation
<4,175 ≥4,175
Stream conifer
true false
Elevation chaparral
<2,250 ≥2,250
riparian chaparral
Figure: The decision tree that would be generated for the vegetation
classification dataset listed in Table 3 [17] using information gain.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
u uu uu uu -
(a) Target
u uu uu uu -
(b) Underfitting
u uu uu uu -
(c) Goldilocks
h
u h
uhu u u hh
hh uu -
(d) Overfitting
Season
ID Work Day Rentals ID Work Day Rentals ID Work Day Rentals ID Work Day Rentals
1 false 800 4 false 2,100 7 false 3,000 10 false 2,910
D1 D2 D3 D4
2 false 826 5 true 4,740 8 true 5,800 11 false 2,880
3 true 900 6 true 4,900 9 true 6,200 12 true 2,820
Figure: The decision tree resulting from splitting the data in Table 5
[25]
using the feature S EASON.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Season
winter autumn
Figure: The final decision tree induced from the dataset in Table 5
[25]
. To illustrate how the tree generates predictions, this tree lists the
instances that ended up at each leaf node and the prediction (PRED.)
made by each leaf node.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Core-Temp
[icu]
low high
Gender Stable-Temp
[icu] [gen]
Figure: The decision tree for the post-operative patient routing task.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Core-Temp
[icu] Core-Temp Core-Temp
[icu] [icu] (1)
low high
low high low high
Gender Stable-Temp
Stable-Temp Stable-Temp
[icu] (0) [gen] icu icu (0)
[gen] (2) [gen]
icu (0) gen (2) gen icu gen (0) icu (0) gen (0) icu (0)
Figure: The iterations of reduced error pruning for the decision tree in
Figure 7 [34] using the validation set in Table 7 [33] . The subtree that is
being considered for pruning in each iteration is highlighted in black.
The prediction returned by each non-leaf node is listed in square
brackets. The error rate for each node is given in round brackets.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Advantages of pruning:
Smaller trees are easier to interpret
Increased generalization accuracy when there is noise in
the training data (noise dampening).
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Model Ensembles
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Boosting
Boosting
Weighted Dataset
Each instance has an associated weight wi ≥ 0,
1
Initially set to n where n is the number of instances in the
dataset.
After each model is added to the ensemble it is tested on
the training data and the weights of the instances the
model gets correct are decreased and the weights of the
instances the model gets incorrect are increased.
These weights are used as a distribution over which the
dataset is sampled to created a replicated training set,
where the replication of an instance is proportional to its
weight.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
Boosting
Boosting
Bagging
Bagging
Bagging
ID F1 F2 F3 Target
1 - - - -
2 - - - -
3 - - - -
4 - - - -
Bagging and
Subspace Sampling
F3 F2 F1
F1 F3
MODEL ENSEMBLE
Bagging
Summary
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary
5 Model Ensembles
Boosting
Bagging
6 Summary