0% found this document useful (0 votes)
146 views52 pages

Fundamentals of Machine Learning For Predictive Data Analytics

The document discusses techniques for feature selection and handling continuous variables in machine learning models. It covers alternative feature selection metrics like information gain ratio and Gini index that address issues with entropy-based metrics. It also describes methods for transforming continuous descriptive features into binary features for decision trees by sorting data, finding classification thresholds, and selecting the threshold with highest information gain. The document is about feature engineering methods for predictive modeling.

Uploaded by

Mba Nani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views52 pages

Fundamentals of Machine Learning For Predictive Data Analytics

The document discusses techniques for feature selection and handling continuous variables in machine learning models. It covers alternative feature selection metrics like information gain ratio and Gini index that address issues with entropy-based metrics. It also describes methods for transforming continuous descriptive features into binary features for decision trees by sorting data, finding classification thresholds, and selecting the threshold with highest information gain. The document is about feature engineering methods for predictive modeling.

Uploaded by

Mba Nani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Feature Selection Cont. Desc. Features Cont.

Targets Noise and Overfitting Ensembles Summary

Fundamentals of Machine Learning for


Predictive Data Analytics
Chapter 4: Information-based Learning
Sections 4.4, 4.5

John Kelleher and Brian Mac Namee and Aoife D’Arcy

[email protected] [email protected] [email protected]


Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

1 Alternative Feature Selection Metrics

2 Handling Continuous Descriptive Features

3 Predicting Continuous Targets

4 Noisy Data, Overfitting and Tree Pruning

5 Model Ensembles
Boosting
Bagging

6 Summary
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Alternative Feature Selection


Metrics
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Entropy based information gain, preferences features with


many values.
One way of addressing this issue is to use information
gain ratio which is computed by dividing the information
gain of a feature by the amount of information used to
determine the value of the feature:
IG (d, D)
GR (d, D) = X (1)
− (P(d = l) × log2 (P(d = l)))
l∈levels(d)
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

IG (S TREAM, D) = 0.3060
IG (S LOPE, D) = 0.5774
IG (E LEVATION, D) = 0.8774
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

H (S TREAM, D)
X
=− P(S TREAM = l) × log2 (P(S TREAM = l))
l∈ ’true’,
n o
’false’
   
4 4 3 3
=− /7 × log2 ( /7 ) + /7 × log2 ( /7 )

= 0.9852 bits

H (S LOPE, D)
X
=− P(S LOPE = l) × log2 (P(S LOPE = l))
(
’flat’,
)
l∈ ’moderate’,
’steep’
     
1 1 1 1 5 5
=− /7 × log2 ( /7 ) + /7 × log2 ( /7 ) + /7 × log2 ( /7 )

= 1.1488 bits

H (E LEVATION, D)
X
=− P(E LEVATION = l) × log2 (P(E LEVATION = l))
 
’low’,
 
’medium’,

l∈
 ’high’, 
’highest’
 
       
1 1 2 2 3 3 1 1
=− /7 × log2 ( /7 ) + /7 × log2 ( /7 ) + /7 × log2 ( /7 ) + /7 × log2 ( /7 )

= 1.8424 bits
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

0.3060
GR (S TREAM, D) = = 0.3106
0.9852
0.5774
GR (S LOPE, D) = = 0.5026
1.1488
0.8774
GR (E LEVATION, D) = = 0.4762
1.8424
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Slope

flat moderate steep

conifer riparian Elevation

low medium high highest

chaparral Stream chaparral conifer

true false

riparian chaparral

Figure: The vegetation classification decision tree generated using


Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Another commonly used measure of impurity is the Gini


index:
X
Gini (t, D) = 1 − P(t = l)2 (2)
l∈levels(t)

The Gini index can be thought of as calculating how often


you would misclassify an instance in the dataset if you
classified it based on the distribution of classifications in
the dataset.
Information gain can be calculated using the Gini index by
replacing the entropy measure with the Gini index.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Gini (V EGETATION, D)
X
=1− P(V EGETATION = l)2
( )
’chapparal’,
l∈ ’riparian’,
’conifer’
  2  2 
2 2 2
= 1 − (3/7 ) + /7 + /7

= 0.6531
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Table: Partition sets (Part.), entropy, Gini index, remainder (Rem.),


and information gain (Info. Gain) by feature

Split by Partition Info.


Feature Level Part. Instances Gini Index Rem. Gain
’true’ D1 d2 , d3 , d6 , d7 0.625
S TREAM 0.5476 0.1054
’false’ D2 d1 , d4 , d5 0.4444
’flat’ D3 d5 0
S LOPE ’moderate’ D4 d2 0 0.4 0.2531
’steep’ D5 d1 , d3 , d4 , d6 , d7 0.56
’low’ D6 d2 0
’medium’ D7 d3 , d4 0.5
E LEVATION 0.3333 0.3198
’high’ D8 d1 , d5 , d7 0.4444
’highest’ D9 d6 0
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Handling Continuous Descriptive


Features
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

The easiest way to handle continuous valued descriptive


features is to turn them into boolean features by defining a
threshold and using this threshold to partition the instances
based their value of the continuous descriptive feature.
How do we set the threshold?
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

1 The instances in the dataset are sorted according to the


continuous feature values.
2 The adjacent instances in the ordering that have different
classifications are then selected as possible threshold
points.
3 The optimal threshold is found by computing the
information gain for each of these classification transition
boundaries and selecting the boundary with the highest
information gain as the threshold.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Once a threshold has been set the dynamically created


new boolean feature can compete with the other
categorical features for selection as the splitting feature at
that node.
This process can be repeated at each node as the tree
grows.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Table: Dataset for predicting the vegetation in an area with a


continuous E LEVATION feature (measured in feet).
ID S TREAM S LOPE E LEVATION V EGETATION
1 false steep 3 900 chapparal
2 true moderate 300 riparian
3 true steep 1 500 riparian
4 false steep 1 200 chapparal
5 false flat 4 450 conifer
6 true steep 5 000 conifer
7 true steep 3 000 chapparal
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Table: Dataset for predicting the vegetation in an area sorted by the


continuous E LEVATION feature.
ID S TREAM S LOPE E LEVATION V EGETATION
2 true moderate 300 riparian
4 false steep 1 200 chapparal
3 true steep 1 500 riparian
7 true steep 3 000 chapparal
1 false steep 3 900 chapparal
5 false flat 4 450 conifer
6 true steep 5 000 conifer
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Table: Partition sets (Part.), entropy, remainder (Rem.), and


information gain (Info. Gain) for the candidate E LEVATION thresholds:
≥750, ≥1 350, ≥2 250 and ≥4 175.

Split by Partition Info.


Threshold Part. Instances Entropy Rem. Gain
D1 d2 0.0
≥750 1.2507 0.3060
D2 d4 , d3 , d7 , d1 , d5 , d6 1.4591
D3 d2 , d4 1.0
≥1 350 1.3728 0.1839
D4 d3 , d7 , d1 , d5 , d6 1.5219
D5 d2 , d4 , d3 0.9183
≥2 250 0.9650 0.5917
D6 d7 , d1 , d5 , d6 1.0
D7 d2 , d4 , d3 , d7 , d1 0.9710
≥4 175 0.6935 0.8631
D8 d5 , d6 0.0
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Elevation

<4,175 ≥4,175

ID Stream Slope Elevation Vegetation


2 true moderate 300 riparian ID Stream Slope Elevation Vegetation
4 false steep 1,200 chaparral
D7 D8 5 false flat 4,450 conifer
3 true steep 1,500 riparian
6 true steep 5,000 conifer
7 true steep 3,000 chaparral
1 false steep 3,900 chaparral

Figure: The vegetation classification decision tree after the dataset


has been split using E LEVATION ≥ 4 175.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Elevation

<4,175 ≥4,175

Stream conifer

true false

Elevation chaparral

<2,250 ≥2,250

riparian chaparral

Figure: The decision tree that would be generated for the vegetation
classification dataset listed in Table 3 [17] using information gain.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Predicting Continuous Targets


Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Regression trees are constructed so as to reduce the


variance in the set of training examples at each of the leaf
nodes in the tree
We can do this by adapting the ID3 algorithm to use a
measure of variance rather than a measure of classification
impurity (entropy) when selecting the best attribute
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

The impurity (variance) at a node can be calculated using


the following equation:
Pn 2
i=1 ti − t̄
var (t, D) = (3)
n−1

We select the feature to split on at a node by selecting the


feature that minimizes the weighted variance across the
resulting partitions:
X |Dd=l |
d[best] = argmin × var (t, Dd=l ) (4)
d∈d |D|
l∈levels(d)
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

u uu uu uu -
(a) Target

u uu uu uu -
(b) Underfitting

u uu uu uu -
(c) Goldilocks

h
u h
uhu u u hh
hh uu -
(d) Overfitting

Figure: (a) A set of instances on a continuous number line; (b), (c),


and (d) depict some of the potential groupings that could be applied
to these instances.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Table: A dataset listing the number of bike rentals per day.

ID S EASON W ORK DAY R ENTALS ID S EASON W ORK DAY R ENTALS


1 winter false 800 7 summer false 3 000
2 winter false 826 8 summer true 5 800
3 winter true 900 9 summer true 6 200
4 spring false 2 100 10 autumn false 2 910
5 spring true 4 740 11 autumn false 2 880
6 spring true 4 900 12 autumn true 2 820
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Table: The partitioning of the dataset in Table 5 [25] based on S EASON


and W ORK DAY features and the computation of the weighted
variance for each partitioning.

Split by |Dd=l | Weighted


Feature Level Part. Instances |D| var (t, D) Variance
’winter’ D1 d1 , d2 , d3 0.25 2 692
’spring’ D2 d4 , d5 , d6 0.25 2 472 533 13
S EASON 1 379 331 13
’summer’ D3 d7 , d8 , d9 0.25 3 040 000
’autumn’ D4 d10 , d11 , d12 0.25 2 100
’true’ D5 d3 , d5 , d6 , d8 , d9 , d12 0.50 4 026 346 31
W ORK DAY 2 551 813 13
’false’ D6 d1 , d2 , d4 , d7 , d10 , d11 0.50 1 077 280
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Season

winter spring summer autumn

ID Work Day Rentals ID Work Day Rentals ID Work Day Rentals ID Work Day Rentals
1 false 800 4 false 2,100 7 false 3,000 10 false 2,910
D1 D2 D3 D4
2 false 826 5 true 4,740 8 true 5,800 11 false 2,880
3 true 900 6 true 4,900 9 true 6,200 12 true 2,820

Figure: The decision tree resulting from splitting the data in Table 5
[25]
using the feature S EASON.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Season

winter autumn

Work Day spring summer Work Day

true false true false

ID Rentals Pred. ID Rentals Pred.


ID Rentals Pred. ID Rentals Pred.
1 800 Work Day Work Day 10 2,910
3 900 900 813 12 2,820 2,820 2,895
2 826 11 2,880

true false true false

ID Rentals Pred. ID Rentals Pred.


ID Rentals Pred. ID Rentals Pred.
5 4,740 8 5,800
4,820 4 2,100 2,100 6,000 7 3,000 3,000
6 4,900 9 6,200

Figure: The final decision tree induced from the dataset in Table 5
[25]
. To illustrate how the tree generates predictions, this tree lists the
instances that ended up at each leaf node and the prediction (PRED.)
made by each leaf node.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Noisy Data, Overfitting and Tree


Pruning
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

In the case of a decision tree, over-fitting involves splitting


the data on an irrelevant feature.

The likelihood of over-fitting occurring increases as a tree gets


deeper because the resulting classifications are based on
smaller and smaller subsets as the dataset is partitioned after
each feature test in the path.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Pre-pruning: stop the recursive partitioning early.


Pre-pruning is also known as forward pruning.
Common Pre-pruning Approaches
1 early stopping
2 χ2 pruning

Post-pruning: allow the algorithm to grow the tree as


much as it likes and then prune the tree of the branches
that cause over-fitting.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Common Post-pruning Approach


Using the validation set evaluate the prediction accuracy
achieved by both the fully grown tree and the pruned copy
of the tree. If the pruned copy of the tree performs no
worse than the fully grown tree the node is a candidate for
pruning.
0.5

Performance on Training Set


Performance on Validation Set
0.4
Misclassification Rate
0.3
0.2
0.1

0 50 100 150 200


Training Iteration
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Table: An example validation set for the post-operative patient routing


task.
C ORE - S TABLE -
ID T EMP T EMP G ENDER D ECISION
1 high true male gen
2 low true female icu
3 high false female icu
4 high false male icu
5 low false female icu
6 low true male icu
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Core-Temp
[icu]

low high

Gender Stable-Temp
[icu] [gen]

male female true false

icu gen gen icu

Figure: The decision tree for the post-operative patient routing task.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Core-Temp
[icu] Core-Temp Core-Temp
[icu] [icu] (1)

low high
low high low high

Gender Stable-Temp
Stable-Temp Stable-Temp
[icu] (0) [gen] icu icu (0)
[gen] (2) [gen]

male female true false true false true false

icu (0) gen (2) gen icu gen (0) icu (0) gen (0) icu (0)

(a) (b) (c)

Figure: The iterations of reduced error pruning for the decision tree in
Figure 7 [34] using the validation set in Table 7 [33] . The subtree that is
being considered for pruning in each iteration is highlighted in black.
The prediction returned by each non-leaf node is listed in square
brackets. The error rate for each node is given in round brackets.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Advantages of pruning:
Smaller trees are easier to interpret
Increased generalization accuracy when there is noise in
the training data (noise dampening).
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Model Ensembles
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Rather than creating a single model they generate a set of


models and then make predictions by aggregating the
outputs of these models.
A prediction model that is composed of a set of models is
called a model ensemble.
In order for this approach to work the models that are in
the ensemble must be different from each other.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

There are two standard approaches to creating


ensembles:
1 boosting
2 bagging.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Boosting

Boosting works by iteratively creating models and adding


them to the ensemble.
The iteration stops when a predefined number of models
have been added.
When we use boosting each new model added to the
ensemble is biased to pay more attention to instances that
previous models miss-classified.
This is done by incrementally adapting the dataset used to
train the models. To do this we use a weighted dataset
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Boosting

Weighted Dataset
Each instance has an associated weight wi ≥ 0,
1
Initially set to n where n is the number of instances in the
dataset.
After each model is added to the ensemble it is tested on
the training data and the weights of the instances the
model gets correct are decreased and the weights of the
instances the model gets incorrect are increased.
These weights are used as a distribution over which the
dataset is sampled to created a replicated training set,
where the replication of an instance is proportional to its
weight.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Boosting

During each training iteration the algorithm:


1 Induces a model and calculates the total error, , by
summing the weights of the training instances for which the
predictions made by the model are incorrect.
2 Increases the weights for the instances misclassified using:
 
1
w[i] ← w[i] × (5)
2×
3 Decreases the weights for the instances correctly
classified:  
1
w[i] ← w[i] × (6)
2 × (1 − )
4 Calculate a confidence factor, α, for the model such that
α increases as  decreases:
 
1 1−
α = × loge (7)
2 
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Boosting

Once the set of models have been created the ensemble


makes predictions using a weighted aggregate of the
predictions made by the individual models.
The weights used in this aggregation are simply the
confidence factors associated with each model.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Bagging

When we use bagging (or bootstrap aggregating) each


model in the ensemble is trained on a random sample of
the dataset known as bootstrap samples.
Each random sample is the same size as the dataset and
sampling with replacement is used.
Consequently, every bootstrap sample will be missing
some of the instances from the dataset so each bootstrap
sample will be different and this means that models trained
on different bootstrap samples will also be different
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Bagging

When bagging is used with decision trees each bootstrap


sample only uses a randomly selected subset of the
descriptive features in the dataset. This is known as
subspace sampling.
The combination of bagging, subspace sampling, and
decision trees is known as a random forest model.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Bagging

ID F1 F2 F3 Target
1 - - - -
2 - - - -
3 - - - -
4 - - - -

Bagging and
Subspace Sampling

ID F1 F3 Target ID F2 F3 Target ID F1 F3 Target


1 - - - 2 - - - 1 - - -
1 - - - 2 - - - 3 - - -
2 - - - 4 - - - 3 - - -
3 - - - 4 - - - 4 - - -

Machine Learning Machine Learning Machine Learning


Algorithm Algorithm Algorithm

F3 F2 F1

F1 F3

MODEL ENSEMBLE

Figure: The process of creating a model ensemble using bagging


and subspace sampling.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Bagging

Which approach should we use? Bagging is simpler to


implement and parallelize than boosting and, so, may be
better with respect to ease of use and training time.
Empirical results indicate:
boosted decision tree ensembles were the best performing
model of those tested for datasets containing up to 4,000
descriptive features.
random forest ensembles (based on bagging) performed
better for datasets containing more that 4,000 features.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Summary
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

The decision tree model makes predictions based on


sequences of tests on the descriptive feature values of a
query
The ID3 algorithm as a standard algorithm for inducing
decision trees from a dataset.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Decision Trees: Advantages


interpretable.
handle both categorical and continuous descriptive
features.
has the ability to model the interactions between
descriptive features (diminished if pre-pruning is
employed)
relatively, robust to the curse of dimensionality.
relatively, robust to noise in the dataset if pruning is used.
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

Decision Tress: Potential Disadvantages


trees become large when dealing with continuous features.
decision trees are very expressive and sensitive to the
dataset, as a result they can overfit the data if there are a
lot of features (curse of dimensionality)
eager learner (concept drift).
Feature Selection Cont. Desc. Features Cont. Targets Noise and Overfitting Ensembles Summary

1 Alternative Feature Selection Metrics

2 Handling Continuous Descriptive Features

3 Predicting Continuous Targets

4 Noisy Data, Overfitting and Tree Pruning

5 Model Ensembles
Boosting
Bagging

6 Summary

You might also like