Week 11 EnsembleLearning
Week 11 EnsembleLearning
Model selection
Statistical validation
Ensemble Learning
Definition
3
Types of ensembles
6
The ensemble learning process
Is optional
7
Methods to generate homogeneous
ensembles
Induction algorithm
Training Examples
Model
Classifier A
Classifier B Predictions
Classifier C
Training Examples
9
Modeling process manipulation
where algorithm, algorithm’ and algorithm’’ are variations of the same induction algorithm
10
How to combine models
(the integration phase)
Product
Maximum
Minimum
Median
For classification:
The base classifiers should be as accurate as possible and
having diverse errors as much the true class is the majority
class (see Brown, G. & Kuncheva, L., “Good” and “Bad”
Diversity in Majority Vote Ensembles, Multiple Classifier
Systems, Springer, 2010, 5997, 124-133)
It is not possible to obtain the optimum ensemble of classifiers based
on the knowledge of the base learners
12
Characteristics of the base models
For regression:
It is possible to express the error of the ensemble in function of
the error of the base learners
Assuming the average as the combination method,
The average error of the base learners (ܾ݅ܽ )ݏshould be as small as possible, i.e.,
the base learners should be as accurate (in average) as possible;
13
Popular ensemble methods
Bagging:
averaging the prediction over a collection of unstable predictors generated from bootstrap
samples (both classification and regression)
Boosting:
weighted vote with a collection of classifiers that were trained sequentially from training sets
given priority to instances wrongly classified (classification)
Random Forest:
averaging the prediction over a collection of trees splited using a randomly selected subset of
features (both classification and regression)
Heterogeneous ensembles:
combining a set of heterogeneous predictors (both classification and regression)
14
Bagging: Bootstrap AGGregatING
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
Given a set D of d tuples, at each iteration i, a training set Di of d tuples
is sampled with replacement from D (i.e., bootstrap)
A classifier model Mi is learned for each training set Di
Classification: classify an unknown sample X
Each classifier Mi returns its class prediction
The bagged classifier M* counts the votes and assigns the class with the
most votes to X
Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
15
Bagging (Breiman 1996)
Accuracy
Often significant better than a single classifier derived from D
For noise data: not considerably worse, more robust
Proved improved accuracy in prediction
Requirement: Need unstable classifier types
Unstable means a small change to the training data may lead to major
decision changes.
Stability in Training
Training: construct classifier f from D
Stability: small changes on D results in small changes on f
Decision trees are a typical unstable classifier
16
http://en.wikibooks.org/wiki/File:DTE_Bagging.png
17
Boosting
Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
Incrementally create models selectively using training
examples based on some distribution.
How boosting works?
Weights are assigned to each training example
A series of k classifiers is iteratively learned
After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to pay more attention to the training
examples that were misclassified by Mi
The final M* combines the votes of each individual classifier, where the
weight of each classifier's vote is a function of its accuracy
18
Boosting: Construct Weak Classifiers
Using Different Data Distribution
Start with uniform weighting
During each step of learning
Increase weights of the examples which are not correctly learned by the
weak learner
Decrease weights of the examples which are correctly learned by the weak
learner
Idea
Focus on difficult examples which are not correctly classified in the
previous steps
19
Boosting: Combine Weak Classifiers
Weighted Voting
Construct strong classifier by weighted voting of the weak
classifiers
Idea
Better weak classifier gets a larger weight
Iteratively add weak classifiers
Increase accuracy of the combined classifier through minimization
of a cost function
20
Boosting
Differences with Bagging:
Models are built sequentially on modified versions of the data
The predictions of the models are combined through a weighted
sum/vote
21
AdaBoost: a popular boosting algorithm
(Freund and Schapire, 1996)
22
Adaboost comments
This distribution update ensures that instances misclassified by
the previous classifier are more likely to be included in the
training data of the next classifier.
Hence, consecutive classifiers’ training data are geared
towards increasingly hard-to-classify instances.
Unlike bagging, AdaBoost uses a rather undemocratic voting
scheme, called the weighted majority voting. The idea is an
intuitive one: those classifiers that have shown good
performance during training are rewarded with higher voting
weights than the others.
23
The diagram should be interpreted with the understanding that the algorithm is sequential: classifier CK is created
before classifier CK+1, which in turn requires that βK and the current distribution DK be available.
24
Random Forest (Breiman 2001)
Random Forest: A variation of the bagging algorithm
Created from individual decision trees.
Diversity is guaranteed by selecting randomly at each split, a
subset of the original features during the process of tree
generation.
During classification, each tree votes and the most popular
class is returned
During regression, the result is the averaged prediction of all
generated trees
25
Random Forest (Breiman 2001)
Two Methods to construct Random Forest:
Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART
methodology is used to grow the trees to maximum size
Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes
(reduces the correlation between individual classifiers)
Comparable in accuracy to Adaboost, but more robust to
errors and outliers
Insensitive to the number of attributes selected for
consideration at each split, and faster than bagging or boosting
26
Ensemble learning via
negative correlation learning
Negative correlation learning can be used only in rnsemble
regression algorithms that try to minimize/maximize a given
objective function (e.g., neural networks, support vector
regression)
The idea is: a model should be trained in order to minimize the
error function of the ensemble, i.e., it adds to the error
function a penalty term with the averaged error of the models
already trained.
This approach will produce models negatively correlated with
the averaged error of the previously generated models.
27
Model selection
Model selection
29
Model selection
Hastie, T.; Tibshirani, R. & Friedman, J. H., The elements of statistical learning: data
mining, inference, and prediction, Springer, 2001, pag. 313.
30
Statistical validation
Statistical validation
32
Introductory References
‘Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations’, Ian H. Witten and Eibe Frank, 1999
33
Top References
Wolpert, D. H., Stacked generalization, Neural Networks, 1992, 5,
241-259
Breiman, L., Bagging predictors, Machine Learning, 1996, 26, 123-
140
Freund, Y. & Schapire, R., Experiments with a new boosting
algorithm, International Conference on Machine Learning, 1996, 148-
156
Breiman, L., Random forests, Machine Learning, 2001, 45, 5-32
Liu, Y. & Yao, X., Ensemble learning via negative correlation,
Neural Networks, 1999, 12, 1399-1404
Rodríguez, J. J.; Kuncheva, L. I. & Alonso, C. J., Rotation forest: a new
classifier ensemble, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2006, 28, 1619-1630
34