ML Lab 10 - Ensemble Learning
ML Lab 10 - Ensemble Learning
Ensemble Learning
Machine Learning
BITS F464
I Semester 2023-24
Ensemble Learning is a type of supervised learning technique in which the basic idea is to
generate multiple models on a training dataset and then simply combining (average) their
output to generate a strong model which outperforms all the individual base classifiers.
Introduction to Data
Reading in the CarSeats Data exploration data set. This is a simulated data set containing sales of
child car seats at 400 different stores. Sales can be predicted by 10 other variables.
Our outcome of interest will be a binary version of Sales: Unit sales (in thousands) at each
location.
(Note again that there is no id variable. This is convenient for some tasks.)
Descriptives
#sample descriptives
describe(Carseats)
#histogram of outcome
ggplot(data=Carseats, aes(x=Sales)) +
geom_histogram(binwidth=1, boundary=.5, fill="white", color="black") +
geom_vline(xintercept = 8, color="red", size=2) +
labs(x = "Sales")
For convenience of didactic illustration we create a new variable HighSales that is binary, “No”
if Sales <= 8, and “Yes” otherwise.
We will use these to evaluate a variety of different classification algorithms: Random Forests,
CForests,
# plot tree
plot(train.tree$finalModel,
main="Regression Tree for Carseat High Sales")
To evalaute the accuracy of the tree we can look at the confusion matrix for the Training data.
Accuracy of 0.71
When evaluating classification models, a few other functions may be useful. For example,
the pROC package provides convenience for calculating confusion matrices, the associcated
measures of sensitivity and specificity, and for obtaining and plotting ROC curves. We can also
look at the ROC curve by extracting probabilites of “Yes”.
#Using treebag
Not yet sure how to parse mode details from the output in order to look at the collection of trees.
Look at the collection of final trees
To evaluate the accuracy of the Bagged Trees we can look at the confusion matrix for the
Training data.
Accuracy of 0.76
We can also look at the ROC curve by extracting probabilites of “Yes”.
head(bagg.probs)
#Calculate ROC curve
rocCurve.bagg <- roc(Carseats.test$HighSales,bagg.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.bagg,col=c(6))
plot(rocCurve.tree,col=c(4))
plot(rocCurve.bagg,add=TRUE,col=c(6)) # color magenta is bagg
plot(rocCurve.rf,add=TRUE,col=c(1)) # color black is rf
A good demonstration of Decision Tree, Random Forest, boosting and bagging algorithm can be
found here:
See …
https://machinelearningmastery.com/machine-learning-ensembles-with-r/
https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-
implementation-in-r/
Exercise
1. Apply boosting model on the same dataset and compare it with bagging ensemble. Also,
try the Adaboost function and analyze their result.
2. How to determine the number of base classifiers to be used in an ensemble?
3. How to find the error of an ensemble when the error rates of base classifiers are
different?