0% found this document useful (0 votes)

15 views

ML Lab 10 - Ensemble Learning

This document provides an introduction and examples of ensemble learning techniques for classification tasks using the R programming language. It loads data on car seat sales, splits it into training and test sets, and trains three models: a single decision tree (Model 0), a bagged decision tree ensemble (Model 1), and a random forest (Model 2). It evaluates the models on both training and test data and compares their performance, finding that the random forest performs best with an AUC of 0.9021 on test data. The document provides exercises to apply boosting and analyze results compared to bagging, and questions about determining the number of base classifiers and calculating ensemble error.

Uploaded by

PRIYANSH AGGARWAL

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

ML Lab 10 - Ensemble Learning

Uploaded by

PRIYANSH AGGARWAL

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Lab Sheet - 10

Ensemble Learning
Machine Learning
BITS F464
I Semester 2023-24

Ensemble Learning is a type of supervised learning technique in which the basic idea is to
generate multiple models on a training dataset and then simply combining (average) their
output to generate a strong model which outperforms all the individual base classifiers.

Load the libraries

library(psych) #for general functions
library(ggplot2) #for data visualization
# library(devtools)
library(caret) #for training and cross validation (also calls other model li
baries)
## Warning: Installed Rcpp (0.12.13) different from Rcpp used to build dplyr
(0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
library(rpart) #for trees
library(rpart.plot) # Enhanced tree plots
library(RColorBrewer) # Color selection for fancy tree plot
library(party) # Alternative decision tree algorithm
library(partykit) # Convert rpart object to BinaryTree
library(pROC) #for ROC curves
library(ISLR) #for the Carseat Data

Introduction to Data
Reading in the CarSeats Data exploration data set. This is a simulated data set containing sales of
child car seats at 400 different stores. Sales can be predicted by 10 other variables.

#loading the data

data("Carseats")

Our outcome of interest will be a binary version of Sales: Unit sales (in thousands) at each
location.
(Note again that there is no id variable. This is convenient for some tasks.)
Descriptives

#sample descriptives
describe(Carseats)
#histogram of outcome
ggplot(data=Carseats, aes(x=Sales)) +
geom_histogram(binwidth=1, boundary=.5, fill="white", color="black") +
geom_vline(xintercept = 8, color="red", size=2) +
labs(x = "Sales")

For convenience of didactic illustration we create a new variable HighSales that is binary, “No”
if Sales <= 8, and “Yes” otherwise.

#creating new binary variable

Carseats$HighSales=ifelse(Carseats$Sales<=8,"No","Yes")

Some Data cleanup

#remove old variable

Carseats$Sales <- NULL
#convert a factor variable into a numeric variable
Carseats$ShelveLoc <- as.numeric(Carseats$ShelveLoc)

Splitting the data into training and test sets

We split the data - half for Training, half for Testing

#random sample half the rows

halfsample = sample(dim(Carseats)[1], dim(Carseats)[1]/2) # half of sample
#create training and test data sets
Carseats.train = Carseats[halfsample, ]
Carseats.test = Carseats[-halfsample, ]

We will use these to evaluate a variety of different classification algorithms: Random Forests,
CForests,

Model 0: A Single Classification Tree

We first optimize fit of a classification tree. Our objective with the cross-validation is to optmize
the size of the tree - tuning the complexity parameter.
train.tree <- train(as.factor(HighSales) ~ .,
data=Carseats.train,
method="ctree",
trControl=cvcontrol,
tuneLength = 10)
train.tree

We see how the accruacy is maximized at a relatively less complex tree.

Look at the final tree

# plot tree
plot(train.tree$finalModel,
main="Regression Tree for Carseat High Sales")

To evalaute the accuracy of the tree we can look at the confusion matrix for the Training data.

#obtaining class predictions

tree.classTrain <- predict(train.tree,
type="raw")
head(tree.classTrain)
#computing confusion matrix
confusionMatrix(Carseats.train$HighSales,tree.classTrain)

Some Errors. But the model was learned.

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions

tree.classTest <- predict(train.tree,

newdata = Carseats.test,
type="raw")
head(tree.classTest)
#computing confusion matrix
confusionMatrix(Carseats.test$HighSales,tree.classTest)

Accuracy of 0.71
When evaluating classification models, a few other functions may be useful. For example,
the pROC package provides convenience for calculating confusion matrices, the associcated
measures of sensitivity and specificity, and for obtaining and plotting ROC curves. We can also
look at the ROC curve by extracting probabilites of “Yes”.

#Obtaining predicted probabilites for Test data

tree.probs=predict(train.tree,
newdata=Carseats.test,
type="prob")
head(tree.probs)
#Calculate ROC curve
rocCurve.tree <- roc(Carseats.test$HighSales,tree.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.tree,col=c(4))
calculate the area under curve (bigger is better)
auc(rocCurve.tree)

Model 1: Bagging of ctrees

Training the model using treebag
We first optimize fit of a classification tree. Our objective with the cross-validation is to optmize
the size of the tree - tuning the complexity parameter.

#Using treebag

train.bagg <- train(as.factor(HighSales) ~ ., data=Carseats.train,method="tre

ebag",trControl=cvcontrol,importance=TRUE)
train.bagg
plot(varImp(train.bagg))

Not yet sure how to parse mode details from the output in order to look at the collection of trees.
Look at the collection of final trees
To evaluate the accuracy of the Bagged Trees we can look at the confusion matrix for the
Training data.

#obtaining class predictions

bagg.classTrain <- predict(train.bagg, type="raw")
head(bagg.classTrain)
#computing confusion matrix
confusionMatrix(Carseats.train$HighSales,bagg.classTrain)

The accuracy is perfect!

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions

bagg.classTest <- predict(train.bagg,
newdata = Carseats.test,
type="raw")
head(bagg.classTest)
#computing confusion matrix
confusionMatrix(Carseats.test$HighSales,bagg.classTest)

Accuracy of 0.76
We can also look at the ROC curve by extracting probabilites of “Yes”.

#Obtaining predicted probabilites for Test data

bagg.probs=predict(train.bagg,
newdata=Carseats.test, type="prob")

head(bagg.probs)
#Calculate ROC curve
rocCurve.bagg <- roc(Carseats.test$HighSales,bagg.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.bagg,col=c(6))

#calculate the area under curve (bigger is better)

auc(rocCurve.bagg)
## Area under the curve: 0.8904

Model 2: Random Forest for classification trees

Training the model using random forest

train.rf <- train(as.factor(HighSales) ~ .,

data=Carseats.train,
method="rf",
trControl=cvcontrol,
#tuneLength = 3,
importance=TRUE)
train.rf

We can look at the confusion matrix for the Training data.

#obtaining class predictions

rf.classTrain <- predict(train.rf, type="raw")
head(rf.classTrain)
#computing confusion matrix
confusionMatrix(Carseats.train$HighSales,rf.classTrain)

No Errors. That is good - the model was learned well.

More interesting is the confusion matrix when applied to the Test data.

#obtaining class predictions

rf.classTest <- predict(train.rf, newdata = Carseats.test,
type="raw")
head(rf.classTest)
#computing confusion matrix
confusionMatrix(Carseats.test$HighSales,rf.classTest)

Accuracy of 0.78. An improvement over Bagging only

We can also look at the ROC curve by extracting probabilites of “Yes”.

#Obtaining predicted probabilites for Test data

rf.probs=predict(train.rf,newdata=Carseats.test,type="prob")
head(rf.probs)
#Calculate ROC curve
rocCurve.rf <- roc(Carseats.test$HighSales,rf.probs[,"Yes"])
#plot the ROC curve
plot(rocCurve.rf,col=c(1))
#calculate the area under curve (bigger is better)
auc(rocCurve.rf)
## Area under the curve: 0.9021
Model Comparisons
We can examine how the models do by looking at the ROC curves.

plot(rocCurve.tree,col=c(4))
plot(rocCurve.bagg,add=TRUE,col=c(6)) # color magenta is bagg
plot(rocCurve.rf,add=TRUE,col=c(1)) # color black is rf

A good demonstration of Decision Tree, Random Forest, boosting and bagging algorithm can be
found here:
See …
https://machinelearningmastery.com/machine-learning-ensembles-with-r/
https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-
implementation-in-r/

Exercise
1. Apply boosting model on the same dataset and compare it with bagging ensemble. Also,
try the Adaboost function and analyze their result.
2. How to determine the number of base classifiers to be used in an ensemble?
3. How to find the error of an ensemble when the error rates of base classifiers are
different?

Car Seats R Code
No ratings yet
Car Seats R Code
5 pages
Application Development and Emerging Technology
100% (1)
Application Development and Emerging Technology
17 pages
Installation Guide: Second Edition
100% (2)
Installation Guide: Second Edition
75 pages
assignmnet (1)
No ratings yet
assignmnet (1)
25 pages
Deep Learning Lab With Output
No ratings yet
Deep Learning Lab With Output
12 pages
Machine learning with Titanic dataset tutorial
No ratings yet
Machine learning with Titanic dataset tutorial
7 pages
Regression Linaire Python Tome II
No ratings yet
Regression Linaire Python Tome II
10 pages
Activity 4 CGPA Vs Placement Package Program
No ratings yet
Activity 4 CGPA Vs Placement Package Program
4 pages
8 Ejercicio - Optimización y Guardado de Modelos - Training - Microsoft Learn Ingles
No ratings yet
8 Ejercicio - Optimización y Guardado de Modelos - Training - Microsoft Learn Ingles
13 pages
Maxbox Starter60 Machine Learning
No ratings yet
Maxbox Starter60 Machine Learning
8 pages
Chapter05 Fundamentals-Of-Ml
No ratings yet
Chapter05 Fundamentals-Of-Ml
7 pages
Nthu Bacshw
No ratings yet
Nthu Bacshw
8 pages
stanfordKNNassignment
No ratings yet
stanfordKNNassignment
78 pages
Materi Demo Data Mining
No ratings yet
Materi Demo Data Mining
5 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
AI lab 8
No ratings yet
AI lab 8
14 pages
dl lab1
No ratings yet
dl lab1
15 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
Xgboost PDF
100% (1)
Xgboost PDF
128 pages
Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem _ by Susan Li _ Towards Data Science
No ratings yet
Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem _ by Susan Li _ Towards Data Science
10 pages
CNN Case Studies With Dropout Layer
No ratings yet
CNN Case Studies With Dropout Layer
2 pages
Module_5
No ratings yet
Module_5
5 pages
Expt6total.i (2) - JupyterLab
No ratings yet
Expt6total.i (2) - JupyterLab
7 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
Image Classification Code
No ratings yet
Image Classification Code
4 pages
Part I
No ratings yet
Part I
12 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
4 pages
Intro To Deep Learning With TensorFlow - Introduction To TensorFlow Cheatsheet - Codecademy
No ratings yet
Intro To Deep Learning With TensorFlow - Introduction To TensorFlow Cheatsheet - Codecademy
8 pages
classification
No ratings yet
classification
4 pages
Machine Learning Practice
No ratings yet
Machine Learning Practice
17 pages
KT 01 Intro2Keras
No ratings yet
KT 01 Intro2Keras
24 pages
Boot Step AIC
No ratings yet
Boot Step AIC
4 pages
dl_5 excuted
No ratings yet
dl_5 excuted
13 pages
MLA Lab 6:-Implementation of Decision Tree
No ratings yet
MLA Lab 6:-Implementation of Decision Tree
16 pages
Lab 1
No ratings yet
Lab 1
3 pages
Final Project - Regression Models
100% (1)
Final Project - Regression Models
35 pages
Module 4 - Supervised Learning - First ML Model
No ratings yet
Module 4 - Supervised Learning - First ML Model
23 pages
Institute of Management Technology, Ghaziabad End Term Exam (Term - VII) Take Home Exam (Time Duration: 2.30 HRS) Batch 2019 - 21 Answer-Sheet
No ratings yet
Institute of Management Technology, Ghaziabad End Term Exam (Term - VII) Take Home Exam (Time Duration: 2.30 HRS) Batch 2019 - 21 Answer-Sheet
18 pages
UNIT_I CHP_5
No ratings yet
UNIT_I CHP_5
26 pages
Dl 5 Excuted
No ratings yet
Dl 5 Excuted
13 pages
DL_0801CS223D04_Assignment5.ipynb - Colab
No ratings yet
DL_0801CS223D04_Assignment5.ipynb - Colab
15 pages
Project - Ipynb - Colaboratory
No ratings yet
Project - Ipynb - Colaboratory
4 pages
GEMA - IA B3 CNN - Transfer Learning - DenseNet121 - Colab
No ratings yet
GEMA - IA B3 CNN - Transfer Learning - DenseNet121 - Colab
9 pages
65934 Fatima Binte Aqeel AI Theory assignment 3
No ratings yet
65934 Fatima Binte Aqeel AI Theory assignment 3
7 pages
R Lab Program
No ratings yet
R Lab Program
20 pages
XGBoost Tuning 1597155827
No ratings yet
XGBoost Tuning 1597155827
7 pages
Pytorch (Tabular) - Regression
No ratings yet
Pytorch (Tabular) - Regression
13 pages
2. Random Forest Algorithm
No ratings yet
2. Random Forest Algorithm
2 pages
Tensor Flow and Keras Sample Programs
No ratings yet
Tensor Flow and Keras Sample Programs
22 pages
Assignment No 2
No ratings yet
Assignment No 2
8 pages
Model Stacking Classification R Amsantac
No ratings yet
Model Stacking Classification R Amsantac
14 pages
code_edge impulse
No ratings yet
code_edge impulse
13 pages
bot
No ratings yet
bot
1 page
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
ML lab manual
No ratings yet
ML lab manual
13 pages
TD2345
No ratings yet
TD2345
3 pages
P05 The Regression Pipeline - Training and Testing Ans
No ratings yet
P05 The Regression Pipeline - Training and Testing Ans
13 pages
Lab Report 8
No ratings yet
Lab Report 8
11 pages
Answer 1722791857 NLP and Classification Practical MCQ 4991
No ratings yet
Answer 1722791857 NLP and Classification Practical MCQ 4991
26 pages
1 - An Introduction To Machine Learning With Scikit-Learn
No ratings yet
1 - An Introduction To Machine Learning With Scikit-Learn
9 pages
Capstone project_Jaro-Prof. Babji
No ratings yet
Capstone project_Jaro-Prof. Babji
5 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
University of Haripur: Khyber Pakhtunkhwa, Pakistan
No ratings yet
University of Haripur: Khyber Pakhtunkhwa, Pakistan
4 pages
What Is DigiLocker
No ratings yet
What Is DigiLocker
21 pages
SELinux Notebook
No ratings yet
SELinux Notebook
422 pages
bca-6-sem-asp-dot-net-paper-3-summer-2018
No ratings yet
bca-6-sem-asp-dot-net-paper-3-summer-2018
1 page
Signed Off - Empowerment TechG11. - q1 Mod2 - Applied Tech.. - v3 1 NN
No ratings yet
Signed Off - Empowerment TechG11. - q1 Mod2 - Applied Tech.. - v3 1 NN
27 pages
Syllabus Iot
No ratings yet
Syllabus Iot
3 pages
Picture Resume 2020 Stclair 1
No ratings yet
Picture Resume 2020 Stclair 1
2 pages
How To Create Simple Calculator Android App Using Android Studio
No ratings yet
How To Create Simple Calculator Android App Using Android Studio
20 pages
Caliber-LIMS-Brochure
No ratings yet
Caliber-LIMS-Brochure
5 pages
Slideshow CK Banners
No ratings yet
Slideshow CK Banners
5 pages
ANALISIS SWOT-TeFa TKJ
No ratings yet
ANALISIS SWOT-TeFa TKJ
10 pages
Software Engineering Lecture 1
No ratings yet
Software Engineering Lecture 1
18 pages
Acct Bazaar
No ratings yet
Acct Bazaar
1 page
Yr 12 DP - Maintenance of Computer II
No ratings yet
Yr 12 DP - Maintenance of Computer II
6 pages
Ugc Care Paper 1
No ratings yet
Ugc Care Paper 1
8 pages
Aermod Userguide PDF
No ratings yet
Aermod Userguide PDF
333 pages
01 Spring Boot Overview
No ratings yet
01 Spring Boot Overview
166 pages
What Is Identity & Access Management (IAM) ?
100% (1)
What Is Identity & Access Management (IAM) ?
8 pages
MFCS Practical
No ratings yet
MFCS Practical
16 pages
Development of A Modified Hardy-Cross Algorithm Fo
No ratings yet
Development of A Modified Hardy-Cross Algorithm Fo
10 pages
Lesson 4-Overview of Health Informatics
No ratings yet
Lesson 4-Overview of Health Informatics
22 pages
GTA San Andreas Cheat Codes For PC - Health, Weapons, Vehicles & More - PDF - Leisure - Sports
No ratings yet
GTA San Andreas Cheat Codes For PC - Health, Weapons, Vehicles & More - PDF - Leisure - Sports
1 page
How To Deal With UX
No ratings yet
How To Deal With UX
40 pages
SAP MM Questions
No ratings yet
SAP MM Questions
5 pages
Festo CMMT EMMT PSIplus EN135954 202305
No ratings yet
Festo CMMT EMMT PSIplus EN135954 202305
12 pages
03 - Using Big Data Lite Virtual Machine
No ratings yet
03 - Using Big Data Lite Virtual Machine
21 pages
Subject: PRF192-PFC Workshop 06 Nguyen Tien Dat - DE160068 Objectives: Managing Arrays
100% (1)
Subject: PRF192-PFC Workshop 06 Nguyen Tien Dat - DE160068 Objectives: Managing Arrays
9 pages
Microsoft Word - 49 - cirular2021.docx-CBSE EXPRESSION SCHEME
No ratings yet
Microsoft Word - 49 - cirular2021.docx-CBSE EXPRESSION SCHEME
7 pages

ML Lab 10 - Ensemble Learning

Uploaded by

ML Lab 10 - Ensemble Learning

Uploaded by

Lab Sheet - 10

Load the libraries

#loading the data

#creating new binary variable

Some Data cleanup

#remove old variable

Splitting the data into training and test sets

#random sample half the rows

Model 0: A Single Classification Tree

We see how the accruacy is maximized at a relatively less complex tree.

#obtaining class predictions

Some Errors. But the model was learned.

#obtaining class predictions

tree.classTest <- predict(train.tree,

#Obtaining predicted probabilites for Test data

Model 1: Bagging of ctrees

train.bagg <- train(as.factor(HighSales) ~ ., data=Carseats.train,method="tre

#obtaining class predictions

The accuracy is perfect!

#obtaining class predictions

#Obtaining predicted probabilites for Test data

#calculate the area under curve (bigger is better)

Model 2: Random Forest for classification trees

train.rf <- train(as.factor(HighSales) ~ .,

We can look at the confusion matrix for the Training data.

#obtaining class predictions

No Errors. That is good - the model was learned well.

#obtaining class predictions

Accuracy of 0.78. An improvement over Bagging only

#Obtaining predicted probabilites for Test data

You might also like