0% found this document useful (0 votes)
49 views13 pages

AMTA Assignment AMTA B (Aswin Avni Navya)

The document summarizes an analysis of wage data using machine learning algorithms within the CARET framework in R. Three algorithms were used: parallel random forest, stochastic gradient boosting, and penalized linear regression. Parallel random forest was able to predict the data with a mean square error of 861.9919. Stochastic gradient boosting and parameter tuning can produce highly accurate predictive models. The document provides details on implementing and evaluating each algorithm.

Uploaded by

Shambhawi Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views13 pages

AMTA Assignment AMTA B (Aswin Avni Navya)

The document summarizes an analysis of wage data using machine learning algorithms within the CARET framework in R. Three algorithms were used: parallel random forest, stochastic gradient boosting, and penalized linear regression. Parallel random forest was able to predict the data with a mean square error of 861.9919. Stochastic gradient boosting and parameter tuning can produce highly accurate predictive models. The document provides details on implementing and evaluating each algorithm.

Uploaded by

Shambhawi Sinha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Advanced Multivariate Technique

CARET - ALGORITHMS

Submitted By

N.S.Aswin 80303180002
Avni Jain 80303180080
Sri Navya 80303180061

Submitted To

Dr. ABHILASH PONNAM


ANALYTICS DEPT
NMIMS HYDERABAD
80303180033
PADMA.S
80303180156
SUDHA TIRUNELAI
80303180196
INTRODUCTION:

R is a programming language developed by Ross Ihaka and Robert Gentleman in 1993. R possesses
an extensive catalog of statistical and graphical methods. It includes machine learning algorithm,
linear regression, time series and statistical inference to name a few. Most of the R libraries are
written in R, but for heavy computational task, C, C++ and Fortran codes are preferred.
R is not only entrusted by academic, but many large companies also use R programming language,
including Uber, Google, Airbnb, Facebook and so on.
Data analysis with R is done in a series of steps; programming, transforming, discovering,
modelling and communicate the results,

 Program: R is a clear and accessible programming tool.


 Transform: R is made up of a collection of libraries designed specifically for data science.
 Discover: Investigate the data, refine your hypothesis and analyse them.
 Model: R provides a wide array of tools to capture the right model for your data.
 Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or build
Shiny apps to share with the world.

R PACKAGE

The primary uses of R is and will always be, statistic, visualization, and machine learning.
Packages are part of R programming and they are useful in collecting sets of R functions into a
single unit. It also contains compiled code and sample data. All of these are kept stored in a
directory called the "library" in the R environment. For loading a package which is already existing
and installed on your system, you can make use of and call the library function.

CARET

The caret package (Classification and REgression Training) contains functions to streamline the
model training process for complex regression and classification problems. The package utilizes a
number of R packages but tries not to load them all at package start-up (by removing formal
package dependencies, the package start up time can be greatly decreased). Caret has several
functions that attempt to streamline the model building and evaluation process, as well as feature
selection and other techniques.
One of the primary tools in the package is the train function which can be used to

o Evaluate using resampling, the effect of model tuning parameters on performance.


o Choose the “optimal” model across these parameters.
o Estimate model performance from a training set.

1|Page
PROJECT

Machine Learning Algorithm within CARET Framework


Data Set Used: ISLR: Wage Dataset

Wage data set is a data set which as data for a group of 3000 male workers in the Mid-Atlantic
region with 11 variables.

Overview of Dataset
Variable Description
Year Year that wage information was recorded
Age Age of worker
Maritl Indicating marital status
Race Indicating race
Education Indicating education level
Region Region of the country
Jobclass Indicating type of job
Health Indicating health level of worker
Health Ins Indicating whether worker has health insurance
Log wage Log of workers wage
Wage Workers raw wage

Since we are not going to work with the log of workers wage we are removing it.

ALGORITHMS USED IN THIS PROJECT:


 Parallel Random Forrest
 Stochastic Gradient Boosting
 Penalized Linear Regression

Parallel Random Forrest

Introduction

Random Forest is a versatile machine learning method capable of performing both regression and
classification tasks. It also undertakes dimensional reduction methods, treats missing values,
outlier values and other essential steps of data exploration, and does a fairly good job.

Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we have
1000 observation in the complete population with 10 variables. Random forest tries to build
multiple CART models with different samples and different initial variables. For instance, it will
take a random sample of 100 observation and 5 randomly chosen initial variables to build a CART

2|Page
model. It will repeat the process (say) 10 times and then make a final prediction on each
observation. Final prediction is a function of each prediction. This final prediction can simply be
the mean of each prediction.

Advantages
 This algorithm can solve both type of problems i.e. classification and regression and does
a decent estimation at both fronts.
 One of benefits is, the power of handling large data set with higher dimensionality. It can
handle thousands of input variables and identify most significant variables, so it is
considered as one of the dimensionality reduction methods. Further, the model outputs
Importance of variable, which can be a very handy feature (on some random data set).

In PARF, only the training / learning phase of Random Forest is parallelized. This implementation
is cluster based and uses MPI (Message Passing Interface) library.

Working

This function works very simply, we need to pass it a vector of mtry values, and it fits a random
forest using each of those values and returns the combined result. We can also pass any additional
parameters like ntree to the randomForest function.

I think this functions provides 2 improvements,


 We can use any parallel backend when a random forest is taking too long to fit
 In the argument .inorder=FALSE in the foreach function, provides a small performance
improvement as it lets R combine the random forests as they finish, rather than forcing R
to combine them in the order they start.

For Example
Lets say we want a random forest with 5000 trees. The default value for ntree is 500, so we use
rep(4,10) as the argument for the function. Maybe we are not sure of the optimal mtry value, and
want combine 2 ensembles of 2500 trees. Then we use the argument c(rep(3,5),rep(4,5)). This
gives us 2500 trees with mtry=3 and 2500 with mtry=4. Which helps in predict more accurately
without out of memory errors.

3|Page
Implementations

Results from the code

4|Page
RMSE value is least at mtry = 6 and the value is 33.90319

Plotting the RMSE based on mtry,

Recommendations

 Model predicts with a mean square error of 861.9919


 Parallel Random forest along with tuning parameter can be used for complete and
proper numerical data. This method reduces the RMSE and also MSE of the dataset.
 Parallel Random Forest increases predictive power of the algorithm also helps
overfitting for both classification and regression.

5|Page
Stochastic Gradient Boosting

Introduction
The accuracy of a predictive model can be boosted in two ways: Either by embracing feature
engineering or by applying boosting algorithms straight away. There are multiple boosting
algorithms like Gradient Boosting, XGBoost, AdaBoost, Gentle Boost etc.

Bagging: It is an approach where you take random samples of data, build learning algorithms and
take simple means to find bagging probabilities.
Boosting: Boosting is similar; however the selection of sample is made more intelligently. We
subsequently give more and more weight too hard to classify observations.

Boosting is a famous ensemble learning technique in which we are not concerned with reducing
the variance of learners like in Bagging where our aim is to reduce the high variance of learners
by averaging lots of models fitted on bootstrapped data samples generated with replacement from
training data, so as to avoid overfitting.
Bagging consists of taking multiple subsets of the training data set, then building multiple
independent decision tree models, and then average the models allowing to create a very
performant predictive model compared to the classical CART model.

Advantages
 Often provides predictive accuracy that cannot be beat.
 Lots of flexibility - can optimize on different loss functions and provides several hyper
parameter tuning options that make the function fit very flexible.
 No data pre-processing required - often works great with categorical and numerical values
as is.
 Handles missing data - imputation not required.

Working
Trees are built one at a time, where each new tree helps to correct errors made by previously trained
tree. With each tree added, the model becomes even more expressive. There are typically three
parameters - number of trees, depth of trees and learning rate, and the tree built is generally
shallow. GBDT training generally takes longer because of the fact that trees are built sequentially.
However benchmark results have shown GBDT are better learners than Random Forests.

GBT are an ensemble of shallow and weak successive trees with each tree learning and improving
on the previous. When combined, these many weak successive trees produce a powerful
“committee” that are often hard to beat with other algorithms.

6|Page
Implementation

Results from code

At interaction depth 9, shrinkage as .1 , minobsinnode as 10 with trees as 250 we get smallest


RMSE

7|Page
Plotting of RMSE Values based on depth,

Age has the highest influence in the Wage Data set.

8|Page
Recommendations

 Method predicted with mean square error of 855.4873


 In a numerical optimization problem where the objective is to minimize the loss of the
model by adding weak learners this method can be used.
 This method can be used to reduce the correlation between the trees.
 This method can be built on small trees because it is data driven, fast and efficient.

Penalized Linear Regression

Introduction

The standard linear model (or the ordinary least squares method) performs poorly in a situation,
where you have a large multivariate data set containing a number of variables superior to the
number of samples. A better alternative is the penalized regression allowing to create a linear
regression model that is penalized, for having too many variables in the model, by adding a
constraint in the equation (James et al. 2014, P. Bruce and Bruce (2017)). This is also known as
shrinkage or regularization methods. The consequence of imposing this penalty, is to reduce (i.e.
shrink) the coefficient values towards zero. This allows the less contribute variables to have a
coefficient close to zero or equal zero. Note that, the shrinkage requires the selection of a tuning
parameter (lambda) that determines the amount of shrinkage.

Working

Penalized regression allows to create a linear regression model that is penalized, for having too
many variables in the model, by adding a constraint in the equation which is also known as
shrinkage or regularization methods. The consequence of imposing this penalty, is to reduce (i.e.
shrink) the coefficient values towards zero. This allows the less contributing variables to have a
coefficient close to zero or equal zero.

9|Page
Implementation

Results from the code

10 | P a g e
Plotting the RMSE across Lambda

11 | P a g e
RMSE is low at lambda 1 = 6 and lambda2 =4 with a value of 30.50465

Recommendations
 A penalized regression method yields a sequence of models, each associated with
specific values for one or more tuning parameters.
 If the amount of shrinkage is large enough, these methods can also perform variable
selection by shrinking some coefficients to zero.

For ANY Queries,


Contact
Aswin @ +91 – 97100 596 337 / +91 – 8939 772 500 / [email protected]

12 | P a g e

You might also like