AMTA Assignment AMTA B (Aswin Avni Navya)
AMTA Assignment AMTA B (Aswin Avni Navya)
CARET - ALGORITHMS
Submitted By
N.S.Aswin 80303180002
Avni Jain 80303180080
Sri Navya 80303180061
Submitted To
R is a programming language developed by Ross Ihaka and Robert Gentleman in 1993. R possesses
an extensive catalog of statistical and graphical methods. It includes machine learning algorithm,
linear regression, time series and statistical inference to name a few. Most of the R libraries are
written in R, but for heavy computational task, C, C++ and Fortran codes are preferred.
R is not only entrusted by academic, but many large companies also use R programming language,
including Uber, Google, Airbnb, Facebook and so on.
Data analysis with R is done in a series of steps; programming, transforming, discovering,
modelling and communicate the results,
R PACKAGE
The primary uses of R is and will always be, statistic, visualization, and machine learning.
Packages are part of R programming and they are useful in collecting sets of R functions into a
single unit. It also contains compiled code and sample data. All of these are kept stored in a
directory called the "library" in the R environment. For loading a package which is already existing
and installed on your system, you can make use of and call the library function.
CARET
The caret package (Classification and REgression Training) contains functions to streamline the
model training process for complex regression and classification problems. The package utilizes a
number of R packages but tries not to load them all at package start-up (by removing formal
package dependencies, the package start up time can be greatly decreased). Caret has several
functions that attempt to streamline the model building and evaluation process, as well as feature
selection and other techniques.
One of the primary tools in the package is the train function which can be used to
1|Page
PROJECT
Wage data set is a data set which as data for a group of 3000 male workers in the Mid-Atlantic
region with 11 variables.
Overview of Dataset
Variable Description
Year Year that wage information was recorded
Age Age of worker
Maritl Indicating marital status
Race Indicating race
Education Indicating education level
Region Region of the country
Jobclass Indicating type of job
Health Indicating health level of worker
Health Ins Indicating whether worker has health insurance
Log wage Log of workers wage
Wage Workers raw wage
Since we are not going to work with the log of workers wage we are removing it.
Introduction
Random Forest is a versatile machine learning method capable of performing both regression and
classification tasks. It also undertakes dimensional reduction methods, treats missing values,
outlier values and other essential steps of data exploration, and does a fairly good job.
Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we have
1000 observation in the complete population with 10 variables. Random forest tries to build
multiple CART models with different samples and different initial variables. For instance, it will
take a random sample of 100 observation and 5 randomly chosen initial variables to build a CART
2|Page
model. It will repeat the process (say) 10 times and then make a final prediction on each
observation. Final prediction is a function of each prediction. This final prediction can simply be
the mean of each prediction.
Advantages
This algorithm can solve both type of problems i.e. classification and regression and does
a decent estimation at both fronts.
One of benefits is, the power of handling large data set with higher dimensionality. It can
handle thousands of input variables and identify most significant variables, so it is
considered as one of the dimensionality reduction methods. Further, the model outputs
Importance of variable, which can be a very handy feature (on some random data set).
In PARF, only the training / learning phase of Random Forest is parallelized. This implementation
is cluster based and uses MPI (Message Passing Interface) library.
Working
This function works very simply, we need to pass it a vector of mtry values, and it fits a random
forest using each of those values and returns the combined result. We can also pass any additional
parameters like ntree to the randomForest function.
For Example
Lets say we want a random forest with 5000 trees. The default value for ntree is 500, so we use
rep(4,10) as the argument for the function. Maybe we are not sure of the optimal mtry value, and
want combine 2 ensembles of 2500 trees. Then we use the argument c(rep(3,5),rep(4,5)). This
gives us 2500 trees with mtry=3 and 2500 with mtry=4. Which helps in predict more accurately
without out of memory errors.
3|Page
Implementations
4|Page
RMSE value is least at mtry = 6 and the value is 33.90319
Recommendations
5|Page
Stochastic Gradient Boosting
Introduction
The accuracy of a predictive model can be boosted in two ways: Either by embracing feature
engineering or by applying boosting algorithms straight away. There are multiple boosting
algorithms like Gradient Boosting, XGBoost, AdaBoost, Gentle Boost etc.
Bagging: It is an approach where you take random samples of data, build learning algorithms and
take simple means to find bagging probabilities.
Boosting: Boosting is similar; however the selection of sample is made more intelligently. We
subsequently give more and more weight too hard to classify observations.
Boosting is a famous ensemble learning technique in which we are not concerned with reducing
the variance of learners like in Bagging where our aim is to reduce the high variance of learners
by averaging lots of models fitted on bootstrapped data samples generated with replacement from
training data, so as to avoid overfitting.
Bagging consists of taking multiple subsets of the training data set, then building multiple
independent decision tree models, and then average the models allowing to create a very
performant predictive model compared to the classical CART model.
Advantages
Often provides predictive accuracy that cannot be beat.
Lots of flexibility - can optimize on different loss functions and provides several hyper
parameter tuning options that make the function fit very flexible.
No data pre-processing required - often works great with categorical and numerical values
as is.
Handles missing data - imputation not required.
Working
Trees are built one at a time, where each new tree helps to correct errors made by previously trained
tree. With each tree added, the model becomes even more expressive. There are typically three
parameters - number of trees, depth of trees and learning rate, and the tree built is generally
shallow. GBDT training generally takes longer because of the fact that trees are built sequentially.
However benchmark results have shown GBDT are better learners than Random Forests.
GBT are an ensemble of shallow and weak successive trees with each tree learning and improving
on the previous. When combined, these many weak successive trees produce a powerful
“committee” that are often hard to beat with other algorithms.
6|Page
Implementation
7|Page
Plotting of RMSE Values based on depth,
8|Page
Recommendations
Introduction
The standard linear model (or the ordinary least squares method) performs poorly in a situation,
where you have a large multivariate data set containing a number of variables superior to the
number of samples. A better alternative is the penalized regression allowing to create a linear
regression model that is penalized, for having too many variables in the model, by adding a
constraint in the equation (James et al. 2014, P. Bruce and Bruce (2017)). This is also known as
shrinkage or regularization methods. The consequence of imposing this penalty, is to reduce (i.e.
shrink) the coefficient values towards zero. This allows the less contribute variables to have a
coefficient close to zero or equal zero. Note that, the shrinkage requires the selection of a tuning
parameter (lambda) that determines the amount of shrinkage.
Working
Penalized regression allows to create a linear regression model that is penalized, for having too
many variables in the model, by adding a constraint in the equation which is also known as
shrinkage or regularization methods. The consequence of imposing this penalty, is to reduce (i.e.
shrink) the coefficient values towards zero. This allows the less contributing variables to have a
coefficient close to zero or equal zero.
9|Page
Implementation
10 | P a g e
Plotting the RMSE across Lambda
11 | P a g e
RMSE is low at lambda 1 = 6 and lambda2 =4 with a value of 30.50465
Recommendations
A penalized regression method yields a sequence of models, each associated with
specific values for one or more tuning parameters.
If the amount of shrinkage is large enough, these methods can also perform variable
selection by shrinking some coefficients to zero.
12 | P a g e