0% found this document useful (0 votes)

49 views13 pages

AMTA Assignment AMTA B (Aswin Avni Navya)

The document summarizes an analysis of wage data using machine learning algorithms within the CARET framework in R. Three algorithms were used: parallel random forest, stochastic gradient boosting, and penalized linear regression. Parallel random forest was able to predict the data with a mean square error of 861.9919. Stochastic gradient boosting and parameter tuning can produce highly accurate predictive models. The document provides details on implementing and evaluating each algorithm.

Uploaded by

Shambhawi Sinha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views13 pages

AMTA Assignment AMTA B (Aswin Avni Navya)

Uploaded by

Shambhawi Sinha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Advanced Multivariate Technique

CARET - ALGORITHMS

Submitted By

N.S.Aswin 80303180002
Avni Jain 80303180080
Sri Navya 80303180061

Submitted To

Dr. ABHILASH PONNAM

ANALYTICS DEPT
NMIMS HYDERABAD
80303180033
PADMA.S
80303180156
SUDHA TIRUNELAI
80303180196
INTRODUCTION:

R is a programming language developed by Ross Ihaka and Robert Gentleman in 1993. R possesses
an extensive catalog of statistical and graphical methods. It includes machine learning algorithm,
linear regression, time series and statistical inference to name a few. Most of the R libraries are
written in R, but for heavy computational task, C, C++ and Fortran codes are preferred.
R is not only entrusted by academic, but many large companies also use R programming language,
including Uber, Google, Airbnb, Facebook and so on.
Data analysis with R is done in a series of steps; programming, transforming, discovering,
modelling and communicate the results,

 Program: R is a clear and accessible programming tool.

 Transform: R is made up of a collection of libraries designed specifically for data science.
 Discover: Investigate the data, refine your hypothesis and analyse them.
 Model: R provides a wide array of tools to capture the right model for your data.
 Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or build
Shiny apps to share with the world.

R PACKAGE

The primary uses of R is and will always be, statistic, visualization, and machine learning.
Packages are part of R programming and they are useful in collecting sets of R functions into a
single unit. It also contains compiled code and sample data. All of these are kept stored in a
directory called the "library" in the R environment. For loading a package which is already existing
and installed on your system, you can make use of and call the library function.

CARET

The caret package (Classification and REgression Training) contains functions to streamline the
model training process for complex regression and classification problems. The package utilizes a
number of R packages but tries not to load them all at package start-up (by removing formal
package dependencies, the package start up time can be greatly decreased). Caret has several
functions that attempt to streamline the model building and evaluation process, as well as feature
selection and other techniques.
One of the primary tools in the package is the train function which can be used to

o Evaluate using resampling, the effect of model tuning parameters on performance.

o Choose the “optimal” model across these parameters.
o Estimate model performance from a training set.

1|Page
PROJECT

Machine Learning Algorithm within CARET Framework

Data Set Used: ISLR: Wage Dataset

Wage data set is a data set which as data for a group of 3000 male workers in the Mid-Atlantic
region with 11 variables.

Overview of Dataset
Variable Description
Year Year that wage information was recorded
Age Age of worker
Maritl Indicating marital status
Race Indicating race
Education Indicating education level
Region Region of the country
Jobclass Indicating type of job
Health Indicating health level of worker
Health Ins Indicating whether worker has health insurance
Log wage Log of workers wage
Wage Workers raw wage

Since we are not going to work with the log of workers wage we are removing it.

ALGORITHMS USED IN THIS PROJECT:

 Parallel Random Forrest
 Stochastic Gradient Boosting
 Penalized Linear Regression

Parallel Random Forrest

Introduction

Random Forest is a versatile machine learning method capable of performing both regression and
classification tasks. It also undertakes dimensional reduction methods, treats missing values,
outlier values and other essential steps of data exploration, and does a fairly good job.

Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we have
1000 observation in the complete population with 10 variables. Random forest tries to build
multiple CART models with different samples and different initial variables. For instance, it will
take a random sample of 100 observation and 5 randomly chosen initial variables to build a CART

2|Page
model. It will repeat the process (say) 10 times and then make a final prediction on each
observation. Final prediction is a function of each prediction. This final prediction can simply be
the mean of each prediction.

Advantages
 This algorithm can solve both type of problems i.e. classification and regression and does
a decent estimation at both fronts.
 One of benefits is, the power of handling large data set with higher dimensionality. It can
handle thousands of input variables and identify most significant variables, so it is
considered as one of the dimensionality reduction methods. Further, the model outputs
Importance of variable, which can be a very handy feature (on some random data set).

In PARF, only the training / learning phase of Random Forest is parallelized. This implementation
is cluster based and uses MPI (Message Passing Interface) library.

Working

This function works very simply, we need to pass it a vector of mtry values, and it fits a random
forest using each of those values and returns the combined result. We can also pass any additional
parameters like ntree to the randomForest function.

I think this functions provides 2 improvements,

 We can use any parallel backend when a random forest is taking too long to fit
 In the argument .inorder=FALSE in the foreach function, provides a small performance
improvement as it lets R combine the random forests as they finish, rather than forcing R
to combine them in the order they start.

For Example
Lets say we want a random forest with 5000 trees. The default value for ntree is 500, so we use
rep(4,10) as the argument for the function. Maybe we are not sure of the optimal mtry value, and
want combine 2 ensembles of 2500 trees. Then we use the argument c(rep(3,5),rep(4,5)). This
gives us 2500 trees with mtry=3 and 2500 with mtry=4. Which helps in predict more accurately
without out of memory errors.

3|Page
Implementations

Results from the code

4|Page
RMSE value is least at mtry = 6 and the value is 33.90319

Plotting the RMSE based on mtry,

Recommendations

 Model predicts with a mean square error of 861.9919

 Parallel Random forest along with tuning parameter can be used for complete and
proper numerical data. This method reduces the RMSE and also MSE of the dataset.
 Parallel Random Forest increases predictive power of the algorithm also helps
overfitting for both classification and regression.

5|Page
Stochastic Gradient Boosting

Introduction
The accuracy of a predictive model can be boosted in two ways: Either by embracing feature
engineering or by applying boosting algorithms straight away. There are multiple boosting
algorithms like Gradient Boosting, XGBoost, AdaBoost, Gentle Boost etc.

Bagging: It is an approach where you take random samples of data, build learning algorithms and
take simple means to find bagging probabilities.
Boosting: Boosting is similar; however the selection of sample is made more intelligently. We
subsequently give more and more weight too hard to classify observations.

Boosting is a famous ensemble learning technique in which we are not concerned with reducing
the variance of learners like in Bagging where our aim is to reduce the high variance of learners
by averaging lots of models fitted on bootstrapped data samples generated with replacement from
training data, so as to avoid overfitting.
Bagging consists of taking multiple subsets of the training data set, then building multiple
independent decision tree models, and then average the models allowing to create a very
performant predictive model compared to the classical CART model.

Advantages
 Often provides predictive accuracy that cannot be beat.
 Lots of flexibility - can optimize on different loss functions and provides several hyper
parameter tuning options that make the function fit very flexible.
 No data pre-processing required - often works great with categorical and numerical values
as is.
 Handles missing data - imputation not required.

Working
Trees are built one at a time, where each new tree helps to correct errors made by previously trained
tree. With each tree added, the model becomes even more expressive. There are typically three
parameters - number of trees, depth of trees and learning rate, and the tree built is generally
shallow. GBDT training generally takes longer because of the fact that trees are built sequentially.
However benchmark results have shown GBDT are better learners than Random Forests.

GBT are an ensemble of shallow and weak successive trees with each tree learning and improving
on the previous. When combined, these many weak successive trees produce a powerful
“committee” that are often hard to beat with other algorithms.

6|Page
Implementation

Results from code

At interaction depth 9, shrinkage as .1 , minobsinnode as 10 with trees as 250 we get smallest

RMSE

7|Page
Plotting of RMSE Values based on depth,

Age has the highest influence in the Wage Data set.

8|Page
Recommendations

 Method predicted with mean square error of 855.4873

 In a numerical optimization problem where the objective is to minimize the loss of the
model by adding weak learners this method can be used.
 This method can be used to reduce the correlation between the trees.
 This method can be built on small trees because it is data driven, fast and efficient.

Penalized Linear Regression

Introduction

The standard linear model (or the ordinary least squares method) performs poorly in a situation,
where you have a large multivariate data set containing a number of variables superior to the
number of samples. A better alternative is the penalized regression allowing to create a linear
regression model that is penalized, for having too many variables in the model, by adding a
constraint in the equation (James et al. 2014, P. Bruce and Bruce (2017)). This is also known as
shrinkage or regularization methods. The consequence of imposing this penalty, is to reduce (i.e.
shrink) the coefficient values towards zero. This allows the less contribute variables to have a
coefficient close to zero or equal zero. Note that, the shrinkage requires the selection of a tuning
parameter (lambda) that determines the amount of shrinkage.

Working

Penalized regression allows to create a linear regression model that is penalized, for having too
many variables in the model, by adding a constraint in the equation which is also known as
shrinkage or regularization methods. The consequence of imposing this penalty, is to reduce (i.e.
shrink) the coefficient values towards zero. This allows the less contributing variables to have a
coefficient close to zero or equal zero.

9|Page
Implementation

Results from the code

10 | P a g e
Plotting the RMSE across Lambda

11 | P a g e
RMSE is low at lambda 1 = 6 and lambda2 =4 with a value of 30.50465

Recommendations
 A penalized regression method yields a sequence of models, each associated with
specific values for one or more tuning parameters.
 If the amount of shrinkage is large enough, these methods can also perform variable
selection by shrinking some coefficients to zero.

For ANY Queries,

Contact
Aswin @ +91 – 97100 596 337 / +91 – 8939 772 500 / [email protected]

12 | P a g e

Stats 101c Final Project
100% (1)
Stats 101c Final Project
16 pages
Cm53Xh Operating Manual Contents
86% (14)
Cm53Xh Operating Manual Contents
119 pages
Assignment HPGD3103 Instructional Technologies January 2022 Semester - Specific Instruction
No ratings yet
Assignment HPGD3103 Instructional Technologies January 2022 Semester - Specific Instruction
12 pages
Checklist of Construction Site
No ratings yet
Checklist of Construction Site
7 pages
Mars 05
No ratings yet
Mars 05
28 pages
Random Forest
No ratings yet
Random Forest
83 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
ML-Lec6
No ratings yet
ML-Lec6
4 pages
Janani Prakash Loan Prediction Study
No ratings yet
Janani Prakash Loan Prediction Study
97 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
DM Chapter 8 (2)
No ratings yet
DM Chapter 8 (2)
7 pages
Best ML Packages in R
No ratings yet
Best ML Packages in R
9 pages
Caret PDF
No ratings yet
Caret PDF
1 page
Decision Trees
67% (3)
Decision Trees
14 pages
A Random Forest Guided Tour: Gerard - Biau@
No ratings yet
A Random Forest Guided Tour: Gerard - Biau@
41 pages
Machine Learning: Classification & Decision Trees
No ratings yet
Machine Learning: Classification & Decision Trees
24 pages
phys361-S24-lecture-17-random-forests
No ratings yet
phys361-S24-lecture-17-random-forests
24 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
Up M PHD Seminar Cart RF May 2023
No ratings yet
Up M PHD Seminar Cart RF May 2023
101 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
CP 4
No ratings yet
CP 4
2 pages
PE IV - Practical Machine Learning
No ratings yet
PE IV - Practical Machine Learning
7 pages
Models - A List of Available Models in Train in Caret - Classification and Regression Training
No ratings yet
Models - A List of Available Models in Train in Caret - Classification and Regression Training
43 pages
ML UNIT-3
No ratings yet
ML UNIT-3
23 pages
Random Forest For Binary Classification
No ratings yet
Random Forest For Binary Classification
19 pages
BuildingPredictiveModelsR Caret
No ratings yet
BuildingPredictiveModelsR Caret
26 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Statistics and Machine Learning Toolbox™ Release Notes
No ratings yet
Statistics and Machine Learning Toolbox™ Release Notes
150 pages
Chapter 09 CART-3
No ratings yet
Chapter 09 CART-3
42 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
Random Forests 2
No ratings yet
Random Forests 2
43 pages
Thera Bank
100% (1)
Thera Bank
25 pages
Random Forest
No ratings yet
Random Forest
30 pages
CRAN Task View Machine Lea..
No ratings yet
CRAN Task View Machine Lea..
3 pages
Features Election
No ratings yet
Features Election
18 pages
Classification
No ratings yet
Classification
36 pages
Random Forest
No ratings yet
Random Forest
8 pages
Handling The Dataset Using R - Word
No ratings yet
Handling The Dataset Using R - Word
54 pages
Guided Tour To Random Forest
No ratings yet
Guided Tour To Random Forest
42 pages
Chapter Non-Parametric Methods
No ratings yet
Chapter Non-Parametric Methods
9 pages
KNN - Model: Train Test CL K
No ratings yet
KNN - Model: Train Test CL K
2 pages
Surabhi Charu Project
No ratings yet
Surabhi Charu Project
16 pages
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Random Forest
No ratings yet
Random Forest
25 pages
ML Techniques and Concepts
No ratings yet
ML Techniques and Concepts
48 pages
Bagging and Boosting
No ratings yet
Bagging and Boosting
32 pages
A Short Introduction To The Caret Package: Max Kuhn June 20, 2013
No ratings yet
A Short Introduction To The Caret Package: Max Kuhn June 20, 2013
10 pages
DecisionTrees RandomForest v2
No ratings yet
DecisionTrees RandomForest v2
27 pages
Decision_tree
No ratings yet
Decision_tree
15 pages
Week 7 - Tree-Based Model
100% (1)
Week 7 - Tree-Based Model
8 pages
Random Forest Algorithms - Comprehensive Guide With Examples
No ratings yet
Random Forest Algorithms - Comprehensive Guide With Examples
13 pages
21AI502 Syllbus
No ratings yet
21AI502 Syllbus
5 pages
A Short Introduction To Caret
No ratings yet
A Short Introduction To Caret
10 pages
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
Tree-Based-Methods
No ratings yet
Tree-Based-Methods
21 pages
Pre R Package
No ratings yet
Pre R Package
30 pages
Interview Questions
No ratings yet
Interview Questions
8 pages
Classification and Regression Trees (CART) Theory and Applications
No ratings yet
Classification and Regression Trees (CART) Theory and Applications
40 pages
Classification_and_Regression_Trees_CART
No ratings yet
Classification_and_Regression_Trees_CART
40 pages
Numbers of Classifier
No ratings yet
Numbers of Classifier
49 pages
05 - Ensemble Learning
No ratings yet
05 - Ensemble Learning
39 pages
Guo Paper 2019
No ratings yet
Guo Paper 2019
4 pages
Amta - Final Exams: Code: # Load The Toyotacorolla - CSV
No ratings yet
Amta - Final Exams: Code: # Load The Toyotacorolla - CSV
13 pages
Aswin 80303180002
100% (1)
Aswin 80303180002
12 pages
N.S.Aswin 8030318002 TIVO
No ratings yet
N.S.Aswin 8030318002 TIVO
2 pages
PSI INDIA - Will Balbir Pasha Help Fight AIDS: Target Audience
No ratings yet
PSI INDIA - Will Balbir Pasha Help Fight AIDS: Target Audience
2 pages
SVKM'S Narsee Monjee Institute of Management Studies, Hyderabad
No ratings yet
SVKM'S Narsee Monjee Institute of Management Studies, Hyderabad
28 pages
Limited Staff
No ratings yet
Limited Staff
2 pages
Firmwidespeech
No ratings yet
Firmwidespeech
3 pages
Company: Job: Job Role: Job Duties
No ratings yet
Company: Job: Job Role: Job Duties
3 pages
Notes On Product Pricing: Concept 1: Price Elasticity of Demand
No ratings yet
Notes On Product Pricing: Concept 1: Price Elasticity of Demand
5 pages
Verizon Communication Inc: Anubhav Anand Katyayini Kesharwani Shambhawi Sinha
No ratings yet
Verizon Communication Inc: Anubhav Anand Katyayini Kesharwani Shambhawi Sinha
8 pages
Post Covid Restaurant
No ratings yet
Post Covid Restaurant
3 pages
Assign Geographic Roles
No ratings yet
Assign Geographic Roles
2 pages
Government Steps Needed To Boost Economy, Corporate Taxes, Effect of Rising Interest Rates
No ratings yet
Government Steps Needed To Boost Economy, Corporate Taxes, Effect of Rising Interest Rates
10 pages
IMCAdvertising
No ratings yet
IMCAdvertising
6 pages
Phillips Curve, Network Effect, Theory of Interest, Employmentand Money, Insolvency and Bankruptcy Code, Inflation (Low and High)
No ratings yet
Phillips Curve, Network Effect, Theory of Interest, Employmentand Money, Insolvency and Bankruptcy Code, Inflation (Low and High)
7 pages
Anirudh Kapoor: Demographics
No ratings yet
Anirudh Kapoor: Demographics
2 pages
Flipkart - Transitioning To A Marketplace Model
50% (2)
Flipkart - Transitioning To A Marketplace Model
13 pages
How Does Inflation Affect The Exchange Rate Between Two Nations?
No ratings yet
How Does Inflation Affect The Exchange Rate Between Two Nations?
32 pages
Interpretations
No ratings yet
Interpretations
1 page
Electronic Commerce or E-Commerce Is A Business Model That Lets Firms and Individuals Buy and Sell Things Over The Internet
No ratings yet
Electronic Commerce or E-Commerce Is A Business Model That Lets Firms and Individuals Buy and Sell Things Over The Internet
14 pages
Transportation
No ratings yet
Transportation
1 page
HP Color Laserjet 5550 Output
No ratings yet
HP Color Laserjet 5550 Output
2 pages
Week 2
No ratings yet
Week 2
27 pages
Java_DSA_Roadmap
No ratings yet
Java_DSA_Roadmap
2 pages
EnGenius Solution Specialist - Cloud Certification
No ratings yet
EnGenius Solution Specialist - Cloud Certification
84 pages
Result 6480243684
No ratings yet
Result 6480243684
1 page
Console Repair Guy: How To Determine The Xbox 360 Secondary Error Code
No ratings yet
Console Repair Guy: How To Determine The Xbox 360 Secondary Error Code
4 pages
Indian Railways Complaints & Suggestions: Complaint Reference Number: PNR Details: Complaint Reported by
No ratings yet
Indian Railways Complaints & Suggestions: Complaint Reference Number: PNR Details: Complaint Reported by
1 page
Getting Started With SAS Text Miner
No ratings yet
Getting Started With SAS Text Miner
102 pages
Lecture1 - Java Server Pages-Đã G P PDF
No ratings yet
Lecture1 - Java Server Pages-Đã G P PDF
465 pages
Unit 1: Introduction To The Autocad Interface: Objectives: Assignments/Quizzes/Tests
No ratings yet
Unit 1: Introduction To The Autocad Interface: Objectives: Assignments/Quizzes/Tests
13 pages
Daisy and Robin Helper Medal and Mouse Merit
No ratings yet
Daisy and Robin Helper Medal and Mouse Merit
5 pages
List of Ugc Journal
100% (1)
List of Ugc Journal
63 pages
Final Requirement STAAD (CE583) : Eastern Visayas State University Tacloban City
100% (1)
Final Requirement STAAD (CE583) : Eastern Visayas State University Tacloban City
117 pages
Slac Me
No ratings yet
Slac Me
3 pages
1ST Quarter Exam
No ratings yet
1ST Quarter Exam
6 pages
Radio Frequency and Wireless Communications
No ratings yet
Radio Frequency and Wireless Communications
12 pages
IOM-9
No ratings yet
IOM-9
27 pages
Frequency Converters: Air Cooled
No ratings yet
Frequency Converters: Air Cooled
16 pages
Final Project
No ratings yet
Final Project
83 pages
Week3 - Introduction To CentOS
No ratings yet
Week3 - Introduction To CentOS
50 pages
The Plates of The Eagles 1864 1866
No ratings yet
The Plates of The Eagles 1864 1866
35 pages
Hygienic Plant Manual
100% (4)
Hygienic Plant Manual
203 pages
MIPL-J-2433 UNVEILING IN-APP ADS AND UNCOVERING COVERT ATTACKS VIA MOBILE APP-WEB INTERFACE
No ratings yet
MIPL-J-2433 UNVEILING IN-APP ADS AND UNCOVERING COVERT ATTACKS VIA MOBILE APP-WEB INTERFACE
9 pages
SAF Volunteer Corps Application Form
No ratings yet
SAF Volunteer Corps Application Form
4 pages
Free To Air Antenna: Optimax
No ratings yet
Free To Air Antenna: Optimax
8 pages
4L-PB351G-L60D - 4L-PB531G-L60D
No ratings yet
4L-PB351G-L60D - 4L-PB531G-L60D
4 pages
CS CRM Functional Delivery Script 2021
No ratings yet
CS CRM Functional Delivery Script 2021
4 pages