5-R
5-R
Chang Liu
R for Data Science Lecture 5
Contents
Regularized Regression 3
Ridge penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Lasso penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Elastic Net Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Decision Tree 49
Unique Value 56
Length 57
Gsub 59
sampling 61
Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Cluster Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Systematic Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Chang Liu 2
R for Data Science Lecture 5
Regularized Regression
Linear models (LMs) provide a simple, yet effective, approach to predictive modeling. Moreover,
when certain assumptions required by LMs are met (e.g., constant variance), the estimated
coefficients are unbiased and, of all linear unbiased estimates, have the lowest variance. However,
in today’s world, data sets being analyzed typically contain a large number of features. As the
number of features grow, certain assumptions typically break down and these models tend to
overfit the training data, causing our out of sample error to increase. Regularization methods
provide a means to constrain or regularize the estimated coefficients, which can reduce the
variance and decrease out of sample error.
Ridge penalty
library(tidyverse)
Chang Liu 3
R for Data Science Lecture 5
Next, we’ll use the glmnet() function to fit the ridge regression model and specify alpha=0.
Note that setting alpha equal to 1 is equivalent to using Lasso Regression and setting alpha to
some value between 0 and 1 is equivalent to using an elastic net.
library(glmnet)
##
## Attaching package: 'Matrix'
Chang Liu 4
R for Data Science Lecture 5
Next, we’ll identify the lambda value that produces the lowest test mean squared error (MSE)
by using k-fold cross-validation.
Fortunately, glmnet has the function cv.glmnet() that automatically performs k-fold cross vali-
dation using k = 10 folds.
## [1] 13.27979
Chang Liu 5
R for Data Science Lecture 5
5000 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Mean−Squared Error
3000
1000
2 4 6 8 10
Log(λ)
Chang Liu 6
R for Data Science Lecture 5
Lasso penalty
Next, we will use the glmnet() function to fit the lasso regression model and specify alpha=1.
Note that setting alpha equal to 0 is equivalent to using ridge regression and setting alpha to
some value between 0 and 1 is equivalent to using an elastic net.
To determine what value to use for lambda, we’ll perform k-fold cross-validation and identify
the lambda value that produces the lowest test mean squared error (MSE).
Note that the function cv.glmnet() automatically performs k-fold cross validation using k = 10
folds.
Chang Liu 7
R for Data Science Lecture 5
library(glmnet)
## [1] 2.928367
4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 0
5000
Mean−Squared Error
3000
1000
−2 −1 0 1 2 3 4
Log(λ)
We can also use the final lasso regression model to make predictions on new observations.
Chang Liu 8
R for Data Science Lecture 5
## s0
## (Intercept) 483.169999
## mpg -2.981768
## wt 21.029736
## drat .
## qsec -19.286215
## 1
## [1,] 107.3869
Chang Liu 9
R for Data Science Lecture 5
#install.packages("dplyr")
#install.packages("glmnet")
#install.packages("ggplot2")
#install.packages("caret")
library(glmnet)
library(caret)
##
## Attaching package: 'caret'
Chang Liu 10
R for Data Science Lecture 5
# X and Y datasets
Y <- mtcars %>%
select(disp) %>%
scale(center = TRUE, scale = FALSE) %>%
as.matrix()
X<- mtcars %>%
select(-disp) %>%
as.matrix()
Chang Liu 11
R for Data Science Lecture 5
Chang Liu 12
R for Data Science Lecture 5
Chang Liu 13
R for Data Science Lecture 5
Chang Liu 14
R for Data Science Lecture 5
Chang Liu 15
R for Data Science Lecture 5
Chang Liu 16
R for Data Science Lecture 5
Chang Liu 17
R for Data Science Lecture 5
Chang Liu 18
R for Data Science Lecture 5
Chang Liu 19
R for Data Science Lecture 5
Chang Liu 20
R for Data Science Lecture 5
Chang Liu 21
R for Data Science Lecture 5
Chang Liu 22
R for Data Science Lecture 5
Chang Liu 23
R for Data Science Lecture 5
Chang Liu 24
R for Data Science Lecture 5
Chang Liu 25
R for Data Science Lecture 5
Chang Liu 26
R for Data Science Lecture 5
Chang Liu 27
R for Data Science Lecture 5
Chang Liu 28
R for Data Science Lecture 5
Chang Liu 29
R for Data Science Lecture 5
Chang Liu 30
R for Data Science Lecture 5
Chang Liu 31
R for Data Science Lecture 5
Chang Liu 32
R for Data Science Lecture 5
Chang Liu 33
R for Data Science Lecture 5
Chang Liu 34
R for Data Science Lecture 5
Chang Liu 35
R for Data Science Lecture 5
Chang Liu 36
R for Data Science Lecture 5
Chang Liu 37
R for Data Science Lecture 5
Chang Liu 38
R for Data Science Lecture 5
Chang Liu 39
R for Data Science Lecture 5
Chang Liu 40
R for Data Science Lecture 5
Chang Liu 41
R for Data Science Lecture 5
Chang Liu 42
R for Data Science Lecture 5
Chang Liu 43
R for Data Science Lecture 5
# Model Prediction
y_hat_pre <- predict(elastic_model, X)
y_hat_pre
Chang Liu 44
R for Data Science Lecture 5
# Plot
plot(elastic_model, main = "Elastic Net Regression")
42
40
38
36
Mixing Percentage
Chang Liu 45
R for Data Science Lecture 5
In the KNN algorithm, K specifies the number of neighbors and its algorithm is as follows:
For the Nearest Neighbor classifier, the distance between two points is expressed in the form of
Euclidean Distance.
Chang Liu 46
R for Data Science Lecture 5
##Generate a random number that is 90% of the total number of rows in dataset.
ran <- sample(1:nrow(iris), 0.9 * nrow(iris))
##Run nomalization on first 4 coulumns of dataset because they are the predictors
iris_norm <- as.data.frame(lapply(iris[,c(1,2,3,4)], nor))
summary(iris_norm)
Chang Liu 47
R for Data Science Lecture 5
##this function divides the correct predictions by total number of predictions that
,→ tell us how accurate teh model is.
accuracy (tab)
## [1] 100
In the iris dataset that is already available in R, I have run the k-nearest neighbor algorithm
that gave me 80% accurate result. First, I normalized the data to convert petal.length,
sepal.length, petal.width and sepal.length into a standardized 0-to-1 form so that we can fit
them into one box (one graph) and also because our main objective is to predict whether a
flower is virginica, Versicolor, or setosa and that is why I excluded the column 5 and stored
it into another variable called iris_target_category. Then, I separated the normalized values
into training and testing dataset. Imagine it this way, that the values from training dataset
are firstly drawn on a graph and after we run knn function with all the necessary arguments,
we introduce testing dataset’s values into the graph and calculate Euclidean distance with
each and every already stored point in graph. Now, although we know which flower it is in
testing dataset, we still predict the values and store them in variable called “pr” so that we can
compare predicted values with original testing dataset’s values. This way we understand the
accuracy of our model and if we are to get new 50 values in future and we are asked to predict
the category of those 50 values, we can do that with this model.
Chang Liu 48
R for Data Science Lecture 5
Decision Tree
Decision Trees are versatile Machine Learning algorithm that can perform both classification
and regression tasks. They are very powerful algorithms, capable of fitting complex datasets.
Besides, decision trees are fundamental components of random forests, which are among the
most potent Machine Learning algorithms available today.
library(tidyverse)
library(rpart)
library(partykit)
Chang Liu 49
R for Data Science Lecture 5
1
social_support
2 3
Good
social_support
(n = 576, err = 10.6%)
4 5
Good Poor
(n = 315, err = 42.2%) (n = 230, err = 29.6%)
Chang Liu 50
R for Data Science Lecture 5
1
logGDP
##
## predicted_tree1 Good Poor
## Good 183 47
## Poor 11 39
Chang Liu 51
R for Data Science Lecture 5
##
## predicted_tree2 Good Poor
## Good 188 18
## Poor 6 68
Chang Liu 52
R for Data Science Lecture 5
Bagging is a powerful method to improve the performance of simple models and reduce overfitting
of more complex models. The principle is very easy to understand, instead of fitting the model on
one sample of the population, several models are fitted on different samples (with replacement)
of the population. Then, these models are aggregated by using their average, weighted average
or a voting system (mainly for classification).
Though bagging reduces the explanatory ability of your model, it makes it much more robust
and able to get the “big picture” from your data.
To build a bagged trees, the process is easy. Let’s say you want 100 models that you will average,
for each of the hundred iterations you will:
Once you trained all your models, to get a prediction from your bagged model on new data, you
will need to:
• Get the estimate from each of the individual trees you saved.
• Average the estimates.
library(rpart)
require(ggplot2)
library(data.table)
##
## Attaching package: 'data.table'
Chang Liu 53
R for Data Science Lecture 5
set.seed(456)
##Reading data
bagging_data=data.table(airquality)
ggplot(bagging_data,aes(Wind,Ozone))+geom_point()+ggtitle("Ozone vs wind speed")
150
100
Ozone
50
5 10 15 20
Wind
data_test=na.omit(bagging_data[,.(Ozone,Wind)])
##Training data
train_index=sample.int(nrow(data_test),size=round(nrow(data_test)*0.8),replace = F)
data_test[train_index,train:=TRUE][-train_index,train:=FALSE]
data_test
Chang Liu 54
R for Data Science Lecture 5
## 3: 12 12.6 TRUE
## 4: 18 11.5 TRUE
## 5: 28 14.9 TRUE
## ---
## 112: 14 16.6 TRUE
## 113: 30 6.9 TRUE
## 114: 14 14.3 TRUE
## 115: 18 8.0 TRUE
## 116: 20 11.5 TRUE
,→ bagged_models=c(bagged_models,list(rpart(Ozone~Wind,data_test[new_sample],control=rpart.control(m
}
Chang Liu 55
R for Data Science Lecture 5
Unique Value
unique(df$team)
unique(df$points)
## [1] 90 99 85
## [1] 85 90 99
## [1] 99 90 85
##
## 85 90 99
## 2 3 1
Chang Liu 56
R for Data Science Lecture 5
Length
#create vector
my_vector <- c(2, 7, 6, 6, 9, 10, 14, 13, 4, 20, NA)
## [1] 11
## [1] 10
#create list
my_list <- list(A=1:5, B=c('hey', 'hi'), C=c(3, 5, 7))
## [1] 3
## [1] 5
## [1] 6
Chang Liu 57
R for Data Science Lecture 5
#define string
my_string <- "hey there"
## [1] 1
#define string
my_string <- "hey there"
## [1] 9
Chang Liu 58
R for Data Science Lecture 5
Gsub
The gsub() function in R can be used to replace all occurrences of certain text within a string
in R.
gsub(pattern, replacement, x)
#define vector
x <- c('Mavs', 'Mavs', 'Spurs', 'Nets', 'Spurs', 'Mavs')
#define vector
x <- c('A', 'A', 'B', 'C', 'D', 'D')
Chang Liu 59
R for Data Science Lecture 5
Chang Liu 60
R for Data Science Lecture 5
sampling
Stratified Sampling
Researchers often take samples from a population and use the data from the sample to draw
conclusions about the population as a whole.
One commonly used sampling method is stratified random sampling, in which a population is
split into groups and a certain number of members from each group are randomly selected to
be included in the sample.
## grade gpa
## 1 Freshman 83.12064
## 2 Freshman 85.55093
## 3 Freshman 82.49311
## 4 Freshman 89.78584
## 5 Freshman 85.98852
## 6 Freshman 82.53859
library(dplyr)
Chang Liu 61
R for Data Science Lecture 5
##
## Freshman Junior Senior Sophomore
## 10 10 10 10
library(dplyr)
##
## Freshman Junior Senior Sophomore
## 15 15 15 15
Chang Liu 62
R for Data Science Lecture 5
Cluster Sampling
One commonly used sampling method is cluster sampling, in which a population is split into
clusters and all members of some clusters are chosen to be included in the sample.
set.seed(1)
## tour experience
## 1 1 6.373546
## 2 1 7.183643
## 3 1 6.164371
## 4 1 8.595281
## 5 1 7.329508
## 6 1 6.179532
#define sample as all members who belong to one of the 4 tour groups
cluster_sample <- df[df$tour %in% clusters, ]
##
## 1 2 3 7
## 20 20 20 20
Chang Liu 63
R for Data Science Lecture 5
Systematic Sampling
One commonly used sampling method is systematic sampling, which is implemented with a
simple two step process:
2. Choose a random starting point and select every nth member to be in the sample.
## last_name gpa
## 1 YLGRG 74.66755
## 2 DCVUK 80.74210
## 3 GZXSE 80.89685
## 4 ARZOG 80.31026
## 5 BNVRR 77.83073
## 6 WMWJM 80.10269
Chang Liu 64
R for Data Science Lecture 5
## last_name gpa
## 5 BNVRR 77.83073
## 10 SBTFE 81.51290
## 15 VPJTO 80.63059
## 20 OVQCE 83.80557
## 25 NLJRM 83.35642
## 30 YMCZO 82.31994
Chang Liu 65