15 Types of Regression You Should Know
15 Types of Regression You Should Know
R-bloggers
(This article was rst published on ListenData, and kindly contributed to R-bloggers)
Regression techniques are one of the most popular statistical techniques used for
predictive modeling and data mining tasks. On average, analytics professionals know only
2-3 types of regression which are commonly used in real world. They are linear and
logistic regression. But the fact is there are more than 10 types of regression algorithms
designed for various types of analysis. Each type has its own signi cance. Every analyst
must know which form of regression to use depending on type of data and distribution.
Table of Contents
Linear Regression
Polynomial Regression
Logistic Regression
Quantile Regression
Ridge Regression
Lasso Regression
ElasticNet Regression
Principal Component Regression
Partial Least Square Regression
Support Vector Regression
Ordinal Regression
Poisson Regression
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 1/30
4/10/2018 15 Types of Regression you should know | R-bloggers
value as compared to the other observations in the data, i.e. it does not belong to the
population, such an observation is called an outlier. In simple words, it is extreme value.
An outlier is a problem because many times it hampers the results we get.
2. Multicollinearity
When the independent variables are highly correlated to each other then the variables are
said to be multicollinear. Many types of regression techniques assumes multicollinearity
should not be present in the dataset. It is because it causes problems in ranking variables
based on its importance. Or it makes job di cult in selecting the most important
independent variable (factor).
3. Heteroscedasticity
When dependent variable’s variability is not equal across values of an independent
variable, it is called heteroscedasticity. Example – As one’s income increases, the
variability of food consumption will increase. A poorer person will spend a rather
constant amount by always eating inexpensive food; a wealthier person may occasionally
buy inexpensive food and at other times eat expensive meals. Those with higher incomes
display a greater variability of food consumption.
When our algorithm works so poorly that it is unable to t even training set well then it is
said to under t the data. It is also known as problem of high bias.
In the following diagram we can see that tting a linear regression (straight line in g 1)
would under t the data i.e. it will lead to large errors even in the training set. Using a
polynomial t in g 2 is balanced i.e. such a t can work on the training and test sets well,
while in g 3 the t will lead to low errors in training set but it will not work well on the
test set.
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 3/30
4/10/2018 15 Types of Regression you should know | R-bloggers
Types of Regression
Every regression technique has some assumptions attached to it which we need to meet
before running analysis. These techniques di er in terms of type of dependent and
independent variables and distribution.
1. Linear Regression
It is the simplest form of regression. It is a technique in which the dependent variable is
continuous in nature. The relationship between the dependent variable and independent
variables is assumed to be linear in nature. We can observe that the given plot represents
a somehow linear relationship between the mileage and displacement of cars. The green
points are the actual observations while the black line tted is the line of regression
Regression Analysis
When you have only 1 independent variable and 1 dependent variable, it is called simple
linear regression.
When you have more than 1 independent variable and 1 dependent variable, it is called
simple linear regression.
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 4/30
4/10/2018 15 Types of Regression you should know | R-bloggers
Estimating the parametersTo estimate the regression coe cients βi’s we use principle of
least squares which is to minimize the sum of squares due to the error terms i.e.
On solving the above equation mathematically we obtain the regression coe cients as:
Thus we can have the regression coe cients 2 and 0.5 which can interpreted as:
1. If no. of hours studied and no. of classes are 0 then the student will obtain 5 marks.
2. Keeping no. of classes attended constant, if student studies for one hour more then he
will score 2 more marks in the examination.
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 5/30
4/10/2018 15 Types of Regression you should know | R-bloggers
3. Similarly keeping no. of hours studied constant, if student attends one more class then
he will attain 0.5 marks more.
Linear Regression in R
We consider the swiss data set for carrying out linear regression in R. We use lm()
function in the base package. We try to estimate Fertility with the help of other variables.
library(datasets)
model = lm(Fertility ~ .,data = swiss)
lm_coe = model$coe cients
lm_coe
summary(model)
> lm_coe
Infant.Mortality
1.0770481
> summary(model)
Call:
Residuals:
Coefficients:
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 6/30
4/10/2018 15 Types of Regression you should know | R-bloggers
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Hence we can see that 70% of the variation in Fertility rate can be explained via linear
regression.
2. Polynomial Regression
It is a technique to t a nonlinear equation by taking polynomial functions of
independent variable.
In the gure given below, you can see the red curve ts the data better than the green
curve. Hence in the situations where the relation between the dependent and independent
variable seems to be non-linear we can deploy Polynomial Regression Models.
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 7/30
4/10/2018 15 Types of Regression you should know | R-bloggers
data = read.csv(“poly.csv”)
x = data$Area
y = data$Price
In order to compare the results of linear and polynomial regression, rstly we t linear
regression:
> model1$fit
1 2 3 4 5 6 7 8 9 10
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 8/30
4/10/2018 15 Types of Regression you should know | R-bloggers
169.0995 178.9081 188.7167 218.1424 223.0467 266.6949 291.7068 296.6111 316.2282 335.8454
> model1$coeff
(Intercept) x
120.05663769 0.09808581
new_x = cbind(x,x^2)
new_x
model2 = lm(y~new_x)
model2$ t
model2$coe
The tted values and regression coe cients of polynomial regression are:
> model2$fit
1 2 3 4 5 6 7 8 9 10
122.5388 153.9997 182.6550 251.7872 260.8543 310.6514 314.1467 312.6928 299.8631 275.8110
> model2$coeff
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 9/30
4/10/2018 15 Types of Regression you should know | R-bloggers
Using ggplot2 package we try to create a plot to compare the curves by both linear and
polynomial regression.
library(ggplot2)
ggplot(data = data) + geom_point(aes(x = Area,y = Price)) +
geom_line(aes(x = Area,y = model1$ t),color = “red”) +
geom_line(aes(x = Area,y = model2$ t),color = “blue”) +
theme(panel.background = element_blank())
3. Logistic Regression
In logistic regression, the dependent variable is binary in nature. Independent variables
can be continuous or binary.
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 10/30
4/10/2018 15 Types of Regression you should know | R-bloggers
In linear regression, range of ‘y’ is real line but here it can take only 2 values. So ‘y’ is
either 0 or 1 but X’B is continuous thus we can’t use usual linear regression in such a
situation.
Secondly, the error terms are not normally distributed.
y follows binomial distribution and hence is not normal.
Examples
HR Analytics: IT rms recruit large number of people, but one of the problems they
encounter is after accepting the job o er many candidates do not join. So, this results in
cost over-runs because they have to repeat the entire process again. Now when you get
an application, can you actually predict whether that applicant is likely to join the
organization (Binary Outcome – Join / Not Join).
Thus we choose a cut of probability say ‘p’ and if P(Yi = 1) > p then we can say that Yi
belongs to class 1 otherwise 0.
Interpreting the logistic regression coe cients (Concept of Odds Ratio)
If we take exponential of coe cients, then we’ll get odds ratio for ith explanatory
variable. Suppose odds ratio is equal to two, then the odds of event is 2 times greater than
the odds of non-event. Suppose dependent variable is customer attrition (whether
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 11/30
4/10/2018 15 Types of Regression you should know | R-bloggers
customer will close relationship with the company) and independent variable is
citizenship status (National / Expat). The odds of expat attrite is 3 times greater than the
odds of a national attrite.
Logistic Regression in R:
In this case, we are trying to estimate whether a person will have cancer depending
whether he smokes or not.
We t logistic regression with glm( ) function and we set family = “binomial”
#Predicted Probablities
model$ tted.values
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25
Predicting whether the person will have cancer or not when we choose the cut o
probability to be 0.5
> data$prediction
[1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
[16] FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
4. Quantile Regression
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 12/30
4/10/2018 15 Types of Regression you should know | R-bloggers
Quantile regression is the extension of linear regression and we generally use it when
outliers, high skeweness and heteroscedasticity exist in the data.
In linear regression, we predict the mean of the dependent variable for given independent
variables. Since mean does not describe the whole distribution, so modeling the mean is
not a full description of a relationship between dependent and independent variables. So
we can use quantile regression which predicts a quantile (or percentile) for given
independent variables.
Basic Idea of Quantile Regression:In quantile regression we try to estimate the quantile
of the dependent variable given the values of X’s. Note that the dependent variable should
be continuous.
This seems similar to linear regression model but here the objective function we consider
to minimize is:
If q = 0.5 i.e. if we are interested in the median then it becomes median regression (or
least absolute deviation regression) and substituting the value of q = 0.5 in above
equation we get the objective function as:
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 13/30
4/10/2018 15 Types of Regression you should know | R-bloggers
It is to be kept in mind that the coe cients which we get in quantile regression for a
particular quantile should di er signi cantly from those we obtain from linear
regression. If it is not so then our usage of quantile regression isn’t justi able. This can be
done by observing the con dence intervals of regression coe cients of the estimates
obtained from both the regressions.
Quantile Regression in R
install.packages(“quantreg”)
library(quantreg)
Using rq function we try to predict the estimate the 25th quantile of Fertility Rate in Swiss
data. For this we set tau = 0.25.
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 14/30
4/10/2018 15 Types of Regression you should know | R-bloggers
Coefficients:
Coefficients:
We can check whether our quantile regression results di er from the OLS results using
plots.
plot(quantplot)
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 15/30
4/10/2018 15 Types of Regression you should know | R-bloggers
Various quantiles are depicted by X axis. The red central line denotes the estimates of OLS
coe cients and the dotted red lines are the con dence intervals around those OLS
coe cients for various quantiles. The black dotted line are the quantile regression
estimates and the gray area is the con dence interval for them for various quantiles. We
can see that for all the variable both the regression estimated coincide for most of the
quantiles. Hence our use of quantile regression is not justi able for such quantiles. In
other words we want that both the red and the gray lines should overlap as less as
possible to justify our use of quantile regression.
5. Ridge Regression
It’s important to understand the concept of regularization before jumping to ridge
regression.
1. Regularization
Regularization helps to solve over tting problem which implies model performing well
on training data but performing poorly on validation (test) data. Regularization solves
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 16/30
4/10/2018 15 Types of Regression you should know | R-bloggers
this problem by adding a penalty term to the objective function and control the model
complexity using that penalty term.
Regularization is generally useful in the following situations:
In the linear regression objective function we try to minimize the sum of squares of
errors. In ridge regression (also known as shrinkage regression) we add a constraint on
the sum of squares of the regression coe cients. Thus in ridge regression our objective
function is:
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 17/30
4/10/2018 15 Types of Regression you should know | R-bloggers
Here λ is the regularization parameter which is a non negative number. Here we do not
assume normality in the error terms.
Very Important Note:
We do not regularize the intercept term. The constraint is just on the sum of squares
of regression coe cients of X’s.
If we choose lambda = 0 then we get back to the usual OLS estimates. If lambda is chosen
to be very large then it will lead to under tting. Thus it is highly important to determine a
desirable value of lambda. To tackle this issue, we plot the parameter estimates against
di erent values of lambda and select the minimum value of λ after which the parameters
tend to stabilize.
Considering the swiss data set, we create two di erent datasets, one containing
dependent variable and other containing independent variables.
X = swiss[,-1]
y = swiss[,1]
library(glmnet)
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 18/30
4/10/2018 15 Types of Regression you should know | R-bloggers
Using cv.glmnet( ) function we can do cross validation. By default alpha = 0 which means
we are carrying out ridge regression. lambda is a sequence of various values of lambda
which will be used for cross validation.
We take the best lambda by using lambda.min and hence get the regression coe cients
using predict function.
best_lambda = model$lambda.min
(Intercept) 64.92994664
Agriculture -0.13619967
Examination -0.31024840
Education -0.75679979
Catholic 0.08978917
Infant.Mortality 1.09527837
6. Lasso Regression
Lasso stands for Least Absolute Shrinkage and Selection Operator. It makes use of L1
regularization technique in the objective function. Thus the objective function in LASSO
regression becomes:
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 19/30
4/10/2018 15 Types of Regression you should know | R-bloggers
For the estimates we don’t have any speci c mathematical formula but we can obtain the
estimates using some statistical software.
Note that lasso regression also needs standardization.
Using cv.glmnet in glmnet package we do cross validation. For lasso regression we set
alpha = 1. By default standardize = TRUE hence we do not need to standardize the
variables seperately.
We consider the best value of lambda by ltering out lamba.min from the model and
hence get the coe cients using predict function.
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 20/30
4/10/2018 15 Types of Regression you should know | R-bloggers
(Intercept) 65.46374579
Agriculture -0.14994107
Examination -0.24310141
Education -0.83632674
Catholic 0.09913931
Infant.Mortality 1.07238898
Both ridge regression and lasso regression are addressed to deal with multicollinearity.
Ridge regression is computationally more e cient over lasso regression. Any of them can
perform better. So the best approach is to select that regression model which ts the test
set data well.
Elastic Net regression is preferred over both ridge and lasso regression when one is
dealing with highly correlated independent variables.
It is a combination of both L1 and L2 regularization.
Setting some di erent value of alpha between 0 and 1 we can carry out elastic net
regression.
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 21/30
4/10/2018 15 Types of Regression you should know | R-bloggers
set.seed(123)
model = cv.glmnet(as.matrix(X),y,alpha = 0.5,lambda = 10^seq(4,-1,-0.1))
#Taking the best lambda
best_lambda = model$lambda.min
en_coe = predict(model,s = best_lambda,type = “coe cients”)
en_coe
(Intercept) 65.9826227
Agriculture -0.1570948
Examination -0.2581747
Education -0.8400929
Catholic 0.0998702
Infant.Mortality 1.0775714
1. Dimensionality Reduction
2. Removal of multicollinearity
Principal components analysis is a statistical method to extract new features when the
original features are highly correlated. We create new features with the help of original
features such that the new features are uncorrelated.
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 22/30
4/10/2018 15 Types of Regression you should know | R-bloggers
Drawbacks:
We use the longley data set available in R which is used for high multicollinearity. We
excplude the Year column.
View(data) This is how some of the observations in our dataset will look like:
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 23/30
4/10/2018 15 Types of Regression you should know | R-bloggers
install.packages(“pls”)
library(pls)
In PCR we are trying to estimate the number of Employed people; scale = T denotes that
we are standardizing the variables; validation = “CV” denotes applicability of cross-
validation.
Data: X dimension: 16 5
Y dimension: 16 1
VALIDATION: RMSEP
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 24/30
4/10/2018 15 Types of Regression you should know | R-bloggers
Here in the RMSEP the root mean square errors are being denoted. While in ‘Training:
%variance explained’ the cumulative % of variance explained by principle components is
being depicted. We can see that with 3 PCs more than 99% of variation can be attributed.
We can also create a plot depicting the mean squares error for the number of various PCs.
validationplot(pcr_model,val.type = “MSEP”)
By writing val.type = “R2” we can plot the R square for various no. of PCs.
validationplot(pcr_model,val.type = “R2”)
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 25/30
4/10/2018 15 Types of Regression you should know | R-bloggers
If we want to t pcr for 3 principal components and hence get the predicted values we can
write:
pred = predict(pcr_model,data1,ncomp = 3)
Both techniques create new independent variables called components which are
linear combinations of the original predictor variables but PCR creates components
to explain the observed variability in the predictor variables, without considering the
response variable at all. While PLS takes the dependent variable into account, and
therefore often leads to models that are able to t the dependent variable with fewer
components.
PLS Regression in R
library(plsdepot)
data(vehicles)
pls.model = plsreg1(vehicles[, c(1:12,14:16)], vehicles[, 13], comps = 3)
# R-Square
pls.model$R2
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 26/30
4/10/2018 15 Types of Regression you should know | R-bloggers
Support vector regression can solve both linear and non-linear models. SVM uses non-
linear kernel functions (such as polynomial) to nd the optimal solution for non-linear
models.
The main idea of SVR is to minimize error, individualizing the hyperplane which
maximizes the margin.
library(e1071)
svr.model <- svm(Y ~ X , data)
pred <- predict(svr.model, data)
points(data$X, pred, col = “red”, pch=4)
Ordinal Regression is used to predict ranked values. In simple words, this type of
regression is suitable when dependent variable is ordinal in nature. Example of ordinal
variables – Survey responses (1 to 6 scale), patient reaction to drug dose (none, mild,
severe).
Why we can’t use linear regression when dealing with ordinal target variable?
In linear regression, the dependent variable assumes that changes in the level of the
dependent variable are equivalent throughout the range of the variable. For example, the
di erence in weight between a person who is 100 kg and a person who is 120 kg is 20kg,
which has the same meaning as the di erence in weight between a person who is 150 kg
and a person who is 170 kg. These relationships do not necessarily hold for ordinal
variables.
library(ordinal)
o.model <- clm(rating ~ ., data = wine)
summary(o.model)
In the code below, we are using dataset named warpbreaks which shows the number of
breaks in Yarn during weaving. In this case, the model includes terms for wool type, wool
tension and the interaction between the two.
When the variance of count data is greater than the mean count, it is a case of
overdispersion. The opposite of the previous statement is a case of under-dispersion.
library(MASS)
nb.model <- glm.nb(Days ~ Sex/(Age + Eth*Lrn), data = quine)
summary(nb.model)
Logistic regression uses a binary dependent variable but ignores the timing of
events.
As well as estimating the time it takes to reach a certain event, survival analysis can also
be used to compare time-to-event for multiple groups.
library(survival)
# Lung Cancer Data
# status: 2=death
lung$SurvObj <- with(lung, Surv(time, status == 2))
cox.reg <- coxph(SurvObj ~ age + sex + ph.karno + wt.loss, data = lung)
cox.reg
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 29/30
4/10/2018 15 Types of Regression you should know | R-bloggers
2. If you are working on count data, you should try poisson, quasi-poisson and negative
binomial regression.
3. To avoid over tting, we can use cross-validation method to evaluate models used for
prediction. We can also use ridge, lasso and elastic net regressions techniques to correct
over tting issue.
4. Try support vector regression when you have non-linear model.
About Author:
Deepanshu founded ListenData with a simple objective – Make analytics easy to
understand and follow. He has over 7 years of experience in data science and predictive
modeling. During his tenure, he has worked with global clients in various domains.
To leave a comment for the author, please follow the link and comment on their blog: ListenData.
R-bloggers.com o ers daily e-mail updates about R news and tutorials on topics such as: Data science, Big
Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX,
SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Categories: R bloggers
R-bloggers
Powered by WordPress Back to top
https://www.r-bloggers.com/15-types-of-regression-you-should-know/amp/#top 30/30