Linear Regression Assumptions and Diagnostics in R - Essentials - Articles - STHDA
Linear Regression Assumptions and Diagnostics in R - Essentials - Articles - STHDA
Licence:
Search...
Support About
Home / Articles / Machine Learning / Regression Model Diagnostics / Linear Regression Assumptions and
Diagnostics in R: Essentials
After performing a regression analysis, you should always check if the model works well for the data at hand.
A first step of this regression diagnostic is to inspect the significance of the regression beta coefficients, as
well as, the R2 that tells us how well the linear regression model fits to the data. This has been described in
the Chapters @ref(linear-regression) and @ref(cross-validation).
In this current chapter, you will learn additional steps to evaluate how well the model fits the data.
For example, the linear regression model makes the assumption that the relationship between the predictors
(x) and the outcome variable is linear. This might not be true. The relationship could be polynomial or
logarithmic.
Additionally, the data might contain some influential observations, such as outliers (or extreme values), that
can affect the result of the regression.
Therefore, you should closely diagnostic the regression model that you built in order to detect potential prob-
lems and to check whether the assumptions made by the linear regression model are met or not.
To do so, we generally examine the distribution of residuals errors, that can tell you more about your data.
In this chapter,
Contents:
The Book:
library(tidyverse)
library(broom)
theme_set(theme_classic())
Example of data
We’ll use the data set marketing [datarium package], introduced in Chapter @ref(regression-analysis).
##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Coefficients:
## (Intercept) youtube
## 8.4391 0.0475
Our regression equation is: y = 8.43 + 0.07*x, that is sales = 8.43 + 0.047*youtube.
Before, describing regression assumptions and regression diagnostics, we start by explaining two key con-
cepts in regression analysis: Fitted values and residuals errors. These are important for understanding the
diagnostic plots presented hereafter.
In our example, for a given youtube advertising budget, the fitted (predicted) sales value would be, sales =
8.44 + 0.0048*youtube.
From the scatter plot below, it can be seen that not all the data points fall exactly on the estimated regression
line. This means that, for a given youtube advertising budget, the observed (or measured) sale values can be
different from the predicted sale values. The difference is called the residual errors, represented by a vertical
red lines.
In R, you can easily augment your data to add fitted values and residuals by using the function augment()
[broom package]. Let’s call the output model.diag.metrics because it contains several metrics useful for re-
gression diagnostics. We’ll describe theme later.
The following R code plots the residuals error (in red color) between observed values and the fitted regression
line. Each vertical red segments represents the residual error between an observed sale value and the corres-
ponding predicted (i.e. fitted) value.
Regression assumptions
Linear regression makes several assumptions about the data, such as :
1. Linearity of the data. The relationship between the predictor (x) and the outcome (y) is assumed to be
linear.
2. Normality of residuals. The residual errors are assumed to be normally distributed.
3. Homogeneity of residuals variance. The residuals are assumed to have a constant variance (homosce-
dasticity)
4. Independence of residuals error terms.
You should check whether or not these assumptions hold true. Potential problems include:
All these assumptions and potential problems can be checked by producing some diagnostic plots visualizing
the residual errors.
Diagnostic plots
Regression diagnostics plots can be created using the R base function plot() or the autoplot() function
[ggfortify package], which creates a ggplot2-based graphics.
library(ggfortify)
autoplot(model)
1. Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct
patterns is an indication for a linear relationship, what is good.
2. Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals
points follow the straight dashed line.
3. Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals
(homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.
This is not the case in our example, where we have a heteroscedasticity problem.
4. Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the
regression results when included or excluded from the analysis. This plot will be described further in the
next sections.
The four plots show the top 3 most extreme data points labeled with with the row numbers of the data in the
data set. They might be potentially problematic. You might want to take a close look at them individually to
check if there is anything special for the subject or if it could be simply data entry errors. We’ll discuss about
this in the following sections.
The metrics used to create the above plots are available in the model.diag.metrics data, described in the
previous section.
In the following section, we’ll describe, in details, how to use these graphs and metrics to check the regression
assumptions and to diagnostic potential problems in the model.
Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at
zero. The presence of a pattern may indicate a problem with some aspect of the linear model.
Intionship
our example, there is no pattern in the residual plot. This suggests that we can assume linear rela-
between the predictors and the outcome variables.
Note that, if the residual plot indicates a non-linear relationship in the data, then a simple approach is
to use non-linear transformations of the predictors, such as log(x), sqrt(x) and x^2, in the regression
model.
Homogeneity of variance
This assumption can be checked by examining the scale-location plot, also known as the spread-location plot.
plot(model, 3)
This plot shows if residuals are spread equally along the ranges of predictors. It’s good if you see a horizontal
line with equally spread points. In our example, this is not the case.
It can be seen that the variability (variances) of the residual points increases with the value of the fitted out-
come variable, suggesting non-constant variances in the residuals errors (or heteroscedasticity).
A possible solution to reduce the heteroscedasticity problem is to use a log or square root transformation of
the outcome variable (y).
In our example, all the points fall approximately along this reference line, so we can assume normality.
plot(model, 2)
Outliers and high levarage points
Outliers:
An outlier is a point that has an extreme outcome variable value. The presence of outliers may affect the in-
terpretation of the model, because it increases the RSE.
Outliers can be identified by examining the standardized residual (or studentized residual), which is the residual
divided by its estimated standard error. Standardized residuals can be interpreted as the number of standard
errors away from the regression line.
Observations whose standardized residuals are greater than 3 in absolute value are possible outliers (James
et al. 2014).
A data point has high leverage, if it has extreme predictor x values. This can be detected by examining the
leverage statistic or the hat-value. A value of this statistic above 2(p + 1)/n indicates an observation with
high leverage (P. Bruce and Bruce 2017); where, p is the number of predictors and n is the number of
observations.
Outliers and high leverage points can be identified by inspecting the Residuals vs Leverage plot:
plot(model, 5)
The plot above highlights the top 3 most extreme points (#26, #36 and #179), with a standardized re-
siduals below -2. However, there is no outliers that exceed 3 standard deviations, what is good.
Additionally, there is no high leverage point in the data. That is, all data points, have a leverage stat-
istic below 2(p + 1)/n = 4/200 = 0.02.
In uential values
An influential value is a value, which inclusion or exclusion can alter the results of the regression analysis.
Such a value is associated with a large residual.
Not all outliers (or extreme data points) are influential in linear regression analysis.
Statisticians have developed a metric called Cook’s distance to determine the influence of a value. This metric
defines influence as a combination of leverage and residual size.
A rule of thumb is that an observation has high influence if Cook’s distance exceeds 4/(n - p - 1)(P. Bruce
and Bruce 2017), where n is the number of observations and p the number of predictor variables.
The Residuals vs Leverage plot can help us to find influential observations if any. On this plot, outlying values
are generally located at the upper right corner or at the lower right corner. Those spots are the places where
data points can be influential against a regression line.
The following plots illustrate the Cook’s distance and the leverage of our model:
# Cook's distance
plot(model, 4)
# Residuals vs Leverage
plot(model, 5)
By default, the top 3 most extreme values are labelled on the Cook’s distance plot. If you want to label the top
5 extreme values, specify the option id.n as follow:
plot(model, 4, id.n = 5)
If you want to look at these top 3 observations with the highest Cook’s distance in case you want to assess
them further, type this R code:
model.diag.metrics %>%
top_n(3, wt = .cooksd)
When data points have high Cook’s distance scores and are to the upper or lower right of the leverage
plot, they have leverage meaning they are influential to the regression results. The regression results
will be altered if we exclude those cases.
Inareournotexample, the data don’t present any influential points. Cook’s distance lines (a red dashed line)
shown on the Residuals vs Leverage plot because all points are well inside of the Cook’s dis-
tance lines.
Let’s show now another example, where the data contain two extremes values with potential influence on the
regression results:
# Cook's distance
plot(model2, 4)
# Residuals vs Leverage
plot(model2, 5)
On the Residuals vs Leverage plot, look for a data point outside of a dashed line, Cook’s distance. When the
points are outside of the Cook’s distance, this means that they have high Cook’s distance scores. In this case,
the values are influential to the regression results. The regression results will be altered if we exclude those
cases.
In the above example 2, two data points are far beyond the Cook’s distance lines. The other residuals appear
clustered on the left. The plot identified the influential observation as #201 and #202. If you exclude these
points from the analysis, the slope coefficient changes from 0.06 to 0.04 and R2 from 0.5 to 0.6. Pretty big
impact!
Discussion
This chapter describes linear regression assumptions and shows how to diagnostic potential problems in the
model.
The diagnostic is essentially performed by visualizing the residuals. Having patterns in residuals is not a stop
signal. Your current regression model might not be the best way to understand your data.
A non-linear relationships between the outcome and the predictor variables. When facing to this
problem, one solution is to include a quadratic term, such as polynomial terms or log transformation.
See Chapter @ref(polynomial-and-spline-regression).
Existence of important variables that you left out from your model. Other variables you didn’t include
(e.g., age or gender) may play an important role in your model and data. See Chapter @ref(confounding-
variables).
Presence of outliers. If you believe that an outlier has occurred due to an error in data collection and
entry, then one solution is to simply remove the concerned observation.
References
Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical
Learning: With Applications in R. Springer Publishing Company, Incorporated.
2 Notes
Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be
STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or shar-
ing it on Twitter, Facebook or Linked In.
Show me some love with the like buttons below... Thank you and please don't forget to share and
comment below!!
More books on R and data sci-
R Graphics Essentials for Great Network Analysis and Visualiza- ence
Data Visualization tion in R
This section contains best data science and self-development resources to help you on your path.
Coursera - Online Courses and Specialization
Data science
Course: Machine Learning: Master the Fundamentals by Standford
Specialization: Data Science by Johns Hopkins University
Specialization: Python for Everybody by University of Michigan
Courses: Build Skills for a Top Job in any Industry by Coursera
Specialization: Master Machine Learning Fundamentals by University of Washington
Specialization: Statistics with R by Duke University
Specialization: Software Development in R by Johns Hopkins University
Specialization: Genomic Data Science by Johns Hopkins University
Trending Courses
The Science of Well-Being by Yale University
Google IT Support Professional by Google
Python for Everybody by University of Michigan
IBM Data Science Professional Certificate by IBM
Business Foundations by University of Pennsylvania
Introduction to Psychology by Yale University
Excel Skills for Business by Macquarie University
Psychological First Aid by Johns Hopkins University
Graphic Design by Cal Arts
Others
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett
Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques
to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund &
Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet
#652
"Our regression equation is: y = 8.43 + 0.07*x, that is sales = 8.43 + 0.047*youtube."
I guess it is supposed to be sales = 8.43 + 0.07*youtube?
#636
this was amazing the number of independant variables in my model increased after i re-
moved the outliers!
#588
Member
brilliant as always!
#505
Sign in
Login
Login
Password
Password
Auto connect
Sign in
Register
Forgotten password
Welcome!
Want to Learn More on R Programming and Data Science?
Follow us by Email
Subscribe
by FeedBurner
factoextra
survminer
ggpubr
ggcorrplot
fastqcr
Our Books
G hi i l f G i li i 200 i l l f
R Graphics Essentials for Great Data Visualization: 200 Practical Examples You Want to Know for Data
Science
NEW!!
3D Plots in R
R-Bloggers
Newsletter Email
Boosted by PHPBoost