0% found this document useful (0 votes)

248 views

Linear Regression Assumptions and Diagnostics in R - Essentials - Articles - STHDA

The document discusses linear regression diagnostics in R. It begins by introducing linear regression assumptions such as linearity, normality of residuals, and homogeneity of variance. Potential problems with violating these assumptions are also described. Then, the document explains how to generate diagnostic plots in R to check the assumptions, including residuals vs fitted, normal Q-Q, scale-location, and residuals vs leverage plots. These plots can help identify outliers, influential points, and other issues. The metrics used in making the plots, such as residuals and fitted values, are also obtained from the regression model.

Uploaded by

icen00b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

248 views

Linear Regression Assumptions and Diagnostics in R - Essentials - Articles - STHDA

Uploaded by

icen00b

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

STHDA

Stati s t i c a l t o o l s f or high-through put data analysis

Licence:

Search... 

Home Basics Data Visualize Analyze Resources Our Products

Support About

Home / Articles / Machine Learning / Regression Model Diagnostics / Linear Regression Assumptions and
Diagnostics in R: Essentials

 Articles - Regression Model Diagnostics

Linear Regression Assumptions and Diagnostics in R: Essentials
 kassambara |  11/03/2018 |  144886 |  Comments (4) |  Regression Model Diagnostics
Linear regression (Chapter @ref(linear-regression)) makes several assumptions about the data at hand. This
chapter describes regression assumptions and provides built-in plots for regression diagnostics in R pro-
gramming language.

After performing a regression analysis, you should always check if the model works well for the data at hand.

A first step of this regression diagnostic is to inspect the significance of the regression beta coefficients, as
well as, the R2 that tells us how well the linear regression model fits to the data. This has been described in
the Chapters @ref(linear-regression) and @ref(cross-validation).

In this current chapter, you will learn additional steps to evaluate how well the model ﬁts the data.

For example, the linear regression model makes the assumption that the relationship between the predictors
(x) and the outcome variable is linear. This might not be true. The relationship could be polynomial or
logarithmic.

Additionally, the data might contain some inﬂuential observations, such as outliers (or extreme values), that
can aﬀect the result of the regression.
Therefore, you should closely diagnostic the regression model that you built in order to detect potential prob-
lems and to check whether the assumptions made by the linear regression model are met or not.

To do so, we generally examine the distribution of residuals errors, that can tell you more about your data.

In this chapter,

we start by explaining residuals errors and tted values.

next, we present linear regresion assumptions, as well as, potential problems you can face when per-
forming regression analysis.
ﬁnally, we describe some built-in diagnostic plots in R for testing the assumptions underlying linear re-
gression model.

Contents:

Loading Required R packages

Example of data
Building a regression model
Fitted values and residuals
Regression assumptions
Regression diagnostics {reg-diag}
Diagnostic plots

Linearity of the data

Homogeneity of variance
Normality of residuals
Outliers and high levarage points
Inﬂuential values
Discussion
References

The Book:

Machine Learning Essentials:

Practical Guide in R

Loading Required R packages

tidyverse for easy data manipulation and visualization
broom: creates a tidy data frame from statistical test results

library(tidyverse)
library(broom)
theme_set(theme_classic())
Example of data
We’ll use the data set marketing [datarium package], introduced in Chapter @ref(regression-analysis).

# Load the data

data("marketing", package = "datarium")
# Inspect the data
sample_n(marketing, 3)

## youtube facebook newspaper sales

## 58 163.4 23.0 19.9 15.8
## 157 112.7 52.2 60.6 18.4
## 81 91.7 32.0 26.8 14.2

Building a regression model

We build a model to predict sales on the basis of advertising budget spent in youtube medias.

model <- lm(sales ~ youtube, data = marketing)

model

##
## Call:
## lm(formula = sales ~ youtube, data = marketing)
##
## Coefficients:
## (Intercept) youtube
## 8.4391 0.0475

Our regression equation is: y = 8.43 + 0.07*x, that is sales = 8.43 + 0.047*youtube.

Before, describing regression assumptions and regression diagnostics, we start by explaining two key con-
cepts in regression analysis: Fitted values and residuals errors. These are important for understanding the
diagnostic plots presented hereafter.

Fitted values and residuals

The tted (or predicted) values are the y-values that you would expect for the given x-values according to
the built regression model (or visually, the best-ﬁtting straight regression line).

In our example, for a given youtube advertising budget, the ﬁtted (predicted) sales value would be, sales =
8.44 + 0.0048*youtube.

From the scatter plot below, it can be seen that not all the data points fall exactly on the estimated regression
line. This means that, for a given youtube advertising budget, the observed (or measured) sale values can be
diﬀerent from the predicted sale values. The diﬀerence is called the residual errors, represented by a vertical
red lines.

In R, you can easily augment your data to add ﬁtted values and residuals by using the function augment()
[broom package]. Let’s call the output model.diag.metrics because it contains several metrics useful for re-
gression diagnostics. We’ll describe theme later.

model.diag.metrics <- augment(model)

head(model.diag.metrics)

## sales youtube .fitted .se.fit .resid .hat .sigma .cooksd .std.resid

## 1 26.52 276.1 21.56 0.385 4.955 0.00970 3.90 7.94e-03 1.2733
## 2 12.48 53.4 10.98 0.431 1.502 0.01217 3.92 9.20e-04 0.3866
## 3 11.16 20.6 9.42 0.502 1.740 0.01649 3.92 1.69e-03 0.4486
## 4 22.20 181.8 17.08 0.277 5.119 0.00501 3.90 4.34e-03 1.3123
## 5 15.48 217.0 18.75 0.297 -3.273 0.00578 3.91 2.05e-03 -0.8393
## 6 8.64 10.4 8.94 0.525 -0.295 0.01805 3.92 5.34e-05 -0.0762

Among the table columns, there are:

youtube: the invested youtube advertising budget

sales: the observed sale values
.fitted: the ﬁtted sale values
.resid: the residual errors
…

The following R code plots the residuals error (in red color) between observed values and the ﬁtted regression
line. Each vertical red segments represents the residual error between an observed sale value and the corres-
ponding predicted (i.e. ﬁtted) value.

ggplot(model.diag.metrics, aes(youtube, sales)) +

geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = youtube, yend = .fitted), color = "red", size = 0.3)
In order to check regression assumptions, we’ll examine the distribution of residuals.

Regression assumptions
Linear regression makes several assumptions about the data, such as :

1. Linearity of the data. The relationship between the predictor (x) and the outcome (y) is assumed to be
linear.
2. Normality of residuals. The residual errors are assumed to be normally distributed.
3. Homogeneity of residuals variance. The residuals are assumed to have a constant variance (homosce-
dasticity)
4. Independence of residuals error terms.

You should check whether or not these assumptions hold true. Potential problems include:

1. Non-linearity of the outcome - predictor relationships

2. Heteroscedasticity: Non-constant variance of error terms.
3. Presence of in uential values in the data that can be:
Outliers: extreme values in the outcome (y) variable
High-leverage points: extreme values in the predictors (x) variable

All these assumptions and potential problems can be checked by producing some diagnostic plots visualizing
the residual errors.

Regression diagnostics {reg-diag}

Diagnostic plots
Regression diagnostics plots can be created using the R base function plot() or the autoplot() function
[ggfortify package], which creates a ggplot2-based graphics.

Create the diagnostic plots with the R base function:

par(mfrow = c(2, 2))

plot(model)

Create the diagnostic plots using ggfortify:

library(ggfortify)
autoplot(model)

The diagnostic plots show residuals in four diﬀerent ways:

1. Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct
patterns is an indication for a linear relationship, what is good.
2. Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals
points follow the straight dashed line.

3. Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals
(homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity.
This is not the case in our example, where we have a heteroscedasticity problem.

4. Residuals vs Leverage. Used to identify inﬂuential cases, that is extreme values that might inﬂuence the
regression results when included or excluded from the analysis. This plot will be described further in the
next sections.

The four plots show the top 3 most extreme data points labeled with with the row numbers of the data in the
data set. They might be potentially problematic. You might want to take a close look at them individually to
check if there is anything special for the subject or if it could be simply data entry errors. We’ll discuss about
this in the following sections.

The metrics used to create the above plots are available in the model.diag.metrics data, described in the
previous section.

# Add observations indices and

# drop some columns (.se.fit, .sigma) for simplification
model.diag.metrics <- model.diag.metrics %>%
mutate(index = 1:nrow(model.diag.metrics)) %>%
select(index, everything(), -.se.fit, -.sigma)
# Inspect the data
head(model.diag.metrics, 4)

## index sales youtube .fitted .resid .hat .cooksd .std.resid

## 1 1 26.5 276.1 21.56 4.96 0.00970 0.00794 1.273
## 2 2 12.5 53.4 10.98 1.50 0.01217 0.00092 0.387
## 3 3 11.2 20.6 9.42 1.74 0.01649 0.00169 0.449
## 4 4 22.2 181.8 17.08 5.12 0.00501 0.00434 1.312

We’ll use mainly the following columns:

.fitted: ﬁtted values

.resid: residual errors
.hat: hat values, used to detect high-leverage points (or extreme values in the predictors x variables)
.std.resid: standardized residuals, which is the residuals divided by their standard errors. Used to de-
tect outliers (or extreme values in the outcome y variable)
.cooksd: Cook’s distance, used to detect inﬂuential values, which can be an outlier or a high leverage
point

In the following section, we’ll describe, in details, how to use these graphs and metrics to check the regression
assumptions and to diagnostic potential problems in the model.

Linearity of the data

The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (1st plot):
plot(model, 1)

Ideally, the residual plot will show no ﬁtted pattern. That is, the red line should be approximately horizontal at
zero. The presence of a pattern may indicate a problem with some aspect of the linear model.

 Intionship
our example, there is no pattern in the residual plot. This suggests that we can assume linear rela-
between the predictors and the outcome variables.

 Note that, if the residual plot indicates a non-linear relationship in the data, then a simple approach is
to use non-linear transformations of the predictors, such as log(x), sqrt(x) and x^2, in the regression
model.

Homogeneity of variance
This assumption can be checked by examining the scale-location plot, also known as the spread-location plot.

plot(model, 3)
This plot shows if residuals are spread equally along the ranges of predictors. It’s good if you see a horizontal
line with equally spread points. In our example, this is not the case.

It can be seen that the variability (variances) of the residual points increases with the value of the ﬁtted out-
come variable, suggesting non-constant variances in the residuals errors (or heteroscedasticity).

A possible solution to reduce the heteroscedasticity problem is to use a log or square root transformation of
the outcome variable (y).

model2 <- lm(log(sales) ~ youtube, data = marketing)

plot(model2, 3)
Normality of residuals
The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot
of residuals should approximately follow a straight line.

In our example, all the points fall approximately along this reference line, so we can assume normality.

plot(model, 2)
Outliers and high levarage points
Outliers:

An outlier is a point that has an extreme outcome variable value. The presence of outliers may aﬀect the in-
terpretation of the model, because it increases the RSE.

Outliers can be identiﬁed by examining the standardized residual (or studentized residual), which is the residual
divided by its estimated standard error. Standardized residuals can be interpreted as the number of standard
errors away from the regression line.

Observations whose standardized residuals are greater than 3 in absolute value are possible outliers (James
et al. 2014).

High leverage points:

A data point has high leverage, if it has extreme predictor x values. This can be detected by examining the
leverage statistic or the hat-value. A value of this statistic above 2(p + 1)/n indicates an observation with
high leverage (P. Bruce and Bruce 2017); where, p is the number of predictors and n is the number of
observations.

Outliers and high leverage points can be identiﬁed by inspecting the Residuals vs Leverage plot:

plot(model, 5)
 The plot above highlights the top 3 most extreme points (#26, #36 and #179), with a standardized re-
siduals below -2. However, there is no outliers that exceed 3 standard deviations, what is good.

Additionally, there is no high leverage point in the data. That is, all data points, have a leverage stat-
istic below 2(p + 1)/n = 4/200 = 0.02.

In uential values
An inﬂuential value is a value, which inclusion or exclusion can alter the results of the regression analysis.
Such a value is associated with a large residual.

Not all outliers (or extreme data points) are inﬂuential in linear regression analysis.

Statisticians have developed a metric called Cook’s distance to determine the influence of a value. This metric
defines influence as a combination of leverage and residual size.

A rule of thumb is that an observation has high inﬂuence if Cook’s distance exceeds 4/(n - p - 1)(P. Bruce
and Bruce 2017), where n is the number of observations and p the number of predictor variables.

The Residuals vs Leverage plot can help us to find influential observations if any. On this plot, outlying values
are generally located at the upper right corner or at the lower right corner. Those spots are the places where
data points can be influential against a regression line.

The following plots illustrate the Cook’s distance and the leverage of our model:

# Cook's distance
plot(model, 4)
# Residuals vs Leverage
plot(model, 5)
By default, the top 3 most extreme values are labelled on the Cook’s distance plot. If you want to label the top
5 extreme values, specify the option id.n as follow:

plot(model, 4, id.n = 5)

If you want to look at these top 3 observations with the highest Cook’s distance in case you want to assess
them further, type this R code:

model.diag.metrics %>%
top_n(3, wt = .cooksd)

## index sales youtube .fitted .resid .hat .cooksd .std.resid

## 1 26 14.4 315 23.4 -9.04 0.0142 0.0389 -2.33
## 2 36 15.4 349 25.0 -9.66 0.0191 0.0605 -2.49
## 3 179 14.2 332 24.2 -10.06 0.0165 0.0563 -2.59

 When data points have high Cook’s distance scores and are to the upper or lower right of the leverage
plot, they have leverage meaning they are inﬂuential to the regression results. The regression results
will be altered if we exclude those cases.

 Inareournotexample, the data don’t present any inﬂuential points. Cook’s distance lines (a red dashed line)
shown on the Residuals vs Leverage plot because all points are well inside of the Cook’s dis-
tance lines.
Let’s show now another example, where the data contain two extremes values with potential inﬂuence on the
regression results:

df2 <- data.frame(

x = c(marketing$youtube, 500, 600),
y = c(marketing$sales, 80, 100)
)
model2 <- lm(y ~ x, df2)

Create the Residuals vs Leverage plot of the two models:

# Cook's distance
plot(model2, 4)
# Residuals vs Leverage
plot(model2, 5)

On the Residuals vs Leverage plot, look for a data point outside of a dashed line, Cook’s distance. When the
points are outside of the Cook’s distance, this means that they have high Cook’s distance scores. In this case,
the values are influential to the regression results. The regression results will be altered if we exclude those
cases.
In the above example 2, two data points are far beyond the Cook’s distance lines. The other residuals appear
clustered on the left. The plot identified the influential observation as #201 and #202. If you exclude these
points from the analysis, the slope coefficient changes from 0.06 to 0.04 and R2 from 0.5 to 0.6. Pretty big
impact!
Discussion
This chapter describes linear regression assumptions and shows how to diagnostic potential problems in the
model.

The diagnostic is essentially performed by visualizing the residuals. Having patterns in residuals is not a stop
signal. Your current regression model might not be the best way to understand your data.

Potential problems might be:

A non-linear relationships between the outcome and the predictor variables. When facing to this
problem, one solution is to include a quadratic term, such as polynomial terms or log transformation.
See Chapter @ref(polynomial-and-spline-regression).

Existence of important variables that you left out from your model. Other variables you didn’t include
(e.g., age or gender) may play an important role in your model and data. See Chapter @ref(confounding-
variables).

Presence of outliers. If you believe that an outlier has occurred due to an error in data collection and
entry, then one solution is to simply remove the concerned observation.

References
Bruce, Peter, and Andrew Bruce. 2017. Practical Statistics for Data Scientists. O’Reilly Media.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2014. An Introduction to Statistical
Learning: With Applications in R. Springer Publishing Company, Incorporated.

     2 Notes

 Enjoyed this article? Give us 5 stars    (just above this text block)! Reader needs to be
STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or shar-
ing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and
comment below!!

Recommended for You!

Machine Learning Essentials: Practical Guide to Cluster Ana- Practical Guide to Principal
Practical Guide in R lysis in R Component Methods in R


More books on R and data sci-
R Graphics Essentials for Great Network Analysis and Visualiza- ence
Data Visualization tion in R

Recommended for you

 This section contains best data science and self-development resources to help you on your path.
Coursera - Online Courses and Specialization
Data science
Course: Machine Learning: Master the Fundamentals by Standford
Specialization: Data Science by Johns Hopkins University
Specialization: Python for Everybody by University of Michigan
Courses: Build Skills for a Top Job in any Industry by Coursera
Specialization: Master Machine Learning Fundamentals by University of Washington
Specialization: Statistics with R by Duke University
Specialization: Software Development in R by Johns Hopkins University
Specialization: Genomic Data Science by Johns Hopkins University

Popular Courses Launched in 2020

Google IT Automation with Python by Google
AI for Medicine by deeplearning.ai
Epidemiology in Public Health Practice by Johns Hopkins University
AWS Fundamentals by Amazon Web Services

Trending Courses
The Science of Well-Being by Yale University
Google IT Support Professional by Google
Python for Everybody by University of Michigan
IBM Data Science Professional Certiﬁcate by IBM
Business Foundations by University of Pennsylvania
Introduction to Psychology by Yale University
Excel Skills for Business by Macquarie University
Psychological First Aid by Johns Hopkins University
Graphic Design by Cal Arts

Books - Data Science

Our Books
Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett
Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques
to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund &
Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

 You are not authorized to post a comment

Visitor 11/27/2018 at 07h25

Visitor

Great summary, easy to understand.

Thank you.

#652

Visitor 10/24/2018 at 10h51

Visitor

"Our regression equation is: y = 8.43 + 0.07*x, that is sales = 8.43 + 0.047*youtube."
I guess it is supposed to be sales = 8.43 + 0.07*youtube?

#636

Visitor 08/17/2018 at 08h19

Visitor

this was amazing the number of independant variables in my model increased after i re-
moved the outliers!

#588

tomer mann 05/29/2018 at 17h43

Member

brilliant as always!

#505

Password
Password

Auto connect

 Register 
 Forgotten password

Welcome!
Want to Learn More on R Programming and Data Science?
Follow us by Email

Subscribe
by FeedBurner

Click to see our collection of resources to help you on your path...

Course & Specialization

Recommended for You (on Coursera):

Course: Machine Learning: Master the Fundamentals
Specialization: Data Science
Specialization: Python for Everybody
Course: Build Skills for a Top Job in any Industry
Specialization: Master Machine Learning Fundamentals
Specialization: Statistics with R
Specialization: Software Development in R
Specialization: Genomic Data Science

See More Resources

factoextra

survminer

ggpubr

ggcorrplot

fastqcr

Our Books

G hi i l f G i li i 200 i l l f
R Graphics Essentials for Great Data Visualization: 200 Practical Examples You Want to Know for Data
Science
 NEW!!

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

3D Plots in R

Datanovia: Online Data Science Courses

R-Bloggers

Newsletter Email 
Boosted by PHPBoost

C1M5 Peer Reviewed Others
No ratings yet
C1M5 Peer Reviewed Others
27 pages
Compresor Mobil Atlas Copco XAS 27 HP
100% (1)
Compresor Mobil Atlas Copco XAS 27 HP
4 pages
LR Assumptions_05
No ratings yet
LR Assumptions_05
12 pages
Regression Analysis Report_Sanjeev Kumar_24MSG1R43
No ratings yet
Regression Analysis Report_Sanjeev Kumar_24MSG1R43
6 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
Regression Analysis Using R
No ratings yet
Regression Analysis Using R
17 pages
Simple Linear Regression Using a Real Dataset in R and Excel
No ratings yet
Simple Linear Regression Using a Real Dataset in R and Excel
4 pages
Regression Diagnostics With R: Anne Boomsma
No ratings yet
Regression Diagnostics With R: Anne Boomsma
23 pages
LR Assumptions
No ratings yet
LR Assumptions
9 pages
2019_GOSIEWSKA_AUDITOR_AN r package for model agnostic visual validation and diagnostics
No ratings yet
2019_GOSIEWSKA_AUDITOR_AN r package for model agnostic visual validation and diagnostics
14 pages
Week 6 - Model Assumptions in Linear Regression
No ratings yet
Week 6 - Model Assumptions in Linear Regression
17 pages
Stats101A - Chapter 3
No ratings yet
Stats101A - Chapter 3
54 pages
R Tutorial
No ratings yet
R Tutorial
15 pages
Why's and Wherefore's
No ratings yet
Why's and Wherefore's
15 pages
Lecture Notes - Linear Regression
No ratings yet
Lecture Notes - Linear Regression
26 pages
2.3 Assumptions of Linear Regression
No ratings yet
2.3 Assumptions of Linear Regression
16 pages
Lec 05 2- Time Series Regression Model
No ratings yet
Lec 05 2- Time Series Regression Model
75 pages
Linear Regression
No ratings yet
Linear Regression
46 pages
Lec 05 - time series regression model
No ratings yet
Lec 05 - time series regression model
32 pages
unit5_R
No ratings yet
unit5_R
5 pages
Regrassion Analysis Lab Question and Answer
No ratings yet
Regrassion Analysis Lab Question and Answer
13 pages
Basic Regression Analysis 2
No ratings yet
Basic Regression Analysis 2
6 pages
Course Notes18
No ratings yet
Course Notes18
113 pages
SLRin R
No ratings yet
SLRin R
23 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
No ratings yet
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
16 pages
Dsur I Chapter 07 Linear Regression
No ratings yet
Dsur I Chapter 07 Linear Regression
18 pages
Linear Regression Using R An Introduction To Data Modeling David J Lilja pdf download
No ratings yet
Linear Regression Using R An Introduction To Data Modeling David J Lilja pdf download
48 pages
LinearRegressionUsing R
No ratings yet
LinearRegressionUsing R
91 pages
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
No ratings yet
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
9 pages
Regression
No ratings yet
Regression
21 pages
PM Week1 MLSDeck0.2
No ratings yet
PM Week1 MLSDeck0.2
15 pages
Statistics-with-R
No ratings yet
Statistics-with-R
10 pages
ASM - CCA-2 - MK GRP 1
No ratings yet
ASM - CCA-2 - MK GRP 1
21 pages
statistical_functions_and_regression_in_R (1)
No ratings yet
statistical_functions_and_regression_in_R (1)
5 pages
10 - 4 - ML - SUP - Linear Regression
No ratings yet
10 - 4 - ML - SUP - Linear Regression
59 pages
Learn Linear Regression with R_ Linear Regression In R Cheatsheet _ Codecademy
No ratings yet
Learn Linear Regression with R_ Linear Regression In R Cheatsheet _ Codecademy
5 pages
Business Forecast Vishay Sood
No ratings yet
Business Forecast Vishay Sood
8 pages
Linear Regression - Jupyter Notebook
100% (3)
Linear Regression - Jupyter Notebook
56 pages
Practical Session 2 Linear Regression Model Assumptions
No ratings yet
Practical Session 2 Linear Regression Model Assumptions
7 pages
The Four Assumptions of Linear Regression
No ratings yet
The Four Assumptions of Linear Regression
10 pages
Lesson 5 10 Linear Regression Residuals
No ratings yet
Lesson 5 10 Linear Regression Residuals
33 pages
10 - 4 - ML - SUP - Linear Regression
No ratings yet
10 - 4 - ML - SUP - Linear Regression
59 pages
R Tutorial Slides
No ratings yet
R Tutorial Slides
13 pages
Chapter3 First Application Linear Regression
No ratings yet
Chapter3 First Application Linear Regression
8 pages
Report Group 8 Final
No ratings yet
Report Group 8 Final
13 pages
2023 Statistics Fin 10
No ratings yet
2023 Statistics Fin 10
14 pages
Regression For Everyone Vol. 1
No ratings yet
Regression For Everyone Vol. 1
25 pages
unit5_R
No ratings yet
unit5_R
5 pages
Dar Solved Ans
No ratings yet
Dar Solved Ans
20 pages
Linear Regression with Multiple Covariates
From Everand
Linear Regression with Multiple Covariates
Brett Kottmann
No ratings yet
Lecture_2_Regression_Approach
No ratings yet
Lecture_2_Regression_Approach
25 pages
Predictive Modeling-Handouts
No ratings yet
Predictive Modeling-Handouts
11 pages
A Guide On How To Compare Different Models in Linear Progression
No ratings yet
A Guide On How To Compare Different Models in Linear Progression
8 pages
15 Types of Regression You Should Know
No ratings yet
15 Types of Regression You Should Know
30 pages
DA-3rd unit
No ratings yet
DA-3rd unit
16 pages
R Module 11 - Statistics
No ratings yet
R Module 11 - Statistics
35 pages
Simple Liner REgression
No ratings yet
Simple Liner REgression
27 pages
Everything You Need To Know About Linear Regression
No ratings yet
Everything You Need To Know About Linear Regression
19 pages
Chapter 14B
No ratings yet
Chapter 14B
25 pages
Lessons in Ornamental Penmanship (Part 2)
No ratings yet
Lessons in Ornamental Penmanship (Part 2)
5 pages
Vista Upgrade Flowchart
No ratings yet
Vista Upgrade Flowchart
1 page
Lessons in Ornamental Penmanship (Part 4)
No ratings yet
Lessons in Ornamental Penmanship (Part 4)
3 pages
Practical Algorithms To Rank Necklaces, Lyndon Words, and de Bruijn Sequences
No ratings yet
Practical Algorithms To Rank Necklaces, Lyndon Words, and de Bruijn Sequences
19 pages
On Greedy Algorithms For Binary de Bruijn Sequences
No ratings yet
On Greedy Algorithms For Binary de Bruijn Sequences
22 pages
On Arc-Disjoint Hamiltonian Cycles in de Bruijn Graphs: March 2010
No ratings yet
On Arc-Disjoint Hamiltonian Cycles in de Bruijn Graphs: March 2010
6 pages
A Fast Algorithm To Generate Necklaces With &xed Content: 2003 Elsevier Science B.V. All Rights Reserved
No ratings yet
A Fast Algorithm To Generate Necklaces With &xed Content: 2003 Elsevier Science B.V. All Rights Reserved
13 pages
Qualifications in Level 3 Adult Offer Jan21 Update
No ratings yet
Qualifications in Level 3 Adult Offer Jan21 Update
103 pages
Understanding Diagnostic Plots For Linear Regression Analysis
No ratings yet
Understanding Diagnostic Plots For Linear Regression Analysis
5 pages
Fluid Physiology 8.2 Infusion of Hypertonic Saline
No ratings yet
Fluid Physiology 8.2 Infusion of Hypertonic Saline
2 pages
EobWUGTuSM2G1lBk7pjNMw Template Analytical Skills Table
No ratings yet
EobWUGTuSM2G1lBk7pjNMw Template Analytical Skills Table
1 page
Christian Lorenz Scheurer: The Techniques of
No ratings yet
Christian Lorenz Scheurer: The Techniques of
3 pages
Boardgamrehas Got Have Got Boardgames Games Icebreakers Oneonone Activities - 44601
No ratings yet
Boardgamrehas Got Have Got Boardgames Games Icebreakers Oneonone Activities - 44601
1 page
Grammar - Korean Class Practice
No ratings yet
Grammar - Korean Class Practice
2 pages
Distillation Lecture Note-2
No ratings yet
Distillation Lecture Note-2
20 pages
Assignment 4
No ratings yet
Assignment 4
1 page
Hazard Aircraft by Birds
No ratings yet
Hazard Aircraft by Birds
54 pages
Course Learning Outcome (: Faculty: Programme: Course: Code
No ratings yet
Course Learning Outcome (: Faculty: Programme: Course: Code
1 page
Kagan A A FROM in Action
No ratings yet
Kagan A A FROM in Action
13 pages
Xmax Technology: A Seminar Report On
No ratings yet
Xmax Technology: A Seminar Report On
8 pages
List of Metaphors and Similes For Kids
100% (1)
List of Metaphors and Similes For Kids
2 pages
Phraseological Units With Colors
No ratings yet
Phraseological Units With Colors
9 pages
CH 1 - Introduction
No ratings yet
CH 1 - Introduction
36 pages
Tube Fittings
No ratings yet
Tube Fittings
12 pages
High-Speed Switching Applications DC-DC Converter Applications Strobe Applications
No ratings yet
High-Speed Switching Applications DC-DC Converter Applications Strobe Applications
5 pages
Swift Transfer To Long Seas Limited
No ratings yet
Swift Transfer To Long Seas Limited
3 pages
PGIM Setup Admin
100% (1)
PGIM Setup Admin
291 pages
Bloodfilm Preparationandreporting 200512223320
No ratings yet
Bloodfilm Preparationandreporting 200512223320
203 pages
Overhead Arm Positioning ....
No ratings yet
Overhead Arm Positioning ....
9 pages
Presentation For OBE Grade 7
No ratings yet
Presentation For OBE Grade 7
14 pages
Keyboard Shortcuts For SmartArt Graphics
No ratings yet
Keyboard Shortcuts For SmartArt Graphics
3 pages
(Ebook) Philosophy: a guide to the reference literature by Bynagle, Hans Edward ISBN 9781563089541, 1563089548 - Download the ebook now to never miss important content
No ratings yet
(Ebook) Philosophy: a guide to the reference literature by Bynagle, Hans Edward ISBN 9781563089541, 1563089548 - Download the ebook now to never miss important content
48 pages
(SS0V003) Uni-Directional Sphere Type Prover
No ratings yet
(SS0V003) Uni-Directional Sphere Type Prover
2 pages
L-9 (Spreading Equipment)
No ratings yet
L-9 (Spreading Equipment)
16 pages
Word List
No ratings yet
Word List
86 pages
Web Gift Code: Index
No ratings yet
Web Gift Code: Index
3 pages
WWW - Smude.edu - In: Prospectus - Fall 2010
No ratings yet
WWW - Smude.edu - In: Prospectus - Fall 2010
36 pages
David Gorman Resume
No ratings yet
David Gorman Resume
1 page
Cit 101 Main Text
0% (2)
Cit 101 Main Text
221 pages
Control Environment Governance Audit 1585631025
No ratings yet
Control Environment Governance Audit 1585631025
4 pages
26 Tips On Error Correction - Part 2
No ratings yet
26 Tips On Error Correction - Part 2
4 pages

Linear Regression Assumptions and Diagnostics in R - Essentials - Articles - STHDA

Uploaded by

Linear Regression Assumptions and Diagnostics in R - Essentials - Articles - STHDA

Uploaded by

STHDA

Stati s t i c a l t o o l s f or high-through put data analysis

Home Basics Data Visualize Analyze Resources Our Products

 Articles - Regression Model Diagnostics

we start by explaining residuals errors and tted values.

Loading Required R packages

Linearity of the data

Machine Learning Essentials:

Loading Required R packages

# Load the data

## youtube facebook newspaper sales

Building a regression model

model <- lm(sales ~ youtube, data = marketing)

Fitted values and residuals

model.diag.metrics <- augment(model)

## sales youtube .fitted .se.fit .resid .hat .sigma .cooksd .std.resid

Among the table columns, there are:

youtube: the invested youtube advertising budget

ggplot(model.diag.metrics, aes(youtube, sales)) +

1. Non-linearity of the outcome - predictor relationships

Regression diagnostics {reg-diag}

Create the diagnostic plots with the R base function:

par(mfrow = c(2, 2))

Create the diagnostic plots using ggfortify:

The diagnostic plots show residuals in four diﬀerent ways:

# Add observations indices and

## index sales youtube .fitted .resid .hat .cooksd .std.resid

We’ll use mainly the following columns:

.fitted: ﬁtted values

Linearity of the data

model2 <- lm(log(sales) ~ youtube, data = marketing)

High leverage points:

## index sales youtube .fitted .resid .hat .cooksd .std.resid

df2 <- data.frame(

Create the Residuals vs Leverage plot of the two models:

Potential problems might be:

Recommended for You!

Recommended for you

Popular Courses Launched in 2020

Books - Data Science

 You are not authorized to post a comment

Visitor 11/27/2018 at 07h25

Great summary, easy to understand.

Visitor 10/24/2018 at 10h51

Visitor 08/17/2018 at 08h19

tomer mann 05/29/2018 at 17h43

Click to see our collection of resources to help you on your path...

Recommended for You (on Coursera):

See More Resources

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

Datanovia: Online Data Science Courses

You might also like