0% found this document useful (0 votes)

19 views

R Lab 3

The document discusses performing simple linear regression (SLR) on temperature data. It covers loading packages, reading in the data, creating a scatter plot of the variables, fitting a linear regression model, and using diagnostics to detect influential observations. Regression diagnostics including Cook's D, hat values, and dfbetas are calculated and plots created to aid in identifying outliers.

Uploaded by

sdcphdwork

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

R Lab 3

Uploaded by

sdcphdwork

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

EXST 7014

EXST 7014, Lab 3: Simple Linear Regression: Regression Diagnostics and Assumptions Test

OBJECTIVES

Simple linear regression (SLR) is a common analysis procedure, used to describe the significant
relationship a researcher presumes to exist between two variables: the dependent (or response)
variable, and the independent (or explanatory) variable. This lab will familiarize you with how to
perform SLR using the lm command in R.

You might notice that a single observation that is substantially different from all other observations
can make a large difference in the results of your regression analysis. I f a single observation (or
small group of observations) substantially changes your results, you would want to know about
this and investigate further. In this lab exercise, we will use appropriate regression diagnostics to
detect influential observations.

In this lab exercise, you will get familiar with and understand the following:
1) Use appropriate regression diagnostics to detect influential observations
2) Evaluate the assumptions of SLR using Normality test

Part I. Housekeeping Statements

Before we dive into the main part of the code it is good to create a pre-amble in which we will
load all the necessary packages for R to execute the following tasks. Since we will be graphing the
scatterplots we need the library “ggplot2”. If you have it installed already great. If not, you can
install it using the “packages” tab on the bottom right panel. Click install, and put the name of the
package you want on the “install packages” window that pops up. The default setting is installing
the packages from the CRAN repository where most “mainstream” packages can be found.

To activate the package, use the following command:

library(ggplot2)

Since we will be analyzing some measures of influence we will also use the library olsrr. Again
install it if it is missing and activate it with the command:

library(olsrr)

Dataset
The data is from your textbook, chapter 7, problem 6 and you can attain it through the link:
http://www.stat.lsu.edu/exstweb/statlab/datasets/fwdata97/FW07P06.txt.
The latitude (LAT) and the mean monthly range (RANGE), which is the difference between mean
monthly maximum and minimum temperatures, are given for a selected set of US cities. The
following program performs a SLR using RANGE as the dependent variable and LAT as the
independent variable.
EXST 7014

Download and save the data in the same folder as the R-script you are working on. From the tab
“Session” on R-Studio select “Set Working Directory” to “Source File location”. This instructs R
to look for the datafiles in the same directory as the R-script. Then use the command:

fw07p06=read.table("data_lab1.txt", header = FALSE, sep = "", dec = ".")

The name of the dataset is then fw07p06 and it is read from the file data_lab1.txt using the
command read.table. The argument header=FALSE instructs R to see the first line of the table
(the header) as yet another datapoint. If it is set to header=TRUE then R will read the first line as
the names of the variables corresponding to each column. Our dataset does not have names for the
columns and hence we choose header to be false and we will then create the names of the columns
ourselves.
The argument sep="" tells R to use spaces as separators between each column/variable. It can be
changed to "," if one is using comma separated values (CSV) and others. Finally, the dec="."
forces R to use a dot . as a separator for decimal points. Again this can be changed to commas,
semicolons and more. After you run the line above you should have a dataset named fw07p06 in
your data environment on the up right part of R-studio.

To set the variable names we use the command colnames as follows:

colnames(fw07p06)=c("CITY","STATE","LAT","RANGE")

This forces R to rename (or just name) the 4 columns of the dataset using the labels inside the c()
argument. Don’t forget the c() is used to create an ordered list of elements or a vector.

You can view the dataset by either clicking on it, or by using the command

View(fw07p06)

Creating a Scatter Plot

When performing a regression analysis, it is always advisable to look at scatter plots of the data
in order to get an idea of the type of relationship that exists between the response variable and
the explanatory variables. The following commands will create a scatterplot between the two
variables.

plot1=ggplot(fw07p06, aes(x = LAT, y = RANGE)) +

geom_point()+
theme_classic()
plot1
ggsave("Scatter1.pdf",plot1)

Let’s explore what each line does. The first one utilizes the command ggplot that is used to create
good plots using R. Inside the argument of ggplot, we define the dataset which we will use to
create the plot to be the dataframe fw07p06 we created earlier. Then we need to define the two
EXST 7014

variables in the aes (short for aesthetics) argument. The x axis for us will be the LAT variable
(independent) and the y axis will be the RANGE variable (dependent).

The next line instructs R to create the pairs of values as geometrical points (geom_point) there are
many features in the geom_point that are left to the user, like shape, size and color of the points.
Check the following link for more information:

http://www.cookbook-r.com/Graphs/Scatterplots_(ggplot2)/

The Theme_classic() argument creates the plot with the default settings for ggplot 2. Again the
possibilities of customization are endless. Examples of the different themes can be found here:
https://ggplot2.tidyverse.org/reference/ggtheme.html

As you might have noticed in the first line, we named our plot “plot1”. In order for us to see the
plot we need to call it by its name in the script hence the line plot1. Finally, the command
ggsave(Scatter1.pdf,plot1) saves the plot we created as a pdf file named Scatter1 in the same
folder as the script we are working on. This is very handy if one wants to use that for another
document. You can also save it as a jpg using the command ggsave(Scatter1.jpg,plot1)
Note that the “+” at the end of each line inside the ggplot argument are needed in order for R to
consider all three of the lines as one thing.

The statement above will create a scatter plot of RANGE vs. LAT. The graph is not fancy, but is
sufficient for getting an idea of how RANGE and LAT are related.
To create more professional graphics, you can explore the R cookbook and ggplot2 in depth here:

http://www.cookbook-r.com/

Part II. Fitting Simple Linear Regression

Model statement with options of INFLUENCE
OUTPUT statement (Predicted values of Y, residual, rstudent, dffits etc)

Based on the scatter plot produced above, we assume that an appropriate regression model
relating RANGE and LAT is the liner model given by:

y = β 0 + β1 χ + ε

where Y is the RANGE, X is the LAT, and ε is a random error term that is normally distributed
with mean 0 and unknown variances σ2. β0 is the estimate of Y-intercept, and β1 is the estimate of
the slope coefficient.

R has many different procedures to compute that linear regression but the simples on is the lm
method. The command is simply

model1 <- lm(RANGE ~ LAT, data=fw07p06)

EXST 7014

With this we are creating the object model1 by invoking the linear regression model (lm)
between the variables RANGE and LAT (the first is the dependent and the second is the
independent). The argument data=fw07p06 tells R which dataset to use in order to find those
variables.

The output of lm is very detailed and provides a lot of information. The following command :

summary(model1)

Gives us the table:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.4793 5.5481 -1.168 0.249
LAT 0.7515 0.1438 5.228 4.79e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The coefficient of LAT (corresponding to β1 ) is 0.7515 and the constant term, β0 is then -6.4793.
To find the R squared value we can look at the next output of the summary :
Residual standard error: 5.498 on 43 degrees of freedom
Multiple R-squared: 0.3886, Adjusted R-squared: 0.3744
F-statistic: 27.33 on 1 and 43 DF, p-value: 4.786e-06

The coefficient of determination is labeled Multiple R-squared in the output of summary lm. Also,
if one works with more complicated models, the adjusted R-square is preferable, which is similar
to R-square but penalizes the use of many explanatory variables. More on that later.
Notice that summary is a very useful generic R command that we will use for other processes as
well.

In order for us to get various diagnostics statistics we need to ask R to apply them in the model we
have created earlier using the influence.measures command

influence.measures(model1)

This will list the Hat Diag H, the dffits and dfbetas. It also has cook’s d bar and cov.r. These
statistics are usually used to detect possible outliers.

The package olsrr has a complete analysis for them including great graphics, so we will be using
that. For more information on it you can visit:

https://cran.r-project.org/web/packages/olsrr/vignettes/influence_measures.html

a) To create Cook’s d bar plot and chart we use the following two simple lines with input our
model (model 1)
EXST 7014

CkDbar=ols_plot_cooksd_bar(model1) # Bar plot

CkDchart=ols_plot_cooksd_chart(model1) # Chart

Besides the plot that pops up automatically one can view the created lists CkDbar or CkDchart
with the commands View(CkDbar) or using the global environment to further analyze the results
of the cook analysis if needed.
For example, the command

CkDbar$outliers

Outputs the table:

# A tibble: 3 x 2
observation cooks_distance
<int> <dbl>
1 4 0.166
2 33 0.109
3 43 0.328

That contains the outliers computed by this method

b) To create the DFBETAs Panel we again ask ols to apply dfbetas analysis on our model as
follows:

Dfb=ols_plot_dfbetas(model1) #dfbetas analysis

Similarly, you can further explore the Dfb list of outputs.

c) To compute the Dffits one should use:

Dff=ols_plot_dffits(model1) #dffites analysis

Similarly, you can further explore the Dff list of outputs.

d) For the studentized Residual Plot the command is

StRes=ols_plot_resid_stud(model1) #Studentized Residual analysis

Similarly, you can further explore the StRes list of outputs.

e) For the Standardized Residual Chart you can use

ResSta=ols_plot_resid_stand(model1)
Similarly, you can further explore the ResSta list of outputs.

In order to compute 95% confidence intervals for betas, we need to use the following code:
EXST 7014

Bconfi=confint(model1,level =0.95)
Bconfi

Confint is a multi-purpose function in R, and the argument level=0.95 changes the confidence
interval level. The default, if nothing is provided, is 95%.

To find a confidence interval for the fitted values the command is:

Fitconfi=predict(model1, interval="prediction",level=0.95) #For Fitted values

Fitconfi

Again, the level can be adjusted to a desired amount. Also, Fitconfi is a matrix with the
corresponding values used for further analysis.

To find a confidence interval for the mean values the command is:

Meanconfi=predict(model1, interval = 'confidence',level=0.95) # For mean values

Meanconfi

Again, the level can be adjusted to a desired amount. Also, Meanconfi is a matrix with the
corresponding values used for further analysis.
Finally, if one wants to find the confidence interval or the confidence interval for the mean of a
new datapoint with Latitude 10 the following commands will do the trick

predict(results,data.frame(LAT=10),interval="confidence") #confidence for mean

predict(results,data.frame(LAT=10),interval="prediction") #confidence for fitted

Part III Evaluate Assumption by Residual Analysis

Ggplot2 has an easy way to create the Residual plot for linear regression using the following lines
of code:

ggplot(lm(RANGE ~ LAT, data=fw07p06)) +

geom_point(aes(x=.fitted, y=.resid)) +
geom_hline(yintercept=0) +
labs(x="Fitted Values", y="Residuals", title="Residual Plot")+
theme_classic()

As we can see this command plots the fitted values versus the residuals using the geom_point
argument. The rest of the commands add a horizontal line at zero, and creates good labels for the
x axis, y axis and the general title. The theme_classic() argument gives us the classic white
background.

Finally, to test the residuals for normality we will use the Shapiro Wilk Test, which is a popular
statistic to evaluate whether the data is normally distributed. It should be noted that the null
EXST 7014

hypothesis of Shapiro-Wilk is that the data is normally distributed. If the p value of the test is less
than the significant level of 0.05 the null hypothesis is rejected, and we conclude that the data is
not normally distributed. Otherwise the null hypothesis cannot be rejected, and we can conclude
that the data is normally distributed.

The residuals are saved in the computation of the model under model1$residuals so the following
command will compute the corresponding p value for us:

shapiro.test(model1$residuals)

LAB ASSIGNMENT

Your assignment is to perform necessary analysis using R and answer the following questions:

1. Is the normality assumption violated? State the name and the value of the statistic that you
used to reach your conclusion.
2. Does there appear to be any possible influential observation. State the name and the value of
the statistics that you used to reach your conclusion.
3. What is the confidence interval for the mean Range for all cities with latitude 32.3?
4. What is the confidence interval for Range for a randomly selected city with latitude 32.3?

*Remember to attach your R script for the lab report.

The iOS Interview Guide - Questions, Answers, and General Guidance On What iOS Developers (EnglishOnlineClub - Com)
No ratings yet
The iOS Interview Guide - Questions, Answers, and General Guidance On What iOS Developers (EnglishOnlineClub - Com)
195 pages
Psychology of The Japanese Soldier
100% (1)
Psychology of The Japanese Soldier
32 pages
Teach Like A Champion 100 Percent and No Opt Out
No ratings yet
Teach Like A Champion 100 Percent and No Opt Out
25 pages
Typescript Tutorial
100% (5)
Typescript Tutorial
189 pages
R Lab 1
No ratings yet
R Lab 1
5 pages
Linear Regression
No ratings yet
Linear Regression
17 pages
Econometrics I - R Summary (Maite Cabeza-Gutes)
No ratings yet
Econometrics I - R Summary (Maite Cabeza-Gutes)
77 pages
R Lab 2
No ratings yet
R Lab 2
6 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
R For Introductory Econometrics-1
No ratings yet
R For Introductory Econometrics-1
4 pages
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
How To Use "Qqplot": X: Independent Variable, Y: Dependent Variable
No ratings yet
How To Use "Qqplot": X: Independent Variable, Y: Dependent Variable
6 pages
R Programming Codes Linear Regression
No ratings yet
R Programming Codes Linear Regression
20 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
R Studio Cheat Sheet
No ratings yet
R Studio Cheat Sheet
6 pages
Lab-10-Forest-Regression
No ratings yet
Lab-10-Forest-Regression
5 pages
R Examples
No ratings yet
R Examples
56 pages
Tidyverse: Core Packages in Tidyverse
No ratings yet
Tidyverse: Core Packages in Tidyverse
8 pages
Intro To Statistic Using R - Session 1
No ratings yet
Intro To Statistic Using R - Session 1
1 page
Simple Linear Regression in R
No ratings yet
Simple Linear Regression in R
17 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
7 pages
Regression An Ova
No ratings yet
Regression An Ova
24 pages
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
No ratings yet
Lab-3: Regression Analysis and Modeling Name: Uid No. Objective
9 pages
Linear Regression
No ratings yet
Linear Regression
22 pages
Mindanao State University General Santos City: Simple Linear Regression
No ratings yet
Mindanao State University General Santos City: Simple Linear Regression
12 pages
R Notes
No ratings yet
R Notes
4 pages
Problem Set 1: Introduction To R - Solutions With R Output: 1 Install Packages
No ratings yet
Problem Set 1: Introduction To R - Solutions With R Output: 1 Install Packages
24 pages
R Unit 4th and 5th
No ratings yet
R Unit 4th and 5th
17 pages
1 Introduction To R and Rstudio: 2024-2025 Calculus Iii
No ratings yet
1 Introduction To R and Rstudio: 2024-2025 Calculus Iii
3 pages
Time Series Practice
No ratings yet
Time Series Practice
4 pages
nw
No ratings yet
nw
1 page
Week 6
No ratings yet
Week 6
36 pages
Course Notes18
No ratings yet
Course Notes18
113 pages
21BCS5999 - Ankit Kumar (Assignment 2)
No ratings yet
21BCS5999 - Ankit Kumar (Assignment 2)
16 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Practice-Training_BTTC
No ratings yet
Practice-Training_BTTC
25 pages
MakeUpCat
No ratings yet
MakeUpCat
6 pages
Simple Regression Model Fitting
No ratings yet
Simple Regression Model Fitting
5 pages
Multiple Regression
No ratings yet
Multiple Regression
7 pages
R Module 11 - Statistics
No ratings yet
R Module 11 - Statistics
35 pages
DA_Lab_Week-1
No ratings yet
DA_Lab_Week-1
7 pages
Ansel in Intro Spat Reg Res
No ratings yet
Ansel in Intro Spat Reg Res
25 pages
WEEK
No ratings yet
WEEK
17 pages
Regression Analysis Using R
No ratings yet
Regression Analysis Using R
17 pages
5 Estimation
No ratings yet
5 Estimation
15 pages
Machine Learning-Intro
No ratings yet
Machine Learning-Intro
7 pages
Session Set Working Directory Choose Directlry
No ratings yet
Session Set Working Directory Choose Directlry
17 pages
Unit 2 R
No ratings yet
Unit 2 R
16 pages
Homework 2
100% (1)
Homework 2
14 pages
Using R For Linear Regression
No ratings yet
Using R For Linear Regression
9 pages
R Manual
No ratings yet
R Manual
10 pages
Lab file AD pdf
No ratings yet
Lab file AD pdf
25 pages
Experiment - 8
No ratings yet
Experiment - 8
3 pages
Applied Multivariate Statistics in R
100% (1)
Applied Multivariate Statistics in R
562 pages
Gs Short Ex
No ratings yet
Gs Short Ex
140 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
Statistical Models in R
No ratings yet
Statistical Models in R
18 pages
R Code For Linear Regression Analysis 1 Way ANOVA
No ratings yet
R Code For Linear Regression Analysis 1 Way ANOVA
8 pages
seminar_1 2
No ratings yet
seminar_1 2
14 pages
Questions With No Solutions
No ratings yet
Questions With No Solutions
20 pages
BZAN 535: Linear Regression
No ratings yet
BZAN 535: Linear Regression
11 pages
R Intro 2011
No ratings yet
R Intro 2011
115 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Advantages and Disadvantages of Chemical Sanitizers
100% (1)
Advantages and Disadvantages of Chemical Sanitizers
4 pages
7od Ts5in
No ratings yet
7od Ts5in
40 pages
MCQ in Power Plant Engineering Part 3 - REE Board Exam: October 6, 2020
No ratings yet
MCQ in Power Plant Engineering Part 3 - REE Board Exam: October 6, 2020
18 pages
Masters 24
No ratings yet
Masters 24
80 pages
Anatomy & Physiology - Week 5
No ratings yet
Anatomy & Physiology - Week 5
35 pages
Download ebooks file About The Three Lines That Strike Key Points Tony Duff all chapters
No ratings yet
Download ebooks file About The Three Lines That Strike Key Points Tony Duff all chapters
55 pages
App Exams PDF Studyguide Boilermaker
No ratings yet
App Exams PDF Studyguide Boilermaker
27 pages
Goals & Objectives Setting - 2012-13 Frequently Asked Questions
No ratings yet
Goals & Objectives Setting - 2012-13 Frequently Asked Questions
3 pages
Bangladesh Railway e Ticket
No ratings yet
Bangladesh Railway e Ticket
1 page
Ph. D. Entrance Exam Result - July 2024 Session
No ratings yet
Ph. D. Entrance Exam Result - July 2024 Session
12 pages
Geoff Zeiss
No ratings yet
Geoff Zeiss
47 pages
Ebooks File (Test Bank) Canadian Advertising in Action 11th Edition by Keith J. Tuckwell All Chapters
100% (6)
Ebooks File (Test Bank) Canadian Advertising in Action 11th Edition by Keith J. Tuckwell All Chapters
20 pages
Kant
No ratings yet
Kant
12 pages
Peter Interview 3 Transcript Anonymised
No ratings yet
Peter Interview 3 Transcript Anonymised
74 pages
5f6301accae47 Chakshu Jindal
No ratings yet
5f6301accae47 Chakshu Jindal
2 pages
0 - Dec Final Copy-1
No ratings yet
0 - Dec Final Copy-1
14 pages
Name: Grade: Division: Roll No: School Code:: Continuous Assessment - I
No ratings yet
Name: Grade: Division: Roll No: School Code:: Continuous Assessment - I
6 pages
Master 1 - Final Oral Test - Type A - Teacher S
No ratings yet
Master 1 - Final Oral Test - Type A - Teacher S
5 pages
Poster Guidelines - MIH Megatrends 2024
No ratings yet
Poster Guidelines - MIH Megatrends 2024
1 page
AEP1141 Industrial Radiation Safety
100% (1)
AEP1141 Industrial Radiation Safety
28 pages
CRM Project
No ratings yet
CRM Project
104 pages
Parking Sensor Innova 2
No ratings yet
Parking Sensor Innova 2
6 pages
DRAFT-Course&Symposium Dan Full Paper Presentation 28 Juli 2015
No ratings yet
DRAFT-Course&Symposium Dan Full Paper Presentation 28 Juli 2015
252 pages
Software Licensing Agreement
No ratings yet
Software Licensing Agreement
19 pages
Install Rac 12r1 Linux
No ratings yet
Install Rac 12r1 Linux
5 pages
Science 5 ST 3
No ratings yet
Science 5 ST 3
3 pages