0% found this document useful (0 votes)
106 views

Mindanao State University General Santos City: Simple Linear Regression

The document provides instructions for students on simple linear regression using the cars dataset in R. It defines simple linear regression modeling as modeling a response variable (Y) as a linear function of a predictor variable (X) plus some error. The document walks through estimating the regression coefficients for the cars data using formulas, interpreting the results, making predictions with the regression equation, and examining residuals.

Uploaded by

Eva Ruth Medillo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views

Mindanao State University General Santos City: Simple Linear Regression

The document provides instructions for students on simple linear regression using the cars dataset in R. It defines simple linear regression modeling as modeling a response variable (Y) as a linear function of a predictor variable (X) plus some error. The document walks through estimating the regression coefficients for the cars data using formulas, interpreting the results, making predictions with the regression equation, and examining residuals.

Uploaded by

Eva Ruth Medillo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Mindanao State University

General Santos City

December 31, 2020

Carlito O. Daarol
Instructor
Descriptive Statistics and Statistical Inference

Book Reference: Applied Statistics with R!

Note: This is your reference for question number 2 in the final exam

Simple Linear Regression


After reading this topic you will be able to:
• Understand the concept of a model.
• Estimate and visualize a regression model using R.
• Interpret regression coefficients and statistics.
• Use a regression model to make predictions.
• Understand that beyond Simple Linear Regression there are more advance methods …

Modeling
Let’s consider a simple example of how the speed of a car affects its stopping distance, that
is, how far it travels before it comes to a stop. To examine this relationship, we will use the
cars dataset which, is a default R dataset.
plot(dist ~ speed, data = cars,
xlab = "Speed (in Miles Per Hour)",
ylab = "Stopping Distance (in Feet)",
main = "Stopping Distance vs Speed",
pch = 20,
cex = 2,
col = "grey")
In the cars example, we are interested in using the predictor variable speed to predict and
explain the response variable dist.
Broadly speaking, we would like to model the relationship between 𝑋 and 𝑌 using the form
𝑌 = 𝑓(𝑋) + 𝜖.
The function 𝑓 describes the functional relationship between the two variables, and the 𝜖
term is used to account for error. This indicates that if we plug in a given value of 𝑋 as
input, our output is a value of 𝑌, within a certain range of error. You could think of this a
number of ways:
• Response = Prediction + Error
• Response = Signal + Noise
• Response = Model + Unexplained
• Response = Deterministic + Random
• Response = Explainable + Unexplainable
What sort of function should we use for 𝑓(𝑋) for the cars data?
We could try to model the data with a well chosen line that will summarize the relationship
between stopping distance and speed quite well. As speed increases, the distance required
to come to a stop increases. There is still some variation about this line, but it seems to
capture the overall trend.
With this in mind, we would like to restrict our choice of 𝑓(𝑋) to linear functions of 𝑋. We
will write our model using 𝛽1 for the slope, and 𝛽0 for the intercept,
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖.

Simple Linear Regression Model


We now define what we will call the simple linear regression model,
𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜖𝑖
where
𝜖𝑖 ∼ 𝑁(0, 𝜎 2 ).
That is, the 𝜖𝑖 are independent and identically distributed (iid) normal random variables
with mean 0 and variance 𝜎 2 . This model has three parameters to be estimated: 𝛽0, 𝛽1, and
𝜎 2 , which are fixed, but unknown constants.
We have slightly modified our notation here. We are now using 𝑌𝑖 and 𝑥𝑖 , since we will be
fitting this model to a set of 𝑛 data points, for 𝑖 = 1,2, … 𝑛.
Recall that we use capital 𝑌 to indicate a random variable, and lower case 𝑦 to denote a
potential value of the random variable. Since we will have 𝑛 observations, we have 𝑛
random variables 𝑌𝑖 and their possible values 𝑦𝑖 .
In the simple linear regression model, the 𝑥𝑖 are assumed to be fixed, known constants, and
are thus notated with a lower case variable. The response 𝑌𝑖 remains a random variable
because of the random behavior of the error variable, 𝜖𝑖 . That is, each response 𝑌𝑖 is tied to
an observable 𝑥𝑖 and a random, unobservable, 𝜖𝑖 .

Assumptions of Linear Regression Model


• Linearity. The relationship between 𝑌 and 𝑥 is linear, of the form 𝛽0 + 𝛽1 𝑥.
• Independent. The errors 𝜖 are independent.
• Normal. The errors, 𝜖 are normally distributed. That is the “error” around the line
follows a normal distribution.
• Equal Variance. At each value of 𝑥, the variance of 𝑌 is the same, 𝜎 2 .
We are also assuming that the values of 𝑥 are fixed, that is, not random. We do not make a
distributional assumption about the predictor variable.
As a side note, we will often refer to simple linear regression as SLR. Some explanation of
the name SLR:
• Simple refers to the fact that we are using a single predictor variable. Later we will
use multiple predictor variables.
• Linear tells us that our model for 𝑌 is a linear combination of the predictors 𝑋. (In this
case just the one.) Right now, this always results in a model that is a line, but later we
will see how this is not always the case.
• Regression simply means that we are attempting to measure the relationship
between a response variable and (one or more) predictor variables. In the case of SLR,
both the response and the predictor are numeric variables.

Formula for the regression coefficients


𝑛 ∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − (∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
𝛽̂1 =
𝑛 ∑𝑛𝑖=1 𝑥𝑖2 − (∑𝑛𝑖=1 𝑥𝑖 )2
𝛽̂0 = 𝑦 − 𝛽̂1 𝑥
To keep some notation consistent with above mathematics, we will store the response
variable as y and the predictor variable as x.
x = cars$speed
y = cars$dist

We then calculate the summation terms as indicated in the formula above.


(n= nrow(cars))

## [1] 50

(Sumxy = sum(x*y))

## [1] 38482

(Sumx = sum(x))
## [1] 770

(Sumy = sum(y))

## [1] 2149

(Sumxsquare = sum(x*x))

## [1] 13228

c(Sumxy, Sumx, Sumy, Sumxsquare)

## [1] 38482 770 2149 13228

Then finally calculate 𝛽̂0 and 𝛽̂1 by inserting the sum of terms to the formula.
beta_1_hat = (n*Sumxy - Sumx*Sumy)/(n*Sumxsquare -Sumx^2)
beta_0_hat = mean(y) - beta_1_hat * mean(x)
c(beta_0_hat, beta_1_hat)

## [1] -17.579095 3.932409

What do these values tell us about our dataset?


The slope parameter 𝛽1 tells us that for an increase in speed of one mile per hour, the mean
stopping distance increases by 𝛽1. It is important to specify that we are talking about the
mean. Recall that 𝛽0 + 𝛽1𝑥 is the mean of 𝑌, in this case stopping distance, for a particular
value of 𝑥. (In this case speed.) So 𝛽1 tells us how the mean of 𝑌 is affected by a change in 𝑥.

Similarly, the estimate 𝛽̂1 = 3.93 tells us that for an increase in speed of one mile per hour,
the estimated mean stopping distance increases by 3.93 feet. Here we should be sure to
specify we are discussing an estimated quantity. Recall that 𝑦̂ is the estimated mean of 𝑌, so
𝛽̂1 tells us how the estimated mean of 𝑌 is affected by changing 𝑥.
The intercept parameter 𝛽0 tells us the mean stopping distance for a car traveling zero
miles per hour. (Not moving.) The estimate 𝛽̂0 = −17.58 tells us that the estimated mean
stopping distance for a car traveling zero miles per hour is −17.58 feet. So when you apply
the brakes to a car that is not moving, it moves backwards? This doesn’t seem right.
(Extrapolation, which we will see later, is the issue here.)

Making Predictions
We can now write the fitted or estimated line,

𝑦̂ = 𝛽̂0 + 𝛽̂1 𝑥.
In this case,
𝑦̂ = −17.58 + 3.93𝑥.
There is an issue we saw when interpreting 𝛽̂0 = −17.58. This is equivalent to making a
prediction at 𝑥 = 0. We should not be confident in the estimated linear relationship outside
of the range of data we have observed.
We can now use this line to make predictions. First, let’s see the possible 𝑥 values in the
cars dataset. Since some 𝑥 values may appear more than once, we use the unique() to
return each unique value only once.
unique(cars$speed)

## [1] 4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25

Let’s make a prediction for the stopping distance of a car traveling at 8 miles per hour.
𝑦̂ = −17.58 + 3.93 × 8
beta_0_hat + beta_1_hat * 8

## [1] 13.88018

This tells us that the estimated mean stopping distance of a car traveling at 8 miles per
hour is 13.88.
To get the prediction values of the entire set, we simply feed the entire set to our
regression model
speed = cars$speed
actual_distance <- cars$dist

stop_distance = round(beta_0_hat + beta_1_hat * speed,2)


Prediction_error <- actual_distance-stop_distance

# combine variables into one table using rbind command then transpose the tab
le
output <- t(rbind(speed,actual_distance,stop_distance,Prediction_error))
colnames(output) <- c("Speed", "Stopping Distance(ft)","Predicted Distance","
Prediction Error")
library(psych)
headTail(as.data.frame(output)) #display first 4 and last 4 rows only

## Speed Stopping.Distance.ft. Predicted.Distance Prediction.Error


## 1 4 2 -1.85 3.85
## 2 4 10 -1.85 11.85
## 3 7 4 9.95 -5.95
## 4 7 22 9.95 12.05
## ... ... ... ... ...
## 47 24 92 76.8 15.2
## 48 24 93 76.8 16.2
## 49 24 120 76.8 43.2
## 50 25 85 80.73 4.27
As you see the last column represents the prediction error. Not good results as of now but
there are ways to improve the regression model.

Residuals
If we think of our model as “Response = Prediction + Error,” we can then write it as
𝑦 = 𝑦̂ + 𝑒.
We then define a residual to be the observed value minus the predicted value.
𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖
The set of residuals or prediction errors is indicated in the last column of the output table.

The lm Function
So far we have done regression by deriving the least squares estimates, then writing simple
R commands to perform the necessary calculations. Since this is such a common task, this is
functionality that is built directly into R via the lm() command.
The lm() command is used to fit linear models which actually account for a broader class
of models than simple linear regression, but we will use SLR as our first demonstration of
lm(). The lm() function will be one of our most commonly used tools, so you may want to
take a look at the documentation by using ?lm. You’ll notice there is a lot of information
there, but we will start with just the very basics. This is documentation you will want to
return to often.
We’ll continue using the cars data, and essentially use the lm() function to check the work
we had previously done.
library(moderndive)
stop_dist_model = lm(dist ~ speed, data = cars)
get_regression_table(stop_dist_model) %>%
knitr::kable()

term estimate std_error statistic p_value lower_ci upper_ci


intercept -17.579 6.758 -2.601 0.012 -31.168 -3.990
speed 3.932 0.416 9.464 0.000 3.097 4.768

This line of code fits our very first linear model. The syntax should look somewhat familiar.
We use the dist ~ speed syntax to tell R we would like to model the response variable
dist as a linear function of the predictor variable speed. In general, you should think of the
syntax as response ~ predictor. The data = cars argument then tells R that that dist
and speed variables are from the dataset cars. We then store this result in a variable
stop_dist_model`. The result of our simple regression table is given in a nice formatted
output
We see that it first tells us the formula we input into R, that is lm(formula = dist ~
speed, data = cars). We also see the coefficients of the model. We can check that these
are what we had calculated previously. (Minus some rounding that R is doing when
displaying the results. They are stored with full precision.)
c(beta_0_hat, beta_1_hat)

## [1] -17.579095 3.932409

Next, it would be nice to add the fitted line to the scatterplot. To do so we will use the
abline() function.
plot(dist ~ speed, data = cars,
xlab = "Speed (in Miles Per Hour)",
ylab = "Stopping Distance (in Feet)",
main = "Stopping Distance vs Speed",
pch = 20,
cex = 2,
col = "grey")
abline(stop_dist_model, lwd = 3, col = "darkorange")

The abline() function is used to add lines of the form 𝑎 + 𝑏𝑥 to a plot. (Hence abline.)
When we give it stop_dist_model as an argument, it automatically extracts the regression
coefficient estimates (𝛽̂0 and 𝛽̂1) and uses them as the slope and intercept of the line. Here
we also use lwd to modify the width of the line, as well as col to modify the color of the line.
The “thing” that is returned by the lm() function is actually an object of class lm which is a
list. The exact details of this are unimportant unless you are seriously interested in the
inner-workings of R, but know that we can determine the names of the elements of the list
using the names() command.
names(stop_dist_model)

## [1] "coefficients" "residuals" "effects" "rank"


## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"

We can then use this information to, for example, access the first 7 residuals using the $
operator.
stop_dist_model$residuals[1:7]

## 1 2 3 4 5 6 7
## 3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584 -3.744993

Another way to access stored information in stop_dist_model are the coef(), resid(),
and fitted() functions. These return the coefficients, first 8 residuals, and fitted values,
respectively.
coef(stop_dist_model)

## (Intercept) speed
## -17.579095 3.932409

resid(stop_dist_model)[1:7]

## 1 2 3 4 5 6 7
## 3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584 -3.744993

fitted(stop_dist_model)[1:7]

## 1 2 3 4 5 6 7
## -1.849460 -1.849460 9.947766 9.947766 13.880175 17.812584 21.744993

Simple Linear Regression and its relationship to other issues


1. like … being compared to more advance regression techniques
2. like … having a girlfriend

Legend:
Blue line - generated using Simple Linear Regression Analysis
Green band or curve – generated by RandomForest Regression Analysis
Red band or curve – generated by Support Vector Regression Analysis

Three methods of regression analysis are presented for comparison.


1. The classical Simple Linear Regression – based on the assumption of normality. It’s
the oldest known model. The idea behind this SLR was formulate centuries ago. Its
performance is good only if the actual relationship is close to that of a straight line.
2. The advance regression methods (RandomForest and Support Vector Regressions)
– does not suffer from the curse of normality assumption. Works well for straight
line relationships or non-linear relationships. Needs fast computers for the
respective algorithms. Falls under the non-parametric analysis.
Case1: The variables X and Y are strongly correlated with r = 0.95. The plot shows a linear
trend.
For purposes of numeric evaluation, we will use the RMSE statistics or Root Mean Square
̀ 2
(𝑦 −𝑦̂ )
Error which is defined as 𝑟𝑚𝑠𝑒 = √∑𝑛𝑖=1 𝑖 𝑛 𝑖

SLR rmse = 0.007 RandomForest rmse = 0.005 Support Vector rmse = 0.007
SLR rmse = 13.81 RandomForest rmse = 2.93 Support Vector rmse = 4.77

Predictive power of SLR begins to deteriorate if the relationship of the variables is not
linear.

SLR rmse = 0.90 RandomForest rmse = 0.30 Support Vector rmse = 0.28
SLR rmse = 0.90 RandomForest rmse = 0.30 Support Vector rmse = 0.28

You might also like