Mindanao State University General Santos City: Simple Linear Regression
Mindanao State University General Santos City: Simple Linear Regression
Carlito O. Daarol
Instructor
Descriptive Statistics and Statistical Inference
Note: This is your reference for question number 2 in the final exam
Modeling
Let’s consider a simple example of how the speed of a car affects its stopping distance, that
is, how far it travels before it comes to a stop. To examine this relationship, we will use the
cars dataset which, is a default R dataset.
plot(dist ~ speed, data = cars,
xlab = "Speed (in Miles Per Hour)",
ylab = "Stopping Distance (in Feet)",
main = "Stopping Distance vs Speed",
pch = 20,
cex = 2,
col = "grey")
In the cars example, we are interested in using the predictor variable speed to predict and
explain the response variable dist.
Broadly speaking, we would like to model the relationship between 𝑋 and 𝑌 using the form
𝑌 = 𝑓(𝑋) + 𝜖.
The function 𝑓 describes the functional relationship between the two variables, and the 𝜖
term is used to account for error. This indicates that if we plug in a given value of 𝑋 as
input, our output is a value of 𝑌, within a certain range of error. You could think of this a
number of ways:
• Response = Prediction + Error
• Response = Signal + Noise
• Response = Model + Unexplained
• Response = Deterministic + Random
• Response = Explainable + Unexplainable
What sort of function should we use for 𝑓(𝑋) for the cars data?
We could try to model the data with a well chosen line that will summarize the relationship
between stopping distance and speed quite well. As speed increases, the distance required
to come to a stop increases. There is still some variation about this line, but it seems to
capture the overall trend.
With this in mind, we would like to restrict our choice of 𝑓(𝑋) to linear functions of 𝑋. We
will write our model using 𝛽1 for the slope, and 𝛽0 for the intercept,
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖.
## [1] 50
(Sumxy = sum(x*y))
## [1] 38482
(Sumx = sum(x))
## [1] 770
(Sumy = sum(y))
## [1] 2149
(Sumxsquare = sum(x*x))
## [1] 13228
Then finally calculate 𝛽̂0 and 𝛽̂1 by inserting the sum of terms to the formula.
beta_1_hat = (n*Sumxy - Sumx*Sumy)/(n*Sumxsquare -Sumx^2)
beta_0_hat = mean(y) - beta_1_hat * mean(x)
c(beta_0_hat, beta_1_hat)
Similarly, the estimate 𝛽̂1 = 3.93 tells us that for an increase in speed of one mile per hour,
the estimated mean stopping distance increases by 3.93 feet. Here we should be sure to
specify we are discussing an estimated quantity. Recall that 𝑦̂ is the estimated mean of 𝑌, so
𝛽̂1 tells us how the estimated mean of 𝑌 is affected by changing 𝑥.
The intercept parameter 𝛽0 tells us the mean stopping distance for a car traveling zero
miles per hour. (Not moving.) The estimate 𝛽̂0 = −17.58 tells us that the estimated mean
stopping distance for a car traveling zero miles per hour is −17.58 feet. So when you apply
the brakes to a car that is not moving, it moves backwards? This doesn’t seem right.
(Extrapolation, which we will see later, is the issue here.)
Making Predictions
We can now write the fitted or estimated line,
𝑦̂ = 𝛽̂0 + 𝛽̂1 𝑥.
In this case,
𝑦̂ = −17.58 + 3.93𝑥.
There is an issue we saw when interpreting 𝛽̂0 = −17.58. This is equivalent to making a
prediction at 𝑥 = 0. We should not be confident in the estimated linear relationship outside
of the range of data we have observed.
We can now use this line to make predictions. First, let’s see the possible 𝑥 values in the
cars dataset. Since some 𝑥 values may appear more than once, we use the unique() to
return each unique value only once.
unique(cars$speed)
## [1] 4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25
Let’s make a prediction for the stopping distance of a car traveling at 8 miles per hour.
𝑦̂ = −17.58 + 3.93 × 8
beta_0_hat + beta_1_hat * 8
## [1] 13.88018
This tells us that the estimated mean stopping distance of a car traveling at 8 miles per
hour is 13.88.
To get the prediction values of the entire set, we simply feed the entire set to our
regression model
speed = cars$speed
actual_distance <- cars$dist
# combine variables into one table using rbind command then transpose the tab
le
output <- t(rbind(speed,actual_distance,stop_distance,Prediction_error))
colnames(output) <- c("Speed", "Stopping Distance(ft)","Predicted Distance","
Prediction Error")
library(psych)
headTail(as.data.frame(output)) #display first 4 and last 4 rows only
Residuals
If we think of our model as “Response = Prediction + Error,” we can then write it as
𝑦 = 𝑦̂ + 𝑒.
We then define a residual to be the observed value minus the predicted value.
𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖
The set of residuals or prediction errors is indicated in the last column of the output table.
The lm Function
So far we have done regression by deriving the least squares estimates, then writing simple
R commands to perform the necessary calculations. Since this is such a common task, this is
functionality that is built directly into R via the lm() command.
The lm() command is used to fit linear models which actually account for a broader class
of models than simple linear regression, but we will use SLR as our first demonstration of
lm(). The lm() function will be one of our most commonly used tools, so you may want to
take a look at the documentation by using ?lm. You’ll notice there is a lot of information
there, but we will start with just the very basics. This is documentation you will want to
return to often.
We’ll continue using the cars data, and essentially use the lm() function to check the work
we had previously done.
library(moderndive)
stop_dist_model = lm(dist ~ speed, data = cars)
get_regression_table(stop_dist_model) %>%
knitr::kable()
This line of code fits our very first linear model. The syntax should look somewhat familiar.
We use the dist ~ speed syntax to tell R we would like to model the response variable
dist as a linear function of the predictor variable speed. In general, you should think of the
syntax as response ~ predictor. The data = cars argument then tells R that that dist
and speed variables are from the dataset cars. We then store this result in a variable
stop_dist_model`. The result of our simple regression table is given in a nice formatted
output
We see that it first tells us the formula we input into R, that is lm(formula = dist ~
speed, data = cars). We also see the coefficients of the model. We can check that these
are what we had calculated previously. (Minus some rounding that R is doing when
displaying the results. They are stored with full precision.)
c(beta_0_hat, beta_1_hat)
Next, it would be nice to add the fitted line to the scatterplot. To do so we will use the
abline() function.
plot(dist ~ speed, data = cars,
xlab = "Speed (in Miles Per Hour)",
ylab = "Stopping Distance (in Feet)",
main = "Stopping Distance vs Speed",
pch = 20,
cex = 2,
col = "grey")
abline(stop_dist_model, lwd = 3, col = "darkorange")
The abline() function is used to add lines of the form 𝑎 + 𝑏𝑥 to a plot. (Hence abline.)
When we give it stop_dist_model as an argument, it automatically extracts the regression
coefficient estimates (𝛽̂0 and 𝛽̂1) and uses them as the slope and intercept of the line. Here
we also use lwd to modify the width of the line, as well as col to modify the color of the line.
The “thing” that is returned by the lm() function is actually an object of class lm which is a
list. The exact details of this are unimportant unless you are seriously interested in the
inner-workings of R, but know that we can determine the names of the elements of the list
using the names() command.
names(stop_dist_model)
We can then use this information to, for example, access the first 7 residuals using the $
operator.
stop_dist_model$residuals[1:7]
## 1 2 3 4 5 6 7
## 3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584 -3.744993
Another way to access stored information in stop_dist_model are the coef(), resid(),
and fitted() functions. These return the coefficients, first 8 residuals, and fitted values,
respectively.
coef(stop_dist_model)
## (Intercept) speed
## -17.579095 3.932409
resid(stop_dist_model)[1:7]
## 1 2 3 4 5 6 7
## 3.849460 11.849460 -5.947766 12.052234 2.119825 -7.812584 -3.744993
fitted(stop_dist_model)[1:7]
## 1 2 3 4 5 6 7
## -1.849460 -1.849460 9.947766 9.947766 13.880175 17.812584 21.744993
Legend:
Blue line - generated using Simple Linear Regression Analysis
Green band or curve – generated by RandomForest Regression Analysis
Red band or curve – generated by Support Vector Regression Analysis
SLR rmse = 0.007 RandomForest rmse = 0.005 Support Vector rmse = 0.007
SLR rmse = 13.81 RandomForest rmse = 2.93 Support Vector rmse = 4.77
Predictive power of SLR begins to deteriorate if the relationship of the variables is not
linear.
SLR rmse = 0.90 RandomForest rmse = 0.30 Support Vector rmse = 0.28
SLR rmse = 0.90 RandomForest rmse = 0.30 Support Vector rmse = 0.28