Regression and correlation notes
Regression and correlation notes
10.0 Introduction
In the previous chapters we have mainly worked with univariate data (where observations are made for one
variable only) however we did look at bivariate data when we studied correlation and in the chapter on joint
random variables (discrete case). We return to the bivariate case in this chapter as we are often interested in
how two or more variables are related to one another eg: an educational psychologist may be interested in how
vocabulary size is related to age or a farmer may be interested in how crop yield is related to rainfall etc.
10.1 Correlation
A scatterplot visually depicts how two continuous variables 𝑋 and 𝑌 are related to each other through a plot
of the observations (𝑥𝑖 , 𝑦𝑖 ) for these variables.
Example 10.1: Length of time (in hours) that nine MS1 students slept just before they sat for their first test
(H), and the marks they scored in the test (in %) (M)
Hours 2 3 4 5 6 7 8 9 10
Marks % 45 45 58 50 57 68 70 62 75
1
Figure 10.1: Scatterplot of hours studied versus marks obtained
Is there a relationship between hours slept and marks obtained? What is the direction of the relationship?
Can we describe this by some measure to say how strongly related the variables are?
Yes. Positive increasing linear relationship between X and Y. The correlation coefficient (𝑟) describes the
strength of the linear relationship between the independent and dependent variables.
2
The above figure sourced from https://communitymedicine4asses.com/2013/12/27/correlation/ shows that
the relationship between the value of a car and the mileage is negatively related (i.e. the more mileage the
car has done, the cheaper it will be), that there is no relationship between the colour of a car and the quality
of the car hence the zero correlation and that there is a positive linear relationship between car insurance cost
and the number of accidents(with more accidents the insurance cost is expected to increase).
Where
𝑛
(∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
𝑆𝑃𝑥𝑦 = ∑ 𝑥𝑖 𝑦𝑖 −
𝑛
𝑖=1
𝑛
(∑𝑛𝑖=1 𝑥𝑖 )2
2
𝑆𝑆𝑥 = ∑ 𝑥𝑖 −
𝑛
𝑖=1
𝑛
(∑𝑛𝑖=1 𝑦𝑖 )2
𝑆𝑆𝑦 = ∑ 𝑦𝑖 2 −
𝑛
𝑖=1
The correlation coefficient (𝑟) measures of the strength of the linear relationship between the independent
and dependent variables. Values lie in the range -1 to +1
3
Correlation Interpretation
𝑟 equal to +1 Perfect positive linear relationship
4
Correlation Interpretation
𝑟 far away from –1 Weak negative linear relationship
The correlation coefficient of 0.895 indicates a strong positive linear relationship exists between hours
studied and mark obtained.
5
Using the equation for correlation gives us:
Hours Marks %
(ℎ) (𝑚) ℎ2 𝑚2 ℎ𝑚
2 45 4 2025 90
3 45 9 2025 135
4 58 16 3364 232
5 50 25 2500 250
6 57 36 3249 342
7 68 49 4624 476
8 70 64 4900 560
9 62 81 3844 558
10 75 100 5625 750
54 530 384 32156 3393
𝑆𝑃𝐻𝑀 213
𝑟= = = 0.895
√𝑆𝑆𝐻 𝑆𝑆𝑀 √(60)(944.8888889)
Where
(∑𝑛 𝑛
𝑖=1 ℎ𝑖 )(∑𝑖=1 𝑚𝑖 ) 54∗530
𝑆𝑃𝐻𝑀 = ∑𝑛𝑖=1 ℎ𝑖 𝑚𝑖 − = 3393 − = 213
𝑛 9
2
(∑𝑛
𝑖=1 ℎ𝑖 ) (54)2
𝑆𝑆𝐻 = ∑𝑛𝑖=1 ℎ𝑖 2 − = 384 − = 60
𝑛 9
𝑛
(∑𝑛𝑖=1 𝑚𝑖 )2 (530)2
𝑆𝑆𝑀 = ∑ 𝑚𝑖 2 − = 32156 − = 944.8888889
𝑛 9
𝑖=1
If the correlation coefficient indicates that there is a linear relationship between the independent variable (𝑋)
and dependent variable (𝑌), then we could fit a linear model to the data by regressing the dependent variable
on the independent variable. The dependent variable is also referred to as the Response variable and the
independent variable is also known as the Explanatory or Predictor variable. A few examples of these include
examining the relationship between an increase in temperature (𝑋 or Explanatory) and the yield (𝑌 or
Response) from a chemical reaction at a particular temperature or the mark of students (𝑌) and the number of
hours spent studying (𝑋) or even the crop yield (𝑌) based on the amount of rainfall received (𝑋).
6
10.2 The Simple Linear regression model
If there is a linear relationship in the data, then there are population parameters 𝛽0, 𝛽1, and 𝜎 2 such that for
any fixed value of 𝑥 of the random variable 𝑋, the Response or dependent variable (𝑌) is related to this fixed
𝑥 by a linear model given by: 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀.
The linear model 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀 is referred to as a probabilistic model. A probabilistic model accounts for
random deviation or random error denoted by ε.
• The random deviation or error ε is a random variable and it is assumed to be normally distributed with
mean 0 and variance 𝜎 2 , that is 𝜀~𝑁(0, 𝜎 2 ) with this mean and variance being the same regardless of
the value of 𝑥 that the model takes on.
• Since the 𝑛 (𝑥, 𝑦) pairs (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) are regarded as being generated independently
of each other, each of the 𝑌’s given by 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 are independent of each other as a result
of the 𝜀𝑖′ s being independent of each other.
The following figure from Devore and Berk (Figure 12.4) illustrates the simple linear regression model.
In the above Figure, 𝑌 and 𝜀 are random variables (so they will have some distribution with mean and
variance), 𝑥 is a fixed value of the random variable 𝑋, and 𝛽0 and 𝛽1 are constants and are referred to as the
regression coefficients.
• Y is the dependent random variable and it depends on X and the random error (ε) in determining Y
• The variable X is fixed equal to xi . We can choose the values we give to it; hence it is called the
independent variable.
• 𝛽0 is the unknown intercept coefficient (𝑦-intercept) - the value of 𝑦 when 𝑥 = 0
• 𝛽1 is the slope coefficient of the population or true regression line 𝛽0 + 𝛽1 𝑥 . It is interpreted as the true
average increase (or decrease) in 𝑌 associated with every one-unit increase (or decrease) in 𝑥
• 𝜀 is the error term or the part of 𝑌, that the regression model is unable to explain.
7
The population parameters 𝛽0 , 𝛽1 and 𝜎 2 are generally unknown and will need to be estimated. The
parameters are estimated by drawing a random sample consisting of 𝑛 (𝑥𝑖 , 𝑦𝑖 ) pairs i.e:
(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) from the population of interest and then utilising this sample data to obtain the
sample estimates 𝑏0 , 𝑏1 and 𝑠 2 which are point estimates for the unknown population parameters 𝛽0 , 𝛽1 and
𝜎 2 . The estimated linear regression line 𝑦̂ = 𝑏0 + 𝑏1 𝑥, is obtained by drawing a straight line through the
sample data such that it is “closest” to as many of the sample data points as possible. The “closeness” of the
data is obtained by calculating the deviations between the observed 𝑦 values and the estimated 𝑦̂ values.
10.2.1 Finding the best fitting line or least squares regression line
If the scatterplot of the bivariate data consisting of 𝑛 (𝑥, 𝑦) pairs i.e: (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) appears to
have a linear pattern we can then relate the set of observations (𝑥𝑖 , 𝑦𝑖 ) of the two variables, X and Y
respectively by an equation for a straight line viz: 𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 . This is done through a process called
Simple Linear regression where 𝑋 represents the independent variable and 𝑌 represents the dependent
variable. The best model will be the straight line that is closest to as many of the observed (𝑥𝑖 , 𝑦𝑖 )
measurements.
From the previous figure, one can observe how much a point (𝑥𝑖 , 𝑦𝑖 ) deviates or how far it is from the fitted
line. The predicted values 𝑦̂1, 𝑦̂2,…, 𝑦̂𝑛 are obtained by substituting the observed 𝑥 values 𝑥1 , 𝑥2 , … , 𝑥𝑛 into the
equation for the regression line. The model can be estimated by minimising the sum of the squared deviations
or the sum of the squared errors between the observed 𝑦𝑖 values and the predicted 𝑦̂𝑖 values. The deviations
are also referred to as residuals.
Definition: Residuals: The vertical deviations between the actual or observed 𝑦’s and the predicted 𝑦 values
given by the line (i.e. 𝑦̂) are called residuals, denoted ei =𝑦𝑖 − 𝑦̂𝑖
We want these residuals to be as small as possible to get the best line i.e. we want to minimise the residuals.
Since the residuals may be positive and negative numbers, they will cancel each other out if we simply sum
them, hence we work with the squared residuals to avoid this. Thus, we are finding the line 𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖
which minimises the sum of the squared residuals(∑ 𝑒𝑖2 ).
8
In the estimated least squares regression line 𝑦̂ = 𝑏0 + 𝑏1 𝑥 obtained from the observed sample data, 𝑏0
represents the intercept (the 𝑦 value when 𝑥 is 0) and 𝑏1 represents the slope of the line (which represents
the change in the dependent variable for every 1-unit change in the independent variable).
Minimising the sum of the squared residuals (∑𝑛𝑖=1 𝑒𝑖2 ) is undertaken by finding the partial derivative of
∑𝑛𝑖=1 𝑒𝑖2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)
𝑖
2
= ∑𝑛𝑖=1(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 with respect to 𝑏0 and 𝑏1 , thereafter setting both to zero
and solving.
Using partial derivatives and differentiating with regard to 𝑏0 gives:
𝑛 𝑛
𝜕
∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 = (−1) 2 ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )
𝜕𝑏0
𝑖=1 𝑖=1
Equating to zero:
𝑛
−2 ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 ) = 0
𝑖=1
𝑛
⇒ ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 ) = 0
𝑖=1
𝑛 𝑛
⇒ ∑ 𝑦𝑖 = 𝑛𝑏0 + 𝑏1 ∑ 𝑥𝑖
𝑖=1 𝑖=1
Equating to zero:
𝑛
−2 ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 ) 𝑥𝑖 = 0
𝑖=1
𝑛
⇒ ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )𝑥𝑖 = 0
𝑖=1
𝑛 𝑛 𝑛
⇒ ∑ 𝑥𝑖 𝑦𝑖 = 𝑏0 ∑ 𝑥𝑖 + 𝑏1 ∑ 𝑥𝑖2
𝑖=1 𝑖=1 𝑖=1
Thus,
(∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑆𝑃𝑥𝑦
𝑏1 = 𝑛 =
𝑛
(∑ 𝑥 ) 2 𝑆𝑆𝑥
∑𝑛𝑖=1 𝑥𝑖2 − 𝑖=1 𝑖
𝑛
9
and making 𝑏1 and then 𝑏0 the subject of the formula gives:
∑𝑛𝑖=1 𝑦𝑖 ∑𝑛𝑖=1 𝑥𝑖
𝑏0 = − 𝑏1 = 𝑦 − 𝑏1 𝑥
𝑛 𝑛
By applying our mathematics knowledge and performing second order differentiation confirms that this is a
minimum and not a maximum or point of inflection.
Now remember that for the line 𝑦̂ = 𝑏0 + 𝑏1 𝑥 , it’s intercept 𝑏0 and slope 𝑏1 are obtained for a particular data
set. The line can be used to predict the value of 𝑦̂ for a given 𝑥 value that is within the range of 𝑥 values
observed in this particular data set. One cannot predict the 𝑦̂ value for a given 𝑥 value that is outside the range
of 𝑥 values as this would amount to extrapolation and the prediction would not be reliable. The slope indicates
the average or expected change in the dependent variable for every 1-unit change in the independent variable.
Also, apart from the line 𝑦̂ = 𝑏0 + 𝑏1 𝑥, having the property that the sum of the squared deviations is
minimised, it also has the property that:
1 1
𝑒̅ = 𝑛 ∑ 𝑒𝑖 = 𝑛 ∑(𝑦𝑖 − 𝑦̂)
𝑖
1
= ∑(𝑦𝑖 − (𝑏0 + 𝑏1 𝑥𝑖 ))
𝑛
= 𝑦̅ − 𝑏0 − 𝑏1 𝑥̅ = 0
since 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
Now remember that the line 𝑦̂ = 𝑏0 + 𝑏1 𝑥, obtained from 𝑛 (𝑥, 𝑦) pairs in a particular sample data set is an
estimate for the population where 𝑏0 and 𝑏1 are estimates of 𝛽0 and 𝛽1 respectively. If we change the data set
to a new or different set of 𝑛 (𝑥, 𝑦) pairs we will obtain a different estimated line 𝑦̂ = 𝑏0 + 𝑏1 𝑥, with
different values for 𝑏0 and 𝑏1 .
10
Recall that it is assumed that the 𝜀𝑖 ~𝑁(0, 𝜎 2 ).
Now each 𝑌𝑖 ~𝑁(𝛽0 + 𝛽1 𝑥𝑖 , 𝜎 2 ) since 𝐸(𝑌|𝑥) = 𝛽0 + 𝛽1 𝑥𝑖 as illustrated in Devore and Berk:
a) Distribution of 𝜀
We have already estimated the slope and intercept parameters, thus we still need to estimate a third parameter
𝜎 2 . We will estimate this parameter in section 10.4.2. The variance parameter 𝜎 2 represents the amount of
variability inherent in the regression model. If σ2 is close to zero then almost all the (x, y) pairs are said to be
close to the population regression line. If the σ2 is far from 0 then it means that most of the (x, y) pairs, in the
scatterplot, are spread out and far away from the population regression line. Since the equation of the true
population regression line is unknown, the estimate is based on the extent to which the sample data deviates
from the estimated line. We require an estimate of 𝜎 2 for the confidence intervals that we will calculate
shortly. Since 𝑏0 and 𝑏1 are the best estimates (you will study in 2nd year that they are both unbiased
estimators) of 𝛽0 and 𝛽1 respectively, we can safely say that the distribution of the errors or residuals
ei =𝑦𝑖 − 𝑦̂𝑖 from the sample should estimate the distribution of 𝜀 in the population.
11
Previously we saw that the sample variance 𝑠 2 is the best estimator of the population parameter 𝜎 2 . In much
the same way, we can use the sample variance of the residuals ei to estimate 𝜎 2 , the population variance of
𝑆𝑆𝐸 ∑𝑛 ̂ 𝑖 )2
𝑖=1(𝑦−𝑦
𝜀.The sample variance of the residuals is 𝑠 2 = = . It has 𝑛 − 2 degrees of freedom because
𝑛−2 𝑛−2
we estimated two parameters 𝛽0 and 𝛽1 for the regression line and we are using the estimates in the equation
𝑆𝑆𝐸
above to find the sample variance 𝑠 2 . Now, 𝑠 2 = is also called the mean squared error (MSE). This
𝑛−2
𝑆𝑆𝐸 (𝑆𝑆𝑦 −𝑏1 𝑆𝑃𝑥𝑦 )
can be calculated by the computationally easier formula: 𝑀𝑆𝐸 = 𝑛−2 = .
𝑛−2
The strength of the linear relationship is calculated by finding the Pearson correlation coefficient between the
two variables Mass and CO2. Using R with the following commands gives:
> cor(Mass,CO2, method="pearson") #Calc Pearson correlation
[1] 0.9392405
The correlation coefficient 𝑟 = 0.9392405 is close to +1 and hence we say that the two variables are strongly,
positively linearly related. Homework exercise: Verify the answer for the correlation coefficient by using
the equations!
Tree mass is the Dependent or Response variable and atmospheric concentration of CO2 is the Independent
variable or Explanatory variable. By regressing Mass on CO2, we can get the linear model
We can find the least squares regression line using:
(∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑆𝑃𝑥𝑦
𝑏1 = 𝑛 =
𝑛
(∑ 𝑥 ) 2 𝑆𝑆𝑥
∑𝑛𝑖=1 𝑥𝑖2 − 𝑖=1 𝑖
𝑛
(4908)(22.7)
𝑆𝑃𝑥𝑦 15441.4 − 8 1514.95
𝑏1 = = 2 = = 0.00845
𝑆𝑆𝑥 (4908) 179190
3190248 − 8
22.7 4908
and 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅ = − 0.00845 ( ) = −2.349
8 8
13
Hence the estimated linear regression line is 𝑦̂ = −2.349 + 0.00845𝑥
We can superimpose the least squares line on the scatterplot with the following code in R:
> abline(linear.fit) #plots linear regression line on data
Figure 10.6: Plot of Mass versus CO2 with superimposed least squares line
The least squares regression coefficients are obtained with the following R code: lm(Dependent variable ~
Independent variable)
> linear.fit=lm(Mass~CO2) #creates the linear model
Interpretation of the slope of the least squares line: The slope is positive and shows that the mass increases
by 0.008 on average for every one-unit increase of CO2.
To calculate the fitted values ( ̂𝑦 ) we use the following R code:
> fitted(linear.fit)
1 2 3 4 5 6 7 8
1.100114 1.100114 2.334461 2.334461 3.399720 3.399720 4.515705 4.515705
14
One can also find the Residuals = Observed values – Fitted value ( 𝑒 = 𝑦 − ̂𝑦 ) using the following code:
> resid(linear.fit)
1 2 3 4 5 6
-0.0001138456 0.1998861544 -0.7344611865 0.1655388135 -0.3997198504 0.9002801496
7 8
-0.3157051175 0.1842948825
The least squares line can be used for two purposes:
For a fixed 𝑥 value (say 𝑥 ∗ ), 𝑦̂ = 𝑏0 + 𝑏1 𝑥 gives either
(i) The estimate of the expected value of 𝑌 when 𝑥 = 𝑥 ∗ (as seen in Figure 10.4) or
(ii) a point prediction of the 𝑌 value when 𝑥 = 𝑥 ∗ i.e we predict tree mass for a given atmospheric
concentration of CO2.
Note that we must caution against extrapolation, hence we can only predict the dependent or Response
variable, in this case 𝑌 (Mass) value for some fixed value of 𝑥 where the value of 𝑥 must be between
(min(𝑥) ; max (𝑥)) as the least squares line was generated using this specific range of 𝑥 values.
A further summary for the model can be obtained in R using the following R code:
> summary(linear.fit) # gives the summary of the model
Call:
lm(formula = Mass ~ CO2)
Residuals:
Min 1Q Median 3Q Max
-0.73446 -0.33671 0.08271 0.18819 0.90028
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.349295 0.796567 -2.949 0.025637 *
CO2 0.008454 0.001261 6.702 0.000536 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
15
the straight line is appropriate, we need to assess the accuracy of the predictions that are based on the least
squares line. Two numerical measures that give us an idea of how well the model fits the sample data are the
coefficient of determination and the standard deviation about the least squares line.
Figure 10.7: Variation in the Dependent variable (extracted from Devore &Berk Figure 12.12)
In the figure labelled (a) above all the points will lie on the least squares line, so all the variation in the 𝑦’s are
attributed to the 𝑥’s and hence there is no unexplained variation. Hence 𝑆𝑆𝐸 which measures the variation
that is unexplained by the model will be 0.
In Figure (b) the points will not all lie on the line but most will be very close to the line so there is a small
amount of unexplained variation and hence SSE will be small.
Figure (c) will have a much larger variation than (b) and hence SSE will be much larger.
To understand how 𝑟 2 (the variability in the dependent variable) is calculated we need to consider the
variability in the 𝑦 values. This can be done in two ways: looking at the total variation that is unexplained by
the model (SSE) or looking at the sum of the total variation in the observed 𝑦’𝑠 as shown in the Figure below.
16
Figure 10.8: Extracted from Devore and Berk (Figure 12.13) to illustrate variation in the model
Understanding the variation in the model is explained in further detail below by considering the equations
for SSE and SST.
a) We can look at how far the 𝑦’𝑠 are from their respective 𝑦̂ values (Figure 10.8 – LHS picture) – this
is referred to as the Error Sum of Squares or Sum of Squares of the Errors (SSE) where
SSE =∑𝑛𝑖=1 𝑒𝑖2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2=∑𝑛𝑖=1(𝑦𝑖 − (𝑎 + 𝑏𝑥𝑖 ))2
It is interpreted as the amount of variability in the 𝑦’𝑠 that is unexplained by the model.
OR
b) We can look at the total amount of variation in the observed 𝑦’𝑠 (Figure 10.8 – RHS picture) i.e.:
how spread out the observed 𝑦’𝑠 are from the mean 𝑦 value (𝑦̅). This is called the Total sum of
squares (denoted SST) where SST = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 = 𝑆𝑆𝑦
2
(∑𝑛
𝑖=1 𝑦𝑖 )
Note the computationally easier equation for 𝑆𝑆𝑦 = ∑𝑛𝑖=1 𝑦𝑖2 − 𝑛
So, SSE is the squared deviation about the least squares line and SST is the squared deviation about
the horizontal line for 𝑦̅
𝑆𝑆𝑇 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2
𝑛
̂𝑖 − 𝑦̅)]2
𝑖 + (𝑦
= ∑[(𝑦𝑖 − 𝑦̂)
𝑖=1
𝑛
2
= ∑[(𝑦𝑖 − 𝑦̂)
𝑖 + 2(𝑦𝑖 − 𝑦̂𝑖 )(𝑦̂𝑖 − 𝑦̅) + (𝑦̂𝑖 − 𝑦̅)2 ]
𝑖=1
= ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)
𝑖
2
+ ∑𝑛𝑖=1(𝑦̂𝑖 − 𝑦̅)2 as the middle term evaluates to 0 since ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂) 𝑛
𝑖 = ∑𝑖=1 𝑒𝑖 = 0
17
Thus SST= SSE + SSR where SSR = ∑𝑛𝑖=1(𝑦̂𝑖 − 𝑦̅)2 is defined as the regression sum of squares
SSE
So, the ratio SST is the proportion of total variation that is unexplained by the model and hence
𝑆𝑆𝐸
1 − 𝑆𝑆𝑇 = 𝑟 2 is the proportion of variation that is explained by the model. This is referred to as the Coefficient
of Determination.
The coefficient of Determination can be obtained using the following R code or it can be read off from the
previous output given above (Multiple R-squared):
> summary(linear.fit)$r.squared # gives the coefficient of determination
[1] 0.8821727
To calculate the standard deviation, 𝑠 we need to calculate the unexplained sum of squares 𝑆𝑆𝐸. Recall that
SSE =∑𝑛𝑖=1 𝑒𝑖2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2
𝑆𝑆𝐸
The unbiased estimator 𝑠 2 of the variance 𝜎 2 is . Using the above equation for calculation can be
𝑛−2
tedious, hence we can use the computationally easier equation: SSE = 𝑆𝑆𝑦 − 𝑏𝑆𝑃𝑥𝑦
18
10.4.3 Check if the assumptions of the model are met
Recall our model: 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀 had the following assumptions made about the model:
The error terms:
i) are Normally distributed with mean 0 and variance 𝜎 2 hence, 𝜀𝑖 ~𝑁(0, 𝜎 2 )
ii) have the same or equal variance for the different 𝑥 values. This is referred to as homoscedasticity
iii) are independent of one another
These assumptions may or may not be true and hence will need to be checked. If they are true then the
observed residuals 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 should also behave in the same way as 𝜀𝑖 and should be approximately
normally distributed with constant variances for the different 𝑥 values.
The assumptions can be checked by examining the following residual plots: (reference:
https://data.library.virginia.edu/diagnostic-plots/)
The plots show that there is no clear pattern in Case 1, however in Case 2, we can see the pattern of a parabola
and hence there is a non-linear relationship in the data, thus the assumptions for the linear model are not
satisfied. This non-linear relationship was not explained by the model and was left out in the residuals.
19
Plot 2: Normal Q-Q
Normal Q-Q plot indicates if the residuals are normally distributed. If the residuals follow the straight dashed
line in the plot, then they are normally distributed.
Case 1 shows that the residuals are Normally distributed as they follow the straight dashed line quite closely
although an observation numbered as 38 looks like it is far away from the other point. Case 2 deviates quite
distinctly from the straight dashed line and hence the normality assumption is violated. Although observation
#38 falls away from the rest of the data in case 1, at this stage we would not be too concerned and say that
Case 1 satisfies the normality assumption.
Let’s look at the next plot (Scale-Location) while keeping in mind that observation #38 might be a potential
problem. Observation #38 is quite far away from the other points in the plot so it seems to be breaking away
from the pattern of the other values and is termed to be an outlier. An outlier can have an influence on the
slope of the model.
Plot 3: Scale-Location
The Scale-location plot also known as the Spread-Location plot, shows if the residuals are spread equally
along the 𝑥 values. If you see a horizontal line with equally (randomly) spread out points then the assumption
of equal variance or homoscedasticity is satisfied.
20
In these plots it appears that the residuals appear randomly and equally spread in Case 1 whereas, in Case 2,
the residuals begin to fan out or spread wider along the 𝑥-axis. Also, since the residuals are spread out wider
and wider, the red smooth line is not horizontal and has a steep incline.
The Residual versus Leverage plot helps us to identify if any of the outliers are influential. Not all outliers are
influential and hence would not necessarily affect the regression line if for example they seem to follow the
trend in the majority of cases in the data however for those that could influence the regression line, closer
examination is needed. Points with extreme values of X are said to have high leverage. High leverage points
have the ability to move the line more. Unlike in the other plots, we do not examine the pattern in the data but
we look for values that lie outside the dashed line that indicates a measure referred to as the Cook’s distance
score. Attention needs to be paid to extreme values that occur at the upper right corner or at the lower right
corner of the plot as these points will have high leverage. Points in these regions indicate cases that can be
influential against a regression line. The dashed line in the plot indicates the Cook’s distance score. Points
that are outside of the dashed lines indicating the Cook’s distance score is regarded as being influential to the
model. If these points are excluded, this would change the regression results. Note that the numbers within the
plot refer to the items or cases in the dataset that need to examined eg: in the plot below 32 refers to the 32 nd
case in the dataset. See video at https://www.youtube.com/watch?v=xc_X9GFVuVU
21
The plots above show the contour lines for the Cook’s distance. Cook’s distance is another measure of the
importance of each observation to the regression. Generally, a Cook’s distance of more than 1 would indicate
an influential value or possible outlier and possibly a poor model. In the above plots there seem to be no
influential points outside the Cook’s distance lines. In Case 2 however, observation #49 appears to be far
outside the Cook’s distance lines and is thus an influential point. Excluding it from the data could influence
the regression model.
If similar cases occur across the four plots as outliers, then closer attention needs to be paid to the points to
see if there is anything special for that particular case or are there errors in the capturing of the data etc. To
use R’s regression diagnostic plots, we set up the regression model as an object and create a plotting
environment of two rows and two columns. Then we use the plot() command, treating the model as an
argument. Applying this to the data in Example 12.5 (Devore) gives the following:
> par(mfrow = c(2,2)) # splits the screen in 4 parts
> plot(linear.fit) #plots the diagnostic plots
22
If there is a non-linear relationship between the dependent variable and independent variable, then the pattern
indicating the non-linear relationship could be seen in this plot. If the residuals seem to be equally spread
around a horizontal line without any clear pattern, this would indicate that we do not have a non-linear
relationship.
In using the R diagnostic plots and influence statistics to diagnose how well our model is fitting the data, we
see that for the first plot (Residuals versus. Fitted values) is a simple scatterplot between residuals and
predicted values. It should look more or less random. Here, the residuals seem to be equally spaced about the
horizontal line.
The second plot (normal Q-Q) shows the points closely plotted along the dashed line hence the errors are
normally distributed.
The third plot (Scale-Location), like the first, should look random. No patterns should appear in this plot.
Ours has an upside-down V-shape.
The last plot (Cook’s distance - D) tells us which points have the greatest influence on the regression (leverage
points). In our plot, the values of D are below 0.5 so the points 3 and 6 for example which appear to be outliers
are not regarded to be influential.
“So, what does having patterns in residuals mean to your research? It’s not just a go-or-stop sign. It tells
you about your model and data. Your current model might not be the best way to understand your data if
there’s so much good stuff left in the data.” https://data.library.virginia.edu/diagnostic-plots/
23
much the same way, the value of 𝑏1 will vary from sample to sample. Hence 𝑏1 will have a distribution with
some mean and variance. It can be shown that (2nd year):
𝜎2 𝜎2
𝐸(𝑏1 ) = 𝛽1 𝑎𝑛𝑑 𝑉𝑎𝑟(𝑏1 ) = 𝑛 =
∑𝑖=1(𝑥 − 𝑥̅ )2 𝑆𝑆𝑥
Thus 𝑏1 is an unbiased estimator of 𝛽1 with known standard deviation (or variance), hence,
𝑏1 −𝛽1 𝑏 −𝛽1
𝑧= = 𝜎1
𝜎𝑏1 ⁄
√𝑆𝑆𝑥
However, since 𝜎 2 is unknown, but 𝛽1 is normally distributed, the appropriate test statistic is
𝑏1 − 𝛽1
𝑡𝑛−2 = 𝑠
⁄
√𝑆𝑆𝑥
We can derive the confidence interval by starting with the following probability statement:
𝑏1 − 𝛽1
𝑃 (−𝑡𝑛−2,𝛼⁄2 < 𝑠 < 𝑡𝑛−2,𝛼⁄2 ) = 1 − 𝛼
⁄
√𝑆𝑆𝑥
Thus, we can construct a 100(1 − 𝛼)% C.I. for 𝛽1 the slope parameter of the true regression line for the
population as
𝛽1 𝜖 𝑏1 ± 𝑡𝑛−2,𝛼⁄2 𝑠⁄
√𝑆𝑆𝑥
𝛽1 𝜖 𝑏1 ± 𝑡𝑛−2,𝛼⁄2 𝑠⁄
√𝑆𝑆𝑥
The 95% confidence interval for the slope can be found using the following R code:
> confint(linear.fit, level=0.95)
24
This should give the following output for the confidence interval for the intercept and slope although we
require the interval for the slope only.
> hours<-c(2,3,4,5,6,7,8,9,10)
> mark<-c(45,45,58,50,57,68,70,62,75)
> linear.fit=lm(mark~hours)
> coef(linear.fit)
(Intercept) hours
37.58889 3.55000
> confint(linear.fit, level=0.95)
2.5 % 97.5 %
(Intercept) 27.234784 47.942994
hours 1.964858 5.135142
Homework exercise: Verify the calculation for Example 10.2 by using the equations and interpret the
interval. Answer should be:
2.5 % 97.5 %
(Intercept) −4.298424521 −0.40016580
CO2 0.005367883 0.01154098
One of the assumptions made is that 𝑌 will be normally distributed for each value of 𝑋.We can say that the
conditional distribution of 𝑌 given 𝑋 = 𝑥0 is Normal with mean 𝛽0 + 𝛽1 𝑥0 and variance 𝜎 2
From the figure above we see that the value of 𝑦̂0 = 𝑏0 + 𝑏1 𝑥0 can be regarded as:
i) a point estimate of the mean of the population of values of 𝑌 when 𝑋 = 𝑥0 i.e. 𝐸(𝑌|𝑥0 ) or
ii) a point estimate of a single value from the population of values of 𝑌 when 𝑋 = 𝑥0
Notice that the point estimates of 𝐸(𝑌|𝑥0 ) and 𝑌|𝑥0 are both 𝑦̂0 = 𝑏0 + 𝑏1 𝑥0 . Thus, interval estimates of both
these quantities are centered at 𝑦̂0
25
The interval estimate of 𝐸(𝑌|𝑥0 ) is called a confidence interval for 𝐸(𝑌|𝑥0 )
The interval estimate of 𝑌|𝑥0 is called a prediction interval for 𝑌|𝑥0
Thus, one could use a linear regression model to predict tree mass (𝑌) based on the CO2 level (𝑋):
i) If we require the interval estimate for the mass of a particular tree exposed to an atmospheric
concentration of CO2 of 600 microliters per liter (parts per million), we will need a prediction interval for
𝑌|𝑥 = 600 since we are estimating the value of 𝑌 from the distribution of 𝑌|𝑥 = 600
ii) If we want the interval estimate of the average mass of all trees exposed to an atmospheric concentration
of CO2 of 600 microliters per liter (parts per million), we will need to find the confidence interval for
𝐸(𝑌|𝑥 = 600)
Note: The 100(1 − )% C.I. interval estimate for 𝐸(𝑌|𝑥0 ) is shorter than the 100(1 − )% interval
estimate for 𝑌|𝑥0
Why? Since the estimate of the mean of the random variable will be less than the estimate of a particular value
of the random variable. Also, there is more uncertainty in estimating one value of the random variable than in
estimating the average value. The variance of the individuals is greater than the variance of the means.
where 𝜎 2 is the variance of the line of regression. However, since 𝜎 2 is unknown, we use 𝑠 2 as the estimator
of 𝜎 2 .
Those confidence intervals have 𝑛 − 2 degrees of freedom and are of the form
1 (𝑥𝑝 − 𝑥̅ )2
𝑦̂𝑝 ± 𝑡𝑛−2,𝛼⁄2 𝑠 √ +
𝑛 𝑆𝑆𝑥
26
Confidence interval for expected tree mass (Example 10.2)
Find a 95% C.I. for the expected tree mass of all trees exposed to an atmospheric concentration of CO2 of
600 microliters per liter (parts per million):
Firstly, we find the point estimate, then we find the 95% C.I. by using
𝐸[𝑌|𝑥𝑝 = 600] = 𝑦̂𝑝 =𝑏0 + 𝑏1 𝑥𝑝 =− 2.349295162 +0.008454434(600) = 2.723365238
1 (𝑥𝑝 − 𝑥̅ )2
𝐸[𝑌|𝑥𝑝 = 600]∈ (𝑦̂𝑝 ± 𝑡6,0.025 𝑠 √ + )
𝑛 𝑆𝑆𝑥
1 (600 − 613.5)2
𝐸[𝑌|𝑥𝑝 = 600]∈ (2.723365238 ± 2.4469√0.285117575 √ + )
8 179190
This can be done in R using the following structure for the R code:
new pred = (value for the prediction)
predict(model, data.frame(pred = new pred), level = 0.95, interval = “confidence”)
where
pred is the object containing the original independent variables and new pred is the object containing the new
values for which predictions are desired, and level is the desired confidence level.
The following code gives the 95% confidence interval for the expected tree mass of all trees exposed to an
atmospheric concentration of CO2 of 600 microliters per liter (parts per million)
> newCO2=(600)
> predict(linear.fit,data.frame(CO2 = newCO2), level = 0.95, interval = "confidence")
fit lwr upr
1 2.723365 2.25955 3.18718
27
Predicting a particular value of 𝒀 for a given value of 𝑥 = 𝑥𝑝
We would like to predict the dependent variable of an individual. What should be evident is that trying to
predict an individual score will have greater error than with the expected value and it can be shown that
2 2
1 (𝑥𝑝 − 𝑥̅ )2
𝜎 𝑝𝑟𝑒𝑑 = 𝜎 [1 + + ]
𝑛 𝑆𝑆𝑥
1 (𝑥𝑝 − 𝑥̅ )2
𝑦̂ ± 𝑡𝑛−2,𝛼⁄2 𝑠 √1 + +
𝑛 𝑆𝑆𝑥
Example 10.2 continued: find a 95% prediction interval for a tree exposed to an atmospheric concentration
of CO2 of 600 microliters per liter (parts per million).
𝑦̂𝑝 =𝑏0 + 𝑏1 𝑥𝑝 =− 2.349295162 +0.008454434(600) = 2.723365238
1 (𝑥𝑝 − 𝑥̅ )2
𝑌|𝑥𝑝 ∈ (𝑦̂𝑝 ± 𝑡6,0.025 𝑠 √1 + + )
𝑛 𝑆𝑆𝑥
1 (600 − 613.5)2
√
𝑌|𝑥𝑝 ∈ (2.723365238 ± 2.4469√0.285117575 1 + + )
8 179190
This can be done in R using the following structure for the R code:
new pred =(value for the prediction)
predict(model, data.frame(pred = new pred), level = 0.95, interval = “predict”)
where
pred is the object containing the original independent variables and new pred is the object containing the
new values for which predictions are desired, and level is the desired prediction level.
The following code gives the 95% confidence interval for the expected tree mass of all trees exposed to an
atmospheric concentration of CO2 of 600 microliters per liter (parts per million)
> predict(linear.fit,data.frame(CO2 = newCO2), level = 0.95, interval = "predict")
fit lwr upr
1 2.723365 1.33692 4.109811
Comparing the confidence interval and the prediction interval one can see that as expected, the prediction
interval is much wider than the confidence interval.
28