0% found this document useful (0 votes)
6 views

Regression and correlation notes

Chapter 10 discusses correlation and simple linear regression, focusing on the relationships between two numerical variables. It covers how to calculate the correlation coefficient, differentiate between correlation and causation, and create a linear model to predict dependent variables. The chapter emphasizes the importance of assessing the linear relationship through scatterplots, correlation coefficients, and regression analysis.

Uploaded by

fezilesilinda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Regression and correlation notes

Chapter 10 discusses correlation and simple linear regression, focusing on the relationships between two numerical variables. It covers how to calculate the correlation coefficient, differentiate between correlation and causation, and create a linear model to predict dependent variables. The chapter emphasizes the importance of assessing the linear relationship through scatterplots, correlation coefficients, and regression analysis.

Uploaded by

fezilesilinda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Chapter 10: Correlation and Simple Linear Regression

10.0 Introduction
In the previous chapters we have mainly worked with univariate data (where observations are made for one
variable only) however we did look at bivariate data when we studied correlation and in the chapter on joint
random variables (discrete case). We return to the bivariate case in this chapter as we are often interested in
how two or more variables are related to one another eg: an educational psychologist may be interested in how
vocabulary size is related to age or a farmer may be interested in how crop yield is related to rainfall etc.

Overview of this chapter and learning objectives


We will look at:
1. the relationships between two numerical variables, 𝑋 and 𝑌 in terms of form, strength, and direction of
the relationship,
2. calculating measures like the correlation coefficient and understanding the difference between a
statistical relationship(correlation) and a causal relationship,
3. finding a linear model that can be used to best describe the relationship between two numerical variables
𝑋 and 𝑌 when the pairs (𝑥𝑖 , 𝑦𝑖 ) of measurements are linearly distributed,
4. using a linear model to predict the dependent variable. Understand why it is statistically unreliable to
make predictions outside the range of the values of the independent variable,
5. the use of 𝑟 2 and the standard deviation about the line to assess the usefulness of the least squares line,
6. the use of the residual plot in assessing if the line is the appropriate way to describe the relationship
between the variables.

10.1 Correlation
A scatterplot visually depicts how two continuous variables 𝑋 and 𝑌 are related to each other through a plot
of the observations (𝑥𝑖 , 𝑦𝑖 ) for these variables.

Example 10.1: Length of time (in hours) that nine MS1 students slept just before they sat for their first test
(H), and the marks they scored in the test (in %) (M)
Hours 2 3 4 5 6 7 8 9 10
Marks % 45 45 58 50 57 68 70 62 75

The following code in R generates the scatterplot of hours versus marks


> hours<-c(2,3,4,5,6,7,8,9,10)
> mark<-c(45,45,58,50,57,68,70,62,75)
> plot(hours,mark)

1
Figure 10.1: Scatterplot of hours studied versus marks obtained

Is there a relationship between hours slept and marks obtained? What is the direction of the relationship?
Can we describe this by some measure to say how strongly related the variables are?
Yes. Positive increasing linear relationship between X and Y. The correlation coefficient (𝑟) describes the
strength of the linear relationship between the independent and dependent variables.

Correlation can be positive, negative or zero:


Positive: In this case, when the values of one continuous variable increase, the values of the other
continuous variable also increase eg: for the relationship between age and shoe size, as age increases (within
some range of age values), the height also increases.
Negative: In this case, when the values of one continuous variable increase, the values of the other
continuous variable decreases eg: The relationship between Body Mass Index (BMI) and running speed- as
BMI increases, running speed decreases.
Zero: In this case, when the values of one continuous variable increase, there is no influence on the values of
the other continuous variable, either positive, or negative. Hence, there is no linear relationship between the
two continuous variables, however there can be some other nonlinear relationship between the variables.

2
The above figure sourced from https://communitymedicine4asses.com/2013/12/27/correlation/ shows that
the relationship between the value of a car and the mileage is negatively related (i.e. the more mileage the
car has done, the cheaper it will be), that there is no relationship between the colour of a car and the quality
of the car hence the zero correlation and that there is a positive linear relationship between car insurance cost
and the number of accidents(with more accidents the insurance cost is expected to increase).

Sample correlation coefficient


Correlation coefficient (𝑟): is a measure of how strongly linearly related or associated 𝑥 and 𝑦 are from the
sample that is observed.

∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) 𝑆𝑃𝑥𝑦


𝑟= =
√[∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 ][∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 ] √𝑆𝑆𝑥 𝑆𝑆𝑦

Where
𝑛
(∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
𝑆𝑃𝑥𝑦 = ∑ 𝑥𝑖 𝑦𝑖 −
𝑛
𝑖=1

𝑛
(∑𝑛𝑖=1 𝑥𝑖 )2
2
𝑆𝑆𝑥 = ∑ 𝑥𝑖 −
𝑛
𝑖=1
𝑛
(∑𝑛𝑖=1 𝑦𝑖 )2
𝑆𝑆𝑦 = ∑ 𝑦𝑖 2 −
𝑛
𝑖=1

The correlation coefficient (𝑟) measures of the strength of the linear relationship between the independent
and dependent variables. Values lie in the range -1 to +1

3
Correlation Interpretation
𝑟 equal to +1 Perfect positive linear relationship

𝑟 close to +1 Strong positive linear relationship

𝑟 far away from +1 Weak positive linear relationship

𝑟=0 No linear relationship


• Does not mean that the variables are
unrelated but only that they are not
linearly related
• There could be some other
relationship other than a linear one

4
Correlation Interpretation
𝑟 far away from –1 Weak negative linear relationship

𝑟 close to –1 Strong negative linear relationship

𝑟 equal to –1 Perfect negative linear relationship

Note: Correlation does not mean causation


A correlation of 𝑟 = 1 for example describes the extent of the association between the two variables i.e. that
larger values of the one variable are associated with larger values of the other variable and not that the large
values on the one variable causes large values on the other variable. Example, heater sales may be strongly
negatively correlated with the crime rate during the winter months (heater sales increase in winter and crime
decreases in winter). So, although heater sales and crime rate are associated, the one variable does not cause
the other to behave in a certain way. However, this association exists as they are both responses to the cold
weather in winter.

Using R in example 10.1 to find the correlation coefficient gives us:


> cor(hours,mark)
[1] 0.8945685

The correlation coefficient of 0.895 indicates a strong positive linear relationship exists between hours
studied and mark obtained.

5
Using the equation for correlation gives us:
Hours Marks %
(ℎ) (𝑚) ℎ2 𝑚2 ℎ𝑚
2 45 4 2025 90
3 45 9 2025 135
4 58 16 3364 232
5 50 25 2500 250
6 57 36 3249 342
7 68 49 4624 476
8 70 64 4900 560
9 62 81 3844 558
10 75 100 5625 750
54 530 384 32156 3393

𝑆𝑃𝐻𝑀 213
𝑟= = = 0.895
√𝑆𝑆𝐻 𝑆𝑆𝑀 √(60)(944.8888889)

Where
(∑𝑛 𝑛
𝑖=1 ℎ𝑖 )(∑𝑖=1 𝑚𝑖 ) 54∗530
𝑆𝑃𝐻𝑀 = ∑𝑛𝑖=1 ℎ𝑖 𝑚𝑖 − = 3393 − = 213
𝑛 9

2
(∑𝑛
𝑖=1 ℎ𝑖 ) (54)2
𝑆𝑆𝐻 = ∑𝑛𝑖=1 ℎ𝑖 2 − = 384 − = 60
𝑛 9
𝑛
(∑𝑛𝑖=1 𝑚𝑖 )2 (530)2
𝑆𝑆𝑀 = ∑ 𝑚𝑖 2 − = 32156 − = 944.8888889
𝑛 9
𝑖=1

If the correlation coefficient indicates that there is a linear relationship between the independent variable (𝑋)
and dependent variable (𝑌), then we could fit a linear model to the data by regressing the dependent variable
on the independent variable. The dependent variable is also referred to as the Response variable and the
independent variable is also known as the Explanatory or Predictor variable. A few examples of these include
examining the relationship between an increase in temperature (𝑋 or Explanatory) and the yield (𝑌 or
Response) from a chemical reaction at a particular temperature or the mark of students (𝑌) and the number of
hours spent studying (𝑋) or even the crop yield (𝑌) based on the amount of rainfall received (𝑋).

6
10.2 The Simple Linear regression model
If there is a linear relationship in the data, then there are population parameters 𝛽0, 𝛽1, and 𝜎 2 such that for
any fixed value of 𝑥 of the random variable 𝑋, the Response or dependent variable (𝑌) is related to this fixed
𝑥 by a linear model given by: 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀.

The linear model 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀 is referred to as a probabilistic model. A probabilistic model accounts for
random deviation or random error denoted by ε.
• The random deviation or error ε is a random variable and it is assumed to be normally distributed with
mean 0 and variance 𝜎 2 , that is 𝜀~𝑁(0, 𝜎 2 ) with this mean and variance being the same regardless of
the value of 𝑥 that the model takes on.
• Since the 𝑛 (𝑥, 𝑦) pairs (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) are regarded as being generated independently
of each other, each of the 𝑌’s given by 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 are independent of each other as a result
of the 𝜀𝑖′ s being independent of each other.
The following figure from Devore and Berk (Figure 12.4) illustrates the simple linear regression model.

Figure 10.2: Illustration of simple linear regression line

In the above Figure, 𝑌 and 𝜀 are random variables (so they will have some distribution with mean and
variance), 𝑥 is a fixed value of the random variable 𝑋, and 𝛽0 and 𝛽1 are constants and are referred to as the
regression coefficients.
• Y is the dependent random variable and it depends on X and the random error (ε) in determining Y
• The variable X is fixed equal to xi . We can choose the values we give to it; hence it is called the

independent variable.
• 𝛽0 is the unknown intercept coefficient (𝑦-intercept) - the value of 𝑦 when 𝑥 = 0
• 𝛽1 is the slope coefficient of the population or true regression line 𝛽0 + 𝛽1 𝑥 . It is interpreted as the true
average increase (or decrease) in 𝑌 associated with every one-unit increase (or decrease) in 𝑥
• 𝜀 is the error term or the part of 𝑌, that the regression model is unable to explain.

7
The population parameters 𝛽0 , 𝛽1 and 𝜎 2 are generally unknown and will need to be estimated. The
parameters are estimated by drawing a random sample consisting of 𝑛 (𝑥𝑖 , 𝑦𝑖 ) pairs i.e:
(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) from the population of interest and then utilising this sample data to obtain the
sample estimates 𝑏0 , 𝑏1 and 𝑠 2 which are point estimates for the unknown population parameters 𝛽0 , 𝛽1 and
𝜎 2 . The estimated linear regression line 𝑦̂ = 𝑏0 + 𝑏1 𝑥, is obtained by drawing a straight line through the
sample data such that it is “closest” to as many of the sample data points as possible. The “closeness” of the
data is obtained by calculating the deviations between the observed 𝑦 values and the estimated 𝑦̂ values.

10.2.1 Finding the best fitting line or least squares regression line
If the scatterplot of the bivariate data consisting of 𝑛 (𝑥, 𝑦) pairs i.e: (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) appears to
have a linear pattern we can then relate the set of observations (𝑥𝑖 , 𝑦𝑖 ) of the two variables, X and Y
respectively by an equation for a straight line viz: 𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 . This is done through a process called
Simple Linear regression where 𝑋 represents the independent variable and 𝑌 represents the dependent
variable. The best model will be the straight line that is closest to as many of the observed (𝑥𝑖 , 𝑦𝑖 )
measurements.

From the previous figure, one can observe how much a point (𝑥𝑖 , 𝑦𝑖 ) deviates or how far it is from the fitted
line. The predicted values 𝑦̂1, 𝑦̂2,…, 𝑦̂𝑛 are obtained by substituting the observed 𝑥 values 𝑥1 , 𝑥2 , … , 𝑥𝑛 into the
equation for the regression line. The model can be estimated by minimising the sum of the squared deviations
or the sum of the squared errors between the observed 𝑦𝑖 values and the predicted 𝑦̂𝑖 values. The deviations
are also referred to as residuals.

Definition: Residuals: The vertical deviations between the actual or observed 𝑦’s and the predicted 𝑦 values
given by the line (i.e. 𝑦̂) are called residuals, denoted ei =𝑦𝑖 − 𝑦̂𝑖

Definition: Error sum of squares (or Residual sum of squares)


∑𝑛𝑖=1 𝑒𝑖2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2 = ∑𝑛𝑖=1(𝑦𝑖 − (𝑏0 + 𝑏1 𝑥𝑖 ))2 = 𝑆𝑆𝐸 = sum of squared error

We want these residuals to be as small as possible to get the best line i.e. we want to minimise the residuals.
Since the residuals may be positive and negative numbers, they will cancel each other out if we simply sum
them, hence we work with the squared residuals to avoid this. Thus, we are finding the line 𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖
which minimises the sum of the squared residuals(∑ 𝑒𝑖2 ).

8
In the estimated least squares regression line 𝑦̂ = 𝑏0 + 𝑏1 𝑥 obtained from the observed sample data, 𝑏0
represents the intercept (the 𝑦 value when 𝑥 is 0) and 𝑏1 represents the slope of the line (which represents
the change in the dependent variable for every 1-unit change in the independent variable).
Minimising the sum of the squared residuals (∑𝑛𝑖=1 𝑒𝑖2 ) is undertaken by finding the partial derivative of
∑𝑛𝑖=1 𝑒𝑖2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)
𝑖
2
= ∑𝑛𝑖=1(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 with respect to 𝑏0 and 𝑏1 , thereafter setting both to zero
and solving.
Using partial derivatives and differentiating with regard to 𝑏0 gives:
𝑛 𝑛
𝜕
∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 = (−1) 2 ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )
𝜕𝑏0
𝑖=1 𝑖=1

Equating to zero:
𝑛

−2 ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 ) = 0
𝑖=1
𝑛

⇒ ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 ) = 0
𝑖=1
𝑛 𝑛

⇒ ∑ 𝑦𝑖 = 𝑛𝑏0 + 𝑏1 ∑ 𝑥𝑖
𝑖=1 𝑖=1

Using partial derivatives and differentiating with regard to 𝑏1 gives:


𝑛 𝑛
𝜕
(∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 ) = −2 ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )𝑥𝑖
𝜕𝑏1
𝑖=1 𝑖=1

Equating to zero:
𝑛

−2 ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 ) 𝑥𝑖 = 0
𝑖=1
𝑛

⇒ ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )𝑥𝑖 = 0
𝑖=1
𝑛 𝑛 𝑛

⇒ ∑ 𝑥𝑖 𝑦𝑖 = 𝑏0 ∑ 𝑥𝑖 + 𝑏1 ∑ 𝑥𝑖2
𝑖=1 𝑖=1 𝑖=1

Thus,
(∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑆𝑃𝑥𝑦
𝑏1 = 𝑛 =
𝑛
(∑ 𝑥 ) 2 𝑆𝑆𝑥
∑𝑛𝑖=1 𝑥𝑖2 − 𝑖=1 𝑖
𝑛
9
and making 𝑏1 and then 𝑏0 the subject of the formula gives:
∑𝑛𝑖=1 𝑦𝑖 ∑𝑛𝑖=1 𝑥𝑖
𝑏0 = − 𝑏1 = 𝑦 − 𝑏1 𝑥
𝑛 𝑛

By applying our mathematics knowledge and performing second order differentiation confirms that this is a
minimum and not a maximum or point of inflection.

Now remember that for the line 𝑦̂ = 𝑏0 + 𝑏1 𝑥 , it’s intercept 𝑏0 and slope 𝑏1 are obtained for a particular data
set. The line can be used to predict the value of 𝑦̂ for a given 𝑥 value that is within the range of 𝑥 values
observed in this particular data set. One cannot predict the 𝑦̂ value for a given 𝑥 value that is outside the range
of 𝑥 values as this would amount to extrapolation and the prediction would not be reliable. The slope indicates
the average or expected change in the dependent variable for every 1-unit change in the independent variable.

Also, apart from the line 𝑦̂ = 𝑏0 + 𝑏1 𝑥, having the property that the sum of the squared deviations is
minimised, it also has the property that:
1 1
𝑒̅ = 𝑛 ∑ 𝑒𝑖 = 𝑛 ∑(𝑦𝑖 − 𝑦̂)
𝑖

1
= ∑(𝑦𝑖 − (𝑏0 + 𝑏1 𝑥𝑖 ))
𝑛
= 𝑦̅ − 𝑏0 − 𝑏1 𝑥̅ = 0
since 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
Now remember that the line 𝑦̂ = 𝑏0 + 𝑏1 𝑥, obtained from 𝑛 (𝑥, 𝑦) pairs in a particular sample data set is an
estimate for the population where 𝑏0 and 𝑏1 are estimates of 𝛽0 and 𝛽1 respectively. If we change the data set
to a new or different set of 𝑛 (𝑥, 𝑦) pairs we will obtain a different estimated line 𝑦̂ = 𝑏0 + 𝑏1 𝑥, with
different values for 𝑏0 and 𝑏1 .

10.3 Estimating the parameters of the model:


The probabilistic model allows one to make probability statements about the parameters in the model and
predictions made from the model, enable us to construct confidence intervals for the model and test hypotheses
for the model (2nd year). Since the 𝑛 (𝑥, 𝑦) pairs (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) are regarded as being
generated independently of each other, then each 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 is independent of each other if each of
the 𝜀𝑖′ s are independent of each other. Recall that 𝑌 and 𝜀 are random variables (so they will have some
distribution or “shape” with a mean and variance) whereas 𝑥 is a fixed value. The Yi ’s are independent of

each other as a result of the 𝜀𝑖′ s being independent of each other.

10
Recall that it is assumed that the 𝜀𝑖 ~𝑁(0, 𝜎 2 ).
Now each 𝑌𝑖 ~𝑁(𝛽0 + 𝛽1 𝑥𝑖 , 𝜎 2 ) since 𝐸(𝑌|𝑥) = 𝛽0 + 𝛽1 𝑥𝑖 as illustrated in Devore and Berk:

a) Distribution of 𝜀

Figure 10.3: Distribution of 𝜀

b) Distribution of 𝑌 for different values of 𝑥

Figure 10.4: Distribution of 𝑌 for different values of 𝑥


For the model 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀 we saw that the best estimate for 𝛽0 is 𝑏0 and the best estimate for 𝛽1 is 𝑏1 .
The random deviation 𝜀 is assumed to be Normally distributed with mean 0 and variance 𝜎 2
hence, 𝜀𝑖 ~𝑁(0, 𝜎2 ).

We have already estimated the slope and intercept parameters, thus we still need to estimate a third parameter
𝜎 2 . We will estimate this parameter in section 10.4.2. The variance parameter 𝜎 2 represents the amount of
variability inherent in the regression model. If σ2 is close to zero then almost all the (x, y) pairs are said to be
close to the population regression line. If the σ2 is far from 0 then it means that most of the (x, y) pairs, in the
scatterplot, are spread out and far away from the population regression line. Since the equation of the true
population regression line is unknown, the estimate is based on the extent to which the sample data deviates
from the estimated line. We require an estimate of 𝜎 2 for the confidence intervals that we will calculate
shortly. Since 𝑏0 and 𝑏1 are the best estimates (you will study in 2nd year that they are both unbiased
estimators) of 𝛽0 and 𝛽1 respectively, we can safely say that the distribution of the errors or residuals
ei =𝑦𝑖 − 𝑦̂𝑖 from the sample should estimate the distribution of 𝜀 in the population.

11
Previously we saw that the sample variance 𝑠 2 is the best estimator of the population parameter 𝜎 2 . In much
the same way, we can use the sample variance of the residuals ei to estimate 𝜎 2 , the population variance of
𝑆𝑆𝐸 ∑𝑛 ̂ 𝑖 )2
𝑖=1(𝑦−𝑦
𝜀.The sample variance of the residuals is 𝑠 2 = = . It has 𝑛 − 2 degrees of freedom because
𝑛−2 𝑛−2

we estimated two parameters 𝛽0 and 𝛽1 for the regression line and we are using the estimates in the equation
𝑆𝑆𝐸
above to find the sample variance 𝑠 2 . Now, 𝑠 2 = is also called the mean squared error (MSE). This
𝑛−2
𝑆𝑆𝐸 (𝑆𝑆𝑦 −𝑏1 𝑆𝑃𝑥𝑦 )
can be calculated by the computationally easier formula: 𝑀𝑆𝐸 = 𝑛−2 = .
𝑛−2

Example 10.2: - based on Example 12.5 (Devore and Berk)


“Global warming is a major issue, and CO2 emissions are an important part of the discussion. What is the
effect of increased CO2 levels on the environment? In particular, what is the effect of these higher levels on
the growth of plants and trees? The article “Effects of Atmospheric CO2 Enrichment on Biomass
Accumulation and Distribution in Eldarica Pine Trees” (J. Exp. Bot., 1994: 345–349) describes the results of
growing pine trees with increasing levels of CO2 in the air. There were two trees at each of four levels of CO2
concentration, and the mass of each tree was measured after 11 months of the experiment. Here are the
observations with 𝑥 = atmospheric concentration of CO2 in microliters per liter (parts per million) and 𝑦 =
mass in kilograms. In the table below, columns were added for 𝑥 2 , 𝑥𝑦 , 𝑦 2 , the fitted value (𝑦̂)
𝑖 and the

residual (𝑒𝑖 = 𝑦𝑖 − 𝑦̂)


𝑖 for ease of use when we find the least squares line through calculations. The mass

measurements were read from a graph in the article.”


Fitted Residual
Observation 𝑥 𝑦 𝑥2 𝑥𝑦 𝑦2 value 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖
1 408 1.1 166464 448.8 1.21 1.100114 -0.000114
2 408 1.3 166464 530.4 1.69 1.100114 0.199886
3 554 1.6 306916 886.4 2.56 2.334461 -0.734461
4 554 2.5 306916 1385 6.25 2.334461 0.165539
5 680 3 462400 2040 9 3.399720 -0.399720
6 680 4.3 462400 2924 18.49 3.399720 0.900280
7 812 4.2 659344 3410.4 17.64 4.515705 -0.315705
8 812 4.7 659344 3816.4 22.09 4.515705 0.184295
4908 22.7 3190248 15441.4 78.93
Table 10.2: Atmospheric CO2 concentration and tree mass
R code to visualise the data:
> Mass=c(1.1, 1.3, 1.6, 2.5, 3.0, 4.3, 4.2, 4.7) #define the vector of values for variable
> CO2=c(408,408,554,554,680,680,812,812) #define the vector of values for variable CO2
> plot(x=CO2, y=Mass) #scatterplot of Mass versus CO2
12
The scatterplot shows that there is a positive linear relationship between the variables.

Figure 10.5 Plot of tree mass versus CO2 concentration

The strength of the linear relationship is calculated by finding the Pearson correlation coefficient between the
two variables Mass and CO2. Using R with the following commands gives:
> cor(Mass,CO2, method="pearson") #Calc Pearson correlation
[1] 0.9392405

The correlation coefficient 𝑟 = 0.9392405 is close to +1 and hence we say that the two variables are strongly,
positively linearly related. Homework exercise: Verify the answer for the correlation coefficient by using
the equations!

Tree mass is the Dependent or Response variable and atmospheric concentration of CO2 is the Independent
variable or Explanatory variable. By regressing Mass on CO2, we can get the linear model
We can find the least squares regression line using:
(∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑆𝑃𝑥𝑦
𝑏1 = 𝑛 =
𝑛
(∑ 𝑥 ) 2 𝑆𝑆𝑥
∑𝑛𝑖=1 𝑥𝑖2 − 𝑖=1 𝑖
𝑛

(4908)(22.7)
𝑆𝑃𝑥𝑦 15441.4 − 8 1514.95
𝑏1 = = 2 = = 0.00845
𝑆𝑆𝑥 (4908) 179190
3190248 − 8
22.7 4908
and 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅ = − 0.00845 ( ) = −2.349
8 8

13
Hence the estimated linear regression line is 𝑦̂ = −2.349 + 0.00845𝑥

We can superimpose the least squares line on the scatterplot with the following code in R:
> abline(linear.fit) #plots linear regression line on data

Figure 10.6: Plot of Mass versus CO2 with superimposed least squares line

The least squares regression coefficients are obtained with the following R code: lm(Dependent variable ~
Independent variable)
> linear.fit=lm(Mass~CO2) #creates the linear model

The coefficients can also be obtained by the following r code


> coef(linear.fit)
(Intercept) CO2
-2.349295162 0.008454434

Hence the least squares regression line is given by 𝑦̂ = −2.349 + 0.008𝑥

Interpretation of the slope of the least squares line: The slope is positive and shows that the mass increases
by 0.008 on average for every one-unit increase of CO2.
To calculate the fitted values ( ̂𝑦 ) we use the following R code:
> fitted(linear.fit)
1 2 3 4 5 6 7 8
1.100114 1.100114 2.334461 2.334461 3.399720 3.399720 4.515705 4.515705

14
One can also find the Residuals = Observed values – Fitted value ( 𝑒 = 𝑦 − ̂𝑦 ) using the following code:
> resid(linear.fit)
1 2 3 4 5 6
-0.0001138456 0.1998861544 -0.7344611865 0.1655388135 -0.3997198504 0.9002801496
7 8
-0.3157051175 0.1842948825
The least squares line can be used for two purposes:
For a fixed 𝑥 value (say 𝑥 ∗ ), 𝑦̂ = 𝑏0 + 𝑏1 𝑥 gives either
(i) The estimate of the expected value of 𝑌 when 𝑥 = 𝑥 ∗ (as seen in Figure 10.4) or
(ii) a point prediction of the 𝑌 value when 𝑥 = 𝑥 ∗ i.e we predict tree mass for a given atmospheric
concentration of CO2.

Note that we must caution against extrapolation, hence we can only predict the dependent or Response
variable, in this case 𝑌 (Mass) value for some fixed value of 𝑥 where the value of 𝑥 must be between
(min(𝑥) ; max (𝑥)) as the least squares line was generated using this specific range of 𝑥 values.
A further summary for the model can be obtained in R using the following R code:
> summary(linear.fit) # gives the summary of the model
Call:
lm(formula = Mass ~ CO2)
Residuals:
Min 1Q Median 3Q Max
-0.73446 -0.33671 0.08271 0.18819 0.90028

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.349295 0.796567 -2.949 0.025637 *
CO2 0.008454 0.001261 6.702 0.000536 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.534 on 6 degrees of freedom


Multiple R-squared: 0.8822, Adjusted R-squared: 0.8625

10.4 Assessing the Model


Thus far we used the scatterplot to show us visually if the linear regression line is appropriate to describe the
relationship between the variables. We then calculated the Pearson correlation coefficient and if the value
indicated that there was a strong linear relationship between the Response (Dependent) and Explanatory
(Independent) variables, we then fitted the linear regression line from the sample data and could predict the
value of the dependent variable based on the value of the independent variable. However, once we decide that

15
the straight line is appropriate, we need to assess the accuracy of the predictions that are based on the least
squares line. Two numerical measures that give us an idea of how well the model fits the sample data are the
coefficient of determination and the standard deviation about the least squares line.

10.4.1. The Coefficient of Determination


The coefficient of determination describes the proportion of variability in the dependent variable that can be
attributed or explained by the linear relationship between the independent and dependent variables.

Figure 10.7: Variation in the Dependent variable (extracted from Devore &Berk Figure 12.12)

In the figure labelled (a) above all the points will lie on the least squares line, so all the variation in the 𝑦’s are
attributed to the 𝑥’s and hence there is no unexplained variation. Hence 𝑆𝑆𝐸 which measures the variation
that is unexplained by the model will be 0.

In Figure (b) the points will not all lie on the line but most will be very close to the line so there is a small
amount of unexplained variation and hence SSE will be small.

Figure (c) will have a much larger variation than (b) and hence SSE will be much larger.

To understand how 𝑟 2 (the variability in the dependent variable) is calculated we need to consider the
variability in the 𝑦 values. This can be done in two ways: looking at the total variation that is unexplained by
the model (SSE) or looking at the sum of the total variation in the observed 𝑦’𝑠 as shown in the Figure below.

16
Figure 10.8: Extracted from Devore and Berk (Figure 12.13) to illustrate variation in the model

Understanding the variation in the model is explained in further detail below by considering the equations
for SSE and SST.
a) We can look at how far the 𝑦’𝑠 are from their respective 𝑦̂ values (Figure 10.8 – LHS picture) – this
is referred to as the Error Sum of Squares or Sum of Squares of the Errors (SSE) where
SSE =∑𝑛𝑖=1 𝑒𝑖2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2=∑𝑛𝑖=1(𝑦𝑖 − (𝑎 + 𝑏𝑥𝑖 ))2
It is interpreted as the amount of variability in the 𝑦’𝑠 that is unexplained by the model.
OR
b) We can look at the total amount of variation in the observed 𝑦’𝑠 (Figure 10.8 – RHS picture) i.e.:
how spread out the observed 𝑦’𝑠 are from the mean 𝑦 value (𝑦̅). This is called the Total sum of
squares (denoted SST) where SST = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 = 𝑆𝑆𝑦
2
(∑𝑛
𝑖=1 𝑦𝑖 )
Note the computationally easier equation for 𝑆𝑆𝑦 = ∑𝑛𝑖=1 𝑦𝑖2 − 𝑛

So, SSE is the squared deviation about the least squares line and SST is the squared deviation about
the horizontal line for 𝑦̅
𝑆𝑆𝑇 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2
𝑛

= ∑(𝑦𝑖 − 𝑦̂𝑖 + 𝑦̂𝑖 − 𝑦̅)2


𝑖=1
𝑛

̂𝑖 − 𝑦̅)]2
𝑖 + (𝑦
= ∑[(𝑦𝑖 − 𝑦̂)
𝑖=1
𝑛
2
= ∑[(𝑦𝑖 − 𝑦̂)
𝑖 + 2(𝑦𝑖 − 𝑦̂𝑖 )(𝑦̂𝑖 − 𝑦̅) + (𝑦̂𝑖 − 𝑦̅)2 ]
𝑖=1

= ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)
𝑖
2
+ ∑𝑛𝑖=1(𝑦̂𝑖 − 𝑦̅)2 as the middle term evaluates to 0 since ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂) 𝑛
𝑖 = ∑𝑖=1 𝑒𝑖 = 0
17
Thus SST= SSE + SSR where SSR = ∑𝑛𝑖=1(𝑦̂𝑖 − 𝑦̅)2 is defined as the regression sum of squares

SSE
So, the ratio SST is the proportion of total variation that is unexplained by the model and hence
𝑆𝑆𝐸
1 − 𝑆𝑆𝑇 = 𝑟 2 is the proportion of variation that is explained by the model. This is referred to as the Coefficient

of Determination.

Definition: Coefficient of Determination (𝑟 2 × 100%) : is the proportion of variation in the dependent


variable that can be attributed or explained by the linear relationship between the independent and dependent
variables.

Calculate and interpret 𝒓𝟐 for Example 10.2


𝑟 2 = 88.22%. Hence 88.22% of the variability in the mass of the tree is accounted for by the linear
relationship between CO2 and mass of the tree. The remaining (100 − 88.22)% = 11.88% of the variation
in the tree mass comes from other factors which are different from CO2 for eg: amount of rainfall, temperature
etc.

The coefficient of Determination can be obtained using the following R code or it can be read off from the
previous output given above (Multiple R-squared):
> summary(linear.fit)$r.squared # gives the coefficient of determination
[1] 0.8821727

10.4.2 Estimating the variance 𝜎 2


Recall that the goodness of fit of the model can be assessed by either looking at coefficient of determination
or the variance (or standard deviation) about the least squares line.

To calculate the standard deviation, 𝑠 we need to calculate the unexplained sum of squares 𝑆𝑆𝐸. Recall that
SSE =∑𝑛𝑖=1 𝑒𝑖2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2
𝑆𝑆𝐸
The unbiased estimator 𝑠 2 of the variance 𝜎 2 is . Using the above equation for calculation can be
𝑛−2

tedious, hence we can use the computationally easier equation: SSE = 𝑆𝑆𝑦 − 𝑏𝑆𝑃𝑥𝑦

18
10.4.3 Check if the assumptions of the model are met
Recall our model: 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀 had the following assumptions made about the model:
The error terms:
i) are Normally distributed with mean 0 and variance 𝜎 2 hence, 𝜀𝑖 ~𝑁(0, 𝜎 2 )
ii) have the same or equal variance for the different 𝑥 values. This is referred to as homoscedasticity
iii) are independent of one another

These assumptions may or may not be true and hence will need to be checked. If they are true then the
observed residuals 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 should also behave in the same way as 𝜀𝑖 and should be approximately
normally distributed with constant variances for the different 𝑥 values.

The assumptions can be checked by examining the following residual plots: (reference:
https://data.library.virginia.edu/diagnostic-plots/)

Plot 1: Residuals versus the fitted or predicted values 𝑦̂𝑖


If there is a non-linear relationship between the dependent variable and independent variable, then the pattern
indicating the non-linear relationship could be seen in this plot. If the residuals seem to be equally spread
around a horizontal line without any clear pattern, this would indicate that we do not have a non-linear
relationship.
Consider the following residual plots. One describes a ‘good’ model (since it meets the regression assumptions
well) and the other describes a ‘bad’ model as it does not meet the regression assumptions.

The plots show that there is no clear pattern in Case 1, however in Case 2, we can see the pattern of a parabola
and hence there is a non-linear relationship in the data, thus the assumptions for the linear model are not
satisfied. This non-linear relationship was not explained by the model and was left out in the residuals.
19
Plot 2: Normal Q-Q
Normal Q-Q plot indicates if the residuals are normally distributed. If the residuals follow the straight dashed
line in the plot, then they are normally distributed.

Case 1 shows that the residuals are Normally distributed as they follow the straight dashed line quite closely
although an observation numbered as 38 looks like it is far away from the other point. Case 2 deviates quite
distinctly from the straight dashed line and hence the normality assumption is violated. Although observation
#38 falls away from the rest of the data in case 1, at this stage we would not be too concerned and say that
Case 1 satisfies the normality assumption.

Let’s look at the next plot (Scale-Location) while keeping in mind that observation #38 might be a potential
problem. Observation #38 is quite far away from the other points in the plot so it seems to be breaking away
from the pattern of the other values and is termed to be an outlier. An outlier can have an influence on the
slope of the model.

Plot 3: Scale-Location
The Scale-location plot also known as the Spread-Location plot, shows if the residuals are spread equally
along the 𝑥 values. If you see a horizontal line with equally (randomly) spread out points then the assumption
of equal variance or homoscedasticity is satisfied.

20
In these plots it appears that the residuals appear randomly and equally spread in Case 1 whereas, in Case 2,
the residuals begin to fan out or spread wider along the 𝑥-axis. Also, since the residuals are spread out wider
and wider, the red smooth line is not horizontal and has a steep incline.

Plot 4: Residuals vs Leverage


Leverage can be calculated by the following equation:
1 (𝑋𝑖 −𝑋̅)2
+
𝑛 𝑆𝑆𝑥

The Residual versus Leverage plot helps us to identify if any of the outliers are influential. Not all outliers are
influential and hence would not necessarily affect the regression line if for example they seem to follow the
trend in the majority of cases in the data however for those that could influence the regression line, closer
examination is needed. Points with extreme values of X are said to have high leverage. High leverage points
have the ability to move the line more. Unlike in the other plots, we do not examine the pattern in the data but
we look for values that lie outside the dashed line that indicates a measure referred to as the Cook’s distance
score. Attention needs to be paid to extreme values that occur at the upper right corner or at the lower right
corner of the plot as these points will have high leverage. Points in these regions indicate cases that can be
influential against a regression line. The dashed line in the plot indicates the Cook’s distance score. Points
that are outside of the dashed lines indicating the Cook’s distance score is regarded as being influential to the
model. If these points are excluded, this would change the regression results. Note that the numbers within the
plot refer to the items or cases in the dataset that need to examined eg: in the plot below 32 refers to the 32 nd
case in the dataset. See video at https://www.youtube.com/watch?v=xc_X9GFVuVU

21
The plots above show the contour lines for the Cook’s distance. Cook’s distance is another measure of the
importance of each observation to the regression. Generally, a Cook’s distance of more than 1 would indicate
an influential value or possible outlier and possibly a poor model. In the above plots there seem to be no
influential points outside the Cook’s distance lines. In Case 2 however, observation #49 appears to be far
outside the Cook’s distance lines and is thus an influential point. Excluding it from the data could influence
the regression model.

If similar cases occur across the four plots as outliers, then closer attention needs to be paid to the points to
see if there is anything special for that particular case or are there errors in the capturing of the data etc. To
use R’s regression diagnostic plots, we set up the regression model as an object and create a plotting
environment of two rows and two columns. Then we use the plot() command, treating the model as an
argument. Applying this to the data in Example 12.5 (Devore) gives the following:
> par(mfrow = c(2,2)) # splits the screen in 4 parts
> plot(linear.fit) #plots the diagnostic plots

22
If there is a non-linear relationship between the dependent variable and independent variable, then the pattern
indicating the non-linear relationship could be seen in this plot. If the residuals seem to be equally spread
around a horizontal line without any clear pattern, this would indicate that we do not have a non-linear
relationship.

In using the R diagnostic plots and influence statistics to diagnose how well our model is fitting the data, we
see that for the first plot (Residuals versus. Fitted values) is a simple scatterplot between residuals and
predicted values. It should look more or less random. Here, the residuals seem to be equally spaced about the
horizontal line.

The second plot (normal Q-Q) shows the points closely plotted along the dashed line hence the errors are
normally distributed.

The third plot (Scale-Location), like the first, should look random. No patterns should appear in this plot.
Ours has an upside-down V-shape.

The last plot (Cook’s distance - D) tells us which points have the greatest influence on the regression (leverage
points). In our plot, the values of D are below 0.5 so the points 3 and 6 for example which appear to be outliers
are not regarded to be influential.

“So, what does having patterns in residuals mean to your research? It’s not just a go-or-stop sign. It tells
you about your model and data. Your current model might not be the best way to understand your data if
there’s so much good stuff left in the data.” https://data.library.virginia.edu/diagnostic-plots/

10.5 Point and Interval Estimation of the slope parameter


In the past we had two methods of estimation, that of a point estimate, and that of an interval estimate. Both
referred to an estimate of a parameter. The same procedure exists in regression, because of the I.I.D errors
~ 𝑁(0, 𝜎2 )

10.5.1 Confidence interval for the Regression Coefficient 𝛽1


Thus far we have seen that that 𝑏0 is the point estimate for 𝛽0 and 𝑏1 is the point estimate for 𝛽1. We saw
previously under the discussion of the sampling distribution of the mean and proportion that the value of any
quantity calculated from the sample data i.e the value of any statistic will vary from sample to sample. In

23
much the same way, the value of 𝑏1 will vary from sample to sample. Hence 𝑏1 will have a distribution with
some mean and variance. It can be shown that (2nd year):
𝜎2 𝜎2
𝐸(𝑏1 ) = 𝛽1 𝑎𝑛𝑑 𝑉𝑎𝑟(𝑏1 ) = 𝑛 =
∑𝑖=1(𝑥 − 𝑥̅ )2 𝑆𝑆𝑥

Thus 𝑏1 is an unbiased estimator of 𝛽1 with known standard deviation (or variance), hence,
𝑏1 −𝛽1 𝑏 −𝛽1
𝑧= = 𝜎1
𝜎𝑏1 ⁄
√𝑆𝑆𝑥

However, since 𝜎 2 is unknown, but 𝛽1 is normally distributed, the appropriate test statistic is
𝑏1 − 𝛽1
𝑡𝑛−2 = 𝑠

√𝑆𝑆𝑥

We can derive the confidence interval by starting with the following probability statement:

𝑏1 − 𝛽1
𝑃 (−𝑡𝑛−2,𝛼⁄2 < 𝑠 < 𝑡𝑛−2,𝛼⁄2 ) = 1 − 𝛼

√𝑆𝑆𝑥

Thus, we can construct a 100(1 − 𝛼)% C.I. for 𝛽1 the slope parameter of the true regression line for the
population as

𝛽1 𝜖 𝑏1 ± 𝑡𝑛−2,𝛼⁄2 𝑠⁄
√𝑆𝑆𝑥

Calculate and interpret the 95% C.I. for 𝛽1 in Example 10.1:


∑ 𝑥 = 54 ∑ 𝑥 2 =384 ∑ 𝑦 = 530 ∑ 𝑦 2 =32156 ∑ 𝑥 𝑦 = 3393 𝑛 = 9
𝑏1 =3.55 𝑡𝑛−2,𝛼⁄2 = 𝑡7,0.025 = 2.3646 𝑆𝑆𝑥 = 60 𝑆𝑆𝑦 = 944.89 𝑆𝑃𝑥𝑦 = 213
𝑆𝑆𝐸 188.74
𝑆𝑆𝐸 = 𝑆𝑆𝑦 − 𝑏1 𝑆𝑃𝑥𝑦 = 944.89-3.55(213) =188.74 𝑠 2 = 𝑛−2 = = 26.96285714
7

𝛽1 𝜖 𝑏1 ± 𝑡𝑛−2,𝛼⁄2 𝑠⁄
√𝑆𝑆𝑥

𝛽1 𝜖 (3.55 ± 2.3646 √26.96285714⁄ )


√60
𝛽1 𝜖(1.96487; 5.13513)

The 95% confidence interval for the slope can be found using the following R code:
> confint(linear.fit, level=0.95)

24
This should give the following output for the confidence interval for the intercept and slope although we
require the interval for the slope only.
> hours<-c(2,3,4,5,6,7,8,9,10)
> mark<-c(45,45,58,50,57,68,70,62,75)
> linear.fit=lm(mark~hours)
> coef(linear.fit)
(Intercept) hours
37.58889 3.55000
> confint(linear.fit, level=0.95)
2.5 % 97.5 %
(Intercept) 27.234784 47.942994
hours 1.964858 5.135142

Homework exercise: Verify the calculation for Example 10.2 by using the equations and interpret the
interval. Answer should be:
2.5 % 97.5 %
(Intercept) −4.298424521 −0.40016580
CO2 0.005367883 0.01154098

10.6 Point and Interval Estimation of y


Recall that for the probabilistic model used in linear regression:
𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
that 𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 is the estimated line of regression of 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖

One of the assumptions made is that 𝑌 will be normally distributed for each value of 𝑋.We can say that the
conditional distribution of 𝑌 given 𝑋 = 𝑥0 is Normal with mean 𝛽0 + 𝛽1 𝑥0 and variance 𝜎 2

because 𝐸[𝜀𝑖 ] = 0. Thus 𝑌|𝑥0 ~𝑁(𝛽0 + 𝛽1 𝑥0 , 2)

From the figure above we see that the value of 𝑦̂0 = 𝑏0 + 𝑏1 𝑥0 can be regarded as:
i) a point estimate of the mean of the population of values of 𝑌 when 𝑋 = 𝑥0 i.e. 𝐸(𝑌|𝑥0 ) or
ii) a point estimate of a single value from the population of values of 𝑌 when 𝑋 = 𝑥0
Notice that the point estimates of 𝐸(𝑌|𝑥0 ) and 𝑌|𝑥0 are both 𝑦̂0 = 𝑏0 + 𝑏1 𝑥0 . Thus, interval estimates of both
these quantities are centered at 𝑦̂0
25
The interval estimate of 𝐸(𝑌|𝑥0 ) is called a confidence interval for 𝐸(𝑌|𝑥0 )
The interval estimate of 𝑌|𝑥0 is called a prediction interval for 𝑌|𝑥0
Thus, one could use a linear regression model to predict tree mass (𝑌) based on the CO2 level (𝑋):
i) If we require the interval estimate for the mass of a particular tree exposed to an atmospheric
concentration of CO2 of 600 microliters per liter (parts per million), we will need a prediction interval for
𝑌|𝑥 = 600 since we are estimating the value of 𝑌 from the distribution of 𝑌|𝑥 = 600

ii) If we want the interval estimate of the average mass of all trees exposed to an atmospheric concentration
of CO2 of 600 microliters per liter (parts per million), we will need to find the confidence interval for
𝐸(𝑌|𝑥 = 600)

Note: The 100(1 −  )% C.I. interval estimate for 𝐸(𝑌|𝑥0 ) is shorter than the 100(1 −  )% interval
estimate for 𝑌|𝑥0
Why? Since the estimate of the mean of the random variable will be less than the estimate of a particular value
of the random variable. Also, there is more uncertainty in estimating one value of the random variable than in
estimating the average value. The variance of the individuals is greater than the variance of the means.

Let us now estimate the expected value of y , when 𝑥 = 𝑥𝑝 .

We can show that 𝑦𝑝 = 𝑏0 + 𝑏1 𝑥𝑝 is the unbiased estimator of 𝐸[𝑌|𝑥𝑝 ]


Furthermore, the greater the distance of 𝑥𝑝 from 𝑥̅ , the greater is the error of estimating the expected value
of 𝑦.
1 (𝑥𝑝 −𝑥̅ )2
Thus, it can be shown that 𝜎𝑦 2 = 𝜎 2 {𝑛 + }
𝑆𝑆𝑥

where 𝜎 2 is the variance of the line of regression. However, since 𝜎 2 is unknown, we use 𝑠 2 as the estimator
of 𝜎 2 .

We can thus create confidence intervals for 𝐸[𝑌|𝑥𝑝 ]

Those confidence intervals have 𝑛 − 2 degrees of freedom and are of the form

1 (𝑥𝑝 − 𝑥̅ )2
𝑦̂𝑝 ± 𝑡𝑛−2,𝛼⁄2 𝑠 √ +
𝑛 𝑆𝑆𝑥

26
Confidence interval for expected tree mass (Example 10.2)
Find a 95% C.I. for the expected tree mass of all trees exposed to an atmospheric concentration of CO2 of
600 microliters per liter (parts per million):

Firstly, we find the point estimate, then we find the 95% C.I. by using
𝐸[𝑌|𝑥𝑝 = 600] = 𝑦̂𝑝 =𝑏0 + 𝑏1 𝑥𝑝 =− 2.349295162 +0.008454434(600) = 2.723365238

1 (𝑥𝑝 − 𝑥̅ )2
𝐸[𝑌|𝑥𝑝 = 600]∈ (𝑦̂𝑝 ± 𝑡6,0.025 𝑠 √ + )
𝑛 𝑆𝑆𝑥

1 (600 − 613.5)2
𝐸[𝑌|𝑥𝑝 = 600]∈ (2.723365238 ± 2.4469√0.285117575 √ + )
8 179190

𝐸[𝑌|𝑥𝑝 = 600]∈ (2.25955214, 3.18717814)

This can be done in R using the following structure for the R code:
new pred = (value for the prediction)
predict(model, data.frame(pred = new pred), level = 0.95, interval = “confidence”)
where
pred is the object containing the original independent variables and new pred is the object containing the new
values for which predictions are desired, and level is the desired confidence level.

The following code gives the 95% confidence interval for the expected tree mass of all trees exposed to an
atmospheric concentration of CO2 of 600 microliters per liter (parts per million)
> newCO2=(600)
> predict(linear.fit,data.frame(CO2 = newCO2), level = 0.95, interval = "confidence")
fit lwr upr
1 2.723365 2.25955 3.18718

27
Predicting a particular value of 𝒀 for a given value of 𝑥 = 𝑥𝑝
We would like to predict the dependent variable of an individual. What should be evident is that trying to
predict an individual score will have greater error than with the expected value and it can be shown that

2 2
1 (𝑥𝑝 − 𝑥̅ )2
𝜎 𝑝𝑟𝑒𝑑 = 𝜎 [1 + + ]
𝑛 𝑆𝑆𝑥

Thus the appropriate prediction interval of y is

1 (𝑥𝑝 − 𝑥̅ )2
𝑦̂ ± 𝑡𝑛−2,𝛼⁄2 𝑠 √1 + +
𝑛 𝑆𝑆𝑥

Example 10.2 continued: find a 95% prediction interval for a tree exposed to an atmospheric concentration
of CO2 of 600 microliters per liter (parts per million).
𝑦̂𝑝 =𝑏0 + 𝑏1 𝑥𝑝 =− 2.349295162 +0.008454434(600) = 2.723365238

1 (𝑥𝑝 − 𝑥̅ )2
𝑌|𝑥𝑝 ∈ (𝑦̂𝑝 ± 𝑡6,0.025 𝑠 √1 + + )
𝑛 𝑆𝑆𝑥

1 (600 − 613.5)2

𝑌|𝑥𝑝 ∈ (2.723365238 ± 2.4469√0.285117575 1 + + )
8 179190

𝑌|𝑥𝑝 ∈ (1.336926342, 4.109803944)

This can be done in R using the following structure for the R code:
new pred =(value for the prediction)
predict(model, data.frame(pred = new pred), level = 0.95, interval = “predict”)
where
pred is the object containing the original independent variables and new pred is the object containing the
new values for which predictions are desired, and level is the desired prediction level.
The following code gives the 95% confidence interval for the expected tree mass of all trees exposed to an
atmospheric concentration of CO2 of 600 microliters per liter (parts per million)
> predict(linear.fit,data.frame(CO2 = newCO2), level = 0.95, interval = "predict")
fit lwr upr
1 2.723365 1.33692 4.109811

Comparing the confidence interval and the prediction interval one can see that as expected, the prediction
interval is much wider than the confidence interval.
28

You might also like