0% found this document useful (0 votes)

6 views

Regression and correlation notes

Chapter 10 discusses correlation and simple linear regression, focusing on the relationships between two numerical variables. It covers how to calculate the correlation coefficient, differentiate between correlation and causation, and create a linear model to predict dependent variables. The chapter emphasizes the importance of assessing the linear relationship through scatterplots, correlation coefficients, and regression analysis.

Uploaded by

fezilesilinda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Regression and correlation notes

Uploaded by

fezilesilinda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Chapter 10: Correlation and Simple Linear Regression

10.0 Introduction
In the previous chapters we have mainly worked with univariate data (where observations are made for one
variable only) however we did look at bivariate data when we studied correlation and in the chapter on joint
random variables (discrete case). We return to the bivariate case in this chapter as we are often interested in
how two or more variables are related to one another eg: an educational psychologist may be interested in how
vocabulary size is related to age or a farmer may be interested in how crop yield is related to rainfall etc.

Overview of this chapter and learning objectives

We will look at:
1. the relationships between two numerical variables, 𝑋 and 𝑌 in terms of form, strength, and direction of
the relationship,
2. calculating measures like the correlation coefficient and understanding the difference between a
statistical relationship(correlation) and a causal relationship,
3. finding a linear model that can be used to best describe the relationship between two numerical variables
𝑋 and 𝑌 when the pairs (𝑥𝑖 , 𝑦𝑖 ) of measurements are linearly distributed,
4. using a linear model to predict the dependent variable. Understand why it is statistically unreliable to
make predictions outside the range of the values of the independent variable,
5. the use of 𝑟 2 and the standard deviation about the line to assess the usefulness of the least squares line,
6. the use of the residual plot in assessing if the line is the appropriate way to describe the relationship
between the variables.

10.1 Correlation
A scatterplot visually depicts how two continuous variables 𝑋 and 𝑌 are related to each other through a plot
of the observations (𝑥𝑖 , 𝑦𝑖 ) for these variables.

Example 10.1: Length of time (in hours) that nine MS1 students slept just before they sat for their first test
(H), and the marks they scored in the test (in %) (M)
Hours 2 3 4 5 6 7 8 9 10
Marks % 45 45 58 50 57 68 70 62 75

The following code in R generates the scatterplot of hours versus marks

> hours<-c(2,3,4,5,6,7,8,9,10)
> mark<-c(45,45,58,50,57,68,70,62,75)
> plot(hours,mark)

1
Figure 10.1: Scatterplot of hours studied versus marks obtained

Is there a relationship between hours slept and marks obtained? What is the direction of the relationship?
Can we describe this by some measure to say how strongly related the variables are?
Yes. Positive increasing linear relationship between X and Y. The correlation coefficient (𝑟) describes the
strength of the linear relationship between the independent and dependent variables.

Correlation can be positive, negative or zero:

Positive: In this case, when the values of one continuous variable increase, the values of the other
continuous variable also increase eg: for the relationship between age and shoe size, as age increases (within
some range of age values), the height also increases.
Negative: In this case, when the values of one continuous variable increase, the values of the other
continuous variable decreases eg: The relationship between Body Mass Index (BMI) and running speed- as
BMI increases, running speed decreases.
Zero: In this case, when the values of one continuous variable increase, there is no influence on the values of
the other continuous variable, either positive, or negative. Hence, there is no linear relationship between the
two continuous variables, however there can be some other nonlinear relationship between the variables.

2
The above figure sourced from https://communitymedicine4asses.com/2013/12/27/correlation/ shows that
the relationship between the value of a car and the mileage is negatively related (i.e. the more mileage the
car has done, the cheaper it will be), that there is no relationship between the colour of a car and the quality
of the car hence the zero correlation and that there is a positive linear relationship between car insurance cost
and the number of accidents(with more accidents the insurance cost is expected to increase).

Sample correlation coefficient

Correlation coefficient (𝑟): is a measure of how strongly linearly related or associated 𝑥 and 𝑦 are from the
sample that is observed.

∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) 𝑆𝑃𝑥𝑦

𝑟= =
√[∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )2 ][∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 ] √𝑆𝑆𝑥 𝑆𝑆𝑦

Where
𝑛
(∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
𝑆𝑃𝑥𝑦 = ∑ 𝑥𝑖 𝑦𝑖 −
𝑛
𝑖=1

𝑛
(∑𝑛𝑖=1 𝑥𝑖 )2
2
𝑆𝑆𝑥 = ∑ 𝑥𝑖 −
𝑛
𝑖=1
𝑛
(∑𝑛𝑖=1 𝑦𝑖 )2
𝑆𝑆𝑦 = ∑ 𝑦𝑖 2 −
𝑛
𝑖=1

The correlation coefficient (𝑟) measures of the strength of the linear relationship between the independent
and dependent variables. Values lie in the range -1 to +1

3
Correlation Interpretation
𝑟 equal to +1 Perfect positive linear relationship

𝑟 close to +1 Strong positive linear relationship

𝑟 far away from +1 Weak positive linear relationship

𝑟=0 No linear relationship

• Does not mean that the variables are
unrelated but only that they are not
linearly related
• There could be some other
relationship other than a linear one

4
Correlation Interpretation
𝑟 far away from –1 Weak negative linear relationship

𝑟 close to –1 Strong negative linear relationship

𝑟 equal to –1 Perfect negative linear relationship

Note: Correlation does not mean causation

A correlation of 𝑟 = 1 for example describes the extent of the association between the two variables i.e. that
larger values of the one variable are associated with larger values of the other variable and not that the large
values on the one variable causes large values on the other variable. Example, heater sales may be strongly
negatively correlated with the crime rate during the winter months (heater sales increase in winter and crime
decreases in winter). So, although heater sales and crime rate are associated, the one variable does not cause
the other to behave in a certain way. However, this association exists as they are both responses to the cold
weather in winter.

Using R in example 10.1 to find the correlation coefficient gives us:

> cor(hours,mark)
[1] 0.8945685

The correlation coefficient of 0.895 indicates a strong positive linear relationship exists between hours
studied and mark obtained.

5
Using the equation for correlation gives us:
Hours Marks %
(ℎ) (𝑚) ℎ2 𝑚2 ℎ𝑚
2 45 4 2025 90
3 45 9 2025 135
4 58 16 3364 232
5 50 25 2500 250
6 57 36 3249 342
7 68 49 4624 476
8 70 64 4900 560
9 62 81 3844 558
10 75 100 5625 750
54 530 384 32156 3393

𝑆𝑃𝐻𝑀 213
𝑟= = = 0.895
√𝑆𝑆𝐻 𝑆𝑆𝑀 √(60)(944.8888889)

Where
(∑𝑛 𝑛
𝑖=1 ℎ𝑖 )(∑𝑖=1 𝑚𝑖 ) 54∗530
𝑆𝑃𝐻𝑀 = ∑𝑛𝑖=1 ℎ𝑖 𝑚𝑖 − = 3393 − = 213
𝑛 9

2
(∑𝑛
𝑖=1 ℎ𝑖 ) (54)2
𝑆𝑆𝐻 = ∑𝑛𝑖=1 ℎ𝑖 2 − = 384 − = 60
𝑛 9
𝑛
(∑𝑛𝑖=1 𝑚𝑖 )2 (530)2
𝑆𝑆𝑀 = ∑ 𝑚𝑖 2 − = 32156 − = 944.8888889
𝑛 9
𝑖=1

If the correlation coefficient indicates that there is a linear relationship between the independent variable (𝑋)
and dependent variable (𝑌), then we could fit a linear model to the data by regressing the dependent variable
on the independent variable. The dependent variable is also referred to as the Response variable and the
independent variable is also known as the Explanatory or Predictor variable. A few examples of these include
examining the relationship between an increase in temperature (𝑋 or Explanatory) and the yield (𝑌 or
Response) from a chemical reaction at a particular temperature or the mark of students (𝑌) and the number of
hours spent studying (𝑋) or even the crop yield (𝑌) based on the amount of rainfall received (𝑋).

6
10.2 The Simple Linear regression model
If there is a linear relationship in the data, then there are population parameters 𝛽0, 𝛽1, and 𝜎 2 such that for
any fixed value of 𝑥 of the random variable 𝑋, the Response or dependent variable (𝑌) is related to this fixed
𝑥 by a linear model given by: 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀.

The linear model 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀 is referred to as a probabilistic model. A probabilistic model accounts for
random deviation or random error denoted by ε.
• The random deviation or error ε is a random variable and it is assumed to be normally distributed with
mean 0 and variance 𝜎 2 , that is 𝜀~𝑁(0, 𝜎 2 ) with this mean and variance being the same regardless of
the value of 𝑥 that the model takes on.
• Since the 𝑛 (𝑥, 𝑦) pairs (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) are regarded as being generated independently
of each other, each of the 𝑌’s given by 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 are independent of each other as a result
of the 𝜀𝑖′ s being independent of each other.
The following figure from Devore and Berk (Figure 12.4) illustrates the simple linear regression model.

Figure 10.2: Illustration of simple linear regression line

In the above Figure, 𝑌 and 𝜀 are random variables (so they will have some distribution with mean and
variance), 𝑥 is a fixed value of the random variable 𝑋, and 𝛽0 and 𝛽1 are constants and are referred to as the
regression coefficients.
• Y is the dependent random variable and it depends on X and the random error (ε) in determining Y
• The variable X is fixed equal to xi . We can choose the values we give to it; hence it is called the

independent variable.
• 𝛽0 is the unknown intercept coefficient (𝑦-intercept) - the value of 𝑦 when 𝑥 = 0
• 𝛽1 is the slope coefficient of the population or true regression line 𝛽0 + 𝛽1 𝑥 . It is interpreted as the true
average increase (or decrease) in 𝑌 associated with every one-unit increase (or decrease) in 𝑥
• 𝜀 is the error term or the part of 𝑌, that the regression model is unable to explain.

7
The population parameters 𝛽0 , 𝛽1 and 𝜎 2 are generally unknown and will need to be estimated. The
parameters are estimated by drawing a random sample consisting of 𝑛 (𝑥𝑖 , 𝑦𝑖 ) pairs i.e:
(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) from the population of interest and then utilising this sample data to obtain the
sample estimates 𝑏0 , 𝑏1 and 𝑠 2 which are point estimates for the unknown population parameters 𝛽0 , 𝛽1 and
𝜎 2 . The estimated linear regression line 𝑦̂ = 𝑏0 + 𝑏1 𝑥, is obtained by drawing a straight line through the
sample data such that it is “closest” to as many of the sample data points as possible. The “closeness” of the
data is obtained by calculating the deviations between the observed 𝑦 values and the estimated 𝑦̂ values.

10.2.1 Finding the best fitting line or least squares regression line
If the scatterplot of the bivariate data consisting of 𝑛 (𝑥, 𝑦) pairs i.e: (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) appears to
have a linear pattern we can then relate the set of observations (𝑥𝑖 , 𝑦𝑖 ) of the two variables, X and Y
respectively by an equation for a straight line viz: 𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 . This is done through a process called
Simple Linear regression where 𝑋 represents the independent variable and 𝑌 represents the dependent
variable. The best model will be the straight line that is closest to as many of the observed (𝑥𝑖 , 𝑦𝑖 )
measurements.

From the previous figure, one can observe how much a point (𝑥𝑖 , 𝑦𝑖 ) deviates or how far it is from the fitted
line. The predicted values 𝑦̂1, 𝑦̂2,…, 𝑦̂𝑛 are obtained by substituting the observed 𝑥 values 𝑥1 , 𝑥2 , … , 𝑥𝑛 into the
equation for the regression line. The model can be estimated by minimising the sum of the squared deviations
or the sum of the squared errors between the observed 𝑦𝑖 values and the predicted 𝑦̂𝑖 values. The deviations
are also referred to as residuals.

Definition: Residuals: The vertical deviations between the actual or observed 𝑦’s and the predicted 𝑦 values
given by the line (i.e. 𝑦̂) are called residuals, denoted ei =𝑦𝑖 − 𝑦̂𝑖

Definition: Error sum of squares (or Residual sum of squares)

∑𝑛𝑖=1 𝑒𝑖2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2 = ∑𝑛𝑖=1(𝑦𝑖 − (𝑏0 + 𝑏1 𝑥𝑖 ))2 = 𝑆𝑆𝐸 = sum of squared error

We want these residuals to be as small as possible to get the best line i.e. we want to minimise the residuals.
Since the residuals may be positive and negative numbers, they will cancel each other out if we simply sum
them, hence we work with the squared residuals to avoid this. Thus, we are finding the line 𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖
which minimises the sum of the squared residuals(∑ 𝑒𝑖2 ).

8
In the estimated least squares regression line 𝑦̂ = 𝑏0 + 𝑏1 𝑥 obtained from the observed sample data, 𝑏0
represents the intercept (the 𝑦 value when 𝑥 is 0) and 𝑏1 represents the slope of the line (which represents
the change in the dependent variable for every 1-unit change in the independent variable).
Minimising the sum of the squared residuals (∑𝑛𝑖=1 𝑒𝑖2 ) is undertaken by finding the partial derivative of
∑𝑛𝑖=1 𝑒𝑖2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)
𝑖
2
= ∑𝑛𝑖=1(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 with respect to 𝑏0 and 𝑏1 , thereafter setting both to zero
and solving.
Using partial derivatives and differentiating with regard to 𝑏0 gives:
𝑛 𝑛
𝜕
∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 = (−1) 2 ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )
𝜕𝑏0
𝑖=1 𝑖=1

Equating to zero:
𝑛

−2 ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 ) = 0
𝑖=1
𝑛

⇒ ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 ) = 0
𝑖=1
𝑛 𝑛

⇒ ∑ 𝑦𝑖 = 𝑛𝑏0 + 𝑏1 ∑ 𝑥𝑖
𝑖=1 𝑖=1

Using partial derivatives and differentiating with regard to 𝑏1 gives:

𝑛 𝑛
𝜕
(∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 ) = −2 ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )𝑥𝑖
𝜕𝑏1
𝑖=1 𝑖=1

Equating to zero:
𝑛

−2 ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 ) 𝑥𝑖 = 0
𝑖=1
𝑛

⇒ ∑(𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )𝑥𝑖 = 0
𝑖=1
𝑛 𝑛 𝑛

⇒ ∑ 𝑥𝑖 𝑦𝑖 = 𝑏0 ∑ 𝑥𝑖 + 𝑏1 ∑ 𝑥𝑖2
𝑖=1 𝑖=1 𝑖=1

Thus,
(∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑆𝑃𝑥𝑦
𝑏1 = 𝑛 =
𝑛
(∑ 𝑥 ) 2 𝑆𝑆𝑥
∑𝑛𝑖=1 𝑥𝑖2 − 𝑖=1 𝑖
𝑛
9
and making 𝑏1 and then 𝑏0 the subject of the formula gives:
∑𝑛𝑖=1 𝑦𝑖 ∑𝑛𝑖=1 𝑥𝑖
𝑏0 = − 𝑏1 = 𝑦 − 𝑏1 𝑥
𝑛 𝑛

By applying our mathematics knowledge and performing second order differentiation confirms that this is a
minimum and not a maximum or point of inflection.

Now remember that for the line 𝑦̂ = 𝑏0 + 𝑏1 𝑥 , it’s intercept 𝑏0 and slope 𝑏1 are obtained for a particular data
set. The line can be used to predict the value of 𝑦̂ for a given 𝑥 value that is within the range of 𝑥 values
observed in this particular data set. One cannot predict the 𝑦̂ value for a given 𝑥 value that is outside the range
of 𝑥 values as this would amount to extrapolation and the prediction would not be reliable. The slope indicates
the average or expected change in the dependent variable for every 1-unit change in the independent variable.

Also, apart from the line 𝑦̂ = 𝑏0 + 𝑏1 𝑥, having the property that the sum of the squared deviations is
minimised, it also has the property that:
1 1
𝑒̅ = 𝑛 ∑ 𝑒𝑖 = 𝑛 ∑(𝑦𝑖 − 𝑦̂)
𝑖

1
= ∑(𝑦𝑖 − (𝑏0 + 𝑏1 𝑥𝑖 ))
𝑛
= 𝑦̅ − 𝑏0 − 𝑏1 𝑥̅ = 0
since 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅
Now remember that the line 𝑦̂ = 𝑏0 + 𝑏1 𝑥, obtained from 𝑛 (𝑥, 𝑦) pairs in a particular sample data set is an
estimate for the population where 𝑏0 and 𝑏1 are estimates of 𝛽0 and 𝛽1 respectively. If we change the data set
to a new or different set of 𝑛 (𝑥, 𝑦) pairs we will obtain a different estimated line 𝑦̂ = 𝑏0 + 𝑏1 𝑥, with
different values for 𝑏0 and 𝑏1 .

10.3 Estimating the parameters of the model:

The probabilistic model allows one to make probability statements about the parameters in the model and
predictions made from the model, enable us to construct confidence intervals for the model and test hypotheses
for the model (2nd year). Since the 𝑛 (𝑥, 𝑦) pairs (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) are regarded as being
generated independently of each other, then each 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 is independent of each other if each of
the 𝜀𝑖′ s are independent of each other. Recall that 𝑌 and 𝜀 are random variables (so they will have some
distribution or “shape” with a mean and variance) whereas 𝑥 is a fixed value. The Yi ’s are independent of

each other as a result of the 𝜀𝑖′ s being independent of each other.

10
Recall that it is assumed that the 𝜀𝑖 ~𝑁(0, 𝜎 2 ).
Now each 𝑌𝑖 ~𝑁(𝛽0 + 𝛽1 𝑥𝑖 , 𝜎 2 ) since 𝐸(𝑌|𝑥) = 𝛽0 + 𝛽1 𝑥𝑖 as illustrated in Devore and Berk:

a) Distribution of 𝜀

Figure 10.3: Distribution of 𝜀

b) Distribution of 𝑌 for different values of 𝑥

Figure 10.4: Distribution of 𝑌 for different values of 𝑥

For the model 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀 we saw that the best estimate for 𝛽0 is 𝑏0 and the best estimate for 𝛽1 is 𝑏1 .
The random deviation 𝜀 is assumed to be Normally distributed with mean 0 and variance 𝜎 2
hence, 𝜀𝑖 ~𝑁(0, 𝜎2 ).

We have already estimated the slope and intercept parameters, thus we still need to estimate a third parameter
𝜎 2 . We will estimate this parameter in section 10.4.2. The variance parameter 𝜎 2 represents the amount of
variability inherent in the regression model. If σ2 is close to zero then almost all the (x, y) pairs are said to be
close to the population regression line. If the σ2 is far from 0 then it means that most of the (x, y) pairs, in the
scatterplot, are spread out and far away from the population regression line. Since the equation of the true
population regression line is unknown, the estimate is based on the extent to which the sample data deviates
from the estimated line. We require an estimate of 𝜎 2 for the confidence intervals that we will calculate
shortly. Since 𝑏0 and 𝑏1 are the best estimates (you will study in 2nd year that they are both unbiased
estimators) of 𝛽0 and 𝛽1 respectively, we can safely say that the distribution of the errors or residuals
ei =𝑦𝑖 − 𝑦̂𝑖 from the sample should estimate the distribution of 𝜀 in the population.

11
Previously we saw that the sample variance 𝑠 2 is the best estimator of the population parameter 𝜎 2 . In much
the same way, we can use the sample variance of the residuals ei to estimate 𝜎 2 , the population variance of
𝑆𝑆𝐸 ∑𝑛 ̂ 𝑖 )2
𝑖=1(𝑦−𝑦
𝜀.The sample variance of the residuals is 𝑠 2 = = . It has 𝑛 − 2 degrees of freedom because
𝑛−2 𝑛−2

we estimated two parameters 𝛽0 and 𝛽1 for the regression line and we are using the estimates in the equation
𝑆𝑆𝐸
above to find the sample variance 𝑠 2 . Now, 𝑠 2 = is also called the mean squared error (MSE). This
𝑛−2
𝑆𝑆𝐸 (𝑆𝑆𝑦 −𝑏1 𝑆𝑃𝑥𝑦 )
can be calculated by the computationally easier formula: 𝑀𝑆𝐸 = 𝑛−2 = .
𝑛−2

Example 10.2: - based on Example 12.5 (Devore and Berk)

“Global warming is a major issue, and CO2 emissions are an important part of the discussion. What is the
effect of increased CO2 levels on the environment? In particular, what is the effect of these higher levels on
the growth of plants and trees? The article “Effects of Atmospheric CO2 Enrichment on Biomass
Accumulation and Distribution in Eldarica Pine Trees” (J. Exp. Bot., 1994: 345–349) describes the results of
growing pine trees with increasing levels of CO2 in the air. There were two trees at each of four levels of CO2
concentration, and the mass of each tree was measured after 11 months of the experiment. Here are the
observations with 𝑥 = atmospheric concentration of CO2 in microliters per liter (parts per million) and 𝑦 =
mass in kilograms. In the table below, columns were added for 𝑥 2 , 𝑥𝑦 , 𝑦 2 , the fitted value (𝑦̂)
𝑖 and the

residual (𝑒𝑖 = 𝑦𝑖 − 𝑦̂)

𝑖 for ease of use when we find the least squares line through calculations. The mass

measurements were read from a graph in the article.”

Fitted Residual
Observation 𝑥 𝑦 𝑥2 𝑥𝑦 𝑦2 value 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖
1 408 1.1 166464 448.8 1.21 1.100114 -0.000114
2 408 1.3 166464 530.4 1.69 1.100114 0.199886
3 554 1.6 306916 886.4 2.56 2.334461 -0.734461
4 554 2.5 306916 1385 6.25 2.334461 0.165539
5 680 3 462400 2040 9 3.399720 -0.399720
6 680 4.3 462400 2924 18.49 3.399720 0.900280
7 812 4.2 659344 3410.4 17.64 4.515705 -0.315705
8 812 4.7 659344 3816.4 22.09 4.515705 0.184295
4908 22.7 3190248 15441.4 78.93
Table 10.2: Atmospheric CO2 concentration and tree mass
R code to visualise the data:
> Mass=c(1.1, 1.3, 1.6, 2.5, 3.0, 4.3, 4.2, 4.7) #define the vector of values for variable
> CO2=c(408,408,554,554,680,680,812,812) #define the vector of values for variable CO2
> plot(x=CO2, y=Mass) #scatterplot of Mass versus CO2
12
The scatterplot shows that there is a positive linear relationship between the variables.

Figure 10.5 Plot of tree mass versus CO2 concentration

The strength of the linear relationship is calculated by finding the Pearson correlation coefficient between the
two variables Mass and CO2. Using R with the following commands gives:
> cor(Mass,CO2, method="pearson") #Calc Pearson correlation
[1] 0.9392405

The correlation coefficient 𝑟 = 0.9392405 is close to +1 and hence we say that the two variables are strongly,
positively linearly related. Homework exercise: Verify the answer for the correlation coefficient by using
the equations!

Tree mass is the Dependent or Response variable and atmospheric concentration of CO2 is the Independent
variable or Explanatory variable. By regressing Mass on CO2, we can get the linear model
We can find the least squares regression line using:
(∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑆𝑃𝑥𝑦
𝑏1 = 𝑛 =
𝑛
(∑ 𝑥 ) 2 𝑆𝑆𝑥
∑𝑛𝑖=1 𝑥𝑖2 − 𝑖=1 𝑖
𝑛

(4908)(22.7)
𝑆𝑃𝑥𝑦 15441.4 − 8 1514.95
𝑏1 = = 2 = = 0.00845
𝑆𝑆𝑥 (4908) 179190
3190248 − 8
22.7 4908
and 𝑏0 = 𝑦̅ − 𝑏1 𝑥̅ = − 0.00845 ( ) = −2.349
8 8

13
Hence the estimated linear regression line is 𝑦̂ = −2.349 + 0.00845𝑥

We can superimpose the least squares line on the scatterplot with the following code in R:
> abline(linear.fit) #plots linear regression line on data

Figure 10.6: Plot of Mass versus CO2 with superimposed least squares line

The least squares regression coefficients are obtained with the following R code: lm(Dependent variable ~
Independent variable)
> linear.fit=lm(Mass~CO2) #creates the linear model

The coefficients can also be obtained by the following r code

> coef(linear.fit)
(Intercept) CO2
-2.349295162 0.008454434

Hence the least squares regression line is given by 𝑦̂ = −2.349 + 0.008𝑥

Interpretation of the slope of the least squares line: The slope is positive and shows that the mass increases
by 0.008 on average for every one-unit increase of CO2.
To calculate the fitted values ( ̂𝑦 ) we use the following R code:
> fitted(linear.fit)
1 2 3 4 5 6 7 8
1.100114 1.100114 2.334461 2.334461 3.399720 3.399720 4.515705 4.515705

14
One can also find the Residuals = Observed values – Fitted value ( 𝑒 = 𝑦 − ̂𝑦 ) using the following code:
> resid(linear.fit)
1 2 3 4 5 6
-0.0001138456 0.1998861544 -0.7344611865 0.1655388135 -0.3997198504 0.9002801496
7 8
-0.3157051175 0.1842948825
The least squares line can be used for two purposes:
For a fixed 𝑥 value (say 𝑥 ∗ ), 𝑦̂ = 𝑏0 + 𝑏1 𝑥 gives either
(i) The estimate of the expected value of 𝑌 when 𝑥 = 𝑥 ∗ (as seen in Figure 10.4) or
(ii) a point prediction of the 𝑌 value when 𝑥 = 𝑥 ∗ i.e we predict tree mass for a given atmospheric
concentration of CO2.

Note that we must caution against extrapolation, hence we can only predict the dependent or Response
variable, in this case 𝑌 (Mass) value for some fixed value of 𝑥 where the value of 𝑥 must be between
(min(𝑥) ; max (𝑥)) as the least squares line was generated using this specific range of 𝑥 values.
A further summary for the model can be obtained in R using the following R code:
> summary(linear.fit) # gives the summary of the model
Call:
lm(formula = Mass ~ CO2)
Residuals:
Min 1Q Median 3Q Max
-0.73446 -0.33671 0.08271 0.18819 0.90028

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.349295 0.796567 -2.949 0.025637 *
CO2 0.008454 0.001261 6.702 0.000536 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.534 on 6 degrees of freedom

Multiple R-squared: 0.8822, Adjusted R-squared: 0.8625

10.4 Assessing the Model

Thus far we used the scatterplot to show us visually if the linear regression line is appropriate to describe the
relationship between the variables. We then calculated the Pearson correlation coefficient and if the value
indicated that there was a strong linear relationship between the Response (Dependent) and Explanatory
(Independent) variables, we then fitted the linear regression line from the sample data and could predict the
value of the dependent variable based on the value of the independent variable. However, once we decide that

15
the straight line is appropriate, we need to assess the accuracy of the predictions that are based on the least
squares line. Two numerical measures that give us an idea of how well the model fits the sample data are the
coefficient of determination and the standard deviation about the least squares line.

10.4.1. The Coefficient of Determination

The coefficient of determination describes the proportion of variability in the dependent variable that can be
attributed or explained by the linear relationship between the independent and dependent variables.

Figure 10.7: Variation in the Dependent variable (extracted from Devore &Berk Figure 12.12)

In the figure labelled (a) above all the points will lie on the least squares line, so all the variation in the 𝑦’s are
attributed to the 𝑥’s and hence there is no unexplained variation. Hence 𝑆𝑆𝐸 which measures the variation
that is unexplained by the model will be 0.

In Figure (b) the points will not all lie on the line but most will be very close to the line so there is a small
amount of unexplained variation and hence SSE will be small.

Figure (c) will have a much larger variation than (b) and hence SSE will be much larger.

To understand how 𝑟 2 (the variability in the dependent variable) is calculated we need to consider the
variability in the 𝑦 values. This can be done in two ways: looking at the total variation that is unexplained by
the model (SSE) or looking at the sum of the total variation in the observed 𝑦’𝑠 as shown in the Figure below.

16
Figure 10.8: Extracted from Devore and Berk (Figure 12.13) to illustrate variation in the model

Understanding the variation in the model is explained in further detail below by considering the equations
for SSE and SST.
a) We can look at how far the 𝑦’𝑠 are from their respective 𝑦̂ values (Figure 10.8 – LHS picture) – this
is referred to as the Error Sum of Squares or Sum of Squares of the Errors (SSE) where
SSE =∑𝑛𝑖=1 𝑒𝑖2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2=∑𝑛𝑖=1(𝑦𝑖 − (𝑎 + 𝑏𝑥𝑖 ))2
It is interpreted as the amount of variability in the 𝑦’𝑠 that is unexplained by the model.
OR
b) We can look at the total amount of variation in the observed 𝑦’𝑠 (Figure 10.8 – RHS picture) i.e.:
how spread out the observed 𝑦’𝑠 are from the mean 𝑦 value (𝑦̅). This is called the Total sum of
squares (denoted SST) where SST = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 = 𝑆𝑆𝑦
2
(∑𝑛
𝑖=1 𝑦𝑖 )
Note the computationally easier equation for 𝑆𝑆𝑦 = ∑𝑛𝑖=1 𝑦𝑖2 − 𝑛

So, SSE is the squared deviation about the least squares line and SST is the squared deviation about
the horizontal line for 𝑦̅
𝑆𝑆𝑇 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2
𝑛

= ∑(𝑦𝑖 − 𝑦̂𝑖 + 𝑦̂𝑖 − 𝑦̅)2

𝑖=1
𝑛

̂𝑖 − 𝑦̅)]2
𝑖 + (𝑦
= ∑[(𝑦𝑖 − 𝑦̂)
𝑖=1
𝑛
2
= ∑[(𝑦𝑖 − 𝑦̂)
𝑖 + 2(𝑦𝑖 − 𝑦̂𝑖 )(𝑦̂𝑖 − 𝑦̅) + (𝑦̂𝑖 − 𝑦̅)2 ]
𝑖=1

= ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)
𝑖
2
+ ∑𝑛𝑖=1(𝑦̂𝑖 − 𝑦̅)2 as the middle term evaluates to 0 since ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂) 𝑛
𝑖 = ∑𝑖=1 𝑒𝑖 = 0
17
Thus SST= SSE + SSR where SSR = ∑𝑛𝑖=1(𝑦̂𝑖 − 𝑦̅)2 is defined as the regression sum of squares

SSE
So, the ratio SST is the proportion of total variation that is unexplained by the model and hence
𝑆𝑆𝐸
1 − 𝑆𝑆𝑇 = 𝑟 2 is the proportion of variation that is explained by the model. This is referred to as the Coefficient

of Determination.

Definition: Coefficient of Determination (𝑟 2 × 100%) : is the proportion of variation in the dependent

variable that can be attributed or explained by the linear relationship between the independent and dependent
variables.

Calculate and interpret 𝒓𝟐 for Example 10.2

𝑟 2 = 88.22%. Hence 88.22% of the variability in the mass of the tree is accounted for by the linear
relationship between CO2 and mass of the tree. The remaining (100 − 88.22)% = 11.88% of the variation
in the tree mass comes from other factors which are different from CO2 for eg: amount of rainfall, temperature
etc.

The coefficient of Determination can be obtained using the following R code or it can be read off from the
previous output given above (Multiple R-squared):
> summary(linear.fit)$r.squared # gives the coefficient of determination
[1] 0.8821727

10.4.2 Estimating the variance 𝜎 2

Recall that the goodness of fit of the model can be assessed by either looking at coefficient of determination
or the variance (or standard deviation) about the least squares line.

To calculate the standard deviation, 𝑠 we need to calculate the unexplained sum of squares 𝑆𝑆𝐸. Recall that
SSE =∑𝑛𝑖=1 𝑒𝑖2 = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂𝑖 )2
𝑆𝑆𝐸
The unbiased estimator 𝑠 2 of the variance 𝜎 2 is . Using the above equation for calculation can be
𝑛−2

tedious, hence we can use the computationally easier equation: SSE = 𝑆𝑆𝑦 − 𝑏𝑆𝑃𝑥𝑦

18
10.4.3 Check if the assumptions of the model are met
Recall our model: 𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀 had the following assumptions made about the model:
The error terms:
i) are Normally distributed with mean 0 and variance 𝜎 2 hence, 𝜀𝑖 ~𝑁(0, 𝜎 2 )
ii) have the same or equal variance for the different 𝑥 values. This is referred to as homoscedasticity
iii) are independent of one another

These assumptions may or may not be true and hence will need to be checked. If they are true then the
observed residuals 𝑒𝑖 = 𝑦𝑖 − 𝑦̂𝑖 should also behave in the same way as 𝜀𝑖 and should be approximately
normally distributed with constant variances for the different 𝑥 values.

The assumptions can be checked by examining the following residual plots: (reference:
https://data.library.virginia.edu/diagnostic-plots/)

Plot 1: Residuals versus the fitted or predicted values 𝑦̂𝑖

If there is a non-linear relationship between the dependent variable and independent variable, then the pattern
indicating the non-linear relationship could be seen in this plot. If the residuals seem to be equally spread
around a horizontal line without any clear pattern, this would indicate that we do not have a non-linear
relationship.
Consider the following residual plots. One describes a ‘good’ model (since it meets the regression assumptions
well) and the other describes a ‘bad’ model as it does not meet the regression assumptions.

The plots show that there is no clear pattern in Case 1, however in Case 2, we can see the pattern of a parabola
and hence there is a non-linear relationship in the data, thus the assumptions for the linear model are not
satisfied. This non-linear relationship was not explained by the model and was left out in the residuals.
19
Plot 2: Normal Q-Q
Normal Q-Q plot indicates if the residuals are normally distributed. If the residuals follow the straight dashed
line in the plot, then they are normally distributed.

Case 1 shows that the residuals are Normally distributed as they follow the straight dashed line quite closely
although an observation numbered as 38 looks like it is far away from the other point. Case 2 deviates quite
distinctly from the straight dashed line and hence the normality assumption is violated. Although observation
#38 falls away from the rest of the data in case 1, at this stage we would not be too concerned and say that
Case 1 satisfies the normality assumption.

Let’s look at the next plot (Scale-Location) while keeping in mind that observation #38 might be a potential
problem. Observation #38 is quite far away from the other points in the plot so it seems to be breaking away
from the pattern of the other values and is termed to be an outlier. An outlier can have an influence on the
slope of the model.

Plot 3: Scale-Location
The Scale-location plot also known as the Spread-Location plot, shows if the residuals are spread equally
along the 𝑥 values. If you see a horizontal line with equally (randomly) spread out points then the assumption
of equal variance or homoscedasticity is satisfied.

20
In these plots it appears that the residuals appear randomly and equally spread in Case 1 whereas, in Case 2,
the residuals begin to fan out or spread wider along the 𝑥-axis. Also, since the residuals are spread out wider
and wider, the red smooth line is not horizontal and has a steep incline.

Plot 4: Residuals vs Leverage

Leverage can be calculated by the following equation:
1 (𝑋𝑖 −𝑋̅)2
+
𝑛 𝑆𝑆𝑥

The Residual versus Leverage plot helps us to identify if any of the outliers are influential. Not all outliers are
influential and hence would not necessarily affect the regression line if for example they seem to follow the
trend in the majority of cases in the data however for those that could influence the regression line, closer
examination is needed. Points with extreme values of X are said to have high leverage. High leverage points
have the ability to move the line more. Unlike in the other plots, we do not examine the pattern in the data but
we look for values that lie outside the dashed line that indicates a measure referred to as the Cook’s distance
score. Attention needs to be paid to extreme values that occur at the upper right corner or at the lower right
corner of the plot as these points will have high leverage. Points in these regions indicate cases that can be
influential against a regression line. The dashed line in the plot indicates the Cook’s distance score. Points
that are outside of the dashed lines indicating the Cook’s distance score is regarded as being influential to the
model. If these points are excluded, this would change the regression results. Note that the numbers within the
plot refer to the items or cases in the dataset that need to examined eg: in the plot below 32 refers to the 32 nd
case in the dataset. See video at https://www.youtube.com/watch?v=xc_X9GFVuVU

21
The plots above show the contour lines for the Cook’s distance. Cook’s distance is another measure of the
importance of each observation to the regression. Generally, a Cook’s distance of more than 1 would indicate
an influential value or possible outlier and possibly a poor model. In the above plots there seem to be no
influential points outside the Cook’s distance lines. In Case 2 however, observation #49 appears to be far
outside the Cook’s distance lines and is thus an influential point. Excluding it from the data could influence
the regression model.

If similar cases occur across the four plots as outliers, then closer attention needs to be paid to the points to
see if there is anything special for that particular case or are there errors in the capturing of the data etc. To
use R’s regression diagnostic plots, we set up the regression model as an object and create a plotting
environment of two rows and two columns. Then we use the plot() command, treating the model as an
argument. Applying this to the data in Example 12.5 (Devore) gives the following:
> par(mfrow = c(2,2)) # splits the screen in 4 parts
> plot(linear.fit) #plots the diagnostic plots

22
If there is a non-linear relationship between the dependent variable and independent variable, then the pattern
indicating the non-linear relationship could be seen in this plot. If the residuals seem to be equally spread
around a horizontal line without any clear pattern, this would indicate that we do not have a non-linear
relationship.

In using the R diagnostic plots and influence statistics to diagnose how well our model is fitting the data, we
see that for the first plot (Residuals versus. Fitted values) is a simple scatterplot between residuals and
predicted values. It should look more or less random. Here, the residuals seem to be equally spaced about the
horizontal line.

The second plot (normal Q-Q) shows the points closely plotted along the dashed line hence the errors are
normally distributed.

The third plot (Scale-Location), like the first, should look random. No patterns should appear in this plot.
Ours has an upside-down V-shape.

The last plot (Cook’s distance - D) tells us which points have the greatest influence on the regression (leverage
points). In our plot, the values of D are below 0.5 so the points 3 and 6 for example which appear to be outliers
are not regarded to be influential.

“So, what does having patterns in residuals mean to your research? It’s not just a go-or-stop sign. It tells
you about your model and data. Your current model might not be the best way to understand your data if
there’s so much good stuff left in the data.” https://data.library.virginia.edu/diagnostic-plots/

10.5 Point and Interval Estimation of the slope parameter

In the past we had two methods of estimation, that of a point estimate, and that of an interval estimate. Both
referred to an estimate of a parameter. The same procedure exists in regression, because of the I.I.D errors
~ 𝑁(0, 𝜎2 )

10.5.1 Confidence interval for the Regression Coefficient 𝛽1

Thus far we have seen that that 𝑏0 is the point estimate for 𝛽0 and 𝑏1 is the point estimate for 𝛽1. We saw
previously under the discussion of the sampling distribution of the mean and proportion that the value of any
quantity calculated from the sample data i.e the value of any statistic will vary from sample to sample. In

23
much the same way, the value of 𝑏1 will vary from sample to sample. Hence 𝑏1 will have a distribution with
some mean and variance. It can be shown that (2nd year):
𝜎2 𝜎2
𝐸(𝑏1 ) = 𝛽1 𝑎𝑛𝑑 𝑉𝑎𝑟(𝑏1 ) = 𝑛 =
∑𝑖=1(𝑥 − 𝑥̅ )2 𝑆𝑆𝑥

Thus 𝑏1 is an unbiased estimator of 𝛽1 with known standard deviation (or variance), hence,
𝑏1 −𝛽1 𝑏 −𝛽1
𝑧= = 𝜎1
𝜎𝑏1 ⁄
√𝑆𝑆𝑥

However, since 𝜎 2 is unknown, but 𝛽1 is normally distributed, the appropriate test statistic is
𝑏1 − 𝛽1
𝑡𝑛−2 = 𝑠
⁄
√𝑆𝑆𝑥

We can derive the confidence interval by starting with the following probability statement:

𝑏1 − 𝛽1
𝑃 (−𝑡𝑛−2,𝛼⁄2 < 𝑠 < 𝑡𝑛−2,𝛼⁄2 ) = 1 − 𝛼
⁄
√𝑆𝑆𝑥

Thus, we can construct a 100(1 − 𝛼)% C.I. for 𝛽1 the slope parameter of the true regression line for the
population as

𝛽1 𝜖 𝑏1 ± 𝑡𝑛−2,𝛼⁄2 𝑠⁄
√𝑆𝑆𝑥

Calculate and interpret the 95% C.I. for 𝛽1 in Example 10.1:

∑ 𝑥 = 54 ∑ 𝑥 2 =384 ∑ 𝑦 = 530 ∑ 𝑦 2 =32156 ∑ 𝑥 𝑦 = 3393 𝑛 = 9
𝑏1 =3.55 𝑡𝑛−2,𝛼⁄2 = 𝑡7,0.025 = 2.3646 𝑆𝑆𝑥 = 60 𝑆𝑆𝑦 = 944.89 𝑆𝑃𝑥𝑦 = 213
𝑆𝑆𝐸 188.74
𝑆𝑆𝐸 = 𝑆𝑆𝑦 − 𝑏1 𝑆𝑃𝑥𝑦 = 944.89-3.55(213) =188.74 𝑠 2 = 𝑛−2 = = 26.96285714
7

𝛽1 𝜖 𝑏1 ± 𝑡𝑛−2,𝛼⁄2 𝑠⁄
√𝑆𝑆𝑥

𝛽1 𝜖 (3.55 ± 2.3646 √26.96285714⁄ )

√60
𝛽1 𝜖(1.96487; 5.13513)

The 95% confidence interval for the slope can be found using the following R code:
> confint(linear.fit, level=0.95)

24
This should give the following output for the confidence interval for the intercept and slope although we
require the interval for the slope only.
> hours<-c(2,3,4,5,6,7,8,9,10)
> mark<-c(45,45,58,50,57,68,70,62,75)
> linear.fit=lm(mark~hours)
> coef(linear.fit)
(Intercept) hours
37.58889 3.55000
> confint(linear.fit, level=0.95)
2.5 % 97.5 %
(Intercept) 27.234784 47.942994
hours 1.964858 5.135142

Homework exercise: Verify the calculation for Example 10.2 by using the equations and interpret the
interval. Answer should be:
2.5 % 97.5 %
(Intercept) −4.298424521 −0.40016580
CO2 0.005367883 0.01154098

10.6 Point and Interval Estimation of y

Recall that for the probabilistic model used in linear regression:
𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
that 𝑦̂𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 is the estimated line of regression of 𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖

One of the assumptions made is that 𝑌 will be normally distributed for each value of 𝑋.We can say that the
conditional distribution of 𝑌 given 𝑋 = 𝑥0 is Normal with mean 𝛽0 + 𝛽1 𝑥0 and variance 𝜎 2

because 𝐸[𝜀𝑖 ] = 0. Thus 𝑌|𝑥0 ~𝑁(𝛽0 + 𝛽1 𝑥0 , 2)

From the figure above we see that the value of 𝑦̂0 = 𝑏0 + 𝑏1 𝑥0 can be regarded as:
i) a point estimate of the mean of the population of values of 𝑌 when 𝑋 = 𝑥0 i.e. 𝐸(𝑌|𝑥0 ) or
ii) a point estimate of a single value from the population of values of 𝑌 when 𝑋 = 𝑥0
Notice that the point estimates of 𝐸(𝑌|𝑥0 ) and 𝑌|𝑥0 are both 𝑦̂0 = 𝑏0 + 𝑏1 𝑥0 . Thus, interval estimates of both
these quantities are centered at 𝑦̂0
25
The interval estimate of 𝐸(𝑌|𝑥0 ) is called a confidence interval for 𝐸(𝑌|𝑥0 )
The interval estimate of 𝑌|𝑥0 is called a prediction interval for 𝑌|𝑥0
Thus, one could use a linear regression model to predict tree mass (𝑌) based on the CO2 level (𝑋):
i) If we require the interval estimate for the mass of a particular tree exposed to an atmospheric
concentration of CO2 of 600 microliters per liter (parts per million), we will need a prediction interval for
𝑌|𝑥 = 600 since we are estimating the value of 𝑌 from the distribution of 𝑌|𝑥 = 600

ii) If we want the interval estimate of the average mass of all trees exposed to an atmospheric concentration
of CO2 of 600 microliters per liter (parts per million), we will need to find the confidence interval for
𝐸(𝑌|𝑥 = 600)

Note: The 100(1 −  )% C.I. interval estimate for 𝐸(𝑌|𝑥0 ) is shorter than the 100(1 −  )% interval
estimate for 𝑌|𝑥0
Why? Since the estimate of the mean of the random variable will be less than the estimate of a particular value
of the random variable. Also, there is more uncertainty in estimating one value of the random variable than in
estimating the average value. The variance of the individuals is greater than the variance of the means.

Let us now estimate the expected value of y , when 𝑥 = 𝑥𝑝 .

We can show that 𝑦𝑝 = 𝑏0 + 𝑏1 𝑥𝑝 is the unbiased estimator of 𝐸[𝑌|𝑥𝑝 ]

Furthermore, the greater the distance of 𝑥𝑝 from 𝑥̅ , the greater is the error of estimating the expected value
of 𝑦.
1 (𝑥𝑝 −𝑥̅ )2
Thus, it can be shown that 𝜎𝑦 2 = 𝜎 2 {𝑛 + }
𝑆𝑆𝑥

where 𝜎 2 is the variance of the line of regression. However, since 𝜎 2 is unknown, we use 𝑠 2 as the estimator
of 𝜎 2 .

We can thus create confidence intervals for 𝐸[𝑌|𝑥𝑝 ]

Those confidence intervals have 𝑛 − 2 degrees of freedom and are of the form

1 (𝑥𝑝 − 𝑥̅ )2
𝑦̂𝑝 ± 𝑡𝑛−2,𝛼⁄2 𝑠 √ +
𝑛 𝑆𝑆𝑥

26
Confidence interval for expected tree mass (Example 10.2)
Find a 95% C.I. for the expected tree mass of all trees exposed to an atmospheric concentration of CO2 of
600 microliters per liter (parts per million):

Firstly, we find the point estimate, then we find the 95% C.I. by using
𝐸[𝑌|𝑥𝑝 = 600] = 𝑦̂𝑝 =𝑏0 + 𝑏1 𝑥𝑝 =− 2.349295162 +0.008454434(600) = 2.723365238

1 (𝑥𝑝 − 𝑥̅ )2
𝐸[𝑌|𝑥𝑝 = 600]∈ (𝑦̂𝑝 ± 𝑡6,0.025 𝑠 √ + )
𝑛 𝑆𝑆𝑥

1 (600 − 613.5)2
𝐸[𝑌|𝑥𝑝 = 600]∈ (2.723365238 ± 2.4469√0.285117575 √ + )
8 179190

𝐸[𝑌|𝑥𝑝 = 600]∈ (2.25955214, 3.18717814)

This can be done in R using the following structure for the R code:
new pred = (value for the prediction)
predict(model, data.frame(pred = new pred), level = 0.95, interval = “confidence”)
where
pred is the object containing the original independent variables and new pred is the object containing the new
values for which predictions are desired, and level is the desired confidence level.

The following code gives the 95% confidence interval for the expected tree mass of all trees exposed to an
atmospheric concentration of CO2 of 600 microliters per liter (parts per million)
> newCO2=(600)
> predict(linear.fit,data.frame(CO2 = newCO2), level = 0.95, interval = "confidence")
fit lwr upr
1 2.723365 2.25955 3.18718

27
Predicting a particular value of 𝒀 for a given value of 𝑥 = 𝑥𝑝
We would like to predict the dependent variable of an individual. What should be evident is that trying to
predict an individual score will have greater error than with the expected value and it can be shown that

2 2
1 (𝑥𝑝 − 𝑥̅ )2
𝜎 𝑝𝑟𝑒𝑑 = 𝜎 [1 + + ]
𝑛 𝑆𝑆𝑥

Thus the appropriate prediction interval of y is

1 (𝑥𝑝 − 𝑥̅ )2
𝑦̂ ± 𝑡𝑛−2,𝛼⁄2 𝑠 √1 + +
𝑛 𝑆𝑆𝑥

Example 10.2 continued: find a 95% prediction interval for a tree exposed to an atmospheric concentration
of CO2 of 600 microliters per liter (parts per million).
𝑦̂𝑝 =𝑏0 + 𝑏1 𝑥𝑝 =− 2.349295162 +0.008454434(600) = 2.723365238

1 (𝑥𝑝 − 𝑥̅ )2
𝑌|𝑥𝑝 ∈ (𝑦̂𝑝 ± 𝑡6,0.025 𝑠 √1 + + )
𝑛 𝑆𝑆𝑥

1 (600 − 613.5)2
√
𝑌|𝑥𝑝 ∈ (2.723365238 ± 2.4469√0.285117575 1 + + )
8 179190

𝑌|𝑥𝑝 ∈ (1.336926342, 4.109803944)

This can be done in R using the following structure for the R code:
new pred =(value for the prediction)
predict(model, data.frame(pred = new pred), level = 0.95, interval = “predict”)
where
pred is the object containing the original independent variables and new pred is the object containing the
new values for which predictions are desired, and level is the desired prediction level.
The following code gives the 95% confidence interval for the expected tree mass of all trees exposed to an
atmospheric concentration of CO2 of 600 microliters per liter (parts per million)
> predict(linear.fit,data.frame(CO2 = newCO2), level = 0.95, interval = "predict")
fit lwr upr
1 2.723365 1.33692 4.109811

Comparing the confidence interval and the prediction interval one can see that as expected, the prediction
interval is much wider than the confidence interval.
28

06 Simple Linear Regression Part1
No ratings yet
06 Simple Linear Regression Part1
8 pages
Chapter 1
No ratings yet
Chapter 1
22 pages
Module 2 - Section 4 (Linear Regression) - 11
No ratings yet
Module 2 - Section 4 (Linear Regression) - 11
20 pages
Lectures 14 15
No ratings yet
Lectures 14 15
66 pages
Stat II Chapter 6
No ratings yet
Stat II Chapter 6
11 pages
Correlation & Simple Regression
No ratings yet
Correlation & Simple Regression
15 pages
Correlation
100% (1)
Correlation
29 pages
Review: I Am Examining Differences in The Mean Between Groups
100% (2)
Review: I Am Examining Differences in The Mean Between Groups
44 pages
Stat 4-6 Chapter
No ratings yet
Stat 4-6 Chapter
37 pages
Lecture SLR (1)
No ratings yet
Lecture SLR (1)
60 pages
Notes - Correlation and Regression
No ratings yet
Notes - Correlation and Regression
26 pages
Y X y X N B: Linear Regression
No ratings yet
Y X y X N B: Linear Regression
7 pages
Chapter 5 - 1
No ratings yet
Chapter 5 - 1
5 pages
Part 2 Exploring Relationships Among Variables
No ratings yet
Part 2 Exploring Relationships Among Variables
8 pages
REGRESSION and CORRELATION ANALYSIS STA 106 -DR. BASHIRU
No ratings yet
REGRESSION and CORRELATION ANALYSIS STA 106 -DR. BASHIRU
10 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Chapter 12
No ratings yet
Chapter 12
36 pages
How Can We Explore The Association Between Two Quantitative Variables?
No ratings yet
How Can We Explore The Association Between Two Quantitative Variables?
7 pages
Business Statistics Method: by Farah Nurul Aisyah (4122001020) Jasmine Alviana Zalzabillah (4122001070)
No ratings yet
Business Statistics Method: by Farah Nurul Aisyah (4122001020) Jasmine Alviana Zalzabillah (4122001070)
35 pages
Stats10_Chapter+4 2
No ratings yet
Stats10_Chapter+4 2
54 pages
Corr_Regression Analysis
No ratings yet
Corr_Regression Analysis
19 pages
Chapter 3 - Regression
No ratings yet
Chapter 3 - Regression
8 pages
Relationship- Correlation and Regression (1)
No ratings yet
Relationship- Correlation and Regression (1)
42 pages
Cha 6
No ratings yet
Cha 6
8 pages
Topic 5-Lecture Notes
No ratings yet
Topic 5-Lecture Notes
12 pages
Presentation4 - Bivariate Analysis and Simple Linear Regression
No ratings yet
Presentation4 - Bivariate Analysis and Simple Linear Regression
31 pages
Chapter-9-Simple Linear Regression & Correlation
No ratings yet
Chapter-9-Simple Linear Regression & Correlation
11 pages
5_Chapter9-linear regression
No ratings yet
5_Chapter9-linear regression
15 pages
Stat Chapter 6
No ratings yet
Stat Chapter 6
23 pages
Lecture 2.2: Simple Regression Model-Linear Equation With One Independent Variable
No ratings yet
Lecture 2.2: Simple Regression Model-Linear Equation With One Independent Variable
14 pages
Descriptive Stats (E.g., Mean, Median, Mode, Standard Deviation) Z-Test &/or T-Test For A Single Population Parameter (E.g., Mean)
No ratings yet
Descriptive Stats (E.g., Mean, Median, Mode, Standard Deviation) Z-Test &/or T-Test For A Single Population Parameter (E.g., Mean)
43 pages
Regression and Correlation
No ratings yet
Regression and Correlation
37 pages
Statistics Overview Part II
No ratings yet
Statistics Overview Part II
29 pages
Lecture 11
No ratings yet
Lecture 11
16 pages
Statistics Regression Final Project
100% (2)
Statistics Regression Final Project
12 pages
Correlation and Regression: Predicting The Unknown
No ratings yet
Correlation and Regression: Predicting The Unknown
5 pages
W 1
No ratings yet
W 1
11 pages
Book 2 Notes-71-78
No ratings yet
Book 2 Notes-71-78
8 pages
Simple and Multiple Linear Regression
No ratings yet
Simple and Multiple Linear Regression
91 pages
Correlation and Regression
No ratings yet
Correlation and Regression
7 pages
Regression Analysis
No ratings yet
Regression Analysis
7 pages
Correlation and Regression
No ratings yet
Correlation and Regression
15 pages
Simple Regression and Correlation
No ratings yet
Simple Regression and Correlation
30 pages
Correlation (Quantitative Variables)
No ratings yet
Correlation (Quantitative Variables)
39 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Chapter 10
No ratings yet
Chapter 10
3 pages
Stat Cor Reg
No ratings yet
Stat Cor Reg
85 pages
Lecture 8 and 9 Regression Correlation and Index
No ratings yet
Lecture 8 and 9 Regression Correlation and Index
32 pages
Chapter 5 Bivariate Analysis Students Notes 230125 152159-1
No ratings yet
Chapter 5 Bivariate Analysis Students Notes 230125 152159-1
13 pages
Lecture 7
No ratings yet
Lecture 7
65 pages
Module 6 RM: Advanced Data Analysis Techniques
No ratings yet
Module 6 RM: Advanced Data Analysis Techniques
23 pages
ASS#1-FINALS Doromal
No ratings yet
ASS#1-FINALS Doromal
8 pages
Handout 5 Correlation and Regression (Recovered)
No ratings yet
Handout 5 Correlation and Regression (Recovered)
6 pages
SQQS2073 Note 1 Simple Linear Regression
No ratings yet
SQQS2073 Note 1 Simple Linear Regression
11 pages
Correlation and Linear Regression
No ratings yet
Correlation and Linear Regression
25 pages
CH 5 - Correlation and Regression
No ratings yet
CH 5 - Correlation and Regression
9 pages
RM Chap 18 Bivariate Analysis
No ratings yet
RM Chap 18 Bivariate Analysis
30 pages
Final Project: Raiha, Maheen, Fabiha Mahnoor, Zara
No ratings yet
Final Project: Raiha, Maheen, Fabiha Mahnoor, Zara
14 pages
15 MAY - NR - Correlation and Regression
No ratings yet
15 MAY - NR - Correlation and Regression
10 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Act roadmap
No ratings yet
Act roadmap
1 page
13-analysis-of-the-annuity-at-two-rates-of-interest-367[1]
No ratings yet
13-analysis-of-the-annuity-at-two-rates-of-interest-367[1]
7 pages
Slides Macro Week 11
No ratings yet
Slides Macro Week 11
72 pages
Formula sheet (1)
No ratings yet
Formula sheet (1)
3 pages
QTT PDF
No ratings yet
QTT PDF
43 pages
ch05 Edit v2
No ratings yet
ch05 Edit v2
56 pages
Spss 23 P 3
No ratings yet
Spss 23 P 3
21 pages
Econometric S
No ratings yet
Econometric S
59 pages
Econometrics Course Outline
No ratings yet
Econometrics Course Outline
2 pages
Assignment 2 S 10
No ratings yet
Assignment 2 S 10
4 pages
Forecasting: Mcgraw-Hill/Irwin
No ratings yet
Forecasting: Mcgraw-Hill/Irwin
63 pages
Linear Regression
No ratings yet
Linear Regression
44 pages
325unit 1 Simple Regression Analysis
No ratings yet
325unit 1 Simple Regression Analysis
10 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
Statistics for Engineers and Scientists William Navidi pdf download
100% (5)
Statistics for Engineers and Scientists William Navidi pdf download
59 pages
BS English Curriculum, Bacha Khan University Charsadda, Session: 2019-2023
No ratings yet
BS English Curriculum, Bacha Khan University Charsadda, Session: 2019-2023
50 pages
Introduction To Linear Regression
No ratings yet
Introduction To Linear Regression
6 pages
Wonder of Heavens
No ratings yet
Wonder of Heavens
8 pages
Untitled
No ratings yet
Untitled
23 pages
Multiple Regression Analysis: Estimation
No ratings yet
Multiple Regression Analysis: Estimation
50 pages
Risk Measurement When Shares ARE Subject To Infrequent Trading
No ratings yet
Risk Measurement When Shares ARE Subject To Infrequent Trading
30 pages
Ba ZC413 Course Handout
No ratings yet
Ba ZC413 Course Handout
6 pages
Final Report Spss
No ratings yet
Final Report Spss
14 pages
Part 4C (Quantitative Methods For Decision Analysis) 354
No ratings yet
Part 4C (Quantitative Methods For Decision Analysis) 354
102 pages
The Effect of International Public Sector Accounting Standard (IPSAS) Implementation and Public Financial Management in Nigeria
No ratings yet
The Effect of International Public Sector Accounting Standard (IPSAS) Implementation and Public Financial Management in Nigeria
10 pages
Applied Econometrics 2nd Edition Dimitrios Asteriou pdf download
No ratings yet
Applied Econometrics 2nd Edition Dimitrios Asteriou pdf download
40 pages
Part 4C Quantitative Methods For Decision Analysis 354
No ratings yet
Part 4C Quantitative Methods For Decision Analysis 354
102 pages
PPT09 - Simple Linear Regression and Correlation
No ratings yet
PPT09 - Simple Linear Regression and Correlation
53 pages
ML 5units
No ratings yet
ML 5units
284 pages
Lecture 9 Simple Regression
No ratings yet
Lecture 9 Simple Regression
52 pages
Literature Review On Multiple Linear Regression
100% (2)
Literature Review On Multiple Linear Regression
4 pages
QMB MT and Final
100% (1)
QMB MT and Final
138 pages
Reading 10 Simple Linear Regression
No ratings yet
Reading 10 Simple Linear Regression
3 pages
Correlation and Regression
No ratings yet
Correlation and Regression
37 pages

Regression and correlation notes

Uploaded by

Regression and correlation notes

Uploaded by

Chapter 10: Correlation and Simple Linear Regression

Overview of this chapter and learning objectives

The following code in R generates the scatterplot of hours versus marks

Correlation can be positive, negative or zero:

Sample correlation coefficient

∑𝑛𝑖=1(𝑥𝑖 − 𝑥̅ )(𝑦𝑖 − 𝑦̅) 𝑆𝑃𝑥𝑦

𝑟 close to +1 Strong positive linear relationship

𝑟 far away from +1 Weak positive linear relationship

𝑟=0 No linear relationship

𝑟 close to –1 Strong negative linear relationship

𝑟 equal to –1 Perfect negative linear relationship

Note: Correlation does not mean causation

Using R in example 10.1 to find the correlation coefficient gives us:

Figure 10.2: Illustration of simple linear regression line

Definition: Error sum of squares (or Residual sum of squares)

Using partial derivatives and differentiating with regard to 𝑏1 gives:

10.3 Estimating the parameters of the model:

each other as a result of the 𝜀𝑖′ s being independent of each other.

Figure 10.3: Distribution of 𝜀

b) Distribution of 𝑌 for different values of 𝑥

Figure 10.4: Distribution of 𝑌 for different values of 𝑥

Example 10.2: - based on Example 12.5 (Devore and Berk)

residual (𝑒𝑖 = 𝑦𝑖 − 𝑦̂)

measurements were read from a graph in the article.”

Figure 10.5 Plot of tree mass versus CO2 concentration

The coefficients can also be obtained by the following r code

Hence the least squares regression line is given by 𝑦̂ = −2.349 + 0.008𝑥

Residual standard error: 0.534 on 6 degrees of freedom

10.4 Assessing the Model

10.4.1. The Coefficient of Determination

= ∑(𝑦𝑖 − 𝑦̂𝑖 + 𝑦̂𝑖 − 𝑦̅)2

Definition: Coefficient of Determination (𝑟 2 × 100%) : is the proportion of variation in the dependent

Calculate and interpret 𝒓𝟐 for Example 10.2

10.4.2 Estimating the variance 𝜎 2

Plot 1: Residuals versus the fitted or predicted values 𝑦̂𝑖

Plot 4: Residuals vs Leverage

10.5 Point and Interval Estimation of the slope parameter

10.5.1 Confidence interval for the Regression Coefficient 𝛽1

Calculate and interpret the 95% C.I. for 𝛽1 in Example 10.1:

𝛽1 𝜖 (3.55 ± 2.3646 √26.96285714⁄ )

10.6 Point and Interval Estimation of y

because 𝐸[𝜀𝑖 ] = 0. Thus 𝑌|𝑥0 ~𝑁(𝛽0 + 𝛽1 𝑥0 , 2)

Let us now estimate the expected value of y , when 𝑥 = 𝑥𝑝 .

We can show that 𝑦𝑝 = 𝑏0 + 𝑏1 𝑥𝑝 is the unbiased estimator of 𝐸[𝑌|𝑥𝑝 ]

We can thus create confidence intervals for 𝐸[𝑌|𝑥𝑝 ]

𝐸[𝑌|𝑥𝑝 = 600]∈ (2.25955214, 3.18717814)

Thus the appropriate prediction interval of y is

𝑌|𝑥𝑝 ∈ (1.336926342, 4.109803944)

You might also like