0% found this document useful (0 votes)
5 views

Topic4-Linear-Models

The document discusses the concepts of correlation and linear modeling, focusing on the relationship between two variables using bivariate data and scatter plots. It explains the correlation coefficient, its properties, and how it can be influenced by outliers and non-linear associations. Additionally, it covers the regression line as an optimal way to summarize the relationship between variables, incorporating statistical measures such as means and standard deviations.

Uploaded by

ishrat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Topic4-Linear-Models

The document discusses the concepts of correlation and linear modeling, focusing on the relationship between two variables using bivariate data and scatter plots. It explains the correlation coefficient, its properties, and how it can be influenced by outliers and non-linear associations. Additionally, it covers the regression line as an optimal way to summarize the relationship between variables, incorporating statistical measures such as means and standard deviations.

Uploaded by

ishrat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Correlation and Linear Model

Modelling Data | Linear Model

© University of Sydney MATH1062


28 August 2024

 Module2 Modelling Data

Normal Curve
What is the Normal Curve? And what does it have to do with sample mean?

Linear Model
How can we describe the relationship between two variables? When is a linear
model appropriate?

2/72

 Scatter plot & correlation

Correlation
· Bivariate data & scatter plot
· Correlation coefficient
· Properties and warnings
Linear model
· Regression Line
· Prediction
· Residuals and properties
· Coefficient of determination
· Diagnostics of model fit
Summary

3/72
Scatter plots
History
· Sir Francis Galton (England, 1822–1911) studied the degree to which children
resemble their parents (and wrote travel books on “wild countries”!)
· Galton’s work was continued by his student Karl Pearson (England, 1857–1936).
Pearson measured the heights of 1,078 fathers and their sons at maturity.

5/72
Pearson’s plot of heights (scatter plot)

· Plotting the pairs of heights creates a cloud of points.

· Generally, taller fathers tend to have taller sons.

6/72
Code for plotting Pearson’s data
# install.packages('UsingR')
suppressMessages(library(UsingR))
library(UsingR) # Loads another collection of datasets
data(father.son) # This is Pearson's data.
data = father.son
x = data$fheight # fathers' heights
y = data$sheight # sons' heights
# scatter plot
plot(x, y, xlim = c(58, 80), ylim = c(58, 80), xaxt = "n", yaxt = "n", xaxs = "i",
yaxs = "i", main = "Pearson's data", xlab = "Father height (inches)", ylab = "Son height (inches)")
# Adjust the gap between label and plot
axp = seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

7/72

 Statistical Thinking
Why do we care whether there is an associaton between two variables (here: height
of father and son)?

· The association is interesting on its own.

· Association between two variables can be used for prediction, i.e, use outcome in
one variable to predict the outcome in another variable.

· How can we quantify a possible association?

8/72
Correlation coefficient
Bivariate data


 Bivariate data
Bivariate data involves a pair of variables. We are interested in the relationship
between the two variables. Can one variable be used to predict the other?

· Formally, we have (𝑥𝑖 , 𝑦𝑖 ) for 𝑖 = 1, 2, … , 𝑛.


· 𝑋 and 𝑌 can have the same role
· 𝑋 and 𝑌 may have different roles: for example, 𝑋 can be an independent
variable (or explanatory variable, predictor or regressor) which we use to explain
or predict 𝑌 , the dependent variable (or response variable).

10/72
How can we summarise bivariate data?
Bivariate data (or a scatter plot) can be summarised by the following five numerical
summaries:

· sample mean and sample SD of 𝑋 (𝑥¯ , SD𝑥 )


· sample mean and sample SD of 𝑌 (𝑦¯, SD𝑦 )
· correlation coefficient (𝑟).

11/72
Association between the two variables

· All clouds have the same centre and horizontal and vertical spread.
· However they have different spread around a line (linear association). How do we
measure this?

12/72
The correlation coefficient

 The (Pearson) correlation coefficient (r)
· A numerical summary measures of how points are spread around the line.
· It indicates both the sign and strength of the linear association.
· It is defined as the mean of the product of the variables in standard units.

Recall that data point − mean


standard unit =
𝑆𝐷
Using sample SD, we divide by 𝑛 − 1 in the average:

1 𝑛 (𝑥𝑖 −𝑥¯ ) (𝑦𝑖 −𝑦¯)


1
∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)
𝑟= 𝑛−1 ∑ 𝑖=1 𝑆𝐷𝑠𝑎𝑚𝑝𝑙𝑒 (𝑋) 𝑆𝐷𝑠𝑎𝑚𝑝𝑙𝑒 (𝑌 ) = 𝑛−1


1
𝑛−1
∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯)2 √ 𝑛−1
1
∑𝑛𝑖=1 (𝑦𝑖 −𝑦¯)2

∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)


which simplifies to 𝑟 = .
√∑𝑖=1 (𝑥𝑖 −𝑥¯) √∑𝑖=1 (𝑦𝑖 −𝑦¯)
𝑛 2 𝑛 2

13/72
The same correlation coefficient 𝑟 can be obtained using the population SD as well
(dividing by 𝑛 in the average).

1 𝑛 (𝑥𝑖 −𝑥¯ ) (𝑦𝑖 −𝑦¯)


1
∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)
𝑟= 𝑛 ∑ 𝑖=1 𝑆𝐷𝑝𝑜𝑝 (𝑋) 𝑆𝐷𝑝𝑜𝑝 (𝑌 ) = 𝑛

√ 𝑛 ∑𝑖=1 (𝑥𝑖 −𝑥¯) √ 𝑛 ∑𝑖=1 (𝑦𝑖 −𝑦¯)


1 𝑛 2 1 𝑛 2

∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)


which also simplifies to 𝑟 = .
√∑𝑖=1 (𝑥𝑖 −𝑥¯) √∑𝑖=1 (𝑦𝑖 −𝑦¯)
𝑛 2 𝑛 2

Quick calculation in R using cor() .

cor(x, y)
## [1] 0.5013383

14/72
Why does 𝑟 measure association?
Here, for illustration, we round data to 1 decimal place to make calculations simpler.

𝑥 (father’s 𝑦 (son’s standard standard


product quadrant
heights) heights) units units
𝑥−67.7 𝑦−68.7 𝑦−68.7
2.7 2.8
( 𝑥−67.7
2.7
)( 2.8
)
65.0 59.8 -0.96 -3.16 3.04 lower left
63.3 63.2 -1.62 -1.94 3.14 lower left
65.0 63.3 -1.00 -1.90 1.89 lower left
70.3 67.0 0.95 -0.59 -0.57 lower right


mean=+0.5

15/72
We divide the scatter plot into 4 quadrants, at the point of averages (centre).

· In the upper right and lower left quadrants, products of standard units are (+)
· In the upper left and lower right quadrants, products of standard units are (-)

· A majority of points in the upper right (+) and lower left quadrants (+) will be indicated
by a positive 𝑟
· A majority of points in the upper left (-) and lower right quadrants (-) will be indicated
by a negative 𝑟

16/72
More examples

17/72
Properties and warnings
Interpretations of r values
· The correlation coefficient 𝑟 always takes values between -1 and 1 (inclusive).
- This can be shown using the definition of 𝑟 and the Cauchy-Schwarz inequality
(only for your information).
· If 𝑟 is positive: the cloud slopes up.
· If 𝑟 is negative: the cloud slopes down.
· 𝑟 = 0 implies no linear dependency between two variables.
· As 𝑟 gets closer to ±1 : the points cluster more tightly around the line.

19/72
Invariant properties
Shift and scale invariant
The correlation coefficient is shift and scale invariant. Why? shifting and scaling do
not change the standard unit.

cor(x, y)
## [1] 0.5013383
cor(0.2 * x + 3, 3 * y - 1)
## [1] 0.5013383

Symmetry (commutative)
The correlation coefficient is not affected by interchanging the variables.

cor(x, y)
## [1] 0.5013383
cor(y, x)
## [1] 0.5013383

20/72
Warning 1:
Wrong interpretations of correlation coefficient

Mistake:
𝑟 = 0.8 means that 80% of the points are tightly closed around the line.

Mistake:
𝑟 = 0.8 means that the points are twice as tightly closed as 𝑟 = 0.4.
Note 1: 𝑟 = 0.8 suggests a stronger association between variables compared to the
case 𝑟 = 0.4, BUT does not suggest the data points are twice as tight.

21/72
Warning 2:
Outliers can overly influence the correlation coefficient
Suppose there was an extra unusual reading of (100,50).

f1 = c(data$fheight, 100) # Add an extra point to data


s1 = c(data$sheight, 50)
cor(data$fheight, data$sheight)
## [1] 0.5013383
cor(f1, s1)
## [1] 0.3956794

22/72
par(mfrow = c(1, 2))
plot(data$fheight, data$sheight)
plot(f1, s1)
points(100, 50, col = "lightgreen", pch = 19, cex = 2)

23/72
Warning 3:
Nonlinear association can’t be detected by the correlation coefficient
What interpretation mistake could be made in the following data set?

x = c(1:20)
y = x^2
cor(x, y)
## [1] 0.9713482

Based on the correlation coefficient, the points should cluster very tightly around the line
sloping up.

24/72
But look at the scatter plots.
plot(x, y)

This data should be modelled by a quadratic curve, not a line.

We should always use correlation coefficient together with the scatter plot.

25/72
Warning 4:
The same correlation coefficient can arise from very different data
The following 4 data sets (Anscombes Quartet) have the same 𝑥¯ , 𝑆 𝐷𝑥 , 𝑦¯, 𝑆𝐷𝑦 , and
also the same value of 𝑟.

## x_mean: 9 9 9 9
## x_sd: 3.316625 3.316625 3.316625 3.316625
## y_mean: 7.500909 7.500909 7.5 7.500909
## y_sd: 2.031568 2.031657 2.030424 2.030579
## r: 0.8164205 0.8162365 0.8162867 0.8165214

26/72
But look at the scatter plots.

27/72
Warning 5 (not for assessment)
Ecological correlation tend to inflate the correlation coefficient
· An ecological correlation is the correlation between two variables that are group
means.
· For example, if we recorded the heights of fathers and sons in many communities,
and then calculated the average for each community.
· Correlations at the group level (ecological correlations) can be much higher than
those at the individual level.
· See Freedman et al, Statistics p148-149.

28/72
Example

· The 1st plot has all 3 sets of data combined: correlation = 0.51 (not very strong).
· The 2nd plot has the averages of the 3 data sets: correlation = 0.94 (very strong).

29/72
Regression line
Pearson’s plot of heights

· How can we summarise the data with a line?


· How do we find the optimal line?

31/72
1st option: SD line (not so good)
· The SD line might look like a good candidate as it connects the point of averages
(𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + SD𝑦 ) (for this data with positive correlation).

32/72
· Note how it underestimates (LHS) and overestimates (RHS) at the extremes.

· Recall that 𝑋, 𝑌 can have the same mean and SD but very different correlation
coefficient.
· The above model does not use the correlation coefficient, so it is insensitive to the
amount of clustering around the line.

33/72
· These data have the same mean and same SDs, and hence then same SD line.

· But they have different correlation coefficient, how to take this into account?

· How to quantify the quality of the fitted line, so we can define the optimal line?

34/72
Best option: regression line
· To describe the scatter plot, we need to use all five summaries: 𝑥¯ , 𝑦¯, 𝑆𝐷𝑥 , 𝑆𝐷𝑦
and 𝑟.
· The regression line connects (𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + 𝑟SD𝑦 )

35/72
· Note the improvement at the extremes.

36/72
Summary of regression line
Feature Regression Line 𝑦 ∼ 𝑥 (𝑦 = 𝑎 + 𝑏𝑥)
Connects (𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + 𝑟SD𝑦 )
SD
Slope (b) 𝑟 𝑦
SD𝑥
Intercept
(a)
𝑦¯ − 𝑏𝑥¯

Optimality: We can derive the regression line using calculus, by minimising the sum of
squares of the residuals.

37/72
In R
lm(y ~ x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 33.8866 0.5141
model = lm(y ~ x)
model$coeff
## (Intercept) x
## 33.886604 0.514093

So for 𝑥 = father height and 𝑦 = son height, the regression line is

𝑦 = 33.886604 + 0.514093𝑥

38/72
plot(x, y, xlab = "Father height (inches)", ylab = "Son height (inches)", col = adjustcolor("black",
alpha.f = 0.35))
abline(lm(y ~ x), col = "red")

39/72
Prediction
Baseline prediction
· For new born (son), the father is 75 inches tall, how can we predict the son’s height?
· If you don’t use the information of the independent variable 𝑥 at all, a basic prediction
of 𝑦 would be the average of 𝑦 for all the 𝑥 values in the data.
· So for any father’s height, we could predict the son’s height to be 68.68.

mean(y)
## [1] 68.68407

41/72
The Regression line
· A better prediction is based on the regression line 𝑦 = slope × 𝑥 + intercept
· For the height data: 𝑦 = 33.886604 + 0.514093𝑥
· So for any father’s height 75, we could predict the son’s height to be 72.44.

42/72
Can we also use Y to predict X?
We can predict 𝑌 from 𝑋 or 𝑋 from 𝑌 , depending on what fits the context.

43/72
Beware!
· Can we just simply rearrange the equation? (𝑦 = 𝑎 + 𝑏𝑥) ⟹ (𝑥 = − 𝑎𝑏 + 1𝑏 𝑦)
· The answer is NO unless 𝑟 = ±1 (data clustered along the line).
· We need to refit the model.

Regression Line 𝑦 ∼𝑥 Regression Line 𝑥 ∼𝑦


Feature ̃
(𝑦 = 𝑎 + 𝑏𝑥) (𝑥 = 𝑎̃ + 𝑏𝑦)
Connects (𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + 𝑟SD𝑦 ) (𝑦¯, 𝑥¯ ) to (𝑦¯ + SD𝑦 , 𝑥¯ + 𝑟SD𝑥 )
SD 𝑏̃ = 𝑟 SD𝑥
Slope 𝑏=𝑟 𝑦 SD𝑦
SD𝑥
Intercept 𝑎 = 𝑦¯ − 𝑏𝑥¯ 𝑎̃ = 𝑥¯ − 𝑏𝑦̃ ¯

44/72
lm(y ~ x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 33.8866 0.5141

lm(x ~ y)
##
## Call:
## lm(formula = x ~ y)
##
## Coefficients:
## (Intercept) y
## 34.1075 0.4889

45/72
Residuals and properties
Residuals
We can now make predictions using the regression line. But we have some prediction
error.


 Residual (prediction error)
· A residual is the vertical distance of a point above or below the regression line.

· A residual represents the error between the actual value and the prediction.

47/72
When the father’s height is 71.82, the actual value of the son’s height is 66.42 with
predicted value 70.81, so the residual is -4.39.

48/72
Formally, given the actual value (𝑦𝑖 ) and the prediction (𝑦̂𝑖 ), a residual is

𝑒𝑖 (𝑎, 𝑏) = 𝑦𝑖 − 𝑦̂𝑖 = 𝑦𝑖 − ( 𝑎 + 𝑏 𝑥𝑖 ).

intercept ⏟
slope

l = lm(y ~ x)
y[39] - l$fitted.values[39]
## 39
## -4.390582
# Or directly
l$residuals[39]
## 39
## -4.390582

The regression line is the best (optimal) linear model - it provides the best fit to the data
𝑛
as the sum of the squared residuals ∑𝑖=1 𝑒𝑖 (𝑎, 𝑏)2 is as small as it can be.

49/72
Optimality of regression line
· We first consider a general line 𝑦 = 𝛼 + 𝛽𝑥 with intercept 𝛼 and slope 𝛽.
· Given the data set {𝑥𝑖 , 𝑦𝑖 }, 𝑖 = 1, … , 𝑛, a pair of variables (𝛼, 𝛽) for defining a line,
the residual is

𝑒𝑖 (𝛼, 𝛽) = 𝑦𝑖 − (𝛼 + 𝛽𝑥𝑖 ).
so that the sum of squared residuals becomes
𝑛 𝑛
𝑒𝑖 (𝛼, 𝛽)2 = ( 𝑦𝑖 − 𝛼 − 𝛽 𝑥 𝑖 ) 2 .
∑ ∑
𝑓(𝛼, 𝛽) =
𝑖=1 𝑖=1

50/72
· Our goal is to find the intercept 𝑎 and the slope 𝑏 that minimise 𝑓(𝛼, 𝛽) :

𝑓(𝑎, 𝑏) ≤ 𝑓(𝛼, 𝛽) for all 𝛼, 𝛽


· The following derivation of optimality is not for examination

51/72
How to find such a minimiser (𝑎, 𝑏)? It need to be a stationary point of the function 𝑓
such that
∂𝑓
∂𝛼 (𝑎, 𝑏) = ∑𝑛𝑖=1 2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )(−1) = 0
and
∂𝑓
∂𝛽 (𝑎, 𝑏) = ∑𝑛𝑖=1 2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )(−𝑥𝑖 ) = 0.

∂𝑓
We use the first equation to find the intercept, ∂𝛼 (𝑎, 𝑏) = 0 is equivalent to

∑𝑛𝑖=1 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ) = 0 ⇔ ∑𝑛𝑖=1 𝑦𝑖 = ∑𝑛𝑖=1 (𝑎 + 𝑏𝑥𝑖 ) = 𝑛𝑎 + 𝑏 ∑𝑛𝑖=1 𝑥𝑖


Dividing both sides by 𝑛, this gives

1 𝑛 1 𝑛
𝑦¯ = ∑𝑖=1 𝑦𝑖 = 𝑎 + 𝑏 ∑𝑖=1 𝑥𝑖 = 𝑎 + 𝑏𝑥¯ ,
𝑛 𝑛
which leads to 𝑎 = 𝑦¯ − 𝑏𝑥¯ .

52/72
We can find the slope by substituting 𝑎 = 𝑦¯ − 𝑏𝑥¯ into the second equation. This way,
∂𝑓
∂𝛽 (𝑎, 𝑏) = 0 becomes
∑𝑛𝑖=1 [𝑦𝑖 − (𝑦¯ − 𝑏𝑥¯ ) − 𝑏𝑥𝑖 ]𝑥𝑖 = 0.
After rearragement,
∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)𝑥𝑖 = 𝑏∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )𝑥𝑖 .
Because the sum of deviations is zero (topic 3 in week 3), we have ∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯) = 0
𝑛
and ∑𝑖=1 (𝑥𝑖 − 𝑥¯ ) = 0, and hence

∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)𝑥¯ = 0 and ∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )𝑥¯ = 0.


as 𝑥¯ is a constant for all 𝑖 .

𝐿𝐻𝑆 = (∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)𝑥𝑖 ) − (∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)𝑥¯ ) = ∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)(𝑥𝑖 − 𝑥¯ )

𝑅𝐻𝑆 = 𝑏 (∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )𝑥𝑖 ) − 𝑏 (∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )𝑥¯ ) = 𝑏∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )2

53/72
∂𝑓
By solving the second equation ∂𝛽 (𝑎, 𝑏) = 0, the slope is

∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)(𝑥𝑖 − 𝑥¯ )


𝑏=
∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )2
Recall that

·
𝑆 𝐷𝑥 = √‾𝑛−1
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
1
∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )‾2

·
𝑆 𝐷𝑦 = √‾𝑛−1
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
1
∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)‾2

· ∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)


𝑟=
√∑𝑖=1 (𝑥𝑖 −𝑥¯) √∑𝑖=1 (𝑦𝑖 −𝑦¯)
𝑛 2 𝑛 2

𝑆𝐷
This gives exactly 𝑏 = 𝑟 𝑆𝐷𝑥𝑦 as we claimed in the definition of the regression line. So
that we know the regression line is indeed the best among all lines (linear functions) in
the sense of sum of squared residuals.

54/72
Average of residual is zero
Given the regression line 𝑦 = 𝑎 + 𝑏𝑥, where 𝑎 = 𝑦¯ − 𝑏𝑥¯ , the sum of residual
∑𝑛𝑖=1 𝑒𝑖 (𝑎, 𝑏) = ∑𝑛𝑖=1 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ) = ∑𝑛𝑖=1 (𝑦𝑖 − (𝑦¯ − 𝑏𝑥¯ ) − 𝑏𝑥𝑖 )
can be expressed as

∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯) − 𝑏∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ ) = 0


Thus, the mean (average) of residual is zero.

55/72
Summary of residual
Feature Regression Line 𝑦 ∼ 𝑥 (𝑦 = 𝑎 + 𝑏𝑥)
Connects (𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + 𝑟SD𝑦 )
SD
Slope (b) 𝑟 𝑦
SD𝑥
Intercept
(a)
𝑦¯ − 𝑏𝑥¯
Residual 𝑒𝑖 = 𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖

· 𝑦 = 𝑎 + 𝑏𝑥 is the best line that minimises the sum of squared residuals ∑𝑛𝑖=1 𝑒2𝑖 .
· the average residual of the regression line is zero: 𝑒¯ = 1
𝑛 ∑𝑛𝑖=1 𝑒𝑖 = 0.

56/72
Coefficient of determination
How much variability of data 𝑦 can be explained by the linear model?

Baseline prediction/deviations in y, Regression line/residuals

58/72
More comments
(A). Data for which all variation is explained. All the points fall exactly on a straight line.
In this case, all (100%) of the sample variation in 𝑦 can be attributed to the fact that 𝑥
and 𝑦 are linearly related in combination with variation in 𝑥.

(B). Data for which most variation is explained. The points do not fall exactly on a line,
but compared to overall 𝑦 variability, the residuals from the regression line are small. It is
reasonable to conclude in this case that much of the observed 𝑦 variation can be
attributed to the linear regression model.

(C). Data for which little variation is explained. There is substantial variation in the points
about the regression line relative to overall y variation, so the linear regression model
fails to explain variation in 𝑦 by relating 𝑦 to 𝑥.

59/72
Explaining variations
· The sum of squared residuals (or SSE for sum of squared errors)
𝑛 𝑛
𝑒2𝑖 = (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2
∑ ∑
SSE =
𝑖=1 𝑖=1

measures variation in 𝑦 left unexplained by the regression line.

· In (A) SSE = 0, and there is no unexplained variation, whereas unexplained variation


is small for (B), and large for (C).

· A quantitative measure of the total amount of variation in observed 𝑦 values is


given by the total sum of squares (sum of squared deviations about sample mean)
𝑛
(𝑦𝑖 − 𝑦¯)2

SST =
𝑖=1

measures variation in 𝑦 left unexplained by the baseline prediction.

60/72
· SST ≥ SSE.

· Why? The regression is optimal for sum of squared errors, so SSE (regression line)
cannot be worse than SST (baseline).

61/72
Coefficient of determination

 The coefficient of determination
The ratio SSE
SST is the proportion of total variation that cannot be explained by the
simple linear regression model, and the coefficient of determination is

SSE
1− = 𝑟2
SST
which is the squared correlation coefficient (a number between 0 and 1) giving
the proportion of observed 𝑦 variation explained by the model.

· The higher the value of 𝑟2 , the more successful is the simple linear regression model
in explaining y variation.
· Note that if SSE = 0 as in case (A), then 𝑟2 = 1.
· This can be verified using 𝑎, 𝑏, SDs, and 𝑟 (see next lab)

62/72
Example
cor(x, y)^2 # quick way
## [1] 0.2513401
lm.fit <- lm(y ~ x)
SSE = sum(lm.fit$residuals^2)
SST = sum((y - mean(y))^2)
r2 = 1 - SSE/SST
## [1] 0.2513401

The coefficient of determination for Pearson’s height data is 0.25, about 25% of the
variations in son’s height can be explained by the regression line.

63/72
Diagnostics
Residual Plot


 Residual plot
· A residual plot graphs the residuals vs 𝑥.
· If the linear fit is appropriate for the data, it should show no pattern (random
points around 0).
· By checking the patterns of the residuals, the residual plot is a diagnostic plot to
check the appropriateness of a linear model.

65/72
plot(x, l$residuals, ylab = "residuals", col = adjustcolor("black", alpha.f = 0.35))
abline(h = 0)


 Does this residual plot look random?

66/72
Homoscedasticity and Heteroscedasticity


 Vertical strips
In linear models and regression analysis generally, we need to check the
homogeneity of the spread of the response variable (or the residuals). We can
divide the scatter plot or the residual plot into vertical strips.

· If the vertical strips on the scatter plot show equal spread in the 𝑦 direction, then
the data is homoscedastic.
- The regression line could be used for predictions.

· If the vertical strips don’t show equal spread in the 𝑦 direction, then the data is
heteroscedastic.
- The regression line should not be used for predictions.

67/72

 Is the Pearson’s height data homoscedastic?

68/72
Common mistake 1: extrapolating
If we make a prediction from an 𝑥 value that is not within the range of the data, then that
prediction can be completely unreliable.

69/72
Common mistake 2: not checking the scatter plot
· We can have a high correlation coefficient and then fit a regression line, but the data
may not even be linear!
· So always check the scatter plot first!

Note: Even though the correlation coefficient is high 𝑟 ≈ 0.99, a quadratic model is
more appropriate than a linear model.

70/72
Common mistake 3. Not checking the residual plot
· You should also check the residual plot
· This detects any pattern that has not been captured by fitting a linear model.
· If the linear model is appropriate, the residual plot should be a random scatter of
points.

71/72
Summary
Correlation
· Correlation coefficient
· Properties and warnings
Linear model
· Regression Line
· Prediction
· Residuals and properties
· Coefficient of determination
· Diagnostics of model fit
Some R Functions
lm , plot , abline

72/72

You might also like