0% found this document useful (0 votes)

5 views

Topic4-Linear-Models

The document discusses the concepts of correlation and linear modeling, focusing on the relationship between two variables using bivariate data and scatter plots. It explains the correlation coefficient, its properties, and how it can be influenced by outliers and non-linear associations. Additionally, it covers the regression line as an optimal way to summarize the relationship between variables, incorporating statistical measures such as means and standard deviations.

Uploaded by

ishrat

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Topic4-Linear-Models

Uploaded by

ishrat

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 72

Correlation and Linear Model

Modelling Data | Linear Model

© University of Sydney MATH1062

28 August 2024

 Module2 Modelling Data

Normal Curve
What is the Normal Curve? And what does it have to do with sample mean?

Linear Model
How can we describe the relationship between two variables? When is a linear
model appropriate?

2/72

 Scatter plot & correlation

Correlation
· Bivariate data & scatter plot
· Correlation coefficient
· Properties and warnings
Linear model
· Regression Line
· Prediction
· Residuals and properties
· Coefficient of determination
· Diagnostics of model fit
Summary

3/72
Scatter plots
History
· Sir Francis Galton (England, 1822–1911) studied the degree to which children
resemble their parents (and wrote travel books on “wild countries”!)
· Galton’s work was continued by his student Karl Pearson (England, 1857–1936).
Pearson measured the heights of 1,078 fathers and their sons at maturity.

5/72
Pearson’s plot of heights (scatter plot)

· Plotting the pairs of heights creates a cloud of points.

· Generally, taller fathers tend to have taller sons.

6/72
Code for plotting Pearson’s data
# install.packages('UsingR')
suppressMessages(library(UsingR))
library(UsingR) # Loads another collection of datasets
data(father.son) # This is Pearson's data.
data = father.son
x = data$fheight # fathers' heights
y = data$sheight # sons' heights
# scatter plot
plot(x, y, xlim = c(58, 80), ylim = c(58, 80), xaxt = "n", yaxt = "n", xaxs = "i",
yaxs = "i", main = "Pearson's data", xlab = "Father height (inches)", ylab = "Son height (inches)")
# Adjust the gap between label and plot
axp = seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)

7/72

 Statistical Thinking
Why do we care whether there is an associaton between two variables (here: height
of father and son)?

· The association is interesting on its own.

· Association between two variables can be used for prediction, i.e, use outcome in
one variable to predict the outcome in another variable.

· How can we quantify a possible association?

8/72
Correlation coefficient
Bivariate data


 Bivariate data
Bivariate data involves a pair of variables. We are interested in the relationship
between the two variables. Can one variable be used to predict the other?

· Formally, we have (𝑥𝑖 , 𝑦𝑖 ) for 𝑖 = 1, 2, … , 𝑛.

· 𝑋 and 𝑌 can have the same role
· 𝑋 and 𝑌 may have different roles: for example, 𝑋 can be an independent
variable (or explanatory variable, predictor or regressor) which we use to explain
or predict 𝑌 , the dependent variable (or response variable).

10/72
How can we summarise bivariate data?
Bivariate data (or a scatter plot) can be summarised by the following five numerical
summaries:

· sample mean and sample SD of 𝑋 (𝑥¯ , SD𝑥 )

· sample mean and sample SD of 𝑌 (𝑦¯, SD𝑦 )
· correlation coefficient (𝑟).

11/72
Association between the two variables

· All clouds have the same centre and horizontal and vertical spread.
· However they have different spread around a line (linear association). How do we
measure this?

12/72
The correlation coefficient

 The (Pearson) correlation coefficient (r)
· A numerical summary measures of how points are spread around the line.
· It indicates both the sign and strength of the linear association.
· It is defined as the mean of the product of the variables in standard units.

Recall that data point − mean

standard unit =
𝑆𝐷
Using sample SD, we divide by 𝑛 − 1 in the average:

1 𝑛 (𝑥𝑖 −𝑥¯ ) (𝑦𝑖 −𝑦¯)

1
∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)
𝑟= 𝑛−1 ∑ 𝑖=1 𝑆𝐷𝑠𝑎𝑚𝑝𝑙𝑒 (𝑋) 𝑆𝐷𝑠𝑎𝑚𝑝𝑙𝑒 (𝑌 ) = 𝑛−1

√
1
𝑛−1
∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯)2 √ 𝑛−1
1
∑𝑛𝑖=1 (𝑦𝑖 −𝑦¯)2

∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)

which simplifies to 𝑟 = .
√∑𝑖=1 (𝑥𝑖 −𝑥¯) √∑𝑖=1 (𝑦𝑖 −𝑦¯)
𝑛 2 𝑛 2

13/72
The same correlation coefficient 𝑟 can be obtained using the population SD as well
(dividing by 𝑛 in the average).

1 𝑛 (𝑥𝑖 −𝑥¯ ) (𝑦𝑖 −𝑦¯)

1
∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)
𝑟= 𝑛 ∑ 𝑖=1 𝑆𝐷𝑝𝑜𝑝 (𝑋) 𝑆𝐷𝑝𝑜𝑝 (𝑌 ) = 𝑛

√ 𝑛 ∑𝑖=1 (𝑥𝑖 −𝑥¯) √ 𝑛 ∑𝑖=1 (𝑦𝑖 −𝑦¯)

1 𝑛 2 1 𝑛 2

∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)

which also simplifies to 𝑟 = .
√∑𝑖=1 (𝑥𝑖 −𝑥¯) √∑𝑖=1 (𝑦𝑖 −𝑦¯)
𝑛 2 𝑛 2

Quick calculation in R using cor() .

cor(x, y)
## [1] 0.5013383

14/72
Why does 𝑟 measure association?
Here, for illustration, we round data to 1 decimal place to make calculations simpler.

𝑥 (father’s 𝑦 (son’s standard standard

product quadrant
heights) heights) units units
𝑥−67.7 𝑦−68.7 𝑦−68.7
2.7 2.8
( 𝑥−67.7
2.7
)( 2.8
)
65.0 59.8 -0.96 -3.16 3.04 lower left
63.3 63.2 -1.62 -1.94 3.14 lower left
65.0 63.3 -1.00 -1.90 1.89 lower left
70.3 67.0 0.95 -0.59 -0.57 lower right

⋮
mean=+0.5

15/72
We divide the scatter plot into 4 quadrants, at the point of averages (centre).

· In the upper right and lower left quadrants, products of standard units are (+)
· In the upper left and lower right quadrants, products of standard units are (-)

· A majority of points in the upper right (+) and lower left quadrants (+) will be indicated
by a positive 𝑟
· A majority of points in the upper left (-) and lower right quadrants (-) will be indicated
by a negative 𝑟

16/72
More examples

17/72
Properties and warnings
Interpretations of r values
· The correlation coefficient 𝑟 always takes values between -1 and 1 (inclusive).
- This can be shown using the definition of 𝑟 and the Cauchy-Schwarz inequality
(only for your information).
· If 𝑟 is positive: the cloud slopes up.
· If 𝑟 is negative: the cloud slopes down.
· 𝑟 = 0 implies no linear dependency between two variables.
· As 𝑟 gets closer to ±1 : the points cluster more tightly around the line.

19/72
Invariant properties
Shift and scale invariant
The correlation coefficient is shift and scale invariant. Why? shifting and scaling do
not change the standard unit.

cor(x, y)
## [1] 0.5013383
cor(0.2 * x + 3, 3 * y - 1)
## [1] 0.5013383

Symmetry (commutative)
The correlation coefficient is not affected by interchanging the variables.

cor(x, y)
## [1] 0.5013383
cor(y, x)
## [1] 0.5013383

20/72
Warning 1:
Wrong interpretations of correlation coefficient

Mistake:
𝑟 = 0.8 means that 80% of the points are tightly closed around the line.

Mistake:
𝑟 = 0.8 means that the points are twice as tightly closed as 𝑟 = 0.4.
Note 1: 𝑟 = 0.8 suggests a stronger association between variables compared to the
case 𝑟 = 0.4, BUT does not suggest the data points are twice as tight.

21/72
Warning 2:
Outliers can overly influence the correlation coefficient
Suppose there was an extra unusual reading of (100,50).

f1 = c(data$fheight, 100) # Add an extra point to data

s1 = c(data$sheight, 50)
cor(data$fheight, data$sheight)
## [1] 0.5013383
cor(f1, s1)
## [1] 0.3956794

22/72
par(mfrow = c(1, 2))
plot(data$fheight, data$sheight)
plot(f1, s1)
points(100, 50, col = "lightgreen", pch = 19, cex = 2)

23/72
Warning 3:
Nonlinear association can’t be detected by the correlation coefficient
What interpretation mistake could be made in the following data set?

x = c(1:20)
y = x^2
cor(x, y)
## [1] 0.9713482

Based on the correlation coefficient, the points should cluster very tightly around the line
sloping up.

24/72
But look at the scatter plots.
plot(x, y)

This data should be modelled by a quadratic curve, not a line.

We should always use correlation coefficient together with the scatter plot.

25/72
Warning 4:
The same correlation coefficient can arise from very different data
The following 4 data sets (Anscombes Quartet) have the same 𝑥¯ , 𝑆 𝐷𝑥 , 𝑦¯, 𝑆𝐷𝑦 , and
also the same value of 𝑟.

## x_mean: 9 9 9 9
## x_sd: 3.316625 3.316625 3.316625 3.316625
## y_mean: 7.500909 7.500909 7.5 7.500909
## y_sd: 2.031568 2.031657 2.030424 2.030579
## r: 0.8164205 0.8162365 0.8162867 0.8165214

26/72
But look at the scatter plots.

27/72
Warning 5 (not for assessment)
Ecological correlation tend to inflate the correlation coefficient
· An ecological correlation is the correlation between two variables that are group
means.
· For example, if we recorded the heights of fathers and sons in many communities,
and then calculated the average for each community.
· Correlations at the group level (ecological correlations) can be much higher than
those at the individual level.
· See Freedman et al, Statistics p148-149.

28/72
Example

· The 1st plot has all 3 sets of data combined: correlation = 0.51 (not very strong).
· The 2nd plot has the averages of the 3 data sets: correlation = 0.94 (very strong).

29/72
Regression line
Pearson’s plot of heights

· How can we summarise the data with a line?

· How do we find the optimal line?

31/72
1st option: SD line (not so good)
· The SD line might look like a good candidate as it connects the point of averages
(𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + SD𝑦 ) (for this data with positive correlation).

32/72
· Note how it underestimates (LHS) and overestimates (RHS) at the extremes.

· Recall that 𝑋, 𝑌 can have the same mean and SD but very different correlation
coefficient.
· The above model does not use the correlation coefficient, so it is insensitive to the
amount of clustering around the line.

33/72
· These data have the same mean and same SDs, and hence then same SD line.

· But they have different correlation coefficient, how to take this into account?

· How to quantify the quality of the fitted line, so we can define the optimal line?

34/72
Best option: regression line
· To describe the scatter plot, we need to use all five summaries: 𝑥¯ , 𝑦¯, 𝑆𝐷𝑥 , 𝑆𝐷𝑦
and 𝑟.
· The regression line connects (𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + 𝑟SD𝑦 )

35/72
· Note the improvement at the extremes.

36/72
Summary of regression line
Feature Regression Line 𝑦 ∼ 𝑥 (𝑦 = 𝑎 + 𝑏𝑥)
Connects (𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + 𝑟SD𝑦 )
SD
Slope (b) 𝑟 𝑦
SD𝑥
Intercept
(a)
𝑦¯ − 𝑏𝑥¯

Optimality: We can derive the regression line using calculus, by minimising the sum of
squares of the residuals.

37/72
In R
lm(y ~ x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 33.8866 0.5141
model = lm(y ~ x)
model$coeff
## (Intercept) x
## 33.886604 0.514093

So for 𝑥 = father height and 𝑦 = son height, the regression line is

𝑦 = 33.886604 + 0.514093𝑥

38/72
plot(x, y, xlab = "Father height (inches)", ylab = "Son height (inches)", col = adjustcolor("black",
alpha.f = 0.35))
abline(lm(y ~ x), col = "red")

39/72
Prediction
Baseline prediction
· For new born (son), the father is 75 inches tall, how can we predict the son’s height?
· If you don’t use the information of the independent variable 𝑥 at all, a basic prediction
of 𝑦 would be the average of 𝑦 for all the 𝑥 values in the data.
· So for any father’s height, we could predict the son’s height to be 68.68.

mean(y)
## [1] 68.68407

41/72
The Regression line
· A better prediction is based on the regression line 𝑦 = slope × 𝑥 + intercept
· For the height data: 𝑦 = 33.886604 + 0.514093𝑥
· So for any father’s height 75, we could predict the son’s height to be 72.44.

42/72
Can we also use Y to predict X?
We can predict 𝑌 from 𝑋 or 𝑋 from 𝑌 , depending on what fits the context.

43/72
Beware!
· Can we just simply rearrange the equation? (𝑦 = 𝑎 + 𝑏𝑥) ⟹ (𝑥 = − 𝑎𝑏 + 1𝑏 𝑦)
· The answer is NO unless 𝑟 = ±1 (data clustered along the line).
· We need to refit the model.

Regression Line 𝑦 ∼𝑥 Regression Line 𝑥 ∼𝑦

Feature ̃
(𝑦 = 𝑎 + 𝑏𝑥) (𝑥 = 𝑎̃ + 𝑏𝑦)
Connects (𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + 𝑟SD𝑦 ) (𝑦¯, 𝑥¯ ) to (𝑦¯ + SD𝑦 , 𝑥¯ + 𝑟SD𝑥 )
SD 𝑏̃ = 𝑟 SD𝑥
Slope 𝑏=𝑟 𝑦 SD𝑦
SD𝑥
Intercept 𝑎 = 𝑦¯ − 𝑏𝑥¯ 𝑎̃ = 𝑥¯ − 𝑏𝑦̃ ¯

44/72
lm(y ~ x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 33.8866 0.5141

lm(x ~ y)
##
## Call:
## lm(formula = x ~ y)
##
## Coefficients:
## (Intercept) y
## 34.1075 0.4889

45/72
Residuals and properties
Residuals
We can now make predictions using the regression line. But we have some prediction
error.


 Residual (prediction error)
· A residual is the vertical distance of a point above or below the regression line.

· A residual represents the error between the actual value and the prediction.

47/72
When the father’s height is 71.82, the actual value of the son’s height is 66.42 with
predicted value 70.81, so the residual is -4.39.

48/72
Formally, given the actual value (𝑦𝑖 ) and the prediction (𝑦̂𝑖 ), a residual is

𝑒𝑖 (𝑎, 𝑏) = 𝑦𝑖 − 𝑦̂𝑖 = 𝑦𝑖 − ( 𝑎 + 𝑏 𝑥𝑖 ).
⏟
intercept ⏟
slope

l = lm(y ~ x)
y[39] - l$fitted.values[39]
## 39
## -4.390582
# Or directly
l$residuals[39]
## 39
## -4.390582

The regression line is the best (optimal) linear model - it provides the best fit to the data
𝑛
as the sum of the squared residuals ∑𝑖=1 𝑒𝑖 (𝑎, 𝑏)2 is as small as it can be.

49/72
Optimality of regression line
· We first consider a general line 𝑦 = 𝛼 + 𝛽𝑥 with intercept 𝛼 and slope 𝛽.
· Given the data set {𝑥𝑖 , 𝑦𝑖 }, 𝑖 = 1, … , 𝑛, a pair of variables (𝛼, 𝛽) for defining a line,
the residual is

𝑒𝑖 (𝛼, 𝛽) = 𝑦𝑖 − (𝛼 + 𝛽𝑥𝑖 ).
so that the sum of squared residuals becomes
𝑛 𝑛
𝑒𝑖 (𝛼, 𝛽)2 = ( 𝑦𝑖 − 𝛼 − 𝛽 𝑥 𝑖 ) 2 .
∑ ∑
𝑓(𝛼, 𝛽) =
𝑖=1 𝑖=1

50/72
· Our goal is to find the intercept 𝑎 and the slope 𝑏 that minimise 𝑓(𝛼, 𝛽) :

𝑓(𝑎, 𝑏) ≤ 𝑓(𝛼, 𝛽) for all 𝛼, 𝛽

· The following derivation of optimality is not for examination

51/72
How to find such a minimiser (𝑎, 𝑏)? It need to be a stationary point of the function 𝑓
such that
∂𝑓
∂𝛼 (𝑎, 𝑏) = ∑𝑛𝑖=1 2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )(−1) = 0
and
∂𝑓
∂𝛽 (𝑎, 𝑏) = ∑𝑛𝑖=1 2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )(−𝑥𝑖 ) = 0.

∂𝑓
We use the first equation to find the intercept, ∂𝛼 (𝑎, 𝑏) = 0 is equivalent to

∑𝑛𝑖=1 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ) = 0 ⇔ ∑𝑛𝑖=1 𝑦𝑖 = ∑𝑛𝑖=1 (𝑎 + 𝑏𝑥𝑖 ) = 𝑛𝑎 + 𝑏 ∑𝑛𝑖=1 𝑥𝑖

Dividing both sides by 𝑛, this gives

1 𝑛 1 𝑛
𝑦¯ = ∑𝑖=1 𝑦𝑖 = 𝑎 + 𝑏 ∑𝑖=1 𝑥𝑖 = 𝑎 + 𝑏𝑥¯ ,
𝑛 𝑛
which leads to 𝑎 = 𝑦¯ − 𝑏𝑥¯ .

52/72
We can find the slope by substituting 𝑎 = 𝑦¯ − 𝑏𝑥¯ into the second equation. This way,
∂𝑓
∂𝛽 (𝑎, 𝑏) = 0 becomes
∑𝑛𝑖=1 [𝑦𝑖 − (𝑦¯ − 𝑏𝑥¯ ) − 𝑏𝑥𝑖 ]𝑥𝑖 = 0.
After rearragement,
∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)𝑥𝑖 = 𝑏∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )𝑥𝑖 .
Because the sum of deviations is zero (topic 3 in week 3), we have ∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯) = 0
𝑛
and ∑𝑖=1 (𝑥𝑖 − 𝑥¯ ) = 0, and hence

∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)𝑥¯ = 0 and ∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )𝑥¯ = 0.

as 𝑥¯ is a constant for all 𝑖 .

𝐿𝐻𝑆 = (∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)𝑥𝑖 ) − (∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)𝑥¯ ) = ∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)(𝑥𝑖 − 𝑥¯ )

𝑅𝐻𝑆 = 𝑏 (∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )𝑥𝑖 ) − 𝑏 (∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )𝑥¯ ) = 𝑏∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )2

53/72
∂𝑓
By solving the second equation ∂𝛽 (𝑎, 𝑏) = 0, the slope is

∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)(𝑥𝑖 − 𝑥¯ )

𝑏=
∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )2
Recall that

·
𝑆 𝐷𝑥 = √‾𝑛−1
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
1
∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )‾2

·
𝑆 𝐷𝑦 = √‾𝑛−1
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
1
∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)‾2

· ∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)

𝑟=
√∑𝑖=1 (𝑥𝑖 −𝑥¯) √∑𝑖=1 (𝑦𝑖 −𝑦¯)
𝑛 2 𝑛 2

𝑆𝐷
This gives exactly 𝑏 = 𝑟 𝑆𝐷𝑥𝑦 as we claimed in the definition of the regression line. So
that we know the regression line is indeed the best among all lines (linear functions) in
the sense of sum of squared residuals.

54/72
Average of residual is zero
Given the regression line 𝑦 = 𝑎 + 𝑏𝑥, where 𝑎 = 𝑦¯ − 𝑏𝑥¯ , the sum of residual
∑𝑛𝑖=1 𝑒𝑖 (𝑎, 𝑏) = ∑𝑛𝑖=1 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ) = ∑𝑛𝑖=1 (𝑦𝑖 − (𝑦¯ − 𝑏𝑥¯ ) − 𝑏𝑥𝑖 )
can be expressed as

∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯) − 𝑏∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ ) = 0

Thus, the mean (average) of residual is zero.

55/72
Summary of residual
Feature Regression Line 𝑦 ∼ 𝑥 (𝑦 = 𝑎 + 𝑏𝑥)
Connects (𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + 𝑟SD𝑦 )
SD
Slope (b) 𝑟 𝑦
SD𝑥
Intercept
(a)
𝑦¯ − 𝑏𝑥¯
Residual 𝑒𝑖 = 𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖

· 𝑦 = 𝑎 + 𝑏𝑥 is the best line that minimises the sum of squared residuals ∑𝑛𝑖=1 𝑒2𝑖 .
· the average residual of the regression line is zero: 𝑒¯ = 1
𝑛 ∑𝑛𝑖=1 𝑒𝑖 = 0.

56/72
Coefficient of determination
How much variability of data 𝑦 can be explained by the linear model?

Baseline prediction/deviations in y, Regression line/residuals

58/72
More comments
(A). Data for which all variation is explained. All the points fall exactly on a straight line.
In this case, all (100%) of the sample variation in 𝑦 can be attributed to the fact that 𝑥
and 𝑦 are linearly related in combination with variation in 𝑥.

(B). Data for which most variation is explained. The points do not fall exactly on a line,
but compared to overall 𝑦 variability, the residuals from the regression line are small. It is
reasonable to conclude in this case that much of the observed 𝑦 variation can be
attributed to the linear regression model.

(C). Data for which little variation is explained. There is substantial variation in the points
about the regression line relative to overall y variation, so the linear regression model
fails to explain variation in 𝑦 by relating 𝑦 to 𝑥.

59/72
Explaining variations
· The sum of squared residuals (or SSE for sum of squared errors)
𝑛 𝑛
𝑒2𝑖 = (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2
∑ ∑
SSE =
𝑖=1 𝑖=1

measures variation in 𝑦 left unexplained by the regression line.

· In (A) SSE = 0, and there is no unexplained variation, whereas unexplained variation

is small for (B), and large for (C).

· A quantitative measure of the total amount of variation in observed 𝑦 values is

given by the total sum of squares (sum of squared deviations about sample mean)
𝑛
(𝑦𝑖 − 𝑦¯)2
∑
SST =
𝑖=1

measures variation in 𝑦 left unexplained by the baseline prediction.

60/72
· SST ≥ SSE.

· Why? The regression is optimal for sum of squared errors, so SSE (regression line)
cannot be worse than SST (baseline).

61/72
Coefficient of determination

 The coefficient of determination
The ratio SSE
SST is the proportion of total variation that cannot be explained by the
simple linear regression model, and the coefficient of determination is

SSE
1− = 𝑟2
SST
which is the squared correlation coefficient (a number between 0 and 1) giving
the proportion of observed 𝑦 variation explained by the model.

· The higher the value of 𝑟2 , the more successful is the simple linear regression model
in explaining y variation.
· Note that if SSE = 0 as in case (A), then 𝑟2 = 1.
· This can be verified using 𝑎, 𝑏, SDs, and 𝑟 (see next lab)

62/72
Example
cor(x, y)^2 # quick way
## [1] 0.2513401
lm.fit <- lm(y ~ x)
SSE = sum(lm.fit$residuals^2)
SST = sum((y - mean(y))^2)
r2 = 1 - SSE/SST
## [1] 0.2513401

The coefficient of determination for Pearson’s height data is 0.25, about 25% of the
variations in son’s height can be explained by the regression line.

63/72
Diagnostics
Residual Plot


 Residual plot
· A residual plot graphs the residuals vs 𝑥.
· If the linear fit is appropriate for the data, it should show no pattern (random
points around 0).
· By checking the patterns of the residuals, the residual plot is a diagnostic plot to
check the appropriateness of a linear model.

65/72
plot(x, l$residuals, ylab = "residuals", col = adjustcolor("black", alpha.f = 0.35))
abline(h = 0)


 Does this residual plot look random?

66/72
Homoscedasticity and Heteroscedasticity


 Vertical strips
In linear models and regression analysis generally, we need to check the
homogeneity of the spread of the response variable (or the residuals). We can
divide the scatter plot or the residual plot into vertical strips.

· If the vertical strips on the scatter plot show equal spread in the 𝑦 direction, then
the data is homoscedastic.
- The regression line could be used for predictions.

· If the vertical strips don’t show equal spread in the 𝑦 direction, then the data is
heteroscedastic.
- The regression line should not be used for predictions.

67/72

 Is the Pearson’s height data homoscedastic?

68/72
Common mistake 1: extrapolating
If we make a prediction from an 𝑥 value that is not within the range of the data, then that
prediction can be completely unreliable.

69/72
Common mistake 2: not checking the scatter plot
· We can have a high correlation coefficient and then fit a regression line, but the data
may not even be linear!
· So always check the scatter plot first!

Note: Even though the correlation coefficient is high 𝑟 ≈ 0.99, a quadratic model is
more appropriate than a linear model.

70/72
Common mistake 3. Not checking the residual plot
· You should also check the residual plot
· This detects any pattern that has not been captured by fitting a linear model.
· If the linear model is appropriate, the residual plot should be a random scatter of
points.

71/72
Summary
Correlation
· Correlation coefficient
· Properties and warnings
Linear model
· Regression Line
· Prediction
· Residuals and properties
· Coefficient of determination
· Diagnostics of model fit
Some R Functions
lm , plot , abline

72/72

Chapter 7 Hypothesis Testing
No ratings yet
Chapter 7 Hypothesis Testing
13 pages
Lecture # 3 (Heteroskedasticity in Cross-Sectional Data)
No ratings yet
Lecture # 3 (Heteroskedasticity in Cross-Sectional Data)
5 pages
FormulaSheet FinalExam
No ratings yet
FormulaSheet FinalExam
8 pages
Lecture 10 Heteroscedasticity
No ratings yet
Lecture 10 Heteroscedasticity
6 pages
Sta2023 Formula Sheet 1
No ratings yet
Sta2023 Formula Sheet 1
4 pages
FormulaSheet Test 1
No ratings yet
FormulaSheet Test 1
6 pages
Lecture-5-6 Moment, Skewness and Kurtosis
No ratings yet
Lecture-5-6 Moment, Skewness and Kurtosis
5 pages
Financial Statistics - Formula Sheet
No ratings yet
Financial Statistics - Formula Sheet
26 pages
Lecture 4 Copy 1
No ratings yet
Lecture 4 Copy 1
13 pages
Correlation and Regression
No ratings yet
Correlation and Regression
7 pages
Correlation and Regression
No ratings yet
Correlation and Regression
7 pages
Tarea 1
No ratings yet
Tarea 1
6 pages
EEPC102-Module_6-Lesson-2
No ratings yet
EEPC102-Module_6-Lesson-2
12 pages
STA-CM 121 Lecture 2
No ratings yet
STA-CM 121 Lecture 2
18 pages
CRE Equations and Formulas Print Out
No ratings yet
CRE Equations and Formulas Print Out
30 pages
Static Tics
No ratings yet
Static Tics
17 pages
Engineering Mathematics Lecture Measures Of Dispersion and Location
No ratings yet
Engineering Mathematics Lecture Measures Of Dispersion and Location
8 pages
Lecture 5&6
No ratings yet
Lecture 5&6
15 pages
Mech 430 Final Formula Sheet Updated
No ratings yet
Mech 430 Final Formula Sheet Updated
12 pages
STAT - Lec.3 - Correlation and Regression
No ratings yet
STAT - Lec.3 - Correlation and Regression
8 pages
CQE Academy Equation Cheat Sheet - D
No ratings yet
CQE Academy Equation Cheat Sheet - D
15 pages
EDA - Formula
No ratings yet
EDA - Formula
2 pages
GB Academy Equation List
No ratings yet
GB Academy Equation List
16 pages
Engineering Methematics Lecture - Moment, Skewness and Kurtosis
No ratings yet
Engineering Methematics Lecture - Moment, Skewness and Kurtosis
5 pages
Appendix - Errors and Uncertainties
No ratings yet
Appendix - Errors and Uncertainties
2 pages
EC2303 Final Formula Sheet PDF
No ratings yet
EC2303 Final Formula Sheet PDF
8 pages
Formula - Midterm Exam
No ratings yet
Formula - Midterm Exam
1 page
Estimation of Mean Vector and Variance Covariance Matrix PDF
No ratings yet
Estimation of Mean Vector and Variance Covariance Matrix PDF
7 pages
Statistical Formulae Sheet Updated 2
No ratings yet
Statistical Formulae Sheet Updated 2
3 pages
Summary
No ratings yet
Summary
3 pages
Lesson 2: Parameter and Statistic: Sampling and Sampling Distribution
No ratings yet
Lesson 2: Parameter and Statistic: Sampling and Sampling Distribution
6 pages
Week 2
No ratings yet
Week 2
5 pages
CE463 Fall 2024 Regression Lecture 12
No ratings yet
CE463 Fall 2024 Regression Lecture 12
16 pages
BB113 Formula Sheet
No ratings yet
BB113 Formula Sheet
5 pages
FiniteDifferenceApprox Notes
No ratings yet
FiniteDifferenceApprox Notes
10 pages
课本附录 (二) - 公式表 Formula Sheet - final
No ratings yet
课本附录 (二) - 公式表 Formula Sheet - final
2 pages
Bio-L8- Correlation and Regression Analysis
No ratings yet
Bio-L8- Correlation and Regression Analysis
15 pages
Assignment 2 Documentation
No ratings yet
Assignment 2 Documentation
15 pages
Statistical Inference 2 Note 02
No ratings yet
Statistical Inference 2 Note 02
7 pages
List of Formula - Managerial Statistics
No ratings yet
List of Formula - Managerial Statistics
6 pages
Statistical Inference
No ratings yet
Statistical Inference
82 pages
노트_241105_2
No ratings yet
노트_241105_2
65 pages
2ndSemStatisticsPractical1 29(Full)
No ratings yet
2ndSemStatisticsPractical1 29(Full)
32 pages
S3 Measures of Dispersion - Schoology
No ratings yet
S3 Measures of Dispersion - Schoology
27 pages
Formula Sheet For Module 3
No ratings yet
Formula Sheet For Module 3
2 pages
Uji Mean Ellipsoid Pos Hoc Independent
No ratings yet
Uji Mean Ellipsoid Pos Hoc Independent
37 pages
Chapter 1, 2 (Basic Statistics - Central Tendency)
No ratings yet
Chapter 1, 2 (Basic Statistics - Central Tendency)
11 pages
Lecture_5
No ratings yet
Lecture_5
7 pages
Chapter 1 Simple Linear Regression Model
No ratings yet
Chapter 1 Simple Linear Regression Model
2 pages
IC DoubleIntegrals Part 1 - 1426
No ratings yet
IC DoubleIntegrals Part 1 - 1426
24 pages
Data Management 2
No ratings yet
Data Management 2
18 pages
Module - Data Management (Part 2)
No ratings yet
Module - Data Management (Part 2)
31 pages
焦润一已发表论文1
No ratings yet
焦润一已发表论文1
5 pages
Formula Help Sheet
No ratings yet
Formula Help Sheet
6 pages
Cheating Sheet - For 2nd2019
No ratings yet
Cheating Sheet - For 2nd2019
1 page
Module 4 & 5 Formula Hand Book
No ratings yet
Module 4 & 5 Formula Hand Book
9 pages
Linear-Regression
No ratings yet
Linear-Regression
35 pages
( ) : ( ) ( - ) ( - ) Máxima Similitud
No ratings yet
( ) : ( ) ( - ) ( - ) Máxima Similitud
1 page
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Algebra, Grades 5 - 12
From Everand
Algebra, Grades 5 - 12
Carson Dellosa Education
No ratings yet
Topic3-NormalCurve
No ratings yet
Topic3-NormalCurve
40 pages
Topic2-Numerical-Summary
No ratings yet
Topic2-Numerical-Summary
62 pages
12-UnknownProportions (1)
No ratings yet
12-UnknownProportions (1)
37 pages
AERO1400 quiz notes
No ratings yet
AERO1400 quiz notes
4 pages
13-UnknownProportionsMore (1)
No ratings yet
13-UnknownProportionsMore (1)
38 pages
Python Notes
No ratings yet
Python Notes
27 pages
Topic5-Probability
No ratings yet
Topic5-Probability
39 pages
14-UnknownMeans
No ratings yet
14-UnknownMeans
43 pages
Instant Download Using IBM SPSS Statistics: An Interactive Hands On Approach 3rd Edition, (Ebook PDF) PDF All Chapters
100% (2)
Instant Download Using IBM SPSS Statistics: An Interactive Hands On Approach 3rd Edition, (Ebook PDF) PDF All Chapters
55 pages
Tabelas de Probabilidades e Estat Istica / M Etodos Estat Isticos
No ratings yet
Tabelas de Probabilidades e Estat Istica / M Etodos Estat Isticos
17 pages
Wooldridge 6e Ch16 IM
No ratings yet
Wooldridge 6e Ch16 IM
3 pages
Probability Notes
No ratings yet
Probability Notes
19 pages
Stat2 2023 Syllabus B v1.0 Weeks 5-6-7
No ratings yet
Stat2 2023 Syllabus B v1.0 Weeks 5-6-7
41 pages
Assignment 1 (B) - Causal Research Design
No ratings yet
Assignment 1 (B) - Causal Research Design
3 pages
Panel Data Analysis Using EViews Chapter - 2 PDF
No ratings yet
Panel Data Analysis Using EViews Chapter - 2 PDF
51 pages
Comparison Between Hybrid QRNN and ARIMAX
No ratings yet
Comparison Between Hybrid QRNN and ARIMAX
8 pages
1. Lecture 2 (Modified v.2025) (1)
No ratings yet
1. Lecture 2 (Modified v.2025) (1)
31 pages
Rpla 2M
No ratings yet
Rpla 2M
22 pages
Pengaruh Motivasi Belajar Mahasiswa
No ratings yet
Pengaruh Motivasi Belajar Mahasiswa
13 pages
question paper
No ratings yet
question paper
4 pages
Wollo University Kiot Department of Industrial Engineering Worksheet On Forecasting
100% (1)
Wollo University Kiot Department of Industrial Engineering Worksheet On Forecasting
3 pages
Comm 201 (Biostatistics) Assignment: (U20DLNS20131)
No ratings yet
Comm 201 (Biostatistics) Assignment: (U20DLNS20131)
4 pages
TRIAL STPM Mathematics M Kepong Baru
100% (1)
TRIAL STPM Mathematics M Kepong Baru
14 pages
Performance-Based Assessment 11.3.1
No ratings yet
Performance-Based Assessment 11.3.1
2 pages
ISLR Chap 7 Shaheryar-Mutahira
No ratings yet
ISLR Chap 7 Shaheryar-Mutahira
15 pages
Assignment 2021
100% (1)
Assignment 2021
4 pages
Chapter 11 Testing Hypothesis
No ratings yet
Chapter 11 Testing Hypothesis
34 pages
Correlation and Simple Linear Regression
No ratings yet
Correlation and Simple Linear Regression
37 pages
In Class Worksheet 4 - Gas
No ratings yet
In Class Worksheet 4 - Gas
2 pages
End-Term Exam - Quantitative Techniques II
No ratings yet
End-Term Exam - Quantitative Techniques II
3 pages
Standard Normal (Z) Table: Area Between 0 and Z
No ratings yet
Standard Normal (Z) Table: Area Between 0 and Z
13 pages
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
No ratings yet
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
35 pages
Statistics: Review Notes Melissa E. Agulto, Ph.D. Professor, Central Luzon State University
No ratings yet
Statistics: Review Notes Melissa E. Agulto, Ph.D. Professor, Central Luzon State University
23 pages
Reliability: Notes
No ratings yet
Reliability: Notes
3 pages
ST2133 Advanced Statistics Distribution Theory Half Course
No ratings yet
ST2133 Advanced Statistics Distribution Theory Half Course
2 pages
Chinese Restaurant Process
No ratings yet
Chinese Restaurant Process
7 pages

Topic4-Linear-Models

Uploaded by

Topic4-Linear-Models

Uploaded by

Correlation and Linear Model

Modelling Data | Linear Model

© University of Sydney MATH1062

· Plotting the pairs of heights creates a cloud of points.

· Generally, taller fathers tend to have taller sons.

· The association is interesting on its own.

· How can we quantify a possible association?

· Formally, we have (𝑥𝑖 , 𝑦𝑖 ) for 𝑖 = 1, 2, … , 𝑛.

· sample mean and sample SD of 𝑋 (𝑥¯ , SD𝑥 )

Recall that data point − mean

1 𝑛 (𝑥𝑖 −𝑥¯ ) (𝑦𝑖 −𝑦¯)

∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)

1 𝑛 (𝑥𝑖 −𝑥¯ ) (𝑦𝑖 −𝑦¯)

√ 𝑛 ∑𝑖=1 (𝑥𝑖 −𝑥¯) √ 𝑛 ∑𝑖=1 (𝑦𝑖 −𝑦¯)

∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)

Quick calculation in R using cor() .

𝑥 (father’s 𝑦 (son’s standard standard

f1 = c(data$fheight, 100) # Add an extra point to data

This data should be modelled by a quadratic curve, not a line.

· How can we summarise the data with a line?

So for 𝑥 = father height and 𝑦 = son height, the regression line is

Regression Line 𝑦 ∼𝑥 Regression Line 𝑥 ∼𝑦

𝑓(𝑎, 𝑏) ≤ 𝑓(𝛼, 𝛽) for all 𝛼, 𝛽

∑𝑛𝑖=1 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ) = 0 ⇔ ∑𝑛𝑖=1 𝑦𝑖 = ∑𝑛𝑖=1 (𝑎 + 𝑏𝑥𝑖 ) = 𝑛𝑎 + 𝑏 ∑𝑛𝑖=1 𝑥𝑖

∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)𝑥¯ = 0 and ∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )𝑥¯ = 0.

𝑅𝐻𝑆 = 𝑏 (∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )𝑥𝑖 ) − 𝑏 (∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )𝑥¯ ) = 𝑏∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )2

∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)(𝑥𝑖 − 𝑥¯ )

· ∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯ )(𝑦𝑖 −𝑦¯)

∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯) − 𝑏∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ ) = 0

Baseline prediction/deviations in y, Regression line/residuals

measures variation in 𝑦 left unexplained by the regression line.

· In (A) SSE = 0, and there is no unexplained variation, whereas unexplained variation

· A quantitative measure of the total amount of variation in observed 𝑦 values is

measures variation in 𝑦 left unexplained by the baseline prediction.

You might also like