Topic4-Linear-Models
Topic4-Linear-Models
Normal Curve
What is the Normal Curve? And what does it have to do with sample mean?
Linear Model
How can we describe the relationship between two variables? When is a linear
model appropriate?
2/72
Scatter plot & correlation
Correlation
· Bivariate data & scatter plot
· Correlation coefficient
· Properties and warnings
Linear model
· Regression Line
· Prediction
· Residuals and properties
· Coefficient of determination
· Diagnostics of model fit
Summary
3/72
Scatter plots
History
· Sir Francis Galton (England, 1822–1911) studied the degree to which children
resemble their parents (and wrote travel books on “wild countries”!)
· Galton’s work was continued by his student Karl Pearson (England, 1857–1936).
Pearson measured the heights of 1,078 fathers and their sons at maturity.
5/72
Pearson’s plot of heights (scatter plot)
6/72
Code for plotting Pearson’s data
# install.packages('UsingR')
suppressMessages(library(UsingR))
library(UsingR) # Loads another collection of datasets
data(father.son) # This is Pearson's data.
data = father.son
x = data$fheight # fathers' heights
y = data$sheight # sons' heights
# scatter plot
plot(x, y, xlim = c(58, 80), ylim = c(58, 80), xaxt = "n", yaxt = "n", xaxs = "i",
yaxs = "i", main = "Pearson's data", xlab = "Father height (inches)", ylab = "Son height (inches)")
# Adjust the gap between label and plot
axp = seq(58, 80, by = 2)
axis(1, at = axp, labels = axp)
axis(2, at = axp, labels = axp)
7/72
Statistical Thinking
Why do we care whether there is an associaton between two variables (here: height
of father and son)?
· Association between two variables can be used for prediction, i.e, use outcome in
one variable to predict the outcome in another variable.
8/72
Correlation coefficient
Bivariate data
Bivariate data
Bivariate data involves a pair of variables. We are interested in the relationship
between the two variables. Can one variable be used to predict the other?
10/72
How can we summarise bivariate data?
Bivariate data (or a scatter plot) can be summarised by the following five numerical
summaries:
11/72
Association between the two variables
· All clouds have the same centre and horizontal and vertical spread.
· However they have different spread around a line (linear association). How do we
measure this?
12/72
The correlation coefficient
The (Pearson) correlation coefficient (r)
· A numerical summary measures of how points are spread around the line.
· It indicates both the sign and strength of the linear association.
· It is defined as the mean of the product of the variables in standard units.
√
1
𝑛−1
∑𝑛𝑖=1 (𝑥𝑖 −𝑥¯)2 √ 𝑛−1
1
∑𝑛𝑖=1 (𝑦𝑖 −𝑦¯)2
13/72
The same correlation coefficient 𝑟 can be obtained using the population SD as well
(dividing by 𝑛 in the average).
cor(x, y)
## [1] 0.5013383
14/72
Why does 𝑟 measure association?
Here, for illustration, we round data to 1 decimal place to make calculations simpler.
⋮
mean=+0.5
15/72
We divide the scatter plot into 4 quadrants, at the point of averages (centre).
· In the upper right and lower left quadrants, products of standard units are (+)
· In the upper left and lower right quadrants, products of standard units are (-)
· A majority of points in the upper right (+) and lower left quadrants (+) will be indicated
by a positive 𝑟
· A majority of points in the upper left (-) and lower right quadrants (-) will be indicated
by a negative 𝑟
16/72
More examples
17/72
Properties and warnings
Interpretations of r values
· The correlation coefficient 𝑟 always takes values between -1 and 1 (inclusive).
- This can be shown using the definition of 𝑟 and the Cauchy-Schwarz inequality
(only for your information).
· If 𝑟 is positive: the cloud slopes up.
· If 𝑟 is negative: the cloud slopes down.
· 𝑟 = 0 implies no linear dependency between two variables.
· As 𝑟 gets closer to ±1 : the points cluster more tightly around the line.
19/72
Invariant properties
Shift and scale invariant
The correlation coefficient is shift and scale invariant. Why? shifting and scaling do
not change the standard unit.
cor(x, y)
## [1] 0.5013383
cor(0.2 * x + 3, 3 * y - 1)
## [1] 0.5013383
Symmetry (commutative)
The correlation coefficient is not affected by interchanging the variables.
cor(x, y)
## [1] 0.5013383
cor(y, x)
## [1] 0.5013383
20/72
Warning 1:
Wrong interpretations of correlation coefficient
Mistake:
𝑟 = 0.8 means that 80% of the points are tightly closed around the line.
Mistake:
𝑟 = 0.8 means that the points are twice as tightly closed as 𝑟 = 0.4.
Note 1: 𝑟 = 0.8 suggests a stronger association between variables compared to the
case 𝑟 = 0.4, BUT does not suggest the data points are twice as tight.
21/72
Warning 2:
Outliers can overly influence the correlation coefficient
Suppose there was an extra unusual reading of (100,50).
22/72
par(mfrow = c(1, 2))
plot(data$fheight, data$sheight)
plot(f1, s1)
points(100, 50, col = "lightgreen", pch = 19, cex = 2)
23/72
Warning 3:
Nonlinear association can’t be detected by the correlation coefficient
What interpretation mistake could be made in the following data set?
x = c(1:20)
y = x^2
cor(x, y)
## [1] 0.9713482
Based on the correlation coefficient, the points should cluster very tightly around the line
sloping up.
24/72
But look at the scatter plots.
plot(x, y)
We should always use correlation coefficient together with the scatter plot.
25/72
Warning 4:
The same correlation coefficient can arise from very different data
The following 4 data sets (Anscombes Quartet) have the same 𝑥¯ , 𝑆 𝐷𝑥 , 𝑦¯, 𝑆𝐷𝑦 , and
also the same value of 𝑟.
## x_mean: 9 9 9 9
## x_sd: 3.316625 3.316625 3.316625 3.316625
## y_mean: 7.500909 7.500909 7.5 7.500909
## y_sd: 2.031568 2.031657 2.030424 2.030579
## r: 0.8164205 0.8162365 0.8162867 0.8165214
26/72
But look at the scatter plots.
27/72
Warning 5 (not for assessment)
Ecological correlation tend to inflate the correlation coefficient
· An ecological correlation is the correlation between two variables that are group
means.
· For example, if we recorded the heights of fathers and sons in many communities,
and then calculated the average for each community.
· Correlations at the group level (ecological correlations) can be much higher than
those at the individual level.
· See Freedman et al, Statistics p148-149.
28/72
Example
· The 1st plot has all 3 sets of data combined: correlation = 0.51 (not very strong).
· The 2nd plot has the averages of the 3 data sets: correlation = 0.94 (very strong).
29/72
Regression line
Pearson’s plot of heights
31/72
1st option: SD line (not so good)
· The SD line might look like a good candidate as it connects the point of averages
(𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + SD𝑦 ) (for this data with positive correlation).
32/72
· Note how it underestimates (LHS) and overestimates (RHS) at the extremes.
· Recall that 𝑋, 𝑌 can have the same mean and SD but very different correlation
coefficient.
· The above model does not use the correlation coefficient, so it is insensitive to the
amount of clustering around the line.
33/72
· These data have the same mean and same SDs, and hence then same SD line.
· But they have different correlation coefficient, how to take this into account?
· How to quantify the quality of the fitted line, so we can define the optimal line?
34/72
Best option: regression line
· To describe the scatter plot, we need to use all five summaries: 𝑥¯ , 𝑦¯, 𝑆𝐷𝑥 , 𝑆𝐷𝑦
and 𝑟.
· The regression line connects (𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + 𝑟SD𝑦 )
35/72
· Note the improvement at the extremes.
36/72
Summary of regression line
Feature Regression Line 𝑦 ∼ 𝑥 (𝑦 = 𝑎 + 𝑏𝑥)
Connects (𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + 𝑟SD𝑦 )
SD
Slope (b) 𝑟 𝑦
SD𝑥
Intercept
(a)
𝑦¯ − 𝑏𝑥¯
Optimality: We can derive the regression line using calculus, by minimising the sum of
squares of the residuals.
37/72
In R
lm(y ~ x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 33.8866 0.5141
model = lm(y ~ x)
model$coeff
## (Intercept) x
## 33.886604 0.514093
𝑦 = 33.886604 + 0.514093𝑥
38/72
plot(x, y, xlab = "Father height (inches)", ylab = "Son height (inches)", col = adjustcolor("black",
alpha.f = 0.35))
abline(lm(y ~ x), col = "red")
39/72
Prediction
Baseline prediction
· For new born (son), the father is 75 inches tall, how can we predict the son’s height?
· If you don’t use the information of the independent variable 𝑥 at all, a basic prediction
of 𝑦 would be the average of 𝑦 for all the 𝑥 values in the data.
· So for any father’s height, we could predict the son’s height to be 68.68.
mean(y)
## [1] 68.68407
41/72
The Regression line
· A better prediction is based on the regression line 𝑦 = slope × 𝑥 + intercept
· For the height data: 𝑦 = 33.886604 + 0.514093𝑥
· So for any father’s height 75, we could predict the son’s height to be 72.44.
42/72
Can we also use Y to predict X?
We can predict 𝑌 from 𝑋 or 𝑋 from 𝑌 , depending on what fits the context.
43/72
Beware!
· Can we just simply rearrange the equation? (𝑦 = 𝑎 + 𝑏𝑥) ⟹ (𝑥 = − 𝑎𝑏 + 1𝑏 𝑦)
· The answer is NO unless 𝑟 = ±1 (data clustered along the line).
· We need to refit the model.
44/72
lm(y ~ x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 33.8866 0.5141
lm(x ~ y)
##
## Call:
## lm(formula = x ~ y)
##
## Coefficients:
## (Intercept) y
## 34.1075 0.4889
45/72
Residuals and properties
Residuals
We can now make predictions using the regression line. But we have some prediction
error.
Residual (prediction error)
· A residual is the vertical distance of a point above or below the regression line.
· A residual represents the error between the actual value and the prediction.
47/72
When the father’s height is 71.82, the actual value of the son’s height is 66.42 with
predicted value 70.81, so the residual is -4.39.
48/72
Formally, given the actual value (𝑦𝑖 ) and the prediction (𝑦̂𝑖 ), a residual is
𝑒𝑖 (𝑎, 𝑏) = 𝑦𝑖 − 𝑦̂𝑖 = 𝑦𝑖 − ( 𝑎 + 𝑏 𝑥𝑖 ).
⏟
intercept ⏟
slope
l = lm(y ~ x)
y[39] - l$fitted.values[39]
## 39
## -4.390582
# Or directly
l$residuals[39]
## 39
## -4.390582
The regression line is the best (optimal) linear model - it provides the best fit to the data
𝑛
as the sum of the squared residuals ∑𝑖=1 𝑒𝑖 (𝑎, 𝑏)2 is as small as it can be.
49/72
Optimality of regression line
· We first consider a general line 𝑦 = 𝛼 + 𝛽𝑥 with intercept 𝛼 and slope 𝛽.
· Given the data set {𝑥𝑖 , 𝑦𝑖 }, 𝑖 = 1, … , 𝑛, a pair of variables (𝛼, 𝛽) for defining a line,
the residual is
𝑒𝑖 (𝛼, 𝛽) = 𝑦𝑖 − (𝛼 + 𝛽𝑥𝑖 ).
so that the sum of squared residuals becomes
𝑛 𝑛
𝑒𝑖 (𝛼, 𝛽)2 = ( 𝑦𝑖 − 𝛼 − 𝛽 𝑥 𝑖 ) 2 .
∑ ∑
𝑓(𝛼, 𝛽) =
𝑖=1 𝑖=1
50/72
· Our goal is to find the intercept 𝑎 and the slope 𝑏 that minimise 𝑓(𝛼, 𝛽) :
51/72
How to find such a minimiser (𝑎, 𝑏)? It need to be a stationary point of the function 𝑓
such that
∂𝑓
∂𝛼 (𝑎, 𝑏) = ∑𝑛𝑖=1 2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )(−1) = 0
and
∂𝑓
∂𝛽 (𝑎, 𝑏) = ∑𝑛𝑖=1 2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )(−𝑥𝑖 ) = 0.
∂𝑓
We use the first equation to find the intercept, ∂𝛼 (𝑎, 𝑏) = 0 is equivalent to
1 𝑛 1 𝑛
𝑦¯ = ∑𝑖=1 𝑦𝑖 = 𝑎 + 𝑏 ∑𝑖=1 𝑥𝑖 = 𝑎 + 𝑏𝑥¯ ,
𝑛 𝑛
which leads to 𝑎 = 𝑦¯ − 𝑏𝑥¯ .
52/72
We can find the slope by substituting 𝑎 = 𝑦¯ − 𝑏𝑥¯ into the second equation. This way,
∂𝑓
∂𝛽 (𝑎, 𝑏) = 0 becomes
∑𝑛𝑖=1 [𝑦𝑖 − (𝑦¯ − 𝑏𝑥¯ ) − 𝑏𝑥𝑖 ]𝑥𝑖 = 0.
After rearragement,
∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)𝑥𝑖 = 𝑏∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )𝑥𝑖 .
Because the sum of deviations is zero (topic 3 in week 3), we have ∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯) = 0
𝑛
and ∑𝑖=1 (𝑥𝑖 − 𝑥¯ ) = 0, and hence
𝐿𝐻𝑆 = (∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)𝑥𝑖 ) − (∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)𝑥¯ ) = ∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)(𝑥𝑖 − 𝑥¯ )
53/72
∂𝑓
By solving the second equation ∂𝛽 (𝑎, 𝑏) = 0, the slope is
·
𝑆 𝐷𝑥 = √‾𝑛−1
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
1
∑𝑛𝑖=1 (𝑥𝑖 − 𝑥¯ )‾2
·
𝑆 𝐷𝑦 = √‾𝑛−1
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
1
∑𝑛𝑖=1 (𝑦𝑖 − 𝑦¯)‾2
𝑆𝐷
This gives exactly 𝑏 = 𝑟 𝑆𝐷𝑥𝑦 as we claimed in the definition of the regression line. So
that we know the regression line is indeed the best among all lines (linear functions) in
the sense of sum of squared residuals.
54/72
Average of residual is zero
Given the regression line 𝑦 = 𝑎 + 𝑏𝑥, where 𝑎 = 𝑦¯ − 𝑏𝑥¯ , the sum of residual
∑𝑛𝑖=1 𝑒𝑖 (𝑎, 𝑏) = ∑𝑛𝑖=1 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ) = ∑𝑛𝑖=1 (𝑦𝑖 − (𝑦¯ − 𝑏𝑥¯ ) − 𝑏𝑥𝑖 )
can be expressed as
55/72
Summary of residual
Feature Regression Line 𝑦 ∼ 𝑥 (𝑦 = 𝑎 + 𝑏𝑥)
Connects (𝑥¯ , 𝑦¯) to (𝑥¯ + SD𝑥 , 𝑦¯ + 𝑟SD𝑦 )
SD
Slope (b) 𝑟 𝑦
SD𝑥
Intercept
(a)
𝑦¯ − 𝑏𝑥¯
Residual 𝑒𝑖 = 𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖
· 𝑦 = 𝑎 + 𝑏𝑥 is the best line that minimises the sum of squared residuals ∑𝑛𝑖=1 𝑒2𝑖 .
· the average residual of the regression line is zero: 𝑒¯ = 1
𝑛 ∑𝑛𝑖=1 𝑒𝑖 = 0.
56/72
Coefficient of determination
How much variability of data 𝑦 can be explained by the linear model?
58/72
More comments
(A). Data for which all variation is explained. All the points fall exactly on a straight line.
In this case, all (100%) of the sample variation in 𝑦 can be attributed to the fact that 𝑥
and 𝑦 are linearly related in combination with variation in 𝑥.
(B). Data for which most variation is explained. The points do not fall exactly on a line,
but compared to overall 𝑦 variability, the residuals from the regression line are small. It is
reasonable to conclude in this case that much of the observed 𝑦 variation can be
attributed to the linear regression model.
(C). Data for which little variation is explained. There is substantial variation in the points
about the regression line relative to overall y variation, so the linear regression model
fails to explain variation in 𝑦 by relating 𝑦 to 𝑥.
59/72
Explaining variations
· The sum of squared residuals (or SSE for sum of squared errors)
𝑛 𝑛
𝑒2𝑖 = (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2
∑ ∑
SSE =
𝑖=1 𝑖=1
60/72
· SST ≥ SSE.
· Why? The regression is optimal for sum of squared errors, so SSE (regression line)
cannot be worse than SST (baseline).
61/72
Coefficient of determination
The coefficient of determination
The ratio SSE
SST is the proportion of total variation that cannot be explained by the
simple linear regression model, and the coefficient of determination is
SSE
1− = 𝑟2
SST
which is the squared correlation coefficient (a number between 0 and 1) giving
the proportion of observed 𝑦 variation explained by the model.
· The higher the value of 𝑟2 , the more successful is the simple linear regression model
in explaining y variation.
· Note that if SSE = 0 as in case (A), then 𝑟2 = 1.
· This can be verified using 𝑎, 𝑏, SDs, and 𝑟 (see next lab)
62/72
Example
cor(x, y)^2 # quick way
## [1] 0.2513401
lm.fit <- lm(y ~ x)
SSE = sum(lm.fit$residuals^2)
SST = sum((y - mean(y))^2)
r2 = 1 - SSE/SST
## [1] 0.2513401
The coefficient of determination for Pearson’s height data is 0.25, about 25% of the
variations in son’s height can be explained by the regression line.
63/72
Diagnostics
Residual Plot
Residual plot
· A residual plot graphs the residuals vs 𝑥.
· If the linear fit is appropriate for the data, it should show no pattern (random
points around 0).
· By checking the patterns of the residuals, the residual plot is a diagnostic plot to
check the appropriateness of a linear model.
65/72
plot(x, l$residuals, ylab = "residuals", col = adjustcolor("black", alpha.f = 0.35))
abline(h = 0)
Does this residual plot look random?
66/72
Homoscedasticity and Heteroscedasticity
Vertical strips
In linear models and regression analysis generally, we need to check the
homogeneity of the spread of the response variable (or the residuals). We can
divide the scatter plot or the residual plot into vertical strips.
· If the vertical strips on the scatter plot show equal spread in the 𝑦 direction, then
the data is homoscedastic.
- The regression line could be used for predictions.
· If the vertical strips don’t show equal spread in the 𝑦 direction, then the data is
heteroscedastic.
- The regression line should not be used for predictions.
67/72
Is the Pearson’s height data homoscedastic?
68/72
Common mistake 1: extrapolating
If we make a prediction from an 𝑥 value that is not within the range of the data, then that
prediction can be completely unreliable.
69/72
Common mistake 2: not checking the scatter plot
· We can have a high correlation coefficient and then fit a regression line, but the data
may not even be linear!
· So always check the scatter plot first!
Note: Even though the correlation coefficient is high 𝑟 ≈ 0.99, a quadratic model is
more appropriate than a linear model.
70/72
Common mistake 3. Not checking the residual plot
· You should also check the residual plot
· This detects any pattern that has not been captured by fitting a linear model.
· If the linear model is appropriate, the residual plot should be a random scatter of
points.
71/72
Summary
Correlation
· Correlation coefficient
· Properties and warnings
Linear model
· Regression Line
· Prediction
· Residuals and properties
· Coefficient of determination
· Diagnostics of model fit
Some R Functions
lm , plot , abline
72/72