Simple Linear Regression
Simple Linear Regression
regression as a
statistical model
Pete Brennan outcome = model + error
[email protected] In general, a model is a
representation of a person
School of Physiology, Pharmacology or a system which provides
& Neuroscience
some information about it.
(Wikipedia)
Concepts and Skills
(PHPH30005/30007/M0011)
C & S statistics week 1
Week 1 Lecture/ Workshop Online SPSS
consolidation tutorials
Monday L1 Statistical
models
Intro to statistics
Tuesday L2 Multiple linear Simple linear
regression regression (A)
mean.
𝑦 residuals The sum of squares (SS)
is used as a measure of
the variability in the data.
The mean is the statistical
model that minimizes the
sum of squares of the
yi = (sum yi)/n + errori residuals (SSR).
residual
model variability
Simple linear regression
Models the linear
relationship between a
Dependent variable y
residuals
continuous dependent
b1 variable y and a single
1 independent (predictor)
variable x.
The slope coefficient (b1)
gives the amount by
b0
which y changes for each
0 unit change in x
Independent variable x
The best model minimizes
yi = b0 + b1x1i + errori SSR and so explains most
residual of the variability in the data.
model variability
Sources of variability
outcome = model + error
SST = SSM + SSR
residual
unexplained
total variability variability
of data (error)
variability
explained by
model
The coefficient of determination R2
For the null hypothesis H0
Dependent variable y
model
slope b1 = 0
SS of differences of data
from null hypothesis (blue
null hypothesis
dotted lines) gives SST
SS of differences of model
from null hypothesis (green
Independent variable x dotted lines) gives SSM
model
The significance of R2 (fit of the
model) is tested by calculating an
F ratio and associated p value.
null hypothesis F = MSmodel/MSresidual (error)
The higher the F ratio, the lower
the p value and the more
significant the fit of the model to
the data.
Independent variable x
√ X
drug dose drug dose
Assumptions of simple linear
regression (continued)
Normality of residuals the scatter of y values from the
fitted line is random and approximately normally
distributed.
Homoscedasticity - the variance of the residuals is
reasonably similar across all x values.
Dependent variable y
sd of distribution equal
along the line =
homoscedasticity
drug dose
Simple linear regression –
how robust is the model?
Outliers can be identified from x,y plot, or from standardised residual plot.
red residual > blue residual
red residual squared >> blue residual squared
Distance and leverage
F(1, 17) = 1.23 F(1, 17) = 5.54 F(1, 16) = 8.36
A p = 0.283
B p = 0.031
C p = 0.011
R2 = 0.068 R2 = 0.246 R2 = 0.343
Both the size of the residual (distance) and its leverage (the
distance from the mean of the predictor) will affect the effect of the
data point on the model fit and regression coefficients.
The robustness of the model can be checked by reanalysing the
data with the datapoint removed in a robustness test.
Model diagnostics
Is the regression model stable or is it biased by a few cases?
Value for red point in A Value for red point in B Diagnostic statistic
-2.366 -2.509 Standardised residual
0.643 0.185 Cook’s distance
0.197 0.0001 Leverage value
-1.25 -0.756 Standardised DFFit
-1.11 -0.035 Standardised DFBeta
R = 0.649
2
Influential
8
points may not
5
6
7
have high
2
3
4
standardized
1
residuals!
What to do about it?
Outlier:
Check for typing/measurement error and correct.
Check underlying distribution of standardised residuals
If there is no obvious source of error and distribution of
residuals looks okay then some disciplines would recommend
removing outliers on the basis of pre-specified criteria, such as
standardised residual > ±3.
But the occasional outlier is expected!