0% found this document useful (0 votes)
7 views

Simple Linear Regression

The document outlines the concepts and skills related to simple linear regression, including its statistical model, assumptions, and diagnostics. It emphasizes the importance of understanding variability, model fit, and the influence of data points on regression analysis. Additionally, it provides a structured learning plan for students, detailing lectures, workshops, and assessments over a two-week period.

Uploaded by

nightflameprayer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Simple Linear Regression

The document outlines the concepts and skills related to simple linear regression, including its statistical model, assumptions, and diagnostics. It emphasizes the importance of understanding variability, model fit, and the influence of data points on regression analysis. Additionally, it provides a structured learning plan for students, detailing lectures, workshops, and assessments over a two-week period.

Uploaded by

nightflameprayer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Simple linear

regression as a
statistical model
Pete Brennan outcome = model + error
[email protected] In general, a model is a
representation of a person
School of Physiology, Pharmacology or a system which provides
& Neuroscience
some information about it.
(Wikipedia)
Concepts and Skills
(PHPH30005/30007/M0011)
C & S statistics week 1
Week 1 Lecture/ Workshop Online SPSS
consolidation tutorials
Monday L1 Statistical
models
Intro to statistics
Tuesday L2 Multiple linear Simple linear
regression regression (A)

L3 ANOVA and SPSS basics


planned contrasts workshop
Wednesday Regression
diagnostics (A)
Thursday C1 – questions Multiple linear
and answers regression (A)
Friday SPSS workshop 1 Planned contrasts
multiple linear (A)
regression
C & S statistics week 2
Week 1 Lecture/ Workshop Online SPSS
consolidation tutorials
Monday L4 Factorial
ANOVA
L5 Nonlinear
regression
Tuesday L6 Categorical Factorial ANOVA
data analysis (A) Robust
analysis
Wednesday Repeated
measures ANOVA
(A)
Thursday C2 – questions Nonlinear
and answers regression (A)
Friday SPSS workshop 2 Categorical data
Factorial ANOVA analysis (A)
C & S statistics assessment
Week 4 Lecture/ Exam
consolidation
Friday C3 – questions and
answers
Week 5
Thursday Formative MCQ
statistics exam
January exam week 1
Date to be confirmed Summative MCQ
statistics exam

(Advanced Concepts and Skills exam for Study in Industry students is


different from BSc MCQ exam
- Advanced statistics formative exam submitted in week 5
- Advanced statistics timed assessment in January Exam week 1)
Learning objectives
Explain how variability can be explained by the addition of random
error to a statistical model.
Explain how simple linear regression fits a straight-line model to data
based on minimising the sum of squares of the residuals (SSR)
Explain how the total variability in data can be split into variability
explained by the model and residual variability.
Calculate R2 from SST and SSM and interpret it.
Describe the assumptions underlying simple linear regression.
Explain the difference between distance and leverage of a data point.
Explain how diagnostic statistics can be used to assess the influence
of an individual data point on a model.
Explain what approaches you can take if you discover influential data
points in your model.
Regression

Regression is not the same as correlation!


Correlation tests whether there is an association
between two variables
Regression tests how well the data fit a theoretical
model
Regression is used to
- Quantify the effect of a predictor variable on a dependent
variable
- Predict values of a dependent variable based on the values of
one or more predictor variables
- Adjust for effects of confounding variables
What is a statistical model?

A mathematical relationship that represents the


most important features and relationships in data.

If the world was perfectly predictable


outcome = model

But the world is inherently variable so usually -

outcome = model + error


The error (not the outcome variable!) is assumed to be
normally distributed.
The mean as a statistical model
The mean is a model of the
underlying population mean.
Residuals are the
differences between each
data point and the sample
variable y

mean.
𝑦 residuals The sum of squares (SS)
is used as a measure of
the variability in the data.
The mean is the statistical
model that minimizes the
sum of squares of the
yi = (sum yi)/n + errori residuals (SSR).
residual
model variability
Simple linear regression
Models the linear
relationship between a
Dependent variable y

residuals
continuous dependent
b1 variable y and a single
1 independent (predictor)
variable x.
The slope coefficient (b1)
gives the amount by
b0
which y changes for each
0 unit change in x
Independent variable x
The best model minimizes
yi = b0 + b1x1i + errori SSR and so explains most
residual of the variability in the data.
model variability
Sources of variability
outcome = model + error
SST = SSM + SSR
residual
unexplained
total variability variability
of data (error)

variability
explained by
model
The coefficient of determination R2
For the null hypothesis H0
Dependent variable y

model
slope b1 = 0
SS of differences of data
from null hypothesis (blue
null hypothesis
dotted lines) gives SST
SS of differences of model
from null hypothesis (green
Independent variable x dotted lines) gives SSM

R2 = the proportion of the total variability in y that can


be explained by its relationship with x.
R2 = SSM/SST
variability explained total variability
by model of data
Testing model fit and slope
Testing model fit
Dependent variable y

model
The significance of R2 (fit of the
model) is tested by calculating an
F ratio and associated p value.
null hypothesis F = MSmodel/MSresidual (error)
The higher the F ratio, the lower
the p value and the more
significant the fit of the model to
the data.
Independent variable x

Testing model slope ( the effect of the predictor)


Is the model slope significantly different from null hypothesis slope of 0?
t = (bmodel - bnull)/SE(bmodel) = bmodel/SE(bmodel)
p value obtained from t distribution with N-2 degrees of freedom
Assumptions of simple linear
regression
1. x and y are asymmetrical – x (independent) predicts y (dependent)
2. The independent variable (x) is measured or controlled without
significant error.
3. The independent variable (x) is not entangled with the dependent
variable (y)
4. Linear relationship between x and y
5. The data points are independent
Dependent variable y

Independent data Related data points are not -


points are okay pseudoreplication

√ X
drug dose drug dose
Assumptions of simple linear
regression (continued)
Normality of residuals the scatter of y values from the
fitted line is random and approximately normally
distributed.
Homoscedasticity - the variance of the residuals is
reasonably similar across all x values.
Dependent variable y

sd of distribution equal
along the line =
homoscedasticity

drug dose
Simple linear regression –
how robust is the model?

Outliers can be identified from x,y plot, or from standardised residual plot.
red residual > blue residual
red residual squared >> blue residual squared
Distance and leverage
F(1, 17) = 1.23 F(1, 17) = 5.54 F(1, 16) = 8.36
A p = 0.283
B p = 0.031
C p = 0.011
R2 = 0.068 R2 = 0.246 R2 = 0.343

b1 = 0.120 b1 = 0.275 b1 = 0.271

Both the size of the residual (distance) and its leverage (the
distance from the mean of the predictor) will affect the effect of the
data point on the model fit and regression coefficients.
The robustness of the model can be checked by reanalysing the
data with the datapoint removed in a robustness test.
Model diagnostics
Is the regression model stable or is it biased by a few cases?

Standardised residual > ±3 is certainly worth checking as an


outlier.

If a lot more that 5% of standardised residuals > ±2 it may


indicate model is a poor fit.

Standardised DFFit > ±1 means a substantial influence on the fit


of the model (F ratio and R2)

Standardised DFBeta > ±1 means a substantial influence on the


size of the respective slope coefficient for the model (bslope).
Effect on diagnostic statistics
F(1, 17) = 1.23 F(1, 17) = 5.54 F(1, 16) = 8.36
A p = 0.283
B p = 0.031
C p = 0.011
R2 = 0.068 R2 = 0.246 R = 0.343
2

b1 = 0.120 b1 = 0.275 b1 = 0.271

Value for red point in A Value for red point in B Diagnostic statistic
-2.366 -2.509 Standardised residual
0.643 0.185 Cook’s distance
0.197 0.0001 Leverage value
-1.25 -0.756 Standardised DFFit
-1.11 -0.035 Standardised DFBeta
R = 0.649
2
Influential
8
points may not
5
6
7

have high
2
3
4
standardized
1
residuals!
What to do about it?
Outlier:
Check for typing/measurement error and correct.
Check underlying distribution of standardised residuals
If there is no obvious source of error and distribution of
residuals looks okay then some disciplines would recommend
removing outliers on the basis of pre-specified criteria, such as
standardised residual > ±3.
But the occasional outlier is expected!

Influential data points:


Run analysis with and without data point(s) and report how this
affects the model parameters and your conclusions (robustness
test).

You might also like