L6 - Biostatistics - Linear Regression and Correlation
L6 - Biostatistics - Linear Regression and Correlation
Institute of Health,
Public Health Faculty,
Department of
Epidemiology,
Biostatistics Unit
Analysis of Continuous
outcome Data
Correlation and Linear Regression Analysis
Session Objectives
At the end of this session, you will be able to:
✓ Use methods of association in measurement
data – correlation analysis
✓ Use scatter plots to see relationship between
variables
✓ Describe relationship of variables using simple
and multiple variable regression analysis of
measurement data
✓ Apply methods of regression analysis for
measurement data in two or more variables
✓ Identify model assumptions – parameter 3
estimation, hypothesis testing and prediction
Correlation Analysis
• Correlation is the method of analysis to use
when studying the possible association
between two continuous variables
5
Correlation Analysis
• It is important to note that a correlation
between variables shows that they are
associated but does not necessarily imply a
‘cause and effect’ relationship
6
Scatter Plots and Correlation
• Correlation analysis is used to measure
strength of the association (linear
relationship) between two variables
• Scatter plot is used to show the
relationship between two variables
• Only concerned with strength of the
relationship and its direction
• We consider the two variables equally; as
a result no causal effect is implied
7
Scatter Plots and Correlation
• As OR and RR are used to quantify risk
between two dichotomous variables,
correlation is used to quantify the degree
to which two random continuous variables
are related, provided the relationship is
linear.
8
9
10
11
Fig.1: Systolic Blood Pressure against Age
12
Correlation coefficient
• In the underlying population from which
the sample of points (xi, yi) is selected, the
correlation coefficient between X and Y is
denoted by (rho) and is given by
• = Average [(X-X)(Y- Y)]
X Y
• It can be thought as the average of the product of the
standard normal deviates of X and Y
13
Estimation of Correlation coefficient
15
If we have two quantitative variables X and Y,
the correlation between them denoted by r(X, Y)
is given by:
(x i x )(yi y) xy
r
i i x y
2 2 2 2
(x x ) (y y )
XY [ X Y ] / n
[ X ( X ) / n][ Y ( Y ) / n]
2 2 2 2
17
Hypothesis testing on ρ
• Population Correlation Coefficient: ρ
• Sample Correlation Coefficient: r
• Under the null hypothesis that there is no
association in the population (=0) it can be
shown that the quantity n2
tr
1 r2
has a t distribution with n-2 degrees of freedom
20
Interpretation of correlation
• One way of looking at the correlation helps to
modify over-enthusiasm is to calculate l00r2
(coefficient of determination), which is the
percentage of variability in the data that is
'explained' by the association
21
Exercise: The following data shows the respective weight of a
sample of 12 fathers and their oldest son. Compute the
correlation coefficient between the two weight measurements
Wt of father – X Wt of son – Y
X2 Y2 XY
65 68 4225 4624 4420
63 66 3969 4356 4158
67 68 4489 4624 4556
64 65 4096 4225 4160
68 69 4624 4761 4692
62 66 3844 4356 4092
70 68 4900 4624 4760
66 65 4356 4225 4290
68 71 4624 5041 4828
67 67 4489 4489 4489
69 68 4761 4624 4692
71 70 5041 4900 4970 22
Scatter Plot
Scatter plot of father's by son's weight
72
71
70
69
68
67
66
65
Y
64
60 62 64 66 68 70 72
X
23
Calculating r
The correlation coefficient for the data on fathers’
and sons’ will be:
Basic values from the data
X 800, X 53,418, Y 811, Y 54,849, XY 54,107
2 2
28
Inference on Correlation Coefficient
r=0 r<0
b=0 b<0
Y
Y
X
X
r>0
b>0
Y
29
Pearson’s r Correlation
• As a rule of thumb, the following guidelines on
strength of relationship are often useful (though
many experts would somewhat disagree on the
choice of boundaries).
Correlation value Interpretation
0.70 or higher Very strong relationship
0.40 to 0.69 Strong relationship
0.30 to 0.39 Moderate relationship
0.20 to 0.29 Weak relationship
0.01 to 0.19 No or negligible relationship
30
Limitations of the correlation coefficient:
31
What is a Model?
1.
1. Representation of
Some Phenomenon
Non-Math/Stats Model
32
What is a Math/Stats Model?
1. Often Describe Relationship between
variables
2. Types
- Deterministic Models (no randomness)
33
Deterministic Models
1. Hypothesize Exact Relationships
2. Suitable When Prediction Error is Negligible
3. Example: Body mass index (BMI) is measure of
body fat based
34
Probabilistic Models
1. Hypothesize 2 Components
• Deterministic
• Random Error
2. Example: Systolic blood pressure of newborns
is 6 Times the Age in days + Random Error
• SBP = 6xage(d) +
• Random Error May Be Due to Factors
Other Than age in days (e.g. Birthweight)
35
Types of Probabilistic Models
Probabilistic
Models
36
Regression Models
37
Types of Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables
Simple Multiple
Non- Non-
Linear Linear
Linear Linear
38
Simple Linear Regression
• Data on one variable are frequently obtained from
a given variable:
Examples
• weight and height
• House rent and income
• Yield and fertilizer
39
Simple Linear Regression
• To form the equation, collect the data on
the independent variable and estimate the
value of the other and you can display in
pairs.
• Let the observations be denoted by (X1
,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn).
• However, before trying to quantify this
relationship, plot the data and get an idea
of their nature.
• Plot these points on the XY plane and
obtain the scatter diagram.
40
Simple Linear Regression
Relationship between heights of
fathers and their oldest sons
73
Heights of oldest sons (inches)
72
71
70
69
68
67
66
65
64
63
62
62 64 66 68 70 72 74
Heights of fathers (inches)
41
Simple Linear Regression
❖Simple regression uses the relationship
between the two variables to obtain
information about one variable by knowing
the values of the other
42
Simple Linear Regression
• The case of simple linear regression
considers a single regressor or predictor x
and a dependent or response variable Y.
• The expected value of Y at each level of x is a
random variable:
E(Y|X) = + X
• We assume that each observation, Y, can be
described by the model
Y = + X + ε 43
Linear Equations
Y
Y = mX + b
Change
m = Slope in Y
Change in X
b = Y-intercept
X
44
Linear Regression Model
Yi 0 1X i i
Dependent Independent
(Response) (Explanatory) Variable
Variable (e.g., Years s. serocon.)
(e.g., CD+ c.) 45
Population & Sample
Regression Models
Population
☺ ☺
☺
☺
46
Population & Sample
Regression Models
Population
Unknown
Relationship ☺
Yi 0 1X i i
☺ ☺
☺
☺
47
Population & Sample
Regression Models
Population Random Sample
Unknown
Relationship ☺
Yi 0 1X i i ☺
☺
☺ ☺
☺
☺
48
Population & Sample
Regression Models
Population Random Sample
Unknown
Y i 0 1 X i i
Relationship ☺
Yi 0 1X i i ☺
☺
☺ ☺
☺
☺
49
Population Linear Regression
Model
Y Yi 0 1X i i Observed
value
i = Random error
E Y 0 1 X i
X
Observed value
50
Sample Linear Regression
Model
Y
Y i 0 1 X i i
^i = Random
error
Unsampled
observation
Yi 0 1 X i
X
Observed value
51
Simple Linear Regression
♦ The scatter diagram helps to choose the curve that
best fits the data. The simplest type of curve is a
straight line whose equation is given by
Ŷ = a + bxi
53
Simple Linear Regression
♦ Regression is a method of estimating the
numerical relationship between variables
► The value of ‘b’ shows the slope of the regression line and
gives us a measure of the change in y for a unit change in x.
( X X )(Y Y ) n XY X Y
b= = n X 2 ( X ) 2
(X X ) 2
58
SLR-example 1
Heights of 10 fathers (X) together with their oldest sons (Y)
are given below (in inches). Find the regression of Y on X.
n XY X Y ( X X )(Y Y )
b= n X 2 ( X ) 2 = (X X ) 2
679 676
a= 10
- 0.77 ( 10
) = 67.9 – 52.05 = 15.85
62
Standard errors of regression
coefficients
• The SE of the regression coefficients is given
by: 2
1 X S
se(a) S and se(b)
(x x )
2
n (x - x ) 2
where
2 2 2
(y y) b (x x )
S
n2
S is the standard deviation of the points about the
line. It has (n-2) degree of freedom.
63
Example (1- )100% CI for
regression coefficient
• Consider the age and SBP data and the
fitted regression model
• SBP = 112.12 + 0.456(Age)
• S = 15.48, se(a) = 2.67, se(b) = 0.064
• A 95% confidence interval for the slope is:
Estimated slope t1-(SE of slope)
0.456 ± 1.96*0.064 = (0.331, 0.581)
• The 95% CI does not include 0=>There is
a sufficient evidence that age affects SBP
64
Significant test for β
Ho:
H1:
If the null hypothesis is true then the
statistic:
Observed slope - 0
t
S.E. of obsereved slope
66
ANOVA Table for linear regression
67
Deviation
68
Simple linear regression
Explained, unexplained (error), total variations
*
*
Assumption 1 **
*
*
*
*
Linear relationship ** * *
Assumption 2 **
**
*
*
*
Y normally distributed **
**
*
at each value of x
Assumption 3
Same variance at each value of x
77
Diagnostic Tests for the Regression
Assumptions
• Linearity tests: Regression curve fitting: No level
shifts
• Independence of observations: Runs test
• Normality of the residuals: Shapiro-Wilks or
Kolmogorov-Smirnov Test
• Homogeneity of variance of the residuals: White’s
General Specification test
• No autocorrelation of residuals: Durbin Watson or
ACF of residuals
78
Diagnostic Tests for the Regression
Assumptions
• Plot residuals and look for high leverage of
residuals
– Lists of Standardized residuals
– Lists of Studentized residuals
– Cook’s distance or leverage statistics
79
Testing Assumptions:
Assumption 1: linear relationship
200.00
180.00
blood pressure
160.00
140.00
120.00
100.00
80.00
80
Testing Assumptions:
Assumption 2: Normality
160.00
140.00
120.00
Residuals need
100.00
R Sq Linear = 0.166
to be normally
80.00
distributed
40.00 60.00 80.00 100.00 120.00 140.00
WEIGHT 81
Testing Assumptions:
Assumption 2: Normality
.75
10
.50
Expected Cum Prob
Frequency
0 N = 127.00
-2 - - 1. 1. 2 2.
.2 1.7 1.2 -.75 -.2 .2
5 5
.7
5 25 75 .25 75
5 5 5
0.00
0.00 .25 .50 .75 1.00
Regression Standardized Residual
Observed Cum Prob
82
Testing Assumptions:
Assumption 3: Spread of y values constant over
range of x values
Plot residuals against x values
80.00000
60.00000
Unstandardized Residual
40.00000
20.00000
0.00000
-20.00000
-40.00000
83
Multiple Linear Regression
Multivariate Analysis
▪ Multivariate analysis refers to the analysis of data that
takes into account numbers of explanatory variables
and one outcome variable simultaneously
▪ The purpose of multiple regression is to analyze the
relationship between metric or dichotomous
independent variables and a metric dependent
variable
▪ If there is a relationship, using the information in the
independent variables will improve our accuracy in
predicting values for the dependent variable
▪ It allows for the efficient estimation of measures of
association while controlling for a number of
confounding factors 85
Multivariate Analysis
▪ A large number of multivariate models have been
developed for specialized purposes, each with a
particular set of assumptions underlying its
applicability
▪ The choice of the appropriate model is based on the
underlying design of the study, the nature of the
variables, as well as assumptions regarding the
inter-relationship between the exposures and
outcomes under investigation
86
Multiple linear regression
▪ Multiple linear regression (we often refer to
this method as multiple regression) is an
extension of the most fundamental model
describing the linear relationship between
two variables
▪ Multiple regression is a statistical technique
that is used to measure and describe the
function relating two (or more) predictors
(independent) variables to a single response
(dependent) variable
87
Regression equation for a linear relationship
88
Regression equation for a linear relationship
89
Analysis of residuals
▪ The residual standard deviation is a measure of the
average difference between the observed y values
and those predicted or fitted by the model
▪ The residuals are given by yobs - yfit , where yobs is
the observed value of the dependent variable
▪ We cannot plot the original multi-dimensional
data, but we can examine plots of the residuals to
see if the model is reasonable
▪ Specifically, we have to check that whether
assumptions of Normal distribution and uniform
variance are met
90
Assumptions of multiple regression
a) The relationship between the dependent variable
and each continuous explanatory variable is linear
92
Summary, Assumptions of MLR
1. Linear functional form
2. Fixed independent variables
3. Independent observations
4.Representative sample and proper specification
of the model (no omitted variables)
5. Normality of the residuals or errors
6. Equality of variance of the errors (homogeneity of
residual variance)
7. No multicollinearity
8. No autocorrelation of the errors
93
9. No outlier distortion
Predicted and Residual Scores
▪ The regression line expresses the best
prediction of the dependent variable (Y),
given the independent variables (X).
▪ However, nature is rarely perfectly
predictable, and usually there is substantial
variation of the observed points around the
fitted regression line
▪ The deviation of a particular point from the
regression line (its predicted value) is called
the residual value
94
Residual Variance and R-square
▪ The smaller the variability of the residual values
around the regression line relative to the overall
variability, the better is our prediction
▪ For example, if there is no relationship between the X and Y
variables, then the ratio of the residual variability of the Y
variable to the original variance is equal to 1.0
▪ If X and Y are perfectly related then there is no residual
variance and the ratio of residual variability to the total
variance would be 0.0
▪ In most cases, this ratio would fall somewhere
between these extremes, that is, between 0.0 and
1.0.
▪ One minus this ratio is referred to as R-square (R2)
or the coefficient of determination
95
Residual Variance and R-square
96
Residual Variance and R-square
▪ The R-square value is an indicator of how well the
model fits the data
▪ An R-square close to 1.0 indicates that we have
accounted for almost all of the variability with the
variables specified in the model
97
Variance of adjusted regression coefficients
2
S
▪ Where y|x is the residual variance of the outcome
2
and S xi is the variance of xi; ri is equivalent to r R 2
from a multiple linear regression model in which xi
is regressed on all other predictors
98
Variance of adjusted regression coefficients
▪ The term 1 /(1 r 2
i ) is known as the
99
Interpreting the Correlation Coefficient R
▪ Customarily, the degree to which two or more
predictors (independent or X variables) are related
to the dependent (Y) variable is expressed in the
correlation coefficient R, which is the square root of R-
square.
▪ In multiple regression, R assumes values between 0
and 1. This is true due to the fact that no meaning can
be given to the direction of correlation in the
multivariate case.
▪ The larger R is, the more closely correlated the
predictor variables are with the outcome variable.
▪ When R=1, the variables are perfectly correlated in
the sense that the outcome variable is a linear
combination of the others.
100
Multiple linear regression of Systolic Blood
Pressure on Weight AND Age SPSS output
Model Summary
ANOVAb
Model SS df Mean Square F Sig.
Regression 29467.088 2 14733.544 71.960 .000a
Residual 99710.886 487 204.745
Total 129177.974 489
101
SPSS output t = βk /se(βk)
p-value
a
Coeffici ents
R2 = 22.8%
22.8% of variation in Systolic BP is explained by differences in
weight and age
200.00 SEX
male
female
SYSTOLIC BP (mm Hg)
175.00
150.00
125.00
100.00
20 30 40 50 60
SPSS
b
Variables Entered/ Removed
Variables Variables
Model Entered Remov ed Met hod
1 SEX, AGE
a . Enter
(y ears)
a. All requested v ariables entered.
b. Dependent Variable: SY STOLI C BP (mmHg) Model Summary
Sum of
Model Squares df Mean Square F Sig.
1 Regression 19728.668 2 9864.334 43.892 .000a
Residual 109449.3 487 224.742
Total 129178.0 489
a. Predictors: (Constant), SEX, AGE (y ears)
b. Dependent Variable: SY STOLIC BP (mmHg)
105
Systolic BP vs. Age in males and females
Coeffici entsa
106
Systolic BP vs. Age in males & females
Regression lines show predicted values
200.00 SEX
male
female
SYSTOLIC BP (mm Hg)
175.00
150.00
β2
β2
125.00
100.00
20 30 40 50 60
110
Find the ‘best’ model
111
(1) Forward Selection
STEP 1 Find variable which has strongest association with
dependent variable and enter it into model
(i.e. largest t-statistic and smallest P value)
STEP 2 Find next variable out of those remaining which
explains largest amount of remaining variability
112
(2) Backwards Regression
STEP 1 Start with model which includes all
explanatory variables
STEP 2 Remove variable with smallest contribution
to model
(largest p value – say greater than 10% level)
STEP 3 Fit new model. Remove next variable with
smallest contribution
REPEAT All variables in model are statistically
UNTIL: significant
STOP
NOTE: Once a variable has been removed from
model it cannot be re-entered
113
(2) Backwards Regression
116
Exercice
117
Multicollinearity
▪ This is a common problem in many correlation analyses
Imagine that you have two predictors (X variables) of a
person's height:
(1) height in metres and (2) height in cms.