0% found this document useful (0 votes)
195 views

L6 - Biostatistics - Linear Regression and Correlation

This document discusses correlation and linear regression analysis of continuous outcome data. It begins by stating the session objectives, which are to use correlation analysis and scatter plots to describe the linear relationship between two continuous variables. It then defines correlation as a measure of the strength and direction of the linear association between two variables from -1 to +1. Several examples of interpreting correlation coefficients are provided. The document demonstrates calculating the correlation coefficient between two sets of data and testing its significance using a t-test.

Uploaded by

selamawit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
195 views

L6 - Biostatistics - Linear Regression and Correlation

This document discusses correlation and linear regression analysis of continuous outcome data. It begins by stating the session objectives, which are to use correlation analysis and scatter plots to describe the linear relationship between two continuous variables. It then defines correlation as a measure of the strength and direction of the linear association between two variables from -1 to +1. Several examples of interpreting correlation coefficients are provided. The document demonstrates calculating the correlation coefficient between two sets of data and testing its significance using a t-test.

Uploaded by

selamawit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

Jimma University,

Institute of Health,
Public Health Faculty,
Department of
Epidemiology,
Biostatistics Unit
Analysis of Continuous
outcome Data
Correlation and Linear Regression Analysis
Session Objectives
At the end of this session, you will be able to:
✓ Use methods of association in measurement
data – correlation analysis
✓ Use scatter plots to see relationship between
variables
✓ Describe relationship of variables using simple
and multiple variable regression analysis of
measurement data
✓ Apply methods of regression analysis for
measurement data in two or more variables
✓ Identify model assumptions – parameter 3
estimation, hypothesis testing and prediction
Correlation Analysis
• Correlation is the method of analysis to use
when studying the possible association
between two continuous variables

• The standard method (Pearson Correlation)


leads to a quantity r that can take on any
value from -1 to +1

• The correlation coefficient r measures the


degree of 'straight-line' association
between the values of two variables 4
Correlation Analysis
• The correlation between two variables is
positive if
– higher values of one variable are associated
with higher values of the other and
• negative if
– one variable tends to be lower as the other gets
higher
• A correlation of around zero indicates that
there is no linear relation between the
values of the two variables

5
Correlation Analysis
• It is important to note that a correlation
between variables shows that they are
associated but does not necessarily imply a
‘cause and effect’ relationship

• In essence r is a measure of the scatter of


the points around an underlying linear trend:
– The greater the spread of the points the lower
the correlation

6
Scatter Plots and Correlation
• Correlation analysis is used to measure
strength of the association (linear
relationship) between two variables
• Scatter plot is used to show the
relationship between two variables
• Only concerned with strength of the
relationship and its direction
• We consider the two variables equally; as
a result no causal effect is implied
7
Scatter Plots and Correlation
• As OR and RR are used to quantify risk
between two dichotomous variables,
correlation is used to quantify the degree
to which two random continuous variables
are related, provided the relationship is
linear.

8
9
10
11
Fig.1: Systolic Blood Pressure against Age

12
Correlation coefficient
• In the underlying population from which
the sample of points (xi, yi) is selected, the
correlation coefficient between X and Y is
denoted by  (rho) and is given by
•  = Average [(X-X)(Y- Y)]
X Y
• It can be thought as the average of the product of the
standard normal deviates of X and Y

13
Estimation of Correlation coefficient 

• After Karl Pearson (1857 – 1936)


• The estimator of the population correlation coefficient is
given by

• r = Σ (xi- x^) (yi -ŷ)


[Σ(xi-x^)2 Σ (yi -ŷ)2]½
• r is called the product moment correlation coefficient or
Pearson’s correlation coefficient, coverability of X and of
Y divided by variability of X and Y separately
-1  r  1
• r = 0 when there is no linear relationship at all
• Pearson’s correlation coefficient is a measure of the
degree of straight line relationship
14
Correlation coefficient

• Correlation coefficient may be zero


even if there is a perfect non-linear
relationship
E.g. y = x2

15
If we have two quantitative variables X and Y,
the correlation between them denoted by r(X, Y)
is given by:
 (x i  x )(yi  y)  xy
r 
 i   i  x y
2 2 2 2
(x x ) (y y )
 XY  [ X  Y ] / n

[ X  (  X ) / n][  Y  (  Y ) / n]
2 2 2 2

where xi and yi are the values of X and Y for the ith


individual

The equation is clearly symmetric as it does not


matter which variable is X and which is Y
16
Calculating r of two quantitative data

• Let X be age and Y be SBP


• From the summary of data, we may have
the following:
Mean of X = 40.41, Mean of Y = 130.54,
Σx2=58,956.549, Σy2=129,177.971,
Σxy=26,871.501, Sx=10.980. Sy=16.253,
Sxy=54.958 and n=490
• r=0.308

17
Hypothesis testing on ρ
• Population Correlation Coefficient: ρ
• Sample Correlation Coefficient: r
• Under the null hypothesis that there is no
association in the population (=0) it can be
shown that the quantity n2
tr
1 r2
has a t distribution with n-2 degrees of freedom

For the age and SBP data, t = 6.99, df=488, p <


0.001 18
Interpretation of correlation
• Correlation coefficients lie within the range -1
to +1, with the mid-point of zero indicating no
linear association between the two variables

• A very small correlation does not necessarily


indicate that two variables are not associated,
however

• To be sure of this we should study a plot of


data, because it is possible that the two
variables display a non-linear relationship (for
example cyclical or curved) 19.
~
Interpretation of correlation
• In such cases r will underestimate the
association, as it is a measure of linear
association alone. Consider transforming the
data to obtain a linear relation before
calculating r

• Very small r values may be statistically


significant in moderately large samples, but
whether they are clinically relevant must be
considered on the merits of each case

20
Interpretation of correlation
• One way of looking at the correlation helps to
modify over-enthusiasm is to calculate l00r2
(coefficient of determination), which is the
percentage of variability in the data that is
'explained' by the association

• So a correlation of 0.7 implies that just about half


(49%) of the variability may be put down to the
observed association, and so on

21
Exercise: The following data shows the respective weight of a
sample of 12 fathers and their oldest son. Compute the
correlation coefficient between the two weight measurements
Wt of father – X Wt of son – Y
X2 Y2 XY
65 68 4225 4624 4420
63 66 3969 4356 4158
67 68 4489 4624 4556
64 65 4096 4225 4160
68 69 4624 4761 4692
62 66 3844 4356 4092
70 68 4900 4624 4760
66 65 4356 4225 4290
68 71 4624 5041 4828
67 67 4489 4489 4489
69 68 4761 4624 4692
71 70 5041 4900 4970 22
Scatter Plot
Scatter plot of father's by son's weight

72
71
70
69
68
67
66
65
Y

64
60 62 64 66 68 70 72
X

23
Calculating r
The correlation coefficient for the data on fathers’
and sons’ will be:
Basic values from the data
 X  800,  X  53,418,  Y  811,  Y  54,849,  XY  54,107
2 2

 (x - x )(y  y)   xy  ( x )( y)/n  54,107  (800  811)/12  40.33


2 2 2 2
 ( x  x)   x  ( x) / n  53,418  (800) / 12  84.67
2 2 2 2
 ( y  y )   y  ( y ) / n  54,849  (811) / 12  38.92
Calculatin g r
40.33
r  0.703
(84.67)(38.92)
24
Significance test
• We need to check that the correlation is
unlikely to have arisen due to sample
variation
• Testing whether the calculated Pearsons’s
correlation coefficient is significant or not
follows t = r_____ distribution
[(1-r)2/(n-2)]½ with df=n-2
Confidence interval can be calculated r  t/2
SE(r) 25
Significance test
For the fathers’ and sons’ weight data:
Ho: ρ = 0
HA: ρ ≠ 0
Test statistic, t, SE(r) = [(1-r)2/(n-2)]½
n2 12  2
t  r  0.703   3.12
1 r 2
1  (0.703) 2

p < 0.01, i.e., the correlation coefficient is


significantly different from 0
26
Interpretation of correlation
• Correlation coefficients lie within the range -1 to
+1, with the mid-point of zero indicating no linear
association between the two variables

• A very small correlation does not necessarily


indicate that two variables are not associated,
however
• To be sure of this we should study a plot of the
data, because it is possible that the two variables
display a non-linear relationship (for example
cyclical or curved).
• In such cases r will underestimate the association,
as it is a measure of linear association alone 27
Interpretation of correlation
• Very small r values may be statistically significant
in moderately large samples
• One way of looking at the correlation helps to
modify over-enthusiasm is to calculate 100r2, the
coefficient of determination called goodness of fit,
which is the percentage of variability in the data
that is 'explained' by the linear association

28
Inference on Correlation Coefficient

r=0 r<0

b=0 b<0
Y

Y
X
X

r>0

b>0
Y

29
Pearson’s r Correlation
• As a rule of thumb, the following guidelines on
strength of relationship are often useful (though
many experts would somewhat disagree on the
choice of boundaries).
Correlation value Interpretation
 0.70 or higher Very strong relationship
 0.40 to 0.69 Strong relationship
 0.30 to 0.39 Moderate relationship
 0.20 to 0.29 Weak relationship
 0.01 to 0.19 No or negligible relationship
30
Limitations of the correlation coefficient:

1. It quantifies only the strength of the linear


relationship between two variables
2. It is very sensitive to outlying values, and
thus can sometimes be misleading
3. It cannot be extrapolated beyond the
observed ranges of the variables
4. A high correlation does not imply a cause-
and-effect relationship

31
What is a Model?

1.
1. Representation of
Some Phenomenon

Non-Math/Stats Model

32
What is a Math/Stats Model?
1. Often Describe Relationship between
variables

2. Types
- Deterministic Models (no randomness)

- Probabilistic Models (with randomness)

33
Deterministic Models
1. Hypothesize Exact Relationships
2. Suitable When Prediction Error is Negligible
3. Example: Body mass index (BMI) is measure of
body fat based

– Metric Formula: BMI = Weight in Kilograms


(Height in Meters)2

– Non-metric Formula: BMI = Weight (pounds)x703


(Height in inches)2

34
Probabilistic Models
1. Hypothesize 2 Components
• Deterministic
• Random Error
2. Example: Systolic blood pressure of newborns
is 6 Times the Age in days + Random Error
• SBP = 6xage(d) + 
• Random Error May Be Due to Factors
Other Than age in days (e.g. Birthweight)

35
Types of Probabilistic Models

Probabilistic
Models

Regression Correlation Other


Models Models Models

36
Regression Models

• Relationship between one dependent


variable and explanatory variable(s)
• Use equation to set up relationship
• Numerical Dependent (Response) Variable
• 1 or More Numerical or Categorical Independent
(Explanatory) Variables
• Used Mainly for Prediction & Estimation

37
Types of Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear

38
Simple Linear Regression
• Data on one variable are frequently obtained from
a given variable:
Examples
• weight and height
• House rent and income
• Yield and fertilizer

• It is usually desirable to express their relationship


by finding an appropriate mathematical equation

39
Simple Linear Regression
• To form the equation, collect the data on
the independent variable and estimate the
value of the other and you can display in
pairs.
• Let the observations be denoted by (X1
,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn).
• However, before trying to quantify this
relationship, plot the data and get an idea
of their nature.
• Plot these points on the XY plane and
obtain the scatter diagram.
40
Simple Linear Regression
Relationship between heights of
fathers and their oldest sons

73
Heights of oldest sons (inches)

72
71
70
69
68
67
66
65
64
63
62
62 64 66 68 70 72 74
Heights of fathers (inches)

41
Simple Linear Regression
❖Simple regression uses the relationship
between the two variables to obtain
information about one variable by knowing
the values of the other

❖The equation showing this type of


relationship is called simple linear
regression equation

42
Simple Linear Regression
• The case of simple linear regression
considers a single regressor or predictor x
and a dependent or response variable Y.
• The expected value of Y at each level of x is a
random variable:
E(Y|X) =  + X
• We assume that each observation, Y, can be
described by the model
Y =  + X + ε 43
Linear Equations
Y
Y = mX + b
Change
m = Slope in Y
Change in X
b = Y-intercept
X

44
Linear Regression Model

Relationship Between Variables is a


Linear Function
Population Population Random
Y-Intercept Slope Error

Yi   0  1X i   i
Dependent Independent
(Response) (Explanatory) Variable
Variable (e.g., Years s. serocon.)
(e.g., CD+ c.) 45
Population & Sample
Regression Models
Population

☺ ☺


46
Population & Sample
Regression Models
Population

Unknown
Relationship ☺
Yi   0  1X i   i
☺ ☺


47
Population & Sample
Regression Models
Population Random Sample

Unknown
Relationship ☺
Yi   0  1X i   i ☺

☺ ☺


48
Population & Sample
Regression Models
Population Random Sample

Unknown
 
Y i   0   1 X i   i
Relationship ☺
Yi   0  1X i   i ☺

☺ ☺


49
Population Linear Regression
Model
Y Yi   0   1X i   i Observed
value

i = Random error

E Y    0   1 X i

X
Observed value
50
Sample Linear Regression
Model
Y  
Y i   0   1 X i   i

^i = Random
error
Unsampled
observation
  
Yi   0   1 X i
X
Observed value
51
Simple Linear Regression
♦ The scatter diagram helps to choose the curve that
best fits the data. The simplest type of curve is a
straight line whose equation is given by
Ŷ = a + bxi

♦ This equation is a point estimate of Y =  + X + ε

b= the sample regression coefficient of Y on X


β= the population regression coefficient of Y on
X

♦ Y on X means Y is the dependent variable and X 52is


the independent one
Simple Linear Regression
• a is the estimated average value of y
when the value of x is zero (provided that
0 is inside the data range considered)

• Otherwise, it shows the portion of the


variability of the dependent variable left
unexplained by the independent variables

53
Simple Linear Regression
♦ Regression is a method of estimating the
numerical relationship between variables

♦ For example, we would like to know what is the


mean or expected weight for factory workers of a
given height, and what increase in weight is
associated with a unit increase in height.

♦ The purpose of a regression equation is to use


one variable to predict another

How is the regression equation determined? 54


The method of least square
• The model Y =  + X + ε refers to the
population from which the sample was
drawn
• The regression line Ŷ = a + bx is an
estimate of the population regression line
that was found using ordinary least
squares (OLS)
• The difference between the given score Y
and the predicted score Ŷ is known as the
error of estimation
55
The method of least square
► The regression line, or the line which best fits the
given pairs of scores, is the line for which the sum of
the squares of these errors of estimation (Σеi²) is
minimized
► That is, of all the curves, the curve with minimum Σеi²
is the least square regression which best fits the given
data
► The least square regression line for the set of
observations
(X1 ,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn) has the
equation
Ŷ = a + bxi 56
The method of least square
► The values ‘a’ and ‘b’ in the equation are constants, i.e.,
their values are fixed.

► The constant ‘a’ indicates the value of y when x=0. It is


also called the y intercept.

► The value of ‘b’ shows the slope of the regression line and
gives us a measure of the change in y for a unit change in x.

► This slope (b) is frequently termed as the regression


coefficient of Y on X.

► If we know the values of ‘a’ and ‘b’, we can easily compute


57
the value of Ŷ for any given value of X.
The method of least square
► The constants ‘a’ and ‘b’ are determined by
solving simultaneously the equations normal
equations
ΣY = an + bΣX
ΣXY = aΣX + bΣX2
a =Y - b X

 ( X  X )(Y  Y ) n XY   X  Y
b= = n X 2  ( X ) 2
(X  X ) 2

58
SLR-example 1
Heights of 10 fathers (X) together with their oldest sons (Y)
are given below (in inches). Find the regression of Y on X.

Father (X) oldest son (Y) product (XY) X²


63 65 4095 3969
64 67 4288 4096
70 69 4830 4900
72 70 5040 5184
65 64 4160 4225
67 68 4556 4489
68 71 4828 4624
66 63 4158 4356
70 70 4900 4900
71 72 5112 5041

Total 676 679 45967 45784 59


SLR-example 1
a =Y - b X

n XY   X  Y  ( X  X )(Y  Y )
b= n X 2  ( X ) 2 = (X  X ) 2

10(45967)  (676 x679) 459670  459004 666


b= 10(45784)  (676) 2
= 457840  456976 = 864 = 0.77

679 676
a= 10
- 0.77 ( 10
) = 67.9 – 52.05 = 15.85

Therefore, Ŷ = 15.85 + 0.77 X


60 in X.
The regression coefficient of Y on X (i.e., 0.77) tells us the change in Y due to a unit change
SLR-example 1
Estimate the height of the oldest son for a father’s
height of 70 inches.

Ŷ = 15.85 + 0.77 (70) = 69.75 inches

NB: 1) n is the number of pairs of X and Y scores


which are used in determining the regression line.
In the above example, n=10.

2) Careful to distinguish between (ΣX)² and Σχ². 61


Standard error of regression
coefficients
• The calculated values for a and b are
sample estimates of the values of the
intercept and slope from the regression
line describing the linear association
between x and y in the whole population
• Therefore , they are subject to sampling
variation and their precision is measured
by their standard errors

62
Standard errors of regression
coefficients
• The SE of the regression coefficients is given
by: 2
1 X S
se(a)  S   and se(b) 
 (x  x )
2
n  (x - x ) 2
where
2 2 2
 (y  y)  b  (x  x )
S
n2
S is the standard deviation of the points about the
line. It has (n-2) degree of freedom.
63
Example (1- )100% CI for
regression coefficient
• Consider the age and SBP data and the
fitted regression model
• SBP = 112.12 + 0.456(Age)
• S = 15.48, se(a) = 2.67, se(b) = 0.064
• A 95% confidence interval for the slope is:
Estimated slope  t1-(SE of slope)
0.456 ± 1.96*0.064 = (0.331, 0.581)
• The 95% CI does not include 0=>There is
a sufficient evidence that age affects SBP
64
Significant test for β
Ho: 
H1: 
If the null hypothesis is true then the
statistic:
Observed slope - 0
t
S.E. of obsereved slope

will follow a t-distribution with (n – 2)


df
65
Significant test for β-Example
• For the age and SBP data,

• b = 0.456 and se(b) = 0.064, then

• t = 7.15 and with (n-2)=488 df


 p < 0.001
➢Decision: Reject Ho

66
ANOVA Table for linear regression

67
Deviation

68
Simple linear regression
Explained, unexplained (error), total variations

♣ If all the points on the scatter diagram fall on the


regression line we could say that the entire
variance of Y is due to variations in X.
Explained variation = Σ(Ŷ- Y )²

♣ The measure of the scatter of points away from the


regression line gives an idea of the variance in Y
that is not explained with the help of the regression
equation.
69
Unexplained variation = Σ(Y - Ŷ)²
Simple linear regression
►The variation of the Y’s about their mean can also be
computed. The quantity Σ(Y- Y)² is called the total
variation.
Explained variation + unexplained variation =Total
variation

► The ratio of the explained variation to the total


variation measures how well the linear regression line
fits the given pairs of scores.
► It is called the coefficient of determination,
and is denoted by
exp lained var iation
r² =
total var iation 70
ANOVA Table for linear regression

• P<α : reject Ho, the model does explain


the variation observed in the data 71
Simple linear regression
► The explained variation is never negative
and is never larger than the total variation

► Therefore, r² is always between 0 and 1.


If the explained variation equals 0, r² = 0

► If r² is known, then r =  r 2 . The sign of r


is the same as the sign of b from the
regression equation
72
Simple linear regression model
• The relationship y =  + x is not expected
to hold exactly for every individual but the
average value of y for a given value of x is
E(Yx) =  + x
• An error term ε, which represents the
variance of the dependent variable among
all individuals with a given x, is introduced
into the model
• The full linear-regression model then takes
the y =  + x + ε form 73
Simple linear regression model
• ε is the residual normally distributed with
mean 0 and variance 2
• One interpretation of the regression line is
that for a subject with x independent
values, the corresponding y dependent
value will be normally distributed with
mean  + x and variance 2
• If 2 were 0, then every point would fall
exactly on the regression line, whereas the
larger 2 is, the more scatter occurs about
the regression line 74
Assumptions
The assumptions made when using this method are:

♣ The relationship between the outcome and the


explanatory variable is linear or at least
approximately linear;

♣ At each value of the explanatory variable the


outcomes follow a normal distribution;

♣ The variance of the outcome is constant for all


75
values of the explanatory variable
Assumptions of Linear Regression
1. Linear relationship between outcome (y)
and explanatory variable x
2. Outcome variable (y) should be Normally
distributed for each value of explanatory variable
(x)
3. Standard deviation of y should be approximately
the same for each value of x
4. Fixed independent observations
e.g. Only one point per person
5. No outlier distortion
76
Assumptions of linear regression

*
*
Assumption 1 **
*
*
*
*
Linear relationship ** * *
Assumption 2 **
**
*
*
*

Y normally distributed **
**
*

at each value of x
Assumption 3
Same variance at each value of x
77
Diagnostic Tests for the Regression
Assumptions
• Linearity tests: Regression curve fitting: No level
shifts
• Independence of observations: Runs test
• Normality of the residuals: Shapiro-Wilks or
Kolmogorov-Smirnov Test
• Homogeneity of variance of the residuals: White’s
General Specification test
• No autocorrelation of residuals: Durbin Watson or
ACF of residuals

78
Diagnostic Tests for the Regression
Assumptions
• Plot residuals and look for high leverage of
residuals
– Lists of Standardized residuals
– Lists of Studentized residuals
– Cook’s distance or leverage statistics

79
Testing Assumptions:
Assumption 1: linear relationship

Plot y against x to check for linearity


220.00

200.00

180.00
blood pressure

160.00

140.00

120.00

100.00

80.00

40.00 60.00 80.00 100.00 120.00 140.00


WEIGHT

80
Testing Assumptions:

Assumption 2: Normality

220.00 Y normally distributed


200.00
at each value of x
180.00
blood pressure

160.00

140.00

120.00
Residuals need
100.00
R Sq Linear = 0.166
to be normally
80.00
distributed
40.00 60.00 80.00 100.00 120.00 140.00
WEIGHT 81
Testing Assumptions:
Assumption 2: Normality

Histogram of residuals Normal probability plot


Histogram Normal P-P Plot of Regression Standardized Residual

Dependent Variable: Systolic BP Dependent Variable: Systolic BP


20
1.00

.75

10
.50
Expected Cum Prob
Frequency

Std. Dev = 1.00


Mean = 0.00 .25

0 N = 127.00
-2 - - 1. 1. 2 2.
.2 1.7 1.2 -.75 -.2 .2
5 5
.7
5 25 75 .25 75
5 5 5
0.00
0.00 .25 .50 .75 1.00
Regression Standardized Residual
Observed Cum Prob

82
Testing Assumptions:
Assumption 3: Spread of y values constant over
range of x values
Plot residuals against x values
80.00000

60.00000
Unstandardized Residual

40.00000

20.00000

0.00000

-20.00000

-40.00000

40.00 60.00 80.00 100.00 120.00 140.00


WEIGHT

83
Multiple Linear Regression
Multivariate Analysis
▪ Multivariate analysis refers to the analysis of data that
takes into account numbers of explanatory variables
and one outcome variable simultaneously
▪ The purpose of multiple regression is to analyze the
relationship between metric or dichotomous
independent variables and a metric dependent
variable
▪ If there is a relationship, using the information in the
independent variables will improve our accuracy in
predicting values for the dependent variable
▪ It allows for the efficient estimation of measures of
association while controlling for a number of
confounding factors 85
Multivariate Analysis
▪ A large number of multivariate models have been
developed for specialized purposes, each with a
particular set of assumptions underlying its
applicability
▪ The choice of the appropriate model is based on the
underlying design of the study, the nature of the
variables, as well as assumptions regarding the
inter-relationship between the exposures and
outcomes under investigation

86
Multiple linear regression
▪ Multiple linear regression (we often refer to
this method as multiple regression) is an
extension of the most fundamental model
describing the linear relationship between
two variables
▪ Multiple regression is a statistical technique
that is used to measure and describe the
function relating two (or more) predictors
(independent) variables to a single response
(dependent) variable
87
Regression equation for a linear relationship

▪ All types of multivariate analyses involve the


construction of a mathematical model to
describe the association between independent
and dependent variables
▪ Individuals observations of the outcome yi are
modeled as varying by an error term i about an
average determined by their predictor values xi:
Yi =   1X1  2X2  3X3 ….  pXp + I
I ~ i.i.d N(0, σ2ε )

88
Regression equation for a linear relationship

▪ In the multiple regression mode


E[Y/x] =   1x1  2x2  3x3 ….  pxp
Yfit = a + b1X1 + b2X2 + . . . + bpXp , where
x1, x2, …, xp represent p predictors and 1, 2,
…, p are the corresponding regression
coefficients which gives the change in E[Y/x]
for an increase of one unit in predictor xi
holding all other predictors when xi are
quantitative variables.

89
Analysis of residuals
▪ The residual standard deviation is a measure of the
average difference between the observed y values
and those predicted or fitted by the model
▪ The residuals are given by yobs - yfit , where yobs is
the observed value of the dependent variable
▪ We cannot plot the original multi-dimensional
data, but we can examine plots of the residuals to
see if the model is reasonable
▪ Specifically, we have to check that whether
assumptions of Normal distribution and uniform
variance are met

90
Assumptions of multiple regression
a) The relationship between the dependent variable
and each continuous explanatory variable is linear

✓ We can examine this assumption for any variable,


✓by plotting (i.e., by using bivariate scatter plots)
✓the residuals (the difference between observed values of the
dependent variable and those predicted by the regression equation)
against the independent variable

✓ Any curvature in the pattern will indicate


✓ a non-linear relationship is more appropriate -
transformation of the explanatory variable may be
considered
91
Assumptions of multiple regression
b) We can produce a Normal plot of the residuals, to
check the overall fit and verify that the residuals
have an approximately normal distribution.
The normal plot may identify outliers for further investigation

c) We can plot the residuals against the fitted values


No pattern should be discernible. In particular, the
variability of the residuals should be constant
across the range of the fitted values

d) The observations (explanatory variables) should be


independent

92
Summary, Assumptions of MLR
1. Linear functional form
2. Fixed independent variables
3. Independent observations
4.Representative sample and proper specification
of the model (no omitted variables)
5. Normality of the residuals or errors
6. Equality of variance of the errors (homogeneity of
residual variance)
7. No multicollinearity
8. No autocorrelation of the errors
93
9. No outlier distortion
Predicted and Residual Scores
▪ The regression line expresses the best
prediction of the dependent variable (Y),
given the independent variables (X).
▪ However, nature is rarely perfectly
predictable, and usually there is substantial
variation of the observed points around the
fitted regression line
▪ The deviation of a particular point from the
regression line (its predicted value) is called
the residual value
94
Residual Variance and R-square
▪ The smaller the variability of the residual values
around the regression line relative to the overall
variability, the better is our prediction
▪ For example, if there is no relationship between the X and Y
variables, then the ratio of the residual variability of the Y
variable to the original variance is equal to 1.0
▪ If X and Y are perfectly related then there is no residual
variance and the ratio of residual variability to the total
variance would be 0.0
▪ In most cases, this ratio would fall somewhere
between these extremes, that is, between 0.0 and
1.0.
▪ One minus this ratio is referred to as R-square (R2)
or the coefficient of determination
95
Residual Variance and R-square

▪ This value is immediately interpretable in the following


manner. If we have an R-square of 0.4 then we know
that the variability of the Y values around the
regression line is 1- 0.4 times the original variance
▪ In other words, we have explained 40% of the original
variability, and are left with 60% residual variability
▪ Ideally, we would like to explain most if not all of the
original variability

96
Residual Variance and R-square
▪ The R-square value is an indicator of how well the
model fits the data
▪ An R-square close to 1.0 indicates that we have
accounted for almost all of the variability with the
variables specified in the model

97
Variance of adjusted regression coefficients

▪ Including multiple predictors does affect the


variance of bi due to multiple correlation of xi with
other predictors in the model
S y2| x
Var (bi ) 
(n  1) S xi2 (1  ri 2 )

2
S
▪ Where y|x is the residual variance of the outcome
2
and S xi is the variance of xi; ri is equivalent to r  R 2
from a multiple linear regression model in which xi
is regressed on all other predictors
98
Variance of adjusted regression coefficients
▪ The term 1 /(1  r 2
i ) is known as the

variance inflation factor(VIF), since Var(bi)


is increased to the extent that xi is
correlated with other predictors in the
model
▪ (1  ri ) is called tolerance
2

99
Interpreting the Correlation Coefficient R
▪ Customarily, the degree to which two or more
predictors (independent or X variables) are related
to the dependent (Y) variable is expressed in the
correlation coefficient R, which is the square root of R-
square.
▪ In multiple regression, R assumes values between 0
and 1. This is true due to the fact that no meaning can
be given to the direction of correlation in the
multivariate case.
▪ The larger R is, the more closely correlated the
predictor variables are with the outcome variable.
▪ When R=1, the variables are perfectly correlated in
the sense that the outcome variable is a linear
combination of the others.
100
Multiple linear regression of Systolic Blood
Pressure on Weight AND Age SPSS output
Model Summary

Adjust ed Std. Error of


Model R R Square R Square the Estimate
1 a
.478 .228 .225 14. 30892
a. Predic tors : (Const ant), AGE (y ears), WEIGHT (Kg)

ANOVAb
Model SS df Mean Square F Sig.
Regression 29467.088 2 14733.544 71.960 .000a
Residual 99710.886 487 204.745
Total 129177.974 489

a. Predictors: (Constant), WEIGHT (Kg), AGE (years)


b. Dependent Variable: SYSTOLIC BP (mmHg)

101
SPSS output t = βk /se(βk)
p-value
a
Coeffici ents

Uns tandardized Standardized


Coef f icients Coef f icients
Model B Std. Error Beta t Sig.
1 (Constant) 86. 295 3. 744 23. 047 .000
WEIGHT (Kg) .388 .042 .369 9. 170 .000
AGE (y ears) .375 .060 .253 6. 285 .000
a. Dependent Variable: SYSTOLIC BP (mmHg)

Multiple Linear Regression Equation is:


SBP = 86.295 + 0.388(WEIGHT) + 0.375(AGE)
Unadjusted Adjusted
(simple lr) (multiple lr)
Weight (β) 0.428 0.388
Age (β) 0.456 0.375
102
Coefficient of determination (R2)
Measures how much of total variation in outcome is
explained by ALL regression variables
Model Summary

Adjust ed Std. Error of


Model R R Square R Square the Estimate
1 a
.478 .228 .225 14. 30892
a. Predic tors : (Const ant), AGE (y ears), WEIGHT (Kg)

R2 = 22.8%
22.8% of variation in Systolic BP is explained by differences in
weight and age

With many variables in model R2 tends be an overestimate


Adjusted R2 is a more conservative estimate
 adjusted R2 = 22.5% 103
Systolic BP vs. Age in males and
females

200.00 SEX

male


female
SYSTOLIC BP (mm Hg)

 
   
175.00   
 
  
   
  
     
  
 
  
    
     
150.00    
        

     
   
      
  
 
     

   
     


     
  




 
    
   
      
        
    
  
 
    
   
   
 

     

               
 
125.00  
    
      

  
       

 

      



 
    
 
    

    
  
          

     
          
   
 


   
100.00   
 

20 30 40 50 60

AGE (ye ars)


104
Systolic BP vs. Age in males and females

SPSS
b
Variables Entered/ Removed

Variables Variables
Model Entered Remov ed Met hod
1 SEX, AGE
a . Enter
(y ears)
a. All requested v ariables entered.
b. Dependent Variable: SY STOLI C BP (mmHg) Model Summary

Adjusted Std. Error of


Model R R Square R Square the Estimate
1 a
.391 .153 .149 14.99139
a. Predictors: (Constant), SEX, AGE (years)
ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 19728.668 2 9864.334 43.892 .000a
Residual 109449.3 487 224.742
Total 129178.0 489
a. Predictors: (Constant), SEX, AGE (y ears)
b. Dependent Variable: SY STOLIC BP (mmHg)
105
Systolic BP vs. Age in males and females
Coeffici entsa

Uns tandardized Standardized


Coef f icients Coef f icients 95% Conf idenc e Interv al f or B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 108.364 2. 666 40. 647 .000 103.125 113.602
AGE (y ears) .458 .062 .310 7. 421 .000 .337 .580
SEX 7. 830 1. 357 .241 5. 768 .000 5. 163 10. 497
a. Dependent Variable: SY STOLIC BP (mmHg)

Systolic BP = 108.364 + 0.458age + 7.830sex

AGE P < 0.01


SEX P < 0.01

106
Systolic BP vs. Age in males & females
Regression lines show predicted values

200.00 SEX
 male


female
SYSTOLIC BP (mm Hg)

 
   
175.00   
 
  
   
     
  
  
  
  
  
150.00

 
    
     
    
 
      
 
      
 
β2
  
    

      
         
 
  
   
  
     
  
          
 
 
       
  
β2
        

  
      

  
125.00  



 
     
     

    
  

   


   


 

 
     

    

 

  
 
         
  
       
   
 


   
100.00   
 

20 30 40 50 60

AGE (ye ars)


107
Choice of the Number of Variables
▪ Multiple regression technique is a "plug in" as many
predictor variables as you can think of and usually at
least a few of them will come out significant
▪ This is because one is capitalizing on chance when
simply including as many variables as one can think of
as predictors of some other variable of interest. This
problem is compounded when, in addition, the
number of observations is relatively low
▪ Intuitively, it is clear that one can hardly draw
conclusions from an analysis of 100 questionnaire
items based on 10 respondents
▪ Most authors recommend that one should have at
least 10 to 20 times as many observations (cases,
respondents) as one has variables, otherwise the
estimates of the regression line will probably be
unstable 108
Choice of the Number of Variables
• Sometimes we know in advance which variables
we wish to include in a multiple regression
model. Here it is straightforward to fit a regression
model containing all of those variables

• Variables that are not significant can be omitted and


the analysis redone but sometimes it is desirable to
keep a variable in a model

• In large samples the omission of non-significant


variables will have little effect on the other
regression coefficients
• Usually it makes sense to omit variables that do109
not contribute much to the model ( P > .05)
Choice of the Number of Variables
♣ The statistical significance of each variable in the
multiple regression model is obtained simply by
calculating the ratio of the regression coefficient to its
standard error

♣ The b/se(b) is related to the t distribution with n-k-1


degrees of freedom, where n is the sample size and k
is the number of variables in the model

110
Find the ‘best’ model

The automated approach!

Stepwise regression models

Stepwise regression is a technique for choosing predictor


variables from a large set

(1) Forward selection (forward stepwise regression)

(2) Backward elimination (backward stepwise regression)

(3) Stepwise regression

111
(1) Forward Selection
STEP 1 Find variable which has strongest association with
dependent variable and enter it into model
(i.e. largest t-statistic and smallest P value)
STEP 2 Find next variable out of those remaining which
explains largest amount of remaining variability

STEP 3 Repeat step 2 until addition of next most important


variable is not statistically significant (e.g. at 5% level)
STOP

112
(2) Backwards Regression
STEP 1 Start with model which includes all
explanatory variables
STEP 2 Remove variable with smallest contribution
to model
(largest p value – say greater than 10% level)
STEP 3 Fit new model. Remove next variable with
smallest contribution
REPEAT All variables in model are statistically
UNTIL: significant
STOP
NOTE: Once a variable has been removed from
model it cannot be re-entered
113
(2) Backwards Regression

♣ As its name indicates, with the backward stepwise method


we approach the problem from the other direction.
♣ The argument given is that we have collected data on these
variables because we believe them to be potentially
important explanatory variables
♣ Therefore, we should fit the full model, including all of these
variables, and then remove unimportant variables one at a
time until all those remaining in the model contribute
significantly
♣ We use the same criterion, say P<.05, to determine
significance. At each step we remove the variable with the
smallest contribution to the model (or the largest P-value) as
long as that P-value is greater than the chosen level 114
(3) Stepwise Regression
Forward selection : a variable added early on can become
unimportant after other variables are added
STEP 1 First step of forward selection performed
STEP 2 Checks significance of included variables
STEP 3 If any variables not significant procedure
changes to backwards elimination
- drops variables
STEP 4 Continues including significant variables and
dropping ones no longer significant
UNTIL: All unused variables are not significant and all
used variables are significant
STOP 115
Comments on Automatic Procedures
•No automatic procedure GUARANTEES that 'best' model
will be chosen

• Forward selection, backward elimination and stepwise


selection don't necessarily end up with the same model

•Arbitrariness in choice of P value for entering or leaving


the model

• No selection procedure can substitute for the insight


of the researcher

NB missing data causes problems with automatic


procedures

116
Exercice

What do you do when you have a lot of


independent variables (say, 30 or more)?
(Hint: start with the classical bivariate analysis and use a
lax criterion, such as α = 0.2)

117
Multicollinearity
▪ This is a common problem in many correlation analyses
Imagine that you have two predictors (X variables) of a
person's height:
(1) height in metres and (2) height in cms.

▪ Obviously, our two predictors are completely redundant; height


is one and the same variable, regardless of whether it is
measured in metres or cms

▪ Trying to decide which one of the two measures is a better


predictor of weight would be rather silly; however, this is
exactly what one would try to do if one were to perform a
multiple regression analysis with weight as the dependent (Y)
variable and the two measures of height as the independent
118
(X) variables
Multicollinearity

▪ When there are very many variables involved, it is


often not immediately apparent that this problem
exists, and it may only manifest itself after several
variables have already been entered into the
regression equation
▪ Nevertheless, when this problem occurs it means
that at least one of the predictor variables is
completely redundant with other predictors.
▪ There are many statistical indicators of this type of
redundancy
119
❖ What are the advantages of
multivariate analyses such as the
use of "multiple linear regression“ ?

Analyze the data given under the


file name, “Exercise_multiple_LR”
Thank You!

You might also like