0% found this document useful (0 votes)

195 views

L6 - Biostatistics - Linear Regression and Correlation

This document discusses correlation and linear regression analysis of continuous outcome data. It begins by stating the session objectives, which are to use correlation analysis and scatter plots to describe the linear relationship between two continuous variables. It then defines correlation as a measure of the strength and direction of the linear association between two variables from -1 to +1. Several examples of interpreting correlation coefficients are provided. The document demonstrates calculating the correlation coefficient between two sets of data and testing its significance using a t-test.

Uploaded by

selamawit

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

195 views

L6 - Biostatistics - Linear Regression and Correlation

Uploaded by

selamawit

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 121

Jimma University,

Institute of Health,
Public Health Faculty,
Department of
Epidemiology,
Biostatistics Unit
Analysis of Continuous
outcome Data
Correlation and Linear Regression Analysis
Session Objectives
At the end of this session, you will be able to:
✓ Use methods of association in measurement
data – correlation analysis
✓ Use scatter plots to see relationship between
variables
✓ Describe relationship of variables using simple
and multiple variable regression analysis of
measurement data
✓ Apply methods of regression analysis for
measurement data in two or more variables
✓ Identify model assumptions – parameter 3
estimation, hypothesis testing and prediction
Correlation Analysis
• Correlation is the method of analysis to use
when studying the possible association
between two continuous variables

• The standard method (Pearson Correlation)

leads to a quantity r that can take on any
value from -1 to +1

• The correlation coefficient r measures the

degree of 'straight-line' association
between the values of two variables 4
Correlation Analysis
• The correlation between two variables is
positive if
– higher values of one variable are associated
with higher values of the other and
• negative if
– one variable tends to be lower as the other gets
higher
• A correlation of around zero indicates that
there is no linear relation between the
values of the two variables

5
Correlation Analysis
• It is important to note that a correlation
between variables shows that they are
associated but does not necessarily imply a
‘cause and effect’ relationship

• In essence r is a measure of the scatter of

the points around an underlying linear trend:
– The greater the spread of the points the lower
the correlation

6
Scatter Plots and Correlation
• Correlation analysis is used to measure
strength of the association (linear
relationship) between two variables
• Scatter plot is used to show the
relationship between two variables
• Only concerned with strength of the
relationship and its direction
• We consider the two variables equally; as
a result no causal effect is implied
7
Scatter Plots and Correlation
• As OR and RR are used to quantify risk
between two dichotomous variables,
correlation is used to quantify the degree
to which two random continuous variables
are related, provided the relationship is
linear.

8
9
10
11
Fig.1: Systolic Blood Pressure against Age

12
Correlation coefficient
• In the underlying population from which
the sample of points (xi, yi) is selected, the
correlation coefficient between X and Y is
denoted by  (rho) and is given by
•  = Average [(X-X)(Y- Y)]
X Y
• It can be thought as the average of the product of the
standard normal deviates of X and Y

13
Estimation of Correlation coefficient 

• After Karl Pearson (1857 – 1936)

• The estimator of the population correlation coefficient is
given by

• r = Σ (xi- x^) (yi -ŷ)

[Σ(xi-x^)2 Σ (yi -ŷ)2]½
• r is called the product moment correlation coefficient or
Pearson’s correlation coefficient, coverability of X and of
Y divided by variability of X and Y separately
-1  r  1
• r = 0 when there is no linear relationship at all
• Pearson’s correlation coefficient is a measure of the
degree of straight line relationship
14
Correlation coefficient

• Correlation coefficient may be zero

even if there is a perfect non-linear
relationship
E.g. y = x2

15
If we have two quantitative variables X and Y,
the correlation between them denoted by r(X, Y)
is given by:
 (x i  x )(yi  y)  xy
r 
 i   i  x y
2 2 2 2
(x x ) (y y )
 XY  [ X  Y ] / n

[ X  (  X ) / n][  Y  (  Y ) / n]
2 2 2 2

where xi and yi are the values of X and Y for the ith

individual

The equation is clearly symmetric as it does not

matter which variable is X and which is Y
16
Calculating r of two quantitative data

• Let X be age and Y be SBP

• From the summary of data, we may have
the following:
Mean of X = 40.41, Mean of Y = 130.54,
Σx2=58,956.549, Σy2=129,177.971,
Σxy=26,871.501, Sx=10.980. Sy=16.253,
Sxy=54.958 and n=490
• r=0.308

17
Hypothesis testing on ρ
• Population Correlation Coefficient: ρ
• Sample Correlation Coefficient: r
• Under the null hypothesis that there is no
association in the population (=0) it can be
shown that the quantity n2
tr
1 r2
has a t distribution with n-2 degrees of freedom

For the age and SBP data, t = 6.99, df=488, p <

0.001 18
Interpretation of correlation
• Correlation coefficients lie within the range -1
to +1, with the mid-point of zero indicating no
linear association between the two variables

• A very small correlation does not necessarily

indicate that two variables are not associated,
however

• To be sure of this we should study a plot of

data, because it is possible that the two
variables display a non-linear relationship (for
example cyclical or curved) 19.
~
Interpretation of correlation
• In such cases r will underestimate the
association, as it is a measure of linear
association alone. Consider transforming the
data to obtain a linear relation before
calculating r

• Very small r values may be statistically

significant in moderately large samples, but
whether they are clinically relevant must be
considered on the merits of each case

20
Interpretation of correlation
• One way of looking at the correlation helps to
modify over-enthusiasm is to calculate l00r2
(coefficient of determination), which is the
percentage of variability in the data that is
'explained' by the association

• So a correlation of 0.7 implies that just about half

(49%) of the variability may be put down to the
observed association, and so on

21
Exercise: The following data shows the respective weight of a
sample of 12 fathers and their oldest son. Compute the
correlation coefficient between the two weight measurements
Wt of father – X Wt of son – Y
X2 Y2 XY
65 68 4225 4624 4420
63 66 3969 4356 4158
67 68 4489 4624 4556
64 65 4096 4225 4160
68 69 4624 4761 4692
62 66 3844 4356 4092
70 68 4900 4624 4760
66 65 4356 4225 4290
68 71 4624 5041 4828
67 67 4489 4489 4489
69 68 4761 4624 4692
71 70 5041 4900 4970 22
Scatter Plot
Scatter plot of father's by son's weight

72
71
70
69
68
67
66
65
Y

64
60 62 64 66 68 70 72
X

23
Calculating r
The correlation coefficient for the data on fathers’
and sons’ will be:
Basic values from the data
 X  800,  X  53,418,  Y  811,  Y  54,849,  XY  54,107
2 2

 (x - x )(y  y)   xy  ( x )( y)/n  54,107  (800  811)/12  40.33

2 2 2 2
 ( x  x)   x  ( x) / n  53,418  (800) / 12  84.67
2 2 2 2
 ( y  y )   y  ( y ) / n  54,849  (811) / 12  38.92
Calculatin g r
40.33
r  0.703
(84.67)(38.92)
24
Significance test
• We need to check that the correlation is
unlikely to have arisen due to sample
variation
• Testing whether the calculated Pearsons’s
correlation coefficient is significant or not
follows t = r_____ distribution
[(1-r)2/(n-2)]½ with df=n-2
Confidence interval can be calculated r  t/2
SE(r) 25
Significance test
For the fathers’ and sons’ weight data:
Ho: ρ = 0
HA: ρ ≠ 0
Test statistic, t, SE(r) = [(1-r)2/(n-2)]½
n2 12  2
t  r  0.703   3.12
1 r 2
1  (0.703) 2

p < 0.01, i.e., the correlation coefficient is

significantly different from 0
26
Interpretation of correlation
• Correlation coefficients lie within the range -1 to
+1, with the mid-point of zero indicating no linear
association between the two variables

• A very small correlation does not necessarily

indicate that two variables are not associated,
however
• To be sure of this we should study a plot of the
data, because it is possible that the two variables
display a non-linear relationship (for example
cyclical or curved).
• In such cases r will underestimate the association,
as it is a measure of linear association alone 27
Interpretation of correlation
• Very small r values may be statistically significant
in moderately large samples
• One way of looking at the correlation helps to
modify over-enthusiasm is to calculate 100r2, the
coefficient of determination called goodness of fit,
which is the percentage of variability in the data
that is 'explained' by the linear association

28
Inference on Correlation Coefficient

r=0 r<0

b=0 b<0
Y

Y
X
X

r>0

b>0
Y

29
Pearson’s r Correlation
• As a rule of thumb, the following guidelines on
strength of relationship are often useful (though
many experts would somewhat disagree on the
choice of boundaries).
Correlation value Interpretation
 0.70 or higher Very strong relationship
 0.40 to 0.69 Strong relationship
 0.30 to 0.39 Moderate relationship
 0.20 to 0.29 Weak relationship
 0.01 to 0.19 No or negligible relationship
30
Limitations of the correlation coefficient:

1. It quantifies only the strength of the linear

relationship between two variables
2. It is very sensitive to outlying values, and
thus can sometimes be misleading
3. It cannot be extrapolated beyond the
observed ranges of the variables
4. A high correlation does not imply a cause-
and-effect relationship

31
What is a Model?

1.
1. Representation of
Some Phenomenon

Non-Math/Stats Model

32
What is a Math/Stats Model?
1. Often Describe Relationship between
variables

2. Types
- Deterministic Models (no randomness)

- Probabilistic Models (with randomness)

33
Deterministic Models
1. Hypothesize Exact Relationships
2. Suitable When Prediction Error is Negligible
3. Example: Body mass index (BMI) is measure of
body fat based

– Metric Formula: BMI = Weight in Kilograms

(Height in Meters)2

– Non-metric Formula: BMI = Weight (pounds)x703

(Height in inches)2

34
Probabilistic Models
1. Hypothesize 2 Components
• Deterministic
• Random Error
2. Example: Systolic blood pressure of newborns
is 6 Times the Age in days + Random Error
• SBP = 6xage(d) + 
• Random Error May Be Due to Factors
Other Than age in days (e.g. Birthweight)

35
Types of Probabilistic Models

Probabilistic
Models

Regression Correlation Other

Models Models Models

36
Regression Models

• Relationship between one dependent

variable and explanatory variable(s)
• Use equation to set up relationship
• Numerical Dependent (Response) Variable
• 1 or More Numerical or Categorical Independent
(Explanatory) Variables
• Used Mainly for Prediction & Estimation

37
Types of Regression Models
1 Explanatory Regression 2+ Explanatory
Variable Models Variables

Simple Multiple

Non- Non-
Linear Linear
Linear Linear

38
Simple Linear Regression
• Data on one variable are frequently obtained from
a given variable:
Examples
• weight and height
• House rent and income
• Yield and fertilizer

• It is usually desirable to express their relationship

by finding an appropriate mathematical equation

39
Simple Linear Regression
• To form the equation, collect the data on
the independent variable and estimate the
value of the other and you can display in
pairs.
• Let the observations be denoted by (X1
,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn).
• However, before trying to quantify this
relationship, plot the data and get an idea
of their nature.
• Plot these points on the XY plane and
obtain the scatter diagram.
40
Simple Linear Regression
Relationship between heights of
fathers and their oldest sons

73
Heights of oldest sons (inches)

72
71
70
69
68
67
66
65
64
63
62
62 64 66 68 70 72 74
Heights of fathers (inches)

41
Simple Linear Regression
❖Simple regression uses the relationship
between the two variables to obtain
information about one variable by knowing
the values of the other

❖The equation showing this type of

relationship is called simple linear
regression equation

42
Simple Linear Regression
• The case of simple linear regression
considers a single regressor or predictor x
and a dependent or response variable Y.
• The expected value of Y at each level of x is a
random variable:
E(Y|X) =  + X
• We assume that each observation, Y, can be
described by the model
Y =  + X + ε 43
Linear Equations
Y
Y = mX + b
Change
m = Slope in Y
Change in X
b = Y-intercept
X

44
Linear Regression Model

Relationship Between Variables is a

Linear Function
Population Population Random
Y-Intercept Slope Error

Yi   0  1X i   i
Dependent Independent
(Response) (Explanatory) Variable
Variable (e.g., Years s. serocon.)
(e.g., CD+ c.) 45
Population & Sample
Regression Models
Population

☺ ☺
☺
☺
46
Population & Sample
Regression Models
Population

Unknown
Relationship ☺
Yi   0  1X i   i
☺ ☺
☺
☺
47
Population & Sample
Regression Models
Population Random Sample

Unknown
Relationship ☺
Yi   0  1X i   i ☺
☺
☺ ☺
☺
☺
48
Population & Sample
Regression Models
Population Random Sample

Unknown
 
Y i   0   1 X i   i
Relationship ☺
Yi   0  1X i   i ☺
☺
☺ ☺
☺
☺
49
Population Linear Regression
Model
Y Yi   0   1X i   i Observed
value

i = Random error

E Y    0   1 X i

X
Observed value
50
Sample Linear Regression
Model
Y  
Y i   0   1 X i   i

^i = Random
error
Unsampled
observation
  
Yi   0   1 X i
X
Observed value
51
Simple Linear Regression
♦ The scatter diagram helps to choose the curve that
best fits the data. The simplest type of curve is a
straight line whose equation is given by
Ŷ = a + bxi

♦ This equation is a point estimate of Y =  + X + ε

b= the sample regression coefficient of Y on X

β= the population regression coefficient of Y on
X

♦ Y on X means Y is the dependent variable and X 52is

the independent one
Simple Linear Regression
• a is the estimated average value of y
when the value of x is zero (provided that
0 is inside the data range considered)

• Otherwise, it shows the portion of the

variability of the dependent variable left
unexplained by the independent variables

53
Simple Linear Regression
♦ Regression is a method of estimating the
numerical relationship between variables

♦ For example, we would like to know what is the

mean or expected weight for factory workers of a
given height, and what increase in weight is
associated with a unit increase in height.

♦ The purpose of a regression equation is to use

one variable to predict another

How is the regression equation determined? 54

The method of least square
• The model Y =  + X + ε refers to the
population from which the sample was
drawn
• The regression line Ŷ = a + bx is an
estimate of the population regression line
that was found using ordinary least
squares (OLS)
• The difference between the given score Y
and the predicted score Ŷ is known as the
error of estimation
55
The method of least square
► The regression line, or the line which best fits the
given pairs of scores, is the line for which the sum of
the squares of these errors of estimation (Σеi²) is
minimized
► That is, of all the curves, the curve with minimum Σеi²
is the least square regression which best fits the given
data
► The least square regression line for the set of
observations
(X1 ,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn) has the
equation
Ŷ = a + bxi 56
The method of least square
► The values ‘a’ and ‘b’ in the equation are constants, i.e.,
their values are fixed.

► The constant ‘a’ indicates the value of y when x=0. It is

also called the y intercept.

► The value of ‘b’ shows the slope of the regression line and
gives us a measure of the change in y for a unit change in x.

► This slope (b) is frequently termed as the regression

coefficient of Y on X.

► If we know the values of ‘a’ and ‘b’, we can easily compute

57
the value of Ŷ for any given value of X.
The method of least square
► The constants ‘a’ and ‘b’ are determined by
solving simultaneously the equations normal
equations
ΣY = an + bΣX
ΣXY = aΣX + bΣX2
a =Y - b X

 ( X  X )(Y  Y ) n XY   X  Y
b= = n X 2  ( X ) 2
(X  X ) 2

58
SLR-example 1
Heights of 10 fathers (X) together with their oldest sons (Y)
are given below (in inches). Find the regression of Y on X.

Father (X) oldest son (Y) product (XY) X²

63 65 4095 3969
64 67 4288 4096
70 69 4830 4900
72 70 5040 5184
65 64 4160 4225
67 68 4556 4489
68 71 4828 4624
66 63 4158 4356
70 70 4900 4900
71 72 5112 5041

Total 676 679 45967 45784 59

SLR-example 1
a =Y - b X

n XY   X  Y  ( X  X )(Y  Y )
b= n X 2  ( X ) 2 = (X  X ) 2

10(45967)  (676 x679) 459670  459004 666

b= 10(45784)  (676) 2
= 457840  456976 = 864 = 0.77

679 676
a= 10
- 0.77 ( 10
) = 67.9 – 52.05 = 15.85

Therefore, Ŷ = 15.85 + 0.77 X

60 in X.
The regression coefficient of Y on X (i.e., 0.77) tells us the change in Y due to a unit change
SLR-example 1
Estimate the height of the oldest son for a father’s
height of 70 inches.

Ŷ = 15.85 + 0.77 (70) = 69.75 inches

NB: 1) n is the number of pairs of X and Y scores

which are used in determining the regression line.
In the above example, n=10.

2) Careful to distinguish between (ΣX)² and Σχ². 61

Standard error of regression
coefficients
• The calculated values for a and b are
sample estimates of the values of the
intercept and slope from the regression
line describing the linear association
between x and y in the whole population
• Therefore , they are subject to sampling
variation and their precision is measured
by their standard errors

62
Standard errors of regression
coefficients
• The SE of the regression coefficients is given
by: 2
1 X S
se(a)  S   and se(b) 
 (x  x )
2
n  (x - x ) 2
where
2 2 2
 (y  y)  b  (x  x )
S
n2
S is the standard deviation of the points about the
line. It has (n-2) degree of freedom.
63
Example (1- )100% CI for
regression coefficient
• Consider the age and SBP data and the
fitted regression model
• SBP = 112.12 + 0.456(Age)
• S = 15.48, se(a) = 2.67, se(b) = 0.064
• A 95% confidence interval for the slope is:
Estimated slope  t1-(SE of slope)
0.456 ± 1.96*0.064 = (0.331, 0.581)
• The 95% CI does not include 0=>There is
a sufficient evidence that age affects SBP
64
Significant test for β
Ho: 
H1: 
If the null hypothesis is true then the
statistic:
Observed slope - 0
t
S.E. of obsereved slope

will follow a t-distribution with (n – 2)

df
65
Significant test for β-Example
• For the age and SBP data,

• b = 0.456 and se(b) = 0.064, then

• t = 7.15 and with (n-2)=488 df

 p < 0.001
➢Decision: Reject Ho

66
ANOVA Table for linear regression

67
Deviation

68
Simple linear regression
Explained, unexplained (error), total variations

♣ If all the points on the scatter diagram fall on the

regression line we could say that the entire
variance of Y is due to variations in X.
Explained variation = Σ(Ŷ- Y )²

♣ The measure of the scatter of points away from the

regression line gives an idea of the variance in Y
that is not explained with the help of the regression
equation.
69
Unexplained variation = Σ(Y - Ŷ)²
Simple linear regression
►The variation of the Y’s about their mean can also be
computed. The quantity Σ(Y- Y)² is called the total
variation.
Explained variation + unexplained variation =Total
variation

► The ratio of the explained variation to the total

variation measures how well the linear regression line
fits the given pairs of scores.
► It is called the coefficient of determination,
and is denoted by
exp lained var iation
r² =
total var iation 70
ANOVA Table for linear regression

• P<α : reject Ho, the model does explain

the variation observed in the data 71
Simple linear regression
► The explained variation is never negative
and is never larger than the total variation

► Therefore, r² is always between 0 and 1.

If the explained variation equals 0, r² = 0

► If r² is known, then r =  r 2 . The sign of r

is the same as the sign of b from the
regression equation
72
Simple linear regression model
• The relationship y =  + x is not expected
to hold exactly for every individual but the
average value of y for a given value of x is
E(Yx) =  + x
• An error term ε, which represents the
variance of the dependent variable among
all individuals with a given x, is introduced
into the model
• The full linear-regression model then takes
the y =  + x + ε form 73
Simple linear regression model
• ε is the residual normally distributed with
mean 0 and variance 2
• One interpretation of the regression line is
that for a subject with x independent
values, the corresponding y dependent
value will be normally distributed with
mean  + x and variance 2
• If 2 were 0, then every point would fall
exactly on the regression line, whereas the
larger 2 is, the more scatter occurs about
the regression line 74
Assumptions
The assumptions made when using this method are:

♣ The relationship between the outcome and the

explanatory variable is linear or at least
approximately linear;

♣ At each value of the explanatory variable the

outcomes follow a normal distribution;

♣ The variance of the outcome is constant for all

75
values of the explanatory variable
Assumptions of Linear Regression
1. Linear relationship between outcome (y)
and explanatory variable x
2. Outcome variable (y) should be Normally
distributed for each value of explanatory variable
(x)
3. Standard deviation of y should be approximately
the same for each value of x
4. Fixed independent observations
e.g. Only one point per person
5. No outlier distortion
76
Assumptions of linear regression

*
*
Assumption 1 **
*
*
*
*
Linear relationship ** * *
Assumption 2 **
**
*
*
*

Y normally distributed **
**
*

at each value of x
Assumption 3
Same variance at each value of x
77
Diagnostic Tests for the Regression
Assumptions
• Linearity tests: Regression curve fitting: No level
shifts
• Independence of observations: Runs test
• Normality of the residuals: Shapiro-Wilks or
Kolmogorov-Smirnov Test
• Homogeneity of variance of the residuals: White’s
General Specification test
• No autocorrelation of residuals: Durbin Watson or
ACF of residuals

78
Diagnostic Tests for the Regression
Assumptions
• Plot residuals and look for high leverage of
residuals
– Lists of Standardized residuals
– Lists of Studentized residuals
– Cook’s distance or leverage statistics

79
Testing Assumptions:
Assumption 1: linear relationship

Plot y against x to check for linearity

220.00

200.00

180.00
blood pressure

160.00

140.00

120.00

100.00

80.00

40.00 60.00 80.00 100.00 120.00 140.00

WEIGHT

80
Testing Assumptions:

Assumption 2: Normality

220.00 Y normally distributed

200.00
at each value of x
180.00
blood pressure

160.00

140.00

120.00
Residuals need
100.00
R Sq Linear = 0.166
to be normally
80.00
distributed
40.00 60.00 80.00 100.00 120.00 140.00
WEIGHT 81
Testing Assumptions:
Assumption 2: Normality

Histogram of residuals Normal probability plot

Histogram Normal P-P Plot of Regression Standardized Residual

Dependent Variable: Systolic BP Dependent Variable: Systolic BP

20
1.00

.75

10
.50
Expected Cum Prob
Frequency

Std. Dev = 1.00

Mean = 0.00 .25

0 N = 127.00
-2 - - 1. 1. 2 2.
.2 1.7 1.2 -.75 -.2 .2
5 5
.7
5 25 75 .25 75
5 5 5
0.00
0.00 .25 .50 .75 1.00
Regression Standardized Residual
Observed Cum Prob

82
Testing Assumptions:
Assumption 3: Spread of y values constant over
range of x values
Plot residuals against x values
80.00000

60.00000
Unstandardized Residual

40.00000

20.00000

0.00000

-20.00000

-40.00000

40.00 60.00 80.00 100.00 120.00 140.00

WEIGHT

83
Multiple Linear Regression
Multivariate Analysis
▪ Multivariate analysis refers to the analysis of data that
takes into account numbers of explanatory variables
and one outcome variable simultaneously
▪ The purpose of multiple regression is to analyze the
relationship between metric or dichotomous
independent variables and a metric dependent
variable
▪ If there is a relationship, using the information in the
independent variables will improve our accuracy in
predicting values for the dependent variable
▪ It allows for the efficient estimation of measures of
association while controlling for a number of
confounding factors 85
Multivariate Analysis
▪ A large number of multivariate models have been
developed for specialized purposes, each with a
particular set of assumptions underlying its
applicability
▪ The choice of the appropriate model is based on the
underlying design of the study, the nature of the
variables, as well as assumptions regarding the
inter-relationship between the exposures and
outcomes under investigation

86
Multiple linear regression
▪ Multiple linear regression (we often refer to
this method as multiple regression) is an
extension of the most fundamental model
describing the linear relationship between
two variables
▪ Multiple regression is a statistical technique
that is used to measure and describe the
function relating two (or more) predictors
(independent) variables to a single response
(dependent) variable
87
Regression equation for a linear relationship

▪ All types of multivariate analyses involve the

construction of a mathematical model to
describe the association between independent
and dependent variables
▪ Individuals observations of the outcome yi are
modeled as varying by an error term i about an
average determined by their predictor values xi:
Yi =   1X1  2X2  3X3 ….  pXp + I
I ~ i.i.d N(0, σ2ε )

88
Regression equation for a linear relationship

▪ In the multiple regression mode

E[Y/x] =   1x1  2x2  3x3 ….  pxp
Yfit = a + b1X1 + b2X2 + . . . + bpXp , where
x1, x2, …, xp represent p predictors and 1, 2,
…, p are the corresponding regression
coefficients which gives the change in E[Y/x]
for an increase of one unit in predictor xi
holding all other predictors when xi are
quantitative variables.

89
Analysis of residuals
▪ The residual standard deviation is a measure of the
average difference between the observed y values
and those predicted or fitted by the model
▪ The residuals are given by yobs - yfit , where yobs is
the observed value of the dependent variable
▪ We cannot plot the original multi-dimensional
data, but we can examine plots of the residuals to
see if the model is reasonable
▪ Specifically, we have to check that whether
assumptions of Normal distribution and uniform
variance are met

90
Assumptions of multiple regression
a) The relationship between the dependent variable
and each continuous explanatory variable is linear

✓ We can examine this assumption for any variable,

✓by plotting (i.e., by using bivariate scatter plots)
✓the residuals (the difference between observed values of the
dependent variable and those predicted by the regression equation)
against the independent variable

✓ Any curvature in the pattern will indicate

✓ a non-linear relationship is more appropriate -
transformation of the explanatory variable may be
considered
91
Assumptions of multiple regression
b) We can produce a Normal plot of the residuals, to
check the overall fit and verify that the residuals
have an approximately normal distribution.
The normal plot may identify outliers for further investigation

c) We can plot the residuals against the fitted values

No pattern should be discernible. In particular, the
variability of the residuals should be constant
across the range of the fitted values

d) The observations (explanatory variables) should be

independent

92
Summary, Assumptions of MLR
1. Linear functional form
2. Fixed independent variables
3. Independent observations
4.Representative sample and proper specification
of the model (no omitted variables)
5. Normality of the residuals or errors
6. Equality of variance of the errors (homogeneity of
residual variance)
7. No multicollinearity
8. No autocorrelation of the errors
93
9. No outlier distortion
Predicted and Residual Scores
▪ The regression line expresses the best
prediction of the dependent variable (Y),
given the independent variables (X).
▪ However, nature is rarely perfectly
predictable, and usually there is substantial
variation of the observed points around the
fitted regression line
▪ The deviation of a particular point from the
regression line (its predicted value) is called
the residual value
94
Residual Variance and R-square
▪ The smaller the variability of the residual values
around the regression line relative to the overall
variability, the better is our prediction
▪ For example, if there is no relationship between the X and Y
variables, then the ratio of the residual variability of the Y
variable to the original variance is equal to 1.0
▪ If X and Y are perfectly related then there is no residual
variance and the ratio of residual variability to the total
variance would be 0.0
▪ In most cases, this ratio would fall somewhere
between these extremes, that is, between 0.0 and
1.0.
▪ One minus this ratio is referred to as R-square (R2)
or the coefficient of determination
95
Residual Variance and R-square

▪ This value is immediately interpretable in the following

manner. If we have an R-square of 0.4 then we know
that the variability of the Y values around the
regression line is 1- 0.4 times the original variance
▪ In other words, we have explained 40% of the original
variability, and are left with 60% residual variability
▪ Ideally, we would like to explain most if not all of the
original variability

96
Residual Variance and R-square
▪ The R-square value is an indicator of how well the
model fits the data
▪ An R-square close to 1.0 indicates that we have
accounted for almost all of the variability with the
variables specified in the model

97
Variance of adjusted regression coefficients

▪ Including multiple predictors does affect the

variance of bi due to multiple correlation of xi with
other predictors in the model
S y2| x
Var (bi ) 
(n  1) S xi2 (1  ri 2 )

2
S
▪ Where y|x is the residual variance of the outcome
2
and S xi is the variance of xi; ri is equivalent to r  R 2
from a multiple linear regression model in which xi
is regressed on all other predictors
98
Variance of adjusted regression coefficients
▪ The term 1 /(1  r 2
i ) is known as the

variance inflation factor(VIF), since Var(bi)

is increased to the extent that xi is
correlated with other predictors in the
model
▪ (1  ri ) is called tolerance
2

99
Interpreting the Correlation Coefficient R
▪ Customarily, the degree to which two or more
predictors (independent or X variables) are related
to the dependent (Y) variable is expressed in the
correlation coefficient R, which is the square root of R-
square.
▪ In multiple regression, R assumes values between 0
and 1. This is true due to the fact that no meaning can
be given to the direction of correlation in the
multivariate case.
▪ The larger R is, the more closely correlated the
predictor variables are with the outcome variable.
▪ When R=1, the variables are perfectly correlated in
the sense that the outcome variable is a linear
combination of the others.
100
Multiple linear regression of Systolic Blood
Pressure on Weight AND Age SPSS output
Model Summary

Adjust ed Std. Error of

Model R R Square R Square the Estimate
1 a
.478 .228 .225 14. 30892
a. Predic tors : (Const ant), AGE (y ears), WEIGHT (Kg)

ANOVAb
Model SS df Mean Square F Sig.
Regression 29467.088 2 14733.544 71.960 .000a
Residual 99710.886 487 204.745
Total 129177.974 489

a. Predictors: (Constant), WEIGHT (Kg), AGE (years)

b. Dependent Variable: SYSTOLIC BP (mmHg)

101
SPSS output t = βk /se(βk)
p-value
a
Coeffici ents

Uns tandardized Standardized

Coef f icients Coef f icients
Model B Std. Error Beta t Sig.
1 (Constant) 86. 295 3. 744 23. 047 .000
WEIGHT (Kg) .388 .042 .369 9. 170 .000
AGE (y ears) .375 .060 .253 6. 285 .000
a. Dependent Variable: SYSTOLIC BP (mmHg)

Multiple Linear Regression Equation is:

SBP = 86.295 + 0.388(WEIGHT) + 0.375(AGE)
Unadjusted Adjusted
(simple lr) (multiple lr)
Weight (β) 0.428 0.388
Age (β) 0.456 0.375
102
Coefficient of determination (R2)
Measures how much of total variation in outcome is
explained by ALL regression variables
Model Summary

Adjust ed Std. Error of

Model R R Square R Square the Estimate
1 a
.478 .228 .225 14. 30892
a. Predic tors : (Const ant), AGE (y ears), WEIGHT (Kg)

R2 = 22.8%
22.8% of variation in Systolic BP is explained by differences in
weight and age

With many variables in model R2 tends be an overestimate

Adjusted R2 is a more conservative estimate
 adjusted R2 = 22.5% 103
Systolic BP vs. Age in males and
females

200.00 SEX

male


female
SYSTOLIC BP (mm Hg)

 
   
175.00   
 
  
   
  
     
  
 
  
    
     
150.00    
        

     
   
      
  
 
     

   
     


     
  




 
    
   
      
        
    
  
 
    
   
   
 

     

               
 
125.00  
    
      

  
       

 

      



 
    
 
    

    
  
          

     
          
   
 


   
100.00   
 

20 30 40 50 60

AGE (ye ars)

104
Systolic BP vs. Age in males and females

SPSS
b
Variables Entered/ Removed

Variables Variables
Model Entered Remov ed Met hod
1 SEX, AGE
a . Enter
(y ears)
a. All requested v ariables entered.
b. Dependent Variable: SY STOLI C BP (mmHg) Model Summary

Adjusted Std. Error of

Model R R Square R Square the Estimate
1 a
.391 .153 .149 14.99139
a. Predictors: (Constant), SEX, AGE (years)
ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 19728.668 2 9864.334 43.892 .000a
Residual 109449.3 487 224.742
Total 129178.0 489
a. Predictors: (Constant), SEX, AGE (y ears)
b. Dependent Variable: SY STOLIC BP (mmHg)
105
Systolic BP vs. Age in males and females
Coeffici entsa

Uns tandardized Standardized

Coef f icients Coef f icients 95% Conf idenc e Interv al f or B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 108.364 2. 666 40. 647 .000 103.125 113.602
AGE (y ears) .458 .062 .310 7. 421 .000 .337 .580
SEX 7. 830 1. 357 .241 5. 768 .000 5. 163 10. 497
a. Dependent Variable: SY STOLIC BP (mmHg)

Systolic BP = 108.364 + 0.458age + 7.830sex

AGE P < 0.01

SEX P < 0.01

106
Systolic BP vs. Age in males & females
Regression lines show predicted values


200.00 SEX
 male


female
SYSTOLIC BP (mm Hg)

 
   
175.00   
 
  
   
     
  
  
  
  
  
150.00

 
    
     
    
 
      
 
      
 
β2
  
    

      
         
 
  
   
  
     
  
          
 
 
       
  
β2
        

  
      

  
125.00  



 
     
     

    
  

   


   


 

 
     

    

 

  
 
         
  
       
   
 


   
100.00   
 

20 30 40 50 60

AGE (ye ars)

107
Choice of the Number of Variables
▪ Multiple regression technique is a "plug in" as many
predictor variables as you can think of and usually at
least a few of them will come out significant
▪ This is because one is capitalizing on chance when
simply including as many variables as one can think of
as predictors of some other variable of interest. This
problem is compounded when, in addition, the
number of observations is relatively low
▪ Intuitively, it is clear that one can hardly draw
conclusions from an analysis of 100 questionnaire
items based on 10 respondents
▪ Most authors recommend that one should have at
least 10 to 20 times as many observations (cases,
respondents) as one has variables, otherwise the
estimates of the regression line will probably be
unstable 108
Choice of the Number of Variables
• Sometimes we know in advance which variables
we wish to include in a multiple regression
model. Here it is straightforward to fit a regression
model containing all of those variables

• Variables that are not significant can be omitted and

the analysis redone but sometimes it is desirable to
keep a variable in a model

• In large samples the omission of non-significant

variables will have little effect on the other
regression coefficients
• Usually it makes sense to omit variables that do109
not contribute much to the model ( P > .05)
Choice of the Number of Variables
♣ The statistical significance of each variable in the
multiple regression model is obtained simply by
calculating the ratio of the regression coefficient to its
standard error

♣ The b/se(b) is related to the t distribution with n-k-1

degrees of freedom, where n is the sample size and k
is the number of variables in the model

110
Find the ‘best’ model

The automated approach!

Stepwise regression models

Stepwise regression is a technique for choosing predictor

variables from a large set

(1) Forward selection (forward stepwise regression)

(2) Backward elimination (backward stepwise regression)

(3) Stepwise regression

111
(1) Forward Selection
STEP 1 Find variable which has strongest association with
dependent variable and enter it into model
(i.e. largest t-statistic and smallest P value)
STEP 2 Find next variable out of those remaining which
explains largest amount of remaining variability

STEP 3 Repeat step 2 until addition of next most important

variable is not statistically significant (e.g. at 5% level)
STOP

112
(2) Backwards Regression
STEP 1 Start with model which includes all
explanatory variables
STEP 2 Remove variable with smallest contribution
to model
(largest p value – say greater than 10% level)
STEP 3 Fit new model. Remove next variable with
smallest contribution
REPEAT All variables in model are statistically
UNTIL: significant
STOP
NOTE: Once a variable has been removed from
model it cannot be re-entered
113
(2) Backwards Regression

♣ As its name indicates, with the backward stepwise method

we approach the problem from the other direction.
♣ The argument given is that we have collected data on these
variables because we believe them to be potentially
important explanatory variables
♣ Therefore, we should fit the full model, including all of these
variables, and then remove unimportant variables one at a
time until all those remaining in the model contribute
significantly
♣ We use the same criterion, say P<.05, to determine
significance. At each step we remove the variable with the
smallest contribution to the model (or the largest P-value) as
long as that P-value is greater than the chosen level 114
(3) Stepwise Regression
Forward selection : a variable added early on can become
unimportant after other variables are added
STEP 1 First step of forward selection performed
STEP 2 Checks significance of included variables
STEP 3 If any variables not significant procedure
changes to backwards elimination
- drops variables
STEP 4 Continues including significant variables and
dropping ones no longer significant
UNTIL: All unused variables are not significant and all
used variables are significant
STOP 115
Comments on Automatic Procedures
•No automatic procedure GUARANTEES that 'best' model
will be chosen

• Forward selection, backward elimination and stepwise

selection don't necessarily end up with the same model

•Arbitrariness in choice of P value for entering or leaving

the model

• No selection procedure can substitute for the insight

of the researcher

NB missing data causes problems with automatic

procedures

116
Exercice

What do you do when you have a lot of

independent variables (say, 30 or more)?
(Hint: start with the classical bivariate analysis and use a
lax criterion, such as α = 0.2)

117
Multicollinearity
▪ This is a common problem in many correlation analyses
Imagine that you have two predictors (X variables) of a
person's height:
(1) height in metres and (2) height in cms.

▪ Obviously, our two predictors are completely redundant; height

is one and the same variable, regardless of whether it is
measured in metres or cms

▪ Trying to decide which one of the two measures is a better

predictor of weight would be rather silly; however, this is
exactly what one would try to do if one were to perform a
multiple regression analysis with weight as the dependent (Y)
variable and the two measures of height as the independent
118
(X) variables
Multicollinearity

▪ When there are very many variables involved, it is

often not immediately apparent that this problem
exists, and it may only manifest itself after several
variables have already been entered into the
regression equation
▪ Nevertheless, when this problem occurs it means
that at least one of the predictor variables is
completely redundant with other predictors.
▪ There are many statistical indicators of this type of
redundancy
119
❖ What are the advantages of
multivariate analyses such as the
use of "multiple linear regression“ ?

Analyze the data given under the

file name, “Exercise_multiple_LR”
Thank You!

Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
10 pages
10 RepeatedMeasuresAndMixedANOVA
No ratings yet
10 RepeatedMeasuresAndMixedANOVA
30 pages
Two-Way (Between-Groups) ANOVA: Statstutor Community Project
No ratings yet
Two-Way (Between-Groups) ANOVA: Statstutor Community Project
4 pages
Correlation and Regression
100% (4)
Correlation and Regression
49 pages
Managing Requirements Knowledge
No ratings yet
Managing Requirements Knowledge
398 pages
English - Class - 4 - Monthly Test
100% (2)
English - Class - 4 - Monthly Test
2 pages
Correlation Analysis
100% (1)
Correlation Analysis
51 pages
Regression 2024
No ratings yet
Regression 2024
49 pages
Correlation-Regression 2019
No ratings yet
Correlation-Regression 2019
76 pages
Statistics and Freq Distribution
No ratings yet
Statistics and Freq Distribution
35 pages
4.3. Parametric & Nonparametric Tests
No ratings yet
4.3. Parametric & Nonparametric Tests
26 pages
Correlation & Simple Regression
No ratings yet
Correlation & Simple Regression
15 pages
Measures of Dispersion (Unit1)
100% (1)
Measures of Dispersion (Unit1)
36 pages
Correlation Regression
100% (1)
Correlation Regression
55 pages
Repeated Measures ANOVA PDF
No ratings yet
Repeated Measures ANOVA PDF
30 pages
Confirmatory Factor Analysis Using AMOS: Step 1: Launch The AMOS Software
100% (1)
Confirmatory Factor Analysis Using AMOS: Step 1: Launch The AMOS Software
12 pages
Binomial Distribution. (Application)
100% (1)
Binomial Distribution. (Application)
5 pages
Correlation and Regression
80% (5)
Correlation and Regression
24 pages
One-Way ANOVA: Introduction To Analysis of Variance (Anova)
No ratings yet
One-Way ANOVA: Introduction To Analysis of Variance (Anova)
30 pages
Topic04 - Simple Linear Regression
No ratings yet
Topic04 - Simple Linear Regression
11 pages
Repeated Measures ANOVA
100% (1)
Repeated Measures ANOVA
41 pages
Theresa Hughes Data Analysis and Surveying 101
No ratings yet
Theresa Hughes Data Analysis and Surveying 101
37 pages
Partial Correlation
No ratings yet
Partial Correlation
2 pages
Bca-1sem Statistics, Unit1,2 and Moment
No ratings yet
Bca-1sem Statistics, Unit1,2 and Moment
52 pages
Power and Sample Size Determination
No ratings yet
Power and Sample Size Determination
27 pages
Group Data Mean
No ratings yet
Group Data Mean
9 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
8 pages
Choosing The Correct Statistical Test
No ratings yet
Choosing The Correct Statistical Test
26 pages
Hypothesis Testing
0% (1)
Hypothesis Testing
139 pages
Multiple Regression and Correlation Analysis: BX A Y
No ratings yet
Multiple Regression and Correlation Analysis: BX A Y
35 pages
Statistical Computing Using Statistical Computing Using
No ratings yet
Statistical Computing Using Statistical Computing Using
128 pages
Chapter-8-Estimation & Hypothesis Testing
100% (1)
Chapter-8-Estimation & Hypothesis Testing
12 pages
Tests of Significance and Measures of Association
No ratings yet
Tests of Significance and Measures of Association
21 pages
Testing Reliability of Questions
No ratings yet
Testing Reliability of Questions
18 pages
One-Way ANOVA: What Is This Test For?
No ratings yet
One-Way ANOVA: What Is This Test For?
21 pages
Statistics Chapter 4 Project
No ratings yet
Statistics Chapter 4 Project
3 pages
Hypothesis Testing
100% (1)
Hypothesis Testing
30 pages
Module9-Correlation and Regression (Business)
No ratings yet
Module9-Correlation and Regression (Business)
15 pages
Measures of Association
100% (1)
Measures of Association
15 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
One-Way ANOVA Exercise
No ratings yet
One-Way ANOVA Exercise
1 page
08 Split Plots
No ratings yet
08 Split Plots
25 pages
Correlational Study
No ratings yet
Correlational Study
46 pages
Chapter-13: Analysis of Variance Techniques
No ratings yet
Chapter-13: Analysis of Variance Techniques
24 pages
Chap 15 Web Site
100% (1)
Chap 15 Web Site
8 pages
Psychology Revision: Research Methods
No ratings yet
Psychology Revision: Research Methods
17 pages
Corelation and Regression
No ratings yet
Corelation and Regression
5 pages
Econometrics For Finance
100% (1)
Econometrics For Finance
54 pages
Palompon Institute of Technology Palompon, Leyte: FD 502 (Educational Statitics)
No ratings yet
Palompon Institute of Technology Palompon, Leyte: FD 502 (Educational Statitics)
18 pages
Chapter 8 Statistical Inference Estimation For Single
No ratings yet
Chapter 8 Statistical Inference Estimation For Single
32 pages
Chemistry - Intro To Measurements
No ratings yet
Chemistry - Intro To Measurements
28 pages
Multiple Regression
100% (1)
Multiple Regression
58 pages
3.1 Sampling Concept
No ratings yet
3.1 Sampling Concept
10 pages
Ibm Spss
No ratings yet
Ibm Spss
20 pages
PAHS 306: Session 5 - Simple Correlation
100% (1)
PAHS 306: Session 5 - Simple Correlation
14 pages
Statistical Inference For Decision Making
No ratings yet
Statistical Inference For Decision Making
9 pages
P P P P: H H I and J Represent Two Different
No ratings yet
P P P P: H H I and J Represent Two Different
4 pages
Correlation: Khairil Anuar Md. Isa Bbiomedicalsc. (Hons), Ukm Msc. (Medical Stat), Usm
No ratings yet
Correlation: Khairil Anuar Md. Isa Bbiomedicalsc. (Hons), Ukm Msc. (Medical Stat), Usm
33 pages
16.. Correlation Analysis_Michael
No ratings yet
16.. Correlation Analysis_Michael
25 pages
26 - Correlation and Regression Analysis
No ratings yet
26 - Correlation and Regression Analysis
50 pages
Correlation
No ratings yet
Correlation
48 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Statistical Packages - SPSS - ABH
No ratings yet
Statistical Packages - SPSS - ABH
68 pages
Group Assignment: Non-Parametric Methods: - Instruction
No ratings yet
Group Assignment: Non-Parametric Methods: - Instruction
7 pages
Health M&E Semester II - ABH Campus
No ratings yet
Health M&E Semester II - ABH Campus
1 page
Cost Analysis of An Integrated Disease S PDF
No ratings yet
Cost Analysis of An Integrated Disease S PDF
11 pages
Surveillance: of Communicable Diseases in The European Union A Long-Term Strategy: 2008-2013
No ratings yet
Surveillance: of Communicable Diseases in The European Union A Long-Term Strategy: 2008-2013
27 pages
Challenges With The Implementation of An
No ratings yet
Challenges With The Implementation of An
13 pages
From The Website Himalaya Wellness
No ratings yet
From The Website Himalaya Wellness
46 pages
First FS - Audioscript - Test 1
0% (2)
First FS - Audioscript - Test 1
10 pages
How Much Do Study Habits, Skills, and Attitudes Affect Student Performance in Introductory College Accounting Courses?
No ratings yet
How Much Do Study Habits, Skills, and Attitudes Affect Student Performance in Introductory College Accounting Courses?
15 pages
Lifting and Leading
No ratings yet
Lifting and Leading
2 pages
98 - Let Reviewer 2016 Professional Education
No ratings yet
98 - Let Reviewer 2016 Professional Education
18 pages
Sow English Year 1 2024-2025
No ratings yet
Sow English Year 1 2024-2025
11 pages
Corrintec Subsea Brochure PDF
No ratings yet
Corrintec Subsea Brochure PDF
8 pages
Health Examination Record: 2019 SHD Form 4-A
100% (1)
Health Examination Record: 2019 SHD Form 4-A
3 pages
Young Tax Professional of The Year 2020
No ratings yet
Young Tax Professional of The Year 2020
5 pages
Screenshot 2024-04-29 at 7.50.15 PM
No ratings yet
Screenshot 2024-04-29 at 7.50.15 PM
11 pages
Unit 6 Language Forms and Functions
No ratings yet
Unit 6 Language Forms and Functions
9 pages
Chapter 4 - Sequential Circuits: Logic and Computer Design Fundamentals
No ratings yet
Chapter 4 - Sequential Circuits: Logic and Computer Design Fundamentals
16 pages
Citation: Prospero International Prospective Register of Systematic Reviews
No ratings yet
Citation: Prospero International Prospective Register of Systematic Reviews
4 pages
Commap Final
No ratings yet
Commap Final
3 pages
Half Yearly Datesheet of Class 12th 2024-25
No ratings yet
Half Yearly Datesheet of Class 12th 2024-25
1 page
RoslynMcDonald Ghibli
No ratings yet
RoslynMcDonald Ghibli
10 pages
Live Action Film Characters Sourcebook
100% (5)
Live Action Film Characters Sourcebook
155 pages
Centripetal Force in Uniform Circular Motion Problems and Solutions
No ratings yet
Centripetal Force in Uniform Circular Motion Problems and Solutions
1 page
Summary Writing Ppt
No ratings yet
Summary Writing Ppt
22 pages
Response Cost Procedure
No ratings yet
Response Cost Procedure
3 pages
Four Kinds of Sentences: Exclamatory
No ratings yet
Four Kinds of Sentences: Exclamatory
2 pages
6.0 - That's Entertainment
No ratings yet
6.0 - That's Entertainment
10 pages
Niche Marketing Revisited Concept, Applications and Some European Cases
No ratings yet
Niche Marketing Revisited Concept, Applications and Some European Cases
17 pages
Listening 2 APTIS EXAM
No ratings yet
Listening 2 APTIS EXAM
8 pages
International Economic Law
100% (1)
International Economic Law
3 pages
Prime Ministers and Presidents of India
No ratings yet
Prime Ministers and Presidents of India
3 pages
Ca
No ratings yet
Ca
161 pages
4B_Future cont_future_perfect_and_future_perfect_cont_removed
No ratings yet
4B_Future cont_future_perfect_and_future_perfect_cont_removed
1 page