0% found this document useful (0 votes)
7 views

lineare regrassion and correlation for mph

Uploaded by

abdi merga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

lineare regrassion and correlation for mph

Uploaded by

abdi merga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

Linear regression and correlation

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


[email protected]
Course content

Top ics Facilitator

1. Regression and Correlation


2. One way Analysis of Variance (ANOVA)
Teresa Kisi
3. Analysis of categorical variables
4. Survival analysis
5. Non parametric Methods

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 2
[email protected]
Course description

The purpose of the course is to emphasize on the


analysis of categorical data, regression and
correlation, analysis of variance, non-parametric
tests, and survival analysis

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 3
[email protected]
Learning Objectives:

 At the end of the course we will be able to:


– Describe the relationship between two or more
variables
– Understand the family of various regression
models
– Know about categorical data analysis
– Estimate the probability that an individual survives
for a given length of time.
– Explain the context and meaning of statistical
significance for distribution free data

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 4
[email protected]
Evaluation

Evaluation criteria Percent


Assignments 40%
Final exam 60%

NB: Grading will be as per the grading scale of the university registrar

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 5
[email protected]
Introduction:
Example 1:
 The average age of staffs of Arsi university is 30 years
with variance of 20 years. To check this assumption a
graduating MPH student wants to proof whether the
assumption made about average age is true or not. He
took a random sample of 10 staffs and found the
average age (mean) of 27 years. Test that the average
age of the staffs is 30 years.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 6
[email protected]
Example 1: Cont’d…

Solution: Test procedure:


1. Null hypothesis: Ho: µ = 30
2. Alternative hypothesis: HA: µ ≠ 30
How to run:-
ttesti 10 27 4.47 30, level(99)

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


7
[email protected]
Example 2:
 The national institute of mental health published an
article stating that in any one year period,
approximately 9.5 percent of American adults suffer
from depression or a depressive illness. Suppose that
in a survey of 100 people in a certain town, seven of
them suffered from depression or a depressive
illness. Conduct a hypothesis test to determine if the
true proportion of the people in that town suffering
from depression or depressive illness is different
from the percent in the general adult American
population.
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
4/7/2017 8
[email protected]
Example 2: cont’d…

Solution: the hypothesis can be stated as:


–Ho: P = 0.095
–HA: P ≠ 0.095
How to run:-
prtesti 100 0.07 0.095, level(99)

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 9
[email protected]
Example 3

If a random sample of 50 non-smokers have a mean life


of 76 years with a standard deviation of 8 years, and a
random sample of 65 smokers live 68 years with a
standard deviation of 9 years,
Test the hypothesis that there is no difference
between the mean lifetimes of non smokers and
smokers at a 0.01 level of significance.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 10
[email protected]
Example 3 cont’d…

Hypotheses:
HO: µ1= µ2 or Ho: µ1 - µ2= 0
HA: µ1 ≠ µ2 or HA: µ1 - µ2 ≠ 0
How to run;
ttesti 50 76 8 65 68 9, level(99)

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 11
[email protected]
Example 4:

A health officer is trying to study the malaria situation of


Ethiopia. From the records of seasonal blood survey (SBS)
results, he came to understand that the proportion of
people having malaria in Ethiopia was 3.8% in 1978 (Eth.
Cal). The size of the sample considered was 15000. He
also realized that during the year that followed (1979),
blood samples were taken from 10,000 randomly selected
persons. The result of the 1979 seasonal blood survey
showed that 200 persons were positive for malaria.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 12
[email protected]
Example 4: Cont’d…

 Help the health officer in testing the hypothesis that


the malaria situation of 1979 did not show any
significant difference from that of 1978 (take the level
of significance, a =0.01).

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


13
[email protected]
Example 4: Cont’d…

 Hypotheses:
Ho: π1978 = π1979 or π1978-π1979 =0
HA: π1978 ≠ π1979 or π1978-π1979 ≠0
 How to run:
prtesti 15000 .038 10000 .02

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 14
[email protected]
Simple linear regression and correlation

 Data are frequently given in pairs where one variable


is dependent on the other.
E.g.
1. Weight and height
2. House rent and income
3. Yield and fertilizer
4. Systolic blood pressure (SBP) and sodium
chloride consumption (gram)
The linear regression model assumes that there is a linear, or
"straight line," relationship between the dependent variable and
each predictor. Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
15
[email protected]
linear regression and correlation cont’d..

 Linear regression is used to model the value of a


dependent scale variable based on its linear
relationship to one or more predictors.
 It is usually desirable to express their relationship by
finding an appropriate mathematical equation.
 To form the equation, collect the data on these two
variables (dependant and independents ).

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


16
[email protected]
linear regression and correlation cont’d..
A) Simple linear regression
 The scatter diagram helps to choose the curve that
best fits the data. The simplest type of curve is a
straight line whose equation is given by:
Ŷ= α + boxi
This equation is a point estimate of:
Y = α + βXi
– bo = the sample regression coefficient of Y on X.
– β= the population regression coefficient of Y on X.
 Y on X means Y is the dependent variable and X is
the independent one.
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
17
[email protected]
linear regression and correlation cont’d..

 The model is linear because increasing the value of X


predictor by 1 unit increases the value of the
dependent by bo units. Note that α is the intercept,
the model-predicted value of the dependent variable
when the value of every predictor is equal to 0.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


18
[email protected]
linear regression and correlation cont’d...
 Regression is a method of estimating the numerical
relationship between variables.
– For example, we would like to know what is the
mean or expected weight for factory workers of a
given height, and what increase in weight is
associated with a unit increase in height.
 The purpose of a regression equation is to use one
variable to predict another.

 How is the regression equation determined?

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


19
[email protected]
linear regression and correlation cont’d...
If we want to investigate the nature of this
relationship, we need to do three things:
– Make sure that the relationship is linear.
– Find a way to determine the equation linking, i.e.
get the values of α and bo
α = constant
bo = regression coefficient
– See if the relationship is statistically significant, i.e.
if it presents in the population.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


20
[email protected]
linear regression and correlation cont’d...
 Is the relationship linear?
– One way of investigating the linearity of the
relationship is to examine the scatter plot, such as
that in Figure 1.
– The points in the scatter plot seem to cluster
along a straight line (shown dotted). This suggests
a linear relationship between BMI and HIP. So far,
so good
– We can write the equation of this straight line as:
BMI = α + bo*HIP

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


21
[email protected]
linear regression and correlation cont’d...

The residual or
error term, e, for
this subject.

Figure 1: A scatter plot of body mass index against hip circumference, for a sample of 412
women in a diet and health cohort study. The scatter of values appears to be distributed
around a straight line. That is, the relationship between these two variables appears to
be broadly linear Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
22
[email protected]
linear regression and correlation cont’d...

Figure 2: scatter plot indicating the relation ship between the height of
oldest sons and fathers‘
Teresaheight
Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
23
[email protected]
linear regression and correlation cont’d...

BMI = α + bo*HIP

This equation is known as The variable on the left-


hand side of the
the simple regression
equation, BMI, is known
equation. ( why?) as the outcome, response
or dependent variable.

Dependant variable must be metric or scale. It gives us the mean


value of BMI for any specified HIP measurement. In other words,
it would tell us (if we knew α and bo) what the mean body mass
index would be for all those women with some particular HIP
measurement. Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
24
[email protected]
linear regression and correlation cont’d...

BMI = α + bo*HIP

The independent variable


The variable on the
can be of any type:
right-hand side of the
nominal, ordinal or
equation, HIP, is known
metric. This is the
as the predictor,
variable that’s doing the
explanatory or
‘causing’. It is changes in
independent variable,
hip circumference that
or the covariate.
cause body mass index to
change in response, but
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
[email protected] the other way round.
25
linear regression and correlation cont’d...
 If the independent variable is categorical, it need to
be recoded to binary (dummy) variables or other
types of contrast variables.
 Basically we have four ways of recoding categorical
variable for linear regression:
–Dummy coding (the common and mostly
used),
–Effects coding,
–Orthogonal coding, and
–Criterion coding (also known as criterion
scaling).

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


26
[email protected]
Recoding of categorical variables to binary
(dummy) variables
 Dummy coding: is used when a researcher wants to
compare other groups of the predictor variable with
one specific group of the predictor variable.
 Often, this specific group is called the reference
group or category.
 It is important to note that dummy coding can be
used with two or more categories.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


27
[email protected]
Recoding cont’d…

 Dummy coding in regression is analogous to simple


independent t-testing or one-way Analysis of
Variance (ANOVA) procedures in that dummy coding
allows the researcher to explore mean differences
by comparing the groups of the categorical variable.

 In order to do this with regression, we must separate


our predictor variable groups in a manner that allows
them to be entered into the regression.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


28
[email protected]
Recoding cont’d…

 For the demonstration of dummy coding, a fictional


data set was created consisting of one continuous
outcome variable (DV_Score), one categorical
predictor variable (IV_Group), and 15 cases (see
Table 1).
 The predictor variable contains three groups;
experimental/treatment 1 (value label 1),
experimental/treatment 2 (value label 2), and
control (value label 3).

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


29
[email protected]
Table 1: Initial data
Case DV_score IV_group
1 1 1
2 3 1
3 5 1
4 7 1
5 9 1
6 8 2
7 10 2
8 12 2
9 14 2
10 16 2
11 22 3
12 24 3
13 26 3
14 28 3
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
15 30 [email protected] 3 30
Table 2:Dummy coding example data:
Case DV_score IV_group Dummy 1 Dummy 2
1 1 1 1 0
2 3 1 1 0
3 5 1 1 0
4 7 1 1 0
5 9 1 1 0
6 8 2 0 1
7 10 2 0 1
8 12 2 0 1
9 14 2 0 1
10 16 2 0 1
11 22 3 0 0
12 24 3 0 0
13 26 3 0 0
14 28 3 (MPH in Epidemiology
Teresa Kisi 0 0
and Biostatistics, Assist. Prof.)
31
15 30 3
[email protected] 0 0
Table 2:Dummy coding cont’d…

 To accomplish this, we would create two new


‘dummy’ variables in our data set, labeled dummy 1
and dummy 2 (see Table 2).

 To represent membership in a group on each of the


dummy variables, each case would be coded as 1 if it
is a member and all other cases coded as 0.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


32
[email protected]
Table 2:Dummy coding cont’d…

 When creating dummy variables, it is only necessary


to create k – 1 dummy variables where k indicates
the number of categories of the predictor variable.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


33
[email protected]
Table 2:Dummy coding cont’d…

Choosing a reference category:


 The control group represents a lack of treatment
and therefore is easily identifiable as the reference
category.
 The reference category should have some clear
distinction. However, much research is done without
a control group. In those instances, identification of
the reference category is generally arbitrary, but
some scholar (Garson (2006)) suggested some
guidelines for choosing the reference category:

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


34
[email protected]
Table 2:Dummy coding cont’d…

 First, using categories such as miscellaneous or


other is not recommended because of the lack of
specificity in those types of categorizations (Garson).
 Second, the reference category should not be a
category with few cases, for obvious reasons related
to sample size and error (Garson).
 Third, some researchers choose to use a middle
category, because they believe it represents the best
choice for comparison; rather than comparisons
against the extremes

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


35
[email protected]
Table 2:Dummy coding cont’d…
 In the analysis, the predictor variable would not be
entered into the regression and instead the dummy
variables would take its place.

The results indicate a significant model, F(2, 12) =


57.17, p < 0.001. (Table 3)
Table 3 ANOV Ac

Sum of
Model Squares df Mean S quare F Sig.
1 Regres sion 653.333 1 653.333 13.923 .003a
Residual 610.000 13 46.923
Total 1263.333 14
2 Regres sion 1143.333 2 571.667 57.167 .000b
Residual 120.000 12 10.000
Total 1263.333 14
a. Predic tors: (Constant), dum my 1
Teresa
b. Predic tors: (Constant), Kisi
dum my(MPH in Epidemiology
1, dumm y 2 and Biostatistics, Assist. Prof.)
36
[email protected]
c. Dependent Variable: DV _sc ore
Table 2:Dummy coding cont’d…
 Table 4: provide R, R², and adj.R². and the regression
model was able to account for 91% of the variance.

Table 4 Model Summary

Adjusted Std. Error of


Model R R Square R Square the Estimate
1 .719a .517 .480 6.850
2 .951b .905 .889 3.162
a. Predictors: (Constant), dumm y 1
b. Predictors: (Constant), dumm y 1, dum my 2

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


37
[email protected]
Table 2:Dummy coding cont’d…
 Table 5: provides unstandardized regression
coefficients (B), intercept (constant), standardized
regression coefficients (ß), which we can use for the
development of the model.

Coefficients a

Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 19.000 2.166 8.771 .000 14.320 23.680
dummy 1 -14.000 3.752 -.719 -3.731 .003 -22.106 -5.894
2 (Constant) 26.000 1.414 18.385 .000 22.919 29.081
dummy 1 -21.000 2.000 -1.079 -10.500 .000 -25.358 -16.642
dummy 2 -14.000 2.000 -.719 -7.000 .000 -18.358 -9.642
a. Dependent Variable: DV_score
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
38
[email protected]
Table 2:Dummy coding cont’d…

 Now, because dummy variables were used to


compare experimental 1 (M = 5.00, SD = 3.16) and
experimental 2 (M = 12.00, SD = 3.16) to the control
(M = 26.00, SD = 3.16), the intercept term is equal to
the mean of the reference category (i.e. the control
group).
De scri ptives

DV _goup St atist ic St d. E rror


DV _sc ore treatm ent 1 Mean 5.00 1.414
St d. Deviat ion 3.162
treatm ent 2 Mean 12.00 1.414
St d. Deviat ion 3.162
control Mean 26.00 1.414
Teresa Kisi ion
St d. Deviat (MPH
in Epidemiology and3.162
Biostatistics, Assist. Prof.)
39
[email protected]
Table 2:Dummy coding cont’d…

 Each regression coefficient represents the amount of


deviation of the group identified in the dummy
variable from the mean of the reference category .
 So, some simple mathematics allows us to see that
the regression coefficient for dummy 1 (representing
experimental 1) is 5 – 26 = -21. Also, the regression
coefficient for dummy 2 (representing experimental
2) is 12 – 26 = -14.
 All of this results in the regression equation:
Ŷ= 26.00 + (-21 * dummy 1) + (-14 * dummy 2)
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
40
[email protected]
Assignment

Explain the following coding procedure for


categorical variable to be used in linear regression
(when and how to use)

–Effects coding,
–Orthogonal coding, and
–Criterion coding (also known as criterion
scaling).

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


41
[email protected]
linear regression and correlation cont’d..

 The Method of least square


– The difference between the given score Y and the
predicted score Ŷ is known as the error of
estimation. The regression line, or the line which
best fits the given pairs of scores, is the line for
which the sum of the squares of these errors of
estimation (Σеi²) is minimized. That is, of all the
curves, the curve with minimum Σеi² are the least
square regression which best fits the given data.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


42
[email protected]
linear regression and correlation cont’d..

 Estimating α and bo– the method of ordinary least


squares (OLS)
– The second problem is to find a method of getting
the values of the sample coefficients α and bo,
which will give us a line that fits the scatter of
points better than any other line, and which will
then enable us to write down the equation linking
the variables.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


43
[email protected]
linear regression and correlation cont’d...
 The most popular method used for this calculation is
called ordinary least squares, or OLS. This gives us the
values of α and bo, and the straight line that best fits
the sample data.
 Roughly speaking, ‘best’ means the line that is, on
average, closer to all of the points than any other line.
Look at Figure 1.

 e has been shown just for one of the points. If all of


these residuals are squared and then added together,
to give the term ∑e2, then the ‘best’ straight line is the
one for which the sum, ∑e2, is smallest. Hence the
name ordinary ‘least squares’.
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
44
[email protected]
linear regression and correlation cont’d...

The residual or
error term, e, for
this subject.

Figure 1: A scatter plot of body mass index against hip circumference, for a sample of 412
women in a diet and health cohort study. The scatter of values appears to be distributed
around a straight line. That is, the relationship between these two variables appears to
be broadly linear
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
45
[email protected]
linear regression and correlation cont’d...

 The least square regression line for the set of


observations (X1 ,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn) has
the equation:
Ŷ = α + boxi.
 The values ‘α’ and ‘bo’ in the equation are constants,
i.e., their values are fixed. The constant ‘α’ indicates
the value of y when x = 0. It is also called the y
intercept. The value of ‘bo’ shows the slope of the
regression line and gives us a measure of the change
in y for a unit change in x.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


46
[email protected]
linear regression and correlation cont’d..
 This slope (bo) is frequently termed as the regression
coefficient of Y on X. If we know the values of ‘α’
and ‘bo’, we can easily compute the value of Ŷ for any
given value of X.
 The constants ‘α’ and ‘bo’ are determined by solving
simultaneously the equations (normal equations):
ΣY = αn + boΣX
ΣXY = α ΣX + boΣX²

α = y − bx
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
47
[email protected]
linear regression and correlation cont’d...

n∑ XY − ∑ X ∑ Y ∑ ( X − X )(Y − Y )
b = n∑ X − (∑ X ) 2 2 = ∑(X − X ) 2

Or simply:
sy
b = r*
sx
r → linear correlation coefficient
s y → standard deviation of out come variable
s x → standard deviation of x variable
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
48
[email protected]
linear regression and correlation cont’d...
Example 1: Heights of 10 fathers(X) together with their
oldest sons (Y) are given below (in inches). Find the
regression of Y on X.

Father (X) oldest son (Y) product (XY) X²


63 65 4095 3969
64 67 4288 4096
70 69 4830 4900
72 70 5040 5184
65 64 4160 4225
67 68 4556 4489
68 71 4828 4624
66 63 4158 4356
70 70 4900 4900
71 72 5112 5041
Total 676 Teresa
679 Kisi (MPH in45967
Epidemiology45784
and Biostatistics, Assist. Prof.)
49
[email protected]
linear regression and correlation cont’d...

n∑ XY − ∑ X ∑ Y ∑ ( X − X )(Y − Y )
b = =
n∑ X − (∑ X )
2 2
∑(X − X ) 2

α = y − bx
10(45967) − (676 x679) 459670 − 459004
b= 10(45784) − (676) 2 =
457840 − 456976

666
b = Teresa Kisi (MPH
=in0.77
Epidemiology and Biostatistics, Assist. Prof.)
864
[email protected]
50
linear regression and correlation cont’d...
679 676
α= − 0.77* = 67.9 – 52.05 = 15.85
10 10

Therefore, Ŷ = 15.85 + 0.77 X or


Height of oldest son = 15.85 + 0.77*height of
father
The regression coefficient of Y on X (i.e., 0.77) tells us the
change in Y due to a unit change in X.

e.g. Estimate the height of the oldest son for a father’s


height of 70 inches.
Height of oldest son (Ŷ) = 15.85 + 0.77 (70) = 69.75 inches.
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
51
[email protected]
linear regression and correlation cont’d...
Explained, unexplained (error), total variations
 If all the points on the scatter diagram fall on the
regression line we could say that the entire variance
of Y is due to variations in X.
– Explained variation = Σ (Ŷ-y )²

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


52
[email protected]
linear regression and correlation cont’d...
 The measure of the scatter of points away from the
regression line gives an idea of the variance in Y that
is not explained with the help of the regression
equation.
– Unexplained variation = Σ(Y - Ŷ)²
 The variation of the Y’s about their mean can also be
computed.
– Total variation = Σ(Y-y )²

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


53
[email protected]
linear regression and correlation cont’d..
 Total variation = Explained variation + unexplained
variation
 The ratio of the explained variation to the total
variation measures how well the linear regression
line fits the given pairs of scores. It is called the
coefficient of determination, and is denoted by r².

2 exp lained var iation


r =
total var iation
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
54
[email protected]
linear regression and correlation cont’d..
 The explained variation is never negative and is never
larger than the total variation. Therefore, r² is always
between 0 and 1. If the explained variation equals 0,
r² = 0.
 If r² is known, then r= ± r 2

– The sign of r is the same as the sign of bo from the


regression equation.
 Since r² is between 0 and 1, r is between -1 and +1.
– Thus, r is known as Karl Pearson’s Coefficient of
Linear correlation
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
55
[email protected]
linear regression and correlation cont’d..
 Linear Correlation (Karl Pearson’s Coefficient of
Linear correlation) (r):-
– measures the degree of linear correlation
between two variables (e.g. X and Y).
– This correlation coefficient is given in pure
number, independent of the units in which the
variables are expressed.
– It also tells us the direction of the slope of a
regression line (positive or negative).

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


56
[email protected]
linear regression and correlation cont’d..

 Population Corrélation Coefficient: ρ


 Sample Corrélation Coefficient: r
 r is positive if higher values of one variable are
associated with higher values of the other variable
and negative if one variable tends to be lower as the
other gets higher
 Correlation of around zero indicates that there is no
linear relationship between the values of the two
variables.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


57
[email protected]
linear regression and correlation cont’d..

In essence r is a measure of the scatter of the


points around an underlying linear trend: the
greater the spread of the points the lower
the correlation

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


58
[email protected]
linear regression and correlation cont’d..

Variable X
This line shows a perfect linear relationship between two variables. It is a perfect
positive correlation (r = 1)
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
59
[email protected]
linear regression and correlation cont’d..

Variable X
A perfect linear relationship; however a negative correlation (r = -1)
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
60
[email protected]
linear regression and correlation cont’d..

A weak positive correlation (r might be around 0.40)


Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
61
[email protected]
linear regression and correlation cont’d...

No linear association between variables (r ~ 0)


Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
62
[email protected]
linear regression and correlation cont’d...

Strength of relationship
– Correlation from 0 to 0.25 (or –0.25) indicate
little or no relationship
– Those from 0.25 to 0.5 (or –0.25 to –0.50)
indicate a fair degree of relationship;
– Those from 0.50 to 0.75(or –0.50 to –0.75)
moderate to good relationship; and
– Those greater than 0.75 (or –0.75 to –1.00)
indicate very good to excellent relationship.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


63
[email protected]
linear regression and correlation cont’d...

 The absolute value of the correlation coefficient


indicates the strength, with larger absolute values
indicating stronger relationships.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


64
[email protected]
linear regression and correlation cont’d...

Significance Test for Pearson Correlation


– H0: ρ = 0
– HA : ρ ≠ 0

r (n − 2)
tcal =
(1 − r ) 2

With n-2 degree of freedom


Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
65
[email protected]
linear regression and correlation cont’d..
 Its formula is:
n∑ XY − (∑ X )(∑ Y )
r=
n∑ X 2 − (∑ X ) 2 n∑ Y 2 − (∑ Y ) 2

 Properties
– -1 ≤ r ≤1
– r is a pure number without unit
– If r is close to 1 ⇒ a strong positive relationship
– If r is close to -1 ⇒ a strong negative relationship
– If r = 0 → no linear correlation
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
66
[email protected]
linear regression and correlation cont’d..

 Determine the value of ‘r’ for the scores in the


above example 1.
r = 0.7776 ≈ 0.78
For Pearson’s correlation coefficient to be
appropriately used, both variables must be metric
continuous and, also approximately Normally
distributed.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


67
[email protected]
linear regression and correlation cont’d..

Assumptions in correlation
– The assumptions needed to make inferences
about the correlation coefficient are that the
sample was randomly selected and the two
variables, X and Y, vary together in a joint.
Distribution that is normally distributed, (called
the bivariate normal distribution).

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


68
[email protected]
Confidence Interval for Regression Coefficients

CI = [b ± t crit SEb ]
Where as:
tcrit = tα ,df = n-k
RSS / (n − 2) k → nomber of variables
SEb = n −

∑ ( x − x)
i =1
i
2
tcal =
b
SE
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
69
[email protected]
Multiple linear regression
– Multivariate analysis refers to the analysis of data
that takes into account a number of explanatory
variables and one outcome variable
simultaneously.
– It allows for the efficient estimation of measures
of association while controlling for a number of
confounding factors.
– All types of multivariate analyses involve the
construction of a mathematical model to
describe the association between independent
and dependent variables.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 70
[email protected]
Multiple linear regression cont’d…
 Multiple linear regression (we often refer to this
method as multiple regression) is an extension of
the most fundamental model describing the linear
relationship between two variables.

 Multiple regression is a statistical technique that is


used to measure and describe the function relating
two (or more) predictors (independent) variables to
a single response (dependent) variable.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 71
[email protected]
Multiple linear regression cont’d…
 Regression equation for a linear relationship:
A linear relationship of n predictor variables,
denoted as:
X1, X2, . . ., Xn
to a single response variable, denoted (Y) is described
by the linear equation involving several variables.
The general linear equation (model) is:
Y = α + b1X1 + b2X2 + . . . + bnXn

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 72
[email protected]
Multiple linear regression cont’d…

• Where:
– The regression coefficients (or b1 . . . bn ) represent
the independent contributions of each explanatory
variable to the prediction of the dependent
variable.

– X1 . . . Xn represent the individual’s particular set of


values for the independent variables.
– n shows the number of independent predictor
variables.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 73
[email protected]
Multiple linear regression cont’d…
 Assumptions
1. First of all, as it is evident in the name multiple
linear regression, it is assumed that the relationship
between the dependent variable and each
continuous explanatory variable is linear. We can
examine this assumption for any variable, by
plotting (i.e., by using bivariate scatter plots) the
residuals (the difference between observed values
of the dependent variable and those predicted by
the regression equation) against the predicted
value.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 74
[email protected]
Multiple linear regression cont’d…

 Any curvature in the pattern will indicate that a non-


linear relationship is more appropriate- if so
transformation of the explanatory variable or using
the analogous non parametric may be considered.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 75
[email protected]
Multiple linear regression cont’d…

2. Normality: It is assumed in multiple regression that


the residuals (ε) should follow a normal
distribution with mean 0.
ε∼N(0, σ2)
2. Variability: The variation of the actual values of the
response variable around the regression line
remains the same regardless of the value of the
explanatory variable (i.e. The constant variance
assumption, which is also known as the
homoscedasticity assumption.)

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 76
[email protected]
Multiple linear regression cont’d…

 The residual plot of the following figure shows the


residuals versus the estimated (predicted) values of
the response variable.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


77
[email protected]
Multiple linear regression cont’d…

In this case, the constant variance


assumption is not violated.

The plot shows that the residuals are scattered randomly around the
horizontal dashed line at zero without any detectable pattern.
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
78
[email protected]
Multiple linear regression cont’d…

In this case, the constant variance


assumption is violated.

shows a residual plot where the variability of the residuals around the horizontal
line changes from one region to an-other. Residuals become more dispersed around
the horizontal line as we move from small to large predicted/fitted values.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


79
[email protected]
Multiple linear regression cont’d…

4. Independence: Another important assumption is


that the observations are independent, which is a
reasonable assumption if we use simple random
sampling to select individuals that are not related
to each other and if we do not obtain multiple
observations from the same individual.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


80
[email protected]
Multiple linear regression cont’d…
 Predicted and Residual Scores
– The regression line expresses the best prediction
of the dependent variable (Y), given the
independent variables (X).
– However, nature is rarely (if ever) perfectly
predictable, and usually there is substantial
variation of the observed points around the fitted
regression line.
– The deviation of a particular point from the
regression line (its predicted value) is called the
residual value.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 81
[email protected]
Multiple linear regression cont’d…
 Residual Variance and R-square
– The smaller the variability of the residual values
around the regression line relative to the overall
variability, the better is our prediction.
– For example, if there is no relationship between
the X and Y variables, then the ratio of the
residual variability of the Y variable to the
original variance is equal to 1.0.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 82
[email protected]
Multiple linear regression cont’d…

– If X and Y are perfectly related then there is no


residual variance and the ratio of variance would
be 0.
– In most cases, the ratio would fall somewhere
between these extremes, that is, between 0 and
1.
– One minus this ratio is referred to as R-square or
the coefficient of determination.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 83
[email protected]
Multiple linear regression cont’d…

– This value is immediately interpretable in the


following manner. If we have an R-square of 0.6 then
we know that the variability of the Y values around
the regression line is 1- 0.6 times the original
variance.
– In other words, we have explained 60% of the
original variability, and are left with 40% residual
variability.
– Ideally, we would like to explain most if not all of the
original variability.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 84
[email protected]
Multiple linear regression cont’d…

– The R-square value is an indicator of how well the


model fits the data
– An R-square close to 1.0 indicates that we have
accounted for almost all of the variability with the
variables specified in the model.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 85
[email protected]
Multiple linear regression cont’d…
N.B. A) The sources of variation in regressions are:
i) Due to regression
ii) Residual (about regression)
B) The sum of squares due to regression (SSR)
over the total sum of squares (TSS) is the
proportion of the variability accounted for by the
regression model.
Therefore, the percentage variability accounted for
or explained by the regression is 100 times this
proportion.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 86
[email protected]
Multiple linear regression cont’d…
 Interpreting the multiple Correlation Coefficient (R)
– Customarily, the degree to which two or more
predictors (independent or X variables) are
related to the dependent (Y) variable is expressed
in the multiple correlation coefficient R, which is
the square root of R-square.
– In multiple correlation coefficient, R assumes
values between 0 and 1. This is true due to the
fact that no meaning can be given to the
correlation in the multivariate case. (why?)

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 87
[email protected]
Multiple linear regression cont’d…

– The larger R is, the more closely correlated the


predictor variables are with the outcome
variable.
– When R=1, the variables are perfectly correlated
in the sense that the outcome variable is a linear
combination of the others.
– When the outcome variable is not linearly
related to any of the predictor variables, R will be
very small, but not zero.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 88
[email protected]
Goodness of Fit

Here, we focus on R-squared, which measures how


well the regression model fits the observed data.
R2 is a measure of goodness of fit; that is, how well
our model represents the observed data and explains
the variation in the response variable.
The value of R2 is between 0 and 1, and the better
the model fits the data, the higher its R2 is.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


89
[email protected]
Multiple linear regression cont’d…
 Multicollinearity
– This is a common problem in many multivariate
correlation analyses.
– Imagine that you have two predictors (X variables)
of a person's height:
1. weight in pounds and
2. weight in ounces.
Trying to decide which one of the two measures is a
better predictor of height would be rather silly.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 90
[email protected]
Multiple linear regression cont’d…

Collinearity (or multicollinearity ) is the undesirable


situation where the correlations among the
independent variables are string.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


91
[email protected]
Causes of multicollinearity

 Including a variable that is computed from other


variables in the equation (e.g. family income =
husband’s income + wife’s income, and the
regression includes all 3 income measures)
 Including the same or almost the same variable twice
(height in feet and height in inches; or, more
commonly, two different operationalizations of the
same identical concept)

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


92
[email protected]
Consequences of multicollinearity

The greater the multicollinearity, the greater the


standard errors. When high multicollinearity is
present, confidence intervals for coefficients tend to
be very wide and t-statistics tend to be very small.
Coefficients will have to be larger in order to be
statistically significant, i.e. it will be harder to reject
the null when multicollinearity is present.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


93
[email protected]
Consequences of multicollinearity cont’d …

When two IVs are highly and positively correlated,


their slope coefficient estimators will tend to be highly
and negatively correlated. When, for example, b1 is
greater than β1, b2 will tend to be less than β2. In
other words, if you overestimate the effect of one
parameter, you will tend to underestimate the effect
of the other. Hence, coefficient estimates tend to be
very shaky from one sample to the next.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


94
[email protected]
Multiple linear regression cont’d…

Detecting high multicollinearity:


 Eigenvalues provide an indication of how many
distinct dimensions there are among the
independent variables.
 When several eigenvalues are close to zero, the
variables are highly intercorrelated and small
changes in the data values may lead to large changes
in the estimates of the coefficients (see the following
table)
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
95
[email protected]
Multiple linear regression cont’d…
A condition index greater than 15 indicates a possible problem and an index greater
than 30 suggests a serious problem with collinearity
a
Collinearity Diagnostics

Variance Proportions
height of monthly fam ily period of
Condition mother income gestation Age of mother
Model Dimension Eigenvalue Index (Constant) (cm s)(X2) (Birr)(X5) (days)(X6) (years)(X3)
1 1 1.999 1.000 .00 .00
2 .001 58.071 1.00 1.00
2 1 2.845 1.000 .00 .00 .01
2 .154 4.294 .00 .00 .43
3 .000 104.138 1.00 1.00 .56
3 1 3.829 1.000 .00 .00 .01 .00
2 .170 4.741 .00 .00 .42 .00
3 .000 116.493 .84 .19 .57 .03
4 7.58E-005 224.782 .16 .81 .00 .97
4 1 4.806 1.000 .00 .00 .00 .00 .00
2 .171 5.308 .00 .00 .41 .00 .00
3 .023 14.410 .00 .00 .13 .00 .90
4 .000 132.931 .87 .17 .45 .03 .03
5 7.10E-005 260.214 .13 .83 .01 .96 .06
a. Dependent Variable: birth weight of the child (kgs)(X1)

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


96
[email protected]
Multiple linear regression cont’d…

– When there are very many variables involved, it is


often not immediately apparent that this problem
exists, and it may only manifest itself after several
variables have already been entered into the
regression equation.
– Nevertheless, when this problem occurs it means
that at least one of the predictor variables is
(practically) completely redundant with other
predictors.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 97
[email protected]
• How we could control multicollinearity?

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


98
[email protected]
Multiple linear regression cont’d…

 The Partial Correlations:


– The Partial Correlations procedure computes
partial correlation coefficients that describe the
linear relationship between two variables while
controlling for the effects of one or more
additional variables.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


99
[email protected]
Multiple linear regression cont’d…
 Example:
 A popular radio talk show host has just received the
latest government study on public health care funding
and has uncovered a startling fact: As health care
funding increases, disease rates also increase! Cities that
spend more actually seem to be worse off than cities
that spend less!
 The data in the government report yield a high, positive
correlation between health care funding and disease
rates -- which seems to indicate that people would be
much healthier if the government simply stopped putting
money into health care programs.
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
100
[email protected]
Multiple linear regression cont’d…

 But, is this really true? It certainly isn't likely that there's


a causal relationship between health care funding and
disease rates. Assuming the numbers are correct, are
there other factors that might create the appearance of a
relationship where none actually exists? (Health funding
Data)
– To obtain partial correlations:
– From the menus choose:
– Analyze
Correlate
Partial

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


101
[email protected]
Multiple linear regression cont’d…

Correlations

Vis its to
Health care Reported health care
funding dis eas es providers
(am ount (rate per (rate per
Control Variables per 100) 10,000) 10,000)
-none- a Health care funding Correlation 1.000 .737 .964
(am ount per 100) Significance (2-tailed) . .000 .000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Vis its to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0

Vis its to health Health care funding Correlation 1.000 .013


care providers (am ount per 100) Significance (2-tailed) . .928
(rate per 10,000) df 0 47
Reported diseases Correlation .013 1.000
(rate per 10,000) Significance (2-tailed) .928 .
df 47 0
a. Cells contain zero-order (Pearson) correlations.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


102
[email protected]
Multiple linear regression cont’d…

 In this example, the Partial Correlations table shows


both the zero-order correlations (correlations
without any control variables) of all three variables
and the partial correlation of the first two variables
controlling for the effects of the third variable.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


103
[email protected]
Multiple linear regression cont’d…

Correlations

Vis its to
Health care Reported health care
funding dis eas es providers
(am ount (rate per (rate per
Control Variables per 100) 10,000) 10,000)
-none- a Health care funding Correlation 1.000 .737 .964
(am ount per 100) Significance (2-tailed) . .000 .000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Vis its to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0

The zero-order
Vis its to health
correlation between
Health care funding Correlation
health care funding
1.000
and.013
disease
rates
(rate peris,
care providers
indeed,
10,000) both fairly high
(am ount per 100)
(0.737) and statistically
Significance (2-tailed)
df 0
. significant(p
.928
47
<
0.001). Reported diseases
(rate per 10,000)
Correlation .013 1.000
Significance (2-tailed) .928 .
df 47 0
a. Cells contain zero-order (Pearson) correlations.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


104
[email protected]
Multiple linear regression cont’d…
The partial correlation controlling for the rate of visits to health care
providers, however, is negligible (0.013) and not statistically significant (p
= 0.928.)
Correlations

Vis its to
Health care Reported health care
funding dis eas es providers
(am ount (rate per (rate per
Control Variables per 100) 10,000) 10,000)
-none- a Health care funding Correlation 1.000 .737 .964
(am ount per 100) Significance (2-tailed) . .000 .000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Vis its to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0

Vis its to health Health care funding Correlation 1.000 .013


care providers (am ount per 100) Significance (2-tailed) . .928
(rate per 10,000) df 0 47
Reported diseases Correlation .013 1.000
(rate per 10,000) Significance (2-tailed) .928 .
df 47 0
a. Cells contain zero-order (Pearson) correlations.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


105
[email protected]
Multiple linear regression cont’d…

 One interpretation of this finding is that the observed


positive "relationship" between health care funding and
disease rates is due to underlying relationships between
each of those variables and the rate of visits to health
care providers:
 Disease rates only appear to increase as health care
funding increases because more people have access to
health care providers when funding increases, and
doctors and hospitals consequently report more
occurrences of diseases since more sick people come to
see them.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


106
[email protected]
Multiple linear regression cont’d…

Going back to the zero-order correlations, you can see that both health
care funding rates and reported disease rates are highly positively
correlated with the control variable, rate of visits to health care
providers. Correlations

Vis its to
Health care Reported health care
funding dis eas es providers
(am ount (rate per (rate per
Control Variables per 100) 10,000) 10,000)
-none- a Health care funding Correlation 1.000 .737 .964
(am ount per 100) Significance (2-tailed) . .000 .000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Vis its to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0

Vis its to health Health care funding Correlation 1.000 .013


care providers (am ount per 100) Significance (2-tailed) . .928
(rate per 10,000) df 0 47
Reported diseases Correlation .013 1.000
(rate per 10,000)
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
Significance (2-tailed) .928 .
df 47
1070
[email protected]
a. Cells contain zero-order (Pearson) correlations.
Multiple linear regression cont’d…

Removing the effects of this variable reduces the correlation between


Correlations

the other two variables to almost zero. It's even possible


Health care
that controlling
Reported
Vis its to
health care
for the effects of some other relevant variables might dis
funding
(am ount
actually
eas es
(rate per
reveal
providers an
(rate per
underlying
Control
-none-a
Variables negative relationship
Health care funding Correlation
between health
per 100)
1.000
care .737
10,000)funding
10,000) and
.964
disease rates. (am ount per 100) Significance (2-tailed) . .000 .000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Vis its to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0

Vis its to health Health care funding Correlation 1.000 .013


care providers (am ount per 100) Significance (2-tailed) . .928
(rate per 10,000) df 0 47
Reported diseases Correlation .013 1.000
(rate per 10,000) Significance (2-tailed) .928 .
df 47 0
a. Cells contain zero-order (Pearson) correlations.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


108
[email protected]
Multiple linear regression cont’d…

 The Partial Correlations procedure is only


appropriate for scale variables.
 If you have categorical (nominal or ordinal) data, use
the Crosstabs procedure. Layer variables in
Crosstabs are similar to control variables in Partial
Correlations.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


109
[email protected]
Multiple linear regression cont’d…
 Linear Regression Variable Selection Methods
Method selection allows you to specify how
independent variables are entered into the analysis.
Using different methods, you can construct a variety
of regression models from the same set of variables.
– Enter (Regression): A procedure for variable
selection in which all variables in a block are
entered in a single step.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 110
[email protected]
Linear Regression Variable Selection Methods
cont’d…
– Stepwise: At each step, the independent variable
not in the equation which has the smallest
probability of F is entered, if that probability is
sufficiently small. Variables already in the
regression equation are removed if their
probability of F becomes sufficiently large. The
method terminates when no more variables are
eligible for inclusion or removal.
– Remove: A procedure for variable selection in
which all variables in a block are removed in a
single step.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 111
[email protected]
Linear Regression Variable Selection Methods cont’d…

– Backward Elimination: A variable selection


procedure in which all variables are entered into the
equation and then sequentially removed. The
variable with the smallest partial correlation with the
dependent variable is considered first for removal. If it
meets the criterion for elimination, it is removed.
After the first variable is removed, the variable
remaining in the equation with the smallest partial
correlation is considered next. The procedure stops
when there are no variables in the equation that
satisfy the removal criteria.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 112
[email protected]
Linear Regression Variable Selection Methods
cont’d…
– Forward Selection: A stepwise variable selection
procedure in which variables are sequentially
entered into the model. The first variable considered
for entry into the equation is the one with the largest
positive or negative correlation with the dependent
variable. This variable is entered into the equation
only if it satisfies the criterion for entry. If the first
variable is entered, the independent variable not in
the equation that has the largest partial correlation is
considered next. The procedure stops when there are
no variables that meet the entry criterion.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 113
[email protected]
Multiple linear regression cont’d…

 Example on multiple regression


– The data for multiple regression were taken from
a survey of women attending an antenatal clinic.

The objectives of the study were to identify the


factors responsible for low birth weight and to
predict women 'at risk' of having a low birth
weight baby.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 114
[email protected]
Multiple linear regression cont’d…

Notations:
BW = Birth weight (kgs) of the child =X1
HEIGHT = Height of mother (cms) = X2
AGEMOTH = Age of mother (years) = X3
AGEFATH = Age of father (years) = X4
FAMINC = Monthly family income (Birr) = X5
GESTAT = Period of gestation (days) = X6
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
4/7/2017 115
[email protected]
Multiple linear regression cont’d…
 Answer the following questions based on the above
data
1. Check the association of each predictor with the
dependent variable.
2. Fit the full regression model
3. Fit the condensed regression model
4. What do you understand from your answers in parts 1,
2 and 3 ?
5. Check the assumptions required and explain.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 116
[email protected]
Multiple linear regression cont’d…
5. What is the proportion of variability accounted for
by the regression?
6. Compute the multiple correlation coefficient
7. Predict the birth weight of a baby born alive from a
woman aged 30 years and with the following
additional characteristics;
–height of mother =170 cm
–age of father =40 years
–monthly family income = 600 Birr
–period of gestation = 275 days
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
4/7/2017 117
[email protected]
Multiple linear regression cont’d…
8. Estimate the birth weight of a baby born alive from
a woman with the same characteristics as in “7"
but with a mother's age of 49 years.

Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)


4/7/2017 118
[email protected]
Teresa Kisi (MPH in Epidemiology and Biostatistics, Assist. Prof.)
119
[email protected]

You might also like