0% found this document useful (0 votes)
35 views

Chapter 8 Logistic Regression (Compatibility Mode)

Uploaded by

sayihmehari74
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Chapter 8 Logistic Regression (Compatibility Mode)

Uploaded by

sayihmehari74
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

University of Gondar

College of medicine and health science


Department of Epidemiology and
Biostatistics

Logistic Regression

Lemma Derseh (BSc. MPH)


Logistic regression
 In linear regression, we can fit a model consisting of a
continuous dependent variable with independent variable/s
of any measurement scale (categorical or numeric)

 What can we do if the dependent variable is


dichotomous (although we can have also
more than 2 categories i.e. multinomial or
ordinal logistic regression)?
Logistic regression cont…
The above question refers to the following types of problems:
Relationship between Coronary Heart Disease (a binary
outcome variable; i.e. +ve or -ve) and age (continuous variable).
Note: CHD = 0 implies -Ve for CHD, and CHD = 1 implies +Ve

Age CHD Age CHD Age CHD


22 0 40 0 54 0
23 0 41 1 55 1
24 0 46 0 58 1
27 0 47 0 60 1
28 0 48 0 60 0
30 0 49 1 62 1
30 0 49 0 65 1
32 0 50 1 67 1
33 0 51 0 71 1
35 1 51 1 77 1
38 0 52 0 81 1
Logistic regression cont…
 One possible statistical method is to use a t or Z-test to comparee the
mean ages of the two groups or using ANOVA ( Even though it has
only two outcome levels)
 Of course, in this regard we will get a statistically significant age
difference between the two groups (CHD +Ve against -Ve)
(p<0.0001)
 However, all these tests tell us only the signifiant différence in age
among the two groups, but not the magnitude of the effect of age on
CHD
 Therefore, what if our research goal is to know the probability
of getting +Ve CHD (i.e. to prédicat the outcome status of each
individual)? Or
 What happens when you have several covariates that you
believe contribute to CHD?
Shall we use linear regression?
Logistic regression cont…
First draw a scatter plot of status of CHD versus age

Probability for CHD


80 (age)

Second add the possible linear regression line of probability on age

Problem! If we try to fit an ordinary linear regression, we will


predict probabilities greater than 1 or less than 0 which is impossible
Logistic regression cont…
So what shall we do?

Rather than dealing with a single age data with binary outcomes,
let us group the age data so that we can get proportions
(probabilities) of success (1s) at different age groups

In doing so, we can get intermediate proportions between 0 and 1


Diseased

Age group # in group # probability

20 - 29 5 0 0

30 - 39 6 1 0.17

40 - 49 7 2 0.29

50 - 59 7 4 0.57

60 - 69 5 4 0.80

70 - 79 2 2 1.00

80 - 89 1 1 1.00
Logistic regression cont…
The probabilities in the above table are the same as the
proportions of individuals with CHD in each age category.

1
Sign of coronary disease

(Yes )

Probability for CHD


0 Age group
(No)
The scatter plot of the set of proportions in the age ranges
could give us the above S-shaped curve (red color)
Logistic regression cont…
 Again such S-shaped (sigmoidal) curve is difficult to describe
with a linear equation for two reasons.
 First, even though it seems linear at the center of the curve, the
extremes do not follow a linear trend;
 Second, the errors are neither normally distributed nor constant
across the entire range of data

Question! So what do we do with this S-Shaped curve?


 Answer:
 First: Find a function that best fits (be linked) with this S-
shaped graph
 Second: Find another function that transforms the S-shaped
graph into linear function
(I) Finding a function that best fits with
the S- shaped graph of probability

1.0

0.8

0.6
P = P(y/x) = P(success given x
0.4
occurred) = P(a person is +ve CHD
0.2 given his age is x)
0.0
20 40 60 80 100

We call the above mathematical expression a logistic function


It always has an S- shaped curve within the range of 0 and 1
for any x
That is why we linked it with p (probability) which has the
same S-shape in the same range of 0 to 1
(II) Transforming to linear function
using logit function
This is the logistic function

The odds of an event

a = log odds of event


in unexposed

logit of P b = log Odds Ratio associated


with being exposed

e b = Odds Ratio
The linking and transformation process
Yes

Outcome
Pi Start
No
Predictor (group) Predictor (single)

Link function

End

Logistic function Logit transform function Linear function


Characteristics of the logistic function
The S shaped curve of logistic function has the following
characteristics:

Function:

If â is the slope of the linear function after logit transformation then,


 The S-shaped curve has a slope equal to p(1-p) â , where p is the
probability at X = x

 As we move to the two extremes of x or p, the slope closes to 0

 The (x, p) coordinate on which the slope reaches its pick is (-á/â,
0.5). The value of x at this point is called median effective level
denoted by EL50
Logistic regression cont…
Example on the data given above:
The analysis of logistic regression is computer intensive

After entering the above data using SPSS and running it for
binary logistic regression, the following result has been obtained.

95.0% C.I. for


Variable â S.E Wald Sig. Exp(â) EXP(â)
Lower Upper
age
0.132 0.046 8.053 0.005 1.141 1.042 1.249
Constant -6.708 2.354 8.121 0.004 0.001

For a unit increase in age of a person, the odds of being


positive for CHD increases by a factor of 1.141

The 95% CI for this estimate (i.e. Odds Ratio) is (1.04, 1.23)
Logistic Regression cont…
Patient satisfaction

Example on Residence Unsatisfied Satisfied Total


categorical variable Rural 98 17 115
(residence) Vs patient Urban 205 154 359
satisfaction on service
Total 303 171 474

 Odds for Rural: p 0 . 85 ln Odds:


  5 . 76
1  p  0 . 15 ln( 5 . 76 )  1 .75

 Odds for Urban: p 0 . 571 ln Odds:


  1 . 33
1  p  0 . 429 ln(1.33)  0.285
 Odds Ratio = = 4.33 ∆ ln Odds = ln OR = â= 1.47

 OR remains the same by the two methods OR = e1.47 = 4.33


Interpreting the Logistic Regression
Model
 p 
 The model for this example is: ln       x 1
0 1
1 p 

 For urban (x1= 0) we have: ln  p    o   1  0   o


1 p 
(Always we make the unexposed category 0)

 Thus the estimate of the intercept is equal to ß0 which is the log


odds for urban (unexposed).
 p 
ln      0 . 285
1 p  0
Interpreting the Logistic Regression
Model cont…
 The estimate of the slope is the difference between the log
odds for rural on the predictor (exposed) and the log odds for
urban on the predictor (unexposed):

 p1   p0 
b1  ln   ln   1.75  0.285  1.465
 1  p1    1  p0  

 The fitted model is: log(Odds) = 0.285 + 1.465X


Meaning of the Odds Ratio
Oddsrural e 0.2851.465  1.465
 The odds ratio is:  0.285  e  4.33
Oddsurban e
Or , Odds Ratio = exp(â1) = exp (1.465) = e1.465 = 4.33

SPSS output
95.0% C.I. for EXP(B)
Variable B S.E Wald P-value. Exp(B) Lower Upper

Residence 1.47 0.284 26.72 <0.001 4.33 2.48 7.55


Constant 2.86 0.107 7.196 0.007 1.33

 Interpretation: the odds of rural patients’ unsatisfaction on the


service they got is 4.33 time that of urban residents’
Multiple logistic regression
 This model includes more than one independent variables

 The independent variables could be dichotomous, ordinal,


nominal, or continuous etc
 P 
logit(p)  ln    á  â 1 x 1  â 2 x 2  ...  â i x i
 1- P 
 Interpretation of bi

 It is the increase in log-odds for a one unit increase in xi


with all the other xis constant

 It measures association between xi and log-odds adjusted


for all other xi
Multiple logistic regression cont..
Example-1: Assume we have a second variable ‘sex’ which is
added to the existing data (CHD data) as indicated in the SPSS
data view in the exercise.
B S.E Wald Sig. Exp(B) 95.0% C.I. for
Variable EXP(B)
Lower Upper

age 0.114 0.053 4.733 0.030 1.121 1.011 1.243

sex(1) 2.952 1.276 5.356 0.021 19.153 1.571 233.459

Constant -7.787 2.689 7.367 0.007 0.000

Interpretation: For females, the odds of developing CHD is 19.15


times that of males’. (Males are taken as a reference)

Note that the 95% CI is very wide due to the fact that there is small
sample size used in the analysis (There must be at least 10 ‘yes’s and
10 ‘no’s preferably 20 for each category of each variables
Multiple Logistic Regression output
Unsatisfied
Characteristics Yes No Crude OR Adjusted OR P- Value
(95% CI) (95% CI)
Cost of treatment 0.522
Very cheap 59 69 1.0 1.0
Cheap 70 42 1.95 (1.16, 3.27) 1.36 (0.66 2.81) 0.400
Moderate 30 8 4.39(1.87, 10.30) 2.12 (0.64, 7.05) 0.220
Expensive 97 41 2.77(1.67, 4.58) 1.54(0.73, 3.26) 0.255
Highly expensive 47 11 5.00(2.38, 10.50) 2.35 (0.74, 7.53) 0.150
Residence
Urban 205 154 1.0 1.0
Rural 98 17 4.33 (2.48, 7.55) 2.71 (1.19, 6.16) 0.017*
Extra job < 0.001**
Goven’t worker 19 13 1.0 1.0
Need partime 210 58 2.48 (1.16, 5.31) 3.18(1.09, 9.06) 0.034
Partimer 60 96 0.43 (0.20, 0.93) 1.04(0.35, 3.15) 0.94
Has his own firm 14 4 2.40 (0.64, 8.93) 6.26 (3.21, 32.27) 0.028
Diagnosis Type
Complex 108 154 1.0 1.0
Simple 195 17 16.36 (9.41, 28.44) 13.55 (6.96, 26.37) <0.001**

Total 303 171


 = 0.05, * shows significant, * *shows highly significant, Underlined figures are overall p-values
Take Notice of:
 How we should put the variables (characteristics) and the
corresponding categories
 How to put the frequencies in relation to the definition of the
categories of the dependent variable in the SPSS variable-view
(Because SPSS always interprets ORs in terms of the larger code
given in the ‘value’ column e.g. here unsatisfaction)
 Overall p-values are important for variables having more than two
categories especially if there are both significant and insignificant
categories in that particular variable
 The overall-p-values are written just straight to the variable name
and specific p-values to the respective categories
 Specific p-values are of course could look redundancies of the
confidence intervals, however, they could tell us the level (degree)
of significance (Like: strong, marginal weak etc associations)

You might also like