0% found this document useful (0 votes)
17 views

Logistic Regression-Advanced Biostat-PDF(1)

Advanced biostatics

Uploaded by

zekariasnigisa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Logistic Regression-Advanced Biostat-PDF(1)

Advanced biostatics

Uploaded by

zekariasnigisa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

Advanced Biostatistics

Course Topic2: Simple Binary Logistic


Regression

1
Simple Binary Logistic Regression
• Simple Logistic regression is a statistical method used to
model the relationship between a binary dependent
variable with one independent variable.
• The independent variable used to predict or
explain the outcome variable can be either
continuous or categorical.

2
Cont…
• Unlike standard linear regression models, logistic
regression model does not require assumptions of:
 normality of the response variable distribution and
 Equal/homogenous variance of the error term for
every level of independent variable
• Logistic regression is widely used in fields of health
sciences to predict the likelihood of an event
occurring, such as whether a patient has a certain
disease or not based on predictor variables (e.g. age,
sex, smoking status).

3
Cont…
Goals of Logistic regression:
• To see whether the probability of getting a particular
value of the outcome/dependent variable is
associated with the independent/predictor variable
and/or
• To predict the probability that outcome/dependent
variable takes a particular value given the value of
continuous independent/predictor variable or the
level/a particular category of categorical
independent/predictor variable.

4
Assumptions of Logistic regression
1. The dependent variable should be binary.
 Dependent variable with the desired outcome takes a
value of 1 and without the desired outcome takes a
value of 0.
2. Log odds of the dependent variable taking the desired
value (the event occurring) is linearly related with
continuous independent variables.
3. The error terms need to be independent
 Observations should be independent of each other.

5
Cont…

4. The model requires a large sample sizes


6. The continuous predictors should not have
influential outliers

6
Logistic Regression Model
• The response variable (Yi) is binary and assumes
only two values that for convenience can be
coded as 0 and 1.
1, if the th subject has the desired attribute
i
Y i 0, if the th subject has no the desired attribute
 
 i
• A random variable Yi assuming values 1 or 0 with
probabilities  i and 1 i respectively
• We can define a model with the intention that
the probabilities of individuals having the
attribute,  i depend on one potential
explanatory variable(simple logistic regression)
7
Cont…
• Given the potential predictor, probabilities of individuals
having the desired event/attribute is given by:

• = 1⎸ = =
Where:
  i is the probability that Yi=1 for a given Xi

 are parameters to be estimated,

8
Cont…
• o is a constant/intercept(corresponds to the
baseline group) and
• is regression coefficient (slope), measure the rate of
changes in  i for a given change in continuous predictor
variable or for a given category/level of a categorical
predictor.
• e is the base of natural logarithm
• This equation is useful to determine the predicted
probability of an occurrence of an event given the
values of the predictor variable.

9
Cont…

• We need to transform  i so that the logistic


regression model closely resembles a familiar
linear regression model:
 First, we calculate odds that an event occurs
(i.e. Yi=1) which is the probability that an event
occurs relative to its converse/it does not occur,
i.e. the probability that Yi=1 relative to the
probability that Yi=0:
= 1⎸ = =
1−

10
Cont…
• Second, we take the natural log of the odds that Yi=1:
• ln ( = 1 ⎸ ) = ln = + +

• The natural log of the odds ratio of the Yi=1 versus Yi=0,
i.e. the log of the odds that an individual has the
attribute/event relative to does not have it.

11
Cont…
•  i is the probability that the ith observation (i=1, …., n)
takes value 1 (Yi=1).
• 1 i is probability that the ith observation (i=1, …., n)
takes value 0 (Yi=0).
• are (k+1) unknown parameters to be
estimated including the intercept term.
• is explanatory variable
•  i is random error/ error term which is binomially
distributed

12
Estimation method
• Since the error terms have binomial distribution,
ordinary least square (OLS) estimation is not
appropriate.
• Thus, we use maximum likelihood (ML) method
for model parameters estimation instead.

13
Confidence Intervals and Significance Tests for Logistic
Regression Parameters
• A 100 1 − % confidence interval for the log-odds
coefficients ( ) are given by:
± × ( ) and ± × ( )
• A 100 1 − % confidence interval for the odds ratio,
are obtained by transforming the confidence interval
for the intercept and the slope as:
• For constant:
± × ( ) × ( ) × ( )
= ,
• For the slope:
± × ( ) × ( ) × ( )
= ,

14
Example
• The objective of this study is whether the cure/non-
cure process is associated with the treatment
alternatives (treatment A or treatment B). In other
words, we are testing if the probability of cure
applying treatment A is equal to, or different from,
the probability of cure applying treatment B.
• To conduct the study, suppose that we have
performed an experiment on a random sample of
100 patients, randomly divided into two groups of 50
patients, each of which is given a treatment (A or B).

15
Cont…
• The results obtained in the experiment are
shown in the following table:
Treatment alternative * Cure Status Crosstabulation

Count
Cure Status
Cure Non-cure Total
Treatment Treatment A 40 10 50
alternative
Treatment B 30 20 50
Total 70 30 100

• OR=2.67

16
Cont…
• Logistic regression output
Dependent Variable Encoding

Original Value Internal Value


Non-cure 0
Cure 1

• Categorical independent variable coding:


Categorical Variables Codings

Parameter
coding
Frequency (1)
Treatment Treatment B 50 .000
alternative Treatment A 50 1.000

• Treatment B is set to be reference category


17
Cont…

• Null model (constant only)


Variables in the Equation

B S.E. Wald df Sig. Exp(B)


Step 0 Constant .847 .218 15.076 1 .000 2.333

• P(cure)=70/100=0.7, P(non-cure)=30/100=0.3
• Odds(cure)=P(cure)/P(non-cure)=0.7/0.3=2.333
• B=ln(Odds of cure)=ln(2.333)=0.847
• OR=EXP(B)=EXP(0.847)=2.333
• OR >1 (OR=2.333) shows that cure is more likely
than non-cure.
18
cont…
• Simple Binary Logistic Regression model
• Does inclusion of one independent variable (treatment
alternative) improve the model over the null model?
Omnibus Tests of Model Coefficients

Chi-square df Sig.
Step 1 Step 4.831 1 .028
Block 4.831 1 .028
Model 4.831 1 .028

• The statistic measures the amount of -2LL (deviance –


equivalent to the sum residual squares) reduced using
the model with one explanatory variable compared to
the null model.
19
Cont…
• The Omnibus Tests of Model Coefficients is used to check
that the model with one explanatory variable is an
improvement over the null model.
• It uses chi-square tests to see if there is a significant
difference between the Log-likelihoods (-2LLs) of the null
model and the model with one independent variable.
• If the model with one independent variable has a
significantly reduced -2LL compared to the null model
then it suggests that the model with the independent
variable is explaining more of the variance in the
response variable (status of cure) than the null model.

20
Cont…
• Under Model Summary of SPSS output,
the -2 Log Likelihood statistic for the model with one
independent variable is 117. 341
Model Summary

Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square


1 117.341 a .047 .067
a. Estimation terminated at iteration number 4 because parameter estimates changed by less than
.001.

• Although SPSS does not give us this statistic for null


model (the model that had only the intercept), from
Omnibus Tests of Model Coefficients output, we know
that -2LL of null model is reduced by 4.831.
21
Cont…
• Since -2LL of the model with one independent variable is
117.341, -2LL of null model equals 122.172
(117.341+4.831).
• Adding the treatment alternative as predictor variable
reduced the -2 Log Likelihood statistic by 4.831(122.172-
117.341).
• The reduction is significant (Chi-square=4.831, df=1; p-
value=0.028).
• df= number of parameters estimated in the model with
one independent variable (constant,
treatmentalternative) minus number of parameters
estimated in the null model (constant only).
Thus, df=2-1=1.
22
Cont…
• Logistic regression model with the independent variable output:
• The output in the Variables in the Equation table provides
the regression coefficients (B), the Wald statistic (to test the
statistical significance) and
• Odds Ratio (Exp (B)) for each variable category.
Variables in the Equation
95% C.I.for
EXP(B)

B S.E. Wald df Sig. Exp(B) Lower Upper


Step 1a Treatment alternative(1) .981 .456 4.618 1 .032 2.667 1.090 6.524

Constant .405 .289 1.973 1 .160 1.500

a. Variable(s) entered on step 1: Treatment alternative.

• Treatment alternative(1)=treatment A
• Treatment B is set to reference group

23
Cont…
• Based on the model estimate:
• The probability that a patient cure from the disease based on
treatment alternative taken, is given by:
. .
• = 1⎸ = = = . .

• The odds of cure from the disease for a patient based on a


particular treatment alternative taken, is given by:
. .
• = 1⎸ = = =

• The log odds of cure from the disease for a patient based on a
particular treatment alternative taken is given by:
• ln ⎸ = ln
= + = . + .

24
Cont…
Interpretation of the logistic regression outputs
• Looking first at the results for Treatmentalternative(1),
there is a significant overall effect of treatment alternative
on likelihood of cure(B= 0.981, SE= 0.456, Wald=4.618,
df=1, p=0.032).
• The B coefficient for Treatmentalternative (1) is significant
and positive, indicating that the Treatment A is associated
with an increased odds of cure from the disease.
• The Exp(B) column (the Odds Ratio) tells us that the odds
of cure from the disease for a patient who took treatment A
was 2.67times the odds of cure for a patient who took
treatment B (our reference category) (COR=2.67 ; 95% CI:
[1.09, 6.524]).

25
Exercise
1. Find the probability that a patient was cured from the disease
given that he/she took treatment A.
2. Find the probability that a patient was not cured from the disease
given that he/she took treatment A.
3. Find the odds of cure for a patient who took treatment A
4. Find the probability that a patient was cured from the disease
given that he/she took treatment B.
5. Find the probability that a patient was not cured from the disease
given that he/she took treatment B.
6. Find the odds of cure for a patient who took treatment B
7. Calculate the ratio of the odds of cure for a patient who took
treatment A to the odds of cure for a patient who took treatment
B
8. Find 95% CI for the odds ratio for patient who took treatment A
compared to those who took treatment B

26
Data for the example
subject Status of cure Treatment alternative
1 1 1
Definitions of variables: 2 1 1
3 1 1
Dependent variable: 4 1 1
5 1 1
Status of cure: 6 1 1
7 1 1
1= cure, 0=non-cure 8 1 1
9 1 1
Independent variable 10
11
1
1
1
1
Treatment alternative: 12
13
1
1
1
1
14 1 1
1=Treatment A 15 1 1
16 1 1
0=Treatment B 17 1 1
18 1 1
19 1 1
20 1 1
27
Cont…
21 1 1 41 1 0
22 1 1 42 1 0
23 1 1 43 1 0
24 1 1 44 1 0
25 1 1 45 1 0
26 1 1 46 1 0
27 1 1 47 1 0
28 1 1 48 1 0
29 1 1 49 1 0
30 1 1 50 1 0
31 1 1 51 1 0
32 1 1 52 1 0
33 1 1 53 1 0
34 1 1 54 1 0
35 1 1 55 1 0
36 1 1 56 1 0
37 1 1 57 1 0
38 1 1 58 1 0
39 1 1 59 1 0
40 1 1 60 1 0

28
Cont…
61 1 0 81 0 0
62 1 0 82 0 0
63 1 0 83 0 0
64 1 0 84 0 0
65 1 0 85 0 0
66 1 0 86 0 0
67 1 0 87 0 0
68 1 0 88 0 0
69 1 0 89 0 0
70 1 0 90 0 0
71 0 1 91 0 0
72 0 1 92 0 0
73 0 1 93 0 0
74 0 1 94 0 0
75 0 1 95 0 0
76 0 1 96 0 0
77 0 1 97 0 0
78 0 1 98 0 0
79 0 1 99 0 0
80 0 1 100 0 0

29
Advanced Biostatistics

Topic3: Multiple Binary Logistic Regression

30
Multiple Binary Logistic Regression
• Multiple binary logistic regression model is a type of
regression model which is appropriate when the
response variable takes only two possible values
representing the presence or absence of an attribute of
interest(examples: infected/not infected, effective/ not
effective) is associated with more than one
independent variables.
• The independent variables that are used for outcome
variable prediction can be dichotomous, categorical of
more than two levels or continuous or combinations of
categorical and continuous variables.

31
Example

• Clinical and epidemiological studies of binary outcome


variables typically focus on the potential effects of
multiple predictors.
• For example, investigators believe that total serum
cholesterol, diastolic and Systolic blood pressure,
smoking, age, BMI, and behavior pattern are potential
predictors of coronary heart disease (CHD).
• Thus, to study the relationship between CHD and these
potential predictors, multiple binary logistic regression
model is appropriate.

32
Assumptions of Logistic regression
• Unlike standard linear regression models, logistic
regression model does not require assumptions of:
 normality of the response variable distribution and
 Equal/homogenous variance of the error term for
every level of independent variable
• However, there are important assumptions to be
met to use multiple binary logistic regression:

33
Cont…
1. The dependent variable should be binary.
 Dependent variable with the desired outcome takes a
value of 1 and without the desired outcome takes a
value of 0.
2. Log odds of the dependent variable taking the desired
value (the event occurring) is linearly related with
continuous independent variables.
3. The model should be fitted correctly/correct model
specification.
 Only potential predictor variables should be included
4. The error terms need to be independent
 Observations should be independent of each other.
34
Cont…

5. No strong multi-collinearity among independent


variables
6. The model requires a large sample sizes
7. The continuous predictors should not have
influential outliers

35
Multiple Binary Logistic Regression Model
• The response variable (Yi) is binary and assumes
only two values that for convenience can be
coded as 0 or 1.
1, if the th subject has the desired attribute
i
Y i 0, if the th subject has no the desired attribute
 
 i
• A random variable Yi assuming values 1 or 0 with
probabilities  i and 1 i respectively
• We can define a model with the intention that
the probabilities of individuals having the
attribute/event,  i depend on k potential
explanatory variables
36
Cont…
• Given potential predictors, probabilities of individuals
having the desired attribute is given by:

e 0  1X i1...  K X iK
i  p(Y 1| X1  x1,...,X k  xk)  1 e 0  1X i1....  K X iK ........................(1)
Where:
  i is the probability that Yi=1 for a given
X j
 x j
; j  1,2,3, ...., k

  o ,  1 , ... ,  k are parameters to be estimated and

37
Cont…
• o is a constant/intercept(corresponds to the
baseline group) and
• are regression coefficients (slopes),
 1 ,  2 , ...,  k
measure the rate of changes in  i for a given change in
continuous predictor variable or for a given
category/level of a categorical predictor.
• e is the base of natural logarithm
• This equation is useful to determine the predicted
probability of an occurrence of an event given the
values of the predictor variables.

38
Cont…

• We need to transform  i so that the logistic


regression model closely resembles a familiar
linear regression model:
 First, we calculate odds that an event occurs
(i.e. Yi=1) which is the probability that an event
occurs relative to its converse/it does not occur,
i.e. the probability that Yi=1 relative to the
probability that Yi=0:
i
odds   e o  1 X i1  ...   k X ik ................................. (2)
1  i

39
Cont…
• Second, we take the natural log of the odds that Yi=1:
  i   pr(Y i 1) 
logit( i)  ln
1-
  ln
  pr(Y 0)  
  ln e 01xi1.... k xik i 
 i   i 
  0  1xi1 ....  k xik   i ............................... (3)

• Equation (3) is the natural log of the odds ratio of the


Yi=1 versus Yi=0, i.e. the log of the odds that an
individual has the attribute/event relative to does not
have it.

40
Cont…
•  i is the probability that the ith observation (i=1, …., n)
takes value 1 (Yi=1).
• 1 i is probability that the ith observation (i=1, …., n)
takes value 0 (Yi=0).
•  0,  1,  2, ...,  k are (k+1) unknown parameters to be
estimated including the intercept term.
• xi1,....,xik are k explanatory variables
•  i is random error/ error term which is binomially
distributed

41
Estimation method
• Since the error terms have binomial distribution,
ordinary least square (OLS) estimation is not
appropriate.
• Thus, model’s parameters estimation is based on
maximum likelihood (ML) method.

42
Confidence Intervals and Significance Tests for Logistic
Regression Parameters
• A 100(1 )% confidence interval for the constant ( 0 )
and slopes ( ; j  1,2,3, ..., k ) are given by:
j

  
 0  Z  SE  0  j  Z  SE   j 

i. 


 ii .
   
2 2
• A confidence interval for the odds ratio of constant
( e  0 ) and slopes ( e j ) are obtained by transforming
the confidence interval for the constant and slopes are
given by:
   SE 
e 0  Z SE  0  e 0  Z SE 0 , e 0 2
     Z 0
i.
2 2 
 

ii.. e j
 
 Z SE 

     
j  e j  Z SE  j , e j


 Z  
 SE  
j
2  
2 2 
  43
Model-Building Strategies for Logistic
Regression
i. Backward elimination
• Starts with a model that contains all of the
explanatory variables available.
• Each variable is examined and the variable that, if
removed, would cause the smallest change (largest P-
value) in the overall model fit is then removed.
• This continues until all variables in the model are
significant.

44
Cont…
ii. Forward selection
• Looks at each explanatory variable individually and
selects a single explanatory variable that fits the data the
best on its own as the first variable included in the
model.
• Given the first variable, other variables are examined to
see if they will add significantly to the overall fit of the
model.
• Among the remaining variables the one that adds the
most is included.
• Examining remaining variables in light of those already in
the model and adding those that add significantly to the
overall fit is repeated until none of the remaining
variables would add significantly or there are no variables
remaining. 45
Cont…
iii. Stepwise selection
• A stepwise selection procedure is a combination of
forward selection and backward elimination.
• start with a forward selection procedure but after
every step one checks to see if variables added early
on have been made redundant (and so can be
dropped) by variables or combinations of variables
added later.
• Similarly, one can start with a backward selection
procedure but after every step check to see if a
variable that has been dropped should be added
back into the model.
46
Hypothesis testing
i. Overall (global) test
H :
0 1
 ....    0
k
 logit( i)  
o

H 1: At least one of B j 0; j1,2, ... , k


Test Statistic
• The test based on the statistic " G2 " under the null
hypothesis that the beta's coefficients for the explanatory
variables in the model are equal to 0.
• The test statistic is given by:
2  Lo 
G  - 2ln 

 L F 
• where Lo is likelihood of the model with only intercept
and LF is the likelihood of the full model (model with all
explanatory variables)
47
Cont…
• A significant p-value (or G2 (2k,) ) provides evidence that at
least one of the regression coefficients for an explanatory
variables, is non zero ( but it doesn’t tell us which ones!)
• where k is the number of explanatory variables in the
logistic regression equation

ii. Individual (partial) test


• Individual test examines whether regression
coefficients in a logistic regression model are
significantly different from 0.

48
Cont…
• We need to test the hypothesis:
H o :   0 versus H :   0 ; j  1,2, ..., k
j 1 j

• If ˆ j  0 , the potential predictor variables ( X j ; j  1,2,3,...,


k )
included in the model, has no effect on the odds of the
response variable takes the event of interest(Y=1).
• The two commonly used individual test methods are:
i. Wald test and
ii. Log-likelihood test

49
Cont…

i. Wald test
• Under the null hypothesis of 0 slopes and based on
asymptotic theory, Wald statistic follows a chi-square
distribution with 1 degree of freedom.
• Wald statistics is computed as:
2
 ˆ j 
W2   s . e ( ˆ j ) 
 

• Where ̂ j represents the estimated coefficient of  j


and s.e(̂ ) is its standard error.
j
2 2
• we reject H o if W   (1, )
50
Cont…
ii. Log-likelihood test
• To test H 0:  1 0 , we compare the fit (the log
likelihood) of the full model:

ln 
P ( Y i 1)
P (Y i 0 )
  0   1 x i1  ....   k x ik

to the fit of the reduced model:


 P (Y  1) 
ln  i      x  ........   x
 P (  0)  i2 ik
 Y
o 2 k
i 

51
Cont…
• To test the Ho, we calculate a test statistic as:
 L
G  2ln   2lnL  lnL 
R
2
R F

L  F

• where L R  likelihood of reduced model and L F is likelihood of full model

2
• If H 0:  1 0is true, the sampling distribution of G
is very close to a  2 distribution with 1 degrees
of freedom (df equal to the number of extra
parameters (regression coefficients) included in
full model but not in reduced model).
52
Cont…
2 2
• If the test is significant (
G   (1, ) ), the inclusion
of X1 as a predictor variable makes the full model
a better fit to our data than the reduced model
and therefore Ho is rejected.
• We can do a similar model comparison test for
the other predictor variables, X ; j  2,3,4, ...., k
j

53
Assessing the Fit of the Model
Goodness of fit
• Goodness of fit is an important diagnostic tool
for checking whether the model is adequate or
not by determining how similar the observed Y-
values are to the expected or predicted Y-values
• Among well known statistics for assessing the
goodness of fit of a logistic regression model, Hosmer
and Lemeshow statistic is common.

54
Cont…
i. Hosmer-Lemeshow goodness of fit test
• The probability of the outcome event is estimated for
each subject’s using the estimated regression
coefficients given the explanatory variables.
• These estimated subject probabilities are then classified
into g groups (usually 10 categories) defined by deciles
is common.
• The 10% of subjects with the lowest estimated
probabilities form the first category, the next lowest
10% form the second category and so on until the last
category is made up of those individuals with the
highest estimated probabilities.
55
Cont…
• The hypothesis is stated as:
eb0  b1 X 1  ...  b K X K
H 0 : E[Y ]  1  b0  b1 X 1  ....  b K X K
e
eb0  b1 X 1  ...  b K X K
H1 : E[Y ]  1  b0  b1 X 1  .... b K X K
e
Or
H0: The model is good fit to the data
H1: The model is not good fit to the data
• The null hypothesis says that the model is
“correct” (or the only reason the observed
frequencies differ from the expected is a result of
random variation).
56
Cont…
Hosmer and Lemeshow test statistic  X 2 
• Obtained by computing the Pearson chi-square statistic
based on observed and expected values from a g by 2 table:
2
2
g 1 (O jk E jk) 2
X   ~  g2
j 1 k 0 E jk
If statistic is unusually large  
2  
2 P  2g2X 2   
• X   g2 or P  vaue   
,then the differences between the observed and
expected values are greater than we would expect by
chance.
• This suggests that the model is not adequate (lack of
fit).

57
Cont…
• If the test statistic is small X 2  2g2 or P  vaue  P 2g2X 2    
, do not reject H o .
• Non-significance of Hosmer and Lemeshow test
indicates that the model prediction does not significantly
differ from the observed outcome values.
• Thus, non-significant test (P-value greater than 0.05)
confirms that the logistic regression model is
correct/adequate model in fitting the data.

58
Cont…
ii. Analogue of Coefficient of determination
• It tells us the idea of the percentage of variation in the
response variable that is `explained' by the model.
• Some of the commonly used pseudo Coefficient of
determinations (pseudo-R2s) for logistic regression are:
A. The log-linear ratio R2 (aka McFadden’s R2)
2 LLF
RL 1  LLo
Where LL o is the log-likelihood for the model with only the
intercept and
LL F is the log-likelihood for the model with all predictors.
59
Cont…
B. Cox and Snell’s R2
2
n
2  L0   (LLoLLF) 
Cox SnellR  1    1 e n 

 LF 
Where:
L 0 is likelihoodof interceptonly model

L F is likelihood of a specified model

LL 0 is - 2log - Likelihood of intercept only/null model

LL F is - 2log - Likelihood of a specified model

n= number of observations/sample size


60
Cont…
C. Nagelkerke’s R2
• The Nagelkerke R2 adjusts the Cox-Snell R2 so that the
range of possible values extends to 1.
2
n
 L0 
1 
2  LF  Cox  Snell R 2
Nagelkerke R  2

  LL o 
1 L 0 
n
1 e 
 n 

 Usually pseudo Coefficient of determinations tend to be


smaller than R-square and values 0.2 to 0.4 are
considered to be satisfactory .

61
Cont…
iii. Information Criteria
2
• In addition to the deviance G  statistic and
2 pseudo R
, information criteria are also used to assess different
models goodness of fit.
A. The Akaike Information Criteria (AIC)
• Akaike Information Criterion adjusts (‘’penalize’’) the
residual deviance (model fit) for the number of
predictors, thus favoring parsimonious models

AIC  -2 LL F  2  P

62
Cont…
B. Bayesian Information Criterion (BIC)

BIC  -2 LL F  P  ln(n)
Where:
LL F  Log - likelihood of the full/fitted model
p is the number of parameters in the model.
n is sample size.
• Smaller values of the information
criteria indicate a better fitting model and
• if many models have similar low AICs and BICs, we
choose the one with the fewest model terms
(parameters).
63
Interpretation of Logistic Regression parameters
Directionality and Magnitude of the Relationship
• A positive relationship means an increase in the
independent variable is associated with an increase in
the predicted probability of the response variable with
the event of interest, and vice versa.
• A negative relationship means an increase in the
independent variable is associated with a decrease in the
predicted probability of the response variable with the
event of interest, and vice versa.
• The direction of the relationship is reflected differently
for the original coefficients (B) and exponentiated logistic
regression coefficients (EXP(B)).
64
Cont…
A. Original coefficient (B)
• Original coefficient signs indicate the direction of the
relationship.
• Positive sign is associated with an increased probability
that the response variable will assume the event of
interest
• Negative sign associated with a decreased probability
that the response variable will assume the event of
interest.
• Therefore, for a unit changes in the continuous variable
or a specified level of categorical variable, the log odds
of the response variable with the event of interest
increases or decreases by B.
65
Cont…
B. Exponentiated coefficients (EXP(B))
• Exponentiated coefficients are interpreted differently
since they are exponentiated values of the original
coefficients and do not have negative values.
• Exponentiated coefficient (OR) above 1 represent a
positive relationship and
• OR Values less than 1 represent negative relationships.
• Accordingly, for a unit changes in the continuous
variable or a specified level of categorical variable, the
odds of the response variable with the event of interest
increases or decreases by (EXP(B)).

66
Cont…
• Exponentiated coefficient (OR) of value 1 shows that a
unit changes in the continuous variable or a specified
level of categorical variable has no effect/ change on the
odds of the response variable with the event of interest.
• In this case, a 95% confidence interval of the odds ratio
includes the value of 1 to confirm that the predictor
variable is not considered to be statistically significant.
• The exponentiated coefficients can be expressed in
terms of the percentage change in the expected odds of
the dependent variable with the desired event for a one-
unit change in the independent variable holding the
other independent variables
constant/fixed/adjusted/controlled.
67
Cont…

• This can be achieved by subtracting 1 from the


exponentiated coefficients(OR) and multiplying the
result by 100%:
(100(OR - 1)% = 100(exp(B) - 1)%).

Odds Ratio
• Odds ratio- the ratio of two odds and is a comparative
measure between two levels of a categorical variable or
a unit change in the continuous variable

68
Cont…
• Odd ratio is given by:
 
 i  eo1Xi1...k Xik eo  Xi1  Xik
OR  ln    (e 1) (
... e k )
1i 

• Thus, e  j is the multiplicative effect on the odds


of the response variable assuming the event of
interest for increasing Xj by 1 unit, holding the
other X’s constant.
Probability
• The logistic regression function can be
interpreted as the probability of the outcome
variable taking value 1 (Y=1).
69
Cont…

• The probability that the response variable will assume


the desired event for a given values of the predictor
variables is predicted by:

e  o 1X i1...  k X ik
 
P Y i 1/ X j ˆ / X j 
i 1e o 1X i1... k X ik

70
Example
• The aim is to study the relationship between coronary
heart disease (CHD) and risk factors (age and sex).
• We want to answer the research question:
 Is there an association between coronary heart disease
and age and sex?
• Absence of coronary heart disease is coded as 0 and
presence of coronary heart disease is coded as 1
• Table1.2. in the next slid presents data of status of
coronary heart disease (CHD) for 33 individuals with
their respective age and sex.

71
Cont…
• Table1.2. Data of status of coronary heart disease for 33
individuals with their respective age and sex(male=1,
female=0)
CHD Age Sex CD Age Sex CHD Age Sex
0 22 1 0 40 1 0 54 0
0 23 0 1 41 1 1 55 1
0 24 1 0 46 0 1 58 1
0 27 0 0 47 0 1 60 1
0 28 1 0 48 0 0 60 0
0 30 0 1 49 0 1 62 1
0 30 1 0 49 1 1 65 1
0 32 0 1 50 0 1 67 1
0 33 0 0 51 0 1 71 1
1 35 1 1 51 1 1 77 0
0 38 0 0 52 0 1 81 1

72
Cont…
To facilitate modelling, we need to categorize
age into two categories(<=50 years and
>50years):
Variables description:
• Response variable:
CDH: 1=presence/0=absence
• Independent variables:
 Age category:
agecateg: 1= <= 50 years (reference category), 2 >50 years.
 sex: 0=Female (reference category), 1= Male
73
Cont…
• The data with age categorize into two groups:

age age age


CHD Age category sex CHD Age category sex CHD Age category sex
0 22 1 1 0 40 1 1 0 54 2 0
0 23 1 0 1 41 1 1 1 55 2 1
0 24 1 1 0 46 1 0 1 58 2 1
0 27 1 0 0 47 1 0 1 60 2 1
0 28 1 1 0 48 1 0 0 60 2 0
0 30 1 0 1 49 1 0 1 62 2 1
0 30 1 1 0 49 1 1 1 65 2 1
0 32 1 0 1 50 1 0 1 67 2 1
0 33 1 0 0 51 2 0 1 71 2 1
1 35 1 1 1 51 2 1 1 77 2 0
0 38 1 0 0 52 2 0 1 81 2 1
74
SPSS output of logistic regression analysis:
• Table 1.3. Classification table
Classification Tablea,b
Observed Predicted
Sign of coronary heart Percentage
disease Correct
absence presence
of CHD of CHD

Sign of coronary absence of CHD 19 0 100.0


Step 0 heart disease presence of CHD 14 0 .0
Overall Percentage 57.6
a. Constant is included in the model.
b. The cut value is .500

• Since more than 50% of the people in the sample did not
develop CHD, the best prediction for each case (if we have no
additional information) is that they did not develop CHD.
• We would be correct for 57.6% of the cases, because
19/33(57.6%) actually did not develop CHD.
75
Cont…
Null model (constant only)
Variables in the Equation

B S.E. Wald df Sig. Exp(B)

Step 0 Constant -.305 .352 .752 1 .386 .737

• P(CHD present)= 14/33=0.424,


• P(CHD absent)= 19/33=0.576, then
• Odd of CHD= 0.424/0.576= 0.737(odds of developing
CHD).
• B=ln(odds of CHD)= ln(0.737)= -0.305
• EXP(B)=EXP(-0.305)=0.737.
76
Log-Likelihood ratio test
• The statistic measures the amount of -2LL (deviance –
equivalent to the sum residual squares) reduced using
the full model compared to the null model.
• The Omnibus Tests of Model Coefficients is used to check
that the full model (with all explanatory variables
included) is an improvement over the null model.
• It uses chi-square tests to see if there is a significant
difference between the Log-likelihoods (-2LLs) of the null
model and the full model.
• If the full model has a significantly reduced -2LL
compared to the null model then it suggests that the full
model is explaining more of the variance in the response
variable (CHD) than null model.
77
Cont…
• Here, the chi-square is highly significant (chi-
square=14.557, df=2, p=0.001).
• So our full model (model with both independent
variables: age and sex) is significantly better than null
model (model without independent variables).
Omnibus Tests of Model Coefficients
Chi-square df Sig.

Step 14.557 2 .001


Step 1 Block 14.557 2 .001
Model 14.557 2 .001

78
Cont…
Under Model Summary of SPSS output,
• the -2 Log Likelihood statistic for full model is 30.43.
• Although SPSS does not give us this statistic for null
model (the model that had only the intercept), from
Omnibus Tests of Model Coefficients output, we know
that -2LL of null model is reduced by 14.557.
• Since -2LL of full model is 30.43, -2LL of null model
equals 44.987 (30.43+14.557).
• Adding the age and sex reduced the -2 Log Likelihood
statistic by 14.557(44.557-30.43).

79
Cont…
• The reduction is significant (Chi-square=14.557,
df=2; p-value=0.001).
• df= number of parameters estimated in the full
model(constant, age, sex) minus number of
parameters estimated in the reduced model
(constant only). Thus, df=3-1=2.
Model Summary
Step -2 Log likelihood Cox & Snell R Nagelkerke R
Square Square

1 30.430a .357 .479


a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.

80
Cont…
• The pseudo-R2 values tell us approximately how much
variation in the outcome is explained by the model.
• The Nagelkerke’s R2 (0.479) suggests that the model
explains roughly 47.9% of the variation in the outcome
variable.
• The Hosmer & Lemeshow test of the goodness of fit is
non-significant (Chi-square =5.172, df=2, p-value=0.075)
suggesting that the model is a good fit to the data.
Hosmer and Lemeshow Test

Step Chi-square df Sig.

5.172 2 .075
1

81
Cont…
• Another useful output is the Classification Table of full model.
• The model that includes the explanatory variables (age and sex) is
correctly classifying the outcome variable for 84.8% of the cases
compared to 57.6% correct classification of the outcome variable
by the null model.
• Full model showed a marked improvement over null model
Classification Tablea
Observed Predicted
Sign of coronary Percentag
heart disease e Correct
absence presence
of of
coronary coronary
heart heart
disease disease
absence of 19 0 100.0
coronary heart
Sign of coronary disease
Step 1 heart disease presence of 5 9 64.3
coronary heart
disease
Overall Percentage 84.8
a. The cut value is .500
82
Cont…
• The output in the Variables in the Equation table provides
the regression coefficients (B), the Wald statistic (to test the
statistical significance) and
• Odds Ratio (Exp (B)) for each variable category.
• Logistic Regression coefficients:
Variables in the Equation

B S.E. Wald df Sig. Exp(B) 95% C.I.for


EXP(B)
Lower Upper

agecateg(1) 2.288 .938 5.956 1 .015 9.858 1.569 61.932


2.126 .953 4.975 1 .026 8.383 1.294 54.304
Step 1a sex(1)
-2.536 .919 7.605 1 .006 .079
Constant
a. Variable(s) entered on step 1: agecateg, sex.

83
Cont…
• Looking first at the results for agecateg(1), there is a
significant overall effect (B= 2.288, SE= 0.938,
Wald=5.956, df=1, p=0.015).
• The B coefficient agecateg(1) is significant and positive,
indicating that the higher age category is associated with
increased odds of developing CHD.
• The Exp(B) column (the Odds Ratio) tells us that the
odds of developing CHD for an individual from age
greater than 50 years was 9.86 times the odds of an
individual from 50 or below years of age group (our
reference category) controlling for sex (AOR=9.86 ; 95%
CI: [1.57, 61.93]).
84
Cont…
• The effect of sex is also significant and positive
(B= 2.126, SE= 0.953, Wald= 4.975, df=1,
p=0.026), indicating that men are more likely to
develop CHD than women.
• The OR estimate shows that the odds of
developing CHD for male was 8.34 times the
odds of female adjusting for age (AOR=8.38; 95%
CI: [1.29, 54.30]).

85
Summary
Logistic regression Equation:
Log(Odds of CHD)= -2.536 + 2.288age(>50 year) +2.126sex(Male)
• Both independent variables (age and sex) were significant predictors of
CHD (age: B=2.29, SE=0.94, Wald=5.956, df=1, P-value=0.015; sex:
B=2.13, SE=0.95, Wald= 4.975, df=1, P-value=0.026).
• The odds of developing coronary heart disease for an individual in the
age category of greater than 50 years is 9.9 times the odds of an
individual in the age category of 50 year or below controlling for sex
(AOR=9.86 ; 95% CI: [1.57, 61.93]).
• The odd of developing coronary heart disease for male was 8.4 times
the odds of female controlling for age (AOR=8.38; 95% CI: [1.29, 54.30]).
• In conclusion,
 age above 50 years and male sex were statistically significant risk
factors for CHD.

86

You might also like