Logistic Regression (2022)
Logistic Regression (2022)
LOGISTIC
REGRESSION
SHAHRUL AIMAN BIN SOELAR
Clinical Research Centre, Hospital Sultanah Bahiyah
[email protected]
CHOOSING THE DON’T
GIVE UP
Correct Statistical Test
Number of Independent(Predictor) Variables
≥ Two (Independent)
Dependent (Outcome)
One (Independent)
Categorical Data
Simple
Two(2) categories
Variable Selection
Logistic Regression
Variable
Multiple
• Numerical
Logistic Regression
• (age)
• Categorical or Numerical or mix
Simple
• (gender and age)
Logistic Regression
• Categorical
• (malay, chinese and indian)
DON’T
Logistic Regression GIVE UP
X Y
Factor 1
Factor 2
Predict
Factor 3 Outcome
Factor 5
DON’T
Logistic Regression GIVE UP
H0: Y does not depend on any of the Xi’s Ha: Y depends on at least one of the Xi’s
X Y X Y
Factor 1 Factor 1
Factor 2 Factor 2
Predict Predict
Factor 3 Outcome Factor 3 Outcome
Factor 4 Factor 4
Factor 5 Factor 5
DON’T
Logistic Regression GIVE UP
Cancer
Exposure a b
c d
𝒂Τ 𝒂
𝒃 ൗ(𝒂+𝒃)
• Odds Ratio = 𝒄 • Risk Ratio =
Τ𝒅 𝒄/(𝒄+𝒅)
• Odds Ratio • Risk Ratio
✓ Cohort Study ✓ Cohort Study
✓ Case Control Study ✓ Randomised Controlled
✓ Cross-sectional Study Trials
✓ Odds (times) ✓ Probability (%)
✓ Diagnostic test
DON’T
Logistic Regression GIVE UP
If Odds Ratio = 1
0 1 ∞
Low risk factor High risk factor
• To interpret Odds Ratio, compare value to 1:
▪ If OR<1, group A is less odds/likely of having event compared to group
B(reference category).
➢ (a negative or protective association between factor and outcome)
▪ If OR=1, group A and B are the odds of having the same event.
➢ (no association between factor and outcome)
▪ If OR>1, group A is more odds/likely of having event compared to group
B(reference category).
➢ (a positive association between factor and outcome)
DON’T
Logistic Regression GIVE UP
• 4 Assumptions
✓ There must be at least two cases for each category of the dependent
✓ Overall model fitness – MULTIPLE ONLY
a) STEP 1: Checking multicollinearity (Variance Inflation Factor)
b) STEP 2: Checking outliers (Cook’s Distance)
c) STEP 3: Checking model fit (Hosmer-Lemeshow goodness-of-fit test)
Objective:
• OBJECTIVE: To identify risk factors (age, sex, DM, HPT and exercise)
associated with hypercholesterolemia
– Import the file Hypercholesterol(Logistic).xlsx
CHOOSING THE DON’T
GIVE UP
Correct Statistical Test
Number of Independent(Predictor) Variables
≥ Two (Independent)
Dependent (Outcome)
One (Independent)
Categorical Data
Simple
Two(2) categories
Variable Selection
Logistic Regression
Variable
Multiple
• Numerical
Logistic Regression
• (age)
• Categorical or Numerical or mix
Simple
• (gender and age)
Logistic Regression
• Categorical
• (malay, chinese and indian)
01 Data Exploration DON’T
GIVE UP
REFERENCES: NOTE
ᵃBulmer, M. G. (1979), Principles of Statistics. NY:Dover Books on Mathematics. Categorical Data
ᵇKevin P. Balanda and H.L. MacGillivray. “Kurtosis: A Critical Review”. The American Statistician 42:2 [May 1988], pp 111–119 Numerical Data
01 Data Exploration DON’T
GIVE UP
COMBINE
## Combine specific results into one table
desc99 <- cbind("Variable"=c("Age(year)"),
"Less.Mean"=c(desc01$No$mean),
"Less.SD"=c(desc01$No$sd),"n"=c(""),"(%)"=c(""),
"More.Mean"=c(desc01$Yes$mean),
"More.SD"=c(desc01$Yes$sd),"n"=c(""),"(%)"=c(""))
desc99
Variable Less.Mean Less.SD n (%) More.Mean More.SD n (%)
[1,] "Age(year)" "38.3125" "4.83354934811876" "" "" "42.5892857142857" "4.68567172137779" "" ""
02 Test statistic using R DON’T
GIVE UP
Coefficients:
(Intercept) age
-8.5878 0.1888
CI <- as.data.frame(exp(confint(model01)))
CI
2.5 % 97.5 %
(Intercept) 6.831754e-06 0.003609845
age 1.124103e+00 1.307493754 NOTE
Categorical Data
Numerical Data
02 Test statistic using R DON’T
GIVE UP
COMBINE
## Combine specific results into one table
model99 <- cbind("OR"=c(OR[-1]),
"CI"=c(paste0("(",format(round(CI$`2.5 %`[-1],2),nsmall = 2),",",
format(round(CI$`97.5 %`[-1],2),nsmall = 2),")")),
"pvalue"=c(PV[-1]))
model99
OR CI pvalue
age "1.20780845923358" "(1.12,1.31)" "8.67616705140556e-07"
Odds Ratio
• The result shows that age (p-value <0.001) was statistically significant to the
hypercholesterolemia.
• An increase in one-year in age has a 1.21 times (95% CI 1.12 to 1.31) more
odds/likely of having HC.
• For example, those with 31 years old people have 1.21 more odds/likely of having
HC compared to those with 30 years old.
04 Data Presentation DON’T
GIVE UP
CHOOSING THE DON’T
GIVE UP
Correct Statistical Test
Number of Independent(Predictor) Variables
≥ Two (Independent)
Dependent (Outcome)
One (Independent)
Categorical Data
Simple
Two(2) categories
Variable Selection
Logistic Regression
Variable
Multiple
• Numerical
Logistic Regression
• (age)
• Categorical or Numerical or mix
Simple
• (gender and age)
Logistic Regression
• Categorical
• (malay, chinese and indian)
01 Data Exploration DON’T
GIVE UP
## Row Percentages
prop01 <- desc01/rowSums(desc01)*100
prop01
No Yes
No DM 82.02247 17.97753
Controlled DM 70.88608 29.11392
Uncontrolled DM 46.87500 53.12500
## Column Percentages
prop01 <- t(t(desc01)/colSums(desc01)*100)
prop01
No Yes
No DM 50.69444 28.57143
Controlled DM 38.88889 41.07143
Uncontrolled DM 10.41667 30.35714
NOTE
Categorical Data
Numerical Data
01 Data Exploration DON’T
GIVE UP
COMBINE
## Combine specific results into one table
desc99 <- cbind("Variable"=c(dimnames(desc01)[[1]]),
"Less.Mean"=c(""),"Less.SD"=c(""),
"n"=c(desc01[,1]),"(%)"=c(prop01[,1]),
"More.Mean"=c(""),"More.SD"=c(""),
"n"=c(desc01[,2]),"(%)"=c(prop01[,2]))
01 Data Exploration DON’T
GIVE UP
Coefficients:
(Intercept) dmControlled DM dmUncontrolled DM
-1.518 0.628 1.643
CI <- as.data.frame(exp(confint(model01)))
CI
2.5 % 97.5 %
(Intercept) 0.1231741 0.3662076
dmControlled DM 0.9115646 3.9310545 NOTE
dmUncontrolled DM 2.1662191 12.7155635 Categorical Data
Numerical Data
02 Test statistic using R DON’T
GIVE UP
COMBINE
## Combine specific results into one table
model99 <- cbind("OR"=c(OR[-1]),
"CI"=c(paste0("(",format(round(CI$`2.5 %`[-1],2),nsmall = 2),",",
format(round(CI$`97.5 %`[-1],2),nsmall = 2),")")),
"pvalue"=c(PV[-1]))
model99
OR CI pvalue
dmControlled DM "1.87388392857446" "(0.91, 3.93)" "0.0903761836323606"
dmUncontrolled DM "5.17083333333328" "(2.17,12.72)" "0.000253674565871809“
Odds Ratio
• Controlled DM group have 1.87 times more odds/likely of having HC compared to
non-DM group.
• Uncontrolled DM group have 5.17 times more odds/likely of having HC compared
to non-DM group.
04 Data Presentation DON’T
GIVE UP
CHOOSING THE DON’T
GIVE UP
Correct Statistical Test
Number of Independent(Predictor) Variables
≥ Two (Independent)
Dependent (Outcome)
One (Independent)
Categorical Data
Simple
Two(2) categories
Variable Selection
Logistic Regression
Variable
Multiple
• Numerical
Logistic Regression
• (age)
• Categorical or Numerical or mix
Simple
• (gender and age)
Logistic Regression
• Categorical
• (malay, chinese and indian)
The independent variable with a P value less than 0.05 was contribute significantly to the
prediction of the dependent variable.
>> Thus, the variable can be included in the Multiple Logistics Regression.
01 Data Exploration DON’T
GIVE UP
Assumption
✓ There must be at least two cases for each
category of the dependent
02 Test statistic using R DON’T
GIVE UP
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2956 -0.6510 -0.3542 0.5696 2.4039
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.77358 1.98224 -5.435 5.48e-08 ***
age 0.24480 0.04727 5.179 2.23e-07 *** The hpt does not contribute
dmControlled DM 0.95302 0.44096 2.161 0.03068 *
dmUncontrolled DM 1.58343 0.53820 2.942 0.00326 ** significantly to the model because the
hptYes -0.03546 0.38861 -0.091 0.92729 p-value is 0.927 is higher than 0.05.
exerciseYes -1.96524 0.43911 -4.476 7.62e-06 ***
--- Then, we decided to remove hpt from
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 the model.
(Dispersion parameter for binomial family taken to be 1)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2865 -0.6462 -0.3551 0.5684 2.4120
Coefficients:
Estimate Std. Error z value Pr(>|z|) All factors contribute significantly to
(Intercept) -10.77819 1.98047 -5.442 5.26e-08 *** the model because the p-value is less
age 0.24455 0.04715 5.187 2.14e-07 ***
dmControlled DM 0.95204 0.44090 2.159 0.03082 *
than 0.05.
dmUncontrolled DM 1.57796 0.53469 2.951 0.00317 ** Then, this is the final model using
exerciseYes -1.96726 0.43868 -4.485 7.31e-06 ***
---
ENTER METHOD
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Coefficients:
(Intercept)
-0.9445
Coefficients:
(Intercept) age dmControlled DM dmUncontrolled DM hptYes exerciseYes
-10.77358 0.24480 0.95302 1.58343 -0.03546 -1.96524
Initial Model
• hyperchol ~ 1 with an AIC of 239.18
• Small AIC means that the model can be improved
• Thus, age will be included for the next step to get a new AIC
which is 211.08
Model 1
• hyperchol ~ age with an AIC of 211.08
• exercise will be included for the next step to get a new AIC
which is 183.67
Model 2
• hyperchol ~ age + exercise with an AIC of 183.67
• DM will be included for the next step to get a new AIC which is
177.40
Model 3
• hyperchol ~ age + exercise + dm with an AIC of 177.40
• hpt can’t be included because the new AIC will be bigger than
the current AIC
02 Test statistic using R DON’T
GIVE UP
Initial Model
• hyperchol ~ age + dm + hpt + exercise with an AIC of 177.40
• Small AIC means that the model can be improved
• Thus, hpt will be excluded for the next step to get a new AIC
which is 177.40
Model 1
• hyperchol ~ age + exercise + dm with an AIC of 177.40
• dm, exercise and age can’t be excluded because the new
AIC will be bigger than the current AIC
02 Test statistic using R DON’T
GIVE UP
Initial Model:
hyperchol ~ 1
Initial Model:
hyperchol ~ age + dm + hpt + exercise
One (Independent)
Categorical Data
Simple
Two(2) categories
Variable Selection
Logistic Regression
Variable
Multiple
• Numerical
Logistic Regression
• (age)
• Categorical or Numerical or mix
Simple
• (gender and age)
Logistic Regression
• Categorical
• (malay, chinese and indian)
02 Test statistic using R DON’T
GIVE UP
NOTE
Categorical Data
Numerical Data
02 Test statistic using R DON’T
GIVE UP
COMBINE
## Combine specific results into one table
model99 <- cbind("OR"=c(OR[-1]),
"CI"=c(paste0("(",format(round(CI$`2.5 %`[-1],2),nsmall = 2),",",
format(round(CI$`97.5 %`[-1],2),nsmall = 2),")")),
"pvalue"=c(PV[-1]))
model99
OR CI pvalue
age "1.27704166910185" "(1.17, 1.41)" "2.13878673212258e-07"
dmControlled DM "2.59099159159642" "(1.11, 6.29)" "0.0308245651537937"
dmUncontrolled DM "4.84507920704878" "(1.73,14.25)" "0.00316560843115176"
exerciseNo "7.15104896668564" "(3.15,17.80)" "7.30812725789492e-06"
02 Test statistic using R DON’T
GIVE UP
COMBINE
## Combine specific results into one table
outFULL <- rbind(cbind("Age in years",t(model99[1,])), #Age in years
cbind("Diabetes Mellitus","","",""), #Diabetes Mellitus
cbind(model0B$xlevels$dm,
rbind(cbind("1.00","(ref.)",""),model99[2:3,])),
cbind("Exercise","","",""), #Exercise
cbind(model0B$xlevels$exercise,
rbind(cbind("1.00","(ref.)",""),model99[4,])))
rownames(outFULL) <- NULL
outFULL
OR CI pvalue
[1,] "Age in years" "1.27704166910185" "(1.17, 1.41)" "2.13878673212258e-07"
[2,] "Diabetes Mellitus" "" "" ""
[3,] "No DM" "1.00" "(ref.)" ""
[4,] "Controlled DM" "2.59099159159642" "(1.11, 6.29)" "0.0308245651537937"
[5,] "Uncontrolled DM" "4.84507920704878" "(1.73,14.25)" "0.00316560843115176"
[6,] "Exercise" "" "" ""
[7,] "Yes" "1.00" "(ref.)" ""
[8,] "No" "7.15104896668564" "(3.15,17.80)" "7.30812725789492e-06"
03 Checking assumptions using R DON’T
GIVE UP
outFULL
OR CI pvalue
[1,] "Age in years" "1.27704166910185" "(1.17, 1.41)" "2.13878673212258e-07"
[2,] "Diabetes Mellitus" "" "" ""
[3,] "No DM" "1.00" "(ref.)" ""
[4,] "Controlled DM" "2.59099159159642" "(1.11, 6.29)" "0.0308245651537937"
[5,] "Uncontrolled DM" "4.84507920704878" "(1.73,14.25)" "0.00316560843115176"
[6,] "Exercise" "" "" ""
[7,] "Yes" "1.00" "(ref.)" ""
[8,] "No" "7.15104896668564" "(3.15,17.80)" "7.30812725789492e-06"
Odds Ratio
➢ AGE
• An increase in one year in age has a 1.28 times more odds/likely of having HC.
➢ DIABETES MELLITUS
• Controlled DM group have 2.59 times more odds/likely of having HC compared to non-DM group.
• Uncontrolled DM group have 4.85 times more odds/likely of having HC compared to non-DM group.
➢ EXERCISE
• Non-exercise group have 7.15 times more odds/likely of having HC compared to the exercise group.
05 Data Presentation DON’T
GIVE UP
“A logistic regression was performed to study the effects of age, diabetes mellitus and exercise on the likelihood that patients
have hypercholesterolemia. Results indicated that age (p<0.001), controlled diabetes mellitus (p=0.031), uncontrolled
diabetes mellitus (p=0.003) and non-exercise (p<0.001) are statistically significant factors for hypercholesterolemia. An
increase in one year in age has a 1.28 times more odds/likely of having hypercholesterolemia. The controlled and
uncontrolled diabetes mellitus group have 2.59 and 4.85 times more odds/likely of having hypercholesterolemia compared to
the none diabetes mellitus group. The non-exercise group have 7.15 times more odds/likely of having hypercholesterolemia
compared to the exercise group.”
05 Data Presentation DON’T
GIVE UP
“A logistic regression was performed to study the effects of age, diabetes mellitus and exercise on the likelihood that patients
have hypercholesterolemia. Results indicated that controlled diabetes mellitus (p=0.031), uncontrolled diabetes mellitus
(p=0.003) and non-exercise (p<0.001) are statistically significant factors for hypercholesterolemia after adjusted by age
(p<0.001). The controlled and uncontrolled diabetes mellitus group have 2.59 and 4.85 times more odds/likely of having
hypercholesterolemia compared to the none diabetes mellitus group. The non-exercise group have 7.15 times more
odds/likely of having hypercholesterolemia compared to the exercise group.”
05 Data Presentation DON’T
GIVE UP
4.0
3.5
3.0
p-value > 0.05
2.5
2.0
1.5
Dead 1.0
0.5
Alive 0.0 33 4 16 30 22 29 13 3 23 12 17 7 19 10 32 21 15 2 9 27 31 14 18 5 11 24 8 25 26 20 6 28
Crude Odds Ratio 0.47 0.58 0.59 0.63 0.63 0.71 0.72 0.75 0.83 0.84 0.85 0.88 0.90 0.92 0.93 0.95 0.97 1.00 1.11 1.12 1.12 1.16 1.25 1.31 1.33 1.34 1.36 1.46 1.50 1.73 1.79 2.51
Lower Limit 0.30 0.36 0.41 0.41 0.44 0.50 0.47 0.54 0.60 0.54 0.60 0.66 0.64 0.64 0.65 0.58 0.68 0.79 0.74 0.79 0.72 0.79 0.99 0.81 0.99 1.01 1.06 1.10 1.16 1.33 1.66
Upper Limit 0.72 0.92 0.86 0.95 0.90 1.02 1.09 1.04 1.15 1.29 1.19 1.17 1.26 1.32 1.32 1.54 1.39 1.54 1.71 1.60 1.88 1.96 1.74 2.19 1.81 1.84 2.02 2.03 2.59 2.40 3.80
THANK YOU
SHAHRUL AIMAN BIN SOELAR
Clinical Research Centre, Hospital Sultanah Bahiyah
[email protected]