0% found this document useful (0 votes)
45 views

Logistic Regression (2022)

This document discusses binary logistic regression. It begins by outlining when to use logistic regression over other statistical tests based on the number and type of predictor and outcome variables. It then provides more detail on logistic regression, including how it can be used to study associations between risk factors and dichotomous outcomes. The document discusses odds ratios, how they are interpreted and calculated. It also outlines the four main assumptions of logistic regression and provides an example objective to identify risk factors associated with hypercholesterolemia using logistic regression.

Uploaded by

Nadzmi Nadzri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Logistic Regression (2022)

This document discusses binary logistic regression. It begins by outlining when to use logistic regression over other statistical tests based on the number and type of predictor and outcome variables. It then provides more detail on logistic regression, including how it can be used to study associations between risk factors and dichotomous outcomes. The document discusses odds ratios, how they are interpreted and calculated. It also outlines the four main assumptions of logistic regression and provides an example objective to identify risk factors associated with hypercholesterolemia using logistic regression.

Uploaded by

Nadzmi Nadzri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

BINARY

LOGISTIC
REGRESSION
SHAHRUL AIMAN BIN SOELAR
Clinical Research Centre, Hospital Sultanah Bahiyah
[email protected]
CHOOSING THE DON’T
GIVE UP
Correct Statistical Test
Number of Independent(Predictor) Variables
≥ Two (Independent)
Dependent (Outcome)

One (Independent)
Categorical Data

Simple
Two(2) categories

Variable Selection
Logistic Regression
Variable

Multiple
• Numerical
Logistic Regression
• (age)
• Categorical or Numerical or mix
Simple
• (gender and age)
Logistic Regression
• Categorical
• (malay, chinese and indian)
DON’T
Logistic Regression GIVE UP

• To study the association between risk factors (numerical


or categorical data) and two outcome categories
(categorical data).

The model The Logistic function

X Y
Factor 1

Factor 2
Predict
Factor 3 Outcome

Factor 5
DON’T
Logistic Regression GIVE UP

• To study the association between risk factors (numerical


or categorical data) and two outcome categories
(categorical data).

H0: Y does not depend on any of the Xi’s Ha: Y depends on at least one of the Xi’s

X Y X Y
Factor 1 Factor 1

Factor 2 Factor 2
Predict Predict
Factor 3 Outcome Factor 3 Outcome

Factor 4 Factor 4

Factor 5 Factor 5
DON’T
Logistic Regression GIVE UP

Cancer

Exposure a b
c d
𝒂Τ 𝒂
𝒃 ൗ(𝒂+𝒃)
• Odds Ratio = 𝒄 • Risk Ratio =
Τ𝒅 𝒄/(𝒄+𝒅)
• Odds Ratio • Risk Ratio
✓ Cohort Study ✓ Cohort Study
✓ Case Control Study ✓ Randomised Controlled
✓ Cross-sectional Study Trials
✓ Odds (times) ✓ Probability (%)
✓ Diagnostic test
DON’T
Logistic Regression GIVE UP

Drug users Non-users Total


Male 120 102 222
Female 85 106 191
Total 205 208 413

• Odds: the probability of belonging to one group or event occurring divided


by the probability of not belonging to that group or event not occurring.
▪ The odds of a male using drug is 120/102=1.18,
▪ The odds of a female using drug is 85/106=0.80
▪ For males, it means that a male is 1.18 times more likely to use drug than not to
use.
▪ For females, it means that a female is 0.80 times less likely to use drug than
not to use.
DON’T
Logistic Regression GIVE UP

Drug users Non-users Total


Male 120 102 222
Female 85 106 191
Total 205 208 413

• Odds Ratio: an important estimate in logistic regression and used to


answer our research question.
▪ For the table below, the research question is whether there is a gender
difference in using drugs or whether the probability of drug use is the same for
males and females.
▪ A ratio of the odds for each group.
▪ Always odd for the response group (males) divided by odd for the referent
group (females).
▪ Odds ratio is 1.18/0.80=1.48
▪ Males in this example were 1.48 times more likely than females to use drugs.
DON’T
Logistic Regression GIVE UP

If Odds Ratio = 1

0 1 ∞
Low risk factor High risk factor
• To interpret Odds Ratio, compare value to 1:
▪ If OR<1, group A is less odds/likely of having event compared to group
B(reference category).
➢ (a negative or protective association between factor and outcome)
▪ If OR=1, group A and B are the odds of having the same event.
➢ (no association between factor and outcome)
▪ If OR>1, group A is more odds/likely of having event compared to group
B(reference category).
➢ (a positive association between factor and outcome)
DON’T
Logistic Regression GIVE UP

• 4 Assumptions
✓ There must be at least two cases for each category of the dependent
✓ Overall model fitness – MULTIPLE ONLY
a) STEP 1: Checking multicollinearity (Variance Inflation Factor)
b) STEP 2: Checking outliers (Cook’s Distance)
c) STEP 3: Checking model fit (Hosmer-Lemeshow goodness-of-fit test)

Objective:
• OBJECTIVE: To identify risk factors (age, sex, DM, HPT and exercise)
associated with hypercholesterolemia
– Import the file Hypercholesterol(Logistic).xlsx
CHOOSING THE DON’T
GIVE UP
Correct Statistical Test
Number of Independent(Predictor) Variables
≥ Two (Independent)
Dependent (Outcome)

One (Independent)
Categorical Data

Simple
Two(2) categories

Variable Selection
Logistic Regression
Variable

Multiple
• Numerical
Logistic Regression
• (age)
• Categorical or Numerical or mix
Simple
• (gender and age)
Logistic Regression
• Categorical
• (malay, chinese and indian)
01 Data Exploration DON’T
GIVE UP

## Compute summary statistics by groups


desc01 <- describeBy(hptR$age, hptR$hyperchol, IQR=TRUE)
desc01
Descriptive statistics by group
group: No
vars n mean sd median trimmed mad min max range skew kurtosis se IQR
X1 1 144 38.31 4.83 38 38.26 4.45 25 52 27 0.09 0.09 0.4 6
--------------------------------------------------------------------------------------------------------
group: Yes
vars n mean sd median trimmed mad min max range skew kurtosis se IQR
X1 1 56 42.59 4.69 43 42.85 4.45 30 52 22 -0.51 -0.2 0.63 6.25

Caution!! Acceptable range for normality is skewnessᵃ


and kurtosisᵇ lying between -1 to 1 and -3 to 3.

REFERENCES: NOTE
ᵃBulmer, M. G. (1979), Principles of Statistics. NY:Dover Books on Mathematics. Categorical Data
ᵇKevin P. Balanda and H.L. MacGillivray. “Kurtosis: A Critical Review”. The American Statistician 42:2 [May 1988], pp 111–119 Numerical Data
01 Data Exploration DON’T
GIVE UP

COMBINE
## Combine specific results into one table
desc99 <- cbind("Variable"=c("Age(year)"),
"Less.Mean"=c(desc01$No$mean),
"Less.SD"=c(desc01$No$sd),"n"=c(""),"(%)"=c(""),
"More.Mean"=c(desc01$Yes$mean),
"More.SD"=c(desc01$Yes$sd),"n"=c(""),"(%)"=c(""))
desc99
Variable Less.Mean Less.SD n (%) More.Mean More.SD n (%)
[1,] "Age(year)" "38.3125" "4.83354934811876" "" "" "42.5892857142857" "4.68567172137779" "" ""
02 Test statistic using R DON’T
GIVE UP

## Simple Logistic Regression (NUMERICAL DATA)


model01 <- glm(hyperchol ~ age, family=binomial, data=hptR)
model01
Call: glm(formula = hyperchol ~ age, family = binomial, data = hptR)

Coefficients:
(Intercept) age
-8.5878 0.1888

Degrees of Freedom: 199 Total (i.e. Null); 198 Residual


Null Deviance: 237.2
Residual Deviance: 207.1 AIC: 211.1

## extract the coefficients from the model and exponentiate


OR <- exp(coef(model01))
OR
(Intercept) age
0.0001863672 1.2078084592

CI <- as.data.frame(exp(confint(model01)))
CI
2.5 % 97.5 %
(Intercept) 6.831754e-06 0.003609845
age 1.124103e+00 1.307493754 NOTE
Categorical Data
Numerical Data
02 Test statistic using R DON’T
GIVE UP

## 2-tailed Wald z tests to test significance of coefficients


PV <- summary(model01)$coefficients[,4]
PV
(Intercept) age
6.947367e-08 8.676167e-07

COMBINE
## Combine specific results into one table
model99 <- cbind("OR"=c(OR[-1]),
"CI"=c(paste0("(",format(round(CI$`2.5 %`[-1],2),nsmall = 2),",",
format(round(CI$`97.5 %`[-1],2),nsmall = 2),")")),
"pvalue"=c(PV[-1]))
model99
OR CI pvalue
age "1.20780845923358" "(1.12,1.31)" "8.67616705140556e-07"

outN1 <- cbind(desc99,model99)


outN1
03 Interpretation DON’T
GIVE UP

Odds Ratio
• The result shows that age (p-value <0.001) was statistically significant to the
hypercholesterolemia.
• An increase in one-year in age has a 1.21 times (95% CI 1.12 to 1.31) more
odds/likely of having HC.
• For example, those with 31 years old people have 1.21 more odds/likely of having
HC compared to those with 30 years old.
04 Data Presentation DON’T
GIVE UP
CHOOSING THE DON’T
GIVE UP
Correct Statistical Test
Number of Independent(Predictor) Variables
≥ Two (Independent)
Dependent (Outcome)

One (Independent)
Categorical Data

Simple
Two(2) categories

Variable Selection
Logistic Regression
Variable

Multiple
• Numerical
Logistic Regression
• (age)
• Categorical or Numerical or mix
Simple
• (gender and age)
Logistic Regression
• Categorical
• (malay, chinese and indian)
01 Data Exploration DON’T
GIVE UP

## Compute summary statistics by groups


desc01 <- table(hptR$dm,hptR$hyperchol)
desc01
No Yes
No DM 73 16
Controlled DM 56 23
Uncontrolled DM 15 17

## Row Percentages
prop01 <- desc01/rowSums(desc01)*100
prop01
No Yes
No DM 82.02247 17.97753
Controlled DM 70.88608 29.11392
Uncontrolled DM 46.87500 53.12500

## Column Percentages
prop01 <- t(t(desc01)/colSums(desc01)*100)
prop01
No Yes
No DM 50.69444 28.57143
Controlled DM 38.88889 41.07143
Uncontrolled DM 10.41667 30.35714
NOTE
Categorical Data
Numerical Data
01 Data Exploration DON’T
GIVE UP

COMBINE
## Combine specific results into one table
desc99 <- cbind("Variable"=c(dimnames(desc01)[[1]]),
"Less.Mean"=c(""),"Less.SD"=c(""),
"n"=c(desc01[,1]),"(%)"=c(prop01[,1]),
"More.Mean"=c(""),"More.SD"=c(""),
"n"=c(desc01[,2]),"(%)"=c(prop01[,2]))
01 Data Exploration DON’T
GIVE UP

The assumption was not met


02 Test statistic using R DON’T
GIVE UP

## Simple Logistic Regression (CATEGORICAL DATA)


model01 <- glm(hyperchol ~ dm, family=binomial, data=hptR)
model01
Call: glm(formula = hyperchol ~ dm, family = binomial, data = hptR)

Coefficients:
(Intercept) dmControlled DM dmUncontrolled DM
-1.518 0.628 1.643

Degrees of Freedom: 199 Total (i.e. Null); 197 Residual


Null Deviance: 237.2
Residual Deviance: 223.4 AIC: 229.4

## extract the coefficients from the model and exponentiate


OR <- exp(coef(model01))
OR
(Intercept) dmControlled DM dmUncontrolled DM
0.2191781 1.8738839 5.1708333

CI <- as.data.frame(exp(confint(model01)))
CI
2.5 % 97.5 %
(Intercept) 0.1231741 0.3662076
dmControlled DM 0.9115646 3.9310545 NOTE
dmUncontrolled DM 2.1662191 12.7155635 Categorical Data
Numerical Data
02 Test statistic using R DON’T
GIVE UP

## 2-tailed Wald z tests to test significance of coefficients


PV <- summary(model01)$coefficients[,4]
PV
(Intercept) dmControlled DM dmUncontrolled DM
3.825688e-08 9.037618e-02 2.536746e-04

COMBINE
## Combine specific results into one table
model99 <- cbind("OR"=c(OR[-1]),
"CI"=c(paste0("(",format(round(CI$`2.5 %`[-1],2),nsmall = 2),",",
format(round(CI$`97.5 %`[-1],2),nsmall = 2),")")),
"pvalue"=c(PV[-1]))
model99
OR CI pvalue
dmControlled DM "1.87388392857446" "(0.91, 3.93)" "0.0903761836323606"
dmUncontrolled DM "5.17083333333328" "(2.17,12.72)" "0.000253674565871809“

outC1 <- rbind(cbind("Diabetes Mellitus","","","","","","","","","","",""),


cbind(desc99,
rbind(cbind("1.00","(ref.)",""),model99)))
outC1
03 Interpretation DON’T
GIVE UP

Odds Ratio
• Controlled DM group have 1.87 times more odds/likely of having HC compared to
non-DM group.
• Uncontrolled DM group have 5.17 times more odds/likely of having HC compared
to non-DM group.
04 Data Presentation DON’T
GIVE UP
CHOOSING THE DON’T
GIVE UP
Correct Statistical Test
Number of Independent(Predictor) Variables
≥ Two (Independent)
Dependent (Outcome)

One (Independent)
Categorical Data

Simple
Two(2) categories

Variable Selection
Logistic Regression
Variable

Multiple
• Numerical
Logistic Regression
• (age)
• Categorical or Numerical or mix
Simple
• (gender and age)
Logistic Regression
• Categorical
• (malay, chinese and indian)

Enter all the variables in the model.


By running a Multiple Logistics Regression, there are three versions of this method:

(1) ENTER | (2) FORWARD | (3) BACKWARD

The independent variable with a P value less than 0.05 was contribute significantly to the
prediction of the dependent variable.
>> Thus, the variable can be included in the Multiple Logistics Regression.
01 Data Exploration DON’T
GIVE UP

NOT CONTRIBUTES SIGNIFICANTLY TO THE MODEL

Assumption
✓ There must be at least two cases for each
category of the dependent
02 Test statistic using R DON’T
GIVE UP

## Multiple Logistic Regression (ENTER METHOD)


model0E <- glm(hyperchol ~ age + dm + hpt + exercise, family=binomial, data=hptR)
summary(model0E)
Call:
glm(formula = hyperchol ~ age + dm + hpt + exercise, family = binomial, data = hptR)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.2956 -0.6510 -0.3542 0.5696 2.4039

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.77358 1.98224 -5.435 5.48e-08 ***
age 0.24480 0.04727 5.179 2.23e-07 *** The hpt does not contribute
dmControlled DM 0.95302 0.44096 2.161 0.03068 *
dmUncontrolled DM 1.58343 0.53820 2.942 0.00326 ** significantly to the model because the
hptYes -0.03546 0.38861 -0.091 0.92729 p-value is 0.927 is higher than 0.05.
exerciseYes -1.96524 0.43911 -4.476 7.62e-06 ***
--- Then, we decided to remove hpt from
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 the model.
(Dispersion parameter for binomial family taken to be 1)

Null deviance: 237.18 on 199 degrees of freedom


Residual deviance: 167.39 on 194 degrees of freedom
AIC: 179.39 NOTE
Number of Fisher Scoring iterations: 5 Categorical Data
Numerical Data
02 Test statistic using R DON’T
GIVE UP

## Multiple Logistic Regression (ENTER METHOD)


model0E <- glm(hyperchol ~ age + dm + exercise, family=binomial, data=hptR)
summary(model0E)
Call:
glm(formula = hyperchol ~ age + dm + exercise, family = binomial, data = hptR)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.2865 -0.6462 -0.3551 0.5684 2.4120

Coefficients:
Estimate Std. Error z value Pr(>|z|) All factors contribute significantly to
(Intercept) -10.77819 1.98047 -5.442 5.26e-08 *** the model because the p-value is less
age 0.24455 0.04715 5.187 2.14e-07 ***
dmControlled DM 0.95204 0.44090 2.159 0.03082 *
than 0.05.
dmUncontrolled DM 1.57796 0.53469 2.951 0.00317 ** Then, this is the final model using
exerciseYes -1.96726 0.43868 -4.485 7.31e-06 ***
---
ENTER METHOD
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 237.18 on 199 degrees of freedom


Residual deviance: 167.40 on 195 degrees of freedom
AIC: 177.4
Number of Fisher Scoring iterations: 5 NOTE
Categorical Data
Numerical Data
02 Test statistic using R DON’T
GIVE UP

## Multiple Logistic Regression (INITIAL MODEL)


## only intercept
model01 <- glm(hyperchol ~ 1, family=binomial, data=hptR)
model01
Call: glm(formula = hyperchol ~ 1, family = binomial, data = hptR)

Coefficients:
(Intercept)
-0.9445

Degrees of Freedom: 199 Total (i.e. Null); 199 Residual


Null Deviance: 237.2
Residual Deviance: 237.2 AIC: 239.2

## Multiple Logistic Regression (FULL MODEL)


model02 <- glm(hyperchol ~ age + dm + hpt + exercise, family=binomial, data=hptR)
model02
Call: glm(formula = hyperchol ~ age + dm + hpt + exercise, family = binomial, data = hptR)

Coefficients:
(Intercept) age dmControlled DM dmUncontrolled DM hptYes exerciseYes
-10.77358 0.24480 0.95302 1.58343 -0.03546 -1.96524

Degrees of Freedom: 199 Total (i.e. Null); 194 Residual


Null Deviance: 237.2
NOTE
Residual Deviance: 167.4 AIC: 179.4 Categorical Data
Numerical Data
02 Test statistic using R DON’T
GIVE UP

## Multiple Logistic Regression (FORWARD METHOD)


model0F <- stepAIC(model01, direction = "forward", scope = formula(model02))

Initial Model
• hyperchol ~ 1 with an AIC of 239.18
• Small AIC means that the model can be improved
• Thus, age will be included for the next step to get a new AIC
which is 211.08

Model 1
• hyperchol ~ age with an AIC of 211.08
• exercise will be included for the next step to get a new AIC
which is 183.67

Model 2
• hyperchol ~ age + exercise with an AIC of 183.67
• DM will be included for the next step to get a new AIC which is
177.40

Model 3
• hyperchol ~ age + exercise + dm with an AIC of 177.40
• hpt can’t be included because the new AIC will be bigger than
the current AIC
02 Test statistic using R DON’T
GIVE UP

## Multiple Logistic Regression (BACKWARD METHOD)


model0B <- stepAIC(model02, direction = "backward")

Initial Model
• hyperchol ~ age + dm + hpt + exercise with an AIC of 177.40
• Small AIC means that the model can be improved
• Thus, hpt will be excluded for the next step to get a new AIC
which is 177.40

Model 1
• hyperchol ~ age + exercise + dm with an AIC of 177.40
• dm, exercise and age can’t be excluded because the new
AIC will be bigger than the current AIC
02 Test statistic using R DON’T
GIVE UP

## Multiple Logistic Regression (FORWARD METHOD)


model0F$anova
Stepwise Model Path
Analysis of Deviance Table

Initial Model:
hyperchol ~ 1

Final Model: The final model using


hyperchol ~ age + exercise + dm FORWARD METHOD

Step Df Deviance Resid. Df Resid. Dev AIC


1 199 237.1813 239.1813
2 + age 1 30.10322 198 207.0781 211.0781
3 + exercise 1 29.40760 197 177.6705 183.6705
4 + dm 2 10.26952 195 167.4010 177.4010

“+” means included in the model


02 Test statistic using R DON’T
GIVE UP

## Multiple Logistic Regression (BACKWARD METHOD)


model0B$anova
Stepwise Model Path
Analysis of Deviance Table

Initial Model:
hyperchol ~ age + dm + hpt + exercise

Final Model: The final model using


hyperchol ~ age + dm + exercise BACKWARD METHOD

Step Df Deviance Resid. Df Resid. Dev AIC


1 194 167.3927 179.3927
2 - hpt 1 0.008329567 195 167.4010 177.4010

“-” means excluded from the model


CHOOSING THE DON’T
GIVE UP
Correct Statistical Test
Number of Independent(Predictor) Variables
≥ Two (Independent)
Dependent (Outcome)

One (Independent)
Categorical Data

Simple
Two(2) categories

Variable Selection
Logistic Regression
Variable

Multiple
• Numerical
Logistic Regression
• (age)
• Categorical or Numerical or mix
Simple
• (gender and age)
Logistic Regression
• Categorical
• (malay, chinese and indian)
02 Test statistic using R DON’T
GIVE UP

Let’s decide MLR using


BACKWARD METHOD as the final model
## extract the coefficients from the model and exponentiate
OR <- exp(coef(model0B))
OR
(Intercept) age dmControlled DM dmUncontrolled DM exerciseYes
0.0000208493 1.2770416691 2.5909915916 4.8450792070 0.1398396242

Change the reference group from


exercise group to non-exercise group

## Multiple Logistic Regression (BACKWARD METHOD)


hptR$exercise <- relevel(hptR$exercise, ref = "Yes")
model0B <- stepAIC(model02, direction = "backward")

## extract the coefficients from the model and exponentiate


OR <- exp(coef(model0B))
OR
(Intercept) age dmControlled DM dmUncontrolled DM exerciseNo
2.915558e-06 1.277042e+00 2.590992e+00 4.845079e+00 7.151049e+00

NOTE
Categorical Data
Numerical Data
02 Test statistic using R DON’T
GIVE UP

Let’s decide MLR using


BACKWARD METHOD as the final model
CI <- as.data.frame(exp(confint(model0B)))
CI
2.5 % 97.5 %
(Intercept) 3.204919e-08 1.462969e-04
age 1.170306e+00 1.409334e+00
dmControlled DM 1.106978e+00 6.294374e+00
dmUncontrolled DM 1.730003e+00 1.425207e+01
exerciseNo 3.148796e+00 1.779667e+01

## 2-tailed Wald z tests to test significance of coefficients


PV <- summary(model0B)$coefficients[,4]
PV
(Intercept) age dmControlled DM dmUncontrolled DM exerciseNo
2.516469e-09 2.138787e-07 3.082457e-02 3.165608e-03 7.308127e-06
02 Test statistic using R DON’T
GIVE UP

COMBINE
## Combine specific results into one table
model99 <- cbind("OR"=c(OR[-1]),
"CI"=c(paste0("(",format(round(CI$`2.5 %`[-1],2),nsmall = 2),",",
format(round(CI$`97.5 %`[-1],2),nsmall = 2),")")),
"pvalue"=c(PV[-1]))
model99
OR CI pvalue
age "1.27704166910185" "(1.17, 1.41)" "2.13878673212258e-07"
dmControlled DM "2.59099159159642" "(1.11, 6.29)" "0.0308245651537937"
dmUncontrolled DM "4.84507920704878" "(1.73,14.25)" "0.00316560843115176"
exerciseNo "7.15104896668564" "(3.15,17.80)" "7.30812725789492e-06"
02 Test statistic using R DON’T
GIVE UP

COMBINE
## Combine specific results into one table
outFULL <- rbind(cbind("Age in years",t(model99[1,])), #Age in years
cbind("Diabetes Mellitus","","",""), #Diabetes Mellitus
cbind(model0B$xlevels$dm,
rbind(cbind("1.00","(ref.)",""),model99[2:3,])),
cbind("Exercise","","",""), #Exercise
cbind(model0B$xlevels$exercise,
rbind(cbind("1.00","(ref.)",""),model99[4,])))
rownames(outFULL) <- NULL
outFULL
OR CI pvalue
[1,] "Age in years" "1.27704166910185" "(1.17, 1.41)" "2.13878673212258e-07"
[2,] "Diabetes Mellitus" "" "" ""
[3,] "No DM" "1.00" "(ref.)" ""
[4,] "Controlled DM" "2.59099159159642" "(1.11, 6.29)" "0.0308245651537937"
[5,] "Uncontrolled DM" "4.84507920704878" "(1.73,14.25)" "0.00316560843115176"
[6,] "Exercise" "" "" ""
[7,] "Yes" "1.00" "(ref.)" ""
[8,] "No" "7.15104896668564" "(3.15,17.80)" "7.30812725789492e-06"
03 Checking assumptions using R DON’T
GIVE UP

## STEP 1: Checking multicollinearity (Variance Inflation Factor)


library(car)
round(vif(model0B),2)
GVIF Df GVIF^(1/(2*Df))
age 1.18 1 1.09 None of the VIFs is more than 10.
dm 1.05 2 1.01
1
Thus, there is no MC problem.
exercise 1.13 1 1.06

## STEP 2: Checking outliers (Cook’s Distance)


describe(cooks.distance(model0B))
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 200 0.01 0.01 0 0 0 0 0.08 0.08 3.58 15.85 0
If the max is 0.08 which is not more than 1.0.
2 Thus, there is no influential outlier.
## STEP 3: Checking model fit (Hosmer-Lemeshow goodness-of-fit test)
library(ResourceSelection)
hoslem.test(model0B$y, fitted(model0B), g=10)
Hosmer and Lemeshow goodness of fit (GOF) test
The p-value is 0.713 which is more than 0.05
data: model0B$y, fitted(model0B)
means that:
X-squared = 5.4143, df = 8, p-value = 0.7125 3
the model fits well; AND
the dataset fits well with the logistic model.
Overall, this final model met all assumptions for multiple logistic regression
04 Interpretation DON’T
GIVE UP

outFULL
OR CI pvalue
[1,] "Age in years" "1.27704166910185" "(1.17, 1.41)" "2.13878673212258e-07"
[2,] "Diabetes Mellitus" "" "" ""
[3,] "No DM" "1.00" "(ref.)" ""
[4,] "Controlled DM" "2.59099159159642" "(1.11, 6.29)" "0.0308245651537937"
[5,] "Uncontrolled DM" "4.84507920704878" "(1.73,14.25)" "0.00316560843115176"
[6,] "Exercise" "" "" ""
[7,] "Yes" "1.00" "(ref.)" ""
[8,] "No" "7.15104896668564" "(3.15,17.80)" "7.30812725789492e-06"

Odds Ratio
➢ AGE
• An increase in one year in age has a 1.28 times more odds/likely of having HC.
➢ DIABETES MELLITUS
• Controlled DM group have 2.59 times more odds/likely of having HC compared to non-DM group.
• Uncontrolled DM group have 4.85 times more odds/likely of having HC compared to non-DM group.
➢ EXERCISE
• Non-exercise group have 7.15 times more odds/likely of having HC compared to the exercise group.
05 Data Presentation DON’T
GIVE UP

“A logistic regression was performed to study the effects of age, diabetes mellitus and exercise on the likelihood that patients
have hypercholesterolemia. Results indicated that age (p<0.001), controlled diabetes mellitus (p=0.031), uncontrolled
diabetes mellitus (p=0.003) and non-exercise (p<0.001) are statistically significant factors for hypercholesterolemia. An
increase in one year in age has a 1.28 times more odds/likely of having hypercholesterolemia. The controlled and
uncontrolled diabetes mellitus group have 2.59 and 4.85 times more odds/likely of having hypercholesterolemia compared to
the none diabetes mellitus group. The non-exercise group have 7.15 times more odds/likely of having hypercholesterolemia
compared to the exercise group.”
05 Data Presentation DON’T
GIVE UP

“A logistic regression was performed to study the effects of age, diabetes mellitus and exercise on the likelihood that patients
have hypercholesterolemia. Results indicated that controlled diabetes mellitus (p=0.031), uncontrolled diabetes mellitus
(p=0.003) and non-exercise (p<0.001) are statistically significant factors for hypercholesterolemia after adjusted by age
(p<0.001). The controlled and uncontrolled diabetes mellitus group have 2.59 and 4.85 times more odds/likely of having
hypercholesterolemia compared to the none diabetes mellitus group. The non-exercise group have 7.15 times more
odds/likely of having hypercholesterolemia compared to the exercise group.”
05 Data Presentation DON’T
GIVE UP

EXAMPLE: Site comparison of mortality (2009)

4.0

3.5

3.0
p-value > 0.05
2.5

2.0

1.5

Dead 1.0

0.5

Alive 0.0 33 4 16 30 22 29 13 3 23 12 17 7 19 10 32 21 15 2 9 27 31 14 18 5 11 24 8 25 26 20 6 28
Crude Odds Ratio 0.47 0.58 0.59 0.63 0.63 0.71 0.72 0.75 0.83 0.84 0.85 0.88 0.90 0.92 0.93 0.95 0.97 1.00 1.11 1.12 1.12 1.16 1.25 1.31 1.33 1.34 1.36 1.46 1.50 1.73 1.79 2.51
Lower Limit 0.30 0.36 0.41 0.41 0.44 0.50 0.47 0.54 0.60 0.54 0.60 0.66 0.64 0.64 0.65 0.58 0.68 0.79 0.74 0.79 0.72 0.79 0.99 0.81 0.99 1.01 1.06 1.10 1.16 1.33 1.66
Upper Limit 0.72 0.92 0.86 0.95 0.90 1.02 1.09 1.04 1.15 1.29 1.19 1.17 1.26 1.32 1.32 1.54 1.39 1.54 1.71 1.60 1.88 1.96 1.74 2.19 1.81 1.84 2.02 2.03 2.59 2.40 3.80
THANK YOU
SHAHRUL AIMAN BIN SOELAR
Clinical Research Centre, Hospital Sultanah Bahiyah
[email protected]

You might also like