100% found this document useful (1 vote)
56 views83 pages

MTH3901 Mini Project Report 2021

This document describes a study that uses multiple logistic regression to predict heart disease. It analyzes secondary data to develop a model and identify significant factors that cause heart disease. The study assesses assumptions, validates the model, and achieves 69.57% accuracy in predicting heart disease events.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
56 views83 pages

MTH3901 Mini Project Report 2021

This document describes a study that uses multiple logistic regression to predict heart disease. It analyzes secondary data to develop a model and identify significant factors that cause heart disease. The study assesses assumptions, validates the model, and achieves 69.57% accuracy in predicting heart disease events.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

PREDICTION OF HEART DISEASE USING MULTIPLE

LOGISTIC REGRESSION

CHIN HUI YIN (197511)

RACHEL YEOH LI WEN (197176)

LOGAN RAMANATHAM (193497)

NURUL HABRAH NABILAH BT RIDZUWAN (197666)

AN-SYUHADAH BINTI MOHAMAD AFINDI (199190)

BACHELOR OF SCIENCE IN STATISTICS WITH HONOURS

DEPARTMENT OF MATHEMATICS

FACULTY OF SCIENCE

UNIVERSITI PUTRA MALAYSIA

2021
Abstract

PREDICTION HEART DISEASE USING MULTIPLE

LOGISTIC REGRESSION

By

CHIN HUI YIN

RACHEL YEOH LI WEN

LOGAN RAMANATHAM

NURUL HABRAH NABILAH BT RIDZUWAN

AN-SYUHADAH BINTI MOHAMAD AFINDI

Lecturer: Prof. Madya Dr. Jayanthi A/P Arasan

Department: Department of Mathematics

Faculty: Faculty of Science

The World Health Organization (WHO) in all its effort in combating heart

diseases with the available instruments that can be used to predict heart

diseases have not been fruitful due to reasons relating to high cost and inac-

curacy of probability. Considering this matter, a reason has been established

1
to further pursue research in the prevention of heart diseases. In this re-

search, multiple logistic regression was used to forecast the event of heart

disease. Logistic regression is one of the most common statistical approaches

for studies involving risk assessment of complex diseases. Furthermore, sec-

ondary data was obtained from the research done by Rossouw et al. (1983) in

South African Medical Journal to develop a model for predicting the event of

heart disease. Factors of heart disease were used to determine their respective

relativity to heart disease. Forward stepwise regression was applied to create

the best fit of the model for predicting the event of heart failure. Besides, the

diagnostic test was carried out in this research to ensure the assumptions of

logistic regression are not violated. From this research, it can be concluded

that age, family history of heart disease, cumulative tobacco, typeA charac-

teristic and Low-density lipoprotein are the essential factors that causes the

heart diseases. Furthermore, the predicted model achieved 69.57% accuracy

and the Area Under ROC Curve (AUC) values obtained is 0.6743 indicating

the created model is satisfactory in predicting heart disease events.

2
Contents

Abstract 1

List of Tables 6

List of Figures 7

List of Abbreviations 9

1 INTRODUCTION 10

1.1 Background Study 10

1.2 Problem Statement 11

1.3 Research Objective 12

1.4 Significance of Study 13

2 LITERATURE REVIEW 14

2.1 Logistic Regression 14

3 METHODOLOGY 17

3.1 Introduction 17

3.2 Multiple logistic regression 18

3.3 Data splitting 21

3
3.4 Forward Stepwise Regression 22

3.5 Diagnostic Test 23

3.5.1 Test on linearity in the logit 23

3.5.2 Test on independence of errors 24

3.5.3 Assessing multicollinearity 25

3.5.4 Detecting outliers 27

3.6 Model Validation and Performance Analysis 30

3.6.1 Hosmer-Lemeshow goodness-of-fit 30

3.6.2 Confusion Matrix 32

3.6.3 Receiver operating characteristic (ROC) curve 33

4 RESULTS AND DISCUSSION 35

4.1 Data Visualization 37

4.2 Multiple Logistic Regression 41

4.3 Forward Stepwise Regression 42

4.4 Diagnostic Tests 46

4.4.1 Linearity of continuous independent variables and log

odds 46

4.4.2 Independence of errors 47

4.4.3 Multicollinearity among independent variables 48

4
4.4.4 Influential outliers 50

4.5 Model Validation and Performance Analysis 55

4.5.1 Hosmer-Lemeshow Goodness-of-fit-test 55

4.5.2 Confusion Matrix 56

4.5.3 Receiver Operating Characteristic (ROC) curve 57

5 CONCLUSION AND FUTURE WORK 58

5.1 Conclusion 58

5.2 Future Work 59

References 60

Appendix A 66

Appendix B 82

5
List of Tables

1 Layout of 2X2 Confusion Matrix. 32

2 List of rates that computed from a confusion matrix. 33

3 Description of the Dataset 36

4 Summary of the multiple logistic regression for the dataset 41

5 Box-Tidwell Test 46

6 Variance Inflation Factor (VIF) 49

6
List of Figures

1 Distribution of the family history of heart disease and CHD 38

2 Box plots for factors sbp, tobacco, ldl and adiposity 39

3 Boxplots for factors typea, obesity, alcohol and age 40

4 Step 1 and 2 of the Forward Stepwise Regression 42

5 Step 3 and 4 of the Forward Stepwise Regression 43

6 Step 5 and 6 of the Forward Stepwise Regression 44

7 Summary of Forward Stepwise Regression 45

8 Residual plot vs Order of observation 47

9 Collinearity Matrix 48

10 Pearson residuals, studentized residuals, hat diagonals, de-

viance residuals, delta chi-square, delta deviance, and delta

beta statistics 50

11 Outlying cases and their impact on influence statistics 51

12 Probability against scaled change in Pearson chi-square 52

13 Probability against scaled change in deviance 53

14 Probability against scaled change in coefficients 54

15 Hosmer-Lemeshow Goodness-of-fit-test 55

16 Confusion Matrix 56

7
17 Receiver Operating Characteristic (ROC) curve 57

8
LIST OF ABBREVIATIONS

AIC Akaike Information Criterion

AUC Area under the ROC curve

CHD Coronary Heart Disease

FN False Negative

FP False Positive

TN True Negative

TP True Positive

ROC Receiver Operating Characteristic

VIF Variance Inflation Factor

9
CHAPTER 1

INTRODUCTION

1 INTRODUCTION

1.1 Background Study

Cardiovascular disease-associated deaths have hastily amplified worldwide

over the past years, proving to be one of the principal sources of death

amongst humans. In 2016, the World Health Organisation (WHO) reported

that approximately 31% (17.9 million) of all deaths globally were due to car-

diovascular diseases, with 85% (15.2 million) were a direct contribution of

heart attacks and strokes (World Health Organisation, 2017). Department

of Statistics Malaysia (2019) illuminates the severity of the issue in Malaysia.

10
Deaths caused by ischaemic heart diseases in 2018 were recorded at 15.9%

and 15% in urban and rural areas of Malaysia respectively.

There are various techniques in neural networks and data mining that

are applied to detect the severity of heart disease among humans (Lakshma-

narao et al., 2019). Due to advanced technology and data collection, heart

disease can be predicted under statistical method efficiently. Some common

statistical methods such as logistic regression, Naı̈ve Bayes, Decision trees,

Random Forest, K-means, and Support-Vector-machine are applied by many

researchers in predicting the event of heart disease. In this research, multi-

ple logistic regression techniques were chosen to forecast the event of heart

disease and thus identify the factors causing heart disease.

1.2 Problem Statement

Prevention against any diseases, let alone something as severe as heart dis-

ease, is always better than a cure. Therefore, it is essential to have early

detection of heart disease. However, early detection of heart disease is not

easy. Thus, there are some problems in this study:

11
1. How to predict the event of heart failure ?

2. How to identify which factors are significantly affect the event of heart

disease ?

3. How to find the best fit of the predicitive model ?

1.3 Research Objective

There are three main objectives to be achieved in this study including:

1. To predict the event of heart failure by using multiple logistic regres-

sion.

2. To study which factors will significantly affect heart disease by using

multiple logistic regression.

3. To find the best fit of the model by using diagnostics test and forward

stepwise regression.

12
1.4 Significance of Study

This study is significant as it can help humans to have early warning be-

fore unexpected passing occurs. There are many factors will contribute to

heart disease. Risk factors associated with heart disease to be age, blood

pressure, total cholesterol, diabetes, hypertension, family history of heart

disease, obesity and lack of physical exercise,and fasting blood sugar have

been statistically analysed (Babu et al., 2017). Thus, examining and identi-

fying the primary factors that will cause the event of heart disease is of great

importance. In order to have early detection of heart disease, a suitable

model should be created to find the relationship of each factor to the heart

disease. In this research, multiple logistic regression is chosen to predict and

find the major factors that will cause the event of heart disease.

13
CHAPTER 2

LITERATURE REVIEW

2 LITERATURE REVIEW

2.1 Logistic Regression

There are a few researches that applied Logistic Regression to predict heart

disease. Bhatti et al. (2006) investigated the factors that caused greatly the

risk of ischemic heart disease by using Logistic Regression analysis. Overall

98.63% of the 585 cases were correctly classified which means the model is

able to fit the data well. Nishadi (2019) discussed how early detection and

treatment for cardiovascular are urgently needed due to shortage of health-

care expenditures. In his study, Logistic Regression was used to identify

14
the most significant predictors of heart diseases by evaluating the risk of 10-

years Coronary Heart Disease (CHD) with 14 independent variables. The

accuracy of the model is about 87%. Moreover, Saw et al. (2020) discussed

how to improve the accuracy in predicting heart disease by using the logistic

regression. In their study, they found out that men are more susceptible

to have heart disease. Furthermore, age, number of cigarettes smoke, sys-

tolic blood pressure increase the risk to have heart disease. The predictive

model in this study achieved 87% accuracy. Prasad et al. (2019) explored

the prediction of heart disease by various statistical approaches such as Lo-

gistic Regression , K-nearest neighbors algorithm Naı̈ve Bayes and Decision

Tree and then compare each model based on their accuracy. They found out

that Logistic Regression achieved the highest accuracy among those models

which is 86.89%. Mothukuri et al. (2020) predict coronary disease by using

different models such as Support Vector Machine and Random Forest. The

precision of the model using Logistic Regression approach in this research

is the highest which is about 63.93%. Lakshmanarao et al. (2019) predict

heart disease by using 3 different sampling techniques such as random over

sampling, Synthetic Minor Oversampling and Adaptive synthetic sampling

approach before applying on the statistical model. After applying three sam-

15
pling techniques, the Logistic Regression achieved 67.5%, 68.8% and 65.7%

accuracy. In the work Khemphila and Boonjing (2010), they compared the

performances of Logistic Regression, Decision Trees and Artificial Neural

Networks (ANN) for classifying heart disease patients. The Logistic Re-

gression achieved 81.2% sensitivity , 73.1% specificity and 77.7% accuracy.

However, Logistic Regression is not the best technique among these 3 tech-

niques. Artificial neural networks (ANN) is the best technique because it has

the highest accuracy which is 80.2%. Same goes to the previous study, Abhay

et al. (2020) predicted the heart disease by using three different techniques

namely, Decision-Tree (used both criteria‘s gini and entropy), Naı̈ve-Bayes

and Logistic Regression. Among these three methods Logistic regression

gives the highest accuracy of 93%. Enriko (2019) compared 10 data mining

classification algorithms to predict heart disease. Logistic regression is one

of the technique be used in analysing. Among the 10 techniques, the ranking

of Logistic Regression is 6th only. Logistic Regression only achieved 62.4%.

The highest is Random Forest technique which achieved 78.0% accuracy.

16
CHAPTER 3

METHODOLOGY

3 METHODOLOGY

3.1 Introduction

This chapter introduces the methods that be used in this research. First and

foremost, multiple logistic regression was proposed to create the predictive

model. Then, the dataset was split into training sets and validation sets.

The training set is used to develop a predictive model by using forward

stepwise regression. Next, diagnostics tests under the assumptions of the

logistic regression model were implemented. Lastly, model validation and

performance analysis are carried out by fitting the validation dataset. Each

step will be discussed in more detail in the following subsections.

17
3.2 Multiple logistic regression

In this subsection, multiple logistic regression is discussed in more detail. In

real-world data, especially social data, there is often the outcome variable is in

general instead of continuous value. To overcome this case, logistic regression

is one of the statistical methods to describe the relationship between one

or more independent variables and binary outcome variables (Sarkar et al.,

2011). The binary outcome variable brings the meaning that the probability

scale has only two possible values (Sarkar et al., 2011). However, this will

create a major problem with linear probability model since the probabilities

are bounded by 0 and 1. In order to overcome this problem, transforming the

probabilities to odds can help to remove the upper bound and the natural

logarithm of odds will remove the lower bound (Sarkar et al., 2011). Hence,

setting the result equal to a linear function of dependent variables yields a

logit or binary response model (Allison, 1999).

Now suppose in a multiple logistic regression case, a collection of k ex-

planatory variables be denoted by the vector X 0 = (X1 , X2 , . . . , Xk ). Let the

conditional probability that the outcome denoted by P (Y = 1|X) = π(X).

Hence the corresponding model to have more than one explanatory variable

18
can be written as:

Yi = πi (X) + i ; i = 1, 2, . . . , n (1)

where:
exp(Zi )
πi (X) = (2)
1 + exp(Zi )

with:

Zi = β0 + β1 X1i + β2 X2i + · · · + βk Xki .

Y = n x 1 vector of response having yi = 0 and yi = 1

X = n x (k + 1) design matrix of explanatory variables.

β = (k + 1) x 1 vector of parameters.

 = n x 1 vector of unobserved random errors.

πi = probability for the ith covariate satisfying the important requirement

0 ≤ πi ≤ 1

Thus, the log-odds of having Y = 1 for given X is modeled as a linear

function of the explanatory variables as:

 
πi
π̂ = ln = β0 + β1 X1i + β2 X2i + · · · + βk Xki (3)
1 − πi

19
And the logistic function is shown as below:

exp(Xβ)
πi = (4)
1 + exp(Xβ)

Furthermore, Maximum Likelihood Estimation (MLE) is used to estimate

the parameter of a logistic regression. Specifically, for a sample of size n whose

observations are (y1 , y2 , . . . , yn ) and the corresponding random variables are

(Y1 , Y2 , . . . , Yn ). Since the Yi is a Bernoulli random variable, the probability

mass function of Yi is

fi (Yi ) = π1Yi (1 − πi )1−Yi ; Yi = 0 or 1 and i = 1, 2, . . . , n (5)

The Y 0 s are assumed to be independent, thus the log-likelihood function

L(β) is defined as:

n
X n
X
L(β) = Yi (Xi0 β) − ln[1 + exp(Xi0 β)] (6)
i=1 i=1

Finally, the fitted logistic response function and fitted values can be expressed

20
as shown below:

exp(X 0 b)
π̂ = = |1 + exp(−X 0 b)|−1 (7)
1 + exp(X 0 b)

exp(Xi0 b)
π̂i = = |1 + exp(−Xi0 b)|−1 (8)
1 + exp(Xi0 b)

where:

X 0 b = b0 + b1 X 1 + · · · + bk X k

Xi0 b = b0 + b1 Xi1 + · · · + bk Xik

3.3 Data splitting

Before building the forecasting model by using stepwise regression, the dataset

was split into training and validation sets. Data splitting is defined as the act

of partitioning available data into two portions for cross-validatory purposes

(Picard and Berk, 1990). One portion of the data is used to develop a pre-

dictive model and another to evaluate the models (Picard and Berk, 1990).

For this study, the whole dataset was split into a training set and validation

set by using rsample package in R studio. Stratified random sampling based

on the explanatory variable was undergone to generate a training set from

70% of the full dataset and validation set from 30% of the full dataset.

21
3.4 Forward Stepwise Regression

In this section, forward stepwise regression is discussed in more detail. For-

ward stepwise is the method of fitting a regression model by determining the

predictor variables that are significant to the response (Kutner et al., 2005).

In forward stepwise method, Akaike Information Criterion(AIC) is used as

a medium for model selection. Akaike Information Criterion( AIC) is shown

as below :

AICp = −2lnL(b) + 2p (9)

where:

p: The number of parameter.

lnL(β): log-likelihood expression.

AIC measures goodness of fit but also includes a penalty that is an in-

creasing function of the number of estimated parameters. This discourages

over-fitting and the preferred model is the set of candidate models with the

minimum AIC value (Sarkar et al., 2010). Hence the model with the mini-

mum AIC value was chosen.

22
For this research, the training set of the data was used to build the reduced

model by using forward stepwise regression. The significant independent

variables were selected and a model with the AIC value was chosen.

3.5 Diagnostic Test

Before inferences based on that model are undertaken, diagnostic tests are

essential to examine the aptness of the model (Kutner et al., 2005). Diag-

nostic tests were implemented in this research to avoid any data violation

in the study. In this research, the assumptions of logistic regression are not

similar to the assumptions in linear and ordinary least squares algorithm

based models (Kutner et al., 2005). There are 4 major assumptions in logis-

tic regression which are linearity in the logit for any continuous independent

variables, independence of error, absence of multicollinearity among indepen-

dent variables and lack of strongly influential outliers (Stoltzfus, 2011). Each

assumption will be discussed in more detail in the following subsections.

3.5.1 Test on linearity in the logit

The first assumption that needs to be tested in logistic regression is the

linearity in the logit for any continuous independent variables. Checking

23
linearity in the logistic regression is quite different compared to the process

of checking linearity in the simple linear regression. In logistic regression,

Hosmer and Lemeshow (2000) recommend using the Box-Tidwell approach

to check the linearity in the logit of any continuous independent variables.

By using this method, the interaction between each independent variable and

its natural logarithm are added into the logistic regression model. If at least

one interaction is significant then the assumption is violated (Tabachnick and

Fidell, 2013). Bonferroni correction was applied for determining significance

for the test. If there is problem in linearity in the log odd, transformation of

the variables should be considered to solve the problem.

3.5.2 Test on independence of errors

The second assumption of the logistic regression is independence of errors.

In logistic regression, the responses of different cases are assumed to be in-

dependent of each other (Tabachnick and Fidell, 2013). This means each

response comes from a different and unrelated case. The logistic regression

will produce an overdispersion (great variability) effect if errors are not inde-

pendent. Therefore, it is necessary to test this assumption. In order to test

the independence of errors, the plot of residual against order of observation

24
is plotted. The purpose of plotting residuals against the order of observation

or any type of sequence is to see if there is any correlation between error

terms that are near each other in the sequence (Kutner et al., 2005). If the

residual plot does not show any trend of pattern and fluctuate around base

line 0, the error terms are assumed to be independent. If the plot does not

exist a random pattern, the assumption is violated and further action should

be taken.

3.5.3 Assessing multicollinearity

The third assumption is absence of multicollinearity among independent vari-

ables. This is quite similar to the testing multicollinearity in multiple linear

regression. Multicollinearity is a phenomena when two or more regressors

are correlated, the standard error of the coefficients will increase (Midi et al.,

2010). It can be concluded that there is a serious multicollinearity problem

if a simple correlation coefficient between two regressors is greater than 0.8

or 0.9 (Midi et al., 2010). If the data suffer a multicollinearity problem, then

the redundant variables may be deleted from the model to solve the problem

(Tabachnick and Fidell, 2013). However, examining the correlation between

two regressors is not sufficient. It is further verified by using Variance Infla-

25
tion Factor (VIF). Mathematically multicollinearity can mainly be detected

with the help of tolerance and its reciprocal, called variance inflation fac-

tor (VIF). Belsley et al. (2005) defined tolerance of any specific explanatory

variable as

T olerance = 1 − R2 (10)

Where:

R2 : Coefficient of determination for the regression of that explanatory

variable on all remaining independent variables.

Furthermore, Belsley et al. (2005) defined the variance inflation factor (VIF)

as the reciprocal of tolerance and the formula is shown below

1
V IF = (11)
1 − R2

Basically, there is no formal cutoff value of VIF to determine the presence of

multicollinearity. However, Allison (2001) states that if values of VIF exceed-

ing 10 often indicate the presence of multicollinearity problems. Therefore,

the independent variables are highly correlated if the VIF value is greater

than 10. Further action such as removing variables from the model should

26
be considered as one of the methods to solve the multicollinearity problems.

3.5.4 Detecting outliers

The fourth assumption of logistic regression is absence of strongly influential

outliers. In logistic regression, outliers are defined as a set of observations

whose values deviate from the expected range and will produce extremely

large residual thus leading to sample peculiarity (Sarkar et al., 2011). Iden-

tifying outliers is important because outliers can lead to incorrect inferences

(Sarkar et al., 2011). In this research, change in Pearson chi-square, change

in deviance and change in parameter estimates from basic building blocks are

the diagnostics used to identify influential outliers. The formula of change

in Pearson chi-square and change in deviance are shown as below.

rp2i
∆χ2i = (12)
1 − hii

where:

rpi : Pearson residuals

hii : Leverage Values

27
dri2
∆Di = (13)
1 − hii

where:

dri : Deviance residual for the ith case

hii : Leverage Values

Sarkar et al. (2011) state in article that the deviance residuals and stu-

dentized Pearson residuals will follow the chi-square distribution with single

degree of freedom since it is under the normality assumption with sufficiently

large sample. Therefore, the upper ninety-fifth percentile value of chi-square

distribution which is approximate to 4 can be considered as the cut-off point

to detect the outliers (Sarkar et al., 2011). In short, if the change in Pearson

chi-square, ∆χ2i and change in deviance, ∆Di are greater than 4, the points

can be considered as outliers.

Besides that, the change in the value of the estimated coefficients is anal-

ogous to the measure proposed by Cook (1977) for linear regression. The

formula of the change in the value of the estimated coefficients is shown as

28
below.
rp2i hii
∆β̂i = (14)
(1 − hii )2

where:

rpi : Pearson residuals

hii : Leverage Values

If the change in the value of the estimated coefficients, ∆β̂i is greater than

1 for an individual case, it will have an effect in the estimated coefficients

(Cook, 1977). In short, if ∆β̂i is greater than 1, it is called as influential

observations.

29
3.6 Model Validation and Performance Analysis

After building the predictive model by using training data set, model valida-

tion analysis was carried out. This is the last step for the model building. It

is important to have model validation analysis to ensure the results of the lo-

gistic regression analysis on the sample can be extended to the corresponding

population (Rana et al., 2010). In other words, we can say that the model

has a good fit. The most suggested technique to obtain a good internal val-

idation of a model performance is data splitting (Rana et al., 2010). The

data already splitted into a training set and validation set before fitting the

model. Now, the validation set of the data was used to compute the summary

measures of fit such as Hosmer-Lemeshow goodness-of-fit. Furthermore, con-

fusion matrix was constucted to find the performance of the model. Lastly,

the Receiver Operating Characteristic (ROC) curve was plotted to obtain

the value of the Area under the ROC curve (AUC).

3.6.1 Hosmer-Lemeshow goodness-of-fit

It is essential to test the fitness of the model before it is accepted to be used.

Hosmer-Lemeshow goodness-of-fit is one of the popular goodness-of-fit test

in logistic regression. Hosemer and Lemeshow proposed grouping based on

30
the values of the estimated probabilities (Hosmer and Lemeshow, 1980). Let

nj denote approximately n g which g is the group number in the jth decile.
√ P
Lets say g = 10, then it can be written as n 10. Let Oj = yj be the

number of positive responses among the covariate patterns falling in the jth

decile. The estimate of the expected value of Oj under the assumption that
P
the fitted model is correct is Ej = mj πj . Then the Hosmer-Lemeshow test

statistic is shown as below:

g
X (Oj − Ej )2
Cv = (15)
n π̄ (1 − π̄j )
j=1 j j

Where:
P
mj πˆj
π̄j = nj

Hosmer and Lemeshow demonstrated that under the null hypothesis is the

good fit of logistic regression model (Hosmer and Lemeshow, 1980). Hence

under the assumption of hypothesis that is good fit of model and each Ej is

sufficiently large for each term in Cv to be distributed as χ2 , it follows that

Cv is distributed as χ2 with g-2 degrees of freedom, where g is group number

in jth decile.

31
3.6.2 Confusion Matrix

In order to find the performance of the predictive model, confusion matrix

needs to be constructed. Confusion matrix is a contingency table that dis-

plays the number of instances assigned to each class and then using the

information to calculate the True Positives (TP), True Negative (TN), False

Positive (FP) and False Negative (FN) among others (Thabtah, 2017). In

this study, 2X2 confusion matrix was constructed because the dependent

variables only involve two classes.

Table 1: Layout of 2X2 Confusion Matrix.

Predicted Class
a (has CHD) b (no CHD)
a (has CHD) TP FN
Actual Class
b (no CHD) FP TN

From the table 1 , there are 4 segments which are TP, FN, FP and

TN. True Positive (TP) is number of patients that are actually have CHD

and correctly be predicted to have CHD. False Negative (FN) is number of

patients have CHD but predicted wrongly to do not have CHD. False Positive

(FP) is number of patients who do not have CHD but predicted wrongly to

32
have CHD. Lastly, True Negative (TN) is the number of patients that do not

have CHD and predicted correctly to do not have CHD.

Thus, the confusion matrix can be used to find the accuracy, misclassifica-

tion rate, sensitivity and specificity. Accuracy means how often the classifier

correct for overall result. Misclassification rate means the error rate for the

overall result. Sensitivity is the true positive rate while specificity is the true

negative rate.

Table 2: List of rates that computed from a confusion matrix.

Terms Formula
(T P +T N )
Accuracy of the model T P +T N +F P +F N

(F P +F N )
Misclssification Rate T P +T N +F P +F N )

TP
Sensitivity of the model (T P +F N )

TN
Specificity of the model (T N +F P )

3.6.3 Receiver operating characteristic (ROC) curve

Furthermore, Receiver operating characteristic (ROC) curve is an effective

method for assessing the performance of the model. ROC curve is a graph

which visualise the performance of a classification model at all classification

thresholds. ROC curve is the plotted as sensitivity against (1-specificity)

33
(Kumar and Indrayan, 2011). Then, the Area under the ROC curve (AUC)

is obtained to measure the performance of the model. The highest AUC

value is 1 which means the model is perfect in differentiating diseased with

non-disease subjects. However, it is impossible to obtain AUC=1 in the

real life data. Therefore, the minimum value of AUC can be accepted is 0.5

(Kumar and Indrayan, 2011). This means that overall there is 50-50 chances

that test will correctly discriminate the diseased and non-diseased subject.

This is really unhelpful if the model can only have 50% chances to predict

correctly.

34
CHAPTER 4

RESULTS AND DISCUSSION

4 RESULTS AND DISCUSSION

In this research, secondary data was obtained in the research done by Rossouw

et.al in 1983, South African Medical Journal. There are a total 462 samples

in this data set. The samples are collected among males population in a

high risk heart disease region of the Western Cape in South Africa. The

data consists of 9 independent variables with a binary dependent variable.

The dependent variable, Coronary heart disease (chd) is coded as 1 if heart

disease was determined to have been present, and 0 otherwise. Furthermore,

data consists of 1 categorical independent variable and 8 continuous inde-

pendent variables. Table below show the description of the data.

35
Table 3: Description of the Dataset

Variable Unit Detail

chd - Coronary Heart Disease

sbp mm/Hg Systolic Blood Pressure

tobacco kg Cumulative Tobacco

ldl mg/ dL Low Density Lipoprotein Cholesterol

adiposity BAI Body Adiposity Index

famhist - Family history of heart disease (categorical)

typea score Type-A behavior

obesity BMI Obesity Problem

alcohol - Current alcohol assumption

age year Age at onset

In this research, the results were obtained separately from IBM SPSS

software and R Studio software. The results are divided into 5 sections

which are data visualisation, multiple logistic regression, forward stepwise

regression, diagnostics test, and model validation and performance analysis.

Each of the results will be discussed in more detail in the following sections.

36
4.1 Data Visualization

Data visualization has been carried out to have a better understanding and

visualization of this data. In this study, a bar chart was created for factor

family history of heart disease and box plots for every continuous independent

variable. Data visualization was used to develop a general idea of how the

independent variables affect the explanatory variable (CHD). However, data

visualization just provides a general idea about the prediction of heart disease

cases and do not provide sufficient evidence to make conclusion directly. The

figure below shows a bar chart representing the distribution of the factors

family history of heart disease and CHD.

37
Figure 1: Distribution of the family history of heart disease and CHD

The bar chart shows that the the total number of people without family

history of heart disease is 270 with 23.7% of them have the heart disease

problem. Furthermore, the total number of people with family history of

heart disease is 192 with 50% of the them having heart disease. It may

be predicted that the probability of getting heart diseases increases when

the patients have family history of heart disease. However, directly making

conclusion is inappropriate. Further statistical analysis should be carried out

to verify the prediction.

38
Figure 2: Box plots for factors sbp, tobacco, ldl and adiposity

Figure 2 shows the box plots for the factor systolic blood pressure (sbp),

cumulative tobacco (tobacco), Low Density Lipoprotein Cholesterol (ldl) and

adiposity. Based on the results, higher blood pressure, cumulative tobacco,

level of ldl and body adiposity index are good indicators of heart disease.

The median values are represented by the horizontal lines of the box plots,

while the mean value of the parameter on the Y-axis are represented by the

red dots.

39
Figure 3: Boxplots for factors typea, obesity, alcohol and age

Figure 3 shows the box plots for the factor type-A characteristic (typea),

obesity which represented by BMI (obesity), current alcohol consumption

(alcohol) and age at onset (age). Based on the results, higher typea charac-

teristic, BMI value, consumption of alcohol and age are good indicators of

heart disease.

40
4.2 Multiple Logistic Regression

Table 4: Summary of the multiple logistic regression for the dataset

Variables Estimate Std.Error z value P r(> |z|)


Intercept -6.1507 1.3083 -4.701 0.000003
sbp 0.0065 0.0057 1.135 0.256374
tobacco 0.0794 0.0266 2.984 0.002847
ldl 0.1739 0.0597 2.915 0.003555
adiposity 0.0186 0.0293 0.635 0.525700
famhist 0.9254 0.2279 4.061 0.00005
typea 0.0396 0.0123 3.214 0.001310
obesity -0.0629 0.0045 -1.422 0.155095
alcohol 0.0001 0.0045 0.027 0.978350
age 0.0452 0.0121 3.728 0.000193

From the table 4, p-value lesser than α =0.05 are considered as significant

independent variable. Therefore, it will be added into the model. As shown

in Table 4, tobacco, ldl, famhist, typea and age were found to be signifi-

cant. Although the summary of the multiple logistic regression has given the

proposed model, improvement of the model can be made by using forward

stepwise regression.

41
4.3 Forward Stepwise Regression

Figure 4: Step 1 and 2 of the Forward Stepwise Regression

Figure 4 shows the step 1 and 2 of Forward Stepwise Regression. The

factor with minimum AIC value will be chosen to include into the model.

According to the results, factor with minimum AIC value in step 1 is age

with the value 355.91. Therefore, age was the first factor to be included in

the model. Next, the minimum AIC value in step 2 is 345.30 for the factor

famhist. Hence, famhist was integrated into the model where the factor age

was already in the model.

42
Figure 5: Step 3 and 4 of the Forward Stepwise Regression

Proceeding to step 3 and 4, the factors, tobacco and typea had the min-

imum value of AIC which is 338.27 and 335.27 respectively. Consequently,

both of these factors were chosen to be included in the model where age and

famhist were also in the model.

43
Figure 6: Step 5 and 6 of the Forward Stepwise Regression

In step 5, the factor ldl with the AIC value of 334.45 was selected to be

included into the model, given that age, famhist, tobacco, and typea were

already in the model. Lastly, the forward stepwise regression stops as no

additional factors can be added into the model due to higher AIC values

when compared to the current model.

44
Figure 7: Summary of Forward Stepwise Regression

Figure 7 shows the final reduced model from Forward Stepwise Regression.

Based on the summary of the Forward Stepwise Regression, age, famhist,

tobacco, typea and ldl were found to significantly affect the event of heart

disease. As a result, the predictive model for the event of heart disease was

developed. Besides, the AIC value decreased from 419.79 to 334.45 which

indicating the improvement of the initial predictive model. The predicted

model is shown as below:

π̂

log 1−π̂
= −6.52744 + 0.06435 ∗ age + 0.98935 ∗ f amhist + 0.08795 ∗

tobacco + 0.03135 ∗ typea + 0.10832 ∗ ldl

45
4.4 Diagnostic Tests

4.4.1 Linearity of continuous independent variables and log odds

Table 5: Box-Tidwell Test

Interaction β S.E. Wald df Sig.


ln(age)*age -0.017 0.072 0.054 1 0.816
ln(tobacco)*tobacco 0.040 0.054 0.539 1 0.463
ln(typea)*typea 0.060 0.073 0.674 1 0.412
ln(ldl)*ldl -0.014 0.204 0.005 1 0.944

Table 5 shows the result of the Box-Tidwell Test. Bonferroni correction

was calculated to determine the significance of factors. Bonferroni correction

was obtained by calculating the significance level divides the number of terms

involved. Hence, the significance level is α∗ /10 = 0.005 as there are total 10

terms involved which are 5 independent variables, 4 interaction of continuous

independent variables and 1 intercept. The significance of the interaction

between each continuous independent variable and its natural logarithms

were checked. Result shows that the interaction of continuous variables with

its natural logarithms are greater than 0.005 which indicate that all the

factors were not significant. Thus, there is no violation of the linearity of

continuous independent variables and log odds.

46
4.4.2 Independence of errors

Figure 8: Residual plot vs Order of observation

Figure 8 shows a plot of residuals against the order of observation. The

error terms were assumed to be independent as there were no specific trends

or patterns shown. Moreover, the error terms fluctuate around baseline 0 as

shown in the figure above.

47
4.4.3 Multicollinearity among independent variables

Figure 9: Collinearity Matrix

The figure 9 shows the collinearity matrix between each pairs of indepen-

dent variables. As shown in Figure 9, the Pearson’s correlation coefficient for

obesity with ldl and age with adiposity were 0.72 and 0.63 respectively. It is

considered as high correlation. However, by looking only at the results from

the collinearity matrix and conclude that the data have a multicollinearity

problem is insufficient. Hence, Variance Inflation Factor (VIF) should be

further analysed.

48
Table 6: Variance Inflation Factor (VIF)

Variables VIF
sbp 1.237388
tobacco 1.307798
ldl 1.258307
adiposity 4.065627
famhist 1.108462
typea 1.044693
obesity 2.597437
alcohol 1.095382
age 2.239563

Table 6 shows the Variance Inflation Factor (VIF) of 9 independent vari-

ables. The Variance Inflation Factor (VIF) were calculated to identify multi-

collinearity problem in the data. Based on the table, it was found that there

were no multicollinearity problem for all 9 independent variables as the VIF

value were lesser than 10.

49
4.4.4 Influential outliers

Figure 10: Pearson residuals, studentized residuals, hat diagonals, deviance


residuals, delta chi-square, delta deviance, and delta beta statistics

The figure shows the derived influence statistics for 324 observations in

training data set. According to Figure 10, change in chi-square, deviance,

and beta was the main concern to identify the influential outliers. Therefore,

the derived influence statistics were used to create useful diagnostic plots

such as probability against scaled change in Pearson chi-square, probability

against scaled change in deviance and probability against scaled change in

coefficients. Each diagnostic plots are shown on the following pages.

50
Figure 11: Outlying cases and their impact on influence statistics

Figure 11 shows the 13 outlying cases and their impact on influence statis-

tics. As from the figure, the change in chi-square and deviance were greater

than 4. The most important is there is no value exceeding 1 found in the

change of beta. This indicated the presence of 13 potential outliers with none

of them being an influential outlier. Therefore, removing the outlying cases

is not necessary as it is justified that there were no substantial changes in

the model fit or estimated parameters.

51
Figure 12: Probability against scaled change in Pearson chi-square

Figure 12 shows the probability against scaled change in Pearson chi-

square. Based on the figure, the points falling in the top left of the plots are

the cases that are poorly fit . Therefore, there are several potential outliers

in this plot.

52
Figure 13: Probability against scaled change in deviance

The figure 13 shows the illustration of change in deviance. As presented

in the figure, the points falling in the top left of the plots are the cases that

are poorly fit . Hence, there were a few potential outliers found in this plot.

53
Figure 14: Probability against scaled change in coefficients

The figure 14 shows the Probability against scaled change in coefficients,

According to Figure 14, none of the points were greater than 1. Therefore,

there were no influential outliers found in the data and removing the outlying

cases are not necessary.

54
4.5 Model Validation and Performance Analysis

4.5.1 Hosmer-Lemeshow Goodness-of-fit-test

Figure 15: Hosmer-Lemeshow Goodness-of-fit-test

After all the assumptions were satisfied, the predictive model undergoes

a Hosmer-Lemeshow Goodness-of-fit-test to ensure that the predictive model

is a good fit model.

H0 : The predictive model is a good fit model

H1 : The predictive model is not a good fit model

Based on Figure 15, p-value for the hypothesis testing was significantly high

which is 0.9617. According to the p-value obtained, it can be concluded that

the predictive model is a good fit model.

55
4.5.2 Confusion Matrix

Figure 16: Confusion Matrix

Figure 16 shows the results of the confusion matrix. The accuracy, sen-

sitivity and specificity of the predictive model were obtained through the

confusion matrix. From the results, the predictive model was found to be

69.57% accurate with a sensitivity and specificity of 60.42% and 74.44% re-

spectively. This means that the model achieved 60.42% to correctly identify

the people with heart disease and 74.44% to correctly identify the people

without heart disease.

56
4.5.3 Receiver Operating Characteristic (ROC) curve
1.0
0.8
0.6
Sensitivity
0.4
0.2
0.0

1.2 1.0 0.8 0.6 0.4 0.2 0.0 −0.2


Specificity

Figure 17: Receiver Operating Characteristic (ROC) curve

Figure 17 shows the Receiver Operating Characteristic (ROC) curve. The

Receiver Operating Characteristic (ROC) curve was plotted at all classifica-

tion thresholds to determine the area under the curve (AUC). Based on the

results, the AUC obtained was 0.6743 which is satisfactory.

57
CHAPTER 5

CONCLUSION AND FUTURE WORK

5 CONCLUSION AND FUTURE WORK

5.1 Conclusion

Throughout this research, the usefulness of a collection of diagnostic tests

were illustrated to check the assumptions of the logistic regression model.

From the results, it showed that the dataset did not violate any assumptions

and the remedial measure was not necessary. Besides, Hosmer-Lemeshow

goodness-of-fit test was carried out to test the fitness of the predictive model.

The result of Hosmer-Lemeshow goodness-of-fit test shows that the predic-

tive model is a good fit of model. Furthermore, the confusion matrix and

58
ROC curve were used to find the performance of the predictive model. Based

on the results, the predictive model achieved 60.42% sensitivity (people with

heart disease were actually correctly identified) and 74.44% specificity (peo-

ple without heart disease were actually correctly identified). Moreover, the

model achieved 69.57% accuracy with an AUC value of 0.6743 indicating the

proposed model is satisfactory. Lastly, we found that age, family history

of heart disease, cumulative tobacco, type-A characteristic, and Low-density

Lipoprotein are the essential factors that causes heart diseases.

5.2 Future Work

In this research, logistic regression was the only method applied to predict

heart failure events. Many popular statistical methods perform well in fore-

casting the events such as Decision Tree, Naı̈ve Bayes algorithm, and Random

Forest method. Therefore, different statistical methods can be applied to-

gether in one study and compare their performances. The highest accuracy

of the model will be chosen as the final model. Furthermore, sampling tech-

niques such as random oversampling and Synthetic Minor Oversampling can

help the statistical methods to perform better. Thus, sampling techniques

can be applied to the data set before applying the selected statistical method.

59
References

Abhay, Kishore, Ajay, Kumar, Karan, Singh, Maninder, Punia, Yogita, and

Hambir (2020). A computational model for prediction of heart disease

based on logistic regression with gridsearchcv.

Allison, P. D. (1999). Comparing logit and probit coefficients across groups.

Sociological methods & research, 28(2):186–208.

Allison, P. D. (2001). Logistic Regression Using the SAS: Theory and Appli-

cation. Cary, NC: SAS Institute Inc, 1 edition.

Babu, S., Vivek, E., Famina, K., Fida, K., Aswathi, P., Shanid, M., and

Hena, M. (2017). Heart disease diagnosis using data mining technique. In

2017 international conference of electronics, communication and aerospace

technology (ICECA), volume 1, pages 750–753. IEEE.

Belsley, D. A., Kuh, E., and Welsch, R. E. (2005). Regression diagnostics:

Identifying influential data and sources of collinearity, volume 571. John

Wiley & Sons.

Bhatti, I. P., Lohano, H. D., Pirzado, Z. A., and Jafri, I. A. (2006). A logistic

60
regression analysis of the ischemic heart disease risk. Journal of Applied

Sciences, 6(4):785–788.

Cook, R. D. (1977). Detection of influential observation in linear regression.

Technometrics, 19(1):15–18.

Department of Statistics Malaysia (2019). Statistics of cause of death,

Malaysia, 2019.

Enriko, I. K. A. (2019). Comparative study of heart disease diagnosis using

top ten data mining classification algorithms. In Proceedings of the 5th

International Conference on Frontiers of Educational Technologies, pages

159–164.

Hosmer, D. W. and Lemeshow, S. (1980). Goodness of fit tests for the

multiple logistic regression model. Communications in statistics-Theory

and Methods, 9(10):1043–1069.

Hosmer, D. W. and Lemeshow, S. (2000). Applied logistic regression. Wiley

New York, 2rd edition.

Khemphila, A. and Boonjing, V. (2010). Comparing performances of logistic

regression, decision trees, and neural networks for classifying heart dis-

61
ease patients. In 2010 international conference on computer information

systems and industrial management applications (CISIM), pages 193–198.

IEEE.

Kumar, R. and Indrayan, A. (2011). Receiver operating characteristic (roc)

curve for medical researchers. Indian pediatrics, 48(4):277–287.

Kutner, M. H., Nachtsheim, C. J., Neter, J., Li, W., et al. (2005). Applied

linear statistical models, volume 5. McGraw-Hill Irwin Boston.

Lakshmanarao, A., Swathi, Y., and Sundareswar, P. S. S. (2019). Machine

learning techniques for heart disease prediction. Forest, 95(99):97.

Midi, H., Sarkar, S. K., and Rana, S. (2010). Collinearity diagnostics of

binary logistic regression model. Journal of Interdisciplinary Mathematics,

13(3):253–267.

Mothukuri, R., Satvik, M. S., Balaji, K. S., and Manikanta, D. (2020). Effec-

tive system for prediction of heart disease by applying logistic regression.

International Journal of Scientific & Technology Research, 9:432–437.

Nishadi, A. T. (2019). Predicting heart diseases in logistic regression of

62
machine learning algorithms by python jupyterlab. International Journal

of Advanced Research and Publications, 3(8).

Picard, R. R. and Berk, K. N. (1990). Data splitting. The American Statis-

tician, 44(2):140–147.

Prasad, R., Anjali, P., Adil, S., and Deepa, N. (2019). Heart disease predic-

tion using logistic regression algorithm using machine learning. Interna-

tional journal of Engineering and Advanced Technology, 8:659–662.

Rana, S., Midi, H., and Sarkar, S. (2010). Validation and performance anal-

ysis of binary logistic regression model. In Proceedings of the WSEAS In-

ternational Conference on Environmental, Medicine and Health Sciences,

pages 23–25.

Rossouw, J., Du Plessis, J., Benadé, A., Jordaan, P., Kotze, J., Jooste,

P., and Ferreira, J. (1983). Coronary risk factor screening in three rural

communities. the coris baseline study. South African medical journal=

Suid-Afrikaanse tydskrif vir geneeskunde, 64(12):430–436.

Sarkar, S., Midi, H., and Rana, S. (2010). Model selection in logistic regres-

sion and performance of its predictive ability. Australian Journal of Basic

and Applied Sciences, 4(12):5813–5822.

63
Sarkar, S. K., Midi, H., and Rana, S. (2011). Detection of outliers and

influential observations in binary logistic regression: An empirical study.

Journal of Applied Sciences, 11(1):26–35.

Saw, M., Saxena, T., Kaithwas, S., Yadav, R., and Lal, N. (2020). Estima-

tion of prediction for getting heart disease using logistic regression model of

machine learning. In 2020 International Conference on Computer Commu-

nication and Informatics (ICCCI), pages 1–6. The Institute of Electrical

and Electronics Engineers.

Stoltzfus, J. C. (2011). Logistic regression: a brief primer. Academic Emer-

gency Medicine, 18(10):1099–1104.

Tabachnick, B. G. and Fidell, L. (2013). Applied logistic regression. Pearson

Boston, MA, 6nd edition.

Thabtah, F. (2017). Autism spectrum disorder screening: Machine learning

adaptation and dsm-5 fulfillment. In Proceedings of the 1st International

Conference on Medical and Health Informatics 2017, ICMHI ’17, page 1–6,

New York, NY, USA. Association for Computing Machinery.

World Health Organisation (2017). Cardiovascular diseases (cvds).

64
https://www.who.int/news-room/fact-sheets/detail/cardiovascular-

diseases-(cvds).

65
APPENDIX A

CODING

A. R PROGRAMMING

#1 . Load p a c k a g e s

l i b r a r y ( psych )

library ( dplyr )

library ( ggplot2 )

l i b r a r y ( patchwork )

l i b r a r y ( ggthemes )

library ( rsample )

library ( c o r r p l o t )

l i b r a r y (HH)

library ( LogisticDx )

library ( R e s o u r c e S e l e c t i o n )

l i b r a r y (pROC)

# 2 .IMPORT AND READ THE DATA

data =read . csv ( ”C : / U s e r s /USER/Desktop/ h e a r t . datanew . c s v ” )

head ( data )

summary( data )

66
#3 .DESCIPTIVE STATISTICS

l i b r a r y ( psych )

psych : : d e s c r i b e ( data )

#4 . DATA VISUALISATION

#A. C a t e g o r i c a l i n d e p e n d e n t v a r i a b l e s

#F a c t o r 1 : f a m h i s t

data$chd <− factor ( data$chd )

data$ f a m h i s t<−factor ( data$ f a m h i s t )

p1 <− g g p l o t ( data , a e s ( x = f a m h i s t , f i l l = chd ) )

+geom bar ( stat = ” count ” , p o s i t i o n = ” s t a c k ” )

+s c a l e x d i s c r e t e ( l a b e l s = c ( ” 0 (No) ” , ” 1 ( Yes ) ” ) )

+s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) ,

name = ”CHD” ,

l a b e l s = c ( ” 0 (No) ” , ” 1 ( Yes ) ” ) )

+l a b s ( x = ” f a m h i s t ” )

+theme minimal ( b a s e s i z e = 1 2 )

+geom l a b e l ( stat = ” count ” , a e s ( l a b e l = . . count . . ) ,

position = position stack ( vjust = 0.5) ,

s i z e = 5 , show . legend = TRUE)

67
p1

+plot a n n o t a t i o n ( t i t l e = ” D i s t r i b u t i o n o f t h e Family H i s t o r y and CHD” )

#B. Continuous i n d e p e n d e n t v a r i a b l e s

h e a r t data t i b b l e <− data %>% as t i b b l e ( )

#F a c t o r 2 : s b p

sbp r e l a t i o n<− h e a r t data t i b b l e %>%

mutate ( ‘ chd ‘ = i f e l s e ( chd == 1 , ” Yes ” , ”No” ) ) %>%

ggplot ()

+geom boxplot ( mapping = a e s ( x = ‘ chd ‘ ,

y = sbp ,

f i l l = ‘ chd ‘ ) ,

varwidth = TRUE, show . legend = FALSE, c o l o r = ” b l a c k ” )

+ stat summary( mapping = a e s ( x = ‘ chd ‘ ,

y = sbp ) ,

fun = ”mean” ,

geom = ” p o i n t ” ,

c o l o r = ” red ” ,

na .rm = TRUE,

s i z e = 2)

+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )

68
+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )

+ theme (

axis . text . x = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . text . y = e l e m e n t text ( c o l o u r = ” b l a c k ” )

+ l a b s ( x = ” chd ” , y = ” sbp ” )

#F a c t o r 3 : t o b a c c o

t o b a c c o r e l a t i o n <− h e a r t data t i b b l e %>%

mutate ( ‘ chd ‘ = i f e l s e ( chd == 1 , ” Yes ” , ”No” ) ) %>%

ggplot ()

+geom boxplot ( mapping = a e s ( x = ‘ chd ‘ ,

y = tobacco ,

f i l l = ‘ chd ‘ ) ,

varwidth = TRUE, show . legend = FALSE, c o l o r = ” b l a c k ” )

+ stat summary( mapping = a e s ( x = ‘ chd ‘ ,

y = tobacco ) ,

fun = ”mean” ,

geom = ” p o i n t ” ,

c o l o r = ” red ” ,

na .rm = TRUE,

s i z e = 2)

69
+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )

+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )

+ theme (

axis . text . x = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . text . y = e l e m e n t text ( c o l o u r = ” b l a c k ” )

+ l a b s ( x = ” chd ” , y = ” t o b a c c o ” )

#F a c t o r 4 : l d l

l d l r e l a t i o n <− h e a r t data t i b b l e %>%

mutate ( ‘ chd ‘ = i f e l s e ( chd == 1 , ” Yes ” , ”No” ) ) %>%

ggplot ()

+geom boxplot ( mapping = a e s ( x = ‘ chd ‘ ,

y = ldl ,

f i l l = ‘ chd ‘ ) ,

varwidth = TRUE, show . legend = FALSE, c o l o r = ” b l a c k ” )

+ stat summary( mapping = a e s ( x = ‘ chd ‘ ,

y = ldl ) ,

fun = ”mean” ,

geom = ” p o i n t ” ,

c o l o r = ” red ” ,

70
na .rm = TRUE,

s i z e = 2)

+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )

+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )

+ theme (

axis . text . x = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . text . y = e l e m e n t text ( c o l o u r = ” b l a c k ” )

+ l a b s ( x = ” chd ” , y = ” l d l ” )

#F a c t o r 5 : a d i p o s i t y

a d i p o s i t y r e l a t i o n <− h e a r t data t i b b l e %>%

mutate ( ‘ chd ‘ = i f e l s e ( chd == 1 , ” Yes ” , ”No” ) ) %>%

ggplot ()

+geom boxplot ( mapping = a e s ( x = ‘ chd ‘ ,

y = adiposity ,

f i l l = ‘ chd ‘ ) ,

varwidth = TRUE, show . legend = FALSE, c o l o r = ” b l a c k ” )

+ stat summary( mapping = a e s ( x = ‘ chd ‘ ,

y = adiposity ) ,

fun = ”mean” ,

geom = ” p o i n t ” ,

71
c o l o r = ” red ” ,

na .rm = TRUE,

s i z e = 2)

+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )

+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )

+ theme (

axis . text . x = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . text . y = e l e m e n t text ( c o l o u r = ” b l a c k ” )

+ l a b s ( x = ” chd ” , y = ” A d i p o s i t y ” )

#combined p l o t ( sbp , t o b a c c o , l d l and a d i p o s i t y )

p combined p l o t s 1 <− ( sbp r e l a t i o n + t o b a c c o r e l a t i o n )

/( l d l r e l a t i o n + adiposity r e l a t i o n )

+ plot a n n o t a t i o n ( t i t l e = ” Heart D i s e a s e − P r e d i c t i o n ” ,

s u b t i t l e = ”Sbp , Tobacco , Ldl , A d i p o s i t y ” ,

theme = theme s o l a r i z e d ( l i g h t = FALSE, b a s e s i z e = 1 2 )

+ theme ( plot . t i t l e = e l e m e n t text ( c o l o r = ” w h i t e ” ,

f a c e = ” bold ” , s i z e = 12) ,

plot . s u b t i t l e = e l e m e n t text ( c o l o r = ” w h i t e ” ,

f a c e = ” bold . i t a l i c ” , s i z e = 1 0 ) ) )

72
p combined p l o t s 1

#F a c t o r 6 : t y p e a

typea r e l a t i o n <− h e a r t data t i b b l e %>%

mutate ( ‘ chd ‘ = i f e l s e ( chd == 1 , ” Yes ” , ”No” ) ) %>%

ggplot ()

+geom boxplot ( mapping = a e s ( x = ‘ chd ‘ ,

y = typea ,

f i l l = ‘ chd ‘ ) ,

varwidth = TRUE, show . legend = FALSE, c o l o r = ” b l a c k ” )

+ stat summary( mapping = a e s ( x = ‘ chd ‘ ,

y = typea ) ,

fun = ”mean” ,

geom = ” p o i n t ” ,

c o l o r = ” red ” ,

na .rm = TRUE,

s i z e = 2)

+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )

+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )

+ theme (

axis . text . x = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . text . y = e l e m e n t text ( c o l o u r = ” b l a c k ” )

73
)

+ l a b s ( x = ” chd ” , y = ” typea ” )

#F a c t o r 7 : o b e s i t y

o b e s i t y r e l a t i o n <− h e a r t data t i b b l e %>%

mutate ( ‘ chd ‘ = i f e l s e ( chd == 1 , ” Yes ” , ”No” ) ) %>%

ggplot ()

+geom boxplot ( mapping = a e s ( x = ‘ chd ‘ ,

y = obesity ,

f i l l = ‘ chd ‘ ) ,

varwidth = TRUE, show . legend = FALSE, c o l o r = ” b l a c k ” )

+ stat summary( mapping = a e s ( x = ‘ chd ‘ ,

y = obesity ) ,

fun = ”mean” ,

geom = ” p o i n t ” ,

c o l o r = ” red ” ,

na .rm = TRUE,

s i z e = 2)

+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )

+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )

+ theme (

axis . text . x = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

74
axis . text . y = e l e m e n t text ( c o l o u r = ” b l a c k ” )

+ l a b s ( x = ” chd ” , y = ” o b e s i t y ” )

#F a c t o r 8 : a l c o h o l

a l c o h o l r e l a t i o n <− h e a r t data t i b b l e %>%

mutate ( ‘ chd ‘ = i f e l s e ( chd == 1 , ” Yes ” , ”No” ) ) %>%

ggplot ()

+geom boxplot ( mapping = a e s ( x = ‘ chd ‘ ,

y = alcohol ,

f i l l = ‘ chd ‘ ) ,

varwidth = TRUE, show . legend = FALSE, c o l o r = ” b l a c k ” )

+ stat summary( mapping = a e s ( x = ‘ chd ‘ ,

y = alcohol ) ,

fun = ”mean” ,

geom = ” p o i n t ” ,

c o l o r = ” red ” ,

na .rm = TRUE,

s i z e = 2)

+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )

+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )

+ theme (

axis . text . x = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

75
axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . text . y = e l e m e n t text ( c o l o u r = ” b l a c k ” )

+ l a b s ( x = ” chd ” , y = ” a l c o h o l ” )

#F a c t o r 9 : age

age r e l a t i o n <− h e a r t data t i b b l e %>%

mutate ( ‘ chd ‘ = i f e l s e ( chd == 1 , ” Yes ” , ”No” ) ) %>%

ggplot ()

+geom boxplot ( mapping = a e s ( x = ‘ chd ‘ ,

y = age ,

f i l l = ‘ chd ‘ ) ,

varwidth = TRUE, show . legend = FALSE, c o l o r = ” b l a c k ” )

+ stat summary( mapping = a e s ( x = ‘ chd ‘ ,

y = age ) ,

fun = ”mean” ,

geom = ” p o i n t ” ,

c o l o r = ” red ” ,

na .rm = TRUE,

s i z e = 2)

+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )

+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )

+ theme (

76
axis . text . x = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,

axis . text . y = e l e m e n t text ( c o l o u r = ” b l a c k ” )

+ l a b s ( x = ” chd ” , y = ” age ” )

#combined p l o t ( t y p e a , o b e s i t y , a l c o h o l and age )

p combined p l o t s 2 <− ( typea r e l a t i o n + o b e s i t y r e l a t i o n ) /

( a l c o h o l r e l a t i o n + age r e l a t i o n )

+ plot a n n o t a t i o n ( t i t l e = ” Heart D i s e a s e − P r e d i c t i o n ” ,

s u b t i t l e = ”Typea , Obesity , A l c o h o l and Age” ,

theme = theme s o l a r i z e d ( l i g h t = FALSE,

base s i z e = 12)

+ theme ( plot . t i t l e = e l e m e n t text ( c o l o r = ” w h i t e ” ,

f a c e = ” bold ” , s i z e = 12) ,

plot . s u b t i t l e = e l e m e n t text ( c o l o r = ” w h i t e ” ,

f a c e = ” bold . i t a l i c ” , s i z e = 1 0 ) ) )

p combined p l o t s 2

#5 . F u l l model

f u l l model <− glm( chd ˜ . , data = data , family = ” b i n o m i a l ” )

summary( f u l l model )

77
#6 .DATA SPLITING

set . s e e d ( 1 2 3 )

data s p l i t <− i n i t i a l s p l i t ( data , p =0.7 , s t r a t a = chd )

t r a i n i n g set <− t r a i n i n g ( data s p l i t )

v a l i d a t e set <− t e s t i n g ( data s p l i t )

head ( t r a i n i n g set )

#7 .STEPWISE REGRESSION

null model <− glm( chd ˜ 1 , data = t r a i n i n g set , family = ” b i n o m i a l ” )

f u l l model <− glm( chd ˜ . , data = t r a i n i n g set , family = ” b i n o m i a l ” )

step model <− step ( null model ,

s c o p e = l i s t ( lower = null model ,

upper = f u l l model ) ,

d i r e c t i o n = ” forward ” )

summary( step model )

#r e d u c e d model ( same as s t e o w i s e )

r e d u c e d<−glm( chd˜ age+f a m h i s t+t o b a c c o+typea+l d l ,

family=binomial , data=t r a i n i n g set )

summary( r e d u c e d )

#8 . Assumption

78
#A. Test i n d e p e n d e n c e o f e r r o r ( Reduced )

r e s<−r e d u c e d $ r e s i d u a l s

plot ( r e s , type=” l ” , col=” b l u e ” ,

main=” R e s i d u a l v e r s u s Order o f O b s e r v a t i o n ” ,

y l a b=” R e s i d u a l ” , x l a b=” Order o f O b s e r v a t i o n ” )

abline ( h=0, l t y =2, col=” r e d ” )

#.B . 1 . Test m u l t i c o l l i n e a r i t y ( F u l l )

data =read . csv ( ”C : / U s e r s /USER/Desktop/ h e a r t . datanew . c s v ” )

data1 <− subset ( data , s e l e c t = −chd )

data . cor <− cor ( data1 )

c o r r p l o t ( data . cor , addCoef . col = TRUE)

#B. 2 CALCULATE THE VARIANCE INFLATION FACTOR

v i f ( f u l l model )

#C. O u t l i e r d e t e c t i o n ( Reduced )

#C. 1 Numerical

step model <− step ( null model ,

s c o p e = l i s t ( lower = null model ,

upper = f u l l model ) ,

d i r e c t i o n = ” forward ” )

summary( step model )

79
library ( LogisticDx )

o u t l i e r<−dx ( step model )

outlier

#T a b u l a t e t h e dChisq and dDev >4

o u t l i e r %>% f i l t e r ( o u t l i e r $dChisq > 4 | o u t l i e r $dDev>4)

#V i s u a l i s e t h e dChisq , dDev and d b h a t

plot ( step model )

#9 . Model v a l i d a t i o n

#A. Hosmer−Lemeshow Goodness −of − f i t Test

f i t n e s s <− glm( chd ˜ . , data = v a l i d a t e set , family = ” b i n o m i a l ” )

h l<−hoslem . t e s t ( f i t n e s s $y , f i t t e d ( f i t n e s s ) , g=10)

hl

cbind ( h l $ ex pe ct ed , h l $ o b s e r v e d )

#B.PROBABILITY & CONFUSION MATRIX

v a l i d a t e<−as . factor ( v a l i d a t e set $chd )

pred <− as . factor ( predict ( step model ,

newdata=v a l i d a t e set , type=” r e s p o n s e ” ) >= 0 . 5 ) %>%

f c t r e c o d e ( ” 0 ” = ”FALSE” , ” 1 ” = ”TRUE” )

c o n f u s i o n M a t r i x ( pred , v a l i d a t e , p o s i t i v e = ” 1 ” )

80
#C.ROC ( Area under t h e c u r v e )

v a l i d a t e set $prob <− predict ( step model , v a l i d a t e set ,

type = ” r e s p o n s e ” )

v a l i d a t e set $chd pred <− i f e l s e ( v a l i d a t e set $prob >= 0 . 5 , 1 , 0 )

plot ( r o c ( v a l i d a t e set $chd , v a l i d a t e set $chd pred ) , col=” r e d ” )

auc ( r o c ( v a l i d a t e set $chd , v a l i d a t e set $chd pred ) )

81
APPENDIX B

BOX-TIDWELL TEST

82

You might also like