MTH3901 Mini Project Report 2021
MTH3901 Mini Project Report 2021
LOGISTIC REGRESSION
DEPARTMENT OF MATHEMATICS
FACULTY OF SCIENCE
2021
Abstract
LOGISTIC REGRESSION
By
LOGAN RAMANATHAM
The World Health Organization (WHO) in all its effort in combating heart
diseases with the available instruments that can be used to predict heart
diseases have not been fruitful due to reasons relating to high cost and inac-
1
to further pursue research in the prevention of heart diseases. In this re-
search, multiple logistic regression was used to forecast the event of heart
ondary data was obtained from the research done by Rossouw et al. (1983) in
South African Medical Journal to develop a model for predicting the event of
heart disease. Factors of heart disease were used to determine their respective
the best fit of the model for predicting the event of heart failure. Besides, the
diagnostic test was carried out in this research to ensure the assumptions of
logistic regression are not violated. From this research, it can be concluded
that age, family history of heart disease, cumulative tobacco, typeA charac-
teristic and Low-density lipoprotein are the essential factors that causes the
and the Area Under ROC Curve (AUC) values obtained is 0.6743 indicating
2
Contents
Abstract 1
List of Tables 6
List of Figures 7
List of Abbreviations 9
1 INTRODUCTION 10
2 LITERATURE REVIEW 14
3 METHODOLOGY 17
3.1 Introduction 17
3
3.4 Forward Stepwise Regression 22
odds 46
4
4.4.4 Influential outliers 50
5.1 Conclusion 58
References 60
Appendix A 66
Appendix B 82
5
List of Tables
5 Box-Tidwell Test 46
6
List of Figures
9 Collinearity Matrix 48
beta statistics 50
15 Hosmer-Lemeshow Goodness-of-fit-test 55
16 Confusion Matrix 56
7
17 Receiver Operating Characteristic (ROC) curve 57
8
LIST OF ABBREVIATIONS
FN False Negative
FP False Positive
TN True Negative
TP True Positive
9
CHAPTER 1
INTRODUCTION
1 INTRODUCTION
over the past years, proving to be one of the principal sources of death
that approximately 31% (17.9 million) of all deaths globally were due to car-
10
Deaths caused by ischaemic heart diseases in 2018 were recorded at 15.9%
There are various techniques in neural networks and data mining that
are applied to detect the severity of heart disease among humans (Lakshma-
narao et al., 2019). Due to advanced technology and data collection, heart
ple logistic regression techniques were chosen to forecast the event of heart
Prevention against any diseases, let alone something as severe as heart dis-
11
1. How to predict the event of heart failure ?
2. How to identify which factors are significantly affect the event of heart
disease ?
sion.
3. To find the best fit of the model by using diagnostics test and forward
stepwise regression.
12
1.4 Significance of Study
This study is significant as it can help humans to have early warning be-
fore unexpected passing occurs. There are many factors will contribute to
heart disease. Risk factors associated with heart disease to be age, blood
disease, obesity and lack of physical exercise,and fasting blood sugar have
been statistically analysed (Babu et al., 2017). Thus, examining and identi-
fying the primary factors that will cause the event of heart disease is of great
model should be created to find the relationship of each factor to the heart
find the major factors that will cause the event of heart disease.
13
CHAPTER 2
LITERATURE REVIEW
2 LITERATURE REVIEW
There are a few researches that applied Logistic Regression to predict heart
disease. Bhatti et al. (2006) investigated the factors that caused greatly the
98.63% of the 585 cases were correctly classified which means the model is
able to fit the data well. Nishadi (2019) discussed how early detection and
14
the most significant predictors of heart diseases by evaluating the risk of 10-
accuracy of the model is about 87%. Moreover, Saw et al. (2020) discussed
how to improve the accuracy in predicting heart disease by using the logistic
regression. In their study, they found out that men are more susceptible
tolic blood pressure increase the risk to have heart disease. The predictive
model in this study achieved 87% accuracy. Prasad et al. (2019) explored
Tree and then compare each model based on their accuracy. They found out
that Logistic Regression achieved the highest accuracy among those models
different models such as Support Vector Machine and Random Forest. The
approach before applying on the statistical model. After applying three sam-
15
pling techniques, the Logistic Regression achieved 67.5%, 68.8% and 65.7%
accuracy. In the work Khemphila and Boonjing (2010), they compared the
Networks (ANN) for classifying heart disease patients. The Logistic Re-
However, Logistic Regression is not the best technique among these 3 tech-
niques. Artificial neural networks (ANN) is the best technique because it has
the highest accuracy which is 80.2%. Same goes to the previous study, Abhay
et al. (2020) predicted the heart disease by using three different techniques
gives the highest accuracy of 93%. Enriko (2019) compared 10 data mining
16
CHAPTER 3
METHODOLOGY
3 METHODOLOGY
3.1 Introduction
This chapter introduces the methods that be used in this research. First and
model. Then, the dataset was split into training sets and validation sets.
performance analysis are carried out by fitting the validation dataset. Each
17
3.2 Multiple logistic regression
real-world data, especially social data, there is often the outcome variable is in
2011). The binary outcome variable brings the meaning that the probability
scale has only two possible values (Sarkar et al., 2011). However, this will
create a major problem with linear probability model since the probabilities
probabilities to odds can help to remove the upper bound and the natural
logarithm of odds will remove the lower bound (Sarkar et al., 2011). Hence,
Hence the corresponding model to have more than one explanatory variable
18
can be written as:
Yi = πi (X) + i ; i = 1, 2, . . . , n (1)
where:
exp(Zi )
πi (X) = (2)
1 + exp(Zi )
with:
β = (k + 1) x 1 vector of parameters.
0 ≤ πi ≤ 1
πi
π̂ = ln = β0 + β1 X1i + β2 X2i + · · · + βk Xki (3)
1 − πi
19
And the logistic function is shown as below:
exp(Xβ)
πi = (4)
1 + exp(Xβ)
mass function of Yi is
n
X n
X
L(β) = Yi (Xi0 β) − ln[1 + exp(Xi0 β)] (6)
i=1 i=1
Finally, the fitted logistic response function and fitted values can be expressed
20
as shown below:
exp(X 0 b)
π̂ = = |1 + exp(−X 0 b)|−1 (7)
1 + exp(X 0 b)
exp(Xi0 b)
π̂i = = |1 + exp(−Xi0 b)|−1 (8)
1 + exp(Xi0 b)
where:
X 0 b = b0 + b1 X 1 + · · · + bk X k
Before building the forecasting model by using stepwise regression, the dataset
was split into training and validation sets. Data splitting is defined as the act
(Picard and Berk, 1990). One portion of the data is used to develop a pre-
dictive model and another to evaluate the models (Picard and Berk, 1990).
For this study, the whole dataset was split into a training set and validation
70% of the full dataset and validation set from 30% of the full dataset.
21
3.4 Forward Stepwise Regression
predictor variables that are significant to the response (Kutner et al., 2005).
as below :
where:
AIC measures goodness of fit but also includes a penalty that is an in-
over-fitting and the preferred model is the set of candidate models with the
minimum AIC value (Sarkar et al., 2010). Hence the model with the mini-
22
For this research, the training set of the data was used to build the reduced
variables were selected and a model with the AIC value was chosen.
Before inferences based on that model are undertaken, diagnostic tests are
essential to examine the aptness of the model (Kutner et al., 2005). Diag-
nostic tests were implemented in this research to avoid any data violation
in the study. In this research, the assumptions of logistic regression are not
based models (Kutner et al., 2005). There are 4 major assumptions in logis-
tic regression which are linearity in the logit for any continuous independent
dent variables and lack of strongly influential outliers (Stoltzfus, 2011). Each
23
linearity in the logistic regression is quite different compared to the process
By using this method, the interaction between each independent variable and
its natural logarithm are added into the logistic regression model. If at least
for the test. If there is problem in linearity in the log odd, transformation of
dependent of each other (Tabachnick and Fidell, 2013). This means each
response comes from a different and unrelated case. The logistic regression
will produce an overdispersion (great variability) effect if errors are not inde-
24
is plotted. The purpose of plotting residuals against the order of observation
terms that are near each other in the sequence (Kutner et al., 2005). If the
residual plot does not show any trend of pattern and fluctuate around base
line 0, the error terms are assumed to be independent. If the plot does not
exist a random pattern, the assumption is violated and further action should
be taken.
are correlated, the standard error of the coefficients will increase (Midi et al.,
or 0.9 (Midi et al., 2010). If the data suffer a multicollinearity problem, then
the redundant variables may be deleted from the model to solve the problem
25
tion Factor (VIF). Mathematically multicollinearity can mainly be detected
with the help of tolerance and its reciprocal, called variance inflation fac-
tor (VIF). Belsley et al. (2005) defined tolerance of any specific explanatory
variable as
T olerance = 1 − R2 (10)
Where:
Furthermore, Belsley et al. (2005) defined the variance inflation factor (VIF)
1
V IF = (11)
1 − R2
the independent variables are highly correlated if the VIF value is greater
than 10. Further action such as removing variables from the model should
26
be considered as one of the methods to solve the multicollinearity problems.
whose values deviate from the expected range and will produce extremely
large residual thus leading to sample peculiarity (Sarkar et al., 2011). Iden-
in deviance and change in parameter estimates from basic building blocks are
rp2i
∆χ2i = (12)
1 − hii
where:
27
dri2
∆Di = (13)
1 − hii
where:
Sarkar et al. (2011) state in article that the deviance residuals and stu-
dentized Pearson residuals will follow the chi-square distribution with single
to detect the outliers (Sarkar et al., 2011). In short, if the change in Pearson
chi-square, ∆χ2i and change in deviance, ∆Di are greater than 4, the points
Besides that, the change in the value of the estimated coefficients is anal-
ogous to the measure proposed by Cook (1977) for linear regression. The
28
below.
rp2i hii
∆β̂i = (14)
(1 − hii )2
where:
If the change in the value of the estimated coefficients, ∆β̂i is greater than
observations.
29
3.6 Model Validation and Performance Analysis
After building the predictive model by using training data set, model valida-
tion analysis was carried out. This is the last step for the model building. It
is important to have model validation analysis to ensure the results of the lo-
population (Rana et al., 2010). In other words, we can say that the model
has a good fit. The most suggested technique to obtain a good internal val-
data already splitted into a training set and validation set before fitting the
model. Now, the validation set of the data was used to compute the summary
fusion matrix was constucted to find the performance of the model. Lastly,
30
the values of the estimated probabilities (Hosmer and Lemeshow, 1980). Let
√
nj denote approximately n g which g is the group number in the jth decile.
√ P
Lets say g = 10, then it can be written as n 10. Let Oj = yj be the
number of positive responses among the covariate patterns falling in the jth
decile. The estimate of the expected value of Oj under the assumption that
P
the fitted model is correct is Ej = mj πj . Then the Hosmer-Lemeshow test
g
X (Oj − Ej )2
Cv = (15)
n π̄ (1 − π̄j )
j=1 j j
Where:
P
mj πˆj
π̄j = nj
Hosmer and Lemeshow demonstrated that under the null hypothesis is the
good fit of logistic regression model (Hosmer and Lemeshow, 1980). Hence
under the assumption of hypothesis that is good fit of model and each Ej is
in jth decile.
31
3.6.2 Confusion Matrix
plays the number of instances assigned to each class and then using the
information to calculate the True Positives (TP), True Negative (TN), False
Positive (FP) and False Negative (FN) among others (Thabtah, 2017). In
this study, 2X2 confusion matrix was constructed because the dependent
Predicted Class
a (has CHD) b (no CHD)
a (has CHD) TP FN
Actual Class
b (no CHD) FP TN
From the table 1 , there are 4 segments which are TP, FN, FP and
TN. True Positive (TP) is number of patients that are actually have CHD
patients have CHD but predicted wrongly to do not have CHD. False Positive
(FP) is number of patients who do not have CHD but predicted wrongly to
32
have CHD. Lastly, True Negative (TN) is the number of patients that do not
Thus, the confusion matrix can be used to find the accuracy, misclassifica-
tion rate, sensitivity and specificity. Accuracy means how often the classifier
correct for overall result. Misclassification rate means the error rate for the
overall result. Sensitivity is the true positive rate while specificity is the true
negative rate.
Terms Formula
(T P +T N )
Accuracy of the model T P +T N +F P +F N
(F P +F N )
Misclssification Rate T P +T N +F P +F N )
TP
Sensitivity of the model (T P +F N )
TN
Specificity of the model (T N +F P )
method for assessing the performance of the model. ROC curve is a graph
33
(Kumar and Indrayan, 2011). Then, the Area under the ROC curve (AUC)
real life data. Therefore, the minimum value of AUC can be accepted is 0.5
(Kumar and Indrayan, 2011). This means that overall there is 50-50 chances
that test will correctly discriminate the diseased and non-diseased subject.
This is really unhelpful if the model can only have 50% chances to predict
correctly.
34
CHAPTER 4
In this research, secondary data was obtained in the research done by Rossouw
et.al in 1983, South African Medical Journal. There are a total 462 samples
in this data set. The samples are collected among males population in a
high risk heart disease region of the Western Cape in South Africa. The
35
Table 3: Description of the Dataset
In this research, the results were obtained separately from IBM SPSS
software and R Studio software. The results are divided into 5 sections
Each of the results will be discussed in more detail in the following sections.
36
4.1 Data Visualization
Data visualization has been carried out to have a better understanding and
visualization of this data. In this study, a bar chart was created for factor
family history of heart disease and box plots for every continuous independent
variable. Data visualization was used to develop a general idea of how the
visualization just provides a general idea about the prediction of heart disease
cases and do not provide sufficient evidence to make conclusion directly. The
figure below shows a bar chart representing the distribution of the factors
37
Figure 1: Distribution of the family history of heart disease and CHD
The bar chart shows that the the total number of people without family
history of heart disease is 270 with 23.7% of them have the heart disease
heart disease is 192 with 50% of the them having heart disease. It may
the patients have family history of heart disease. However, directly making
38
Figure 2: Box plots for factors sbp, tobacco, ldl and adiposity
Figure 2 shows the box plots for the factor systolic blood pressure (sbp),
level of ldl and body adiposity index are good indicators of heart disease.
The median values are represented by the horizontal lines of the box plots,
while the mean value of the parameter on the Y-axis are represented by the
red dots.
39
Figure 3: Boxplots for factors typea, obesity, alcohol and age
Figure 3 shows the box plots for the factor type-A characteristic (typea),
(alcohol) and age at onset (age). Based on the results, higher typea charac-
teristic, BMI value, consumption of alcohol and age are good indicators of
heart disease.
40
4.2 Multiple Logistic Regression
From the table 4, p-value lesser than α =0.05 are considered as significant
in Table 4, tobacco, ldl, famhist, typea and age were found to be signifi-
cant. Although the summary of the multiple logistic regression has given the
stepwise regression.
41
4.3 Forward Stepwise Regression
factor with minimum AIC value will be chosen to include into the model.
According to the results, factor with minimum AIC value in step 1 is age
with the value 355.91. Therefore, age was the first factor to be included in
the model. Next, the minimum AIC value in step 2 is 345.30 for the factor
famhist. Hence, famhist was integrated into the model where the factor age
42
Figure 5: Step 3 and 4 of the Forward Stepwise Regression
Proceeding to step 3 and 4, the factors, tobacco and typea had the min-
both of these factors were chosen to be included in the model where age and
43
Figure 6: Step 5 and 6 of the Forward Stepwise Regression
In step 5, the factor ldl with the AIC value of 334.45 was selected to be
included into the model, given that age, famhist, tobacco, and typea were
additional factors can be added into the model due to higher AIC values
44
Figure 7: Summary of Forward Stepwise Regression
Figure 7 shows the final reduced model from Forward Stepwise Regression.
tobacco, typea and ldl were found to significantly affect the event of heart
disease. As a result, the predictive model for the event of heart disease was
developed. Besides, the AIC value decreased from 419.79 to 334.45 which
π̂
log 1−π̂
= −6.52744 + 0.06435 ∗ age + 0.98935 ∗ f amhist + 0.08795 ∗
45
4.4 Diagnostic Tests
was obtained by calculating the significance level divides the number of terms
involved. Hence, the significance level is α∗ /10 = 0.005 as there are total 10
were checked. Result shows that the interaction of continuous variables with
its natural logarithms are greater than 0.005 which indicate that all the
46
4.4.2 Independence of errors
47
4.4.3 Multicollinearity among independent variables
The figure 9 shows the collinearity matrix between each pairs of indepen-
obesity with ldl and age with adiposity were 0.72 and 0.63 respectively. It is
the collinearity matrix and conclude that the data have a multicollinearity
further analysed.
48
Table 6: Variance Inflation Factor (VIF)
Variables VIF
sbp 1.237388
tobacco 1.307798
ldl 1.258307
adiposity 4.065627
famhist 1.108462
typea 1.044693
obesity 2.597437
alcohol 1.095382
age 2.239563
ables. The Variance Inflation Factor (VIF) were calculated to identify multi-
collinearity problem in the data. Based on the table, it was found that there
49
4.4.4 Influential outliers
The figure shows the derived influence statistics for 324 observations in
and beta was the main concern to identify the influential outliers. Therefore,
the derived influence statistics were used to create useful diagnostic plots
50
Figure 11: Outlying cases and their impact on influence statistics
Figure 11 shows the 13 outlying cases and their impact on influence statis-
tics. As from the figure, the change in chi-square and deviance were greater
change of beta. This indicated the presence of 13 potential outliers with none
51
Figure 12: Probability against scaled change in Pearson chi-square
square. Based on the figure, the points falling in the top left of the plots are
the cases that are poorly fit . Therefore, there are several potential outliers
in this plot.
52
Figure 13: Probability against scaled change in deviance
in the figure, the points falling in the top left of the plots are the cases that
are poorly fit . Hence, there were a few potential outliers found in this plot.
53
Figure 14: Probability against scaled change in coefficients
According to Figure 14, none of the points were greater than 1. Therefore,
there were no influential outliers found in the data and removing the outlying
54
4.5 Model Validation and Performance Analysis
After all the assumptions were satisfied, the predictive model undergoes
Based on Figure 15, p-value for the hypothesis testing was significantly high
55
4.5.2 Confusion Matrix
Figure 16 shows the results of the confusion matrix. The accuracy, sen-
sitivity and specificity of the predictive model were obtained through the
confusion matrix. From the results, the predictive model was found to be
69.57% accurate with a sensitivity and specificity of 60.42% and 74.44% re-
spectively. This means that the model achieved 60.42% to correctly identify
the people with heart disease and 74.44% to correctly identify the people
56
4.5.3 Receiver Operating Characteristic (ROC) curve
1.0
0.8
0.6
Sensitivity
0.4
0.2
0.0
tion thresholds to determine the area under the curve (AUC). Based on the
57
CHAPTER 5
5.1 Conclusion
From the results, it showed that the dataset did not violate any assumptions
goodness-of-fit test was carried out to test the fitness of the predictive model.
tive model is a good fit of model. Furthermore, the confusion matrix and
58
ROC curve were used to find the performance of the predictive model. Based
on the results, the predictive model achieved 60.42% sensitivity (people with
heart disease were actually correctly identified) and 74.44% specificity (peo-
ple without heart disease were actually correctly identified). Moreover, the
model achieved 69.57% accuracy with an AUC value of 0.6743 indicating the
In this research, logistic regression was the only method applied to predict
heart failure events. Many popular statistical methods perform well in fore-
casting the events such as Decision Tree, Naı̈ve Bayes algorithm, and Random
gether in one study and compare their performances. The highest accuracy
of the model will be chosen as the final model. Furthermore, sampling tech-
can be applied to the data set before applying the selected statistical method.
59
References
Abhay, Kishore, Ajay, Kumar, Karan, Singh, Maninder, Punia, Yogita, and
Allison, P. D. (2001). Logistic Regression Using the SAS: Theory and Appli-
Babu, S., Vivek, E., Famina, K., Fida, K., Aswathi, P., Shanid, M., and
Bhatti, I. P., Lohano, H. D., Pirzado, Z. A., and Jafri, I. A. (2006). A logistic
60
regression analysis of the ischemic heart disease risk. Journal of Applied
Sciences, 6(4):785–788.
Technometrics, 19(1):15–18.
Malaysia, 2019.
159–164.
regression, decision trees, and neural networks for classifying heart dis-
61
ease patients. In 2010 international conference on computer information
IEEE.
Kutner, M. H., Nachtsheim, C. J., Neter, J., Li, W., et al. (2005). Applied
13(3):253–267.
Mothukuri, R., Satvik, M. S., Balaji, K. S., and Manikanta, D. (2020). Effec-
62
machine learning algorithms by python jupyterlab. International Journal
tician, 44(2):140–147.
Prasad, R., Anjali, P., Adil, S., and Deepa, N. (2019). Heart disease predic-
Rana, S., Midi, H., and Sarkar, S. (2010). Validation and performance anal-
pages 23–25.
Rossouw, J., Du Plessis, J., Benadé, A., Jordaan, P., Kotze, J., Jooste,
P., and Ferreira, J. (1983). Coronary risk factor screening in three rural
Sarkar, S., Midi, H., and Rana, S. (2010). Model selection in logistic regres-
63
Sarkar, S. K., Midi, H., and Rana, S. (2011). Detection of outliers and
Saw, M., Saxena, T., Kaithwas, S., Yadav, R., and Lal, N. (2020). Estima-
tion of prediction for getting heart disease using logistic regression model of
Conference on Medical and Health Informatics 2017, ICMHI ’17, page 1–6,
64
https://www.who.int/news-room/fact-sheets/detail/cardiovascular-
diseases-(cvds).
65
APPENDIX A
CODING
A. R PROGRAMMING
#1 . Load p a c k a g e s
l i b r a r y ( psych )
library ( dplyr )
library ( ggplot2 )
l i b r a r y ( patchwork )
l i b r a r y ( ggthemes )
library ( rsample )
library ( c o r r p l o t )
l i b r a r y (HH)
library ( LogisticDx )
library ( R e s o u r c e S e l e c t i o n )
l i b r a r y (pROC)
head ( data )
summary( data )
66
#3 .DESCIPTIVE STATISTICS
l i b r a r y ( psych )
psych : : d e s c r i b e ( data )
#4 . DATA VISUALISATION
#A. C a t e g o r i c a l i n d e p e n d e n t v a r i a b l e s
#F a c t o r 1 : f a m h i s t
+s c a l e x d i s c r e t e ( l a b e l s = c ( ” 0 (No) ” , ” 1 ( Yes ) ” ) )
+s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) ,
name = ”CHD” ,
l a b e l s = c ( ” 0 (No) ” , ” 1 ( Yes ) ” ) )
+l a b s ( x = ” f a m h i s t ” )
+theme minimal ( b a s e s i z e = 1 2 )
67
p1
#B. Continuous i n d e p e n d e n t v a r i a b l e s
#F a c t o r 2 : s b p
ggplot ()
y = sbp ,
f i l l = ‘ chd ‘ ) ,
y = sbp ) ,
fun = ”mean” ,
geom = ” p o i n t ” ,
c o l o r = ” red ” ,
na .rm = TRUE,
s i z e = 2)
+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )
68
+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )
+ theme (
axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,
+ l a b s ( x = ” chd ” , y = ” sbp ” )
#F a c t o r 3 : t o b a c c o
ggplot ()
y = tobacco ,
f i l l = ‘ chd ‘ ) ,
y = tobacco ) ,
fun = ”mean” ,
geom = ” p o i n t ” ,
c o l o r = ” red ” ,
na .rm = TRUE,
s i z e = 2)
69
+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )
+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )
+ theme (
axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,
+ l a b s ( x = ” chd ” , y = ” t o b a c c o ” )
#F a c t o r 4 : l d l
ggplot ()
y = ldl ,
f i l l = ‘ chd ‘ ) ,
y = ldl ) ,
fun = ”mean” ,
geom = ” p o i n t ” ,
c o l o r = ” red ” ,
70
na .rm = TRUE,
s i z e = 2)
+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )
+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )
+ theme (
axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,
+ l a b s ( x = ” chd ” , y = ” l d l ” )
#F a c t o r 5 : a d i p o s i t y
ggplot ()
y = adiposity ,
f i l l = ‘ chd ‘ ) ,
y = adiposity ) ,
fun = ”mean” ,
geom = ” p o i n t ” ,
71
c o l o r = ” red ” ,
na .rm = TRUE,
s i z e = 2)
+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )
+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )
+ theme (
axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,
+ l a b s ( x = ” chd ” , y = ” A d i p o s i t y ” )
/( l d l r e l a t i o n + adiposity r e l a t i o n )
+ plot a n n o t a t i o n ( t i t l e = ” Heart D i s e a s e − P r e d i c t i o n ” ,
f a c e = ” bold ” , s i z e = 12) ,
plot . s u b t i t l e = e l e m e n t text ( c o l o r = ” w h i t e ” ,
f a c e = ” bold . i t a l i c ” , s i z e = 1 0 ) ) )
72
p combined p l o t s 1
#F a c t o r 6 : t y p e a
ggplot ()
y = typea ,
f i l l = ‘ chd ‘ ) ,
y = typea ) ,
fun = ”mean” ,
geom = ” p o i n t ” ,
c o l o r = ” red ” ,
na .rm = TRUE,
s i z e = 2)
+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )
+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )
+ theme (
axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,
73
)
+ l a b s ( x = ” chd ” , y = ” typea ” )
#F a c t o r 7 : o b e s i t y
ggplot ()
y = obesity ,
f i l l = ‘ chd ‘ ) ,
y = obesity ) ,
fun = ”mean” ,
geom = ” p o i n t ” ,
c o l o r = ” red ” ,
na .rm = TRUE,
s i z e = 2)
+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )
+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )
+ theme (
axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,
74
axis . text . y = e l e m e n t text ( c o l o u r = ” b l a c k ” )
+ l a b s ( x = ” chd ” , y = ” o b e s i t y ” )
#F a c t o r 8 : a l c o h o l
ggplot ()
y = alcohol ,
f i l l = ‘ chd ‘ ) ,
y = alcohol ) ,
fun = ”mean” ,
geom = ” p o i n t ” ,
c o l o r = ” red ” ,
na .rm = TRUE,
s i z e = 2)
+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )
+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )
+ theme (
75
axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,
+ l a b s ( x = ” chd ” , y = ” a l c o h o l ” )
#F a c t o r 9 : age
ggplot ()
y = age ,
f i l l = ‘ chd ‘ ) ,
y = age ) ,
fun = ”mean” ,
geom = ” p o i n t ” ,
c o l o r = ” red ” ,
na .rm = TRUE,
s i z e = 2)
+ s c a l e f i l l manual ( v a l u e s = c ( ” l i g h t b l u e ” , ” d a r k o l i v e g r e e n 2 ” ) )
+ theme s o l a r i z e d 2 ( l i g h t = TRUE, b a s e s i z e = 1 0 )
+ theme (
76
axis . text . x = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,
axis . t i t l e = e l e m e n t text ( c o l o u r = ” b l a c k ” ) ,
+ l a b s ( x = ” chd ” , y = ” age ” )
( a l c o h o l r e l a t i o n + age r e l a t i o n )
+ plot a n n o t a t i o n ( t i t l e = ” Heart D i s e a s e − P r e d i c t i o n ” ,
base s i z e = 12)
f a c e = ” bold ” , s i z e = 12) ,
plot . s u b t i t l e = e l e m e n t text ( c o l o r = ” w h i t e ” ,
f a c e = ” bold . i t a l i c ” , s i z e = 1 0 ) ) )
p combined p l o t s 2
#5 . F u l l model
summary( f u l l model )
77
#6 .DATA SPLITING
set . s e e d ( 1 2 3 )
head ( t r a i n i n g set )
#7 .STEPWISE REGRESSION
upper = f u l l model ) ,
d i r e c t i o n = ” forward ” )
#r e d u c e d model ( same as s t e o w i s e )
summary( r e d u c e d )
#8 . Assumption
78
#A. Test i n d e p e n d e n c e o f e r r o r ( Reduced )
r e s<−r e d u c e d $ r e s i d u a l s
main=” R e s i d u a l v e r s u s Order o f O b s e r v a t i o n ” ,
#.B . 1 . Test m u l t i c o l l i n e a r i t y ( F u l l )
v i f ( f u l l model )
#C. O u t l i e r d e t e c t i o n ( Reduced )
#C. 1 Numerical
upper = f u l l model ) ,
d i r e c t i o n = ” forward ” )
79
library ( LogisticDx )
outlier
#9 . Model v a l i d a t i o n
h l<−hoslem . t e s t ( f i t n e s s $y , f i t t e d ( f i t n e s s ) , g=10)
hl
cbind ( h l $ ex pe ct ed , h l $ o b s e r v e d )
f c t r e c o d e ( ” 0 ” = ”FALSE” , ” 1 ” = ”TRUE” )
c o n f u s i o n M a t r i x ( pred , v a l i d a t e , p o s i t i v e = ” 1 ” )
80
#C.ROC ( Area under t h e c u r v e )
type = ” r e s p o n s e ” )
81
APPENDIX B
BOX-TIDWELL TEST
82