0% found this document useful (0 votes)
24 views

BE Project13

Uploaded by

Sanjana Pol
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

BE Project13

Uploaded by

Sanjana Pol
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

A Decision Support System for Diabetes Prediction

Using Machine Learning and Deep Learning


Techniques
Amani Yahyaoui Akhtar Jamil
Department of Sotware Engineering Department of Computer Engineering
Istanbul Sabahattin Zaim University, Istanbul Sabahattin Zaim University,
Istanbul, Turkey Istanbul, Turkey
[email protected] https://orcid.org/0000-0002-2592-1039

Jawad Rasheed Mirsat Yesiltepe


Department of Computer Engineering Dept. of Mathematical Engineering,
Istanbul Sabahattin Zaim University, Yildiz Technical University,
Istanbul, Turkey Istanbul, Turkey
[email protected] [email protected]

Abstract— With the continuing increase in the number of the such as diabetic ketoacidosis, nonketotic hyperosmolar,
deadly diseases that threaten both human health and life, cardiovascular disease, stroke etc.
medical Decision Support Systems (DSS) continue to prove their
effectiveness in providing physicians and other healthcare According to the World Health Organization, diabetes is
professionals with support in clinical decision making. Among one of the leading causes of death worldwide and about 422
these dangerous diseases, diabetes continues to be one of the million people worldwide have diabetes. Indeed, it caused the
leading one that has caused several deaths in the world. It is deaths of 1.6 million people in 2016 [3].
characterized by an increase in blood sugar levels which can
have severe effects on other human organs. According to the
There are two main types of diabetes, type1 and type 2.
International Diabetes Federation (IDA), 382 million people are The diabetes type 1 span 5 to 10% of all diabetes cases. This
living with diabetes and by 2035, these statistics will double to type of diabetes appears most often during childhood or
reach 592 million. In this paper, we propose a DSS for diabetes adolescence and characterized by the partial functioning of
prediction based on Machine Learning (ML) techniques. We pancreas. At the beginning, type 1 diabetes does not develop
compared conventional machine learning with deep learning any symptoms, as the pancreas remains partially functional.
approaches. For conventional machine learning method, we The disease only becomes apparent when 80-90% of
considered the most commonly used classifiers: Support Vector pancreatic insulin-producing cells are already destroyed [4].
Machine (SVM) and the Random Forest(RF). On the other
hand, for Deep Learning (DL) we employed a fully
The diabetes type 2 presents 90% of all diabetes cases.
Convolutional Neural Network (CNN) to predict and detect the This type of diabetes is characterized by chronic
diabetes patients. The proposed system is evaluated on publicly hyperglycemia and the body's inability to regulate blood sugar
available Pima Indians Diabetes database which consisted of levels, which causes a too high glucose (sugar) level in the
total 768 samples each with 8 features. 500 samples were labeled blood. This disease usually occurs in older adults and affects
as non-diabetic while 268 were diabetic patients. The overall more obese or overweight people [5].
accuracy obtained using DL, SVM and RF was 76.81%, 65.38%
In medicine, doctors and current research confirm that if
and 83.67% respectively. The experimental results show that RF
the disease is discovered at an early stage, the chances of
was more effective for diabetes prediction compared to deep
learning and SVM methods.
recovery will be greater. With the continuous advancement of
technology, machine learning and deep learning techniques
Keywords— Decision Support Systems, diabetes, machine have become very useful in early prediction and disease
learning, deep learning, Support Vector Machine, Random analysis. Among these techniques, Support Vector Machine
Forest, Convolutional Neural Network. (SVM), the Random Forest (RF) and the Convolutional
Neural Network (CNN) are used in this research to predict the
I. INTRODUCTION diabetes.
Diabetes mellitus or diabetes is one of the incurable Recently, several researches have focused on predicting
chronic diseases caused by lack or absence of a hormone diabetes using machine learning and deep learning techniques.
called insulin [1]. It is an essential hormone produced by the For instance, in [6], authors have proposed a deep learning-
pancreas that allows the cells to absorb glucose (blood sugar) based method for diabetes data classification by using the
from food supplies in order to provide them the necessary Deep Neural Network (DNN) method. The proposed system
energy [2]. The presence of high blood sugar levels in the was experimented on Pima Indians Diabetes data set. The
blood is known as Hyperglycemia in medical terms. This proposed system has shown good classification accuracy
situation can occur for two main reasons: (1) when the body (86.26%) which shows the effectiveness of the DNN in
cannot make insulin required by the blood cells (2) the body helping doctors to predict the disease.
cannot respond to insulin properly. The body needs insulin so
glucose in the blood can enter the cells of the body where it In [7], authors have presented a theoretical research based
can be used for energy. However, if the body fails to utilize on three classification method from machine learning
glucose to produce energy, it builds up in the blood resulting techniques which are the SVM, the Logistic Regression (LR)
in hyperglycemia. This can cause serious health problems and the Artificial Neural Network (ANN).

978-1-7281-3992-0/19/$31.00 ©2019 IEEE


In [8], authors have proposed an hybrid system for body mass index, Diabetes pedigree function, and Age
diabetes prediction by using the Boltzman method from deep (years).
learning techniques to predict whether the patient is diabetic
or not and by using the decision tree method from the machine In our study, we used all the available features in all
learning techniques to classify either the diabetic patient is experiments. The data set was divided into training (60%) and
having type 1 or type 2 diabetes. The obtained results prove testing (40%) parts. Furthermore, the training data set was
the good performance of the deep learning method by further divided into 10% validation set to evaluate the
reaching an accuracy of 80% and also for the machine learning performance of the model. These were randomly selected and
method by giving 94%. From this research, we can conclude experiments were repeated ten times to make sure there is no
that deep learning and machine learning have mixed their bias in the system. The final accuracy was calculated as the
skills and advantages to give birth to a powerful hybrid system average accuracy obtained from these experiments.
that can absolutely help doctor not only in prediction diabetes III. METHODOLOGY
but also in classifying the diabetes type.
A. Support Vector Machine
Another recent research presented in [3] have proposed a
hybrid diabetes detection system based on mixing two The SVMs have proven to be very effective for various
common techniques from deep learning which are the Long data classification tasks. It tries to find the optimal separating
Short Term Memory (LSTM) method and the Convolutional hyperplane between classes by finding the set of points that lie
Neural Network (CNN) method. The obtained results confirm on the edge of the class descriptors. The distance between the
the effectiveness of the deep learning by reaching 95.7%. classes is referred to as margin. SVM algorithms finds a
margin such that its distance is maximum. The higher the
In addition, in [9], authors also have presented a deep margin, the better the classification accuracy can be obtained
learning approach to identify diabetes by using the Recurrent for the classifier. The data points lying on the border are
Deep Neural Network (RDNN) method. This approach was known as support vectors (Fig. 1). Hence the name support
evaluated by using the public well know data set Pima Indians vector machine. The rest of the training samples are
diabetes. The performance of the proposed approach was very discarded. SVMs can achieve good performance even on
good and has reached an accuracy of 81%. small training samples as less training samples are effectively
Moreover, in [10], authors have proposed and hybrid used.
system composed by three common algorithms of machine Primarily, SVMs are designed to deal with linearly
learning which are the Decision Tree (DT), the Neural separable binary classification data. Several variations have
Network (NN) and the Random Forest (RF) methods to been proposed to adopt it for multi-class classification
predict the diabetes. The data used was taken from Luzhou problems. Similarly, it can also be applied for classification of
hospital, china and composed by 68994 healthy and diabetic nonlinear cases by applying kernels techniques [12]. These
patients. In this research, the Principal Component Analysis kernels apply mapping from nonlinear to a linear space where
(PCA) was also used to reduce the data set dimensionality. it is believed that the data could be easily separated. The most
The obtained accuracy has reached 80% which can be commonly used kernel functions include radial basis function
considered as good results. (RBF), polynomial, sigmoid etc. For our analysis, RBF kernel
In this paper, three different learning-based methods are is used due to it popularity. The RBF kernel is calculated using
compared for prediction of diabetes disease on the above data following formula
set. Our main objective is to analyze the efficiency of + = exp − −
conventional machine learning and deep learning approaches
for diabetic prediction. We used SVM and RF as part of Where γ is gamma which is the RBF kernel’s learnable
conventional machine learning approaches and CNN as part parameter.
of deep learning method. Literature review shows that both
SVM and RF have proven to be effective for many
classification algorithms. Similarly, CNNs have recently been
widely used for many classification and recognition tasks as
well. Therefore, we selected these algorithms to evaluate their
performance for diabetes detection to develop a decision
support system. Extensive experiments were performed on our
data set. For each method, same number of training and
testing samples were selected. All the analysis and
visualization are carried out in python 3.6 within the
Anaconda 5 environment.
II. MATERIALS
The proposed method was evaluated on the online
available PIMA Indians diabetes dataset which is available
online and can be downloaded from UCI machine learning
library [11]. The dataset consists of 768 instances with 8
features which are in CSV format. In this dataset, 500
instances belong to non-diabetic class remaining 268
instances were diabetic patients. The features include pregnant
count, Plasma glucose concentration, Diastolic blood pressure Fig. 1. Support vectors in SVM
(mm Hg), skin thickness (mm), serum insulin (mu U/ml),
B. Random Forest CNNs basically consists of three main types of layers,
The Random Forest algorithm is a supervised namely convolutional layer, pooling layer and fully connected
classification algorithm widely used in different classification layer (fig- 3). Convolutional layer forms the core part of the
tasks. This algorithm was proposed by Leo Breiman and network, which has local connections and weights of shared
Adèle Cutler in 2001 [13]. The random forest algorithm is characteristics. The objective is to learn feature
derived from the decision tree classifier and, as the name representations of the inputs data. The input feature maps are
suggests, is based on a set of trees where each tree depends on first convolved with a kernel and then the obtained results are
a set of random variables [14]. The main idea of this algorithm passed into a nonlinear activation function. The pooling layer
is explained in the Fig. 2. can be considered as a fuzzy filter, it reduces the feature
dimensionality and increases their robustness. Finally, the
The random forest algorithm is based on a set of decision fully connected layer takes input from previous layers and
trees. These different trees are characterized by the same sends the signals to each neuron in it. The classification is then
number of nodes, but different data. The decisions of these performed by the output layer which usually consists of a
different decision trees will be combined to give a final answer softmax classifier. For more details about the CNN, refer to
that represents an average response of all these decision trees. [1],[11].
Like neural networks, CNNs are based on a set of neurons,
weights of each neuron and biases. Each neuron receives
inputs and generates an output by using an activation function.
Therefore, the convolutional neuron network is a subclass of
neural networks that have at least one convolutional layer. The
main purpose of CNN is to reduce the network complexity that
existed in Neural Networks algorithm by applying the
convolution.
IV. EXPERIMENTAL RESULTS
For all experiments we selected 60% (460 samples) of the
data for training and validation while 40% (154 samples) was
used for testing. The performance of the proposed method was
evaluated in terms of overall accuracy (OA), Kappa
Coefficient (KC), precision (P), recall (R), and f-measure (F).
In order to avoid any biasedness of the models, we repeated
the experiment ten times and the accuracy was calculated to
be the average of the all experiments. Furthermore, as the
original data consisted of only 8 features, therefore, no further
analysis was performed for feature selection or calculating the
feature importance. We assumed all features are equally
important.
Since, the classifiers have some learnable parameters,
therefore, the first step that we performed was to fine tune
those parameters first. Following section describes the details
about the experiments and quantitative results obtained from
Fig. 2. Random forest algorithm [14] each classifier.

C. Convolutional Neural Network (CNN A. Support Vector Machine


The CNN is one of the most commonly used DL The SVM model requires tuning of few parameters. We
algorithms. It is a specific type of artificial neural network that selected radial basis function (RBF) kernel as it has proved to
uses several layers of perceptron connected in sequence [15]. be effective for classification. We empirically calculated the
CNNs perform a series of operations on the input and values for gamma (1.0E-4) and C (10.0). The SVM
transform it to produce the desired output. This output from classification produced and over all accuracy 73.94 %,
previous layers can be taken as input to the next block. precision: 62.56%, recall: 45.82%, F-measure: 51.93% and
KC. The classifier produced less recall and there was higher

Fig 3 Architecture of the proposed CNN


error as it classified non-diabetic patients into diabetic class. REFERENCES
Table 1 summarizes the classification results obtained for
SVM classifier. [1] Z. Punthakee, R. Goldenberg, and P. Katz, “Definition, Classification
TABLE 1. CLASSIFICATION ACCURACY FOR SVM CLASSIFIER and Diagnosis of Diabetes, Prediabetes and Metabolic Syndrome,”
Can. J. Diabetes, vol. 42, pp. S10–S15, 2018.
Non-Diabetic Diabetic Precision [2] M. N. Piero, “Diabetes mellitus – a devastating metabolic disorder,”
Asian J. Biomed. Pharm. Sci., vol. 4, no. 40, pp. 1–7, 2015.
Non-Diabetic 98 28 77.78%
Diabetic 15 24 61.54% [3] G. Swapna, R. Vinayakumar, and K. P. Soman, “Diabetes detection
Class Recall 86.73% 46.15% using deep learning algorithms,” ICT Express, vol. 4, no. 4, pp. 243–
246, 2018.
[4] L. Lucaccioni and L. Iughetti, “Issues in Diagnosis and Treatment of
B. Classification with RF Type 1 Diabetes Mellitus in Childhood,” J. Diabetes Mellit., vol. 06,
no. 02, pp. 175–183, 2016.
The two important parameters that affect the accuracy of [5] “Type 2 Diabetes: a Review of Current Trends -,” Int. J. Curr. Res.
RF classifier are: number of decision trees and the maximum Rev., vol. 7, no. 18, pp. 61–66, 2015.
depth. These two parameters were obtained empirically and [6] K. Kannadasan, D. R. Edla, and V. Kuppili, “Type 2 diabetes data
set to 20 and 7 respectively. Fig. 4 shows the grid search classi fi cation using stacked autoencoders in deep neural networks,”
ranges for each parameter and their corresponding accuracies. Clin. Epidemiol. Glob. Heal., no. December, pp. 2–7, 2018.
The overall accuracy obtained for this classier was 79.26%, [7] T. N. Joshi and P. P. M. Chawan, “Diabetes Prediction Using Machine
precision: 84.36%, recall: 62.74%, f-measure: 70.93% and Learning Techniques,” Ijera, vol. 8, no. 1, pp. 9–13, 2018.
kappa: 0.556. A higher number of diabetic patients were [8] M. T. P. Kamble, “Diabetes Detection using Deep Learning
misclassified as non-diabetic (35) out of 99 diabetic patients. Approach,” vol. 2, no. 12, pp. 342–349, 2016.
However, the class precision for non-diabetic patients was [9] S. Ramesh, R. D. Caytiles, and N. C. S. N. Iyengar, “A Deep Learning
Approach to Identify Diabetes,” vol. 145, no. Ngcit, pp. 44–49, 2017.
relatively higher (88.43%). Table 2 shows the results obtained
[10] Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, and H. Tang, “Predicting
for RF classifier. Diabetes Mellitus With Machine Learning Techniques,” Front. Genet.,
vol. 9, no. November, pp. 1–10, 2018.
TABLE 2. CLASSIFICATION ACCURACY FOR RF CLASSIFIER [11] U. M. L. Repository, “https://archive.ics.uci.edu/ml/index.php.” .
[12] N. Cristianini, J. Shawe-Taylor, and others, An introduction to support
Non- Diabetic Class vector machines and other kernel-based learning methods. Cambridge
Diabetic precision university press, 2000.
Non-Diabetic 107 14 88.43% [13] A. Cutler, D. R. Cutler, and J. R. Stevens, “Random Forests,” no.
Diabetic 35 64 64.65% February 2014, 2011.
Class Recall 75.35% 82.05% [14] B. Yang, X. Di, and T. Han, “Random forests classifier for machine
fault diagnosis Random forests classifier for machine fault diagnosis,”
no. April, 2014.
[15] S. Albawi and T. A. Mohammed, “Understanding of a Convolutional
Neural Network,” no. April, 2018.
[16] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic
Optimization,” Dec. 2014.

Fig. 4 Visualization of grid search for RF parameters

V. CONCLUSION
This study performed a comparative analysis of machine
learning and deep learning-based algorithms for prediction of
diabetes. The results showed that RF was more effective for
classification of the diabetes in all rounds of experiments
which produced overall accuracy for diabetic prediction to be
83.67%. The prediction accuracy for SVM reached 65.38%
while DL method produced 76.81% on our dataset. In future
we would like to improve the feature extraction step by
applying an automatic deep feature extraction approach and
for obtaining a better fitting model to improve the prediction
accuracy.

You might also like