0% found this document useful (0 votes)
9 views

Literature survey paper on Comparative Analysis of Diabetics Prediction Systems using Machine Learning Algorithms

A Literature survey paper on "Comparative Analysis of Diabetics Prediction Systems using Machine Learning Algorithms."

Uploaded by

Ibtesam Hussain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Literature survey paper on Comparative Analysis of Diabetics Prediction Systems using Machine Learning Algorithms

A Literature survey paper on "Comparative Analysis of Diabetics Prediction Systems using Machine Learning Algorithms."

Uploaded by

Ibtesam Hussain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Comparative Analysis of Diabetics Prediction

Systems using Machine Learning Algorithms


Dr. Sharada K A Nikhil K S Mohammed Ibtesam Hussain
Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
HKBK College Of Engineering HKBK College Of Engineering HKBK College Of Engineering
Bangalore, India Bangalore, India Bangalore, India
[email protected] [email protected] [email protected]

Mukesh Rokaya Sadiq Khan S Praveen Kumar


Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
HKBK College Of Engineering HKBK College Of Engineering HKBK College Of Engineering
Bangalore, India Bangalore, India Bangalore, India
[email protected] [email protected] [email protected]

Abstract— Machine learning algorithms are used in various This study aims to conduct a comprehensive comparative
domains like health care, banking sector, education because of analysis of various machine learning algorithms to identify the
extracting useful information from the database and predict the most effective model for predicting diabetes. The algorithms
disease. Diabetes is a prevalent chronic condition characterized evaluated in this research include Logistic Regression,
by elevated blood sugar levels, which, if not managed properly, Decision Trees, Random Forests, Support Vector Machines,
can lead to severe complications such as cardiovascular disease, and Neural Networks. Each algorithm is assessed based on a
neuropathy, and retinopathy. Early and accurate prediction of range of performance metrics, including accuracy, precision,
diabetes can significantly enhance patient outcomes by enabling recall, F1 score, and AUC-ROC, using a publicly available
timely intervention and treatment. This study conducts a
diabetes dataset.
comparative analysis of various machine learning algorithms to
determine the most effective model for predicting diabetes.

Keywords— Diabetes, Support Vector Machine, Machine The primary objective of this research is to provide
Learning, Decision Trees, K-NN, deep learning. healthcare professionals with insights into the strengths and
limitations of different machine learning models, guiding
them in selecting the optimal algorithm for clinical
applications. By comparing the performance of these models,
I. INTRODUCTION this study seeks to enhance the predictive accuracy of diabetes
Diabetes mellitus, a chronic metabolic disorder diagnosis and contribute to the growing body of knowledge in
characterized by elevated levels of blood glucose, has become the field of medical data science.
a global health crisis, affecting millions of individuals
worldwide. The early detection and accurate prediction of
diabetes are vital for effective management and prevention of
the associated complications, such as cardiovascular diseases,
neuropathy, and nephropathy. Traditional diagnostic methods
often rely on clinical assessments and laboratory tests, which,
while effective, can be time-consuming and may miss early
signs in at-risk populations.

In recent years, the rapid advancements in machine


learning have provided powerful tools for the healthcare
sector, enabling the development of predictive models that can
analyze vast amounts of data with high accuracy. These
models leverage patient data, such as medical history, genetic Fig. 1. Flowchart for ML based diabetes prediction
information, and lifestyle factors, to predict the likelihood of
developing diabetes. By doing so, they offer the potential for II. RECENT RELATED WORKS
early intervention and personalized treatment plans, This part of the section represents the recent related works
ultimately improving patient outcomes and reducing of Diabetes prediction models using machine learning
healthcare costs. algorithms for diabetic patients.
K. VijiyaKumar et al. in [1] applied ML models for
prediction of diabetes. The Random Forest ML model was
implemented and its performance was tested for diabetes.
Chatrati et al. [2] deviced a system for measuring the blood in this process, also visualization libraries like matplotlib and
pressure and glucose for better analyzing with the help of seaborn can be used to visualize the dataset.
graphical user interface.
Sujatha et al., [3] focused on predicting the risk of diabetes
irrespective of gender. They used ensemble approach which is B. Data Cleaningx
the combination of algorithm. At investigation of ensemble Data Cleaning is considered as a crucial phase in the
approach, they suggested ensemble of Naïve Bayes, SVM and Machine Learning model building, because it can break or
Optimized Parametric Multilayer Perceptron as a combination make the model [8]. There is some quote in machine learning
which provides higher accuracy comparatively. saying, “Better data beats fancier algorithms”, which means
better data will provide better model results. In this paper, only
Priyanka Sonar and JayaMaliniis [4] proposed to model a missing or null data points and unexpected outlier was
to predict and diagnose diabetes with high precision. The considered for data cleaning phase. TABLE I, shows that there
proposed model can be acted as a decision support system. are no missing values or data in PIMA dataset.
This model is based on ANN, Decision Tree, Naive Bayes and
SVM. Decision Tree algorithm achieved the precision of 85%
and Naive Bayes with 77%, SVM achieves 77.3%.
a) TABLE I. MISSING DATA OBSERVATION
Robert et al., [5] developed a management system based S.NO FEATURES MISSING
on machine learning algorithms that considers various factors VALUES
for causing diabetes. This management model performed food
classification and recommendation using TensorFlow neural 1 Pregnancies 0
network model and KNN algorithm respectively. The author 2 Glucose 0
has used cognitive sciences to build a chatbot for the
discussion of diabetes related queries. The tracking user 3 Blood Pressure 0
activity was also done in this paper using Cordova plugins. 4 Skin Thickness 0
Ramesh et al. [6] proposed a model for diabetes risk 5 Insulin 0
prediction and management using remote monitoring
framework with automated facility. Smart wearable devices 6 BMI 0
and smart phones are used in this framework for personal
7 Diabetes Pedigree func on 0
health devices. They used SVM which is a frequently used
supervised learning algorithm in machine learning for 8 Age 0
prediction [6]. Data set used is Pima Indian Diabetes
Database. 9 Outcome 0

III. DATASET DESCRIPTION


C. Model Selection and Evaluation
PIMA Indian Diabetes Dataset (PIDD) plays a major role
in the area of research in diabetes. Most of the authors have Model selection is a critical step in machine learning as it
developed prediction model for diabetes using PIDD because determines the algorithm and configuration that best suits the
of its simplicity and uniqueness [7]. There are eight dataset. The process ensures that the model aligns with the
independent features and one dependent feature is available in dataset's characteristics, achieving optimal accuracy and
PIDD which is shown in the Fig. 2. In which outcome is the performance. Evaluation during this phase helps in making
dependent feature others being independent features. Totally informed decisions about the suitability of the model for the
769 rows of data are present in the PIDD. given problem.
Every dataset has unique properties, such as size, feature
types, and distribution. Different machine learning algorithms
IV. MODEL BUILDING AND SIMULATION (ML) have distinct strengths and weaknesses.
The Machine Learning model is built using python 3 Algorithm selection is the core part machine learning. It is
simulated on jupyter notebook includes three phases, they are the phase which is responsible for accuracy and other
Data exploration phase, Data cleaning phase and finally performance evaluators. Based on the result from this phase
Model selection and evaluation phase. only we can select or reject the model.
Each data set may differ in unique ways so that all the
A. Data Exploration models can’t fit for the data set. We have to select the model
for the appropriate data set. In this paper, two model
Data exploration phase is essential to understand the data evaluation techniques were used, they are,
and it will be helpful for preprocessing. This phase helps to
gain basic understanding which is important for feature Train/Test Split and K-Fold Cross Validation techniques.
engineering. The work in the data cleaning phase can be Train/Test split method splits the data set into two portions as
minimized by this data exploration. Firstly, necessary python shown in Fig. 2: a training set and a testing set.
libraries like numpy, pandas and PIMA dataset will be
imported in the jupyter notebook and then data exploration
part will begin. Shape and size of the dataset will be analyzed
V. RESULTS DISCUSSION
This section presents the result analysis of machine
learning algorithms in the research works discussed in section
II. Most of the work in literature had been described about ML
algorithms like SVM learning classification algorithm and
then Random Forest classification algorithm, Decision Tree
etc. The various performance metrics used in the research
papers are accuracy, sensitivity, specificity, recall and a
precision, etc., for evaluating the efficiency of the models.
The accuracy obtained by the various ML algorithms and
is fully dependent on the correct predictions in the test data.
Accuarcy=(TPos+TNeg)/(TPos+TNeg+FPos+FNeg) (1)
Fig. 2. Train/Test Split
It has both merits and demerits .It is useful as it is simple and
fast while being efficient at the same time it is not optimal for
smaller dataset as each split uses only a fraction of the data
for training and testing. This can lead to unreliable
performance estimates.

The table below shows the accuracy of Train/Test Split


S.NO ALGORITHMS ACCURACY
1 k-NN 0.729282
2 SVM 0.740331
Fig. 3. Accuracy scores of selected ML model
3 Logis c Regression 0.779006
4 Decision Tree 0.723757
5 G-Naïve Bayes 0.734807 VI. CONCLUSION
6 Random Forest 0.762431 This paper reviewed the machine learning and deep learning
techniques for analyzing and prediction of diabetes. The well-
7 Gradient Boost 0.773481 known machine learning algorithms discussed in this study are
Logistic Regression, Gaussian NB, KNN, SVM, Decision
Tree, Random Forest and Gradient Boosting classification
K-fold cross-validation is a robust technique used to algorithms. From the simulation results, it is proved that the
evaluate the performance of machine learning models. It helps Random Forest provided better accuracy of 98% after
ensure that the model's performance is not dependent on a implementing other models discussed in this paper. With this
particular train-test split. By systematically varying the paper it is stated that feature engineering, feature selection and
training and test sets, it helps ensure that the model will outliers’ removal and other data cleaning and data mining
perform well on unseen data, ultimately leading to better techniques will provide better accuracy when compared with
generalization. simple prediction using simple machine learning algorithms.
TABLE III. ACCURACIES IN K-FOLD CROSS VALIDATION This study explored machine learning and deep learning
S.NO ALGORITHMS ACCURACY techniques for diabetes analysis and prediction, focusing on
commonly used algorithms such as Logistic Regression,
1 k-NN 0.714136 Gaussian Naïve Bayes (NB), K-Nearest Neighbors (KNN),
Support Vector Machines (SVM), Decision Trees, Random
2 SVM 0.755651 Forest, and Gradient Boosting classifiers. Based on simulation
results, the Random Forest algorithm demonstrated superior
3 Logis c Regression 0.772165 performance with an accuracy of 98%, outperforming other
4 Decision Tree 0.686701 models discussed in the paper. These results highlight the
significance of model selection and algorithm optimization in
5 G-Naïve Bayes 0.754205 achieving high prediction accuracy.
In a medicine domain, Accuracy is the very important measure
6 Random Forest 0.764022 to diagnose the diseases and disorders effectively. This paper
7 Gradient Boost 0.759798 will help the researchers doing research works on the medicine
domain especially in diabetes diagnosis by providing path to
improve the accuracy scores and other evaluation metric
scores of their models.
REFERENCES
[1] American Diabetes Association, “DIABETES CARE”,
Volume 30, Supplement 1, January 2007, DOI:
10.2337/dc07-S042.

[2] S. P. Chatrati, G. Hossain, A. Goyal, “Smart home health


monitoring system for predicting type 2 diabetes and
hypertension”, Journal of King Saud University –
Computer and Information Sciences, Volume 143, ISSN
106424, Page No: 1-9, 2020.

[3] K. Sujatha, K.V. Krishna Kishore, B. Srinivasa Rao,


Rajkumar Rajasekaran, “Diabetes Disease Prediction
Based on Symptoms Using Machine Learning
Algorithms”, Annals of R.S.C.B., ISSN:1583 6258, Vol.
25, Issue 6, 2021, Pages. 3805 - 3817, 08 May 2021

[4] Priyanka Sonar, Prof. K. JayaMalini, “Diabetes


prediction using different machine learning approaches”,
Third International Conference on Computing
Methodologies and Communication (ICC WC 2019),
IEEE Xplore Part Number: CFP19K25-ART; ISBN: 978
1-5386-7808-4.

[5] Robert A. Sowah, Adelaide A. Bampoe-Addo, Stephen


K. Armoo, Firibu K. Saalia, Francis Gatsi, and Baffour
Sarkodie-Mensah, “Design and Development of
Diabetes Management System Using Machine
Learning”, Hindawi-International Journal of
Telemedicine and Applications, Volume 2020, Article
ID 8870141, 17 pages.

[6] Ramesh, J., Aburukba, R., Sagahyroon, A.: “A remote


healthcare monitoring framework for diabetes prediction
using machine learning”. Healthc. Technol. Lett. 8, Page
No:45–57, 2021.

[7] Amatul Zehra, Tuty Asmawaty, M.A M. Aznan, “A


comparative study on the pre-processing and mining of
Pima Indian Diabetes Dataset”, International Joint
Conference on Neural Networks (IJCNN), pp. 2159 – 65
(2020).

[8] Sathyaseelan K, S. Sarathambekai. "Machine Learning


based Prediction Model for Health Care Sector - A
Survey", 2021 Innovations in Power and Advanced
Computing Technologies (i PACT), 2021.

You might also like