0% found this document useful (0 votes)
9 views

Prediction of Diabetes Using R

This research paper focuses on predicting diabetes using machine learning algorithms, specifically K-Nearest Neighbor (KNN), Decision Tree, and Random Forest. It emphasizes the importance of early detection and employs techniques like upsampling and feature selection to enhance model performance. The study demonstrates that the highest accuracy is achieved when feature selection is applied before upsampling, highlighting the effectiveness of these algorithms in diabetes prediction.

Uploaded by

Krishna Koushik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Prediction of Diabetes Using R

This research paper focuses on predicting diabetes using machine learning algorithms, specifically K-Nearest Neighbor (KNN), Decision Tree, and Random Forest. It emphasizes the importance of early detection and employs techniques like upsampling and feature selection to enhance model performance. The study demonstrates that the highest accuracy is achieved when feature selection is applied before upsampling, highlighting the effectiveness of these algorithms in diabetes prediction.

Uploaded by

Krishna Koushik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Advances in Engineering and Management (IJAEM)

Volume 4, Issue 12 Dec. 2022, pp: 885-890 www.ijaem.net ISSN: 2395-5252

Prediction of Diabetes using R


Omkar Kalange, Tejaswini Katale, Atharv Kale, Rushikesh
Kahat, Juwairia Sayyed
Department of Computer Engineering, Vishwakarma Institute of Technology

---------------------------------------------------------------------------------------------------------------------------------------
Submitted: 18-12-2022 Accepted: 31 -12-2022
---------------------------------------------------------------------------------------------------------------------------------------

ABSTRACT—Diabetes, a chronic disease which (INDIAB Study)[9].More than 200 million people
is caused due to continued high blood sugar levels are infected and about a seven percent increase in
in the human body. It is further classified into the annual predominance of diabetes in the world
“Type1” and “Type2” based on the level of glucose [16]. K- Nearest Neighbor Algorithm is a simple
in the body and also gestational diabetes (diabetes and supervised algorithm which is used for both
while pregnant). Currently diabetes is diagnosed classification and regression models. Decision tree
using A1C, Fasting blood sugar test, Glucose Algorithm is used for preparing a training model
tolerance test and Random blood sugar test. which is used to predict the outcomes . Random
However, if detected early diabetes can be avoided. Forest is one of the best algorithms which is widely
Detection of diabetes with Machine Learning and used for Classification and Regression
Deep learning techniques come into play to solve analysis.Hence, this paper implements three
this issue. This research paper experiments and prediction techniques as mentioned above also
analyzes 3 Machine learning algorithms- Random taking into consideration only significant factors
Forest(RF), Decision tree and K-Nearest from the dataset.For better results up-sampling,
Neighbor(KNN) and also Upsampling, Feature feature selection and data cleaning has been
Selection and Performance Metric (Precision and implemented.
Recall). The data used in the dataset was procured
from the Iraqi Society from the laboratory of II. DATASETDESCRIPTION
Medical City Hospital (The specialized center for The Diabetes data is selected from the Iraqi Society
Endocrinology and Diabetes-Al-Kindy Teaching from the laboratory of Medical City Hospital (The
Hospital).The dataset consists of 11 risk factors. specialized center for Endocrinology and Diabetes-
However, Upsampling, Feature Selection and Al- Kindy Teaching Hospital).10 risk factors are
Correlation Matrix helped to wave off some included in the dataset also the patient's
irrelevant factors. gender is taken into consideration.These
Keywords: Machine Learning, Diabetes characteristics are displayed in Table 1.The dataset
prediction, Regression analysis, KNN, Random consists of a total 1000 observations including 11
Forest, Decision Tree ,Upsampling,Feature attributes. Dataset contains 2 Integer 2-Character
Selection, Precision, Recall. and 8 Numeric attributes.

I. INTRODUCTION Table1
Diabetes is a disease that is threatening lives Diabetes Dataset Risk Factors
around the world today..The most common types of FEATURENUMB ATTRIBUT ATTRIBUT
Diabetes are -Type1 , Type2 and gestational ER ENAME ETYPE
diabetes. Some of the factors include Age, High
Blood Pressure , Weight , family history etc . The 1 Gender Character
symptoms may include hunger , fatigue , high thirst
2 Age Integer
, blurred vision , numbness etc [1]. In India's
adult population, probably 72.96-million cases are 3 Urea Numeric
of diabetes. The prevalence in urban areas ranged 4 Cr Integer
from 10.9% to 14.2%[9]. In rural India, the 5 HbA1c Numeric
prevalence was 3.0-7.8%, from the population age 6 Chol Numeric
group 20 years and above, with a much higher 7 TG Numeric
prevalence among individuals over the age of 50

DOI: 10.35629/5252-0412885890 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 885
International Journal of Advances in Engineering and Management (IJAEM)
Volume 4, Issue 12 Dec. 2022, pp: 885-890 www.ijaem.net ISSN: 2395-5252

8 HDL Numeric 2. Feature Selection: Feature selection is the


9 LDL Numeric procedure of reducing the number of non-
significant input variables when developing a
10 VLDL Numeric
predictive model for improving the performance of
11 BMI Numeric the model . By using the Boruta function under
12 Class Charact0er Boruta package a total of 4 unimportant features
are found : Gender, Cr, HDL, Urea .
III. METHODOLOGY 3. Wrapper Method : Boruta package used for
The model proposed by the paper is divided into Feature selection comes under Wrapper Algorithm.
three main stages.The first stage is Data Processing It helps to understand the mechanisms related to
which includes Data Cleaning, Typo Conversions the variable of interest, rather than just building a
and dividing the data into training and testing black box predictive model with good prediction
data.The second stage involves implementation of accuracy.
Machine learning models: upsampling, wrapper
method and feature selection. Algorithms ALGORITHMS:
implemented are KNN, Decision Tree and Random This research paper implements the following
Forest. The third and the final stage is to draw Supervised Learning Algorithms:
Accuracies, Precision and Recall values. 1. K-Nearest Neighbor :The K-Nearest Neighbor
(KNN) method can be used to solve both
DATA PROCESSING: regression and classification issues, while it is most
To achieve the goal some data preprocessing is commonly employed to tackle classification
done on the given diabetes dataset[17]. problems in business.Its main
It includes data cleaning which means removing benefit is the ease with which it may be translated
duplicate values, converting categorical attributes and the little amount of time it takes to compute
to integer values to perform mathematical [2]. The selection of K’s value is very important.
operations. Note that the K value is frequently odd in order to
avoid ties [6]. To determine the distance from the
1. Data Cleaning: NA values, Duplicate data, point of interest to the point of the training data set
outliers were removed from the dataset for better it uses[17].
accuracies.
2. Typo Conversions :It refers to temporarily 2. Decision tree: Decision trees are a type of
changing the datatype of the variable to carry out supervised machine learning where the data is
numeric operations on it .In the dataset ,Gender continuously split according to a certain parameter
(M,F) was type casted into (0,1) .The outcomes [17]. It uses nodes and branches, where the test on
(N,P,Y) which implies N- Don’t have Diabetes, P- each attribute is represented
Possibility of having Diabetes , Y- have Diabetes ; at the nodes, and the outcome of this procedure is
were type casted into (1,2,3) represented at the branches, the class labels are
3. Training and Testing data: Training data refers represented at the leaf nodes.
to the initial dataset which is used to train your
Machine Learning Model whereas Testing dataset 3. Random Forest : This algorithm is self
refers to the evaluation of your model .The dataset explanatory, it consists of many decision trees and
is divided in 2 parts using split function in the ratio utilizes ensemble learning which is a technique that
0.7 for training and testing dataset. combines multiple classifiers to provide solutions
to complex problems.Random forests are ensemble
To prevent the results to be inclined towards the learning methods for classification and regression
majority class the following methods are used that works by developing a huge number of
which would result in an equalization procedure. decision trees at the time of training and yielding
1. Upsampling : It refers to training the class which is the method of
disproportionately the upper subset of majority the classification or regression of the individual
class examples. The model being trained would be trees that are present in the forest[18].
dominated by the majority class such as knn would
predict the majority class more effectively than the TECHNIQUES TO EVALUATE MODEL’S
minority class due to an imbalance dataset this EFFECTIVENESS.
would result in high value for sensitivity rate and 1. Precision: It is one of the methods to determine
low value for specificity rate. For the same the the effectiveness of the model’s performance.It
Up.sampling () method is implemented . refers to Positive Prediction made by the

DOI: 10.35629/5252-0412885890 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 886
International Journal of Advances in Engineering and Management (IJAEM)
Volume 4, Issue 12 Dec. 2022, pp: 885-890 www.ijaem.net ISSN: 2395-5252

model.procedure is represented at the branches, the


class labels are represented at the leaf nodes.
TP(True Positive): Number of Correct predicted
values.
FP(False Positive): Number of Incorrect predicted
values positive class.

2. Recall : Like precision, recall is also used to


determine a model’s performance.It refers to
Positive Prediction made by the model. Higher the
value of recall claims more the number of positive
samples detected.It ranges from 0.01 to 1.0.
TP(True Positive):Number of Correct predicted
values. Fig 1.1- Accuracy without feature selection and
upsampling
IV. RESULTS
Results are inferred on the basis of 3 cases
C.1) Without Feature Selection and
Upsampling
Algorith Accuracy Precision Recall
m
Decisi 0.9782609 N:1 N:0.9411
onTre P:0.6923 P:1
e Y:0.9913 Y:0.9828

KNN 0.9094203 N:0.7096 N:0.6875


P:0.5 P:0.5384
Y:0.9610 Y:0.9610 Fig 1.2- Precision without feature selection and
upsampling.

Rand 0.8949275 N:0.9473 N:0.5625


omFo P:0.2580 P:0.6153
rest Y:0.9778 Y:0.9567

Fig 1.3- Recall without feature selection and


upsampling

DOI: 10.35629/5252-0412885890 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 887
International Journal of Advances in Engineering and Management (IJAEM)
Volume 4, Issue 12 Dec. 2022, pp: 885-890 www.ijaem.net ISSN: 2395-5252

C.2) With first Upsampling and then Feature


Selection
Algor Accuracy Precision Recall
ithm

Decisi 0.9784483 N:1 N:0.9707


onTre P:1 P:0.9666
e Y:0.9353 Y:1

Fig 2.2- Precision with first Feature Selection


KNN 0.9698276 N:0.9583 N:0.9913 and
P:0.9547 P:1 then Upsampling
Y:1 Y:0.9181

Rand 0.9827586 N:0.9957 N:1


omFo P:0.9547 P:1
rest Y:1 Y:0.9482

Fig 2.3- Recall with first Upsampling and then


Feature
Selection

C.3) With first Feature Selection and then


Upsampling
Fig 2.1- Accuracy with first Feature Selection
and then Upsampling Algorith Accuracy Precision Recall
m
Decision 0.9760766 N:1 N:0.9720
Tree P:1 P:0.9585
Y:0.9285 Y:1

KNN 0.9744817 N:0.9669 N:0.9808


P:0.9585 P:1
Y:1 Y:0.9428

DOI: 10.35629/5252-0412885890 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 888
International Journal of Advances in Engineering and Management (IJAEM)
Volume 4, Issue 12 Dec. 2022, pp: 885-890 www.ijaem.net ISSN: 2395-5252

RandomF 0.9920255 N:0.9905 N:1


orest P:0.9857 P:1
Y:1 Y:0.9761

Fig 3.3- Recall with first Feature Selection and


then
Upsampling
The highest accuracy for all the algorithms
is observed in the third model where feature
Fig 3.1- Accuracy with first Feature Selection selection is applied first and then upsampling is
and then implemented In terms of other performance metrics
Upsampling it is observed that the precision and recall increases
drastically in the second model, where first
upsampling is applied and then feature selection
with respect to the first model without upsampling
or feature selection. In the third model where
feature selection is implemented first and then
upsampling a significant increase in all the three
performance metrics is observed.

Fig 3.2- Precision with first Feature Selection


and
then Upsampling

V. CONCLUSION:
The detection and prediction of diabetes is
collectively one of the most common medical
problems in today’s world and if not diagnosed in
the early phase it can lead to a lot of other issues
and health problems. The above use of algorithms
as well model effectiveness techniques can serve as
DOI: 10.35629/5252-0412885890 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 889
International Journal of Advances in Engineering and Management (IJAEM)
Volume 4, Issue 12 Dec. 2022, pp: 885-890 www.ijaem.net ISSN: 2395-5252

a future scope for researchers. [15]. European Journal of Science and


Technology Special Issue 24, pp. 53-59,
REFERENCES: April 2021 Copyright © 2021 EJOSAT -
[1]. Rashid, Ahlam (2020), “Diabetes Diabetes Prediction Using Machine
Dataset”, Mendeley Data, V1, doi: Learning Classification Algorithms
10.17632/wj9rwkp9c2.1 [16]. International Journal of Advanced Science
[2]. Procedia Computer Science, Volume 167, and Technology -Diabetes Prediction
2020 -Prediction of Type 2 Diabetes using Using Artificial Neural Network
Machine Learning Classification Methods [17]. International Journal of Scientific &
[3]. 2020 IEEE International Conference on Engineering Research Volume 12, Issue 3,
advances and development electrical and March-2021 - DIABETES PREDICTION
electronics Engineering (ICADE 2020) - USING MACHINE LEARNING
Comparison of Different Machine [18]. 2019 International Conference on
Learning Models for diabetes detection Computing, Power and Communication
[4]. 2019 International Conference on Technologies (GUCON) -Ensemble
Computing, Power and Communication Learning on Diabetes Data Set and Early
Technologies (GUCON) Galgotias Diabetes Prediction
University, Greater Noida, UP, India. --
Ensemble Learning on Diabetes Data Set
and Early Diabetes Prediction
[5]. International Conference on
Computational Intelligence and Data
Science (ICCIDS 2018)-Prediction of
Diabetes using Classification Algorithms
[6]. International Journal of Electrical and
Computer Engineering (IJECE) Vol. 8,
No. 5, October 2018, pp. 3966~3975 --A
Comparative Analysis on the Evaluation
of Classification Algorithms in the
Prediction of Diabetes
[7]. (IJCSIT) International Journal of
Computer Science and Information
Technologies, Vol. 5 (4) , 2014, 5174-
5178 -Prediction of Diabetes Using
Bayesian Network
[8]. Machine Learning Tools for Long-Term
Type 2 Diabetes Risk Prediction
[9]. Prediction of the Onset of Diabetes Using
Artificial Neural Network and Pima
Indians Diabetes Dataset
[10]. Predicting Diabetes in Healthy Population
through Machine Learning
[11]. Machine Learning-Based Application for
Predicting Risk of Type 2 Diabetes
Mellitus (T2DM) in Saudi Arabia: A
Retrospective Cross-Sectional Study
[12]. Received 26 January 2019, Revised 2 July
2019, Accepted 4 July 2019, Available
online 9 July 2019. -A model for early
prediction of diabetes
[13]. AINIT 2020 -Research on Diabetes
Prediction Method Based on Machine
Learning
[14]. Springer Nature Switzerland AG 2019 --
Prediction and diagnosis of future diabetes
risk: a machine learning approach

DOI: 10.35629/5252-0412885890 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 890

You might also like