0% found this document useful (0 votes)
65 views

Analyze The Use of Machine Learning Models in The Pima Diabetes Data Set For Early Stage Detection

This document analyzes the use of machine learning models on the Pima Indian Diabetes dataset to detect diabetes early. It summarizes the types of diabetes and discusses motivations for early detection to prevent health issues. Five machine learning techniques (KNN, XGBoost, Logistic Regression, Gradient Boosting, Random Forest) are applied and XGBoost achieved the highest accuracy of 82% for classification.

Uploaded by

Pushan Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views

Analyze The Use of Machine Learning Models in The Pima Diabetes Data Set For Early Stage Detection

This document analyzes the use of machine learning models on the Pima Indian Diabetes dataset to detect diabetes early. It summarizes the types of diabetes and discusses motivations for early detection to prevent health issues. Five machine learning techniques (KNN, XGBoost, Logistic Regression, Gradient Boosting, Random Forest) are applied and XGBoost achieved the highest accuracy of 82% for classification.

Uploaded by

Pushan Dutta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Analyze the use of machine learning models in the Pima diabetes

data set for early stage detection


Mr. Harsh Tita†, Ms. Rashi Sharma†, Mr. Ankit Nayak†, Ms. Anisha Sancheti†, Mr. Saubhik Bandyopadhyay†,

Dr. Pushan Kumar Dutta†



School of Engineering and Technology, Amity University Kolkata, India.

Abstract Diabetes Mellitus Type-2 – In this type, the body’s cells


become resistance to the insulin and pancreas can’t make
Diabetes is a serious metabolic disorder and many people enough insulin to overcome this resistance. Therefore,
suffer from it. The main causes of this disease are obesity, glucose levels rise in the bloodstream. It usually occurs in
age, lifestyle, malnutrition, blood pressure, etc. People with middle-aged and older people and referred as Adult-onset
diabetes are at high risk for diseases of the heart, kidneys, diabetes.
eyes and other organs. Therefore, early diagnosis of diabetes
is important to prevent these diseases. Machine learning and Gestational Diabetes – It is the third principal structure that's
big data analytics play an important role in the healthcare ascertained throughout physiological state. Develops in some
industry. Machine learning techniques are used in prediction women during their pregnancy [5]. Hormones produced
of the disease and in improving the performance. The paper during your pregnancy make the body’s cells more resistant
focuses on ML classification techniques in PIDD (Pima to insulin, resulting glucose build up in the bloodstream.
Indian Diabetes Dataset) sourced from UCI ML repository to
predict the presence of diabetes in patients with utmost Prediabetes is a condition in which the blood glucose levels
correctness using Python. In this we have proposed a diabetes are higher than normal and poses a higher risk of having
prediction model for better classification of diabetes using diabetes. Within the practice, an individual having an
factors like BMI, Glucose, Age etc. Five ML techniques aldohexose concentration of a hundred to one hundred
(KNN, XGBOOST, Logistic Regression, Gradient Boosting twenty-five mg/dL is taken into account as pre-diabetic [6].
Classifier and Random Forest Classifier) were used in the With the development of living standards, diabetes is
experiment to detect diabetes at an early stage and the increasingly common in people’s daily life. Therefore, quick
performance of these algorithms is validated using measures and accurate diagnosis and analysis of diabetes is very
such as Error Rate, Accuracy, Precision, Recall and F- important. This analysis aims to work out the danger of
Measure. XGBOOST provided the best result among all the development of diabetes in a person. In this study, Logistical
ML algorithms used, showing the maximum accuracy of Regression, XGBoost, K- Nearest Neighbors, Random Forest
82%. and Gradient Boosting Classifier are used and evaluated on
the PIMA dataset to predict diabetes. All algorithms are
1. Introduction compared on numerous measures to achieve sensible
accuracy [7].
Diabetes or Diabetes Mellitus (DM) refers to a group of
conditions characterized by a high level of blood glucose, 2. Literature Review
which is caused by abnormal insulin secretion and/or action
[1]. It’s symptoms frequent urination, increased thirst, blurred The analysis of related work gives results on various
vison and feeling tired [2]. Too much sugar in the blood can healthcare datasets, where analysis and predictions were
cause serious damage and dysfunction of various tissues, carried out using various methods and techniques. Various
eyes, heart, kidneys, blood vessels and nerves, sometimes prediction models have been developed and implemented by
life-threatening health problems [3]. There are three types of various researchers using variants of data mining techniques,
chronic diabetic conditions [4]: machine learning algorithms or also combination of these
techniques. Dr Saravana Kumar N M, Eswari, Sampath P and
Diabetes Mellitus Type-1 – This is an immune system Lavanya S (2015) implemented a system using Hadoop and
disease, in which the insulin-producing cells in the pancreas Map Reduce technique for analysis of Diabetic data. This
are destroyed. Without insulin to allow glucose to enter your system predicts type of diabetes and also risks associated with
cells, glucose builds up in your bloodstream. Diagnosed in it. The system is Hadoop based and is economical for any
children and young adults. The patients are required healthcare organization [8]. Aiswarya Iyer (2015) used
insulin therefore known as Insulin-dependent diabetes. classification technique to study hidden patterns in diabetes
dataset. Naïve Bayes and Decision Trees were used in this

Authorized licensed use limited to: AMITY University. Downloaded on June 11,2023 at 11:05:16 UTC from IEEE Xplore. Restrictions apply.
model. Comparison was made for performance of both Indian heritage. The objective of the dataset is to
algorithms and effectiveness of both algorithms was shown as diagnostically predict whether or not a patient has diabetes,
a result [9]. K. Rajesh and V. Sangeetha (2012) used
classification technique. They used C4.5 decision tree based on certain diagnostic measurements included in the
algorithm to find hidden patterns from the dataset for dataset.
classifying efficiently [11]. Humar Kahramanli and Novruz The attributes present in the dataset are as follows:
Allahverdi (2008) used Artificial neural network (ANN) in a. Pregnancies
combination with fuzzy logic to predict diabetes [12]. B.M. b. Glucose
Patil, R.C. Joshi and Durga Toshniwal (2010) proposed c. Blood pressure
Hybrid Prediction Model which includes Simple K-means d. Skin Thickness
clustering algorithm, followed by application of classification e. Insulin
algorithm to the result obtained from clustering algorithm. In f. Body mass index (BMI)
order to build classifiers C4.5 decision tree algorithm is used g. Diabetes pedigree function
[13]. Mani Butwall and Shraddha Kumar (2015) proposed a h. Age
model using Random Forest Classifier to forecast diabetes
behavior [10]. Nawaz Mohamudally1 and Dost Muhammad
(2011) used C4.5 decision tree algorithm, Neural Network, K-
means clustering algorithm and Visualization to predict
diabetes [14]. Ramraj Santhanam et al (2017) used both Table 1- Dataset Description
XGBoost and Gradient Boosting algorithms to perform
predictive analysis on different datasets and found them to be
useful [15].

3. Motivation
To understand the dataset, here are the detailed statistics:
Over the last decade, the proportion of people suffering from
diabetes has increased dramatically. The current human
lifestyle is the main reason for the increase in diabetes. 1. Number of Diabetic vs Nondiabetic patients-
Three different types of errors can occur in current medical
diagnostic procedures-

1. The false-negative type in which the patient already has


diabetes, but the test results show that there is no diabetes.

2. The false-positive type. In this type, the patient is not


actually diabetic, but the test report states that he or she is
diabetic.

3. The third type is unclassifiable type in which a system


cannot diagnose a given case. This happens due to insufficient
knowledge extraction from past data, a given patient may get
predicted in an unclassified type.

In practice, however, patients should be expected to fall into


the diabetic or non-diabetic categories. These diagnostic
errors can lead to unnecessary treatment or no treatment at all
when needed. To avoid or mitigate the severity of these 2. Correlation between the attributes-
impacts, it is necessary to create systems that use machine
learning algorithms and data mining techniques that deliver
accurate results and reduce human effort [16].

4. Definition of dataset
The PIMA diabetes dataset is originally from the National
Institute of Diabetes and Digestive and Kidney Diseases, and
all patients here are females at least 21 years old of Pima

Authorized licensed use limited to: AMITY University. Downloaded on June 11,2023 at 11:05:16 UTC from IEEE Xplore. Restrictions apply.
3. Frequency of each attribute- weight of variables predicted wrong by the tree is
increased and these variables are then fed to the
second decision tree. These individual
classifiers/predictors then ensemble to give a strong
and more precise model. It can work on regression,
classification, ranking, and user-defined prediction
problems.

3. Logistic regression is a classification algorithm, used


when the value of the target variable
is categorical in nature. Logistic regression is most
commonly used when the data in question has binary
output, so when it belongs to one class or another, or
is either a 0 or 1.

4. Gradient Boosting is a popular boosting algorithm.


In gradient boosting, each predictor corrects its
predecessor’s error. The weights of the training
instances are not tweaked, instead, each predictor is
trained using the residual errors of predecessor as
labels.

1. Model building 5. The Random forest or Random Decision Forest


creates a set of decision trees from a randomly
Step 1- Standardization of the data: Data standardization is selected subset of the training set. It is basically a
the process of converting data to a common format to enable set of decision trees (DT) from a randomly selected
users to process and analyze it. Standardization of this data subset of the training set and then it collects the
helps us to get a clear picture of the attributes. It is the way in votes from different decision trees to decide the
which the label data will improve access to the most relevant final prediction.
and current information. This will help us make our analytics
and reporting easier. Step 4- Approach used:

Step 2- Splitting the data: Splitting the dataset is essential for Start
an unbiased evaluation of prediction performance. We have mn= [KNN( ), XGBoost(), LogisticRegression(),
split our dataset randomly with a test size of 0.2 using GradientBoostClassifier(),
train_test_split(). RandomForestClassifier(),]
for(i=0; i<5; i++) do
1. The training set is applied to train, or fit, your Model= mn[i];
model. Model.fit();
2. The test set is needed for an unbiased evaluation of Model.predict();
the final model. print(Accuracy(i), confusion_matrix,
classification_report);
Step 3- Algorithms used: End

1. K-nearest neighbours (KNN) algorithm uses ‘feature


similarity’ to predict the values of new datapoints
which further means that the new data point will be
assigned a value based on how closely it matches the
points in the training set.

2. XGBoost is an implementation of Gradient Boosted


decision trees. In this algorithm, decision trees are
created in sequential form. Weights play an
important role in XGBoost. Weights are assigned to
all the independent variables which are then fed
into the decision tree which predicts results. The

Authorized licensed use limited to: AMITY University. Downloaded on June 11,2023 at 11:05:16 UTC from IEEE Xplore. Restrictions apply.
2. Evaluation 3. Results

This is the final step of prediction model. Here, we evaluate After applying various Machine Learning Algorithms on
the prediction results using various evaluation metrics like data-set we got accuracies as mentioned below in table-2.
classification accuracy, confusion matrix and f1-score.
Table-2
Classification Accuracy- It is the ratio of number of correct
predictions to the total number of input samples [16]. Algorithm Accuracy

KNN 81%

XGBoost 82%

Logistic Regression 76%


Confusion Matrix- It gives us gives us a matrix as output and
describes the complete performance of the model. Where TP: Gradient Boosting 77%
True Positive; FP: False Positive; FN: False Negative; TN:
True Negative [16]. Random Forest Classifier 72%

Table-3 shows the confusion matrix and Table-4 shows the


classification report:

Table-3

Table-4
Accuracy for the matrix can be calculated by taking average
of the values lying across the main diagonal. It is given as-

F1 score- It is used to measure a test’s accuracy. F1 Score is


the Harmonic Mean between precision and recall. The range
for F1 Score is [0, 1]. It tells you how precise your classifier
is as well as how robust it is [16]. It is given as- We have plotted the accuracies against the algorithms.
Visualization of these accuracies helps us to understand
variations among them clearly.

F1 Score tries to find the balance between precision and


recall.

Precision: It is the number of correct positive results divided


by the number of positive results predicted by the classifier
[16]. It is expressed as-

Recall: It is the number of correct positive results divided by


the number of all relevant samples [16]. In mathematical form
it is given as-
XGBoost gives the highest accuracy of 82%.

Authorized licensed use limited to: AMITY University. Downloaded on June 11,2023 at 11:05:16 UTC from IEEE Xplore. Restrictions apply.
4. Conclusion [11] K. Rajesh and V. Sangeetha, “Application of Data
Mining Methods and Techniques for Diabetes Diagnosis”,
In this study, various machine learning algorithms were International Journal of Engineering and Innovative
applied to the PIMA India Diabetes dataset and classification Technology (IJEIT) Volume 2, Issue 3, September 2012.
was performed using various algorithms, among which
XGBoost provides up to 82% accuracy. Additionally, this [12] Humar Kahramanli and Novruz Allahverdi, “Design of a
study could be expanded to see how likely it is that people Hybrid System for the Diabetes and Heart Disease”, Expert
without diabetes will develop diabetes in the next few years. Systems with Applications: An International Journal, Volume
Systems developed using these machine learning algorithms 35 Issue 1-2, July, 2008.
could also be tuned to predict other alternative diseases. The
study could be further extended by introducing another [13] B.M. Patil, R.C. Joshi and Durga Toshniwal,
machine learning algorithm to improve diabetes prediction. “Association Rule for Classification of Type-2 Diabetic
Patients”, ICMLC '10 Proceedings of the 2010 Second
5. References International Conference on Machine Learning and
Computing, February 09 - 11, 2010.
[1] American Diabetes Association. Diagnosis and
classification of diabetes mellitus. Diabetes Care [14] Dost Muhammad Khan1, Nawaz Mohamudally2, “An
2009;32(Suppl. 1): S62–7. Integration of K-means and Decision Tree (ID3) towards a
more Efficient Data Mining Algorithm”, Journal of
[2] http://diabetesindia.com/ Computing, Volume 3, Issue 12, December 2011.

[3] Anjana, R. M., Pradeepa, R., Deepa, M., Datta, M., Sudha, [15] Ramraj Santhanam et al., “Experimenting XGBoost
V., Unnikrishnan, R., Bhansali, A., Joshi, S. R., Joshi, P. P., Algorithm for Prediction and Classification of Different
Yajnik, C. S., Dhandhania, V. K. (2011) “Prevalence of Datasets”, National Conference on Recent Innovations in
diabetes and prediabetes (impaired fasting glucose and/or Software Engineering and Computer Technologies
impaired glucose tolerance) in urban and rural India: Phase I (NCRISECT) 2017.
results of the Indian Council of Medical Research–
INdiaDIABetes (ICMR–INDIAB) study.” Diabetologia 54 [16] Aishwarya Mujumdar, Dr Vaidehi V, “Diabetes
(12): 3022-3027. Prediction using Machine Learning Algorithms”,
International Conference on Recent Trends in Advanced
[4] https://my.clevelandclinic.org/health/diseases/7104- Computing 2019.
diabetes-mellitus-an-overview

[5] https://diabetes.org/diabetes/gestational-diabetes

[6] https://www.diabetes.co.uk/diabetes_care/blood-sugar-
level-ranges.html

[7] Iyer, A., S, J., Sumbaly, R., 2015. “Diagnosis of Diabetes


Using Classification Mining Techniques”. International
Journal of Data Mining & Knowledge Management Process
5, 1–14. doi:10.5121/ijdkp.2015.5101, arXiv:1502.03774.

[8] Dr Saravana kumar N M, Eswari T, Sampath P and


Lavanya S, “Predictive Methodology for Diabetic Data
Analysis in Big Data”, 2nd International Symposium on Big
Data and Cloud Computing,2015.

[9] Aiswarya Iyer, S. Jeyalatha and Ronak Sumbaly,


“Diagnosis of Diabetes Using Classification Mining
Techniques”, International Journal of Data Mining &
Knowledge Management Process (IJDKP) Vol.5, No.1,
January 2015.

[10] Mani Butwall and Shraddha Kumar, “A Data Mining


Approach for the Diagnosis of Diabetes Mellitus using
Random Forest Classifier”, International Journal of Computer
Applications, Volume 120 - Number 8,2015.

Authorized licensed use limited to: AMITY University. Downloaded on June 11,2023 at 11:05:16 UTC from IEEE Xplore. Restrictions apply.

You might also like