Analyze The Use of Machine Learning Models in The Pima Diabetes Data Set For Early Stage Detection
Analyze The Use of Machine Learning Models in The Pima Diabetes Data Set For Early Stage Detection
Authorized licensed use limited to: AMITY University. Downloaded on June 11,2023 at 11:05:16 UTC from IEEE Xplore. Restrictions apply.
model. Comparison was made for performance of both Indian heritage. The objective of the dataset is to
algorithms and effectiveness of both algorithms was shown as diagnostically predict whether or not a patient has diabetes,
a result [9]. K. Rajesh and V. Sangeetha (2012) used
classification technique. They used C4.5 decision tree based on certain diagnostic measurements included in the
algorithm to find hidden patterns from the dataset for dataset.
classifying efficiently [11]. Humar Kahramanli and Novruz The attributes present in the dataset are as follows:
Allahverdi (2008) used Artificial neural network (ANN) in a. Pregnancies
combination with fuzzy logic to predict diabetes [12]. B.M. b. Glucose
Patil, R.C. Joshi and Durga Toshniwal (2010) proposed c. Blood pressure
Hybrid Prediction Model which includes Simple K-means d. Skin Thickness
clustering algorithm, followed by application of classification e. Insulin
algorithm to the result obtained from clustering algorithm. In f. Body mass index (BMI)
order to build classifiers C4.5 decision tree algorithm is used g. Diabetes pedigree function
[13]. Mani Butwall and Shraddha Kumar (2015) proposed a h. Age
model using Random Forest Classifier to forecast diabetes
behavior [10]. Nawaz Mohamudally1 and Dost Muhammad
(2011) used C4.5 decision tree algorithm, Neural Network, K-
means clustering algorithm and Visualization to predict
diabetes [14]. Ramraj Santhanam et al (2017) used both Table 1- Dataset Description
XGBoost and Gradient Boosting algorithms to perform
predictive analysis on different datasets and found them to be
useful [15].
3. Motivation
To understand the dataset, here are the detailed statistics:
Over the last decade, the proportion of people suffering from
diabetes has increased dramatically. The current human
lifestyle is the main reason for the increase in diabetes. 1. Number of Diabetic vs Nondiabetic patients-
Three different types of errors can occur in current medical
diagnostic procedures-
4. Definition of dataset
The PIMA diabetes dataset is originally from the National
Institute of Diabetes and Digestive and Kidney Diseases, and
all patients here are females at least 21 years old of Pima
Authorized licensed use limited to: AMITY University. Downloaded on June 11,2023 at 11:05:16 UTC from IEEE Xplore. Restrictions apply.
3. Frequency of each attribute- weight of variables predicted wrong by the tree is
increased and these variables are then fed to the
second decision tree. These individual
classifiers/predictors then ensemble to give a strong
and more precise model. It can work on regression,
classification, ranking, and user-defined prediction
problems.
Step 2- Splitting the data: Splitting the dataset is essential for Start
an unbiased evaluation of prediction performance. We have mn= [KNN( ), XGBoost(), LogisticRegression(),
split our dataset randomly with a test size of 0.2 using GradientBoostClassifier(),
train_test_split(). RandomForestClassifier(),]
for(i=0; i<5; i++) do
1. The training set is applied to train, or fit, your Model= mn[i];
model. Model.fit();
2. The test set is needed for an unbiased evaluation of Model.predict();
the final model. print(Accuracy(i), confusion_matrix,
classification_report);
Step 3- Algorithms used: End
Authorized licensed use limited to: AMITY University. Downloaded on June 11,2023 at 11:05:16 UTC from IEEE Xplore. Restrictions apply.
2. Evaluation 3. Results
This is the final step of prediction model. Here, we evaluate After applying various Machine Learning Algorithms on
the prediction results using various evaluation metrics like data-set we got accuracies as mentioned below in table-2.
classification accuracy, confusion matrix and f1-score.
Table-2
Classification Accuracy- It is the ratio of number of correct
predictions to the total number of input samples [16]. Algorithm Accuracy
KNN 81%
XGBoost 82%
Table-3
Table-4
Accuracy for the matrix can be calculated by taking average
of the values lying across the main diagonal. It is given as-
Authorized licensed use limited to: AMITY University. Downloaded on June 11,2023 at 11:05:16 UTC from IEEE Xplore. Restrictions apply.
4. Conclusion [11] K. Rajesh and V. Sangeetha, “Application of Data
Mining Methods and Techniques for Diabetes Diagnosis”,
In this study, various machine learning algorithms were International Journal of Engineering and Innovative
applied to the PIMA India Diabetes dataset and classification Technology (IJEIT) Volume 2, Issue 3, September 2012.
was performed using various algorithms, among which
XGBoost provides up to 82% accuracy. Additionally, this [12] Humar Kahramanli and Novruz Allahverdi, “Design of a
study could be expanded to see how likely it is that people Hybrid System for the Diabetes and Heart Disease”, Expert
without diabetes will develop diabetes in the next few years. Systems with Applications: An International Journal, Volume
Systems developed using these machine learning algorithms 35 Issue 1-2, July, 2008.
could also be tuned to predict other alternative diseases. The
study could be further extended by introducing another [13] B.M. Patil, R.C. Joshi and Durga Toshniwal,
machine learning algorithm to improve diabetes prediction. “Association Rule for Classification of Type-2 Diabetic
Patients”, ICMLC '10 Proceedings of the 2010 Second
5. References International Conference on Machine Learning and
Computing, February 09 - 11, 2010.
[1] American Diabetes Association. Diagnosis and
classification of diabetes mellitus. Diabetes Care [14] Dost Muhammad Khan1, Nawaz Mohamudally2, “An
2009;32(Suppl. 1): S62–7. Integration of K-means and Decision Tree (ID3) towards a
more Efficient Data Mining Algorithm”, Journal of
[2] http://diabetesindia.com/ Computing, Volume 3, Issue 12, December 2011.
[3] Anjana, R. M., Pradeepa, R., Deepa, M., Datta, M., Sudha, [15] Ramraj Santhanam et al., “Experimenting XGBoost
V., Unnikrishnan, R., Bhansali, A., Joshi, S. R., Joshi, P. P., Algorithm for Prediction and Classification of Different
Yajnik, C. S., Dhandhania, V. K. (2011) “Prevalence of Datasets”, National Conference on Recent Innovations in
diabetes and prediabetes (impaired fasting glucose and/or Software Engineering and Computer Technologies
impaired glucose tolerance) in urban and rural India: Phase I (NCRISECT) 2017.
results of the Indian Council of Medical Research–
INdiaDIABetes (ICMR–INDIAB) study.” Diabetologia 54 [16] Aishwarya Mujumdar, Dr Vaidehi V, “Diabetes
(12): 3022-3027. Prediction using Machine Learning Algorithms”,
International Conference on Recent Trends in Advanced
[4] https://my.clevelandclinic.org/health/diseases/7104- Computing 2019.
diabetes-mellitus-an-overview
[5] https://diabetes.org/diabetes/gestational-diabetes
[6] https://www.diabetes.co.uk/diabetes_care/blood-sugar-
level-ranges.html
Authorized licensed use limited to: AMITY University. Downloaded on June 11,2023 at 11:05:16 UTC from IEEE Xplore. Restrictions apply.