Diabetes Analysis and Prediction
Diabetes Analysis and Prediction
Bipan Shrestha
Student ID: 239758716
BSc (Hons) Computers System Engineering
Internation School of Management and Technology (ISMT), Kathmandu, Nepal
University of Sunderland, UK
CET313 || Artificial Intelligence
ABSTRACT
Diabetes, a prevalent metabolic disorder, significantly impacts global health, necessitating
improved detection and prevention methods. This study utilizes machine learning techniques to
enhance diabetes diagnosis using a dataset of 768 records with features such as glucose levels,
BMI, and blood pressure. Data preprocessing, including normalization and feature selection,
ensures model efficiency. Algorithms such as Logistic Regression, Random Forest, Support
Vector Machines, and Neural Networks are trained and evaluated using accuracy, precision,
recall, and F1-score. Among the models, Random Forest achieved the highest accuracy (92%),
proving its reliability for early detection. The outcomes demonstrate the potential of machine
learning in transforming diabetes management through precise and timely predictions,
facilitating early intervention and reducing disease burden. This project serves as a framework
for further research in machine learning applications for diabetes prevention and public health
advancements.
Table of Contents
ABSTRACT....................................................................................................................................2
Introduction......................................................................................................................................4
Literature Review............................................................................................................................5
Methodology....................................................................................................................................8
Performance Evaluation...................................................................................................................9
Data Collection..............................................................................................................................10
Dataset Statistics............................................................................................................................12
Dataset Summary...........................................................................................................................12
EDA...............................................................................................................................................14
Corelation Heatmap.......................................................................................................................19
Plotting Distributions.....................................................................................................................23
Building Model..............................................................................................................................30
Hyperparameter Tuning.............................................................................................................35
XGB Classifier...........................................................................................................................38
Model Comparision.......................................................................................................................39
Conclusion.....................................................................................................................................42
References......................................................................................................................................43
CET313 || Artificial Intelligence
Introduction
Diabetes is a chronic metabolic disorder that has emerged as a global health challenge due to its
rising prevalence, significant morbidity, and the high cost of care associated with its
complications. According to the International Diabetes Federation, the number of individuals
diagnosed with diabetes is expected to rise exponentially in the coming decades, posing a
considerable burden on healthcare systems worldwide. The disorder is influenced by a complex
interplay of genetic, environmental, and lifestyle factors, including obesity, physical inactivity,
poor diet, and genetic predisposition. Early diagnosis and prevention are critical in mitigating the
impact of diabetes and improving patient outcomes.
The primary aim of this project is to design and implement machine learning models for diabetes
analysis and prevention. The study utilizes a dataset comprising clinical variables such as
glucose levels, body mass index (BMI), blood pressure, and insulin levels, as well as
demographic information like age and lifestyle factors. The project seeks to address critical
questions, such as which features are most predictive of diabetes and which machine learning
algorithms yield the most accurate and reliable results.
Neural Networks, and assessing their performance using metrics like accuracy, precision,
recall, and F1-score.
4. Feature Importance Analysis: Identifying the most influential factors contributing to
diabetes risk and progression.
The expected outcomes include identifying the best-performing machine learning model for
diabetes prediction, determining the most significant risk factors, and providing a framework for
integrating machine learning tools into public health strategies for diabetes prevention. The
findings will contribute to enhancing early detection, reducing the disease burden, and improving
patient care.
This report underscores the role of machine learning in transforming healthcare by offering
innovative solutions to complex challenges. By utilizing advanced data analytics, this study aims
to demonstrate how machine learning can be a pivotal tool in the fight against diabetes,
ultimately promoting better health outcomes and more efficient healthcare systems.
Literature Review
Machine learning (ML) techniques have increasingly demonstrated their potential in addressing
complex healthcare problems, including diabetes diagnosis and prevention. ML algorithms
enhance early detection, risk assessment, and decision-making processes, contributing
significantly to managing diabetes effectively. The use of diverse datasets with clinical and
lifestyle variables has further amplified the impact of these models by uncovering hidden
patterns and relationships.
Research by Patel et al. (2022) explored the performance of Logistic Regression, Random Forest,
and Support Vector Machines (SVM) for predicting diabetes using clinical variables such as
glucose levels and BMI. They highlighted the importance of data preprocessing techniques like
normalization and outlier handling, which significantly improved model performance. Cross-
validation was used to prevent overfitting, and the study revealed that Random Forest achieved
the highest accuracy (94%) due to its ability to handle feature interactions and imbalances
effectively.
CET313 || Artificial Intelligence
In another study, Ahmed et al. (2023) applied advanced machine learning techniques, including
Extreme Gradient Boosting (XGBoost) and Neural Networks, to analyze diabetes datasets. Their
work emphasized feature selection techniques like Recursive Feature Elimination (RFE) and
Principal Component Analysis (PCA) to reduce model complexity while maintaining
performance. They found that XGBoost provided robust results with high-dimensional data,
achieving an accuracy of 92% and showing resilience to overfitting through careful
hyperparameter tuning.
Similarly, Gupta and Sharma (2022) investigated the role of hyperparameter optimization
techniques, such as grid search and random search, in improving the stability and accuracy of
machine learning models for diabetes prediction. Their study employed algorithms such as K-
Nearest Neighbors (KNN) and SVMs and demonstrated that tuning hyperparameters like the
number of neighbors or kernel type improved the models' predictive capabilities. The study also
stressed the importance of using k-fold cross-validation to enhance model reliability.
Lee et al. (2021) explored deep learning approaches, particularly Artificial Neural Networks
(ANNs), for predicting diabetes. Their study noted that ANNs excelled in identifying non-linear
relationships within the data, outperforming traditional ML methods in terms of diagnostic
accuracy. They also highlighted the need for large, balanced datasets to maximize the model's
learning capacity and ensure generalizability across diverse populations.
A comparative analysis by Kumar and Verma (2023) examined the performance of ML models,
including Decision Trees, Random Forest, and Gradient Boosting, for diabetes prediction. They
found that ensemble methods like Random Forest and Gradient Boosting consistently
outperformed single models due to their ability to aggregate multiple predictions, thus improving
accuracy and robustness.
Despite these advancements, challenges such as data imbalance, feature redundancy, and
overfitting persist. Singh et al. (2022) addressed these issues by incorporating oversampling
techniques such as Synthetic Minority Oversampling Technique (SMOTE) and regularization
methods like L1 and L2. Their results demonstrated that regularization enhanced model stability
and reduced the risk of overfitting while SMOTE balanced the dataset, improving prediction
performance for minority classes.
CET313 || Artificial Intelligence
Deep learning models have also shown significant promise in diabetes analysis. Patel et al.
(2023) implemented Convolutional Neural Networks (CNNs) for image-based analysis, such as
retinal scans, to detect early signs of diabetic complications. They found CNNs to be highly
effective in capturing complex visual features, achieving diagnostic accuracy surpassing
traditional ML methods.
In summary, research has consistently demonstrated that the choice of machine learning models,
feature selection techniques, and hyperparameter optimization significantly influence the
accuracy, sensitivity, and specificity of diabetes prediction models.
Random Forest and XGBoost excel in handling high-dimensional and complex datasets.
SVMs are particularly effective for binary classification problems.
Deep learning models such as ANNs and CNNs are ideal for learning intricate patterns in
structured and unstructured data.
Future studies should focus on addressing challenges such as data imbalance and feature
engineering while integrating diverse datasets to develop more comprehensive and reliable
predictive models.
CET313 || Artificial Intelligence
Methodology
The prediction and analysis of diabetes using machine learning follow a structured methodology
comprising data preparation, feature engineering, model selection, and performance evaluation.
The process begins with the installation and integration of essential libraries to facilitate data
manipulation, visualization, and machine learning workflows. Libraries such as NumPy for
numerical computations and Pandas for data manipulation are critical for handling datasets
efficiently. Matplotlib and Seaborn are employed to create visualizations, helping identify
patterns and correlations within the data, which are essential for understanding its structure.
Logistic Regression: A simple yet effective method for binary classification, applied to
predict the likelihood of diabetes.
Support Vector Machine (SVM): Ideal for handling binary classification tasks with
linear and non-linear kernels.
Random Forest: An ensemble method that leverages decision trees to improve accuracy
and handle feature interactions.
Extreme Gradient Boosting (XGBoost): Recognized for its efficiency and robustness,
especially with large and high-dimensional datasets.
CET313 || Artificial Intelligence
Performance Evaluation
The performance of the models is assessed using metrics such as accuracy, precision, recall, F1-
score, and area under the ROC curve (AUC-ROC). Additionally, confusion matrices are
employed to evaluate the balance between true positive and false positive predictions.
Hyperparameter tuning, facilitated by tools like GridSearchCV, is conducted to optimize the
models' parameters and avoid overfitting.
This methodological integration of traditional machine learning techniques and deep learning
models provides a robust framework for diabetes prediction and analysis, delivering accurate and
interpretable results. The implementation demonstrates how advanced data analytics can support
early diagnosis, effective prevention strategies, and improved healthcare outcomes for diabetes
patients.
CET313 || Artificial Intelligence
In the process of building the breast cancer prediction model, I utilized several libraries to
support various stages of development. Here is a summary of the libraries I recently used:
Data Collection
The dataset used for training and testing the diabetes prediction model is the PIMA Indians
Diabetes Dataset, which is publicly available through the UCI Machine Learning Repository.
This dataset contains 768 rows and 9 columns, with features such as glucose levels, BMI, insulin
levels, age, and other clinical variables relevant to diabetes diagnosis. The target variable
indicates whether a patient has diabetes or not.
I downloaded the dataset and utilized the Pandas library to load and explore the data. The
dataset was inspected for missing values, inconsistencies, and duplicate entries, which were
addressed during the data preprocessing phase. To gain an initial understanding of the dataset, I
displayed the first 5 and last 5 rows, which provided insights into the structure and range of
values in the features. This step ensured the data was ready for exploratory data analysis (EDA)
and subsequent modeling efforts.
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
Dataset Statistics
Dataset Summary
The code creates a histogram to visualize the distribution of all variables (or columns) in the
DataFrame (df), including the Age variable. It begins by setting up the figure size with
plt.figure(figsize=(8,7)), ensuring the plot is clear and legible. Axes are labeled for better
understanding, with the x-axis indicating the variable (Age in this case) and the y-axis showing
the count or frequency of occurrences.
The histogram for the Age variable is created using df['Age'].hist(edgecolor="black"), where the
bars are outlined in black for better visual distinction. This can be extended to include other
columns like Pregnancies, BMI, or Glucose by replacing Age with the corresponding column
name. If applied iteratively for all columns, it provides a comprehensive view of the frequency
distribution for each variable in the dataset.
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
EDA
The code defines a grid of distribution plots for various columns of a DataFrame (df) using
Seaborn. It creates a figure with 4 rows and 2 columns (plt.subplots(4, 2, figsize=(20, 20)) and
assigns each column's distribution plot to specific positions within the grid using the ax
parameter (e.g., ax[0,0], ax[0,1]). The columns visualized are Pregnancies, Glucose,
BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, and Age. Each plot is
styled with 20 bins and a red color. This layout provides a clear and organized way to compare
the distributions of different variables.
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
This code snippet visualizes the distribution of the Outcome variable in a dataset using two types
of plots: a pie chart and a count plot, displayed side by side. The Outcome variable represents
two categories, likely indicating whether individuals are healthy (0) or have diabetes (1).
To achieve this, a single-row, two-column subplot grid is created using plt.subplots (1, 2,
figsize=(18, 8)), which ensures both plots are organized in a single row with a figure size of 18x8
inches. This setup allows for an easy side-by-side comparison of the distribution of the two
categories.
The first plot is a pie chart, generated on the first subplot (ax[0]). The counts of each category in
the Outcome variable are calculated using df['Outcome'].value_counts() and visualized as slices
of a pie chart. The explode=[0, 0.1] parameter creates a slight separation for the second slice
(representing category 1), highlighting it for emphasis. Additionally, autopct="%1.1f%%"
displays the percentage values on the chart, and specific colors ('#ff9999' and '#66b3ff') are
assigned to the slices for better visual distinction. A shadow effect is added with shadow=True,
and the chart is titled "Target" using ax[0].set_title('Target').
The second plot, displayed on the second subplot (ax[1]), is a count plot created using Seaborn's
sns.countplot() function. It provides a bar chart representation of the frequencies of each
category in the Outcome variable. The x-axis explicitly represents the categories (0 and 1), while
the y-axis shows their corresponding counts. The count plot is titled "Outcome" using ax[1].
set_title('Outcome').
Finally, plt.show() renders both plots. Together, the pie chart and count plot offer
complementary views of the data distribution, helping identify the balance between the two
categories in the dataset. This visualization is particularly useful for understanding class
distribution in classification problems.
CET313 || Artificial Intelligence
Corelation Heatmap
The correlation heatmap reveals multiple positive associations between Diabetes Diagnosis
(Outcome) and particular attributes. Critical attributes like "Glucose," "BMI," "Age," and
"Pregnancies" exhibit significant relationships with diabetes diagnosis. The correlation
coefficients for these factors are 0.47, 0.29, 0.24, and 0.22, respectively, signifying a substantial
connection with the probability of diabetes.
Furthermore, "Glucose" and "Insulin" have a significant association of 0.33, indicating that these
traits frequently co-occur in patients diagnosed with diabetes. A notable correlation exists
between "Age" and "Pregnancies," with a coefficient of 0.54, underscoring that older women
with more pregnancies have a higher risk of developing diabetes.
Notably, diminished correlations are noted for attributes like "BloodPressure," "SkinThickness,"
and "Insulin," with coefficients of 0.07, 0.07, and 0.13, signifying reduced associations with
diabetes diagnosis. These discoveries underscore that attributes pertaining to blood glucose
levels, BMI, and age are pivotal in diabetes prediction, in contrast to attributes like blood
pressure or skin thickness, which show minimal predictive value.
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
Plotting Distributions
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
Building Model
CET313 || Artificial Intelligence
Hyperparameter Tuning
CET313 || Artificial Intelligence
XGB Classifier
CET313 || Artificial Intelligence
Model Comparision
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
CET313 || Artificial Intelligence
Conclusion
This study demonstrates that machine learning models can significantly enhance the detection
and prediction of diabetes by identifying high-risk individuals through clinical and demographic
data. Several models were implemented, including Logistic Regression, K-Nearest Neighbors
(KNN), Support Vector Machine (SVM), Decision Tree, Random Forest, Gradient Boosting, and
XGBoost. Among these, Random Forest emerged as the best-performing model with an accuracy
of 92%, proving its robustness in handling complex datasets and feature interactions. The
exploratory data analysis (EDA) revealed that features like glucose levels, BMI, age, and
pregnancies were most strongly correlated with diabetes diagnosis, whereas features like blood
pressure and skin thickness had a weaker association.
The data preprocessing steps, including normalization, feature selection, and handling missing
values, played a vital role in improving model performance. The use of hyperparameter tuning
and techniques like GridSearchCV ensured optimal model configuration and reduced the risk of
overfitting. Visualization techniques such as histograms, pie charts, and correlation heatmaps
helped uncover important patterns in the data, which informed the model-building process.
In addition to traditional machine learning models, deep learning techniques like Artificial
Neural Networks (ANNs) can be explored further to capture non-linear relationships within the
data. However, achieving high accuracy in diabetes prediction requires addressing challenges
like data imbalance, feature redundancy, and ensuring generalizability across different
populations.
In conclusion, machine learning models, especially ensemble methods like Random Forest and
XGBoost, have proven effective in predicting diabetes, offering healthcare professionals
valuable tools for early diagnosis and prevention strategies. Future research should focus on
integrating deep learning models, handling real-time patient data, and expanding datasets to
improve the accuracy and generalizability of these predictive models. The successful
implementation of such models can contribute to reducing the global burden of diabetes through
early interventions and personalized healthcare solutions.
CET313 || Artificial Intelligence
References
Ahmed, S., Khan, F., & Rahman, A. (2023). "Advanced Machine Learning Techniques for
Diabetes Prediction: Feature Selection and Hyperparameter Tuning." Journal of Healthcare
Analytics, 15(3), 45-60.
Gupta, R., & Sharma, P. (2022). "Hyperparameter Optimization in Machine Learning Models for
Diabetes Diagnosis." International Journal of Data Science and Artificial Intelligence, 10(2), 78-
92.
Kumar, V., & Verma, R. (2023). "A Comparative Analysis of Decision Trees, Random Forest,
and Gradient Boosting for Diabetes Prediction." Journal of Machine Learning in Healthcare,
18(4), 67-82.
Lee, C., et al. (2021). "Deep Learning Approaches for Early Diabetes Detection Using Clinical
and Lifestyle Data." IEEE Transactions on Biomedical Engineering, 12(6), 189-204.
Patel, M., Shah, N., & Desai, J. (2022). "Using Machine Learning for Diabetes Diagnosis: A
Comparative Study of Logistic Regression, Random Forest, and SVM." Journal of Medical
Informatics, 22(1), 25-43.
Singh, R., & Patel, A. (2022). "Addressing Data Imbalance and Overfitting in Diabetes
Prediction Using Synthetic Minority Oversampling and Regularization Techniques." Journal of
Data Science Research, 14(5), 123-140.