Alam-Proj1
Alam-Proj1
Tamim Alam_80764318
2025-02-07
Loading Libraries
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
library(ggplot2)
library(skimr)
library(caret)
library(glmnet)
library(pROC)
##
## Attaching package: 'pROC'
Reading Dataset
diabetes_data <- read_csv("diabetes_data_upload.csv")
head(diabetes_data)
## # A tibble: 6 × 17
## Age Gender Polyuria Polydipsia `sudden weight loss` weakness
Polyphagia
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 40 Male No Yes No Yes No
## 2 58 Male No No No Yes No
## 3 41 Male Yes No No Yes Yes
## 4 45 Male No No Yes Yes Yes
## 5 60 Male Yes Yes Yes Yes Yes
## 6 55 Male Yes Yes No Yes Yes
## # ℹ 10 more variables: `Genital thrush` <chr>, `visual blurring` <chr>,
## # Itching <chr>, Irritability <chr>, `delayed healing` <chr>,
## # `partial paresis` <chr>, `muscle stiffness` <chr>, Alopecia <chr>,
## # Obesity <chr>, class <chr>
Structure of Dataset
str(diabetes_data)
The dataset has 520 observations and 17 columns. Age is a numeric (double) variable. The
remaining variables (Gender, Polyuria, Polydipsia, etc.) are stored as character
(categorical) variables. The target variable class (indicating diabetes diagnosis) is also a
character variable with values “Positive” and “Negative”.
Summary Statistics
summary(diabetes_data)
The dataset has 520 observations. Age is the only numeric variable. Min: 16, Max: 90,
Median: 47.5, Mean: 48.03, Interquartile Range: 39 - 57 (Middle 50%). All other variables,
including class, are stored as character variables.
Checking Missing Values
missing_values <- colSums(is.na(diabetes_data))
print(missing_values)
print(class_distribution)
## # A tibble: 2 × 3
## class n percentage
## <chr> <int> <dbl>
## 1 Negative 200 38.5
## 2 Positive 320 61.5
The dataset has an imbalanced class distribution but not extreme. 61.54% of cases are
Positive (diabetes present). 38.46% of cases are Negative (no diabetes).
Visualizing Class Distribution
ggplot(diabetes_data, aes(x = class, fill = class)) +
geom_bar() +
labs(title = "Class Distribution of Diabetes Diagnosis", x = "Diabetes
Diagnosis", y = "Count") +
theme_minimal()
Univariate Analysis: Age
ggplot(diabetes_data, aes(x = Age)) +
geom_histogram(bins = 10, fill = "skyblue", color = "black") +
labs(title = "Histogram of Age", x = "Age", y = "Frequency") +
theme_minimal()
The distribution appears slightly right-skewed, indicating a small proportion of older
individuals. Most individuals are between 30 to 60 years old, with fewer younger (<25) and
older (>75) participants.
Univariate Analysis: Gender
ggplot(diabetes_data, aes(x = Gender, fill = Gender)) +
geom_bar() +
labs(title = "Distribution of Gender", x = "Gender", y = "Count") +
theme_minimal()
The dataset contains more male participants than female participants.
Univariate Analysis: Polyuria
ggplot(diabetes_data, aes(x = Polyuria, fill = Polyuria)) +
geom_bar() +
labs(title = "Distribution of Polyuria", x = "Polyuria", y = "Count") +
theme_minimal()
The counts of “Yes” (Polyuria present) and “No” (Polyuria absent) are nearly the same.
Univariate Analysis: Polydipsia
ggplot(diabetes_data, aes(x = Polydipsia, fill = Polydipsia)) +
geom_bar() +
labs(title = "Distribution of Polydipsia", x = "Polydipsia", y = "Count") +
theme_minimal()
More individuals in
the dataset do not have Polydipsia compared to those who do.
Univariate Analysis: Sudden Weight Loss
ggplot(diabetes_data, aes(x = `sudden weight loss`, fill = `sudden weight
loss`)) +
geom_bar() +
labs(title = "Distribution of Sudden Weight Loss", x = "Sudden Weight
Loss", y = "Count") +
theme_minimal()
More individuals in the dataset do not have sudden weight loss compared to those who do.
Univariate Analysis: Weakness
ggplot(diabetes_data, aes(x = weakness, fill = weakness)) +
geom_bar() +
labs(title = "Distribution of Weakness", x = "Weakness", y = "Count") +
theme_minimal()
More individuals in the dataset experience weakness (“Yes”) compared to those who do
not.
Univariate Analysis: Polyphagia
ggplot(diabetes_data, aes(x = Polyphagia, fill = Polyphagia)) +
geom_bar() +
labs(title = "Distribution of Polyphagia", x = "Polyphagia", y = "Count") +
theme_minimal()
More individuals in the dataset do not have Polyphagia (“No”) compared to those who do.
Univariate Analysis: Genital Thrush
ggplot(diabetes_data, aes(x = `Genital thrush`, fill = `Genital thrush`)) +
geom_bar() +
labs(title = "Distribution of Genital Thrush", x = "Genital Thrush", y =
"Count") +
theme_minimal()
The majority of individuals do not have Genital Thrush (“No”), while a much smaller group
does (“Yes”).
Univariate Analysis: Visual Blurring
ggplot(diabetes_data, aes(x = `visual blurring`, fill = `visual blurring`)) +
geom_bar() +
labs(title = "Distribution of Visual Blurring", x = "Visual Blurring", y =
"Count") +
theme_minimal()
More individuals in the dataset do not experience visual blurring (“No”) compared to those
who do.
Univariate Analysis: Itching
ggplot(diabetes_data, aes(x = Itching, fill = Itching)) +
geom_bar() +
labs(title = "Distribution of Itching", x = "Itching", y = "Count") +
theme_minimal()
The counts of “Yes” (Itching present) and “No” (Itching absent) are similar.
Univariate Analysis: Irritability
ggplot(diabetes_data, aes(x = Irritability, fill = Irritability)) +
geom_bar() +
labs(title = "Distribution of Irritability", x = "Irritability", y =
"Count") +
theme_minimal()
The majority of individuals do not experience Irritability (“No”), while a much smaller
group does (“Yes”).
Univariate Analysis: Delayed Healing
ggplot(diabetes_data, aes(x = `delayed healing`, fill = `delayed healing`)) +
geom_bar() +
labs(title = "Distribution of Delayed Healing", x = "Delayed Healing", y =
"Count") +
theme_minimal()
More individuals in the dataset do not experience delayed healing (“No”) compared to
those who do.
Univariate Analysis: Partial Paresis
ggplot(diabetes_data, aes(x = `partial paresis`, fill = `partial paresis`)) +
geom_bar() +
labs(title = "Distribution of Partial Paresis", x = "Partial Paresis", y =
"Count") +
theme_minimal()
More individuals in the dataset do not have partial paresis (“No”) compared to those who
do.
Univariate Analysis: Muscle Stiffness
ggplot(diabetes_data, aes(x = `muscle stiffness`, fill = `muscle stiffness`))
+
geom_bar() +
labs(title = "Distribution of Muscle Stiffness", x = "Muscle Stiffness", y
= "Count") +
theme_minimal()
More individuals in the dataset do not experience muscle stiffness (“No”) compared to
those who do.
Univariate Analysis: Alopecia
ggplot(diabetes_data, aes(x = Alopecia, fill = Alopecia)) +
geom_bar() +
labs(title = "Distribution of Alopecia", x = "Alopecia", y = "Count") +
theme_minimal()
More individuals in the dataset do not have Alopecia (“No”) compared to those who do.
Univariate Analysis: Obesity
ggplot(diabetes_data, aes(x = Obesity, fill = Obesity)) +
geom_bar() +
labs(title = "Distribution of Obesity", x = "Obesity", y = "Count") +
theme_minimal()
The majority of individuals do not have obesity (“No”), while a much smaller group does
(“Yes”).
Bivariate Association With Class (Numerical Variables)
ggplot(diabetes_data, aes(x = Age, fill = class)) +
geom_density(alpha = 0.5) +
labs(title = "Age Distribution by Diabetes Diagnosis", x = "Age", y =
"Density") +
theme_minimal()
The age distribution differs between Positive and Negative diabetes cases. Younger
individuals (below 40) have a higher proportion of Negative cases. Middle-aged individuals
(40–60) have more Positive cases. Older individuals (above 60) show a more balanced
distribution.
Shapiro-Wilk Normality Test for Age in Each Class
shapiro_test_negative <- shapiro.test(diabetes_data$Age[diabetes_data$class
== "Negative"])
shapiro_test_positive <- shapiro.test(diabetes_data$Age[diabetes_data$class
== "Positive"])
print(shapiro_test_negative)
##
## Shapiro-Wilk normality test
##
## data: diabetes_data$Age[diabetes_data$class == "Negative"]
## W = 0.96687, p-value = 0.0001182
print(shapiro_test_positive)
##
## Shapiro-Wilk normality test
##
## data: diabetes_data$Age[diabetes_data$class == "Positive"]
## W = 0.9804, p-value = 0.0002325
The Shapiro-Wilk test was used to check the normality of Age distribution for both
Negative and Positive diabetes cases. p-values are very small (< 0.05) for both groups,
indicating that Age is not normally distributed in either class. Since normality is violated, a
non-parametric test (Wilcoxon rank-sum test) should be used instead of a t-test for
comparing Age between classes.
Wilcoxon Rank Sum Test
wilcox_test_result <- wilcox.test(Age ~ class, data = diabetes_data)
print(wilcox_test_result)
##
## Wilcoxon rank sum test with continuity correction
##
## data: Age by class
## W = 27834, p-value = 0.0124
## alternative hypothesis: true location shift is not equal to 0
The p-value (0.0124) is less than 0.05, indicating a statistically significant difference in age
distribution between the two groups. This suggests that Age is an important factor in
distinguishing between diabetes-positive and diabetes-negative individuals.
Bivariate Association With Class (Categorical Variables)
#Creating an Empty Dataframe to Store Results
results_df <- data.frame(Variable = character(), Test_Used = character(),
P_Value = numeric(), Selection = character(), stringsAsFactors = FALSE)
perform_test_and_store("Gender")
perform_test_and_store("Polyuria")
perform_test_and_store("Polydipsia")
perform_test_and_store("sudden weight loss")
perform_test_and_store("weakness")
perform_test_and_store("Polyphagia")
perform_test_and_store("Genital thrush")
perform_test_and_store("visual blurring")
perform_test_and_store("Itching")
perform_test_and_store("Irritability")
perform_test_and_store("delayed healing")
perform_test_and_store("partial paresis")
perform_test_and_store("muscle stiffness")
perform_test_and_store("Alopecia")
perform_test_and_store("Obesity")
print(results_df)
The Chi-Square test was used to assess the association between categorical variables and
diabetes diagnosis.
A significance threshold (α = 0.15) was applied to determine variable selection.
Variables Must be Selected (p-value < 0.15): Gender (3.28e-24) Polyuria (1.74e-51)
Polydipsia (6.18e-49) Sudden Weight Loss (5.97e-23) Weakness (4.87e-08) Polyphagia
(1.16e-14) Genital Thrush (1.61e-07) Visual Blurring (1.70e-08) Irritability (1.77e-11)
Partial Paresis (1.56e-22) Muscle Stiffness (6.93e-03) Alopecia (1.90e-09) Obesity (1.27e-
01)
Variables Can be Ignored (p-value ≥ 0.15): Itching (p = 0.8297) Delayed Healing (p =
0.3267)
Data Partition
#For reproducibility
set.seed(123)
##
## Negative Positive
## 134 214
table(D2$class)
##
## Negative Positive
## 66 106
The dataset has been splitted into a training set (D1) and a test set (D2) in a 2:1 ratio.
Training and Test Set Sizes: Training Set (D1): 348 observations (2/3 of the dataset) Test
Set (D2): 172 observations (1/3 of the dataset)
Class Distribution Check:
Training Set (D1): 134 Negative cases 214 Positive cases Test Set (D2): 66 Negative cases
106 Positive cases The class distribution in both sets is similar to the original dataset,
meaning the partitioning preserved the originality of the dataset.
Logistic Regression Modeling Data Preparation for Logistic Regression
#Converting Class Variable to Numeric (0 for Negative, 1 for Positive)
D1$class <- ifelse(D1$class == "Positive", 1, 0)
#Creating Model Matrix
X_train <- model.matrix(class ~ ., data = D1)[, -1] # Removing Intercept
Column
y_train <- D1$class # Response variable
##
## Call: cv.glmnet(x = X_train, y = y_train, lambda = lambda_seq, nfolds =
10, alpha = 1, family = "binomial")
##
## Measure: Binomial Deviance
##
## Lambda Index Measure SE Nonzero
## min 0.003981 55 0.4245 0.05210 13
## 1se 0.015849 49 0.4717 0.03794 12
## [1] 0.003981072
This shows the final LASSO logistic regression model coefficients. Variables with nonzero
coefficients were selected as important predictors of diabetes. Zero or missing coefficients
(.) indicate that LASSO removed those variables due to their lower relevance. Key selected
predictors: Strong positive associations with diabetes: Polydipsia (4.16), Polyuria (2.83),
Irritability (1.99), Genital Thrush (1.55). Strong negative associations with diabetes:
Gender (Male) (-2.80), Itching (-1.70), Obesity (-1.26). Weaker but still important
predictors: Sudden weight loss (0.90), Visual blurring (1.04), Partial Paresis (1.14).
Dropped variables: Delayed Healing and Muscle Stiffness were removed (coefficient = 0).
Identify and Interpret Important Predictors
#Extracting Nonzero Coefficients
important_vars <- coef(final_lasso_model)[, 1]
important_vars <- important_vars[important_vars != 0]
## Variable Coefficient
## (Intercept) (Intercept) 1.5925199
## Age Age -0.0406250
## GenderMale GenderMale -2.8076771
## PolyuriaYes PolyuriaYes 2.8363262
## PolydipsiaYes PolydipsiaYes 4.1597332
## `sudden weight loss`Yes `sudden weight loss`Yes 0.9006294
## weaknessYes weaknessYes 0.2782634
## PolyphagiaYes PolyphagiaYes 0.2373943
## `Genital thrush`Yes `Genital thrush`Yes 1.5566422
## `visual blurring`Yes `visual blurring`Yes 1.0457586
## ItchingYes ItchingYes -1.7015408
## IrritabilityYes IrritabilityYes 1.9926249
## `partial paresis`Yes `partial paresis`Yes 1.1414596
## ObesityYes ObesityYes -1.2646710
Model Assessment/Deployment
#Converting Class Variable in Test Data to Numeric (0 for Negative, 1 for
Positive)
D2$class <- ifelse(D2$class == "Positive", 1, 0)
The Receiver Operating Characteristic (ROC) curve evaluates the model’s ability to classify
positive and negative cases. X-axis (Specificity): False Positive Rate (FPR). Y-axis
(Sensitivity): True Positive Rate (TPR). The blue curve represents the model’s
performance. The diagonal gray line represents a random classifier (AUC = 0.5). The closer
the blue curve is to the top-left corner, the better the model performs.
The ROC curve is very close to the top-left corner, indicating strong performance.
# Compute and display AUC
auc_value <- auc(roc_curve)
cat("Area Under the Curve (AUC / C-Statistic):", auc_value, "\n")
A high AUC (likely >0.90) suggests excellent discrimination between positive and negative
cases.