0% found this document useful (0 votes)

7 views

Alam-Proj1

The document outlines a project analyzing a diabetes dataset with 520 observations and 17 variables, focusing on various health indicators and demographic information. It includes data loading, structure examination, summary statistics, missing values check, class distribution, and univariate analyses for different health symptoms. Visualizations are provided to illustrate the distribution of age, gender, and other health-related variables in relation to diabetes diagnosis.

Uploaded by

tamim.sfsab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Alam-Proj1

Uploaded by

tamim.sfsab

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Project 1

Tamim Alam_80764318

2025-02-07
Loading Libraries
library(readr)

## Warning: package 'readr' was built under R version 4.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':

##
## filter, lag

## The following objects are masked from 'package:base':

##
## intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(skimr)

## Warning: package 'skimr' was built under R version 4.3.3

library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.3.3

## Loading required package: Matrix

## Loaded glmnet 4.1-8

library(pROC)

## Warning: package 'pROC' was built under R version 4.3.3

## Type 'citation("pROC")' for a citation.

##
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':

##
## cov, smooth, var

Reading Dataset
diabetes_data <- read_csv("diabetes_data_upload.csv")

## Rows: 520 Columns: 17

## ── Column specification
────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Polyuria, Polydipsia, sudden weight loss, weakness,
Polyph...
## dbl (1): Age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this
message.

head(diabetes_data)

## # A tibble: 6 × 17
## Age Gender Polyuria Polydipsia `sudden weight loss` weakness
Polyphagia
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 40 Male No Yes No Yes No
## 2 58 Male No No No Yes No
## 3 41 Male Yes No No Yes Yes
## 4 45 Male No No Yes Yes Yes
## 5 60 Male Yes Yes Yes Yes Yes
## 6 55 Male Yes Yes No Yes Yes
## # ℹ 10 more variables: `Genital thrush` <chr>, `visual blurring` <chr>,
## # Itching <chr>, Irritability <chr>, `delayed healing` <chr>,
## # `partial paresis` <chr>, `muscle stiffness` <chr>, Alopecia <chr>,
## # Obesity <chr>, class <chr>

Structure of Dataset
str(diabetes_data)

## spc_tbl_ [520 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

## $ Age : num [1:520] 40 58 41 45 60 55 57 66 67 70 ...
## $ Gender : chr [1:520] "Male" "Male" "Male" "Male" ...
## $ Polyuria : chr [1:520] "No" "No" "Yes" "No" ...
## $ Polydipsia : chr [1:520] "Yes" "No" "No" "No" ...
## $ sudden weight loss: chr [1:520] "No" "No" "No" "Yes" ...
## $ weakness : chr [1:520] "Yes" "Yes" "Yes" "Yes" ...
## $ Polyphagia : chr [1:520] "No" "No" "Yes" "Yes" ...
## $ Genital thrush : chr [1:520] "No" "No" "No" "Yes" ...
## $ visual blurring : chr [1:520] "No" "Yes" "No" "No" ...
## $ Itching : chr [1:520] "Yes" "No" "Yes" "Yes" ...
## $ Irritability : chr [1:520] "No" "No" "No" "No" ...
## $ delayed healing : chr [1:520] "Yes" "No" "Yes" "Yes" ...
## $ partial paresis : chr [1:520] "No" "Yes" "No" "No" ...
## $ muscle stiffness : chr [1:520] "Yes" "No" "Yes" "No" ...
## $ Alopecia : chr [1:520] "Yes" "Yes" "Yes" "No" ...
## $ Obesity : chr [1:520] "Yes" "No" "No" "No" ...
## $ class : chr [1:520] "Positive" "Positive" "Positive"
"Positive" ...
## - attr(*, "spec")=
## .. cols(
## .. Age = col_double(),
## .. Gender = col_character(),
## .. Polyuria = col_character(),
## .. Polydipsia = col_character(),
## .. `sudden weight loss` = col_character(),
## .. weakness = col_character(),
## .. Polyphagia = col_character(),
## .. `Genital thrush` = col_character(),
## .. `visual blurring` = col_character(),
## .. Itching = col_character(),
## .. Irritability = col_character(),
## .. `delayed healing` = col_character(),
## .. `partial paresis` = col_character(),
## .. `muscle stiffness` = col_character(),
## .. Alopecia = col_character(),
## .. Obesity = col_character(),
## .. class = col_character()
## .. )
## - attr(*, "problems")=<externalptr>

The dataset has 520 observations and 17 columns. Age is a numeric (double) variable. The
remaining variables (Gender, Polyuria, Polydipsia, etc.) are stored as character
(categorical) variables. The target variable class (indicating diabetes diagnosis) is also a
character variable with values “Positive” and “Negative”.
Summary Statistics
summary(diabetes_data)

## Age Gender Polyuria Polydipsia

## Min. :16.00 Length:520 Length:520 Length:520
## 1st Qu.:39.00 Class :character Class :character Class :character
## Median :47.50 Mode :character Mode :character Mode :character
## Mean :48.03
## 3rd Qu.:57.00
## Max. :90.00
## sudden weight loss weakness Polyphagia Genital thrush
## Length:520 Length:520 Length:520 Length:520
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## visual blurring Itching Irritability delayed healing
## Length:520 Length:520 Length:520 Length:520
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## partial paresis muscle stiffness Alopecia Obesity
## Length:520 Length:520 Length:520 Length:520
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## class
## Length:520
## Class :character
## Mode :character
##
##
##

The dataset has 520 observations. Age is the only numeric variable. Min: 16, Max: 90,
Median: 47.5, Mean: 48.03, Interquartile Range: 39 - 57 (Middle 50%). All other variables,
including class, are stored as character variables.
Checking Missing Values
missing_values <- colSums(is.na(diabetes_data))
print(missing_values)

## Age Gender Polyuria

Polydipsia
## 0 0 0
0
## sudden weight loss weakness Polyphagia Genital
thrush
## 0 0 0
0
## visual blurring Itching Irritability delayed
healing
## 0 0 0
0
## partial paresis muscle stiffness Alopecia
Obesity
## 0 0 0
0
## class
## 0

There are no missing values.

Frequency of Target Variable
class_distribution <- diabetes_data %>%
count(class) %>%
mutate(percentage = n / sum(n) * 100)

print(class_distribution)

## # A tibble: 2 × 3
## class n percentage
## <chr> <int> <dbl>
## 1 Negative 200 38.5
## 2 Positive 320 61.5

The dataset has an imbalanced class distribution but not extreme. 61.54% of cases are
Positive (diabetes present). 38.46% of cases are Negative (no diabetes).
Visualizing Class Distribution
ggplot(diabetes_data, aes(x = class, fill = class)) +
geom_bar() +
labs(title = "Class Distribution of Diabetes Diagnosis", x = "Diabetes
Diagnosis", y = "Count") +
theme_minimal()
Univariate Analysis: Age
ggplot(diabetes_data, aes(x = Age)) +
geom_histogram(bins = 10, fill = "skyblue", color = "black") +
labs(title = "Histogram of Age", x = "Age", y = "Frequency") +
theme_minimal()
The distribution appears slightly right-skewed, indicating a small proportion of older
individuals. Most individuals are between 30 to 60 years old, with fewer younger (<25) and
older (>75) participants.
Univariate Analysis: Gender
ggplot(diabetes_data, aes(x = Gender, fill = Gender)) +
geom_bar() +
labs(title = "Distribution of Gender", x = "Gender", y = "Count") +
theme_minimal()
The dataset contains more male participants than female participants.
Univariate Analysis: Polyuria
ggplot(diabetes_data, aes(x = Polyuria, fill = Polyuria)) +
geom_bar() +
labs(title = "Distribution of Polyuria", x = "Polyuria", y = "Count") +
theme_minimal()
The counts of “Yes” (Polyuria present) and “No” (Polyuria absent) are nearly the same.
Univariate Analysis: Polydipsia
ggplot(diabetes_data, aes(x = Polydipsia, fill = Polydipsia)) +
geom_bar() +
labs(title = "Distribution of Polydipsia", x = "Polydipsia", y = "Count") +
theme_minimal()
More individuals in
the dataset do not have Polydipsia compared to those who do.
Univariate Analysis: Sudden Weight Loss
ggplot(diabetes_data, aes(x = `sudden weight loss`, fill = `sudden weight
loss`)) +
geom_bar() +
labs(title = "Distribution of Sudden Weight Loss", x = "Sudden Weight
Loss", y = "Count") +
theme_minimal()
More individuals in the dataset do not have sudden weight loss compared to those who do.
Univariate Analysis: Weakness
ggplot(diabetes_data, aes(x = weakness, fill = weakness)) +
geom_bar() +
labs(title = "Distribution of Weakness", x = "Weakness", y = "Count") +
theme_minimal()
More individuals in the dataset experience weakness (“Yes”) compared to those who do
not.
Univariate Analysis: Polyphagia
ggplot(diabetes_data, aes(x = Polyphagia, fill = Polyphagia)) +
geom_bar() +
labs(title = "Distribution of Polyphagia", x = "Polyphagia", y = "Count") +
theme_minimal()
More individuals in the dataset do not have Polyphagia (“No”) compared to those who do.
Univariate Analysis: Genital Thrush
ggplot(diabetes_data, aes(x = `Genital thrush`, fill = `Genital thrush`)) +
geom_bar() +
labs(title = "Distribution of Genital Thrush", x = "Genital Thrush", y =
"Count") +
theme_minimal()
The majority of individuals do not have Genital Thrush (“No”), while a much smaller group
does (“Yes”).
Univariate Analysis: Visual Blurring
ggplot(diabetes_data, aes(x = `visual blurring`, fill = `visual blurring`)) +
geom_bar() +
labs(title = "Distribution of Visual Blurring", x = "Visual Blurring", y =
"Count") +
theme_minimal()
More individuals in the dataset do not experience visual blurring (“No”) compared to those
who do.
Univariate Analysis: Itching
ggplot(diabetes_data, aes(x = Itching, fill = Itching)) +
geom_bar() +
labs(title = "Distribution of Itching", x = "Itching", y = "Count") +
theme_minimal()
The counts of “Yes” (Itching present) and “No” (Itching absent) are similar.
Univariate Analysis: Irritability
ggplot(diabetes_data, aes(x = Irritability, fill = Irritability)) +
geom_bar() +
labs(title = "Distribution of Irritability", x = "Irritability", y =
"Count") +
theme_minimal()
The majority of individuals do not experience Irritability (“No”), while a much smaller
group does (“Yes”).
Univariate Analysis: Delayed Healing
ggplot(diabetes_data, aes(x = `delayed healing`, fill = `delayed healing`)) +
geom_bar() +
labs(title = "Distribution of Delayed Healing", x = "Delayed Healing", y =
"Count") +
theme_minimal()
More individuals in the dataset do not experience delayed healing (“No”) compared to
those who do.
Univariate Analysis: Partial Paresis
ggplot(diabetes_data, aes(x = `partial paresis`, fill = `partial paresis`)) +
geom_bar() +
labs(title = "Distribution of Partial Paresis", x = "Partial Paresis", y =
"Count") +
theme_minimal()
More individuals in the dataset do not have partial paresis (“No”) compared to those who
do.
Univariate Analysis: Muscle Stiffness
ggplot(diabetes_data, aes(x = `muscle stiffness`, fill = `muscle stiffness`))
+
geom_bar() +
labs(title = "Distribution of Muscle Stiffness", x = "Muscle Stiffness", y
= "Count") +
theme_minimal()
More individuals in the dataset do not experience muscle stiffness (“No”) compared to
those who do.
Univariate Analysis: Alopecia
ggplot(diabetes_data, aes(x = Alopecia, fill = Alopecia)) +
geom_bar() +
labs(title = "Distribution of Alopecia", x = "Alopecia", y = "Count") +
theme_minimal()
More individuals in the dataset do not have Alopecia (“No”) compared to those who do.
Univariate Analysis: Obesity
ggplot(diabetes_data, aes(x = Obesity, fill = Obesity)) +
geom_bar() +
labs(title = "Distribution of Obesity", x = "Obesity", y = "Count") +
theme_minimal()
The majority of individuals do not have obesity (“No”), while a much smaller group does
(“Yes”).
Bivariate Association With Class (Numerical Variables)
ggplot(diabetes_data, aes(x = Age, fill = class)) +
geom_density(alpha = 0.5) +
labs(title = "Age Distribution by Diabetes Diagnosis", x = "Age", y =
"Density") +
theme_minimal()
The age distribution differs between Positive and Negative diabetes cases. Younger
individuals (below 40) have a higher proportion of Negative cases. Middle-aged individuals
(40–60) have more Positive cases. Older individuals (above 60) show a more balanced
distribution.
Shapiro-Wilk Normality Test for Age in Each Class
shapiro_test_negative <- shapiro.test(diabetes_data$Age[diabetes_data$class
== "Negative"])
shapiro_test_positive <- shapiro.test(diabetes_data$Age[diabetes_data$class
== "Positive"])

print(shapiro_test_negative)

##
## Shapiro-Wilk normality test
##
## data: diabetes_data$Age[diabetes_data$class == "Negative"]
## W = 0.96687, p-value = 0.0001182

print(shapiro_test_positive)

##
## Shapiro-Wilk normality test
##
## data: diabetes_data$Age[diabetes_data$class == "Positive"]
## W = 0.9804, p-value = 0.0002325
The Shapiro-Wilk test was used to check the normality of Age distribution for both
Negative and Positive diabetes cases. p-values are very small (< 0.05) for both groups,
indicating that Age is not normally distributed in either class. Since normality is violated, a
non-parametric test (Wilcoxon rank-sum test) should be used instead of a t-test for
comparing Age between classes.
Wilcoxon Rank Sum Test
wilcox_test_result <- wilcox.test(Age ~ class, data = diabetes_data)
print(wilcox_test_result)

##
## Wilcoxon rank sum test with continuity correction
##
## data: Age by class
## W = 27834, p-value = 0.0124
## alternative hypothesis: true location shift is not equal to 0

The p-value (0.0124) is less than 0.05, indicating a statistically significant difference in age
distribution between the two groups. This suggests that Age is an important factor in
distinguishing between diabetes-positive and diabetes-negative individuals.
Bivariate Association With Class (Categorical Variables)
#Creating an Empty Dataframe to Store Results
results_df <- data.frame(Variable = character(), Test_Used = character(),
P_Value = numeric(), Selection = character(), stringsAsFactors = FALSE)

#Function to Perform Chi-square or Fisher’s test and store results

perform_test_and_store <- function(variable, alpha = 0.15) {
#Creating Contingency Table
table_var <- table(diabetes_data[[variable]], diabetes_data$class)

#Checking for Small Cell Counts

if (any(table_var < 5)) {
test_result <- fisher.test(table_var) # Use Fisher’s Exact Test
test_type <- "Fisher’s Exact Test"
} else {
test_result <- chisq.test(table_var) # Use Chi-Square Test
test_type <- "Chi-Square Test"
}

#Determining if Variable Should be Selected

selection <- ifelse(test_result$p.value <= alpha, "Select", "Don't Select")

#Appending Results to Dataframe

results_df <<- bind_rows(results_df,
data.frame(Variable = variable,
Test_Used = test_type,
P_Value = test_result$p.value,
Selection = selection))
}

perform_test_and_store("Gender")
perform_test_and_store("Polyuria")
perform_test_and_store("Polydipsia")
perform_test_and_store("sudden weight loss")
perform_test_and_store("weakness")
perform_test_and_store("Polyphagia")
perform_test_and_store("Genital thrush")
perform_test_and_store("visual blurring")
perform_test_and_store("Itching")
perform_test_and_store("Irritability")
perform_test_and_store("delayed healing")
perform_test_and_store("partial paresis")
perform_test_and_store("muscle stiffness")
perform_test_and_store("Alopecia")
perform_test_and_store("Obesity")

print(results_df)

## Variable Test_Used P_Value Selection

## 1 Gender Chi-Square Test 3.289704e-24 Select
## 2 Polyuria Chi-Square Test 1.740912e-51 Select
## 3 Polydipsia Chi-Square Test 6.187010e-49 Select
## 4 sudden weight loss Chi-Square Test 5.969166e-23 Select
## 5 weakness Chi-Square Test 4.869843e-08 Select
## 6 Polyphagia Chi-Square Test 1.165158e-14 Select
## 7 Genital thrush Chi-Square Test 1.609790e-02 Select
## 8 visual blurring Chi-Square Test 1.701504e-08 Select
## 9 Itching Chi-Square Test 8.297484e-01 Don't Select
## 10 Irritability Chi-Square Test 1.771483e-11 Select
## 11 delayed healing Chi-Square Test 3.266599e-01 Don't Select
## 12 partial paresis Chi-Square Test 1.565289e-22 Select
## 13 muscle stiffness Chi-Square Test 6.939096e-03 Select
## 14 Alopecia Chi-Square Test 1.909279e-09 Select
## 15 Obesity Chi-Square Test 1.271080e-01 Select

The Chi-Square test was used to assess the association between categorical variables and
diabetes diagnosis.
A significance threshold (α = 0.15) was applied to determine variable selection.
Variables Must be Selected (p-value < 0.15): Gender (3.28e-24) Polyuria (1.74e-51)
Polydipsia (6.18e-49) Sudden Weight Loss (5.97e-23) Weakness (4.87e-08) Polyphagia
(1.16e-14) Genital Thrush (1.61e-07) Visual Blurring (1.70e-08) Irritability (1.77e-11)
Partial Paresis (1.56e-22) Muscle Stiffness (6.93e-03) Alopecia (1.90e-09) Obesity (1.27e-
01)
Variables Can be Ignored (p-value ≥ 0.15): Itching (p = 0.8297) Delayed Healing (p =
0.3267)
Data Partition
#For reproducibility
set.seed(123)

#Defining Training Size (2/3 of the Dataset)

train_index <- createDataPartition(diabetes_data$class, p = 2/3, list =
FALSE)

#Creating Training and Test Datasets

D1 <- diabetes_data[train_index, ] # Training set (2/3 of data)
D2 <- diabetes_data[-train_index, ] # Test set (1/3 of data)

#Checking the Dimensions of Each Set

cat("Training Data Size:", nrow(D1), "\n")

## Training Data Size: 348

cat("Test Data Size:", nrow(D2), "\n")

## Test Data Size: 172

#Verifying Class Distribution Remains Similar

table(D1$class)

##
## Negative Positive
## 134 214

table(D2$class)

##
## Negative Positive
## 66 106

The dataset has been splitted into a training set (D1) and a test set (D2) in a 2:1 ratio.
Training and Test Set Sizes: Training Set (D1): 348 observations (2/3 of the dataset) Test
Set (D2): 172 observations (1/3 of the dataset)
Class Distribution Check:
Training Set (D1): 134 Negative cases 214 Positive cases Test Set (D2): 66 Negative cases
106 Positive cases The class distribution in both sets is similar to the original dataset,
meaning the partitioning preserved the originality of the dataset.
Logistic Regression Modeling Data Preparation for Logistic Regression
#Converting Class Variable to Numeric (0 for Negative, 1 for Positive)
D1$class <- ifelse(D1$class == "Positive", 1, 0)
#Creating Model Matrix
X_train <- model.matrix(class ~ ., data = D1)[, -1] # Removing Intercept
Column
y_train <- D1$class # Response variable

Fit LASSO Regularized Logistic Regression

#Defining Sequence of Lambda Values
lambda_seq <- 10^seq(3, -3, by = -0.1)

#Fitting LASSO Model

lasso_model <- cv.glmnet(X_train, y_train, alpha = 1, lambda = lambda_seq,
family = "binomial", nfolds = 10)
lasso_model

##
## Call: cv.glmnet(x = X_train, y = y_train, lambda = lambda_seq, nfolds =
10, alpha = 1, family = "binomial")
##
## Measure: Binomial Deviance
##
## Lambda Index Measure SE Nonzero
## min 0.003981 55 0.4245 0.05210 13
## 1se 0.015849 49 0.4717 0.03794 12

A LASSO-regularized logistic regression model was fitted using 10-fold cross-validation.

The model was trained using a sequence of lambda (λ) values ranging from 10^3 to 10^-3.
The best lambda values identified: Min λ (0.003981): Gives the lowest binomial deviance
with 13 nonzero coefficients.
Plotting Cross-Validation Error
plot(lasso_model)
The plot shows binomial deviance vs. log(λ) for the LASSO-regularized logistic regression
model. Each red dot represents a different λ value, with corresponding binomial deviance.
Vertical dashed lines indicate The optimal λ (minimum deviance). The top axis numbers
show the number of nonzero coefficients (selected features) at each λ value. As log(λ)
increases, more coefficients shrink to zero, leading to a simpler model.

Best Lambda (Minimizing Cross-Validation Error)

best_lambda <- lasso_model$lambda.min
print(best_lambda)

## [1] 0.003981072

Fit Final Model Using Best Lambda

#Train Final Model with Selected Lambda
final_lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda =
best_lambda, family = "binomial")

#Print Model Coefficients

coef(final_lasso_model)

## 17 x 1 sparse Matrix of class "dgCMatrix"

## s0
## (Intercept) 1.5925199
## Age -0.0406250
## GenderMale -2.8076771
## PolyuriaYes 2.8363262
## PolydipsiaYes 4.1597332
## `sudden weight loss`Yes 0.9006294
## weaknessYes 0.2782634
## PolyphagiaYes 0.2373943
## `Genital thrush`Yes 1.5566422
## `visual blurring`Yes 1.0457586
## ItchingYes -1.7015408
## IrritabilityYes 1.9926249
## `delayed healing`Yes .
## `partial paresis`Yes 1.1414596
## `muscle stiffness`Yes .
## AlopeciaYes .
## ObesityYes -1.2646710

This shows the final LASSO logistic regression model coefficients. Variables with nonzero
coefficients were selected as important predictors of diabetes. Zero or missing coefficients
(.) indicate that LASSO removed those variables due to their lower relevance. Key selected
predictors: Strong positive associations with diabetes: Polydipsia (4.16), Polyuria (2.83),
Irritability (1.99), Genital Thrush (1.55). Strong negative associations with diabetes:
Gender (Male) (-2.80), Itching (-1.70), Obesity (-1.26). Weaker but still important
predictors: Sudden weight loss (0.90), Visual blurring (1.04), Partial Paresis (1.14).
Dropped variables: Delayed Healing and Muscle Stiffness were removed (coefficient = 0).
Identify and Interpret Important Predictors
#Extracting Nonzero Coefficients
important_vars <- coef(final_lasso_model)[, 1]
important_vars <- important_vars[important_vars != 0]

#Convert to Dataframe for Visualization

important_vars_df <- data.frame(Variable = names(important_vars), Coefficient
= important_vars)

#Display Important Variables

print(important_vars_df)

## Variable Coefficient
## (Intercept) (Intercept) 1.5925199
## Age Age -0.0406250
## GenderMale GenderMale -2.8076771
## PolyuriaYes PolyuriaYes 2.8363262
## PolydipsiaYes PolydipsiaYes 4.1597332
## `sudden weight loss`Yes `sudden weight loss`Yes 0.9006294
## weaknessYes weaknessYes 0.2782634
## PolyphagiaYes PolyphagiaYes 0.2373943
## `Genital thrush`Yes `Genital thrush`Yes 1.5566422
## `visual blurring`Yes `visual blurring`Yes 1.0457586
## ItchingYes ItchingYes -1.7015408
## IrritabilityYes IrritabilityYes 1.9926249
## `partial paresis`Yes `partial paresis`Yes 1.1414596
## ObesityYes ObesityYes -1.2646710

Visualizing Important Predictors

ggplot(important_vars_df, aes(x = reorder(Variable, Coefficient), y =
Coefficient, fill = Coefficient)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Important Predictors from LASSO Logistic Regression", x =
"Variable", y = "Coefficient") +
theme_minimal()

Model Assessment/Deployment
#Converting Class Variable in Test Data to Numeric (0 for Negative, 1 for
Positive)
D2$class <- ifelse(D2$class == "Positive", 1, 0)

#Creating Model Matrix for Test Data

X_test <- model.matrix(class ~ ., data = D2)[, -1] #Removing Intercept
column
y_test <- D2$class # Response Variable

Generate Predictions Using Final LASSO Model

#Predicting Probabilities on Test Data
prob_predictions <- predict(final_lasso_model, newx = X_test, type =
"response")

#Converting Probabilities to a Numeric Vector

prob_predictions <- as.vector(prob_predictions)

Plotting the ROC Curve & Compute AUC (C-Statistic)

# Compute ROC curve
roc_curve <- roc(y_test, prob_predictions)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Plot ROC curve

plot(roc_curve, col = "blue", main = "ROC Curve for LASSO Logistic
Regression", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red") # Diagonal reference line

The Receiver Operating Characteristic (ROC) curve evaluates the model’s ability to classify
positive and negative cases. X-axis (Specificity): False Positive Rate (FPR). Y-axis
(Sensitivity): True Positive Rate (TPR). The blue curve represents the model’s
performance. The diagonal gray line represents a random classifier (AUC = 0.5). The closer
the blue curve is to the top-left corner, the better the model performs.
The ROC curve is very close to the top-left corner, indicating strong performance.
# Compute and display AUC
auc_value <- auc(roc_curve)
cat("Area Under the Curve (AUC / C-Statistic):", auc_value, "\n")

## Area Under the Curve (AUC / C-Statistic): 0.9694111

A high AUC (likely >0.90) suggests excellent discrimination between positive and negative
cases.

Episode 1 Emily in Paris
No ratings yet
Episode 1 Emily in Paris
18 pages
SCE and EBEEDM 2024 - Diabetes Playlist
No ratings yet
SCE and EBEEDM 2024 - Diabetes Playlist
6 pages
Pima Indian Diabetes Questions
No ratings yet
Pima Indian Diabetes Questions
6 pages
Predict Heart Disease
No ratings yet
Predict Heart Disease
55 pages
Stroke Prediction Dataset
No ratings yet
Stroke Prediction Dataset
48 pages
ADS Exp-1
No ratings yet
ADS Exp-1
3 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
20 pages
Pima Indians Diabetes Database Analysis - Kaggle
No ratings yet
Pima Indians Diabetes Database Analysis - Kaggle
37 pages
Diabetes_Prediction_1704256341
No ratings yet
Diabetes_Prediction_1704256341
17 pages
Mean Vector and Correlation Matrix in R - Jupyter Notebook
No ratings yet
Mean Vector and Correlation Matrix in R - Jupyter Notebook
7 pages
Diabetes
No ratings yet
Diabetes
97 pages
vertopal.com_python2025
No ratings yet
vertopal.com_python2025
25 pages
healthcare-project-simplilearn- Week1
No ratings yet
healthcare-project-simplilearn- Week1
6 pages
lab_8__(6)عفان عبدالله احمد_التكليف_
No ratings yet
lab_8__(6)عفان عبدالله احمد_التكليف_
18 pages
Heart Disease Indicator Prediction Model
No ratings yet
Heart Disease Indicator Prediction Model
17 pages
Exp 5
No ratings yet
Exp 5
7 pages
Mla - 2 (Cia - 2) - 20221013
No ratings yet
Mla - 2 (Cia - 2) - 20221013
14 pages
Logidtic_Regression_ASSIGNMENT
No ratings yet
Logidtic_Regression_ASSIGNMENT
13 pages
Diabetes EDA and Kears Modeling
No ratings yet
Diabetes EDA and Kears Modeling
26 pages
Understanding & Interpreting The Effects of Continuous Variables: The MCP (Marginscontplot) Command
No ratings yet
Understanding & Interpreting The Effects of Continuous Variables: The MCP (Marginscontplot) Command
18 pages
salil idc 2 (1)
No ratings yet
salil idc 2 (1)
11 pages
E_AI_Lab_EX_2and_3
No ratings yet
E_AI_Lab_EX_2and_3
9 pages
Stata All Command (Jahidul)
No ratings yet
Stata All Command (Jahidul)
13 pages
Univariate and Multivariate Analysis - Jupyter Notebook
No ratings yet
Univariate and Multivariate Analysis - Jupyter Notebook
5 pages
Aids
No ratings yet
Aids
88 pages
Assignment# 06
No ratings yet
Assignment# 06
16 pages
QT Report
No ratings yet
QT Report
20 pages
Health Related Quality of Life in Patients With Type - 2 Diabetes
No ratings yet
Health Related Quality of Life in Patients With Type - 2 Diabetes
16 pages
Logistic regression worksheet solution
No ratings yet
Logistic regression worksheet solution
3 pages
Ok 5
No ratings yet
Ok 5
7 pages
ML Proj Diabetes.pptx
No ratings yet
ML Proj Diabetes.pptx
51 pages
Soguero-Ruiz Et Al (2020) - Finding Associations Among Chronic Conditions by Bootstrap and Multiple Correspondence Analysis
No ratings yet
Soguero-Ruiz Et Al (2020) - Finding Associations Among Chronic Conditions by Bootstrap and Multiple Correspondence Analysis
8 pages
ML Practical 04
No ratings yet
ML Practical 04
20 pages
Python code
No ratings yet
Python code
12 pages
Programming With Python - Final Assignment - Valerie Riady Huette
No ratings yet
Programming With Python - Final Assignment - Valerie Riady Huette
11 pages
Computer Manual 2
No ratings yet
Computer Manual 2
65 pages
My Code
No ratings yet
My Code
7 pages
eda-ml-decision-tree.ipynb - Colab
No ratings yet
eda-ml-decision-tree.ipynb - Colab
20 pages
CH 1 Homework Sts
No ratings yet
CH 1 Homework Sts
5 pages
Multiple Disease Prediction Final Print Out
No ratings yet
Multiple Disease Prediction Final Print Out
46 pages
8.Perform Correlation and scatter plots (1)
No ratings yet
8.Perform Correlation and scatter plots (1)
5 pages
KNN For Classification
No ratings yet
KNN For Classification
5 pages
KNN For Classification
No ratings yet
KNN For Classification
4 pages
Asthama Disease Prediction Using Machine Learning !!!!: Importing Necessary Libraries
No ratings yet
Asthama Disease Prediction Using Machine Learning !!!!: Importing Necessary Libraries
55 pages
Homework 9 Solutions: Table (Type)
No ratings yet
Homework 9 Solutions: Table (Type)
6 pages
Paper On Diabetes
No ratings yet
Paper On Diabetes
4 pages
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
No ratings yet
Diabetes and Glucose Correlation - IBM Machine Learning Training Project
10 pages
x23 Group 1 - Final Project cst383
No ratings yet
x23 Group 1 - Final Project cst383
25 pages
Aging Memory Data
No ratings yet
Aging Memory Data
6 pages
Eda Report
No ratings yet
Eda Report
8 pages
Week 1
No ratings yet
Week 1
38 pages
KNN - Jupyter Notebook (1)
No ratings yet
KNN - Jupyter Notebook (1)
7 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
1 page
Hcin620 m6 Lab6 Hanifahmutesi-Finalproject
No ratings yet
Hcin620 m6 Lab6 Hanifahmutesi-Finalproject
5 pages
Center For Diabetes Care
No ratings yet
Center For Diabetes Care
8 pages
Nutritional Reference Guide
100% (9)
Nutritional Reference Guide
362 pages
Diabetes & Other Disease_Dataset_Description
No ratings yet
Diabetes & Other Disease_Dataset_Description
2 pages
Documentation Code
No ratings yet
Documentation Code
20 pages
Capstone Project 2
No ratings yet
Capstone Project 2
15 pages
Dovdush_KN-305_lab3
No ratings yet
Dovdush_KN-305_lab3
2 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Java for Black Jack: Learn the Java Programming Language in One Session by Writing and Running a Java-Based Card Game Simulation
From Everand
Java for Black Jack: Learn the Java Programming Language in One Session by Writing and Running a Java-Based Card Game Simulation
U.Q. Magnusson
No ratings yet
WT Oet
No ratings yet
WT Oet
13 pages
Nurs FPX 4030 Assessment 2 Determining The Credibility of Evidence and Resources
No ratings yet
Nurs FPX 4030 Assessment 2 Determining The Credibility of Evidence and Resources
5 pages
SHPP.694442 15560863 31102023
No ratings yet
SHPP.694442 15560863 31102023
12 pages
Cycle 3 Day3
No ratings yet
Cycle 3 Day3
4 pages
Meglitinide Analogues Are Classified As Insulin Secretagogues and Have Actions and
No ratings yet
Meglitinide Analogues Are Classified As Insulin Secretagogues and Have Actions and
6 pages
Gestational Diabetes NCP-3
No ratings yet
Gestational Diabetes NCP-3
2 pages
1214 Bulletin of The Japanese Society of Scientific Fisheries Vol. 36, No. 12, 1970
No ratings yet
1214 Bulletin of The Japanese Society of Scientific Fisheries Vol. 36, No. 12, 1970
5 pages
Complications of Diabetes Mellitus
100% (1)
Complications of Diabetes Mellitus
46 pages
Diabetes Mellitus
No ratings yet
Diabetes Mellitus
37 pages
Jurnal Daun Kersen Untuk Diabetes
No ratings yet
Jurnal Daun Kersen Untuk Diabetes
11 pages
Anita Ramamurthy
No ratings yet
Anita Ramamurthy
3 pages
Diabetes Obesity Metabolism - 2023 - P Pin
No ratings yet
Diabetes Obesity Metabolism - 2023 - P Pin
8 pages
Afghanistan Essay
100% (2)
Afghanistan Essay
6 pages
Insulin & Oral Hypoglycemic Agents
No ratings yet
Insulin & Oral Hypoglycemic Agents
54 pages
CXXXXXXX
No ratings yet
CXXXXXXX
11 pages
SSC 15 Diabetes Combination
No ratings yet
SSC 15 Diabetes Combination
1 page
ECDMR-05-00136 Diabetes and The Power of Speech Dec 30 2020
No ratings yet
ECDMR-05-00136 Diabetes and The Power of Speech Dec 30 2020
5 pages
Heald 2016
No ratings yet
Heald 2016
4 pages
Faktor Risiko Dan Komorbiditas Migrain: (Risk Factor and Comorbidity of Migraine)
No ratings yet
Faktor Risiko Dan Komorbiditas Migrain: (Risk Factor and Comorbidity of Migraine)
14 pages
Metformin Hydro Chloride
No ratings yet
Metformin Hydro Chloride
3 pages
Report b2771712
No ratings yet
Report b2771712
16 pages
Topic 3 Endo
No ratings yet
Topic 3 Endo
3 pages
Alligator Pepper Cleansing As... - Favour Spiritu
0% (1)
Alligator Pepper Cleansing As... - Favour Spiritu
1 page
2023 Pharmacogenomics Diabetes FAA - 240118 - 071252
No ratings yet
2023 Pharmacogenomics Diabetes FAA - 240118 - 071252
24 pages
Data Information and Knowledge Lecture 3
No ratings yet
Data Information and Knowledge Lecture 3
6 pages
Comparative Analysis of Data Preparation Methods Employed in Prediction of Diabetes Mellitus Diagnosis Using eICU-CRD Dataset
No ratings yet
Comparative Analysis of Data Preparation Methods Employed in Prediction of Diabetes Mellitus Diagnosis Using eICU-CRD Dataset
6 pages
Cc2 Exam September
No ratings yet
Cc2 Exam September
7 pages
Blood Glucose Measurement OSCE Guide
No ratings yet
Blood Glucose Measurement OSCE Guide
9 pages

Alam-Proj1

Uploaded by

Alam-Proj1

Uploaded by

Project 1

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## The following objects are masked from 'package:stats':

## The following objects are masked from 'package:base':

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'skimr' was built under R version 4.3.3

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice

## Warning: package 'glmnet' was built under R version 4.3.3

## Loading required package: Matrix

## Loaded glmnet 4.1-8

## Warning: package 'pROC' was built under R version 4.3.3

## The following objects are masked from 'package:stats':

## Rows: 520 Columns: 17

## spc_tbl_ [520 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)

## Age Gender Polyuria Polydipsia

## Age Gender Polyuria

There are no missing values.

#Function to Perform Chi-square or Fisher’s test and store results

#Checking for Small Cell Counts

#Determining if Variable Should be Selected

#Appending Results to Dataframe

## Variable Test_Used P_Value Selection

#Defining Training Size (2/3 of the Dataset)

#Creating Training and Test Datasets

#Checking the Dimensions of Each Set

## Training Data Size: 348

cat("Test Data Size:", nrow(D2), "\n")

## Test Data Size: 172

#Verifying Class Distribution Remains Similar

Fit LASSO Regularized Logistic Regression

#Fitting LASSO Model

A LASSO-regularized logistic regression model was fitted using 10-fold cross-validation.

Best Lambda (Minimizing Cross-Validation Error)

Fit Final Model Using Best Lambda

#Print Model Coefficients

## 17 x 1 sparse Matrix of class "dgCMatrix"

#Convert to Dataframe for Visualization

#Display Important Variables

Visualizing Important Predictors

#Creating Model Matrix for Test Data

Generate Predictions Using Final LASSO Model

#Converting Probabilities to a Numeric Vector

Plotting the ROC Curve & Compute AUC (C-Statistic)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Plot ROC curve

## Area Under the Curve (AUC / C-Statistic): 0.9694111

You might also like