0% found this document useful (0 votes)
7 views

Alam-Proj1

The document outlines a project analyzing a diabetes dataset with 520 observations and 17 variables, focusing on various health indicators and demographic information. It includes data loading, structure examination, summary statistics, missing values check, class distribution, and univariate analyses for different health symptoms. Visualizations are provided to illustrate the distribution of age, gender, and other health-related variables in relation to diabetes diagnosis.

Uploaded by

tamim.sfsab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Alam-Proj1

The document outlines a project analyzing a diabetes dataset with 520 observations and 17 variables, focusing on various health indicators and demographic information. It includes data loading, structure examination, summary statistics, missing values check, class distribution, and univariate analyses for different health symptoms. Visualizations are provided to illustrate the distribution of age, gender, and other health-related variables in relation to diabetes diagnosis.

Uploaded by

tamim.sfsab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Project 1

Tamim Alam_80764318

2025-02-07
Loading Libraries
library(readr)

## Warning: package 'readr' was built under R version 4.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(skimr)

## Warning: package 'skimr' was built under R version 4.3.3

library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.3.3

## Loading required package: Matrix

## Loaded glmnet 4.1-8

library(pROC)

## Warning: package 'pROC' was built under R version 4.3.3


## Type 'citation("pROC")' for a citation.

##
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':


##
## cov, smooth, var

Reading Dataset
diabetes_data <- read_csv("diabetes_data_upload.csv")

## Rows: 520 Columns: 17


## ── Column specification
────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Polyuria, Polydipsia, sudden weight loss, weakness,
Polyph...
## dbl (1): Age
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this
message.

head(diabetes_data)

## # A tibble: 6 × 17
## Age Gender Polyuria Polydipsia `sudden weight loss` weakness
Polyphagia
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 40 Male No Yes No Yes No
## 2 58 Male No No No Yes No
## 3 41 Male Yes No No Yes Yes
## 4 45 Male No No Yes Yes Yes
## 5 60 Male Yes Yes Yes Yes Yes
## 6 55 Male Yes Yes No Yes Yes
## # ℹ 10 more variables: `Genital thrush` <chr>, `visual blurring` <chr>,
## # Itching <chr>, Irritability <chr>, `delayed healing` <chr>,
## # `partial paresis` <chr>, `muscle stiffness` <chr>, Alopecia <chr>,
## # Obesity <chr>, class <chr>

Structure of Dataset
str(diabetes_data)

## spc_tbl_ [520 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)


## $ Age : num [1:520] 40 58 41 45 60 55 57 66 67 70 ...
## $ Gender : chr [1:520] "Male" "Male" "Male" "Male" ...
## $ Polyuria : chr [1:520] "No" "No" "Yes" "No" ...
## $ Polydipsia : chr [1:520] "Yes" "No" "No" "No" ...
## $ sudden weight loss: chr [1:520] "No" "No" "No" "Yes" ...
## $ weakness : chr [1:520] "Yes" "Yes" "Yes" "Yes" ...
## $ Polyphagia : chr [1:520] "No" "No" "Yes" "Yes" ...
## $ Genital thrush : chr [1:520] "No" "No" "No" "Yes" ...
## $ visual blurring : chr [1:520] "No" "Yes" "No" "No" ...
## $ Itching : chr [1:520] "Yes" "No" "Yes" "Yes" ...
## $ Irritability : chr [1:520] "No" "No" "No" "No" ...
## $ delayed healing : chr [1:520] "Yes" "No" "Yes" "Yes" ...
## $ partial paresis : chr [1:520] "No" "Yes" "No" "No" ...
## $ muscle stiffness : chr [1:520] "Yes" "No" "Yes" "No" ...
## $ Alopecia : chr [1:520] "Yes" "Yes" "Yes" "No" ...
## $ Obesity : chr [1:520] "Yes" "No" "No" "No" ...
## $ class : chr [1:520] "Positive" "Positive" "Positive"
"Positive" ...
## - attr(*, "spec")=
## .. cols(
## .. Age = col_double(),
## .. Gender = col_character(),
## .. Polyuria = col_character(),
## .. Polydipsia = col_character(),
## .. `sudden weight loss` = col_character(),
## .. weakness = col_character(),
## .. Polyphagia = col_character(),
## .. `Genital thrush` = col_character(),
## .. `visual blurring` = col_character(),
## .. Itching = col_character(),
## .. Irritability = col_character(),
## .. `delayed healing` = col_character(),
## .. `partial paresis` = col_character(),
## .. `muscle stiffness` = col_character(),
## .. Alopecia = col_character(),
## .. Obesity = col_character(),
## .. class = col_character()
## .. )
## - attr(*, "problems")=<externalptr>

The dataset has 520 observations and 17 columns. Age is a numeric (double) variable. The
remaining variables (Gender, Polyuria, Polydipsia, etc.) are stored as character
(categorical) variables. The target variable class (indicating diabetes diagnosis) is also a
character variable with values “Positive” and “Negative”.
Summary Statistics
summary(diabetes_data)

## Age Gender Polyuria Polydipsia


## Min. :16.00 Length:520 Length:520 Length:520
## 1st Qu.:39.00 Class :character Class :character Class :character
## Median :47.50 Mode :character Mode :character Mode :character
## Mean :48.03
## 3rd Qu.:57.00
## Max. :90.00
## sudden weight loss weakness Polyphagia Genital thrush
## Length:520 Length:520 Length:520 Length:520
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## visual blurring Itching Irritability delayed healing
## Length:520 Length:520 Length:520 Length:520
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## partial paresis muscle stiffness Alopecia Obesity
## Length:520 Length:520 Length:520 Length:520
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## class
## Length:520
## Class :character
## Mode :character
##
##
##

The dataset has 520 observations. Age is the only numeric variable. Min: 16, Max: 90,
Median: 47.5, Mean: 48.03, Interquartile Range: 39 - 57 (Middle 50%). All other variables,
including class, are stored as character variables.
Checking Missing Values
missing_values <- colSums(is.na(diabetes_data))
print(missing_values)

## Age Gender Polyuria


Polydipsia
## 0 0 0
0
## sudden weight loss weakness Polyphagia Genital
thrush
## 0 0 0
0
## visual blurring Itching Irritability delayed
healing
## 0 0 0
0
## partial paresis muscle stiffness Alopecia
Obesity
## 0 0 0
0
## class
## 0

There are no missing values.


Frequency of Target Variable
class_distribution <- diabetes_data %>%
count(class) %>%
mutate(percentage = n / sum(n) * 100)

print(class_distribution)

## # A tibble: 2 × 3
## class n percentage
## <chr> <int> <dbl>
## 1 Negative 200 38.5
## 2 Positive 320 61.5

The dataset has an imbalanced class distribution but not extreme. 61.54% of cases are
Positive (diabetes present). 38.46% of cases are Negative (no diabetes).
Visualizing Class Distribution
ggplot(diabetes_data, aes(x = class, fill = class)) +
geom_bar() +
labs(title = "Class Distribution of Diabetes Diagnosis", x = "Diabetes
Diagnosis", y = "Count") +
theme_minimal()
Univariate Analysis: Age
ggplot(diabetes_data, aes(x = Age)) +
geom_histogram(bins = 10, fill = "skyblue", color = "black") +
labs(title = "Histogram of Age", x = "Age", y = "Frequency") +
theme_minimal()
The distribution appears slightly right-skewed, indicating a small proportion of older
individuals. Most individuals are between 30 to 60 years old, with fewer younger (<25) and
older (>75) participants.
Univariate Analysis: Gender
ggplot(diabetes_data, aes(x = Gender, fill = Gender)) +
geom_bar() +
labs(title = "Distribution of Gender", x = "Gender", y = "Count") +
theme_minimal()
The dataset contains more male participants than female participants.
Univariate Analysis: Polyuria
ggplot(diabetes_data, aes(x = Polyuria, fill = Polyuria)) +
geom_bar() +
labs(title = "Distribution of Polyuria", x = "Polyuria", y = "Count") +
theme_minimal()
The counts of “Yes” (Polyuria present) and “No” (Polyuria absent) are nearly the same.
Univariate Analysis: Polydipsia
ggplot(diabetes_data, aes(x = Polydipsia, fill = Polydipsia)) +
geom_bar() +
labs(title = "Distribution of Polydipsia", x = "Polydipsia", y = "Count") +
theme_minimal()
More individuals in
the dataset do not have Polydipsia compared to those who do.
Univariate Analysis: Sudden Weight Loss
ggplot(diabetes_data, aes(x = `sudden weight loss`, fill = `sudden weight
loss`)) +
geom_bar() +
labs(title = "Distribution of Sudden Weight Loss", x = "Sudden Weight
Loss", y = "Count") +
theme_minimal()
More individuals in the dataset do not have sudden weight loss compared to those who do.
Univariate Analysis: Weakness
ggplot(diabetes_data, aes(x = weakness, fill = weakness)) +
geom_bar() +
labs(title = "Distribution of Weakness", x = "Weakness", y = "Count") +
theme_minimal()
More individuals in the dataset experience weakness (“Yes”) compared to those who do
not.
Univariate Analysis: Polyphagia
ggplot(diabetes_data, aes(x = Polyphagia, fill = Polyphagia)) +
geom_bar() +
labs(title = "Distribution of Polyphagia", x = "Polyphagia", y = "Count") +
theme_minimal()
More individuals in the dataset do not have Polyphagia (“No”) compared to those who do.
Univariate Analysis: Genital Thrush
ggplot(diabetes_data, aes(x = `Genital thrush`, fill = `Genital thrush`)) +
geom_bar() +
labs(title = "Distribution of Genital Thrush", x = "Genital Thrush", y =
"Count") +
theme_minimal()
The majority of individuals do not have Genital Thrush (“No”), while a much smaller group
does (“Yes”).
Univariate Analysis: Visual Blurring
ggplot(diabetes_data, aes(x = `visual blurring`, fill = `visual blurring`)) +
geom_bar() +
labs(title = "Distribution of Visual Blurring", x = "Visual Blurring", y =
"Count") +
theme_minimal()
More individuals in the dataset do not experience visual blurring (“No”) compared to those
who do.
Univariate Analysis: Itching
ggplot(diabetes_data, aes(x = Itching, fill = Itching)) +
geom_bar() +
labs(title = "Distribution of Itching", x = "Itching", y = "Count") +
theme_minimal()
The counts of “Yes” (Itching present) and “No” (Itching absent) are similar.
Univariate Analysis: Irritability
ggplot(diabetes_data, aes(x = Irritability, fill = Irritability)) +
geom_bar() +
labs(title = "Distribution of Irritability", x = "Irritability", y =
"Count") +
theme_minimal()
The majority of individuals do not experience Irritability (“No”), while a much smaller
group does (“Yes”).
Univariate Analysis: Delayed Healing
ggplot(diabetes_data, aes(x = `delayed healing`, fill = `delayed healing`)) +
geom_bar() +
labs(title = "Distribution of Delayed Healing", x = "Delayed Healing", y =
"Count") +
theme_minimal()
More individuals in the dataset do not experience delayed healing (“No”) compared to
those who do.
Univariate Analysis: Partial Paresis
ggplot(diabetes_data, aes(x = `partial paresis`, fill = `partial paresis`)) +
geom_bar() +
labs(title = "Distribution of Partial Paresis", x = "Partial Paresis", y =
"Count") +
theme_minimal()
More individuals in the dataset do not have partial paresis (“No”) compared to those who
do.
Univariate Analysis: Muscle Stiffness
ggplot(diabetes_data, aes(x = `muscle stiffness`, fill = `muscle stiffness`))
+
geom_bar() +
labs(title = "Distribution of Muscle Stiffness", x = "Muscle Stiffness", y
= "Count") +
theme_minimal()
More individuals in the dataset do not experience muscle stiffness (“No”) compared to
those who do.
Univariate Analysis: Alopecia
ggplot(diabetes_data, aes(x = Alopecia, fill = Alopecia)) +
geom_bar() +
labs(title = "Distribution of Alopecia", x = "Alopecia", y = "Count") +
theme_minimal()
More individuals in the dataset do not have Alopecia (“No”) compared to those who do.
Univariate Analysis: Obesity
ggplot(diabetes_data, aes(x = Obesity, fill = Obesity)) +
geom_bar() +
labs(title = "Distribution of Obesity", x = "Obesity", y = "Count") +
theme_minimal()
The majority of individuals do not have obesity (“No”), while a much smaller group does
(“Yes”).
Bivariate Association With Class (Numerical Variables)
ggplot(diabetes_data, aes(x = Age, fill = class)) +
geom_density(alpha = 0.5) +
labs(title = "Age Distribution by Diabetes Diagnosis", x = "Age", y =
"Density") +
theme_minimal()
The age distribution differs between Positive and Negative diabetes cases. Younger
individuals (below 40) have a higher proportion of Negative cases. Middle-aged individuals
(40–60) have more Positive cases. Older individuals (above 60) show a more balanced
distribution.
Shapiro-Wilk Normality Test for Age in Each Class
shapiro_test_negative <- shapiro.test(diabetes_data$Age[diabetes_data$class
== "Negative"])
shapiro_test_positive <- shapiro.test(diabetes_data$Age[diabetes_data$class
== "Positive"])

print(shapiro_test_negative)

##
## Shapiro-Wilk normality test
##
## data: diabetes_data$Age[diabetes_data$class == "Negative"]
## W = 0.96687, p-value = 0.0001182

print(shapiro_test_positive)

##
## Shapiro-Wilk normality test
##
## data: diabetes_data$Age[diabetes_data$class == "Positive"]
## W = 0.9804, p-value = 0.0002325
The Shapiro-Wilk test was used to check the normality of Age distribution for both
Negative and Positive diabetes cases. p-values are very small (< 0.05) for both groups,
indicating that Age is not normally distributed in either class. Since normality is violated, a
non-parametric test (Wilcoxon rank-sum test) should be used instead of a t-test for
comparing Age between classes.
Wilcoxon Rank Sum Test
wilcox_test_result <- wilcox.test(Age ~ class, data = diabetes_data)
print(wilcox_test_result)

##
## Wilcoxon rank sum test with continuity correction
##
## data: Age by class
## W = 27834, p-value = 0.0124
## alternative hypothesis: true location shift is not equal to 0

The p-value (0.0124) is less than 0.05, indicating a statistically significant difference in age
distribution between the two groups. This suggests that Age is an important factor in
distinguishing between diabetes-positive and diabetes-negative individuals.
Bivariate Association With Class (Categorical Variables)
#Creating an Empty Dataframe to Store Results
results_df <- data.frame(Variable = character(), Test_Used = character(),
P_Value = numeric(), Selection = character(), stringsAsFactors = FALSE)

#Function to Perform Chi-square or Fisher’s test and store results


perform_test_and_store <- function(variable, alpha = 0.15) {
#Creating Contingency Table
table_var <- table(diabetes_data[[variable]], diabetes_data$class)

#Checking for Small Cell Counts


if (any(table_var < 5)) {
test_result <- fisher.test(table_var) # Use Fisher’s Exact Test
test_type <- "Fisher’s Exact Test"
} else {
test_result <- chisq.test(table_var) # Use Chi-Square Test
test_type <- "Chi-Square Test"
}

#Determining if Variable Should be Selected


selection <- ifelse(test_result$p.value <= alpha, "Select", "Don't Select")

#Appending Results to Dataframe


results_df <<- bind_rows(results_df,
data.frame(Variable = variable,
Test_Used = test_type,
P_Value = test_result$p.value,
Selection = selection))
}

perform_test_and_store("Gender")
perform_test_and_store("Polyuria")
perform_test_and_store("Polydipsia")
perform_test_and_store("sudden weight loss")
perform_test_and_store("weakness")
perform_test_and_store("Polyphagia")
perform_test_and_store("Genital thrush")
perform_test_and_store("visual blurring")
perform_test_and_store("Itching")
perform_test_and_store("Irritability")
perform_test_and_store("delayed healing")
perform_test_and_store("partial paresis")
perform_test_and_store("muscle stiffness")
perform_test_and_store("Alopecia")
perform_test_and_store("Obesity")

print(results_df)

## Variable Test_Used P_Value Selection


## 1 Gender Chi-Square Test 3.289704e-24 Select
## 2 Polyuria Chi-Square Test 1.740912e-51 Select
## 3 Polydipsia Chi-Square Test 6.187010e-49 Select
## 4 sudden weight loss Chi-Square Test 5.969166e-23 Select
## 5 weakness Chi-Square Test 4.869843e-08 Select
## 6 Polyphagia Chi-Square Test 1.165158e-14 Select
## 7 Genital thrush Chi-Square Test 1.609790e-02 Select
## 8 visual blurring Chi-Square Test 1.701504e-08 Select
## 9 Itching Chi-Square Test 8.297484e-01 Don't Select
## 10 Irritability Chi-Square Test 1.771483e-11 Select
## 11 delayed healing Chi-Square Test 3.266599e-01 Don't Select
## 12 partial paresis Chi-Square Test 1.565289e-22 Select
## 13 muscle stiffness Chi-Square Test 6.939096e-03 Select
## 14 Alopecia Chi-Square Test 1.909279e-09 Select
## 15 Obesity Chi-Square Test 1.271080e-01 Select

The Chi-Square test was used to assess the association between categorical variables and
diabetes diagnosis.
A significance threshold (α = 0.15) was applied to determine variable selection.
Variables Must be Selected (p-value < 0.15): Gender (3.28e-24) Polyuria (1.74e-51)
Polydipsia (6.18e-49) Sudden Weight Loss (5.97e-23) Weakness (4.87e-08) Polyphagia
(1.16e-14) Genital Thrush (1.61e-07) Visual Blurring (1.70e-08) Irritability (1.77e-11)
Partial Paresis (1.56e-22) Muscle Stiffness (6.93e-03) Alopecia (1.90e-09) Obesity (1.27e-
01)
Variables Can be Ignored (p-value ≥ 0.15): Itching (p = 0.8297) Delayed Healing (p =
0.3267)
Data Partition
#For reproducibility
set.seed(123)

#Defining Training Size (2/3 of the Dataset)


train_index <- createDataPartition(diabetes_data$class, p = 2/3, list =
FALSE)

#Creating Training and Test Datasets


D1 <- diabetes_data[train_index, ] # Training set (2/3 of data)
D2 <- diabetes_data[-train_index, ] # Test set (1/3 of data)

#Checking the Dimensions of Each Set


cat("Training Data Size:", nrow(D1), "\n")

## Training Data Size: 348

cat("Test Data Size:", nrow(D2), "\n")

## Test Data Size: 172

#Verifying Class Distribution Remains Similar


table(D1$class)

##
## Negative Positive
## 134 214

table(D2$class)

##
## Negative Positive
## 66 106

The dataset has been splitted into a training set (D1) and a test set (D2) in a 2:1 ratio.
Training and Test Set Sizes: Training Set (D1): 348 observations (2/3 of the dataset) Test
Set (D2): 172 observations (1/3 of the dataset)
Class Distribution Check:
Training Set (D1): 134 Negative cases 214 Positive cases Test Set (D2): 66 Negative cases
106 Positive cases The class distribution in both sets is similar to the original dataset,
meaning the partitioning preserved the originality of the dataset.
Logistic Regression Modeling Data Preparation for Logistic Regression
#Converting Class Variable to Numeric (0 for Negative, 1 for Positive)
D1$class <- ifelse(D1$class == "Positive", 1, 0)
#Creating Model Matrix
X_train <- model.matrix(class ~ ., data = D1)[, -1] # Removing Intercept
Column
y_train <- D1$class # Response variable

Fit LASSO Regularized Logistic Regression


#Defining Sequence of Lambda Values
lambda_seq <- 10^seq(3, -3, by = -0.1)

#Fitting LASSO Model


lasso_model <- cv.glmnet(X_train, y_train, alpha = 1, lambda = lambda_seq,
family = "binomial", nfolds = 10)
lasso_model

##
## Call: cv.glmnet(x = X_train, y = y_train, lambda = lambda_seq, nfolds =
10, alpha = 1, family = "binomial")
##
## Measure: Binomial Deviance
##
## Lambda Index Measure SE Nonzero
## min 0.003981 55 0.4245 0.05210 13
## 1se 0.015849 49 0.4717 0.03794 12

A LASSO-regularized logistic regression model was fitted using 10-fold cross-validation.


The model was trained using a sequence of lambda (λ) values ranging from 10^3 to 10^-3.
The best lambda values identified: Min λ (0.003981): Gives the lowest binomial deviance
with 13 nonzero coefficients.
Plotting Cross-Validation Error
plot(lasso_model)
The plot shows binomial deviance vs. log(λ) for the LASSO-regularized logistic regression
model. Each red dot represents a different λ value, with corresponding binomial deviance.
Vertical dashed lines indicate The optimal λ (minimum deviance). The top axis numbers
show the number of nonzero coefficients (selected features) at each λ value. As log(λ)
increases, more coefficients shrink to zero, leading to a simpler model.

Best Lambda (Minimizing Cross-Validation Error)


best_lambda <- lasso_model$lambda.min
print(best_lambda)

## [1] 0.003981072

Fit Final Model Using Best Lambda


#Train Final Model with Selected Lambda
final_lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda =
best_lambda, family = "binomial")

#Print Model Coefficients


coef(final_lasso_model)

## 17 x 1 sparse Matrix of class "dgCMatrix"


## s0
## (Intercept) 1.5925199
## Age -0.0406250
## GenderMale -2.8076771
## PolyuriaYes 2.8363262
## PolydipsiaYes 4.1597332
## `sudden weight loss`Yes 0.9006294
## weaknessYes 0.2782634
## PolyphagiaYes 0.2373943
## `Genital thrush`Yes 1.5566422
## `visual blurring`Yes 1.0457586
## ItchingYes -1.7015408
## IrritabilityYes 1.9926249
## `delayed healing`Yes .
## `partial paresis`Yes 1.1414596
## `muscle stiffness`Yes .
## AlopeciaYes .
## ObesityYes -1.2646710

This shows the final LASSO logistic regression model coefficients. Variables with nonzero
coefficients were selected as important predictors of diabetes. Zero or missing coefficients
(.) indicate that LASSO removed those variables due to their lower relevance. Key selected
predictors: Strong positive associations with diabetes: Polydipsia (4.16), Polyuria (2.83),
Irritability (1.99), Genital Thrush (1.55). Strong negative associations with diabetes:
Gender (Male) (-2.80), Itching (-1.70), Obesity (-1.26). Weaker but still important
predictors: Sudden weight loss (0.90), Visual blurring (1.04), Partial Paresis (1.14).
Dropped variables: Delayed Healing and Muscle Stiffness were removed (coefficient = 0).
Identify and Interpret Important Predictors
#Extracting Nonzero Coefficients
important_vars <- coef(final_lasso_model)[, 1]
important_vars <- important_vars[important_vars != 0]

#Convert to Dataframe for Visualization


important_vars_df <- data.frame(Variable = names(important_vars), Coefficient
= important_vars)

#Display Important Variables


print(important_vars_df)

## Variable Coefficient
## (Intercept) (Intercept) 1.5925199
## Age Age -0.0406250
## GenderMale GenderMale -2.8076771
## PolyuriaYes PolyuriaYes 2.8363262
## PolydipsiaYes PolydipsiaYes 4.1597332
## `sudden weight loss`Yes `sudden weight loss`Yes 0.9006294
## weaknessYes weaknessYes 0.2782634
## PolyphagiaYes PolyphagiaYes 0.2373943
## `Genital thrush`Yes `Genital thrush`Yes 1.5566422
## `visual blurring`Yes `visual blurring`Yes 1.0457586
## ItchingYes ItchingYes -1.7015408
## IrritabilityYes IrritabilityYes 1.9926249
## `partial paresis`Yes `partial paresis`Yes 1.1414596
## ObesityYes ObesityYes -1.2646710

Visualizing Important Predictors


ggplot(important_vars_df, aes(x = reorder(Variable, Coefficient), y =
Coefficient, fill = Coefficient)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Important Predictors from LASSO Logistic Regression", x =
"Variable", y = "Coefficient") +
theme_minimal()

Model Assessment/Deployment
#Converting Class Variable in Test Data to Numeric (0 for Negative, 1 for
Positive)
D2$class <- ifelse(D2$class == "Positive", 1, 0)

#Creating Model Matrix for Test Data


X_test <- model.matrix(class ~ ., data = D2)[, -1] #Removing Intercept
column
y_test <- D2$class # Response Variable

Generate Predictions Using Final LASSO Model


#Predicting Probabilities on Test Data
prob_predictions <- predict(final_lasso_model, newx = X_test, type =
"response")

#Converting Probabilities to a Numeric Vector


prob_predictions <- as.vector(prob_predictions)

Plotting the ROC Curve & Compute AUC (C-Statistic)


# Compute ROC curve
roc_curve <- roc(y_test, prob_predictions)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

# Plot ROC curve


plot(roc_curve, col = "blue", main = "ROC Curve for LASSO Logistic
Regression", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red") # Diagonal reference line

The Receiver Operating Characteristic (ROC) curve evaluates the model’s ability to classify
positive and negative cases. X-axis (Specificity): False Positive Rate (FPR). Y-axis
(Sensitivity): True Positive Rate (TPR). The blue curve represents the model’s
performance. The diagonal gray line represents a random classifier (AUC = 0.5). The closer
the blue curve is to the top-left corner, the better the model performs.
The ROC curve is very close to the top-left corner, indicating strong performance.
# Compute and display AUC
auc_value <- auc(roc_curve)
cat("Area Under the Curve (AUC / C-Statistic):", auc_value, "\n")

## Area Under the Curve (AUC / C-Statistic): 0.9694111

A high AUC (likely >0.90) suggests excellent discrimination between positive and negative
cases.

You might also like