0% found this document useful (0 votes)
2 views

Alam-Proj2

The document outlines a data analysis project involving a dataset on shill bidding behavior. It includes data preprocessing steps such as reading the dataset, removing ID columns, converting class values, and checking for missing values. The analysis also covers scaling of features, class distribution visualization, and preparation for model training and validation.

Uploaded by

tamim.sfsab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Alam-Proj2

The document outlines a data analysis project involving a dataset on shill bidding behavior. It includes data preprocessing steps such as reading the dataset, removing ID columns, converting class values, and checking for missing values. The analysis also covers scaling of features, class distribution visualization, and preparation for model training and validation.

Uploaded by

tamim.sfsab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Alam-Proj2

Tamim Alam_80764318

2025-02-28
Loading Necessary Libraries
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice

Ques 1
# Read the dataset (assuming the file is named "Shill Bidding Dataset.csv")
df <- read.csv("Shill Bidding Dataset.csv")

# Remove the first three columns (ID variables)


df <- df %>% select(-c(Record_ID, Auction_ID, Bidder_ID))

# Convert Class values: Change 0 to -1


df$Class <- ifelse(df$Class == 0, -1, 1)

# Display the structure of the updated dataset


str(df)

## 'data.frame': 6321 obs. of 10 variables:


## $ Bidder_Tendency : num 0.2 0.0244 0.1429 0.1 0.0513 ...
## $ Bidding_Ratio : num 0.4 0.2 0.2 0.2 0.222 ...
## $ Successive_Outbidding : num 0 0 0 0 0 0 0 1 1 0.5 ...
## $ Last_Bidding : num 2.78e-05 1.31e-02 3.04e-03 9.75e-02 1.32e-
03 ...
## $ Auction_Bids : num 0 0 0 0 0 ...
## $ Starting_Price_Average: num 0.994 0.994 0.994 0.994 0 ...
## $ Early_Bidding : num 2.78e-05 1.31e-02 3.04e-03 9.75e-02 1.24e-
03 ...
## $ Winning_Ratio : num 0.667 0.944 1 1 0.5 ...
## $ Auction_Duration : int 5 5 5 5 7 7 7 7 7 7 ...
## $ Class : num -1 -1 -1 -1 -1 -1 -1 1 1 1 ...

# Display first few rows


head(df)

## Bidder_Tendency Bidding_Ratio Successive_Outbidding Last_Bidding


Auction_Bids
## 1 0.20000000 0.4000000 0 0.0000277778
0
## 2 0.02439024 0.2000000 0 0.0131226852
0
## 3 0.14285714 0.2000000 0 0.0030416667
0
## 4 0.10000000 0.2000000 0 0.0974768519
0
## 5 0.05128205 0.2222222 0 0.0013177910
0
## 6 0.03846154 0.1111111 0 0.0168435847
0
## Starting_Price_Average Early_Bidding Winning_Ratio Auction_Duration
Class
## 1 0.9935928 0.0000277778 0.6666667 5 -
1
## 2 0.9935928 0.0131226852 0.9444444 5 -
1
## 3 0.9935928 0.0030416667 1.0000000 5 -
1
## 4 0.9935928 0.0974768519 1.0000000 5 -
1
## 5 0.0000000 0.0012417328 0.5000000 7 -
1
## 6 0.0000000 0.0168435847 0.8000000 7 -
1

Ques 2a
# Compute the number of unique values for each column
distinct_counts <- sapply(df, function(x) length(unique(x)))

# Print the distinct counts for each variable


print(distinct_counts)
## Bidder_Tendency Bidding_Ratio Successive_Outbidding
## 489 400 3
## Last_Bidding Auction_Bids Starting_Price_Average
## 5807 49 22
## Early_Bidding Winning_Ratio Auction_Duration
## 5690 72 5
## Class
## 2

Comment: 1. Binary Variable: Class (2 unique values): Represents the target variable with -
1 (normal behavior) and 1 (otherwise). This is a binary classification problem. 2. Variables
with Very Few Unique Values (Likely Categorical or Discrete): Successive_Outbidding (3
unique values): Since this describes whether a bidder successively outbids themselves, it is
likely an ordinal variable (e.g., low, medium, high frequency). Auction_Duration (5 unique
values): Since auction durations are limited to a few possible values, it is likely categorical
or discrete numeric. 3. Variables with a Moderate Number of Unique Values (Likely
Discrete Numeric): Auction_Bids (49 unique values): Auctions with shill bidding tend to
have higher bid counts. Since bid counts are whole numbers, this is discrete numeric.
Starting_Price_Average (22 unique values): The number of distinct starting prices is
relatively small, meaning that auctions often begin at specific price points rather than any
random value. This could be discrete numeric but with limited granularity. 4. Variables
with Many Unique Values (Likely Continuous Features): Bidder_Tendency (489 unique
values): Measures how concentrated a bidder’s activity is within a small group of sellers.
Since it has many values, it is continuous. Bidding_Ratio (400 unique values): Indicates how
frequently a bidder participates, and its high variation suggests it is continuous.
Last_Bidding (5807 unique values): Measures inactivity towards the end of the auction; the
large number of values suggests a continuous variable. Early_Bidding (5690 unique
values): Indicates if a bidder places bids early in an auction, another continuous variable.
Winning_Ratio (72 unique values): Represents how often a bidder wins. This is continuous
but with a somewhat limited range.
Ques 2b
# Check for missing values in the dataset
missing_values <- colSums(is.na(df))

# Print the missing values count for each column


print(missing_values)

## Bidder_Tendency Bidding_Ratio Successive_Outbidding


## 0 0 0
## Last_Bidding Auction_Bids Starting_Price_Average
## 0 0 0
## Early_Bidding Winning_Ratio Auction_Duration
## 0 0 0
## Class
## 0

Comment: The dataset has no missing values.


Ques 2c
# Convert dataset to long format for ggplot
df_long <- reshape2::melt(df, id.vars = "Class")

# Create parallel boxplots


ggplot(df_long, aes(x = variable, y = value)) +
geom_boxplot() +
labs(title = "Parallel Boxplot of Predictors",
x = "Predictors",
y = "Values") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis
labels for better visibility

Comment: The parallel boxplot clearly shows that Auction_Duration has a much larger
range than other variables. Most other features are tightly packed near zero, meaning they
contribute much less in their raw form.
Some predictors, like Successive_Outbidding, have a small range with a few outliers. Others,
like Auction_Bids and Winning_Ratio, have a moderate range. Without scaling, variables
with a large range will dominate distance-based models.
Kernel-based methods like SVM are sensitive to feature scales. Without scaling, features
with higher magnitudes (e.g., Auction_Duration) will have a stronger influence on the
model than those with smaller values (e.g., Bidding_Ratio). Scaling ensures all features
contribute equally. Since we plan to use SVM (which relies on the kernel trick), scaling is
mandatory to avoid feature dominance and improve model performance.
# Identify numeric columns (excluding Class)
numeric_features <- df %>% select(-Class)

# Standardize the numeric features (Z-score normalization)


df_scaled <- numeric_features %>% mutate(across(everything(), scale))

# Add the Class column back to the scaled dataset


df_scaled$Class <- df$Class

# Display the summary of the scaled data


summary(df_scaled)

## Bidder_Tendency.V1 Bidding_Ratio.V1 Successive_Outbidding.V1


## Min. :-0.723250 Min. :-0.881203 Min. :-0.371047
## 1st Qu.:-0.586115 1st Qu.:-0.640091 1st Qu.:-0.371047
## Median :-0.406126 Median :-0.337080 Median :-0.371047
## Mean : 0.000000 Mean : 0.000000 Mean : 0.000000
## 3rd Qu.: 0.122415 3rd Qu.: 0.296487 3rd Qu.:-0.371047
## Max. : 4.350740 Max. : 6.632156 Max. : 3.204238
## Last_Bidding.V1 Auction_Bids.V1 Starting_Price_Average.V1
## Min. :-1.2184234 Min. :-0.9073611 Min. :-0.9651145
## 1st Qu.:-1.0923288 1st Qu.:-0.9073611 1st Qu.:-0.9651145
## Median :-0.0583584 Median :-0.3476901 Median :-0.9651145
## Mean : 0.0000000 Mean : 0.0000000 Mean : 0.0000000
## 3rd Qu.: 1.0451096 3rd Qu.: 0.8734102 3rd Qu.: 1.0629895
## Max. : 1.4122205 Max. : 2.1807058 Max. : 1.0759356
## Early_Bidding.V1 Winning_Ratio.V1 Auction_Duration.V1
## Min. :-1.1310382 Min. :-0.8423123 Min. :-1.4656003
## 1st Qu.:-1.0611290 1st Qu.:-0.8423123 1st Qu.:-0.6547772
## Median :-0.1853496 Median :-0.8423123 Median : 0.1560459
## Mean : 0.0000000 Mean : 0.0000000 Mean : 0.0000000
## 3rd Qu.: 1.0401619 3rd Qu.: 1.1089099 3rd Qu.: 0.9668691
## Max. : 1.4948532 Max. : 1.4482529 Max. : 2.1831038
## Class
## Min. :-1.0000
## 1st Qu.:-1.0000
## Median :-1.0000
## Mean :-0.7864
## 3rd Qu.:-1.0000
## Max. : 1.0000

Comment: The mean of all numeric variables is approximately 0. The standard deviation of
all features is 1. The min and max values are now centered around 0, confirming that
scaling was applied correctly.
Most features now fall within the range of -1 to 1, though some values go slightly beyond
due to outliers. For example: Bidder_Tendency.V1: Min = -0.72, Max = 4.35
Winning_Ratio.V1: Min = -0.84, Max = 1.44 This shows that some features have outliers, but
the overall transformation brings them to a comparable scale.
The Class variable still has only two unique values (-1 and 1), which confirms that the
target variable was not mistakenly scaled.
Successive_Outbidding and Auction_Duration Were Scaled. Even though these features had
only a few unique values, they were still standardized.
This is fine for SVM or logistic regression, as it prevents numerical instability.
Ques 2d
# Create a bar plot for Class distribution
ggplot(df_scaled, aes(x = as.factor(Class))) +
geom_bar(fill = "steelblue") +
labs(title = "Class Distribution", x = "Class", y = "Count") +
theme_minimal()

Comment: Severe Class Imbalance


The majority class (-1, normal bidding behavior) significantly outweighs the minority class
(1, shill bidding behavior). From the plot, it looks like around 90% of the data belongs to
Class -1, while only about 10% belongs to Class.
Ques 3
# Set seed for reproducibility
set.seed(123)

# Define the splitting ratios


train_ratio <- 0.5 # 2/4
valid_ratio <- 0.25 # 1/4
test_ratio <- 0.25 # 1/4

# First, split into training (50%) and temp (50%) (valid + test)
train_index <- createDataPartition(df_scaled$Class, p = train_ratio, list =
FALSE)
train_data <- df_scaled[train_index, ]
temp_data <- df_scaled[-train_index, ]

# Split temp_data into validation (25%) and test (25%)


valid_index <- createDataPartition(temp_data$Class, p = valid_ratio /
(valid_ratio + test_ratio), list = FALSE)
valid_data <- temp_data[valid_index, ]
test_data <- temp_data[-valid_index, ]

# Check dimensions of the splits


dim(train_data)

## [1] 3161 10

# Check dimensions of the splits


dim(valid_data)

## [1] 1580 10

# Check dimensions of the splits


dim(test_data)

## [1] 1580 10

Ques 4a
# Pool the training and validation datasets
D_prime <- rbind(train_data, valid_data)

# Separate predictors and response variable


X <- as.matrix(D_prime %>% select(-Class)) # Predictor variables
y <- as.numeric(D_prime$Class) # Response variable (-1,1)

# Define the negative log-likelihood function


neg_log_likelihood <- function(beta, X, y) {
linear_pred <- X %*% beta # Compute linear combination
log_likelihood <- -sum(log(1 + exp(-y * linear_pred))) # Logistic loss
return(-log_likelihood) # Return negative log-likelihood
}

# Initialize beta coefficients


beta_init <- rep(0, ncol(X))

# Optimize using BFGS method


optim_result <- optim(
par = beta_init,
fn = neg_log_likelihood,
X = X, y = y,
method = "BFGS",
hessian = TRUE
)

# Extract estimated coefficients


beta_hat <- optim_result$par

# Extract Hessian matrix (for standard errors)


hessian_matrix <- optim_result$hessian

# Compute standard errors from Hessian


se <- sqrt(diag(solve(hessian_matrix)))

# Compute p-values (Wald Test)


z_scores <- beta_hat / se
p_values <- 2 * (1 - pnorm(abs(z_scores))) # Two-tailed test

# Tabulate results
results <- data.frame(
Coefficient = beta_hat,
Std_Error = se,
Z_Score = z_scores,
P_Value = p_values
)

# Print the results


print(results)

## Coefficient Std_Error Z_Score P_Value


## 1 0.0036445904 0.05382070 0.067717268 0.9460107
## 2 0.0004977948 0.10537620 0.004723977 0.9962308
## 3 5.0963641229 0.13436311 37.929787126 0.0000000
## 4 0.1369996745 0.16035600 0.854347027 0.3929127
## 5 -0.0257401134 0.06780506 -0.379619344 0.7042280
## 6 -0.0145358099 0.06065458 -0.239649000 0.8106024
## 7 -0.0789644815 0.16200879 -0.487408618 0.6259688
## 8 0.0549160275 0.06822855 0.804883416 0.4208870
## 9 0.0137500248 0.04664752 0.294764320 0.7681739

# Check if optimization converged


if (optim_result$convergence == 0) {
print("Optimization converged successfully!")
} else {
print("Optimization did NOT converge.")
}

## [1] "Optimization converged successfully!"

Comment: Only One Predictor is Statistically Significant


Third Predictor (β₃ = 5.0963641229, p-value = 0.000000) This variable has a very low p-
value, meaning it is highly significant in predicting shill bidding. Most Other Predictors are
Not Significant.
Variables with high p-values (e.g., 0.946017, 0.9962308, 0.704228, etc.) are not statistically
significant. These predictors do not strongly contribute to distinguishing between normal
and shill bidding.
Convergence Check: Optimization converged successfully! This confirms that the optim()
function successfully minimized the negative log-likelihood, meaning the logistic regression
model has found optimal coefficients.
Ques 4b
# Convert Class from {-1,1} to {0,1}
D_prime$Class <- ifelse(D_prime$Class == -1, 0, 1)

# Fit logistic regression model using glm()


logit_model <- glm(Class ~ ., data = D_prime, family = binomial)

# Display summary of the model


summary(logit_model)

##
## Call:
## glm(formula = Class ~ ., family = binomial, data = D_prime)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.841774 0.493919 -13.852 < 2e-16 ***
## Bidder_Tendency 0.067916 0.115924 0.586 0.558
## Bidding_Ratio 0.004853 0.137284 0.035 0.972
## Successive_Outbidding 3.103406 0.199467 15.558 < 2e-16 ***
## Last_Bidding 0.466863 0.323492 1.443 0.149
## Auction_Bids 0.134472 0.198016 0.679 0.497
## Starting_Price_Average 0.048412 0.167136 0.290 0.772
## Early_Bidding -0.413480 0.321724 -1.285 0.199
## Winning_Ratio 2.485606 0.315598 7.876 3.38e-15 ***
## Auction_Duration 0.138105 0.130763 1.056 0.291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3220.28 on 4740 degrees of freedom
## Residual deviance: 422.68 on 4731 degrees of freedom
## AIC: 442.68
##
## Number of Fisher Scoring iterations: 9

Comment: Intercept is highly negative, meaning the baseline probability of shill bidding
(when all predictors are 0) is very low. Successive_Outbidding has a very large positive
coefficient (3.1034, p < 2e-16), making it the strongest predictor of shill bidding.
Winning_Ratio also has a large positive coefficient (2.4856, p = 3.38e-15), meaning bidders
with a low winning ratio are more likely to be shill bidders.
optim() only found one significant predictor. glm() found two (Successive_Outbidding,
Winning_Ratio). More Stable Standard Errors in glm().
Ques 4c
# Convert Class from {0,1} to {-1,1}
D_prime$Class <- ifelse(D_prime$Class == 0, -1, 1)

# Extract test predictors (excluding the response variable)


X_test <- as.matrix(test_data %>% select(-Class))

# Add intercept column (column of 1s)


X_test <- cbind(1, X_test) # First column should be 1s

# Compute linear combination X'β (dot product)


linear_pred <- X_test %*% coef(logit_model)

# Apply logistic function


probabilities <- exp(linear_pred) / (1 + exp(linear_pred))

# Convert probabilities to class predictions using threshold 0.5


y_pred <- ifelse(probabilities > 0.5, 1, 0) # 1 if prob > 0.5, else 0

# Extract actual labels from test set


y_actual <- test_data$Class

# Compute accuracy
accuracy <- mean(y_pred == y_actual)
# Print accuracy
print(paste("Test Accuracy:", round(accuracy * 100, 2), "%"))

## [1] "Test Accuracy: 9.68 %"

Comment: The logistic regression model is performing very poorly on the test set. A 9.68%
accuracy is much worse than random guessing (~50% for a balanced dataset).
Ques 5a
# Extract predictor matrices for training (X1), validation (X2), and test
(X3)
X1 <- as.matrix(train_data %>% select(-Class))
X2 <- as.matrix(valid_data %>% select(-Class))
X3 <- as.matrix(test_data %>% select(-Class))

# Standardize X1 (Compute column means and standard deviations)


X1_scaled <- scale(X1)

# Extract means and standard deviations from X1


X1_mean <- attr(X1_scaled, "scaled:center")
X1_sd <- attr(X1_scaled, "scaled:scale")

# Scale X2 and X3 using the means and SDs from X1


X2_scaled <- scale(X2, center = X1_mean, scale = X1_sd)
X3_scaled <- scale(X3, center = X1_mean, scale = X1_sd)

# Convert response variables to numeric form


y1 <- train_data$Class # Training labels
y2 <- valid_data$Class # Validation labels
y3 <- test_data$Class # Test labels

# Compute mean vectors for each class in the training data


m_pos <- colMeans(X1_scaled[y1 == 1, ]) # Mean of class 1
m_neg <- colMeans(X1_scaled[y1 == -1, ]) # Mean of class -1

# Compute middle point m


m <- (m_pos + m_neg) / 2

# Define corrected primitive LDA function


primitive_lda <- function(X, m_pos, m_neg, m) {
# Ensure (m_pos - m_neg) is a column vector
w <- as.matrix(m_pos - m_neg) # Convert to column vector
w <- matrix(w, ncol = 1) # Ensure correct shape

# Compute discriminant function


scores <- X %*% w - drop(t(w) %*% m) # Ensure proper dimensions
# Assign class labels using sign function
predictions <- ifelse(scores > 0, 1, -1)
return(predictions)
}

# Make predictions on the test set


y_pred <- primitive_lda(X3_scaled, m_pos, m_neg, m)

accuracy <- mean(y_pred == y3)


print(paste("Test Accuracy of Primitive LDA:", round(accuracy * 100, 2),
"%"))

## [1] "Test Accuracy of Primitive LDA: 97.72 %"

Comment: The Primitive LDA classifier achieved 97.72% accuracy, which is significantly
better than our earlier logistic regression model (~9.68% accuracy). This suggests that
LDA’s linear decision boundary works well for this dataset.
Ques 5b
# Define Polynomial Kernel Function
poly_kernel <- function(X, Y, degree = 2, coef0 = 1) {
return((X %*% t(Y) + coef0)^degree) # Polynomial transformation
}

# Define function to train Kernelized LDA with Polynomial Kernel


kernel_lda_poly <- function(X_train, y_train, X_valid, y_valid, degree) {
# Compute Kernel Matrices
K_train <- poly_kernel(X_train, X_train, degree)
K_valid <- poly_kernel(X_valid, X_train, degree)

# Compute class means in feature space


m_pos <- colMeans(K_train[y_train == 1, , drop = FALSE]) # Ensure column
format
m_neg <- colMeans(K_train[y_train == -1, , drop = FALSE]) # Ensure column
format
m <- (m_pos + m_neg) / 2 # Middle point

# Ensure (m_pos - m_neg) is a column vector


w <- as.matrix(m_pos - m_neg)
w <- matrix(w, ncol = 1) # Ensure correct shape

# Compute decision function


scores <- K_valid %*% w - drop(t(w) %*% m) # Ensure proper dimensions

# Assign class labels using sign function


y_pred <- ifelse(scores > 0, 1, -1)
# Compute accuracy
return(mean(y_pred == y_valid))
}

# Try different polynomial degrees


degree_values <- seq(1, 5, by = 1) # Try degrees from 1 to 5
accuracies <- numeric(length(degree_values))

# Train and evaluate for each polynomial degree


for (i in seq_along(degree_values)) {
accuracies[i] <- kernel_lda_poly(X1_scaled, y1, X2_scaled, y2,
degree_values[i])
}

# Print best polynomial degree


best_degree <- degree_values[which.max(accuracies)]
print(paste("Best polynomial degree:", best_degree, "with Accuracy:",
max(accuracies)))

## [1] "Best polynomial degree: 2 with Accuracy: 0.963924050632911"

# Create a data frame for plotting


accuracy_df <- data.frame(Degree = degree_values, Accuracy = accuracies)

# Plot accuracy vs. polynomial degree


ggplot(accuracy_df, aes(x = Degree, y = Accuracy)) +
geom_line(color = "blue") +
geom_point(color = "red") +
labs(title = "Kernelized LDA Accuracy vs. Polynomial Degree",
x = "Polynomial Degree", y = "Validation Accuracy") +
theme_minimal()
Comment: Best Accuracy at Degree 2.
The highest validation accuracy (~96.5%) is at degree = 2. This suggests that a quadratic
decision boundary works best for the kernelized LDA. Accuracy Declines for Higher
Degrees.
As the polynomial degree increases beyond 2, the validation accuracy drops.
Ques 5c
# Merge training (D1) and validation (D2) to create D'
D_prime <- rbind(train_data, valid_data)

# Extract predictors and response variable


X_prime <- as.matrix(D_prime %>% select(-Class)) # Feature matrix X'
y_prime <- D_prime$Class # Labels for D'

# Extract test predictors (X3) and labels (y3)


X3 <- as.matrix(test_data %>% select(-Class))
y3 <- test_data$Class

# Standardize X' (Compute column means and standard deviations)


X_prime_scaled <- scale(X_prime)

# Extract means and standard deviations from X'


X_prime_mean <- attr(X_prime_scaled, "scaled:center")
X_prime_sd <- attr(X_prime_scaled, "scaled:scale")

# Scale X3 using the means and SDs from X'


X3_scaled <- scale(X3, center = X_prime_mean, scale = X_prime_sd)

# Use the best polynomial degree found previously


best_degree <- 2 # From previous tuning

# Compute Kernel Matrices


K_prime <- poly_kernel(X_prime_scaled, X_prime_scaled, best_degree)
K_test <- poly_kernel(X3_scaled, X_prime_scaled, best_degree)

# Compute class means in feature space


m_pos <- colMeans(K_prime[y_prime == 1, , drop = FALSE]) # Mean for class +1
m_neg <- colMeans(K_prime[y_prime == -1, , drop = FALSE]) # Mean for class -1
m <- (m_pos + m_neg) / 2 # Middle point

# Ensure correct shape for (m_pos - m_neg)


w <- as.matrix(m_pos - m_neg)
w <- matrix(w, ncol = 1) # Convert to column vector

# Compute decision function


scores <- K_test %*% w - drop(t(w) %*% m)

# Assign class labels using sign function


y_pred <- ifelse(scores > 0, 1, -1)

# Compute accuracy
test_accuracy <- mean(y_pred == y3)

# Print the test accuracy


print(paste("Test Accuracy with Polynomial Kernel (Degree =", best_degree,
"):", round(test_accuracy * 100, 2), "%"))

## [1] "Test Accuracy with Polynomial Kernel (Degree = 2 ): 95.63 %"

Comparison with Previous Models: Logistic Regression (glm()): Low (~9.68%) (Poor)
Primitive LDA (Without Kernel): ~97.72% (Worked well but was fully linear) Kernelized
LDA (Polynomial Kernel, Degree = 2): Validation Accuracy- 96.5% Test Accuracy- 95.63%
(Best performing model so far)

You might also like