Alam-Proj2
Alam-Proj2
Tamim Alam_80764318
2025-02-28
Loading Necessary Libraries
library(dplyr)
##
## Attaching package: 'dplyr'
library(ggplot2)
library(caret)
Ques 1
# Read the dataset (assuming the file is named "Shill Bidding Dataset.csv")
df <- read.csv("Shill Bidding Dataset.csv")
Ques 2a
# Compute the number of unique values for each column
distinct_counts <- sapply(df, function(x) length(unique(x)))
Comment: 1. Binary Variable: Class (2 unique values): Represents the target variable with -
1 (normal behavior) and 1 (otherwise). This is a binary classification problem. 2. Variables
with Very Few Unique Values (Likely Categorical or Discrete): Successive_Outbidding (3
unique values): Since this describes whether a bidder successively outbids themselves, it is
likely an ordinal variable (e.g., low, medium, high frequency). Auction_Duration (5 unique
values): Since auction durations are limited to a few possible values, it is likely categorical
or discrete numeric. 3. Variables with a Moderate Number of Unique Values (Likely
Discrete Numeric): Auction_Bids (49 unique values): Auctions with shill bidding tend to
have higher bid counts. Since bid counts are whole numbers, this is discrete numeric.
Starting_Price_Average (22 unique values): The number of distinct starting prices is
relatively small, meaning that auctions often begin at specific price points rather than any
random value. This could be discrete numeric but with limited granularity. 4. Variables
with Many Unique Values (Likely Continuous Features): Bidder_Tendency (489 unique
values): Measures how concentrated a bidder’s activity is within a small group of sellers.
Since it has many values, it is continuous. Bidding_Ratio (400 unique values): Indicates how
frequently a bidder participates, and its high variation suggests it is continuous.
Last_Bidding (5807 unique values): Measures inactivity towards the end of the auction; the
large number of values suggests a continuous variable. Early_Bidding (5690 unique
values): Indicates if a bidder places bids early in an auction, another continuous variable.
Winning_Ratio (72 unique values): Represents how often a bidder wins. This is continuous
but with a somewhat limited range.
Ques 2b
# Check for missing values in the dataset
missing_values <- colSums(is.na(df))
Comment: The parallel boxplot clearly shows that Auction_Duration has a much larger
range than other variables. Most other features are tightly packed near zero, meaning they
contribute much less in their raw form.
Some predictors, like Successive_Outbidding, have a small range with a few outliers. Others,
like Auction_Bids and Winning_Ratio, have a moderate range. Without scaling, variables
with a large range will dominate distance-based models.
Kernel-based methods like SVM are sensitive to feature scales. Without scaling, features
with higher magnitudes (e.g., Auction_Duration) will have a stronger influence on the
model than those with smaller values (e.g., Bidding_Ratio). Scaling ensures all features
contribute equally. Since we plan to use SVM (which relies on the kernel trick), scaling is
mandatory to avoid feature dominance and improve model performance.
# Identify numeric columns (excluding Class)
numeric_features <- df %>% select(-Class)
Comment: The mean of all numeric variables is approximately 0. The standard deviation of
all features is 1. The min and max values are now centered around 0, confirming that
scaling was applied correctly.
Most features now fall within the range of -1 to 1, though some values go slightly beyond
due to outliers. For example: Bidder_Tendency.V1: Min = -0.72, Max = 4.35
Winning_Ratio.V1: Min = -0.84, Max = 1.44 This shows that some features have outliers, but
the overall transformation brings them to a comparable scale.
The Class variable still has only two unique values (-1 and 1), which confirms that the
target variable was not mistakenly scaled.
Successive_Outbidding and Auction_Duration Were Scaled. Even though these features had
only a few unique values, they were still standardized.
This is fine for SVM or logistic regression, as it prevents numerical instability.
Ques 2d
# Create a bar plot for Class distribution
ggplot(df_scaled, aes(x = as.factor(Class))) +
geom_bar(fill = "steelblue") +
labs(title = "Class Distribution", x = "Class", y = "Count") +
theme_minimal()
# First, split into training (50%) and temp (50%) (valid + test)
train_index <- createDataPartition(df_scaled$Class, p = train_ratio, list =
FALSE)
train_data <- df_scaled[train_index, ]
temp_data <- df_scaled[-train_index, ]
## [1] 3161 10
## [1] 1580 10
## [1] 1580 10
Ques 4a
# Pool the training and validation datasets
D_prime <- rbind(train_data, valid_data)
# Tabulate results
results <- data.frame(
Coefficient = beta_hat,
Std_Error = se,
Z_Score = z_scores,
P_Value = p_values
)
##
## Call:
## glm(formula = Class ~ ., family = binomial, data = D_prime)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.841774 0.493919 -13.852 < 2e-16 ***
## Bidder_Tendency 0.067916 0.115924 0.586 0.558
## Bidding_Ratio 0.004853 0.137284 0.035 0.972
## Successive_Outbidding 3.103406 0.199467 15.558 < 2e-16 ***
## Last_Bidding 0.466863 0.323492 1.443 0.149
## Auction_Bids 0.134472 0.198016 0.679 0.497
## Starting_Price_Average 0.048412 0.167136 0.290 0.772
## Early_Bidding -0.413480 0.321724 -1.285 0.199
## Winning_Ratio 2.485606 0.315598 7.876 3.38e-15 ***
## Auction_Duration 0.138105 0.130763 1.056 0.291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3220.28 on 4740 degrees of freedom
## Residual deviance: 422.68 on 4731 degrees of freedom
## AIC: 442.68
##
## Number of Fisher Scoring iterations: 9
Comment: Intercept is highly negative, meaning the baseline probability of shill bidding
(when all predictors are 0) is very low. Successive_Outbidding has a very large positive
coefficient (3.1034, p < 2e-16), making it the strongest predictor of shill bidding.
Winning_Ratio also has a large positive coefficient (2.4856, p = 3.38e-15), meaning bidders
with a low winning ratio are more likely to be shill bidders.
optim() only found one significant predictor. glm() found two (Successive_Outbidding,
Winning_Ratio). More Stable Standard Errors in glm().
Ques 4c
# Convert Class from {0,1} to {-1,1}
D_prime$Class <- ifelse(D_prime$Class == 0, -1, 1)
# Compute accuracy
accuracy <- mean(y_pred == y_actual)
# Print accuracy
print(paste("Test Accuracy:", round(accuracy * 100, 2), "%"))
Comment: The logistic regression model is performing very poorly on the test set. A 9.68%
accuracy is much worse than random guessing (~50% for a balanced dataset).
Ques 5a
# Extract predictor matrices for training (X1), validation (X2), and test
(X3)
X1 <- as.matrix(train_data %>% select(-Class))
X2 <- as.matrix(valid_data %>% select(-Class))
X3 <- as.matrix(test_data %>% select(-Class))
Comment: The Primitive LDA classifier achieved 97.72% accuracy, which is significantly
better than our earlier logistic regression model (~9.68% accuracy). This suggests that
LDA’s linear decision boundary works well for this dataset.
Ques 5b
# Define Polynomial Kernel Function
poly_kernel <- function(X, Y, degree = 2, coef0 = 1) {
return((X %*% t(Y) + coef0)^degree) # Polynomial transformation
}
# Compute accuracy
test_accuracy <- mean(y_pred == y3)
Comparison with Previous Models: Logistic Regression (glm()): Low (~9.68%) (Poor)
Primitive LDA (Without Kernel): ~97.72% (Worked well but was fully linear) Kernelized
LDA (Polynomial Kernel, Degree = 2): Validation Accuracy- 96.5% Test Accuracy- 95.63%
(Best performing model so far)