Assignment 4 Reportdocx
Assignment 4 Reportdocx
Key Concepts:
Sigmoid Function:
The core of logistic regression is the sigmoid function (also called the logistic function), which transforms
linear combinations of input features into values between 0 and 1. This is particularly useful for
probability estimation.
Optimization:
The LR model learns by observing the training model to calculate the weights that make the sigmoid
function fits the training data by maximizing the probabilities of all the training data points.
To decrease overfitting and uncertainty that may happen with large weights, it is recommended to use a
regularization technique (e.g. L1 regularization or L2 regularization).
Numerous mathematical tools can be used to optimize the parameters but we used in our assignment
the gradient descent tool, which assumes that the problem to solve is convex and differentiable which
are the properties of the loss function (negative log-likelihood).
Key Concepts:
Hyperplane: In the context of classification, a hyperplane is a decision boundary that separates the
classes. For a binary classification problem in a two-dimensional space, the hyperplane would be a line.
Margin:
SVM works by finding the hyperplane that maximizes the margin, i.e., the distance between the nearest
points (support vectors) of each class to the hyperplane. Maximizing this margin helps in improving
generalization and reducing the risk of overfitting.
Support Vectors:
The data points closest to the decision boundary are called support vectors. These points have a direct
impact on the position and orientation of the hyperplane.
Kernel Trick:
SVMs are particularly powerful in handling non-linearly separable data. By using a kernel function (such
as the Radial Basis Function or polynomial kernel), SVM maps the data to a higher-dimensional feature
space where it is easier to find a linear separation. Common kernels include:
Linear Kernel:
C-parameter:
The C parameter in SVM controls the trade-off between maximizing the margin and minimizing
classification errors. A high C value gives a smaller margin but fewer classification errors, while a low C
value allows for a larger margin but more errors.
Key Concepts:
Distance Metric:
KNN relies on a distance metric (commonly Euclidean distance) to calculate the proximity of instances.
Other distance metrics, such as Manhattan, can also be used depending on the problem at hand.
K-Value:
The "k" in KNN represents the number of nearest neighbors to consider for classification or regression.
The optimal value of k can be determined via cross-validation. A small value of 𝑘 may lead to overfitting,
while a large 𝑘 may lead to under fitting.
For classification tasks, the predicted class is determined by majority voting among the "k" nearest
neighbors.
For regression tasks, the predicted value is the average of the values of the "k" nearest neighbors.
Steps for Feature Engineering
In the two functions, engineer_features and analyze_features, a series of key steps are followed to
process and explore the data:
Data Copying:
Both functions start by ensuring that the original Data Frame remains unchanged. In the
engineer_features function, a copy of the input Data Frame is created to avoid altering the original data
during feature engineering.
Feature Engineering:
Combining Features:
New features are derived by combining existing ones, such as the ratio of age to trestbps and chol to
thalach.
Transformations:
Mathematical transformations are applied to certain features to handle potential issues like zero or
negative values. For example, the chol_log feature is the logarithm of chol (with a small epsilon added),
and the oldpeak_sqrt feature is the square root of oldpeak.
Polynomial Features:
A new feature is created by squaring the age feature, adding more complexity to the model.
Scaling Features:
Numerical features are scaled using StandardScaler to standardize the data, which helps avoid issues
from features with large or small values.
Visualizing Distributions: The analyze_features function visualizes the distribution of each feature in the
Data Frame using histograms or count plots, depending on whether the feature is numerical or
categorical.
It checks the relationship between each feature and the target variable. For numerical targets, a
scatterplot is used, and for categorical targets, a boxplot is applied.
Correlation Matrix:
A heat map of the correlation matrix is generated to understand the relationships between numerical
features.
Outlier Detection:
These steps serve to enhance the features for model performance and to explore and understand the
dataset’s structure and relationships.
Note: Please refer to the code notebook to see all the visualizations
Implementation details and results of Logistic Regression
The implementation of the LogisticRegression class follows the design of a typical binary logistic
regression model with support for gradient descent optimization and regularization techniques (L1 and
L2 regularization).
Initialization (__init__): The constructor initializes the essential parameters of the model, including the
learning rate, number of iterations for gradient descent, regularization method (if any), and the
regularization strength (lambda). Additionally, it sets up the weights and bias attributes, which will be
optimized during training.
Sigmoid Function (sigmoid): The logistic regression model relies on the sigmoid function to map linear
combinations of input features to probabilities. This function is essential for converting the model’s raw
output (z) into a probability value between 0 and 1.
Loss Function (loss_function): The loss function calculates the binary cross-entropy (log loss) between
the true labels (y) and the predicted labels (y_pred). It incorporates an optional regularization term. If
regularization is enabled (either L1 or L2), it adds a penalty to the loss, controlled by the lambda_reg
parameter, to prevent overfitting by discouraging large weights.
Gradient Descent (gradient_descent): This method implements the gradient descent algorithm to
minimize the loss function and update the weights and bias. During each iteration, the model computes
the gradients of the loss with respect to the weights and bias. The gradient for the weights is adjusted
with the addition of regularization (if applicable). The weights and bias are then updated by taking steps
proportional to the negative of these gradients, scaled by the learning rate.
Fit Method (fit): The fit method simply calls the gradient_descent function to train the model by
adjusting the weights and bias based on the provided training data (X, y).
Predict Method (predict): The predict method applies the learned weights and bias to the input data
(X), computes the linear combination (z), and then passes the result through the sigmoid function to
generate predicted probabilities. These probabilities are then thresholded at 0.5 to classify the outputs
as either class 0 or 1.
Overall, the design of this class follows a structured approach to implementing logistic regression with
both optimization via gradient descent and flexibility through regularization techniques.
Note: Please refer to the code notebook for the implementation details
Comparative Analysis of all Models
In this analysis, we evaluate the performance of Support Vector Machine (SVM), K-Nearest Neighbors
(KNN), and Logistic Regression (LR) models using various parameter settings, on both original data
(unprocessed features) and engineered data (features that have been preprocessed or transformed).
We focus on several evaluation metrics: accuracy, precision, recall, F1-score, and ROC AUC, to draw
comprehensive comparisons between these models and assess the impact of feature engineering on
their performance.
Model Evaluation:
1. Logistic Regression (LR)
Original Data:
Accuracy: Logistic Regression's performance with original data is moderate, with accuracy ranging from
0.547 to 0.567. The L1 and L2 regularization (L1 at 0.56765 and L2 at 0.56098) offers slight
improvements over the baseline.
Precision: The precision is relatively higher in the regularized versions (L1: 0.757), though the accuracy is
not as high.
Recall and F1-Score: The recall values are generally high, indicating LR captures many positive instances,
though it sacrifices precision in some cases. The F1-scores are lower due to this imbalance between
precision and recall.
ROC AUC: The ROC AUC for the original data is moderate (0.75), showing room for improvement.
Engineered Data:
Accuracy: The engineered data results show a significant improvement in accuracy, with LR achieving
0.818. Feature engineering is beneficial for logistic regression, providing a large boost in performance.
Precision and Recall: Precision drops slightly for engineered data compared to the original, but recall
improves significantly, leading to a balanced performance across both metrics. The F1-score is
considerably better, reflecting the model's improved performance.
ROC AUC: The ROC AUC for engineered data is 0.8975, indicating strong discriminatory power.
2. K-Nearest Neighbors (KNN)
Original Data:
Accuracy: KNN performs well on the original data, with high accuracy values (as high as 0.838 for k=3,
Euclidean distance). KNN's performance is highly influenced by the choice of distance metric and k
value.
Precision and Recall: Precision and recall are quite balanced, with higher values for smaller k values
(e.g., k=3). The Euclidean distance tends to yield better results than Manhattan distance. With weighted
KNN, precision improves slightly compared to the unweighted version, but recall can suffer.
ROC AUC: KNN's ROC AUC varies between 0.84 and 0.89, showing good performance in distinguishing
between classes, especially for smaller values of k and weighted distances.
Engineered Data:
Accuracy: Engineered data results in a drop in accuracy for KNN compared to original data. The highest
accuracy achieved is around 0.795 (k=3, Euclidean), suggesting that KNN struggles with engineered data,
possibly due to high dimensionality or the curse of dimensionality. Manhattan distance tends to perform
slightly worse than Euclidean, especially for engineered data.
Precision and Recall: Precision generally improves for k=5 and k=7 (Euclidean), though recall drops for
the same configurations, especially when using Manhattan distance. Weighted KNN does improve
precision but might result in lower recall.
ROC AUC: The ROC AUC scores are generally lower for engineered data, around 0.82 to 0.89, indicating
the model's reduced ability to distinguish between classes compared to original data.
Accuracy: SVM performance is solid with high accuracy values (ranging from 0.831 to 0.838 for linear
kernels). The RBF kernel with C=1 performs particularly well, with an accuracy of 0.838 and precision at
0.835. The polynomial kernel with lower C values results in significant drops in performance, especially
for higher degrees (degree=2, C=0.1).
Precision and Recall: Precision is high for the linear kernel with lower values of C (0.1), but the recall
values are moderate. As C increases, recall improves slightly, but precision begins to drop slightly due to
overfitting.
ROC AUC: The ROC AUC is strong for linear kernels, peaking around 0.90 for C=10, showing excellent
discriminatory power. The RBF kernel also shows good ROC AUC scores, though slightly lower than the
linear kernel.
Engineered Data:
Accuracy: SVM sees a marginal drop in accuracy for engineered data, with the highest accuracy (0.898)
achieved with linear kernels at C=0.1.
Precision and Recall: Precision and recall are generally well-balanced for linear kernels, with recall
slightly improving compared to the original data. The polynomial and RBF kernels, especially for lower C
values (0.1), result in lower performance.
ROC AUC: The ROC AUC for engineered data is similar to the original data (0.88 to 0.90), indicating
strong classification ability, especially with linear kernels.
Comparative Analysis:
Model Best Accuracy Best Precision Best Recall Best F1-Score Best ROC AUC
Logistic 0.818 0.806 0.897 0.840 0.897
Regression (Engineered (Engineered (Engineered (Engineered (Engineered
Data) Data) Data) Data) Data)
KNN 0.838 (Original 0.850 (Original 0.896 (Original 0.859 (Original 0.891 (Original
Data, k=3, Data, k=3, Data, k=5, Data, k=5, Data, k=7,
Euclidean) Euclidean) Euclidean) Euclidean) Euclidean)
SVM 0.901 (Original 0.857 (Original 0.910 (Original 0.855 (Original 0.900 (Original
Data, Linear, Data, Linear, Data, Linear, Data, Linear, Data, Linear,
C=10) C=10) C=0.1) C=10) C=10)
Model Insights:
Logistic Regression:
The engineered data provides a substantial improvement in performance, with both accuracy and ROC
AUC showing significant boosts.
Regularization (L1, L2) slightly improves the model's generalizability, but the impact on accuracy and
precision is minimal without feature engineering.
Logistic regression is most effective with well-engineered data, especially in situations where linear
relationships dominate.
K-Nearest Neighbors:
KNN performs well on original data, with the best performance achieved using Euclidean distance and
smaller values of k (e.g., k=3).
Weighted KNN offers better precision but can compromise recall. Feature engineering does not
significantly enhance KNN's performance, suggesting KNN might struggle with the increased
dimensionality or noise introduced by feature transformations.
For original data, KNN demonstrates robust performance, but the curse of dimensionality in engineered
data diminishes its effectiveness.
Linear SVM with C=0.1 performs exceptionally well on both original and engineered data, offering the
highest accuracy and ROC AUC scores.
The RBF kernel and polynomial kernels underperform, especially at lower values of C, highlighting the
advantages of simpler linear decision boundaries.
SVM is highly effective with original data and can still maintain good performance with engineered data.
It is the most robust model overall, particularly for datasets where linear separability is possible.
Conclusion:
Best Model for Original Data: SVM (Linear Kernel, C=10) stands out with the best accuracy and ROC AUC,
closely followed by KNN using Euclidean distance and k=3.
Best Model for Engineered Data: Logistic Regression with engineered features performs excellently in
terms of both accuracy and F1-score. It is a strong contender when feature engineering improves the
feature representation.
Best All-Rounder: SVM (Linear Kernel, C=10) provides the best balance between accuracy, precision,
recall, and ROC AUC across both datasets, making it the most reliable model.
Ultimately, feature engineering significantly benefits Logistic Regression, while KNN and SVM also
benefit from appropriate parameter tuning. Each model has strengths and weaknesses depending on
the data preprocessing steps and the task requirements.