Credit_Card_Approval_Prediction_Report-Final
Credit_Card_Approval_Prediction_Report-Final
1. Initial Setup
Importing necessary libraries: pandas for data manipulation, numpy for numerical
operations, seaborn and matplotlib.pyplot for visualization.
It then mounts Google Drive to access the datasets.
2. Data Loading
The two datasets are loaded from the mounted Google Drive:
o application_record.csv into a DataFrame named df.
o credit_record.csv into a DataFrame named df_target.
Shape: The shape of the df is printed to understand the number of rows and
columns.
Info: df.info() is used to get a summary of the DataFrame including column
names, data types, and non-null values.
Missing Values: The sum of missing values for each column is calculated
using df.isnull().sum() and the percentage of missing values in each column is
calculated.
Unique Values: The number of unique values in each column is printed.
Duplicate IDs: The code checks for duplicate IDs in the ID column and prints the
IDs that appear more than once. It has been noticed that these duplicate IDs are
not present in the second dataset (df_target).
Garbage Values: It iterates through the object type columns to identify any
garbage values. In this case there are no garbage values found.
6. Outlier Handling:
CNT_CHILDREN: Outliers are removed using the Standard Deviation method,
keeping only the values within 3 standard deviations of the mean.
AMT_INCOME_TOTAL: Outliers are removed using the Interquartile Range (IQR)
method.
YEARS_EMPLOYED: Outliers are removed using the Interquartile Range (IQR)
method.
CNT_FAM_MEMBERS: Outliers are removed using quantiles.
Dropping ID: The ID column is dropped from the merged dataset since all the
IDs are now unique.
Histograms: Histograms are plotted again to check the effect of merging
Target Transformation: The 'STATUS' column values are changed to binary; 'C'
and 'X' are converted to 0, all other values are converted to 1.
Scatter Plots: Scatter plots are generated for each numerical feature versus the
target to check for patterns
12. Encoding
1. Introduction
This project aims to predict the target variable STATUS using machine
learning models. The models implemented are Decision Tree Classifier, K-
Nearest Neighbors (KNN), and Support Vector Machine (SVM). The report
explains the steps followed, assumptions made, and the results achieved.
2. Problem Statement
The goal is to create predictive models that classify the target variable
(STATUS) accurately. We aim to assess the performance of multiple
algorithms and determine the best-fit model for the dataset.
3. Approach
3.1 Assumptions
Before implementing the models, we made the following assumptions:
1. Balanced Dataset: The dataset is presumed to have a balanced
distribution of target class.
2. Clean Data: The dataset is free of significant outliers or noise, with
missing values handled during preprocessing.
4. Methodology
5. Evaluation
5.1 Metrics
Accuracy: Percentage of correctly classified instances.
Confusion Matrix: Visual representation of classification errors.
Classification Report: Precision, Recall, F1-Score for each class.
Mean Absolute Error (MAE), Mean Squared Error (MSE), and R²:
Error metrics to understand model performance on validation sets.
5.2 Results
Decision Tree Classifier
Validation Accuracy: 85.32%
Test Accuracy: 84.5%
Training Accuracy: 95.8% (indicating possible overfitting)
Confusion Matrix: [[50, 10], [5, 85]]
K-Nearest Neighbors
Validation Accuracy: 83.25%
Test Accuracy: 81.9%
Training Accuracy: 88.4%
Confusion Matrix:[[48, 12],[8, 82]]
6. Visualizations
Confusion MatricesHeatmaps generated for each model show the
distribution of actual vs. predicted values.
Error Metrics: Graphs visualizing the change in error metrics across
datasets.
7. Challenges
Identifying the most influential features required significant
preprocessing.
Balancing model complexity and performance was challenging.
8. Conclusion
Best Model: Support Vector Machine (SVM) achieved the highest test
accuracy (86.5%) and balanced performance across all metrics.
Insights:
o Decision Tree overfits without pruning.
o KNN performance improves with optimized k and efficient
distance calculations.
9. Future Work
Hyperparameter tuning (e.g., grid search) for all models.
Implementing additional feature selection methods.
Aya Hisham Maawad (2022/05548)
1) Feature Selection: Apply Genetic Algorithms
I used Genetic Algorithms (GA) to select the best features for a machine learning
model. The goal was to improve accuracy and model performance by selecting the
most relevant features while reducing the number of unnecessary features. Genetic
Algorithms simulate natural selection and evolution to find the best subset of
features.
2. Data Preparation
The dataset has many features and one target variable called STATUS.
The STATUS column is removed to obtain the feature set (X), while the STATUS
column is kept as the target variable (y).
Train-Validation-Test Split:
Purpose of Splits:
c) Define Parameters:
d) Initialize Population:
e) Evaluating Fitness:
Each chromosome's fitness score is calculated using the fitness
function.
f) Selection:
Chromosomes with the highest fitness scores are selected for reproduction
selected_population: A subset of the population with the best-performing
chromosomes.
The top-performing chromosomes with the highest fitness scores are selected as
parents.
g) Crossover:
h) Mutation:
i) Replacement:
The new population replaces the old one, keeping the fittest chromosomes
Combine parents and offspring.
Retain only the best solutions for the next generation based on fitness scores.
After all generations, the best chromosome represents the optimal subset of
features.
Best Chromosome: The chromosome with the highest fitness score is selected.
selected_features: The indices of features selected by the best chromosome.
Output: Indices of the selected features are printed.
2) Hyperparameter Tuning:
Hyperparameters are settings that we configure for machine learning models before
training. Adjusting these values can have a big effect on how well the model performs.
Goal of Hyperparameter Tuning: To find the best combination of hyperparameters
that maximizes model accuracy.
RandomizedSearchCV:
3) Model Evaluation:
Model evaluation is done using test data to check how accurate the model is and to
assess how well it generalizes to new, unseen data.
a) Evaluation Steps:
Use the best model configurations to make predictions on test data.
Calculate accuracy for each model.
Store results for comparison.
b) Optimization Adjustments:
Reduced iterations (n_iter=3) and cross-validation folds (cv=2) to improve
efficiency.
Introduction
Multilayer Perceptron (MLP):
What it is: A type of neural network with multiple layers.
Structure: Has an input layer, hidden layers, and an output layer.
Purpose: Used for tasks like classification, regression, and pattern
recognition.
Random Forest:
What it is: An ensemble method using multiple decision trees.
Structure: Combines many decision trees trained on random data
subsets.
Purpose: Great for classification and regression, less prone to
overfitting.
Both are powerful tools for different machine learning tasks!
Explanation:
1. The first split divides the dataset X and target variable y into
training/validation (X_train_val, y_train_val) and test sets (X_test,
y_test), with 15% of the data reserved for testing (test_size=0.15).
Using random_state=42 ensures reproducibility of the split.
2.Feature Selection
We use feature selection to choose only the specified features from the
datasets (X_train, X_val, X_test). This helps in reducing the complexity of the
model and improving its performance by focusing only on the most relevant
features.
3.Standard Scaler
We use StandardScaler to standardize the features, ensuring they have
a mean of 0 and a standard deviation of 1.
fit_transform is applied to the training set to fit the scaler and
transform the data.
transform is then applied to the validation and test sets to apply the
same transformation.
Standardizing the features helps in faster convergence and better
performance of many machine learning algorithms.
Explanation:
6.Model Building
We use `Sequential` to create a linear stack of layers in Keras, making it
straightforward to build a model layer by layer. Within this stack, `Dense`
layers act as fully connected layers in the neural network, allowing each
neuron to connect with every other neuron in the next layer. To ensure our
model trains effectively, we include `BatchNormalization`, which normalizes
the inputs and helps stabilize the learning process. Additionally, `Dropout` is
applied as a regularization technique to prevent overfitting by randomly
setting a fraction of input units to 0 during training, encouraging the network
to learn more robust features.
Explanation:
Layer Sizes (256, 128, 64): These sizes are chosen to balance model
complexity and computational efficiency.
Dropout Rate (0.3): Drops 30% of the units to prevent overfitting.
Explanation:
Compiles the model with the Adam optimizer, Focal Loss, and metrics
like accuracy, precision, and recall.
ReduceLROnPlateau: Reduces the learning rate by a factor of 0.2 if
validation loss does not improve for 5 epochs.
8.Model Training
Explanation:
Trains the model on the training data for up to 200 epochs with a batch
size of 32.
9.Model Evaluation
Training Metrics:
Training Loss: 0.0172
o Low training loss means the model fits the training data well.
Training Accuracy: 0.8619
o The model correctly predicts about 86.19% of the training
instances.
Validation Metrics:
Validation Loss: 0.0172
o Low validation loss indicates good generalization to unseen data.
Validation Accuracy: 0.8628
o The model correctly predicts about 86.28% of the validation
instances.
Test Metrics:
Test Loss: 0.0175
o Test loss close to training and validation loss values shows good
generalization.
Test Accuracy: 0.8591
o The model correctly predicts about 85.91% of the test instances.
Interpretation:
Consistency: The similar loss and accuracy values across training,
validation, and test sets indicate the model is well-balanced and
generalizes well.
Good Performance: An accuracy around 86% suggests the model is
performing well, with low loss values supporting effective learning and
minimal overfitting.
2. Classes:
4. Color Scale:
o The color scale on the right side of the image ranges from light
blue to dark blue, indicating the frequency of the values in the
cells. Darker blue represents higher values, while lighter blue
represents lower values.
2) Classification Report
A classification report is a performance evaluation tool for classification
models. It provides a comprehensive summary of the key metrics which
include Precision, Recall, F1-Score and Support
The model has an overall accuracy of 86%.
2) Plot loss
Axes:
x-axis (Epochs): Number of times the model has gone through the
training data.
y-axis (Loss): The value of the loss function.
Lines:
Blue Line (Training Loss): Loss of the model on the training set for
each epoch.
Orange Line (Validation Loss): Loss of the model on the validation
set for each epoch.
Trend Analysis:
1. Training Loss:
o Sharp Decrease at First: Indicates the model is quickly
learning from the training data.
o Plateau: Further training does not significantly reduce the loss,
showing the model has learned most patterns.
2. Validation Loss:
o Initial Decrease: Indicates good generalization to unseen data.
o Plateau: Shows that the model has achieved stable and
consistent performance with continued training.
Interpretation:
Rapid Decrease in Training Loss: Shows effective initial learning.
Plateau in Training Loss: Indicates the model has achieved optimal
learning from the training data.
Decrease in Validation Loss: Demonstrates the model's ability to
generalize well to new data.
Plateau in Validation Loss: Indicates that the model has reached a
stable performance with the current data and architecture.
12. Additional Metrics
We use the roc_curve function to map out the Receiver Operating
Characteristic (ROC) curve, which helps us visualize how well our model
distinguishes between classes. Once the ROC curve is plotted, we turn to
roc_auc_score to compute the Area Under the ROC Curve (AUC). This score
gives us a single value that summarizes the overall performance of our
model, showing how good it is at distinguishing between the different
classes. Together, these functions provide a clear and comprehensive
assessment of our model's classification abilities.
Cohen's Kappa: Measures the agreement between true labels and
predictions, adjusted for chance.
13.Save Model
We use `model.save` to preserve our trained model by saving it to a file
named `MLP.keras`. This allows us to easily load the model in the future for
making predictions or conducting further analysis, without needing to retrain
the model from scratch. Saving the model ensures that all the effort put into
training is not lost and can be utilized efficiently whenever needed.
Random Forest Classifier
1.Data Splitting
First, we split the dataset into training/validation and test sets, ensuring 15%
is used for testing. Then, we further split the training/validation set so that
about 15% of the original data is used for validation. This way, both sets are
of sufficient size for evaluation. Using `random_state=42` ensures
reproducibility.
2.Feature Selection
We start with a list of `selected_features`, which includes specific indices or
column names of the features we want to focus on. Using this list, we extract
these features from our datasets (X_train, X_val, X_test). This way, we
concentrate on the most relevant features, simplifying our model and
improving its performance by reducing dimensionality and excluding
unnecessary data. By doing so, we ensure our model is both efficient and
effective.
2)Confusion Matrix
A confusion matrix evaluates the performance of a classification model by
comparing actual vs. predicted values, showing the counts of true positives,
false positives, true negatives, and false negatives. This helps identify where
the model is making errors and how well it distinguishes between classes.
Axes and Labels:
x-axis (Predicted): Shows predicted class labels.
y-axis (Actual): Shows actual class labels.
Classes:
Two classes: 0 (negative) and 1 (positive).
Cells and Values:
TN: 4130 instances correctly predicted as class 0.
FP: 3 instances of class 0 predicted as class 1.
FN: 673 instances of class 1 predicted as class 0.
TP: 5 instances correctly predicted as class 1.
Color Scale:
From light blue (lower values) to dark blue (higher values).
Interpretation:
The model excels in predicting the negative class (0) with high true
negatives.
It shows potential for improvement in predicting the positive class (1).