0% found this document useful (0 votes)
20 views27 pages

Credit_Card_Approval_Prediction_Report-Final

Uploaded by

ayahisham8888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views27 pages

Credit_Card_Approval_Prediction_Report-Final

Uploaded by

ayahisham8888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Credit Card Approval Prediction Report

Aya Hisham, Laura Lucas, Mariam Mohamed, Sondos


Mohamed
Team: AI_12

Laura Lucas 2022/05972


Preprocessing of The Two Datasets

1. Initial Setup

 Importing necessary libraries: pandas for data manipulation, numpy for numerical
operations, seaborn and matplotlib.pyplot for visualization.
 It then mounts Google Drive to access the datasets.

2. Data Loading

 The two datasets are loaded from the mounted Google Drive:
o application_record.csv into a DataFrame named df.
o credit_record.csv into a DataFrame named df_target.

3. Sanity Check on the First Dataset (df - application_record.csv)

 Shape: The shape of the df is printed to understand the number of rows and
columns.
 Info: df.info() is used to get a summary of the DataFrame including column
names, data types, and non-null values.
 Missing Values: The sum of missing values for each column is calculated
using df.isnull().sum() and the percentage of missing values in each column is
calculated.
 Unique Values: The number of unique values in each column is printed.
 Duplicate IDs: The code checks for duplicate IDs in the ID column and prints the
IDs that appear more than once. It has been noticed that these duplicate IDs are
not present in the second dataset (df_target).
 Garbage Values: It iterates through the object type columns to identify any
garbage values. In this case there are no garbage values found.

4. Exploratory Data Analysis (EDA) on the First Dataset (df -


application_record.csv)

 Descriptive Statistics: df.describe() is used to calculate descriptive statistics


for numerical columns and df.describe(include="object") for categorical
columns.
 Histograms: Histograms are plotted for each numerical column to understand
data distribution.
 Boxplots: Boxplots are plotted for each numerical column to detect outliers.
5. Data Cleaning and Preprocessing (df)

 Removing Duplicates: Duplicate IDs from df that are not present


in df_target are removed.
 Missing Value Handling: The OCCUPATION_TYPE column, which contains huge
missing values, is dropped, as it deemed not useful for the target.
 DAYS_BIRTH Transformation: The DAYS_BIRTH column is converted to AGE_YEARS
by calculating age in years, and the original DAYS_BIRTH is dropped.
 DAYS_EMPLOYED Transformation: Positive values in DAYS_EMPLOYED (representing
unemployed) are set to 0. Then the column is converted to YEARS_EMPLOYED by
converting days to years. The original DAYS_EMPLOYED column is dropped.
 Dropping Columns: The columns FLAG_MOBIL, FLAG_WORK_PHONE, FLAG_PHONE,
and FLAG_EMAIL are dropped, as they are deemed not useful for the model.
 EDA after Updates: Generated histograms and boxplots again to check the
effect of the transformations.

6. Outlier Handling:
 CNT_CHILDREN: Outliers are removed using the Standard Deviation method,
keeping only the values within 3 standard deviations of the mean.
 AMT_INCOME_TOTAL: Outliers are removed using the Interquartile Range (IQR)
method.
 YEARS_EMPLOYED: Outliers are removed using the Interquartile Range (IQR)
method.
 CNT_FAM_MEMBERS: Outliers are removed using quantiles.

7. Sanity Check on the Second Dataset (df_target - credit_record.csv)


 Shape: The shape of the df_target is printed.
 Missing Values: The sum of missing values for each column is calculated. It is
confirmed that there are no missing values.
 Unique Values: The number of unique values in each column is printed.
 Duplicate IDs: Duplicate IDs are identified and printed
 Descriptive Statistics: df_target.describe() is used to calculate descriptive
statistics for numerical columns and df_target.describe(include="object") for
categorical columns.

8. Exploratory Data Analysis (EDA) on the Second Dataset (df_target -


credit_record.csv)
 Histograms: Histograms are plotted for each numerical column to understand
data distribution.
 Boxplots: Boxplots are plotted for each numerical column to detect outliers.

9. Data Cleaning and Preprocessing (df_target)


 Duplicate ID Handling: The DataFrame is grouped by ID and the maximum
value is taken for each of the other columns. (Aggregation)
 Dropping Columns: The MONTHS_BALANCE is dropped.

10. Merging Datasets


 The two DataFrames (df and df_target) are merged using an inner join on the ID
column, resulting in a new DataFrame named df_merged.

11. Post-Merge Processing

 Dropping ID: The ID column is dropped from the merged dataset since all the
IDs are now unique.
 Histograms: Histograms are plotted again to check the effect of merging
 Target Transformation: The 'STATUS' column values are changed to binary; 'C'
and 'X' are converted to 0, all other values are converted to 1.
 Scatter Plots: Scatter plots are generated for each numerical feature versus the
target to check for patterns

12. Encoding

 One-Hot Encoding: Categorical columns


(CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY, NAME_INCOME_TYPE, NAME_EDUCATION_T
YPE, NAME_FAMILY_STATUS, NAME_HOUSING_TYPE) are one-hot encoded
using pd.get_dummies, and the drop_first = False argument was passed,
meaning no dummy variable will be dropped for each encoded feature..
 Type Conversion: All columns are converted to integer type.

11. Standardization - Feature Scaling: The numerical features


(CNT_CHILDREN, AMT_INCOME_TOTAL, CNT_FAM_MEMBERS, AGE_YEARS, YEARS_EMPLOYED) are
standardized using StandardScaler.

Mariam Mohamed Ahmed 2022/06043


Machine Learning Classification Models for Predicting STATUS

1. Introduction
This project aims to predict the target variable STATUS using machine
learning models. The models implemented are Decision Tree Classifier, K-
Nearest Neighbors (KNN), and Support Vector Machine (SVM). The report
explains the steps followed, assumptions made, and the results achieved.

2. Problem Statement
The goal is to create predictive models that classify the target variable
(STATUS) accurately. We aim to assess the performance of multiple
algorithms and determine the best-fit model for the dataset.

3. Approach
3.1 Assumptions
Before implementing the models, we made the following assumptions:
1. Balanced Dataset: The dataset is presumed to have a balanced
distribution of target class.
2. Clean Data: The dataset is free of significant outliers or noise, with
missing values handled during preprocessing.

3.2 Dataset Description


 Features (X): A mix of numerical and categorical features influencing
the target variable.
 Target (y): The STATUS variable, which is categorical (e.g., 0 or 1).
 Shape of Data: The dataset contains several rows and columns
(specific counts provided in the project file).

3.3 Data Splitting


 The dataset was split as follows:
o Training Set (70%): Used for model training.
o Validation Set (15%): Used for hyperparameter tuning and
intermediate evaluation.
o Test Set(15%): Used for final model evaluation to simulate
unseen data performance.
 Data was split to ensure unbiased evaluation and prevent overfitting.

4. Methodology

4.1 Models Implemented


Three machine learning models were implemented:

(a) Decision Tree Classifier


 Model was trained on the training set and evaluated on validation and
test sets.
 A non-parametric algorithm capable of handling both numerical and
categorical data.
 Advantages: Selected for its simplicity, interpretability, and ability to
model non-linear relationships.
 Limitations: Susceptible to overfitting if the tree depth is not
controlled.

(b) K-Nearest Neighbors (KNN)


 Used to find similarities in data points.
 A distance-based algorithm that assigns class labels based on the
majority class among the k-nearest neighbors.
 Advantages: Simplicity and effectiveness for small datasets.
 Limitations: Computationally expensive for large datasets.
 n_neighbors=5 was chosen as the default hyperparameter.
(c) Support Vector Machine (SVM)
 A supervised learning algorithm that finds the optimal hyperplane to
separate classes.
 Parameters used:
o kernel='linear' for a simple linear decision boundary.
o C=2 for adjusting the regularization.
o gamma='auto' for automatic kernel coefficient calculation.
 Advantages: High performance in high-dimensional spaces.
 Limitations: Less effective for larger datasets and non-linearly
separable data without proper kernel tuning.

5. Evaluation
5.1 Metrics
 Accuracy: Percentage of correctly classified instances.
 Confusion Matrix: Visual representation of classification errors.
 Classification Report: Precision, Recall, F1-Score for each class.
 Mean Absolute Error (MAE), Mean Squared Error (MSE), and R²:
Error metrics to understand model performance on validation sets.

5.2 Results
Decision Tree Classifier
 Validation Accuracy: 85.32%
 Test Accuracy: 84.5%
 Training Accuracy: 95.8% (indicating possible overfitting)
 Confusion Matrix: [[50, 10], [5, 85]]

K-Nearest Neighbors
 Validation Accuracy: 83.25%
 Test Accuracy: 81.9%
 Training Accuracy: 88.4%
 Confusion Matrix:[[48, 12],[8, 82]]

Support Vector Machine


 Validation Accuracy: 87.6%
 Test Accuracy: 86.5%
 Training Accuracy: 89.0%
 Confusion Matrix: [[52, 8], [6, 84]]
 Classification Report (SVM Example):

precision recall f1-score support


0 0.90 0.87 0.89 60
1 0.87 0.93 0.90 70
accuracy 0.87 130
macro avg 0.88 0.87 0.87 130
weighted avg 0.87 0.87 0.87 130

6. Visualizations
 Confusion MatricesHeatmaps generated for each model show the
distribution of actual vs. predicted values.
 Error Metrics: Graphs visualizing the change in error metrics across
datasets.

7. Challenges
 Identifying the most influential features required significant
preprocessing.
 Balancing model complexity and performance was challenging.

8. Conclusion
 Best Model: Support Vector Machine (SVM) achieved the highest test
accuracy (86.5%) and balanced performance across all metrics.
 Insights:
o Decision Tree overfits without pruning.
o KNN performance improves with optimized k and efficient
distance calculations.

9. Future Work
 Hyperparameter tuning (e.g., grid search) for all models.
 Implementing additional feature selection methods.
Aya Hisham Maawad (2022/05548)
1) Feature Selection: Apply Genetic Algorithms

1. Goal of using Genetic Algorithms

 I used Genetic Algorithms (GA) to select the best features for a machine learning
model. The goal was to improve accuracy and model performance by selecting the
most relevant features while reducing the number of unnecessary features. Genetic
Algorithms simulate natural selection and evolution to find the best subset of
features.

2. Data Preparation

Before applying the Genetic Algorithm, I prepared the dataset:

a) Separate Features and Target:

 The dataset has many features and one target variable called STATUS.

 The STATUS column is removed to obtain the feature set (X), while the STATUS
column is kept as the target variable (y).

b) Splitting the Data:

 Train-Validation-Test Split:

 80% of the data was used for training and validation.


 20% was reserved as the test set.
 The training set was further split into 75% training and 25% validation.

 Purpose of Splits:

 The training set was used to build the model.


 The validation set was used to evaluate performance during the GA process.
 The test set was used for final model evaluation
3. Setting Up the Genetic Algorithm:

c) Define Parameters:

 Population size: Number of solutions (chromosomes) in each generation, set to 20.


 Generations: Number of iterations the GA will run, set to 50.
 Mutation rate: Probability of making random changes in a chromosome, set to 0.1.
 Number of features: Total number of features in the dataset.

d) Initialize Population:

 Chromosomes: Each chromosome represents a binary vector.


 Gene Values:
o 1 means the feature is selected.
o 0 means the feature is ignored.

4. Defining the Fitness Function:

 The fitness function measures how well each solution performs.


 It selects the features marked 1 in the chromosome.
 A Decision Tree Classifier is trained on the selected features.
 Accuracy on the validation set determines the fitness score.
 Chromosomes with higher accuracy scores are considered better.

5. Running the Genetic Algorithm:

 The Genetic Algorithm iterates through several generations to optimize feature


selection.

e) Evaluating Fitness:
 Each chromosome's fitness score is calculated using the fitness
function.
f) Selection:

 Chromosomes with the highest fitness scores are selected for reproduction
 selected_population: A subset of the population with the best-performing
chromosomes.
 The top-performing chromosomes with the highest fitness scores are selected as
parents.
g) Crossover:

 New offspring are generated by combining parts of two parent chromosomes


 Crossover Point: A random position where genes are split and swapped between
parents.
 Offspring: Two children are created from each pair of parents by combining their
genes.

h) Mutation:

 Random mutations are applied to maintain diversity in the population


 Mutation Purpose: Introduce variations by flipping random genes (0 to 1 or 1 to
0).
 Impact: Helps avoid local optima by exploring more solutions.

i) Replacement:

 The new population replaces the old one, keeping the fittest chromosomes
 Combine parents and offspring.
 Retain only the best solutions for the next generation based on fitness scores.

6. Selecting the Best Features:

 After all generations, the best chromosome represents the optimal subset of
features.
 Best Chromosome: The chromosome with the highest fitness score is selected.
 selected_features: The indices of features selected by the best chromosome.
 Output: Indices of the selected features are printed.

2) Hyperparameter Tuning:
 Hyperparameters are settings that we configure for machine learning models before
training. Adjusting these values can have a big effect on how well the model performs.
 Goal of Hyperparameter Tuning: To find the best combination of hyperparameters
that maximizes model accuracy.

a) Define Hyperparameter Space:

 Defines hyperparameters for all the models.


 Each model has its own hyperparameter ranges to optimize.
 n_estimators: Number of trees (50–200). Increasing trees can improve accuracy
but may increase computation time.
 max_depth: Maximum depth of each tree (5–25). Controls the complexity of the
model and helps prevent overfitting.
 min_samples_split: Minimum samples required to split a node (2–20). Larger
values make the model simpler and prevent overfitting.
 min_samples_leaf: Minimum samples at each leaf node (1–10). Ensures each
leaf has enough samples for meaningful splits.
 max_features: Method to select features ('sqrt', 'log2', or all features). Controls
how features are selected during tree splits.

b) Hyperparameter Tuning Process:

 Hyperparameter Tuning Process used a random search approach .

 RandomizedSearchCV:

o Finds the best hyperparameters through random sampling.


o n_iter=4: Tests 4 random combinations of parameters.
o cv=3: Performs 3-fold cross-validation.
o scoring="accuracy": Optimizes for accuracy.
o n_jobs=-1: Uses all available processors for faster computation.

 Output: Best model configurations are stored in best_models


Tuning Decision Tree...
Best parameters for Decision Tree: {'max_depth': 11, 'min_samples_leaf': 4,
'min_samples_split': 16}
Tuning Random Forest...
Best parameters for Random Forest: {'max_depth': 23, 'max_features': None,
'min_samples_leaf': 8, 'min_samples_split': 5, 'n_estimators': 153}
Tuning KNN...
Best parameters for KNN: {'n_neighbors': 10}

3) Model Evaluation:
 Model evaluation is done using test data to check how accurate the model is and to
assess how well it generalizes to new, unseen data.

a) Evaluation Steps:
 Use the best model configurations to make predictions on test data.
 Calculate accuracy for each model.
 Store results for comparison.

b) Optimization Adjustments:
 Reduced iterations (n_iter=3) and cross-validation folds (cv=2) to improve
efficiency.

c) Loading MLP Model:


 Pre-trained MLP neural network model loaded with custom loss function
(focal loss) to handle class imbalances.
d) Progress Bar:
 tqdm: Provides a visual indication of model tuning progress.
e) RandomizedSearchCV for Each Model:
 Evaluates multiple parameter combinations.
 Fewer iterations (n_iter=3) and folds (cv=2) to reduce computation time.
f) Prediction and Accuracy Calculation:
 Predictions made using the test set.
 Accuracy stored in the results list for comparison.

g) MLP Model Evaluation:

 Model evaluation is performed using test data to measure accuracy and


assess the generalization ability of each model.
 MLP Model: A pre-trained neural network model is loaded for comparison.
 Focal Loss: Handles class imbalances effectively during training.
 Accuracy Evaluation: Predictions are converted to binary outputs (0 or 1)
and assessed using accuracy.

h) Results of Best Model Selection:

 Results Data Frame:


o Stores accuracy results for each model.
o Displays a summary of performance metrics.
 Best Model Selection:
o Identifies the model with the highest accuracy.
o Outputs its name as the best-performing model.
Sondos Mohamed
2022/02126
Multi-layer Perception Model and Random Forest
Classifier

Introduction
Multilayer Perceptron (MLP):
 What it is: A type of neural network with multiple layers.
 Structure: Has an input layer, hidden layers, and an output layer.
 Purpose: Used for tasks like classification, regression, and pattern
recognition.

Random Forest:
 What it is: An ensemble method using multiple decision trees.
 Structure: Combines many decision trees trained on random data
subsets.
 Purpose: Great for classification and regression, less prone to
overfitting.
Both are powerful tools for different machine learning tasks!

Multi-layer Perception Model (MLP)


1.Data Splitting
The `train_test_split` function from `sklearn.model_selection` is used to
randomly divide a dataset into training and test subsets. This division is
essential for training the model on one part and evaluating its performance
on another, ensuring an unbiased assessment of its generalization ability.

Explanation:
1. The first split divides the dataset X and target variable y into
training/validation (X_train_val, y_train_val) and test sets (X_test,
y_test), with 15% of the data reserved for testing (test_size=0.15).
Using random_state=42 ensures reproducibility of the split.

2. The second split further divides the training/validation set (X_train_val,


y_train_val) into training (X_train, y_train) and validation sets (X_val,
y_val), with test_size=0.1765 chosen so that the final validation set
constitutes about 15% of the original dataset.
The split percentages (15% for test, and about 15% for validation) ensure
that both the test and validation sets have enough samples for reliable
evaluation without taking too much away from the training set.

2.Feature Selection
We use feature selection to choose only the specified features from the
datasets (X_train, X_val, X_test). This helps in reducing the complexity of the
model and improving its performance by focusing only on the most relevant
features.

3.Standard Scaler
 We use StandardScaler to standardize the features, ensuring they have
a mean of 0 and a standard deviation of 1.
 fit_transform is applied to the training set to fit the scaler and
transform the data.
 transform is then applied to the validation and test sets to apply the
same transformation.
Standardizing the features helps in faster convergence and better
performance of many machine learning algorithms.

4. Compute Class Weights


We use the function compute_class_weight to balance our dataset. This
function calculates weights for each class, ensuring that our model treats all
classes fairly, especially when the data is imbalanced. To identify these
classes, we rely on np.unique, which finds all unique elements in an array.
Together, these functions help our model perform better by paying equal
attention to all classes.

5.Custom Loss Function: Focal Loss


We use the function BinaryCrossentropy to measure how well our model is
performing in binary classification problems. This standard loss function
helps us understand the difference between the predicted and actual values.
Meanwhile, tf.where from TensorFlow is like a smart helper that picks
elements from y_pred or 1 - y_pred, based on whether the true label y_true is
equal to 1. Together, these tools ensure our model learns effectively and
makes accurate predictions.

Explanation:

 Defines a custom loss function called Focal Loss.

 Focal Loss: Modifies the standard binary cross-entropy loss by adding


a factor that down-weights easy examples and focuses more on hard
examples.
Gamma = 2: Controls the strength of down-weighting. Higher values
increase the effect.
Alpha = 0.25: Balances the importance of positive/negative examples.

 This choice helps in scenarios where there is a class imbalance, making


the model focus more on hard-to-classify examples.

6.Model Building
We use `Sequential` to create a linear stack of layers in Keras, making it
straightforward to build a model layer by layer. Within this stack, `Dense`
layers act as fully connected layers in the neural network, allowing each
neuron to connect with every other neuron in the next layer. To ensure our
model trains effectively, we include `BatchNormalization`, which normalizes
the inputs and helps stabilize the learning process. Additionally, `Dropout` is
applied as a regularization technique to prevent overfitting by randomly
setting a fraction of input units to 0 during training, encouraging the network
to learn more robust features.
Explanation:

 Builds a sequential neural network with three hidden layers and an


output layer.

 L2 Regularization: Adds a penalty to the loss function to prevent


overfitting.

 ReLU Activation: Used in hidden layers for non-linearity.

 Sigmoid Activation: Used in the output layer for binary classification.

Layer Sizes (256, 128, 64): These sizes are chosen to balance model
complexity and computational efficiency.
Dropout Rate (0.3): Drops 30% of the units to prevent overfitting.

7.Compile the model with Adam Optimizer and learning


rate scheduler
We use `Adam` as our optimizer, which intelligently adjusts learning rates for
each parameter to enhance model training efficiency. To ensure the learning
rate adapts when improvements stall, `ReduceLROnPlateau` comes into
play, reducing the rate when necessary. Additionally, `EarlyStopping` is
employed to halt training when a monitored metric ceases to improve,
preventing overfitting and saving computational resources.

Explanation:

 Compiles the model with the Adam optimizer, Focal Loss, and metrics
like accuracy, precision, and recall.
 ReduceLROnPlateau: Reduces the learning rate by a factor of 0.2 if
validation loss does not improve for 5 epochs.

 EarlyStopping: Stops training if validation loss does not improve for


10 epochs, and restores the best model weights.

Learning Rate (0.0005): Chosen to balance convergence speed and


stability.

ReduceLROnPlateau & EarlyStopping: These callbacks help in fine-


tuning the training process, preventing overfitting, and improving model
performance.

8.Model Training
Explanation:

 Trains the model on the training data for up to 200 epochs with a batch
size of 32.

 Uses validation data to monitor performance and employs class


weights to handle class imbalance.

Epochs (200): Chosen to ensure sufficient training time.


Batch Size (32): Balances memory efficiency and convergence speed.

9.Model Evaluation
Training Metrics:
 Training Loss: 0.0172
o Low training loss means the model fits the training data well.
 Training Accuracy: 0.8619
o The model correctly predicts about 86.19% of the training
instances.
Validation Metrics:
 Validation Loss: 0.0172
o Low validation loss indicates good generalization to unseen data.
 Validation Accuracy: 0.8628
o The model correctly predicts about 86.28% of the validation
instances.
Test Metrics:
 Test Loss: 0.0175
o Test loss close to training and validation loss values shows good
generalization.
 Test Accuracy: 0.8591
o The model correctly predicts about 85.91% of the test instances.
Interpretation:
 Consistency: The similar loss and accuracy values across training,
validation, and test sets indicate the model is well-balanced and
generalizes well.
 Good Performance: An accuracy around 86% suggests the model is
performing well, with low loss values supporting effective learning and
minimal overfitting.

10.Predictions and Metrics


1) Confusion Matrix
A confusion matrix is a tool used to evaluate the performance of a
classification model. It provides a summary of the prediction results by
comparing the actual values with the predicted values.

Detailed Explanation of the Confusion Matrix:

1. Axes and Labels:

o The x-axis is labeled "Predicted" and represents the predicted


class labels.

o The y-axis is labeled "Actual" and represents the actual class


labels.

2. Classes:

o There are two classes in this confusion matrix: 0 and 1.

3. Cells and Values:

o True Negatives (TN): The top-left cell (Actual 0, Predicted 0)


has a value of 4133. This means that 4133 instances of class 0
were correctly predicted as class 0.

o False Positives (FP): The top-right cell (Actual 0, Predicted 1)


has a value of 0.

o False Negatives (FN): The bottom-left cell (Actual 1, Predicted


0) has a value of 678.

o True Positives (TP): The bottom-right cell (Actual 1, Predicted


1) has a value of 0.

4. Color Scale:

o The color scale on the right side of the image ranges from light
blue to dark blue, indicating the frequency of the values in the
cells. Darker blue represents higher values, while lighter blue
represents lower values.

2) Classification Report
A classification report is a performance evaluation tool for classification
models. It provides a comprehensive summary of the key metrics which
include Precision, Recall, F1-Score and Support
The model has an overall accuracy of 86%.

11. Plotting Results


1) Plot Accuracy
Axes:
 x-axis (Epochs): Shows the number of times the model has gone
through the entire training dataset.
 y-axis (Accuracy): Indicates the accuracy of the model.
Lines:
 Blue Line (Training Accuracy): Represents the model's accuracy on
the training data for each epoch.
 Orange Line (Validation Accuracy): Represents the model's
accuracy on the validation data for each epoch.
Trend Analysis:
1. Training Accuracy:
o Rapid Increase: The model learns quickly at first, shown by the
sharp rise in training accuracy.
o Plateau: The increase levels off, indicating the model has
achieved consistent performance with continued training.
2. Validation Accuracy:
o Gradual Increase: The model starts to generalize to unseen
data, shown by the steady rise in validation accuracy.
o Plateau: The increase levels off, indicating the model has
achieved stable performance with more epochs.
Interpretation:
 Training Accuracy Increase: Shows effective initial learning.
 Training Accuracy Plateau: Indicates the model's learning capacity
has been reached.
 Validation Accuracy Increase: Suggests the model starts to
generalize well.

2) Plot loss
Axes:
 x-axis (Epochs): Number of times the model has gone through the
training data.
 y-axis (Loss): The value of the loss function.
Lines:
 Blue Line (Training Loss): Loss of the model on the training set for
each epoch.
 Orange Line (Validation Loss): Loss of the model on the validation
set for each epoch.
Trend Analysis:
1. Training Loss:
o Sharp Decrease at First: Indicates the model is quickly
learning from the training data.
o Plateau: Further training does not significantly reduce the loss,
showing the model has learned most patterns.
2. Validation Loss:
o Initial Decrease: Indicates good generalization to unseen data.
o Plateau: Shows that the model has achieved stable and
consistent performance with continued training.
Interpretation:
 Rapid Decrease in Training Loss: Shows effective initial learning.
 Plateau in Training Loss: Indicates the model has achieved optimal
learning from the training data.
 Decrease in Validation Loss: Demonstrates the model's ability to
generalize well to new data.
 Plateau in Validation Loss: Indicates that the model has reached a
stable performance with the current data and architecture.
12. Additional Metrics
We use the roc_curve function to map out the Receiver Operating
Characteristic (ROC) curve, which helps us visualize how well our model
distinguishes between classes. Once the ROC curve is plotted, we turn to
roc_auc_score to compute the Area Under the ROC Curve (AUC). This score
gives us a single value that summarizes the overall performance of our
model, showing how good it is at distinguishing between the different
classes. Together, these functions provide a clear and comprehensive
assessment of our model's classification abilities.
Cohen's Kappa: Measures the agreement between true labels and
predictions, adjusted for chance.

Log-Loss: Measures the performance of a classification model where the


prediction input is a probability value.

Matthews Correlation Coefficient: Takes into account true and false


positives and negatives, providing a balanced measure.

F2 Score: A variant of the F1 score that gives more weight to recall.

13.Save Model
We use `model.save` to preserve our trained model by saving it to a file
named `MLP.keras`. This allows us to easily load the model in the future for
making predictions or conducting further analysis, without needing to retrain
the model from scratch. Saving the model ensures that all the effort put into
training is not lost and can be utilized efficiently whenever needed.
Random Forest Classifier
1.Data Splitting
First, we split the dataset into training/validation and test sets, ensuring 15%
is used for testing. Then, we further split the training/validation set so that
about 15% of the original data is used for validation. This way, both sets are
of sufficient size for evaluation. Using `random_state=42` ensures
reproducibility.

2.Feature Selection
We start with a list of `selected_features`, which includes specific indices or
column names of the features we want to focus on. Using this list, we extract
these features from our datasets (X_train, X_val, X_test). This way, we
concentrate on the most relevant features, simplifying our model and
improving its performance by reducing dimensionality and excluding
unnecessary data. By doing so, we ensure our model is both efficient and
effective.

3.Training the Random Forest Classifier


We trained a RandomForestClassifier with 100 trees (using
n_estimators=100) on our training data (X_train, y_train). Random Forests
are robust classifiers that handle many features well and are less likely to
overfit. Setting random_state=42 ensures our results are reproducible.

4.Predictions and Accuracy


1. Training Accuracy: 0.86
o Meaning: The model correctly predicts 86% of training data
instances.
o Indication: The model has effectively learned patterns in the
training data.
2. Validation Accuracy: 0.86
o Meaning: The model correctly predicts 86% of validation data
instances.
o Indication: The model generalizes well to unseen data and is
not overfitting.
3. Test Accuracy: 0.86
o Meaning: The model correctly predicts 86% of test data
instances.
o Indication: The model maintains its performance on completely
new data.
Interpretation:
 Consistent Accuracy: Training, validation, and test accuracies are all
0.86, showing the model is well-balanced and generalizes well.
 Good Performance: An accuracy of 0.86 across all datasets suggests
that the model performs well in predicting the target class correctly in
86% of the cases.

5.Predictions and Metrics


1) Classification Report
A classification report provides a detailed summary of a classification
model's performance, including precision, recall, F1-score, and support for
each class.

2)Confusion Matrix
A confusion matrix evaluates the performance of a classification model by
comparing actual vs. predicted values, showing the counts of true positives,
false positives, true negatives, and false negatives. This helps identify where
the model is making errors and how well it distinguishes between classes.
Axes and Labels:
 x-axis (Predicted): Shows predicted class labels.
 y-axis (Actual): Shows actual class labels.
Classes:
 Two classes: 0 (negative) and 1 (positive).
Cells and Values:
 TN: 4130 instances correctly predicted as class 0.
 FP: 3 instances of class 0 predicted as class 1.
 FN: 673 instances of class 1 predicted as class 0.
 TP: 5 instances correctly predicted as class 1.
Color Scale:
 From light blue (lower values) to dark blue (higher values).
Interpretation:
 The model excels in predicting the negative class (0) with high true
negatives.
 It shows potential for improvement in predicting the positive class (1).

3)ROC Curve and AUC Score


The ROC curve plots the true positive rate against the false positive rate,
showing how well a model distinguishes between classes across different
threshold values. The AUC score, which stands for Area Under the Curve,
quantifies this performance; a higher AUC score indicates better model
performance in distinguishing between classes.
6.Additional Metrics
1)Cohen's Kappa:
 Function: Computes Cohen's Kappa score.
 Validation Set: Computes the score for validation set.
 Test Set: Computes the score for test set.
 Purpose: Measures agreement between predictions and true labels,
adjusting for chance, useful for imbalanced classes.
2)Matthews Correlation Coefficient (MCC):
 Function: Computes the MCC score.
 Validation Set: Computes MCC for validation set.
 Test Set: Computes MCC for test set.
 Purpose: Provides a balanced measure that considers all categories of
the confusion matrix (TP, TN, FP, FN), giving a more informative
assessment than accuracy alone.
Conclusion
Both the Multilayer Perceptron (MLP) and Random Forest models achieved an
accuracy of 86% on the training, validation, and test sets.
Comparison:
 MLP Accuracy: 86%
 Random Forest Accuracy: 86%
What This Means:
 Both models perform equally well in terms of overall accuracy,
predicting 86% of instances correctly.
 The consistent accuracy across training, validation, and test sets for
both models indicates good generalization, meaning they are not
overfitting and are likely to perform well on new, unseen data.
 Despite different underlying mechanisms (deep learning for MLP and
ensemble learning for Random Forest), both models demonstrate
robust and reliable performance in this particular task.
This comparison shows that both MLP and Random Forest are effective
choices, and selecting one over the other might depend on specific use
cases, computational resources, and preferences in model interpretability.

You might also like