0% found this document useful (0 votes)

20 views27 pages

Credit_Card_Approval_Prediction_Report-Final

Uploaded by

ayahisham8888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views27 pages

Credit_Card_Approval_Prediction_Report-Final

Uploaded by

ayahisham8888

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Credit Card Approval Prediction Report

Aya Hisham, Laura Lucas, Mariam Mohamed, Sondos

Mohamed
Team: AI_12

Laura Lucas 2022/05972

Preprocessing of The Two Datasets

1. Initial Setup

 Importing necessary libraries: pandas for data manipulation, numpy for numerical
operations, seaborn and matplotlib.pyplot for visualization.
 It then mounts Google Drive to access the datasets.

2. Data Loading

 The two datasets are loaded from the mounted Google Drive:
o application_record.csv into a DataFrame named df.
o credit_record.csv into a DataFrame named df_target.

3. Sanity Check on the First Dataset (df - application_record.csv)

 Shape: The shape of the df is printed to understand the number of rows and
columns.
 Info: df.info() is used to get a summary of the DataFrame including column
names, data types, and non-null values.
 Missing Values: The sum of missing values for each column is calculated
using df.isnull().sum() and the percentage of missing values in each column is
calculated.
 Unique Values: The number of unique values in each column is printed.
 Duplicate IDs: The code checks for duplicate IDs in the ID column and prints the
IDs that appear more than once. It has been noticed that these duplicate IDs are
not present in the second dataset (df_target).
 Garbage Values: It iterates through the object type columns to identify any
garbage values. In this case there are no garbage values found.

4. Exploratory Data Analysis (EDA) on the First Dataset (df -

application_record.csv)

 Descriptive Statistics: df.describe() is used to calculate descriptive statistics

for numerical columns and df.describe(include="object") for categorical
columns.
 Histograms: Histograms are plotted for each numerical column to understand
data distribution.
 Boxplots: Boxplots are plotted for each numerical column to detect outliers.
5. Data Cleaning and Preprocessing (df)

 Removing Duplicates: Duplicate IDs from df that are not present

in df_target are removed.
 Missing Value Handling: The OCCUPATION_TYPE column, which contains huge
missing values, is dropped, as it deemed not useful for the target.
 DAYS_BIRTH Transformation: The DAYS_BIRTH column is converted to AGE_YEARS
by calculating age in years, and the original DAYS_BIRTH is dropped.
 DAYS_EMPLOYED Transformation: Positive values in DAYS_EMPLOYED (representing
unemployed) are set to 0. Then the column is converted to YEARS_EMPLOYED by
converting days to years. The original DAYS_EMPLOYED column is dropped.
 Dropping Columns: The columns FLAG_MOBIL, FLAG_WORK_PHONE, FLAG_PHONE,
and FLAG_EMAIL are dropped, as they are deemed not useful for the model.
 EDA after Updates: Generated histograms and boxplots again to check the
effect of the transformations.

6. Outlier Handling:
 CNT_CHILDREN: Outliers are removed using the Standard Deviation method,
keeping only the values within 3 standard deviations of the mean.
 AMT_INCOME_TOTAL: Outliers are removed using the Interquartile Range (IQR)
method.
 YEARS_EMPLOYED: Outliers are removed using the Interquartile Range (IQR)
method.
 CNT_FAM_MEMBERS: Outliers are removed using quantiles.

7. Sanity Check on the Second Dataset (df_target - credit_record.csv)

 Shape: The shape of the df_target is printed.
 Missing Values: The sum of missing values for each column is calculated. It is
confirmed that there are no missing values.
 Unique Values: The number of unique values in each column is printed.
 Duplicate IDs: Duplicate IDs are identified and printed
 Descriptive Statistics: df_target.describe() is used to calculate descriptive
statistics for numerical columns and df_target.describe(include="object") for
categorical columns.

8. Exploratory Data Analysis (EDA) on the Second Dataset (df_target -

credit_record.csv)
 Histograms: Histograms are plotted for each numerical column to understand
data distribution.
 Boxplots: Boxplots are plotted for each numerical column to detect outliers.

9. Data Cleaning and Preprocessing (df_target)

 Duplicate ID Handling: The DataFrame is grouped by ID and the maximum
value is taken for each of the other columns. (Aggregation)
 Dropping Columns: The MONTHS_BALANCE is dropped.

10. Merging Datasets

 The two DataFrames (df and df_target) are merged using an inner join on the ID
column, resulting in a new DataFrame named df_merged.

11. Post-Merge Processing

 Dropping ID: The ID column is dropped from the merged dataset since all the
IDs are now unique.
 Histograms: Histograms are plotted again to check the effect of merging
 Target Transformation: The 'STATUS' column values are changed to binary; 'C'
and 'X' are converted to 0, all other values are converted to 1.
 Scatter Plots: Scatter plots are generated for each numerical feature versus the
target to check for patterns

12. Encoding

 One-Hot Encoding: Categorical columns

(CODE_GENDER, FLAG_OWN_CAR, FLAG_OWN_REALTY, NAME_INCOME_TYPE, NAME_EDUCATION_T
YPE, NAME_FAMILY_STATUS, NAME_HOUSING_TYPE) are one-hot encoded
using pd.get_dummies, and the drop_first = False argument was passed,
meaning no dummy variable will be dropped for each encoded feature..
 Type Conversion: All columns are converted to integer type.

11. Standardization - Feature Scaling: The numerical features

(CNT_CHILDREN, AMT_INCOME_TOTAL, CNT_FAM_MEMBERS, AGE_YEARS, YEARS_EMPLOYED) are
standardized using StandardScaler.

Mariam Mohamed Ahmed 2022/06043

Machine Learning Classification Models for Predicting STATUS

1. Introduction
This project aims to predict the target variable STATUS using machine
learning models. The models implemented are Decision Tree Classifier, K-
Nearest Neighbors (KNN), and Support Vector Machine (SVM). The report
explains the steps followed, assumptions made, and the results achieved.

2. Problem Statement
The goal is to create predictive models that classify the target variable
(STATUS) accurately. We aim to assess the performance of multiple
algorithms and determine the best-fit model for the dataset.

3. Approach
3.1 Assumptions
Before implementing the models, we made the following assumptions:
1. Balanced Dataset: The dataset is presumed to have a balanced
distribution of target class.
2. Clean Data: The dataset is free of significant outliers or noise, with
missing values handled during preprocessing.

3.2 Dataset Description

 Features (X): A mix of numerical and categorical features influencing
the target variable.
 Target (y): The STATUS variable, which is categorical (e.g., 0 or 1).
 Shape of Data: The dataset contains several rows and columns
(specific counts provided in the project file).

3.3 Data Splitting

 The dataset was split as follows:
o Training Set (70%): Used for model training.
o Validation Set (15%): Used for hyperparameter tuning and
intermediate evaluation.
o Test Set(15%): Used for final model evaluation to simulate
unseen data performance.
 Data was split to ensure unbiased evaluation and prevent overfitting.

4. Methodology

4.1 Models Implemented

Three machine learning models were implemented:

(a) Decision Tree Classifier

 Model was trained on the training set and evaluated on validation and
test sets.
 A non-parametric algorithm capable of handling both numerical and
categorical data.
 Advantages: Selected for its simplicity, interpretability, and ability to
model non-linear relationships.
 Limitations: Susceptible to overfitting if the tree depth is not
controlled.

(b) K-Nearest Neighbors (KNN)

 Used to find similarities in data points.
 A distance-based algorithm that assigns class labels based on the
majority class among the k-nearest neighbors.
 Advantages: Simplicity and effectiveness for small datasets.
 Limitations: Computationally expensive for large datasets.
 n_neighbors=5 was chosen as the default hyperparameter.
(c) Support Vector Machine (SVM)
 A supervised learning algorithm that finds the optimal hyperplane to
separate classes.
 Parameters used:
o kernel='linear' for a simple linear decision boundary.
o C=2 for adjusting the regularization.
o gamma='auto' for automatic kernel coefficient calculation.
 Advantages: High performance in high-dimensional spaces.
 Limitations: Less effective for larger datasets and non-linearly
separable data without proper kernel tuning.

5. Evaluation
5.1 Metrics
 Accuracy: Percentage of correctly classified instances.
 Confusion Matrix: Visual representation of classification errors.
 Classification Report: Precision, Recall, F1-Score for each class.
 Mean Absolute Error (MAE), Mean Squared Error (MSE), and R²:
Error metrics to understand model performance on validation sets.

5.2 Results
Decision Tree Classifier
 Validation Accuracy: 85.32%
 Test Accuracy: 84.5%
 Training Accuracy: 95.8% (indicating possible overfitting)
 Confusion Matrix: [[50, 10], [5, 85]]

K-Nearest Neighbors
 Validation Accuracy: 83.25%
 Test Accuracy: 81.9%
 Training Accuracy: 88.4%
 Confusion Matrix:[[48, 12],[8, 82]]

Support Vector Machine

 Validation Accuracy: 87.6%
 Test Accuracy: 86.5%
 Training Accuracy: 89.0%
 Confusion Matrix: [[52, 8], [6, 84]]
 Classification Report (SVM Example):

precision recall f1-score support

0 0.90 0.87 0.89 60
1 0.87 0.93 0.90 70
accuracy 0.87 130
macro avg 0.88 0.87 0.87 130
weighted avg 0.87 0.87 0.87 130

6. Visualizations
 Confusion MatricesHeatmaps generated for each model show the
distribution of actual vs. predicted values.
 Error Metrics: Graphs visualizing the change in error metrics across
datasets.

7. Challenges
 Identifying the most influential features required significant
preprocessing.
 Balancing model complexity and performance was challenging.

8. Conclusion
 Best Model: Support Vector Machine (SVM) achieved the highest test
accuracy (86.5%) and balanced performance across all metrics.
 Insights:
o Decision Tree overfits without pruning.
o KNN performance improves with optimized k and efficient
distance calculations.

9. Future Work
 Hyperparameter tuning (e.g., grid search) for all models.
 Implementing additional feature selection methods.
Aya Hisham Maawad (2022/05548)
1) Feature Selection: Apply Genetic Algorithms

1. Goal of using Genetic Algorithms

 I used Genetic Algorithms (GA) to select the best features for a machine learning
model. The goal was to improve accuracy and model performance by selecting the
most relevant features while reducing the number of unnecessary features. Genetic
Algorithms simulate natural selection and evolution to find the best subset of
features.

2. Data Preparation

Before applying the Genetic Algorithm, I prepared the dataset:

a) Separate Features and Target:

 The dataset has many features and one target variable called STATUS.

 The STATUS column is removed to obtain the feature set (X), while the STATUS
column is kept as the target variable (y).

b) Splitting the Data:

 Train-Validation-Test Split:

 80% of the data was used for training and validation.

 20% was reserved as the test set.
 The training set was further split into 75% training and 25% validation.

 Purpose of Splits:

 The training set was used to build the model.

 The validation set was used to evaluate performance during the GA process.
 The test set was used for final model evaluation
3. Setting Up the Genetic Algorithm:

c) Define Parameters:

 Population size: Number of solutions (chromosomes) in each generation, set to 20.

 Generations: Number of iterations the GA will run, set to 50.
 Mutation rate: Probability of making random changes in a chromosome, set to 0.1.
 Number of features: Total number of features in the dataset.

d) Initialize Population:

 Chromosomes: Each chromosome represents a binary vector.

 Gene Values:
o 1 means the feature is selected.
o 0 means the feature is ignored.

4. Defining the Fitness Function:

 The fitness function measures how well each solution performs.

 It selects the features marked 1 in the chromosome.
 A Decision Tree Classifier is trained on the selected features.
 Accuracy on the validation set determines the fitness score.
 Chromosomes with higher accuracy scores are considered better.

5. Running the Genetic Algorithm:

 The Genetic Algorithm iterates through several generations to optimize feature

selection.

e) Evaluating Fitness:
 Each chromosome's fitness score is calculated using the fitness
function.
f) Selection:

 Chromosomes with the highest fitness scores are selected for reproduction
 selected_population: A subset of the population with the best-performing
chromosomes.
 The top-performing chromosomes with the highest fitness scores are selected as
parents.
g) Crossover:

 New offspring are generated by combining parts of two parent chromosomes

 Crossover Point: A random position where genes are split and swapped between
parents.
 Offspring: Two children are created from each pair of parents by combining their
genes.

h) Mutation:

 Random mutations are applied to maintain diversity in the population

 Mutation Purpose: Introduce variations by flipping random genes (0 to 1 or 1 to
0).
 Impact: Helps avoid local optima by exploring more solutions.

i) Replacement:

 The new population replaces the old one, keeping the fittest chromosomes
 Combine parents and offspring.
 Retain only the best solutions for the next generation based on fitness scores.

6. Selecting the Best Features:

 After all generations, the best chromosome represents the optimal subset of
features.
 Best Chromosome: The chromosome with the highest fitness score is selected.
 selected_features: The indices of features selected by the best chromosome.
 Output: Indices of the selected features are printed.

2) Hyperparameter Tuning:
 Hyperparameters are settings that we configure for machine learning models before
training. Adjusting these values can have a big effect on how well the model performs.
 Goal of Hyperparameter Tuning: To find the best combination of hyperparameters
that maximizes model accuracy.

a) Define Hyperparameter Space:

 Defines hyperparameters for all the models.

 Each model has its own hyperparameter ranges to optimize.
 n_estimators: Number of trees (50–200). Increasing trees can improve accuracy
but may increase computation time.
 max_depth: Maximum depth of each tree (5–25). Controls the complexity of the
model and helps prevent overfitting.
 min_samples_split: Minimum samples required to split a node (2–20). Larger
values make the model simpler and prevent overfitting.
 min_samples_leaf: Minimum samples at each leaf node (1–10). Ensures each
leaf has enough samples for meaningful splits.
 max_features: Method to select features ('sqrt', 'log2', or all features). Controls
how features are selected during tree splits.

b) Hyperparameter Tuning Process:

 Hyperparameter Tuning Process used a random search approach .

 RandomizedSearchCV:

o Finds the best hyperparameters through random sampling.

o n_iter=4: Tests 4 random combinations of parameters.
o cv=3: Performs 3-fold cross-validation.
o scoring="accuracy": Optimizes for accuracy.
o n_jobs=-1: Uses all available processors for faster computation.

 Output: Best model configurations are stored in best_models

Tuning Decision Tree...
Best parameters for Decision Tree: {'max_depth': 11, 'min_samples_leaf': 4,
'min_samples_split': 16}
Tuning Random Forest...
Best parameters for Random Forest: {'max_depth': 23, 'max_features': None,
'min_samples_leaf': 8, 'min_samples_split': 5, 'n_estimators': 153}
Tuning KNN...
Best parameters for KNN: {'n_neighbors': 10}

3) Model Evaluation:
 Model evaluation is done using test data to check how accurate the model is and to
assess how well it generalizes to new, unseen data.

a) Evaluation Steps:
 Use the best model configurations to make predictions on test data.
 Calculate accuracy for each model.
 Store results for comparison.

b) Optimization Adjustments:
 Reduced iterations (n_iter=3) and cross-validation folds (cv=2) to improve
efficiency.

c) Loading MLP Model:

 Pre-trained MLP neural network model loaded with custom loss function
(focal loss) to handle class imbalances.
d) Progress Bar:
 tqdm: Provides a visual indication of model tuning progress.
e) RandomizedSearchCV for Each Model:
 Evaluates multiple parameter combinations.
 Fewer iterations (n_iter=3) and folds (cv=2) to reduce computation time.
f) Prediction and Accuracy Calculation:
 Predictions made using the test set.
 Accuracy stored in the results list for comparison.

g) MLP Model Evaluation:

 Model evaluation is performed using test data to measure accuracy and

assess the generalization ability of each model.
 MLP Model: A pre-trained neural network model is loaded for comparison.
 Focal Loss: Handles class imbalances effectively during training.
 Accuracy Evaluation: Predictions are converted to binary outputs (0 or 1)
and assessed using accuracy.

h) Results of Best Model Selection:

 Results Data Frame:

o Stores accuracy results for each model.
o Displays a summary of performance metrics.
 Best Model Selection:
o Identifies the model with the highest accuracy.
o Outputs its name as the best-performing model.
Sondos Mohamed
2022/02126
Multi-layer Perception Model and Random Forest
Classifier

Introduction
Multilayer Perceptron (MLP):
 What it is: A type of neural network with multiple layers.
 Structure: Has an input layer, hidden layers, and an output layer.
 Purpose: Used for tasks like classification, regression, and pattern
recognition.

Random Forest:
 What it is: An ensemble method using multiple decision trees.
 Structure: Combines many decision trees trained on random data
subsets.
 Purpose: Great for classification and regression, less prone to
overfitting.
Both are powerful tools for different machine learning tasks!

Multi-layer Perception Model (MLP)

1.Data Splitting
The `train_test_split` function from `sklearn.model_selection` is used to
randomly divide a dataset into training and test subsets. This division is
essential for training the model on one part and evaluating its performance
on another, ensuring an unbiased assessment of its generalization ability.

Explanation:
1. The first split divides the dataset X and target variable y into
training/validation (X_train_val, y_train_val) and test sets (X_test,
y_test), with 15% of the data reserved for testing (test_size=0.15).
Using random_state=42 ensures reproducibility of the split.

2. The second split further divides the training/validation set (X_train_val,

y_train_val) into training (X_train, y_train) and validation sets (X_val,
y_val), with test_size=0.1765 chosen so that the final validation set
constitutes about 15% of the original dataset.
The split percentages (15% for test, and about 15% for validation) ensure
that both the test and validation sets have enough samples for reliable
evaluation without taking too much away from the training set.

2.Feature Selection
We use feature selection to choose only the specified features from the
datasets (X_train, X_val, X_test). This helps in reducing the complexity of the
model and improving its performance by focusing only on the most relevant
features.

3.Standard Scaler
 We use StandardScaler to standardize the features, ensuring they have
a mean of 0 and a standard deviation of 1.
 fit_transform is applied to the training set to fit the scaler and
transform the data.
 transform is then applied to the validation and test sets to apply the
same transformation.
Standardizing the features helps in faster convergence and better
performance of many machine learning algorithms.

4. Compute Class Weights

We use the function compute_class_weight to balance our dataset. This
function calculates weights for each class, ensuring that our model treats all
classes fairly, especially when the data is imbalanced. To identify these
classes, we rely on np.unique, which finds all unique elements in an array.
Together, these functions help our model perform better by paying equal
attention to all classes.

5.Custom Loss Function: Focal Loss

We use the function BinaryCrossentropy to measure how well our model is
performing in binary classification problems. This standard loss function
helps us understand the difference between the predicted and actual values.
Meanwhile, tf.where from TensorFlow is like a smart helper that picks
elements from y_pred or 1 - y_pred, based on whether the true label y_true is
equal to 1. Together, these tools ensure our model learns effectively and
makes accurate predictions.

Explanation:

 Defines a custom loss function called Focal Loss.

 Focal Loss: Modifies the standard binary cross-entropy loss by adding

a factor that down-weights easy examples and focuses more on hard
examples.
Gamma = 2: Controls the strength of down-weighting. Higher values
increase the effect.
Alpha = 0.25: Balances the importance of positive/negative examples.

 This choice helps in scenarios where there is a class imbalance, making

the model focus more on hard-to-classify examples.

6.Model Building
We use `Sequential` to create a linear stack of layers in Keras, making it
straightforward to build a model layer by layer. Within this stack, `Dense`
layers act as fully connected layers in the neural network, allowing each
neuron to connect with every other neuron in the next layer. To ensure our
model trains effectively, we include `BatchNormalization`, which normalizes
the inputs and helps stabilize the learning process. Additionally, `Dropout` is
applied as a regularization technique to prevent overfitting by randomly
setting a fraction of input units to 0 during training, encouraging the network
to learn more robust features.
Explanation:

 Builds a sequential neural network with three hidden layers and an

output layer.

 L2 Regularization: Adds a penalty to the loss function to prevent

overfitting.

 ReLU Activation: Used in hidden layers for non-linearity.

 Sigmoid Activation: Used in the output layer for binary classification.

Layer Sizes (256, 128, 64): These sizes are chosen to balance model
complexity and computational efficiency.
Dropout Rate (0.3): Drops 30% of the units to prevent overfitting.

7.Compile the model with Adam Optimizer and learning

rate scheduler
We use `Adam` as our optimizer, which intelligently adjusts learning rates for
each parameter to enhance model training efficiency. To ensure the learning
rate adapts when improvements stall, `ReduceLROnPlateau` comes into
play, reducing the rate when necessary. Additionally, `EarlyStopping` is
employed to halt training when a monitored metric ceases to improve,
preventing overfitting and saving computational resources.

Explanation:

 Compiles the model with the Adam optimizer, Focal Loss, and metrics
like accuracy, precision, and recall.
 ReduceLROnPlateau: Reduces the learning rate by a factor of 0.2 if
validation loss does not improve for 5 epochs.

 EarlyStopping: Stops training if validation loss does not improve for

10 epochs, and restores the best model weights.

Learning Rate (0.0005): Chosen to balance convergence speed and

stability.

ReduceLROnPlateau & EarlyStopping: These callbacks help in fine-

tuning the training process, preventing overfitting, and improving model
performance.

8.Model Training
Explanation:

 Trains the model on the training data for up to 200 epochs with a batch
size of 32.

 Uses validation data to monitor performance and employs class

weights to handle class imbalance.

Epochs (200): Chosen to ensure sufficient training time.

Batch Size (32): Balances memory efficiency and convergence speed.

9.Model Evaluation
Training Metrics:
 Training Loss: 0.0172
o Low training loss means the model fits the training data well.
 Training Accuracy: 0.8619
o The model correctly predicts about 86.19% of the training
instances.
Validation Metrics:
 Validation Loss: 0.0172
o Low validation loss indicates good generalization to unseen data.
 Validation Accuracy: 0.8628
o The model correctly predicts about 86.28% of the validation
instances.
Test Metrics:
 Test Loss: 0.0175
o Test loss close to training and validation loss values shows good
generalization.
 Test Accuracy: 0.8591
o The model correctly predicts about 85.91% of the test instances.
Interpretation:
 Consistency: The similar loss and accuracy values across training,
validation, and test sets indicate the model is well-balanced and
generalizes well.
 Good Performance: An accuracy around 86% suggests the model is
performing well, with low loss values supporting effective learning and
minimal overfitting.

10.Predictions and Metrics

1) Confusion Matrix
A confusion matrix is a tool used to evaluate the performance of a
classification model. It provides a summary of the prediction results by
comparing the actual values with the predicted values.

Detailed Explanation of the Confusion Matrix:

1. Axes and Labels:

o The x-axis is labeled "Predicted" and represents the predicted

class labels.

o The y-axis is labeled "Actual" and represents the actual class

labels.

2. Classes:

o There are two classes in this confusion matrix: 0 and 1.

3. Cells and Values:

o True Negatives (TN): The top-left cell (Actual 0, Predicted 0)

has a value of 4133. This means that 4133 instances of class 0
were correctly predicted as class 0.

o False Positives (FP): The top-right cell (Actual 0, Predicted 1)

has a value of 0.

o False Negatives (FN): The bottom-left cell (Actual 1, Predicted

0) has a value of 678.

o True Positives (TP): The bottom-right cell (Actual 1, Predicted

1) has a value of 0.

4. Color Scale:

o The color scale on the right side of the image ranges from light
blue to dark blue, indicating the frequency of the values in the
cells. Darker blue represents higher values, while lighter blue
represents lower values.

2) Classification Report
A classification report is a performance evaluation tool for classification
models. It provides a comprehensive summary of the key metrics which
include Precision, Recall, F1-Score and Support
The model has an overall accuracy of 86%.

11. Plotting Results

1) Plot Accuracy
Axes:
 x-axis (Epochs): Shows the number of times the model has gone
through the entire training dataset.
 y-axis (Accuracy): Indicates the accuracy of the model.
Lines:
 Blue Line (Training Accuracy): Represents the model's accuracy on
the training data for each epoch.
 Orange Line (Validation Accuracy): Represents the model's
accuracy on the validation data for each epoch.
Trend Analysis:
1. Training Accuracy:
o Rapid Increase: The model learns quickly at first, shown by the
sharp rise in training accuracy.
o Plateau: The increase levels off, indicating the model has
achieved consistent performance with continued training.
2. Validation Accuracy:
o Gradual Increase: The model starts to generalize to unseen
data, shown by the steady rise in validation accuracy.
o Plateau: The increase levels off, indicating the model has
achieved stable performance with more epochs.
Interpretation:
 Training Accuracy Increase: Shows effective initial learning.
 Training Accuracy Plateau: Indicates the model's learning capacity
has been reached.
 Validation Accuracy Increase: Suggests the model starts to
generalize well.

2) Plot loss
Axes:
 x-axis (Epochs): Number of times the model has gone through the
training data.
 y-axis (Loss): The value of the loss function.
Lines:
 Blue Line (Training Loss): Loss of the model on the training set for
each epoch.
 Orange Line (Validation Loss): Loss of the model on the validation
set for each epoch.
Trend Analysis:
1. Training Loss:
o Sharp Decrease at First: Indicates the model is quickly
learning from the training data.
o Plateau: Further training does not significantly reduce the loss,
showing the model has learned most patterns.
2. Validation Loss:
o Initial Decrease: Indicates good generalization to unseen data.
o Plateau: Shows that the model has achieved stable and
consistent performance with continued training.
Interpretation:
 Rapid Decrease in Training Loss: Shows effective initial learning.
 Plateau in Training Loss: Indicates the model has achieved optimal
learning from the training data.
 Decrease in Validation Loss: Demonstrates the model's ability to
generalize well to new data.
 Plateau in Validation Loss: Indicates that the model has reached a
stable performance with the current data and architecture.
12. Additional Metrics
We use the roc_curve function to map out the Receiver Operating
Characteristic (ROC) curve, which helps us visualize how well our model
distinguishes between classes. Once the ROC curve is plotted, we turn to
roc_auc_score to compute the Area Under the ROC Curve (AUC). This score
gives us a single value that summarizes the overall performance of our
model, showing how good it is at distinguishing between the different
classes. Together, these functions provide a clear and comprehensive
assessment of our model's classification abilities.
Cohen's Kappa: Measures the agreement between true labels and
predictions, adjusted for chance.

Log-Loss: Measures the performance of a classification model where the

prediction input is a probability value.

Matthews Correlation Coefficient: Takes into account true and false

positives and negatives, providing a balanced measure.

F2 Score: A variant of the F1 score that gives more weight to recall.

13.Save Model
We use `model.save` to preserve our trained model by saving it to a file
named `MLP.keras`. This allows us to easily load the model in the future for
making predictions or conducting further analysis, without needing to retrain
the model from scratch. Saving the model ensures that all the effort put into
training is not lost and can be utilized efficiently whenever needed.
Random Forest Classifier
1.Data Splitting
First, we split the dataset into training/validation and test sets, ensuring 15%
is used for testing. Then, we further split the training/validation set so that
about 15% of the original data is used for validation. This way, both sets are
of sufficient size for evaluation. Using `random_state=42` ensures
reproducibility.

2.Feature Selection
We start with a list of `selected_features`, which includes specific indices or
column names of the features we want to focus on. Using this list, we extract
these features from our datasets (X_train, X_val, X_test). This way, we
concentrate on the most relevant features, simplifying our model and
improving its performance by reducing dimensionality and excluding
unnecessary data. By doing so, we ensure our model is both efficient and
effective.

3.Training the Random Forest Classifier

We trained a RandomForestClassifier with 100 trees (using
n_estimators=100) on our training data (X_train, y_train). Random Forests
are robust classifiers that handle many features well and are less likely to
overfit. Setting random_state=42 ensures our results are reproducible.

4.Predictions and Accuracy

1. Training Accuracy: 0.86
o Meaning: The model correctly predicts 86% of training data
instances.
o Indication: The model has effectively learned patterns in the
training data.
2. Validation Accuracy: 0.86
o Meaning: The model correctly predicts 86% of validation data
instances.
o Indication: The model generalizes well to unseen data and is
not overfitting.
3. Test Accuracy: 0.86
o Meaning: The model correctly predicts 86% of test data
instances.
o Indication: The model maintains its performance on completely
new data.
Interpretation:
 Consistent Accuracy: Training, validation, and test accuracies are all
0.86, showing the model is well-balanced and generalizes well.
 Good Performance: An accuracy of 0.86 across all datasets suggests
that the model performs well in predicting the target class correctly in
86% of the cases.

5.Predictions and Metrics

1) Classification Report
A classification report provides a detailed summary of a classification
model's performance, including precision, recall, F1-score, and support for
each class.

2)Confusion Matrix
A confusion matrix evaluates the performance of a classification model by
comparing actual vs. predicted values, showing the counts of true positives,
false positives, true negatives, and false negatives. This helps identify where
the model is making errors and how well it distinguishes between classes.
Axes and Labels:
 x-axis (Predicted): Shows predicted class labels.
 y-axis (Actual): Shows actual class labels.
Classes:
 Two classes: 0 (negative) and 1 (positive).
Cells and Values:
 TN: 4130 instances correctly predicted as class 0.
 FP: 3 instances of class 0 predicted as class 1.
 FN: 673 instances of class 1 predicted as class 0.
 TP: 5 instances correctly predicted as class 1.
Color Scale:
 From light blue (lower values) to dark blue (higher values).
Interpretation:
 The model excels in predicting the negative class (0) with high true
negatives.
 It shows potential for improvement in predicting the positive class (1).

3)ROC Curve and AUC Score

The ROC curve plots the true positive rate against the false positive rate,
showing how well a model distinguishes between classes across different
threshold values. The AUC score, which stands for Area Under the Curve,
quantifies this performance; a higher AUC score indicates better model
performance in distinguishing between classes.
6.Additional Metrics
1)Cohen's Kappa:
 Function: Computes Cohen's Kappa score.
 Validation Set: Computes the score for validation set.
 Test Set: Computes the score for test set.
 Purpose: Measures agreement between predictions and true labels,
adjusting for chance, useful for imbalanced classes.
2)Matthews Correlation Coefficient (MCC):
 Function: Computes the MCC score.
 Validation Set: Computes MCC for validation set.
 Test Set: Computes MCC for test set.
 Purpose: Provides a balanced measure that considers all categories of
the confusion matrix (TP, TN, FP, FN), giving a more informative
assessment than accuracy alone.
Conclusion
Both the Multilayer Perceptron (MLP) and Random Forest models achieved an
accuracy of 86% on the training, validation, and test sets.
Comparison:
 MLP Accuracy: 86%
 Random Forest Accuracy: 86%
What This Means:
 Both models perform equally well in terms of overall accuracy,
predicting 86% of instances correctly.
 The consistent accuracy across training, validation, and test sets for
both models indicates good generalization, meaning they are not
overfitting and are likely to perform well on new, unseen data.
 Despite different underlying mechanisms (deep learning for MLP and
ensemble learning for Random Forest), both models demonstrate
robust and reliable performance in this particular task.
This comparison shows that both MLP and Random Forest are effective
choices, and selecting one over the other might depend on specific use
cases, computational resources, and preferences in model interpretability.

Adriaan D. de Groot - Thought and Choice in Chess
100% (7)
Adriaan D. de Groot - Thought and Choice in Chess
485 pages
Web, Artificial Intelligence and Network Applications
100% (1)
Web, Artificial Intelligence and Network Applications
1,217 pages
Project Report - Credit Card Fraud Detection
No ratings yet
Project Report - Credit Card Fraud Detection
11 pages
CE802 Pilot
No ratings yet
CE802 Pilot
2 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Default_of_Credit_Card_Clients
No ratings yet
Default_of_Credit_Card_Clients
33 pages
Final Report (1)
No ratings yet
Final Report (1)
17 pages
CASE STUDY STOCK MARKET PREDICITON
No ratings yet
CASE STUDY STOCK MARKET PREDICITON
10 pages
CE802 Report
No ratings yet
CE802 Report
7 pages
Mini Project
No ratings yet
Mini Project
9 pages
22K61A0654_2_sasi_auto
No ratings yet
22K61A0654_2_sasi_auto
24 pages
Project Report-Micro Credit Loan
No ratings yet
Project Report-Micro Credit Loan
8 pages
Quadexp IDS Project
No ratings yet
Quadexp IDS Project
22 pages
ML LAB REPORT
No ratings yet
ML LAB REPORT
6 pages
Perform Prediction Using Regression Algorithm: Ex No: 1 Date
No ratings yet
Perform Prediction Using Regression Algorithm: Ex No: 1 Date
13 pages
Loan Approval Model Prediction
No ratings yet
Loan Approval Model Prediction
10 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
No ratings yet
About The Dataset - Car Evaluation Dataset (UCI Machine Learning Repository
5 pages
Question 1 The Given Dataset Can Be Visualized As Follows
No ratings yet
Question 1 The Given Dataset Can Be Visualized As Follows
13 pages
Data Analytics on Banking
No ratings yet
Data Analytics on Banking
3 pages
MLT 1 - 7 Kanish
No ratings yet
MLT 1 - 7 Kanish
24 pages
Heart Merged
No ratings yet
Heart Merged
8 pages
turover prediction
No ratings yet
turover prediction
52 pages
Train
No ratings yet
Train
17 pages
ML New record (5)
No ratings yet
ML New record (5)
51 pages
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
9 pages
ML MANUAL
No ratings yet
ML MANUAL
24 pages
MANUAL (2)
No ratings yet
MANUAL (2)
33 pages
Advance Python
No ratings yet
Advance Python
5 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
23 pages
Module 4 - Supervised Learning - First ML Model
No ratings yet
Module 4 - Supervised Learning - First ML Model
23 pages
Predicting Credit Card Approvals
100% (1)
Predicting Credit Card Approvals
14 pages
Flight Price Prediction Report
No ratings yet
Flight Price Prediction Report
18 pages
Complete Data Science Questions
No ratings yet
Complete Data Science Questions
5 pages
dsbda_5
No ratings yet
dsbda_5
4 pages
_ML Report__22112037
No ratings yet
_ML Report__22112037
9 pages
Midterm Data Mining
No ratings yet
Midterm Data Mining
18 pages
Machine Learning Approach For Credit Score Analysis: A Case Study of
No ratings yet
Machine Learning Approach For Credit Score Analysis: A Case Study of
60 pages
Credit Card Approve Predict Bynvd
No ratings yet
Credit Card Approve Predict Bynvd
90 pages
Mini Project 2024
No ratings yet
Mini Project 2024
48 pages
Documenting the Solution to Develop a Behaviour Score
No ratings yet
Documenting the Solution to Develop a Behaviour Score
9 pages
Loan Approval Prediction2
No ratings yet
Loan Approval Prediction2
72 pages
minor project
No ratings yet
minor project
21 pages
R Assignment
No ratings yet
R Assignment
8 pages
Company Bankruptcy Detection PDF
No ratings yet
Company Bankruptcy Detection PDF
34 pages
PA v0.25
No ratings yet
PA v0.25
18 pages
Kaggle Course Notes
No ratings yet
Kaggle Course Notes
87 pages
Machine learning lab manual
No ratings yet
Machine learning lab manual
22 pages
Python Code For Loan Default Prediction
No ratings yet
Python Code For Loan Default Prediction
4 pages
Decision Support
No ratings yet
Decision Support
21 pages
Data Science Assignment 2
No ratings yet
Data Science Assignment 2
14 pages
This Study Resource Was
No ratings yet
This Study Resource Was
5 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
StarterNotebook - Jupyter Notebook
No ratings yet
StarterNotebook - Jupyter Notebook
12 pages
Machine Learning Final Report
No ratings yet
Machine Learning Final Report
8 pages
Data_preprocessing_example_programs1
No ratings yet
Data_preprocessing_example_programs1
9 pages
Machine Learning
No ratings yet
Machine Learning
16 pages
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
No ratings yet
Project - Machine Learning-Business Report: By: K Ravi Kumar PGP-Data Science and Business Analytics (PGPDSBA.O.MAR23.A)
38 pages
Capstone 2 Corizo
No ratings yet
Capstone 2 Corizo
2 pages
Data Mining Report
No ratings yet
Data Mining Report
7 pages
Ritesh Machine Learning Project
100% (9)
Ritesh Machine Learning Project
46 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
UML501 Project Report
No ratings yet
UML501 Project Report
13 pages
Student_Performance_Prediction_Project_Detailed
No ratings yet
Student_Performance_Prediction_Project_Detailed
4 pages
Privacy in The Age of AI Ethical Issues
No ratings yet
Privacy in The Age of AI Ethical Issues
9 pages
Python Deep Learning 2022-2023
No ratings yet
Python Deep Learning 2022-2023
9 pages
66d877f8c73e566226938839_2023_Winter_Dignitas_FNL
No ratings yet
66d877f8c73e566226938839_2023_Winter_Dignitas_FNL
27 pages
Zhu Et Al 2023 Chatgpt and Environmental Research
No ratings yet
Zhu Et Al 2023 Chatgpt and Environmental Research
4 pages
Utilizing Artificial Intelligence For Personalized Mental Health Interventions
No ratings yet
Utilizing Artificial Intelligence For Personalized Mental Health Interventions
2 pages
MMA 2023 Brochure WEB
No ratings yet
MMA 2023 Brochure WEB
16 pages
18 - Computational Complexity
No ratings yet
18 - Computational Complexity
21 pages
AI Manual
No ratings yet
AI Manual
69 pages
AI and ML The Future of PEB Design, Manufacturing, and Construction
No ratings yet
AI and ML The Future of PEB Design, Manufacturing, and Construction
9 pages
CUAP Brochure 2023 (SD)
No ratings yet
CUAP Brochure 2023 (SD)
57 pages
Behavsci 14 00956
No ratings yet
Behavsci 14 00956
16 pages
Chapter 1
No ratings yet
Chapter 1
4 pages
Alan Petersen - Emotions Online - Feelings and Affordances of Digital Media-Routledge (2022)
100% (1)
Alan Petersen - Emotions Online - Feelings and Affordances of Digital Media-Routledge (2022)
192 pages
A Survey of KNN Algorithm
No ratings yet
A Survey of KNN Algorithm
10 pages
Tomorrow script
No ratings yet
Tomorrow script
2 pages
AIML Syllabus
No ratings yet
AIML Syllabus
3 pages
Oms Analytics Course List
No ratings yet
Oms Analytics Course List
6 pages
Industrial Ai
No ratings yet
Industrial Ai
4 pages
AI - Notes (Introduction To AI& AI Project Cycle)
No ratings yet
AI - Notes (Introduction To AI& AI Project Cycle)
11 pages
ESCE - 5 - Artificial Intelligence (EN)
No ratings yet
ESCE - 5 - Artificial Intelligence (EN)
45 pages
Pedestrian Detection Report
100% (1)
Pedestrian Detection Report
7 pages
A Human Centric Prespective Exploring The Readiness Toward Smart Warehousing
No ratings yet
A Human Centric Prespective Exploring The Readiness Toward Smart Warehousing
15 pages
64 AI Prompts Ebook PDF 1727572430
No ratings yet
64 AI Prompts Ebook PDF 1727572430
10 pages
AI - Artificial Intelligence Program Brochure by Weschool, Bangalore (Welingkar Management Institute)
No ratings yet
AI - Artificial Intelligence Program Brochure by Weschool, Bangalore (Welingkar Management Institute)
9 pages
ClassWise SAM Adapter
No ratings yet
ClassWise SAM Adapter
12 pages