0% found this document useful (0 votes)
329 views78 pages

All Life Bank - AIML_ML_Project_low_code_notebook

AllLife Bank aims to convert liability customers into personal loan customers to increase loan business. A successful previous campaign had a conversion rate of over 9%, prompting the marketing department to develop targeted campaigns. The project involves building a predictive model to identify potential loan customers based on various attributes and analyzing the dataset for insights.

Uploaded by

sanjaycj99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
329 views78 pages

All Life Bank - AIML_ML_Project_low_code_notebook

AllLife Bank aims to convert liability customers into personal loan customers to increase loan business. A successful previous campaign had a conversion rate of over 9%, prompting the marketing department to develop targeted campaigns. The project involves building a predictive model to identify potential loan customers based on various attributes and analyzing the dataset for insights.

Uploaded by

sanjaycj99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

 Problem Statement

Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers
are liability customers (depositors) with varying sizes of deposits. The number of customers
who are also borrowers (asset customers) is quite small, and the bank is interested in
expanding this base rapidly to bring in more loan business and in the process, earn more
through the interest on loans. In particular, the management wants to explore ways of
converting its liability customers to personal loan customers (while retaining them as
depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion
rate of over 9% success. This has encouraged the retail marketing department to devise
campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing
department to identify the potential customers who have a higher probability of purchasing
the loan.

Objective

To predict whether a liability customer will buy personal loans, to understand which customer
attributes are most signiMcant in driving purchases, and identify which segment of customers
to target more.
Data Dictionary

ID : Customer ID
Age : Customer’s age in completed years
Experience : #years of professional experience
Income : Annual income of the customer (in thousand dollars)
ZIP Code : Home Address ZIP code.
Family : the Family size of the customer
CCAvg : Average spending on credit cards per month (in thousand dollars)
Education : Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
Mortgage : Value of house mortgage if any. (in thousand dollars)
Personal_Loan : Did this customer accept the personal loan offered in the last
campaign? (0: No, 1: Yes)
Securities_Account : Does the customer have securities account with the bank? (0:
No, 1: Yes)
CD_Account : Does the customer have a certiMcate of deposit (CD) account with the
bank? (0: No, 1: Yes)
Online : Do customers use internet banking facilities? (0: No, 1: Yes)
CreditCard : Does the customer use a credit card issued by any other Bank (excluding
All life Bank)? (0: No, 1: Yes)

Please read the instructions carefully before starting the



project.
This is a commented Jupyter IPython Notebook Mle in which all the instructions and tasks to
be performed are mentioned.

Blanks '___' are provided in the notebook that needs to be Mlled with an appropriate code
to get the correct result. With every '___' blank, there is a comment that briecy describes
what needs to be Mlled in the blank space.
Identify the task to be performed correctly, and only then proceed to write the required
code.
Fill the code wherever asked by the commented lines like "# write your code here" or "#
complete the code". Running incomplete code may throw error.
Please run the codes in a sequential manner from the beginning to avoid any
unnecessary errors.
Add the results/observations (wherever mentioned) derived from the analysis in the
presentation and submit the same.

 Importing necessary libraries


# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-l

error: subprocess-exited-with-error

Getting requirements to build wheel did not run successfully.


exit code: 1

[33 lines of output]


Traceback (most recent call last):
File "C:\Users\conne\anaconda3\Lib\site-packages\pip\_vendor\pyproject_hook
main()
File "C:\Users\conne\anaconda3\Lib\site-packages\pip\_vendor\pyproject_hook
json_out['return_val'] = hook(**hook_input['kwargs'])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\conne\anaconda3\Lib\site-packages\pip\_vendor\pyproject_hook
backend = _build_backend()
^^^^^^^^^^^^^^^^
File "C:\Users\conne\anaconda3\Lib\site-packages\pip\_vendor\pyproject_hook
obj = import_module(mod_path)
^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\conne\anaconda3\Lib\importlib\__init__.py", line 90, in impo
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_remove
File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 995, in exec_module
File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_remove
File "C:\Users\conne\AppData\Local\Temp\pip-build-env-5akfrxos\overlay\Lib\
import setuptools.version
File "C:\Users\conne\AppData\Local\Temp\pip-build-env-5akfrxos\overlay\Lib\
import pkg_resources
File "C:\Users\conne\AppData\Local\Temp\pip-build-env-5akfrxos\overlay\Lib\
register_finder(pkgutil.ImpImporter, find_on_path)
^^^^^^^^^^^^^^^^^^^
AttributeError: module 'pkgutil' has no attribute 'ImpImporter'. Did you mean
[end of output]

note: This error originates from a subprocess, and is likely not a problem wi
error: subprocess-exited-with-error

Getting requirements to build wheel did not run successfully.


exit code: 1

See above for output.

note: This error originates from a subprocess, and is likely not a problem with
Note:

1. After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or
runtime (for Google Colab) and run all cells sequentially from the next cell.

2. On executing the above line of code, you might see a warning regarding package
dependencies. This error message can be ignored as the above code ensures that all
necessary libraries and their dependencies are maintained to successfully execute the
code in this notebook.

# Libraries to help with reading and manipulating data


import pandas as pd
import numpy as np

# libaries to help with data visualization


import matplotlib.pyplot as plt
import seaborn as sns

# Library to split data


from sklearn.model_selection import train_test_split

# To build model for prediction


from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To get diferent metric scores


from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
)

# to suppress unnecessary warnings


import warnings
warnings.filterwarnings("ignore")

 Loading the dataset

# uncomment the following lines if Google Colab is being used


# from google.colab import drive
# drive.mount('/content/drive')
Loan = pd.read_csv("C:\\Users\\conne\\OneDrive\\Desktop\\AI ML COURSE\\Loan_Modelli

# copying data to another variable to avoid any changes to original data


data = Loan.copy()

 Data Overview

 View the Mrst and last 5 rows of the dataset.

data.head(5) ## Complete the code to view top 5 rows of the data

ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Pers

0 1 25 1 49 91107 4 1.6 1 0

1 2 45 19 34 90089 3 1.5 1 0

2 3 39 15 11 94720 1 1.0 1 0

3 4 35 9 100 94112 1 2.7 2 0

4 5 35 8 45 91330 4 1.0 2 0

data.tail(5) ## Complete the code to view last 5 rows of the data

ID Age Experience Income ZIPCode Family CCAvg Education Mortgage

4995 4996 29 3 40 92697 1 1.9 3 0

4996 4997 30 4 15 92037 4 0.4 1 85

4997 4998 63 39 24 93023 2 0.3 3 0

4998 4999 65 40 49 90034 3 0.5 2 0

4999 5000 28 4 83 92612 3 0.8 1 0

 Understand the shape of the dataset.

data.shape ## Complete the code to get the shape of the data

(5000, 14)
 Check the data types of the columns for the dataset

data.dtypes ## Complete the code to view the datatypes of the data

ID int64
Age int64
Experience int64
Income int64
ZIPCode int64
Family int64
CCAvg float64
Education int64
Mortgage int64
Personal_Loan int64
Securities_Account int64
CD_Account int64
Online int64
CreditCard int64
dtype: object

 Checking the Statistical Summary


data.describe().T ## Complete the code to print the statistical summary of the dat

count mean std min 25% 50% 75%

ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.2

Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.0

Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.0

Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.0

ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.0

Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.0

CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.5

Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.0

Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.0

Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.0

Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.0

CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.0

Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.0

CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.0


Most customers fall within a broad working age range, with a median age of 45. This
indicates the bank's customer base is largely in their prime earning years, making them
potentially attractive targets for loans.
Negative values in Experience suggest data quality issues that need correction. Most
customers have signiMcant professional experience, which might correlate with stable
incomes.
Wide variability in income indicates a diverse customer base, with a signiMcant
proportion earning under $98k (75th percentile). The higher-income customers could be
prioritized for loan marketing as they may be more likely to qualify and accept larger
loans.
Most customers belong to small families. This suggests marketing campaigns should
focus on individual or small family needs rather than larger family-speciMc messaging.
SigniMcant variation in credit card spending indicates that high spenders might be more
inclined toward loans, either to consolidate debt or manage expenses.
Customers with higher education levels (Graduate/Professional) may be more receptive
to Mnancial products due to better Mnancial literacy and higher earning potential.
A high standard deviation suggests a mixed customer base, where a signiMcant
proportion may not have mortgages (median = 0). Those without mortgages could be
targeted for personal loans.

 Dropping columns

data = data.drop(['ID'], axis=1) ## Complete the code to drop a column from the da

Dropping ID column, since it is a running serial number and won't have any bearing on
the model

data.head()

Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal

0 25 1 49 91107 4 1.6 1 0

1 45 19 34 90089 3 1.5 1 0

2 39 15 11 94720 1 1.0 1 0

3 35 9 100 94112 1 2.7 2 0

4 35 8 45 91330 4 1.0 2 0
Start coding or generate with AI.

data.isnull().sum()

Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0
dtype: int64

 Data Preprocessing

 Checking for Anomalous Values

data["Experience"].unique()

array([ 1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18,
21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29,
3, 22, -1, 34, 0, 38, 40, 33, 4, -2, 42, -3, 43], dtype=int64)

# checking for experience <0


data[data["Experience"] < 0]["Experience"].unique()

array([-1, -2, -3], dtype=int64)

# Correcting the experience values


data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)

data["Education"].unique()

array([1, 2, 3], dtype=int64)


data.head()

Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal

0 25 1 49 91107 4 1.6 1 0

1 45 19 34 90089 3 1.5 1 0

2 39 15 11 94720 1 1.0 1 0

3 35 9 100 94112 1 2.7 2 0

4 35 8 45 91330 4 1.0 2 0

# data["Mortgage"].unique()
data["Personal_Loan"].unique()

array([0, 1], dtype=int64)

data["CD_Account"].unique()

array([0, 1], dtype=int64)

data["Online"].unique()

array([0, 1], dtype=int64)

Start coding or generate with AI.

Start coding or generate with AI.

 Feature Engineering

# checking the number of uniques in the zip code


data["ZIPCode"].nunique()

467
data["ZIPCode"] = data["ZIPCode"].astype(str)
print(
"Number of unique values if we take first two digits of ZIPCode: ",
data["ZIPCode"].str[0:2].nunique(),
)
data["ZIPCode"] = data["ZIPCode"].str[0:2]

data["ZIPCode"] = data["ZIPCode"].astype("category")

Number of unique values if we take first two digits of ZIPCode: 7

## Converting the data type of categorical features to 'category'


cat_cols = [
"Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
"ZIPCode",
]
data[cat_cols] = data[cat_cols].astype("category")

 Exploratory Data Analysis (EDA)

 Univariate Analysis
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined

data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the col
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram

# function to create labeled barplots

def labeled_barplot(data, feature, perc=False, n=None):


"""
Barplot with percentage at the top

data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all level
"""

total = len(data[feature]) # length of the column


count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)

for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category

x = p.get_x() + p.get_width() / 2 # width of the plot


y = p.get_height() # height of the plot

ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage

plt.show() # show the plot

 Observations on Age
data.dtypes

Age int64
Experience int64
Income int64
ZIPCode category
Family int64
CCAvg float64
Education category
Mortgage int64
Personal_Loan category
Securities_Account category
CD_Account category
Online category
CreditCard category
dtype: object

histogram_boxplot(data, "Age")
 Observations on Experience

histogram_boxplot(data, "Experience") ## Complete the code to create histogram_boxp

 Observations on Income
histogram_boxplot(data, "Income") ## Complete the code to create histogram_boxplot

 Observations on CCAvg
histogram_boxplot(data, "CCAvg") ## Complete the code to create histogram_boxplot

0000000000000

500

400-

300-
Count

200-

100-

0
2
CCAvg

 Observations on Mortgage
histogram_boxplot(data, "Mortgage") ## Complete the code to create histogram_boxpl

 Observations on Family
labeled_barplot(data, "Family", perc=True)

 Observations on Education
labeled_barplot(data, "Education") ## Complete the code to create labeled_barplot

 Observations on Securities_Account
labeled_barplot(data, "Securities_Account") ## Complete the code to create labele

 Observations on CD_Account
labeled_barplot(data, "CD_Account") ## Complete the code to create labeled_barplo

 Observations on Online
labeled_barplot(data, "Online") ## Complete the code to create labeled_barplot fo

 Observation on CreditCard
labeled_barplot(data, "CreditCard") ## Complete the code to create labeled_barplo

 Observation on ZIPCode
labeled_barplot(data, "ZIPCode") ## Complete the code to create labeled_barplot f

 Bivariate Analysis
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart

data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target

def distribution_plot_wrt_target(data, predictor, target):

fig, axs = plt.subplots(2, 2, figsize=(12, 10))

target_uniq = data[target].unique()

axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))


sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)

axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))


sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)

axs[1, 0].set_title("Boxplot w.r.t target")


sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainb

axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")


sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)

plt.tight_layout()
plt.show()

 Correlation check
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", c
plt.show()

Ag 1.00 0.99 -0.06 -0.05


Experience

0.99 1.00 -0.05 -0.05


Income

-0.06 -0.05 1.00 -0.16


Family

-0.05 -0.05 -0.16 1.00


CCAV9

-0.05 -0.05 0.65 -0.11


Mortglage

- -0.01 -0.01 0.21 -0.02

Age Experience Income Family

Let's check how a customer's interest in purchasing a loan varies with their
 education
stacked_barplot(data, "Education", "Personal_Loan")

Personal_Loan 0 1 All
Education
All 4520 480 5000
3 1296 205 1501
2 1221 182 1403
1 2003 93 2096
-------------------------------------------------------------------------------

Higher education levels correlate with higher loan acceptance rates. This trend may be
due to:

Higher Mnancial literacy and understanding of loan beneMts.


Better income levels associated with advanced education.

 Personal_Loan vs Family
stacked_barplot(data, "Personal_Loan", "Family") ## Complete the code to plot stac

Family 1 2 3 4 All
Personal_Loan
All 1472 1296 1010 1222 5000
0 1365 1190 877 1088 4520
1 107 106 133 134 480
-------------------------------------------------------------------------------

 Personal_Loan vs Securities_Account
stacked_barplot(data, "Personal_Loan","Securities_Account") ## Complete the code to

Securities_Account 0 1 All
Personal_Loan
All 4478 522 5000
0 4058 462 4520
1 420 60 480
-------------------------------------------------------------------------------

 Personal_Loan vs CD_Account
stacked_barplot(data, "Personal_Loan", "CD_Account") ## Complete the code to plot s

CD_Account 0 1 All
Personal_Loan
All 4698 302 5000
0 4358 162 4520
1 340 140 480
-------------------------------------------------------------------------------

 Personal_Loan vs Online
stacked_barplot(data, "Personal_Loan", "Online") ## Complete the code to plot stack

Online 0 1 All
Personal_Loan
All 2016 2984 5000
0 1827 2693 4520
1 189 291 480
-------------------------------------------------------------------------------

 Personal_Loan vs CreditCard
stacked_barplot(data, "Personal_Loan", "CreditCard") ## Complete the code to plot s

CreditCard 0 1 All
Personal_Loan
All 3530 1470 5000
0 3193 1327 4520
1 337 143 480
-------------------------------------------------------------------------------

 Personal_Loan vs ZIPCode
stacked_barplot(data, "Personal_Loan", "ZIPCode") ## Complete the code to plot stac

ZIPCode 90 91 92 93 94 95 96 All
Personal_Loan
All 703 565 988 417 1472 815 40 5000
0 636 510 894 374 1334 735 37 4520
1 67 55 94 43 138 80 3 480
-------------------------------------------------------------------------------

 Let's check how a customer's interest in purchasing a loan varies with their age

distribution_plot_wrt_target(data, "Age", "Personal_Loan")


 Personal Loan vs Experience

distribution_plot_wrt_target(data, "Experience", "Personal_Loan") ## Complete the c


 Personal Loan vs Income

distribution_plot_wrt_target(data, "Income", "Personal_Loan") ## Complete the code


distribution_plot_wrt_target(data, "Income", "Personal_Loan") ## Complete the code
 Personal Loan vs CCAvg

distribution_plot_wrt_target(data, "CCAvg", "Personal_Loan") ## Complete the code t


 Data Preprocessing (contd.)

 Outlier Detection

# Select numerical columns only


numerical_data = data.select_dtypes(include=["float64", "int64"])

Q1 = numerical_data.quantile(0.25) # To find the 25th percentile and 75th percenti


Q3 = numerical_data.quantile(0.75)

IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)

lower = ( Q1 - 1.5 * IQR) # Finding lower and upper bounds for all values. All val
upper = Q3 + 1.5 * IQR

(
(data.select_dtypes(include=["float64", "int64"]) < lower)
| (data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100

Age 0.00
Experience 0.00
Income 1.92
Family 0.00
CCAvg 6.48
Mortgage 5.82
dtype: float64

 Data Preparation for Modeling


# dropping Experience as it is perfectly correlated with Age
X = data.drop(["Personal_Loan", "Experience"], axis=1)
Y = data["Personal_Loan"]

X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)


X = X.astype(float)

# Splitting data in train and test sets


X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)

print("Shape of Training set : ", X_train.shape)


print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))

Shape of Training set : (3500, 17)


Shape of test set : (1500, 17)
Percentage of classes in training set:
Personal_Loan
0 0.905429
1 0.094571
Name: proportion, dtype: float64
Percentage of classes in test set:
Personal_Loan
0 0.900667
1 0.099333
Name: proportion, dtype: float64

 Model Building

 Model Evaluation Criterion

mention the model evaluation criterion here with proper reasoning

Primary Focus: Recall (to minimize missed loan takers) and F1-Score (to balance
precision and recall).

Supplementary Metrics: Precision (to optimize targeting) and Accuracy (as a general
indicator).
First, let's create functions to calculate different metrics and confusion matrix so that we don't
have to use the same code repeatedly for each model.

The model_performance_classiMcation_sklearn function will be used to check the model


performance of models.
The confusion_matrix_sklearnfunction will be used to plot confusion matrix.

# defining a function to compute different metrics to check performance of a classi


def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance

model: classifier
predictors: independent variables
target: dependent variable
"""

# predicting using the independent variables


pred = model.predict(predictors)

acc = accuracy_score(target, pred) # to compute Accuracy


recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score

# creating a dataframe of metrics


df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)

return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages

model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum()
for item in cm.flatten()
]
).reshape(2, 2)

plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")

 Decision Tree (sklearn default)

model = DecisionTreeClassifier(criterion="gini", random_state=1)


model.fit(X_train, y_train)

▾ DecisionTreeClassifier i ?

DecisionTreeClassifier(random_state=1)

 Checking model performance on training data


confusion_matrix_sklearn(model, X_train, y_train)

decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train

Accuracy Recall Precision F1

0 1.0 1.0 1.0 1.0

 Check Performance on Test Data

Evaluate the model's performance on the test data to verify if it generalizes well:

decision_tree_perf_test = model_performance_classification_sklearn(model, X_test, y


print("Test Performance:")
decision_tree_perf_test

Test Performance:
Accuracy Recall Precision F1

0 0.986 0.932886 0.926667 0.929766


 Tree Depth

print(f"Tree Depth: {model.get_depth()}")


print(f"Number of Leaves: {model.get_n_leaves()}")

Tree Depth: 10
Number of Leaves: 49

Start coding or generate with AI.

Start coding or generate with AI.

 Visualizing the Decision Tree

feature_names = list(X_train.columns)
print(feature_names)

['Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Acco

plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
Start coding or generate with AI.

# Text report showing the rules of a decision tree -

print(tree.export_text(model, feature_names=feature_names, show_weights=True))

|--- Income <= 116.50


| |--- CCAvg <= 2.95
| | |--- Income <= 106.50
| | | |--- weights: [2553.00, 0.00] class: 0
| | |--- Income > 106.50
| | | |--- Family <= 3.50
| | | | |--- ZIPCode_93 <= 0.50
| | | | | |--- Age <= 28.50
| | | | | | |--- Education_2 <= 0.50
| | | | | | | |--- weights: [5.00, 0.00] class: 0
| | | | | | |--- Education_2 > 0.50
| | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | |--- Age > 28.50
| | | | | | |--- CCAvg <= 2.20
| | | | | | | |--- weights: [48.00, 0.00] class: 0
| | | | | | |--- CCAvg > 2.20
| | | | | | | |--- Education_3 <= 0.50
| | | | | | | | |--- weights: [7.00, 0.00] class: 0
| | | | | | | |--- Education_3 > 0.50
| | | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | |--- ZIPCode_93 > 0.50
| | | | | |--- Age <= 37.50
| | | | | | |--- weights: [2.00, 0.00] class: 0
| | | | | |--- Age > 37.50
| | | | | | |--- Income <= 112.00
| | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | | |--- Income > 112.00
| | | | | | | |--- weights: [1.00, 0.00] class: 0
| | | |--- Family > 3.50
| | | | |--- Age <= 32.50
| | | | | |--- CCAvg <= 2.40
| | | | | | |--- weights: [12.00, 0.00] class: 0
| | | | | |--- CCAvg > 2.40
| | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | |--- Age > 32.50
| | | | | |--- Age <= 60.00
| | | | | | |--- weights: [0.00, 6.00] class: 1
| | | | | |--- Age > 60.00
| | | | | | |--- weights: [4.00, 0.00] class: 0
| |--- CCAvg > 2.95
| | |--- Income <= 92.50
| | | |--- CD_Account <= 0.50
| | | | |--- Age <= 26.50
| | | | | |--- weights: [0.00, 1.00] class: 1
| | | | |--- Age > 26.50
| | | | | |--- CCAvg <= 3.55
| | | | | | |--- CCAvg <= 3.35
| | | | | | | |--- Age <= 37.50
| | | | | | | | |--- Age <= 33.50
| | | | | | | | | |--- weights: [3.00, 0.00] class: 0
| | | | | | | | |--- Age > 33.50
| | | | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | | | |--- Age > 37.50
| | | | | | | | |--- Income <= 82.50
| | | | | | | | | |--- weights: [23.00, 0.00] class: 0
| | | | | | | | |--- Income > 82.50
| | | | | | | | | |--- Income <= 83.50
| | | | | | | | | | |--- weights: [0.00, 1.00] class: 1
| | | | | | | | | |--- Income > 83.50
| | | | | | | | | | |--- weights: [5.00, 0.00] class: 0
# importance of features in the tree building ( The importance of a feature is comp
# (normalized) total reduction of the criterion brought by that feature. It is also

print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)

Imp
Income 0.308098
Family 0.259255
Education_2 0.166192
Education_3 0.147127
CCAvg 0.048798
Age 0.033150
CD_Account 0.017273
ZIPCode_94 0.007183
ZIPCode_93 0.004682
Mortgage 0.003236
Online 0.002224
Securities_Account 0.002224
ZIPCode_91 0.000556
ZIPCode_92 0.000000
ZIPCode_95 0.000000
ZIPCode_96 0.000000
CreditCard 0.000000
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
 Checking model performance on test data

confusion_matrix_sklearn(model, X_test, y_test) ## Complete the code to create conf

decision_tree_perf_test = model_performance_classification_sklearn(model, X_test, y


decision_tree_perf_test

Accuracy Recall Precision F1

0 0.986 0.932886 0.926667 0.929766

 Model Performance Improvement

 Pre-pruning

Note: The parameters provided below are a sample set. You can feel free to update the same
and try out other combinations.

# Define the parameters of the tree to iterate over


# max_depth_values = np.arange(2, 11, 2)
# max_leaf_nodes_values = [10, 51, 10]
# min_samples_split_values = [10, 51, 10]
# max_depth_values = np.arange(3, 21, 2) # Explore deeper trees
# max_leaf_nodes_values = [20, 50, 100, 200] # Allow more splits to refine leaf no
# min_samples_split_values = [2, 10, 20, 50] # Explore smaller splits for better g

# max_depth_values = np.arange(3, 16, 2) # Explore deeper trees


max_depth_values = np.arange(6, 15)
max_leaf_nodes_values = [20, 50, 100, 200] # Allow more splits to refine leaf node
min_samples_split_values = [2, 5, 10] # Explore smaller splits for better granular

# Initialize variables to store the best model and its performance


best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0

# Iterate over all combinations of the specified parameter values


for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:

# Initialize the tree with the current set of parameters


estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced',
random_state=1
)

# Fit the model to the training data


estimator.fit(X_train, y_train)

# Make predictions on the training and test sets


y_train_pred = estimator.predict(X_train)
y_test_pred = estimator.predict(X_test)

# Calculate recall scores for training and test sets


train_recall_score = recall_score(y_train, y_train_pred)
test_recall_score = recall_score(y_test, y_test_pred)

# Calculate the absolute difference between training and test recall sc


score_diff = abs(train_recall_score - test_recall_score)

# Update the best estimator and best score if the current one has a sma
if (score_diff < best_score_diff) & (test_recall_score > best_test_scor
best_score_diff = score_diff
best_test_score = test_recall_score
best_estimator = estimator

# Print the best parameters


print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")

Best parameters found:


Max depth: 6
Max leaf nodes: 20
Min samples split: 2
Best test recall score: 0.9664429530201343
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Fit the best algorithm to the data.
estimator = best_estimator
estimator.fit(X_train, y_train) ## Complete the code to fit model on train data

▾ DecisionTreeClassifier i

DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=20


random_state=1)

Checking performance on training data

# Generate predictions for the train data


# y_train_pred = estimator.predict(X_train)
# Predict labels for the training data
y_train_pred = estimator.predict(X_train)

confusion_matrix_sklearn(estimator, X_train, y_train) ## Complete the code to creat

decision_tree_tune_perf_train = model_performance_classification_sklearn(estimator,
decision_tree_tune_perf_train

Accuracy Recall Precision F1

0 0.973714 1.0 0.782506 0.877984


Visualizing the Decision Tree

plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -

print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))

|--- Income <= 92.50


| |--- CCAvg <= 2.95
| | |--- weights: [1344.67, 0.00] class: 0
| |--- CCAvg > 2.95
| | |--- CD_Account <= 0.50
| | | |--- CCAvg <= 3.95
| | | | |--- Mortgage <= 102.50
| | | | | |--- Income <= 68.50
| | | | | | |--- weights: [8.28, 0.00] class: 0
| | | | | |--- Income > 68.50
| | | | | | |--- weights: [21.54, 52.87] class: 1
| | | | |--- Mortgage > 102.50
| | | | | |--- weights: [11.60, 0.00] class: 0
| | | |--- CCAvg > 3.95
| | | | |--- weights: [23.19, 0.00] class: 0
| | |--- CD_Account > 0.50
| | | |--- weights: [0.00, 26.44] class: 1
|--- Income > 92.50
| |--- Family <= 2.50
| | |--- Education_3 <= 0.50
| | | |--- Education_2 <= 0.50
| | | | |--- Income <= 103.50
| | | | | |--- CCAvg <= 3.21
| | | | | | |--- weights: [22.09, 0.00] class: 0
| | | | | |--- CCAvg > 3.21
| | | | | | |--- weights: [2.76, 15.86] class: 1
| | | | |--- Income > 103.50
| | | | | |--- weights: [239.11, 0.00] class: 0
| | | |--- Education_2 > 0.50
| | | | |--- Income <= 110.00
| | | | | |--- CCAvg <= 2.90
| | | | | | |--- weights: [12.70, 0.00] class: 0
| | | | | |--- CCAvg > 2.90
| | | | | | |--- weights: [0.00, 10.57] class: 1
| | | | |--- Income > 110.00
| | | | | |--- weights: [3.87, 296.07] class: 1
| | |--- Education_3 > 0.50
| | | |--- Income <= 116.50
| | | | |--- CCAvg <= 1.10
| | | | | |--- weights: [7.73, 0.00] class: 0
| | | | |--- CCAvg > 1.10
| | | | | |--- weights: [9.94, 47.58] class: 1
| | | |--- Income > 116.50
| | | | |--- weights: [0.00, 327.79] class: 1
| |--- Family > 2.50
| | |--- Income <= 113.50
| | | |--- CCAvg <= 2.80
| | | | |--- Income <= 106.50
| | | | | |--- weights: [24.85, 0.00] class: 0
| | | | |--- Income > 106.50
| | | | | |--- Age <= 28.50
| | | | | | |--- weights: [4.97, 0.00] class: 0
| | | | | |--- Age > 28.50
| | | | | | |--- weights: [6.07, 31.72] class: 1
| | | |--- CCAvg > 2.80
| | | | |--- weights: [3.31, 95.17] class: 1
| | |--- Income > 113.50
| | | |--- weights: [3.31, 845.92] class: 1
# importance of features in the tree building ( The importance of a feature is comp
# (normalized) total reduction of the criterion brought by that feature. It is also

print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)

Imp
Income 0.662361
Education_2 0.143155
CCAvg 0.087565
Education_3 0.050404
Family 0.039987
CD_Account 0.007829
Mortgage 0.004987
Age 0.003711
Online 0.000000
ZIPCode_91 0.000000
ZIPCode_92 0.000000
ZIPCode_93 0.000000
ZIPCode_94 0.000000
ZIPCode_95 0.000000
ZIPCode_96 0.000000
Securities_Account 0.000000
CreditCard 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Checking performance on test data


confusion_matrix_sklearn(estimator, X_test, y_test) # Complete the code to get the

decision_tree_tune_perf_test = model_performance_classification_sklearn(estimator,
decision_tree_tune_perf_test

Accuracy Recall Precision F1

0 0.963333 0.966443 0.742268 0.83965

 Post-pruning

clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)

ccp_alphas impurities

0 0.000000 0.000000

1 0.000186 0.001114

2 0.000214 0.001542

3 0.000242 0.002750

4 0.000250 0.003250

5 0.000268 0.004324

6 0.000272 0.004868

7 0.000276 0.005420

8 0.000381 0.005801

9 0.000527 0.006329

10 0.000625 0.006954

11 0.000700 0.007654

12 0.000769 0.010731

13 0.000882 0.014260

14 0.000889 0.015149

15 0.001026 0.017200

16 0.001305 0.018505

17 0.001647 0.020153

18 0.002333 0.022486

19 0.002407 0.024893

20 0.003294 0.028187

21 0.006473 0.034659

22 0.025146 0.084951

23 0.039216 0.124167

24 0.047088 0.171255
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Next, we train a decision tree using effective alphas. The last value in ccp_alphas is the
alpha value that prunes the whole tree, leaving the tree, clfs[-1] , with one node.

clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train) ## Complete the code to fit decision tree on trai
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)

Number of nodes in the last tree is: 1 with ccp_alpha: 0.04708834100596766


clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]


depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
Recall vs alpha for training and testing sets
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)

recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)

fig, ax = plt.subplots(figsize=(15, 5))


ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post"
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
best_alpha = ccp_alphas[index_best_model]

print(best_model)

DecisionTreeClassifier(random_state=1)

print(ccp_alpha)

0.04708834100596766

estimator_2 = DecisionTreeClassifier(
ccp_alpha=best_alpha, class_weight={0: 0.15, 1: 0.85}, random_state=1 #
)
estimator_2.fit(X_train, y_train)

▾ DecisionTreeClassifier i ?

DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)

Checking performance on training data

confusion_matrix_sklearn(estimator, X_train, y_train) ## Complete the code to creat


decision_tree_tune_post_train = model_performance_classification_sklearn(estimator,
decision_tree_tune_post_train

Accuracy Recall Precision F1

0 0.973714 1.0 0.782506 0.877984

Visualizing the Decision Tree

plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -

print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True)

|--- Income <= 98.50


| |--- CCAvg <= 2.95
| | |--- weights: [374.10, 0.00] class: 0
| |--- CCAvg > 2.95
| | |--- CD_Account <= 0.50
| | | |--- CCAvg <= 3.95
| | | | |--- Income <= 81.50
| | | | | |--- Age <= 36.50
| | | | | | |--- Family <= 3.50
| | | | | | | |--- CCAvg <= 3.50
| | | | | | | | |--- Income <= 75.00
| | | | | | | | | |--- weights: [0.00, 0.85] class: 1
| | | | | | | | |--- Income > 75.00
| | | | | | | | | |--- weights: [0.00, 0.85] class: 1
| | | | | | | |--- CCAvg > 3.50
| | | | | | | | |--- weights: [0.15, 0.00] class: 0
| | | | | | |--- Family > 3.50
| | | | | | | |--- weights: [0.60, 0.00] class: 0
| | | | | |--- Age > 36.50
| | | | | | |--- ZIPCode_91 <= 0.50
| | | | | | | |--- Age <= 37.50
| | | | | | | | |--- weights: [0.15, 0.00] class: 0
| | | | | | | |--- Age > 37.50
| | | | | | | | |--- weights: [6.00, 0.00] class: 0
| | | | | | |--- ZIPCode_91 > 0.50
| | | | | | | |--- Education_3 <= 0.50
| | | | | | | | |--- weights: [0.00, 0.85] class: 1
| | | | | | | |--- Education_3 > 0.50
| | | | | | | | |--- weights: [0.45, 0.00] class: 0
| | | | |--- Income > 81.50
| | | | | |--- Mortgage <= 152.00
| | | | | | |--- Securities_Account <= 0.50
| | | | | | | |--- CCAvg <= 3.05
| | | | | | | | |--- weights: [0.45, 0.00] class: 0
| | | | | | | |--- CCAvg > 3.05
| | | | | | | | |--- CCAvg <= 3.85
| | | | | | | | | |--- ZIPCode_91 <= 0.50
| | | | | | | | | | |--- CreditCard <= 0.50
| | | | | | | | | | | |--- truncated branch of depth 4
| | | | | | | | | | |--- CreditCard > 0.50
| | | | | | | | | | | |--- truncated branch of depth 2
| | | | | | | | | |--- ZIPCode_91 > 0.50
| | | | | | | | | | |--- CCAvg <= 3.35
| | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0
| | | | | | | | | | |--- CCAvg > 3.35
| | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0
| | | | | | | | |--- CCAvg > 3.85
| | | | | | | | | |--- weights: [0.00, 2.55] class: 1
| | | | | | |--- Securities_Account > 0.50
| | | | | | | |--- Education_3 <= 0.50
| | | | | | | | |--- weights: [0.45, 0.00] class: 0
| | | | | | | |--- Education_3 > 0.50
| | | | | | | | |--- weights: [0.15, 0.00] class: 0
| | | | | |--- Mortgage > 152.00
| | | | | | |--- Income <= 84.00
| | | | | | | |--- weights: [0.15, 0.00] class: 0
| | | | | | |--- Income > 84.00
| | | | | | | |--- weights: [0.90, 0.00] class: 0
| | | |--- CCAvg > 3.95
| | | | |--- weights: [6.75, 0.00] class: 0
# importance of features in the tree building ( The importance of a feature is comp
# (normalized) total reduction of the criterion brought by that feature. It is also

print(
pd.DataFrame(
estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)

Imp
Income 0.597264
Education_2 0.138351
CCAvg 0.078877
Education_3 0.067293
Family 0.066244
Age 0.018973
CD_Account 0.011000
Mortgage 0.005762
Securities_Account 0.004716
ZIPCode_94 0.004702
ZIPCode_91 0.003587
CreditCard 0.002428
ZIPCode_92 0.000802
Online 0.000000
ZIPCode_93 0.000000
ZIPCode_95 0.000000
ZIPCode_96 0.000000
importances = estimator_2.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Checking performance on test data


confusion_matrix_sklearn(estimator_2, X_test, y_test) # Complete the code to get t

decision_tree_tune_post_test = model_performance_classification_sklearn(estimator_2
decision_tree_tune_post_test

Accuracy Recall Precision F1

0 0.979333 0.852349 0.933824 0.891228

 Model Performance Comparison and Final Model Selection


# training performance comparison

models_train_comp_df = pd.concat(
[decision_tree_perf_train.T, decision_tree_tune_perf_train.T, decision_tree_tun
)
models_train_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (
print("Training performance comparison:")
models_train_comp_df

Training performance comparison:


Decision Tree (sklearn Decision Tree (Pre- Decision Tree
default) Pruning) (Post-Pruning)

Accuracy 1.0 0.973714 0.973714

Recall 1.0 1.000000 1.000000

Precision 1.0 0.782506 0.782506

F1 1.0 0.877984 0.877984

# testing performance comparison

models_test_comp_df = pd.concat(
[decision_tree_perf_test.T, decision_tree_tune_perf_test.T, decision_tree_tune_
)
models_test_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (P
print("Test set performance comparison:")
models_test_comp_df

Test set performance comparison:


Decision Tree (sklearn Decision Tree (Pre- Decision Tree
default) Pruning) (Post-Pruning)

Accuracy 0.986000 0.963333 0.979333

Recall 0.932886 0.966443 0.852349

Precision 0.926667 0.742268 0.933824

F1 0.929766 0.839650 0.891228

 Actionable Insights and Business Recommendations

Start coding or generate with AI.

What recommedations would you suggest to the bank?

 Actionable Insights
Actionable Insights
1. Key Drivers of Loan Acceptance:

Income: The most critical predictor of loan acceptance. Customers with higher
incomes are more likely to accept loans, rececting better Mnancial stability and
repayment ability.
Education Level: Graduate and advanced education levels correlate strongly with
loan acceptance, likely due to better Mnancial literacy and earning potential.
Credit Card Spending (CCAvg): High-spending customers are more inclined toward
loans, possibly for debt consolidation or managing high expenses.
Family Size: Customers with larger families show a moderate likelihood of
accepting loans, potentially driven by higher Mnancial responsibilities.

2. Targeted Segments:

High-Income Customers: Focus on customers with income above the 75th


percentile as they are more likely to qualify and accept loans.
Educated Customers: Graduate and professional-level education segments should
be prioritized in marketing campaigns.
Financially Active Customers: Customers with high credit card spending are key
targets for cross-selling loans.

3. Pruned Decision Tree Model:

The post-pruned decision tree model offers the best balance of recall (85.23%)
and precision (93.38%), making it suitable for identifying high-probability loan
takers while minimizing false positives.

Business Recommendations
1. Targeted Marketing Campaigns:

Use the model to identify customers with high probabilities of loan acceptance
based on their income, education, and spending habits.
Develop personalized loan offers tailored to these segments to increase
conversion rates.

2. Precision-Driven Strategy:

Leverage the high precision of the post-pruned model to minimize wasted


marketing efforts on uninterested customers.
Focus on cost-effective campaigns by prioritizing customers with the highest
predicted probabilities of loan acceptance.
3. Cross-Selling Opportunities:

Identify customers with existing Mnancial products (e.g., CD accounts) for


potential upselling or cross-selling of loans.
Highlight speciMc loan beneMts, such as low-interest rates for high-income earners
or cexible repayment plans for large families.

4. Educational Outreach:

For customers with undergraduate education, consider campaigns focused on


Mnancial literacy to improve their understanding of personal loan beneMts.
This could expand the target base for future campaigns.

5. Digital Marketing Channels:

Given that a signiMcant portion of the bank’s customers use online banking,
prioritize digital marketing channels to reach these customers effectively.

o. Continuous Model Monitoring:

Deploy the post-pruned decision tree model in production, but continuously


monitor its performance on new data to detect potential data drift.
Regularly retrain the model using updated customer data for sustained accuracy.

7. Customer Feedback Loop:

Collect and analyze feedback from customers who decline loans to identify
potential barriers (e.g., interest rates, terms) and improve future offerings.

Expected BeneRts

- Improved conversion rates for personal loan campaigns by focusing on high-probability seg
- Reduced marketing costs by targeting precise customer segments and minimizing false posit
- Enhanced customer satisfaction through personalized and relevant loan offers.

You might also like