All Life Bank - AIML_ML_Project_low_code_notebook
All Life Bank - AIML_ML_Project_low_code_notebook
Context
AllLife Bank is a US bank that has a growing customer base. The majority of these customers
are liability customers (depositors) with varying sizes of deposits. The number of customers
who are also borrowers (asset customers) is quite small, and the bank is interested in
expanding this base rapidly to bring in more loan business and in the process, earn more
through the interest on loans. In particular, the management wants to explore ways of
converting its liability customers to personal loan customers (while retaining them as
depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion
rate of over 9% success. This has encouraged the retail marketing department to devise
campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing
department to identify the potential customers who have a higher probability of purchasing
the loan.
Objective
To predict whether a liability customer will buy personal loans, to understand which customer
attributes are most signiMcant in driving purchases, and identify which segment of customers
to target more.
Data Dictionary
ID : Customer ID
Age : Customer’s age in completed years
Experience : #years of professional experience
Income : Annual income of the customer (in thousand dollars)
ZIP Code : Home Address ZIP code.
Family : the Family size of the customer
CCAvg : Average spending on credit cards per month (in thousand dollars)
Education : Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
Mortgage : Value of house mortgage if any. (in thousand dollars)
Personal_Loan : Did this customer accept the personal loan offered in the last
campaign? (0: No, 1: Yes)
Securities_Account : Does the customer have securities account with the bank? (0:
No, 1: Yes)
CD_Account : Does the customer have a certiMcate of deposit (CD) account with the
bank? (0: No, 1: Yes)
Online : Do customers use internet banking facilities? (0: No, 1: Yes)
CreditCard : Does the customer use a credit card issued by any other Bank (excluding
All life Bank)? (0: No, 1: Yes)
Blanks '___' are provided in the notebook that needs to be Mlled with an appropriate code
to get the correct result. With every '___' blank, there is a comment that briecy describes
what needs to be Mlled in the blank space.
Identify the task to be performed correctly, and only then proceed to write the required
code.
Fill the code wherever asked by the commented lines like "# write your code here" or "#
complete the code". Running incomplete code may throw error.
Please run the codes in a sequential manner from the beginning to avoid any
unnecessary errors.
Add the results/observations (wherever mentioned) derived from the analysis in the
presentation and submit the same.
error: subprocess-exited-with-error
note: This error originates from a subprocess, and is likely not a problem wi
error: subprocess-exited-with-error
note: This error originates from a subprocess, and is likely not a problem with
Note:
1. After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or
runtime (for Google Colab) and run all cells sequentially from the next cell.
2. On executing the above line of code, you might see a warning regarding package
dependencies. This error message can be ignored as the above code ensures that all
necessary libraries and their dependencies are maintained to successfully execute the
code in this notebook.
Data Overview
0 1 25 1 49 91107 4 1.6 1 0
1 2 45 19 34 90089 3 1.5 1 0
2 3 39 15 11 94720 1 1.0 1 0
4 5 35 8 45 91330 4 1.0 2 0
(5000, 14)
Check the data types of the columns for the dataset
ID int64
Age int64
Experience int64
Income int64
ZIPCode int64
Family int64
CCAvg float64
Education int64
Mortgage int64
Personal_Loan int64
Securities_Account int64
CD_Account int64
Online int64
CreditCard int64
dtype: object
Dropping columns
data = data.drop(['ID'], axis=1) ## Complete the code to drop a column from the da
Dropping ID column, since it is a running serial number and won't have any bearing on
the model
data.head()
0 25 1 49 91107 4 1.6 1 0
1 45 19 34 90089 3 1.5 1 0
2 39 15 11 94720 1 1.0 1 0
4 35 8 45 91330 4 1.0 2 0
Start coding or generate with AI.
data.isnull().sum()
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0
dtype: int64
Data Preprocessing
data["Experience"].unique()
array([ 1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18,
21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29,
3, 22, -1, 34, 0, 38, 40, 33, 4, -2, 42, -3, 43], dtype=int64)
data["Education"].unique()
0 25 1 49 91107 4 1.6 1 0
1 45 19 34 90089 3 1.5 1 0
2 39 15 11 94720 1 1.0 1 0
4 35 8 45 91330 4 1.0 2 0
# data["Mortgage"].unique()
data["Personal_Loan"].unique()
data["CD_Account"].unique()
data["Online"].unique()
Feature Engineering
467
data["ZIPCode"] = data["ZIPCode"].astype(str)
print(
"Number of unique values if we take first two digits of ZIPCode: ",
data["ZIPCode"].str[0:2].nunique(),
)
data["ZIPCode"] = data["ZIPCode"].str[0:2]
data["ZIPCode"] = data["ZIPCode"].astype("category")
Univariate Analysis
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the col
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all level
"""
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
Observations on Age
data.dtypes
Age int64
Experience int64
Income int64
ZIPCode category
Family int64
CCAvg float64
Education category
Mortgage int64
Personal_Loan category
Securities_Account category
CD_Account category
Online category
CreditCard category
dtype: object
histogram_boxplot(data, "Age")
Observations on Experience
Observations on Income
histogram_boxplot(data, "Income") ## Complete the code to create histogram_boxplot
Observations on CCAvg
histogram_boxplot(data, "CCAvg") ## Complete the code to create histogram_boxplot
0000000000000
500
400-
300-
Count
200-
100-
0
2
CCAvg
Observations on Mortgage
histogram_boxplot(data, "Mortgage") ## Complete the code to create histogram_boxpl
Observations on Family
labeled_barplot(data, "Family", perc=True)
Observations on Education
labeled_barplot(data, "Education") ## Complete the code to create labeled_barplot
Observations on Securities_Account
labeled_barplot(data, "Securities_Account") ## Complete the code to create labele
Observations on CD_Account
labeled_barplot(data, "CD_Account") ## Complete the code to create labeled_barplo
Observations on Online
labeled_barplot(data, "Online") ## Complete the code to create labeled_barplot fo
Observation on CreditCard
labeled_barplot(data, "CreditCard") ## Complete the code to create labeled_barplo
Observation on ZIPCode
labeled_barplot(data, "ZIPCode") ## Complete the code to create labeled_barplot f
Bivariate Analysis
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
target_uniq = data[target].unique()
plt.tight_layout()
plt.show()
Correlation check
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", c
plt.show()
Let's check how a customer's interest in purchasing a loan varies with their
education
stacked_barplot(data, "Education", "Personal_Loan")
Personal_Loan 0 1 All
Education
All 4520 480 5000
3 1296 205 1501
2 1221 182 1403
1 2003 93 2096
-------------------------------------------------------------------------------
Higher education levels correlate with higher loan acceptance rates. This trend may be
due to:
Personal_Loan vs Family
stacked_barplot(data, "Personal_Loan", "Family") ## Complete the code to plot stac
Family 1 2 3 4 All
Personal_Loan
All 1472 1296 1010 1222 5000
0 1365 1190 877 1088 4520
1 107 106 133 134 480
-------------------------------------------------------------------------------
Personal_Loan vs Securities_Account
stacked_barplot(data, "Personal_Loan","Securities_Account") ## Complete the code to
Securities_Account 0 1 All
Personal_Loan
All 4478 522 5000
0 4058 462 4520
1 420 60 480
-------------------------------------------------------------------------------
Personal_Loan vs CD_Account
stacked_barplot(data, "Personal_Loan", "CD_Account") ## Complete the code to plot s
CD_Account 0 1 All
Personal_Loan
All 4698 302 5000
0 4358 162 4520
1 340 140 480
-------------------------------------------------------------------------------
Personal_Loan vs Online
stacked_barplot(data, "Personal_Loan", "Online") ## Complete the code to plot stack
Online 0 1 All
Personal_Loan
All 2016 2984 5000
0 1827 2693 4520
1 189 291 480
-------------------------------------------------------------------------------
Personal_Loan vs CreditCard
stacked_barplot(data, "Personal_Loan", "CreditCard") ## Complete the code to plot s
CreditCard 0 1 All
Personal_Loan
All 3530 1470 5000
0 3193 1327 4520
1 337 143 480
-------------------------------------------------------------------------------
Personal_Loan vs ZIPCode
stacked_barplot(data, "Personal_Loan", "ZIPCode") ## Complete the code to plot stac
ZIPCode 90 91 92 93 94 95 96 All
Personal_Loan
All 703 565 988 417 1472 815 40 5000
0 636 510 894 374 1334 735 37 4520
1 67 55 94 43 138 80 3 480
-------------------------------------------------------------------------------
Let's check how a customer's interest in purchasing a loan varies with their age
Outlier Detection
lower = ( Q1 - 1.5 * IQR) # Finding lower and upper bounds for all values. All val
upper = Q3 + 1.5 * IQR
(
(data.select_dtypes(include=["float64", "int64"]) < lower)
| (data.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(data) * 100
Age 0.00
Experience 0.00
Income 1.92
Family 0.00
CCAvg 6.48
Mortgage 5.82
dtype: float64
Model Building
Primary Focus: Recall (to minimize missed loan takers) and F1-Score (to balance
precision and recall).
Supplementary Metrics: Precision (to optimize targeting) and Accuracy (as a general
indicator).
First, let's create functions to calculate different metrics and confusion matrix so that we don't
have to use the same code repeatedly for each model.
model: classifier
predictors: independent variables
target: dependent variable
"""
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum()
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
▾ DecisionTreeClassifier i ?
DecisionTreeClassifier(random_state=1)
decision_tree_perf_train = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train
Evaluate the model's performance on the test data to verify if it generalizes well:
Test Performance:
Accuracy Recall Precision F1
Tree Depth: 10
Number of Leaves: 49
feature_names = list(X_train.columns)
print(feature_names)
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
Start coding or generate with AI.
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp
Income 0.308098
Family 0.259255
Education_2 0.166192
Education_3 0.147127
CCAvg 0.048798
Age 0.033150
CD_Account 0.017273
ZIPCode_94 0.007183
ZIPCode_93 0.004682
Mortgage 0.003236
Online 0.002224
Securities_Account 0.002224
ZIPCode_91 0.000556
ZIPCode_92 0.000000
ZIPCode_95 0.000000
ZIPCode_96 0.000000
CreditCard 0.000000
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Checking model performance on test data
Pre-pruning
Note: The parameters provided below are a sample set. You can feel free to update the same
and try out other combinations.
# Update the best estimator and best score if the current one has a sma
if (score_diff < best_score_diff) & (test_recall_score > best_test_scor
best_score_diff = score_diff
best_test_score = test_recall_score
best_estimator = estimator
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Fit the best algorithm to the data.
estimator = best_estimator
estimator.fit(X_train, y_train) ## Complete the code to fit model on train data
▾ DecisionTreeClassifier i
decision_tree_tune_perf_train = model_performance_classification_sklearn(estimator,
decision_tree_tune_perf_train
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp
Income 0.662361
Education_2 0.143155
CCAvg 0.087565
Education_3 0.050404
Family 0.039987
CD_Account 0.007829
Mortgage 0.004987
Age 0.003711
Online 0.000000
ZIPCode_91 0.000000
ZIPCode_92 0.000000
ZIPCode_93 0.000000
ZIPCode_94 0.000000
ZIPCode_95 0.000000
ZIPCode_96 0.000000
Securities_Account 0.000000
CreditCard 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
decision_tree_tune_perf_test = model_performance_classification_sklearn(estimator,
decision_tree_tune_perf_test
Post-pruning
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
ccp_alphas impurities
0 0.000000 0.000000
1 0.000186 0.001114
2 0.000214 0.001542
3 0.000242 0.002750
4 0.000250 0.003250
5 0.000268 0.004324
6 0.000272 0.004868
7 0.000276 0.005420
8 0.000381 0.005801
9 0.000527 0.006329
10 0.000625 0.006954
11 0.000700 0.007654
12 0.000769 0.010731
13 0.000882 0.014260
14 0.000889 0.015149
15 0.001026 0.017200
16 0.001305 0.018505
17 0.001647 0.020153
18 0.002333 0.022486
19 0.002407 0.024893
20 0.003294 0.028187
21 0.006473 0.034659
22 0.025146 0.084951
23 0.039216 0.124167
24 0.047088 0.171255
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using effective alphas. The last value in ccp_alphas is the
alpha value that prunes the whole tree, leaving the tree, clfs[-1] , with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train) ## Complete the code to fit decision tree on trai
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
print(best_model)
DecisionTreeClassifier(random_state=1)
print(ccp_alpha)
0.04708834100596766
estimator_2 = DecisionTreeClassifier(
ccp_alpha=best_alpha, class_weight={0: 0.15, 1: 0.85}, random_state=1 #
)
estimator_2.fit(X_train, y_train)
▾ DecisionTreeClassifier i ?
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(
pd.DataFrame(
estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp
Income 0.597264
Education_2 0.138351
CCAvg 0.078877
Education_3 0.067293
Family 0.066244
Age 0.018973
CD_Account 0.011000
Mortgage 0.005762
Securities_Account 0.004716
ZIPCode_94 0.004702
ZIPCode_91 0.003587
CreditCard 0.002428
ZIPCode_92 0.000802
Online 0.000000
ZIPCode_93 0.000000
ZIPCode_95 0.000000
ZIPCode_96 0.000000
importances = estimator_2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
decision_tree_tune_post_test = model_performance_classification_sklearn(estimator_2
decision_tree_tune_post_test
models_train_comp_df = pd.concat(
[decision_tree_perf_train.T, decision_tree_tune_perf_train.T, decision_tree_tun
)
models_train_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (
print("Training performance comparison:")
models_train_comp_df
models_test_comp_df = pd.concat(
[decision_tree_perf_test.T, decision_tree_tune_perf_test.T, decision_tree_tune_
)
models_test_comp_df.columns = ["Decision Tree (sklearn default)", "Decision Tree (P
print("Test set performance comparison:")
models_test_comp_df
Actionable Insights
Actionable Insights
1. Key Drivers of Loan Acceptance:
Income: The most critical predictor of loan acceptance. Customers with higher
incomes are more likely to accept loans, rececting better Mnancial stability and
repayment ability.
Education Level: Graduate and advanced education levels correlate strongly with
loan acceptance, likely due to better Mnancial literacy and earning potential.
Credit Card Spending (CCAvg): High-spending customers are more inclined toward
loans, possibly for debt consolidation or managing high expenses.
Family Size: Customers with larger families show a moderate likelihood of
accepting loans, potentially driven by higher Mnancial responsibilities.
2. Targeted Segments:
The post-pruned decision tree model offers the best balance of recall (85.23%)
and precision (93.38%), making it suitable for identifying high-probability loan
takers while minimizing false positives.
Business Recommendations
1. Targeted Marketing Campaigns:
Use the model to identify customers with high probabilities of loan acceptance
based on their income, education, and spending habits.
Develop personalized loan offers tailored to these segments to increase
conversion rates.
2. Precision-Driven Strategy:
4. Educational Outreach:
Given that a signiMcant portion of the bank’s customers use online banking,
prioritize digital marketing channels to reach these customers effectively.
Collect and analyze feedback from customers who decline loans to identify
potential barriers (e.g., interest rates, terms) and improve future offerings.
Expected BeneRts
- Improved conversion rates for personal loan campaigns by focusing on high-probability seg
- Reduced marketing costs by targeting precise customer segments and minimizing false posit
- Enhanced customer satisfaction through personalized and relevant loan offers.