Project Presentation
Project Presentation
Abstract: This project report aims to analyze the factors influencing a customer's decision to
subscribe to a term deposit and develop a predictive model to forecast the likelihood of
subscription. The report outlines the data collection process, data preprocessing techniques,
feature engineering approaches, model development, evaluation metrics, and interpretation of
results. The project findings provide insights for financial institutions to optimize their marketing
strategies and improve the subscription rate.
Problem Statement
Evaluation Metric
We will be using ROC-AUC for evaluation.
We will be defining a task to be performed and write the code to solve the task.
The tasks performed below should serve as a good guide regarding the steps that you
should go about a Machine Learning Problem. But kindly do not restrict yourself to only
the tasks that have been performed in this notebook and feel free to bring your ideas,skills
and strategies and implement them as well.
Word of caution
This template is just an example of a data-science pipeline, every data science problem is unique
and there are multiple ways to tackle them. Go through this template and try to leverage the
information in this while solving your hackathon problems but you may not be able to use all the
functions created here.
The data is related to direct marketing campaigns of a Portuguese banking institution. The
marketing campaigns were based on phone calls. Often, more than one contact to the same client
was required, in order to access if the product (bank term deposit) would be subscribed ('yes') or
not ('no') subscribed.
There are two datasets: train.csv with all examples (32950) and 21 inputs including the target
feature, ordered by date (from May 2008 to November 2010), very close to the data analyzed in
[Moro et al., 2014]
test.csv which is the test data that consists of 8238 observations and 20 features without the
target feature
Goal:- The classification goal is to predict if the client will subscribe (yes/no) a term deposit
(variable y).
Features
categorical,
default has credit in default? ('no','yes','unknown')
nominal
categorical,
housing has housing loan? ('no','yes','unknown')
nominal
categorical,
loan has personal loan? ('no','yes','unknown')
nominal
categorical,
contact contact communication type ('cellular','telephone')
nominal
categorical,
month last contact month of year ('jan', 'feb', 'mar', ..., 'nov', 'dec')
ordinal
day_of_ categorical,
last contact day of the week ('mon','tue','wed','thu','fri')
week ordinal
campaig number of contacts performed during this campaign and for this
numeric
n client (includes last contact)
Feature Feature Type Description
number of days that passed by after the client was last contacted
pdays numeric from a previous campaign (999 means client was not previously
contacted)
In [1]:
import numpy as np
import pandas as
pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwar
nings('ignore')
Loading Data
Modelling
Libraries
We will use the popular scikit-learn library to develop our machine learning algorithms. In
sklearn, algorithms are called Estimators and implemented in their own classes. For data
visualization, we will use the matplotlib and seaborn library. Below are common classes to load.
In [2]:
from sklearn.preprocessing import LabelEncoder,MinMaxScaler,StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier ,RandomForestClassifier ,GradientBo
ostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import Ridge,Lasso
from sklearn.metrics import roc_auc_score ,mean_squared_error,accuracy_score,classification_r
eport,roc_curve,confusion_matrix
import warnings
warnings.filterwarnings('ignore')
from scipy.stats.mstats import winsorize
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
pd.set_option('display.max_columns',None)
import six
import
sys
sys.modul
es['sklearn
.externals.
six'] = six
There are many Classification algorithms are present in machine learning, which are used for
different classification applications. Some of the main classification algorithms are as follows-
Logistic Regression
DecisionTree Classifier
RandomForest Classfier
The code we have written below internally splits the data into training data and validation data. It
then fits the classification model on the train data and then makes a prediction on the validation
data and outputs the scores for this prediction.
# Target
y = dataframe.iloc[:,-1]
[
6
]
:
# getting the auc roc curve
auc = roc_auc_score(y_val, y_scores)
print('Classification Report:')
print(classification_report(y_val,y_scores))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_scores)
print('ROC_AUC_SCORE is',roc_auc_score(y_val, y_scores))
#fpr, tpr,
0 _ =0.90
roc_curve(y_test,
0.98 predictions[:,1])
0.93 5798
1 0.50 0.17 0.26 792
plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
accuracy 0.88 659
plt.ylabel('TPR') 0
plt.title('ROC
macro avg curve') 0.70 0.57 0.60 6590
plt.show()
weighted avg 0.85 0.88 0.85 6590
Classification
ROC_AUC_SCORE Report:is 0.5742166403601381
precision
recall f1-
score
support
The above two steps are combined and run in a single cell for all the remaining models
respectively
In [7]:
# Run Decision Tree Classifier
model = DecisionTreeClassifier()
model.fit(x_train, y_train)
y_scores = model.predict(x_val)
auc = roc_auc_score(y_val, y_scores)
print('Classification Report:')
print(classification_report(y_val,y_scores))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_scores)
print('ROC_AUC_SCORE is',roc_auc_score(y_val, y_scores))
plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()
Classification Report:
precision
recall f1-
score
support
accuracy 0.87 6590
macro
0 avg 0.69 0.69 0.69 6590
0.93
0.93
0.93
5798
1
0.46
weighted avg 0.87 0.87 0.87 6590
ROC_AUC_SCORE is 0.6924298608715649
In [8]:
from sklearn import tree
from sklearn.tree import export_graphviz # display the tree within a Jupyter notebook
from IPython.display import SVG
from graphviz import Source
from IPython.display import
display
from ipywidgets import interactive, IntSlider, FloatSlider, interact
import ipywidgets
from IPython.display import Image
from subprocess import call
import matplotlib.image as mpimg
I
n
[
9
]
:
@interact
def plot_tree(crit=["gini", "entropy"],
split=["best", "random"],
depth=IntSlider(min=1,max=30,valu
e=2,
continuous_update=False),
min_split=IntSlider(min=2,max=5,value=2, continuous_update=False),
min_leaf=IntSlider(min=1,max=5,value=1, continuous_update=False)):
estimator = DecisionTreeClassifier(random_state=0,
criterion=crit, splitter =
split, max_depth =
depth,
print('Decision Tree Training Accuracy: {:.3f}'.format(accuracy_score(y_train, estimator.predi
ct(x_train))))
print('Decision Tree Test Accuracy: {:.3f}'.format(accuracy_score(y_val, estimator.predict(x_
val))))
graph = Source(tree.export_graphviz(estimator,
out_file=None,
feature_names=x_train.columns,
class_names=['0', '1'],
filled = True))
display(Image(data=graph.pipe(format='png'
DecisionTreeClassifier(max_depth=2, random_state=0)
In [10]:
# run Random Forrest Classifier
model = RandomForestClassifier()
model.fit(x_train, y_train)
y_scores = model.predict(x_val)
auc = roc_auc_score(y_val, y_scores)
print('Classification Report:')
print(classification_report(y_val,y_scores))
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val, y_scores)
print('ROC_AUC_SCORE is',roc_auc_score(y_val, y_scores))
plt.plot(false_positive_rate, true_positive_rate)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()
Classification Report:
precision
recall f1-
score
support
accuracy 0.90 6590
macro0 avg 0.78 0.66 0.70 6590
weighted avg
0.92 0.88 0.90 0.89 6590
0.97
ROC_AUC_SCORE
0.94 is 0.662138372340166
5798
1
0.64
0.35
0.45
792
Feature Selection
Now that we have applied vanilla models on our data, we now have a basic understanding of
what our predictions look like. Let's now use feature selection methods for identifying the best
set of features for each model.
[
1
2
]
:
# Selecting 8 number of features
# Random Forrest classifier model
models = RandomForestClassifier()
# using rfe and selecting 8 features
rfe = RFE(models,8)
# fitting the model
rfe = rfe.fit(X,y)
# ranking features
feature_ranking = pd.Series(rfe.ranking_, index=X.columns)
plt.show()
print('Features to be selected for Random Forrest Classifier are:')
print(feature_ranking[feature_ranking.values==1].index.tolist())
print('===='*30)
Features to be selected for Random Forrest Classifier are:
['age', 'job', 'education', 'month', 'day_of_week', 'duration',
'campaign', 'poutcome']
===============================================
======================
In [13]:
# splitting the data into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=
y)
# selecting the data
rfc = RandomForestClassifier(random_state=42)
# fitting the data
rfc.fit(X_train, y_train)
# predicting the data
y_pred =
rfc.predict(X_test)
# feature importances
rfc_importances =
pd.Series(rfc.feature_i
mportances_,
index=X.columns).sort
_values().tail(10)
# plotting bar chart
according to feature
importance
rfc_importances.plot(kind='bar')
plt.show()
Observations :
We can test the features obtained from both the feature selection techniques by inserting these
features to the model and depending on which set of features perform better, we can retain them
for the model.
The Feature Selection techniques can differ from problem to problem and the techniques
applied for this problem may or may not work for the other problems. In those cases, feel
free to try out other methods like PCA, SelectKBest(), SelectPercentile(), tSNE etc.
In
[14]:
# splitting the data
x_train,x_val,y_train,y_val = train_test_split(X,y, test_size=0.3, random_state=42, stratify=y)
# selecting the classifier
rfc = RandomForestClassifier()
# selecting the parameter
param_grid = {
'max_features': ['auto', 'sqrt',
'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
# using grid search with respective parameters
grid_search_model = GridSearchCV(rfc, param_grid=param_grid)
# fitting the model
grid_search_model.fit(x_train, y_train)
# printing the best parameters
print('Best Parameters are:',grid_search_model.best_params_)
Best Parameters are: {'criterion': 'gini', 'max_depth': 8,
'max_features': 'log2'}
Kindly note that SMOTE should always be applied only on the training data and not on the
validation and test data.
You can try experimenting with and without SMOTE and check for the difference in recall.
In [15]:
from sklearn.metrics import roc_auc_score,roc_curve,classification_report
from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import SMOTE
from yellowbrick.classifier import roc_auc
rfc.fit(X_sm, y_sm)
y_pred = rfc.predict(x_val)
print(classification_report(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
visualizer = roc_auc(rfc,X_sm,y_sm,x_val,y_val)
grid_search_random_forrest_best(X,y)
precision recall f1-score
support
[[6801 1922]
[ 309 853]]
Applying the grid search function for random forest only on the best features obtained
using Random Forest
In [16]:
grid_search_random_forrest_best(X[['age', 'job', 'education', 'month', 'day_of_week', 'duration', 'c
ampaign', 'poutcome']],y)
precision recall f1-score support
[[7099 1624]
[ 258 904]]
Ensembling
Ensemble learning uses multiple machine learning models to obtain better predictive
performance than could be obtained from any of the constituent learning algorithms alone. In the
below task, we have used an ensemble of three models
- RandomForestClassifier(), GradientBoostingClassifier(), LogisticRegression(). Feel free to
modify this function as per your requirements and fit more models or change the parameters for
every model.
In [17]:
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import VotingClassifier
[[7420 1303]
[ 358 804]]
In this task below, we will read the test file and store the Id column from the test file in a
variable Id. This column would be of use to us while submission since we need to have an Id
column in the submission file which is the same Id of the observations in the test data.
We have to perform the same preprocessing operations on the test data that we have performed
on the train data. For demonstration purposes, we have preprocessed the test data and this
preprocessed data is present in the csv file test_preprocessed.csv
We then make a prediction on the preprocessed test data using the Grid Search Logisitic
regression model. And as the final step, we concatenate this prediction with the Id column and
then convert this into a csv file which becomes the submission.csv
In
[19]
:
# Preprocessed Test File
test = pd.read_csv('../input/banking-project-term-deposit/new_test.csv')
test.head()
a O
jo mar educa defa hous lo cont mo day_of_ durat camp upoutc
b ital tion ult ing act week ion aign t ome
g
an nth [
e 1
9
]
3
0 4 0 6 0 0 0 0 3 3 131 5 :1
2
3 1
1 3 6 0 0 0 0 4 3 100 1 1
7 0
5
2 5 0 5 1 2 0 0 3 2 131 2 1
5
4
3 2 1 0 1 0 0 1 4 3 48 2 1
4
2
4 0 2 3 0 0 0 0 5 0 144 2 1
8
rfc = RandomForestClassifier()
# selecting the parameter
param_grid = {
'max_features': ['auto', 'sqrt',
'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
# using grid search with
respective parameters
grid_search_model =
GridSearchCV(rfc,
param_grid=param_grid)