Naive Bayes vs Logistic Regression
Naive Bayes vs Logistic Regression
1
5) default: has credit in default? (binary: “yes”,“no”)
6) balance: average yearly balance, in euros (numeric)
7) housing: has housing loan? (binary: “yes”,“no”)
8) loan: has personal loan? (binary: “yes”,“no”)
## related with the last contact of the current campaign:
9) contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
10) day: last contact day of the month (numeric)
11) month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
12) duration: last contact duration, in seconds (numeric)
## other attributes: 13) campaign: number of contacts performed during this campaign and
for this client (numeric, includes last contact) 14) pdays: number of days that passed by after
the client was last contacted from a previous campaign (numeric, -1 means client was not previ-
ously contacted) 15) previous: number of contacts performed before this campaign and for this
client (numeric) 16) poutcome: outcome of the previous marketing campaign (categorical: “un-
known”,“other”,“failure”,“success”)
Output variable (desired target): 17) y : has the client subscribed a term deposit? (binary:
“yes”,“no”)
bank_df
2
contact day month duration campaign pdays previous poutcome y
0 unknown 5 may 261 1 -1 0 unknown no
1 unknown 5 may 151 1 -1 0 unknown no
2 unknown 5 may 76 1 -1 0 unknown no
3 unknown 5 may 92 1 -1 0 unknown no
4 unknown 5 may 198 1 -1 0 unknown no
… … … … … … … … … …
45206 cellular 17 nov 977 3 -1 0 unknown yes
45207 cellular 17 nov 456 2 -1 0 unknown yes
45208 cellular 17 nov 1127 5 184 3 success yes
45209 telephone 17 nov 508 4 -1 0 unknown no
45210 cellular 17 nov 361 2 188 11 other no
[3]: print(bank_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 45211 non-null int64
1 job 45211 non-null object
2 marital 45211 non-null object
3 education 45211 non-null object
4 default 45211 non-null object
5 balance 45211 non-null int64
6 housing 45211 non-null object
7 loan 45211 non-null object
8 contact 45211 non-null object
9 day 45211 non-null int64
10 month 45211 non-null object
11 duration 45211 non-null int64
12 campaign 45211 non-null int64
13 pdays 45211 non-null int64
14 previous 45211 non-null int64
15 poutcome 45211 non-null object
16 y 45211 non-null object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB
None
[4]: print(bank_df.isnull().sum().sort_values(ascending=False).head(17))
age 0
day 0
3
poutcome 0
previous 0
pdays 0
campaign 0
duration 0
month 0
contact 0
job 0
loan 0
housing 0
balance 0
default 0
education 0
marital 0
y 0
dtype: int64
So we don’t have the Nan on this dataset .That’s gonna reduce the charge of the work .This is very
goood.
[6]: bank_df.describe()
pdays previous
count 45211.000000 45211.000000
mean 40.197828 0.580323
std 100.128746 2.303441
min -1.000000 0.000000
25% -1.000000 0.000000
50% -1.000000 0.000000
75% -1.000000 0.000000
max 871.000000 275.000000
In this description we have so many columns have so much outliers like balance and pdays . But
it’s not the problem we gonna treat that later. so place to make some visualizations to see more
details the distribution of dataset.
4
2.0.2 Visualization
[9]: def set_of_visualization(df):
df = bank_df.copy()
# Visualization of all
# Boxplot
fig, axes = plt.subplots(1,2 ,figsize=(12,6))
plt.tight_layout()
plt.show()
set_of_visualization(bank_df)
5
In this part of visualization we can see two way: - The boxplot have so much the outliers out
filters the dataset that’s continue to give me for the positives or negatives values many and many
outliers.In the preprocessing I’m gonna move all this outliers because that’s gonna do the model
gonna overshift so much. I’m thinks this feature gonna have so much powerfull in the target. - The
histogram we can see the approximative mean age is between [30:40].
df = bank_df.copy()
age_grouped = df.groupby("age")["balance"].mean()
plt.figure(figsize=(8,5))
plt.bar(x=age_grouped.index,height=age_grouped.
↪values,color="blue",edgecolor="k",alpha=0.7)
plt.xlabel("Age")
plt.ylabel("Balance")
plt.title("Scatterplot between Age and Balance")
plt.show()
scatterplot(bank_df)
6
We can see in the barplot the young people don’t have a big balance .I’m thinks because the young
don’t have much money to take a loan in a bank . But the old peoples have the biggest balance
,I’m thinks the old people save theirs money for the retirement and all we go with have.That’s
doint have the very extremes anomalies in the dataset. So let see who type range age back more
fast theirs money.
7
plt.tight_layout()
plt.show()
time_visualization(bank_df)
We can see we have so much the outliers in the dataset . Many of the columns have so much the
ouliers that’s gonna biased the model and so much overshift that’s not good.So for the next step
first we gonna remove the most possible ouliers and second , we gonna use sklearn for pretreatment
.So let’s start.
df_clean = bank_df.copy()
for col in columns:
Q1 = df_clean[col].quantile(0.25)
8
Q3 = df_clean[col].quantile(0.75)
IQR = Q3 - Q1
return df_clean
bank_df_cleaned.describe()
pdays previous
count 34967.000000 34967.000000
mean 42.587325 0.590042
std 103.504469 2.383307
min -1.000000 0.000000
25% -1.000000 0.000000
50% -1.000000 0.000000
75% -1.000000 0.000000
max 871.000000 275.000000
# Convert y in to numeric That's gonna help for the ROC_curve is work only with␣
↪numerics
9
df["y"] = df["y"].map({"no": 0, "yes": 1})
X = df[col_num + col_cat]
y = df["y"]
# Features Names
feature_names = prepro.get_feature_names_out()
# Conversion on DataFrame
X_train_transformed = pd.DataFrame(X_train_transformed, columns=feature_names)
X_test_transformed = pd.DataFrame(X_test_transformed, columns=feature_names)
10
in this case.
# Prediction
y_pred_logit = logit_model.predict(X_test_transformed)
plt.xlabel("Predictions")
plt.ylabel("True Values")
plt.title("Confusion Matrix for Logistic Regression")
plt.show()
11
We can see the logistic regression perform instead well with the accuracy 92% is good .The model
is good but the problem is the trues peoples are deposing theirs money the model don’t predict
well. The precision is not too bad but not good and the recall is too bad.But what’s the
precision? what’s the recall? So we gonna explain this now.
#Predictions
y_pred_nb = nb_model.predict(X_test_transformed)
12
print()
print(classification_report(y_test, y_pred_nb))
plt.xlabel("Predictions")
plt.ylabel("True Values")
plt.title("Confusion Matrix for Naive Bayes")
plt.show()
13
We can see it’s not better than logistic regression for this case and is the same problem with the
logistic regression . the both don’t find very well the people are paying theirs credits. But the recall
for this case is better than logistic regression or the precision for the logistic regression is better
than Naive bayes.
14
weak 29% is too bad the model confuse so much the instances . It’s not good because is gonna say
this person paying well is credit but is absolutly false the model lied and give bad person for the
loan or credit. So too bad in this case. To read the recall is the line you’re read for example in
this case the recall for the people who’s paying their credit is 265/265+642. So is the recall very
important metric for the many case.
y_prob_nb = nb_model.predict_proba(X_test_transformed)[:, 1]
y_prob_logit = logit_model.predict_proba(X_test_transformed)[:, 1]
# ROC Curve
fpr_nb, tpr_nb, _ = roc_curve(y_test, y_prob_nb)
fpr_logit, tpr_logit, _ = roc_curve(y_test, y_prob_logit)
# AUC score
roc_auc_nb = roc_auc_score(y_test, y_prob_nb)
roc_auc_logit = roc_auc_score(y_test, y_prob_logit)
plt.figure(figsize=(8,6))
plt.plot(fpr_nb, tpr_nb, color='blue', label=f'Naive Bayes (AUC = {roc_auc_nb:.
↪4f})')
15
We can say officially the logistic regression outperform the Naive Bayes for this case. But the both
perform well for this case in the general.
joblib.dump(prepro,"pipeline_bank.pkl")
[71]: ['pipeline_bank.pkl']
2.0.9 Conclusion
The most importants way to understand are: - We have so much the outliers in the case - The
both models perform well but Logistic Regression outperform Naive Bayes for this case - The both
models have so much the problem to detect very well the trues values either precision or recall same
problem is not good.
We gonna resolve this problem in the future to see how we can boost the precision and recall to
take the trues values. The ensemble models like Random forest and others models boosting
gonna help to adjust the metrics because many of them are so resistants to the outliers.
[ ]:
16