0% found this document useful (0 votes)
9 views14 pages

7.email Spam Filtering Using Naive Bayes Classifier

Uploaded by

nanipavan830
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views14 pages

7.email Spam Filtering Using Naive Bayes Classifier

Uploaded by

nanipavan830
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

___

Back to Blog

DATA SCIENCE

Email Spam Filtering Using


Naive Bayes Classifier
9 minute read | June 14, 2021

Written by:
Sakshi Gupta

Naive Bayes is a probabilistic algorithm based on the Bayes


Theorem used for email spam filtering in data analytics. If you have
an email account, we are sure that you have seen emails being
categorised into different buckets and automatically being marked
important, spam, promotions, etc. Isn’t it wonderful to see machines
being so smart and doing the work for you?
More often than not, these labels added by the system are right. So
does this mean our email software is reading through every
communication and now understands what you as a user would
have done? Absolutely right! In this age and time of data analytics &
machine learning, automated filtering of emails happens via
algorithms like Naive Bayes Classifier, which apply the basic Bayes
Theorem on the data.

In this article, we will understand briefly about the Naive Bayes


Algorithm before we get our hands dirty and analyse a real email
dataset in Python. This blog is second in the series to understand
the Naive Bayes Algorithm. You can read part 1 here in the
introduction to Bayes Theorem & Naive Bayes Algorithm blog.

Source: GitHub

The Naive Bayes Classifier


Formula
One of the most simple yet powerful classifier algorithms, Naive
Bayes is based on Bayes’ Theorem Formula with an assumption of
independence among predictors. Given a Hypothesis A and
evidence B, Bayes’ Theorem calculator states that the relationship
between the probability of Hypothesis before getting the evidence
P(A) and the probability of the hypothesis after getting the evidence
P(A|B) is:

Source: Wikipedia

Here:

A, B = events
P(A|B) = probability of A given B is true
P(B|A) = probability of B given A is true
P(A), P(B) = the independent probabilities of A and B

This theorem, as explained in one of our previous articles, is mainly


used for classification techniques in data analytics. The Naive Bayes
theorem calculator pays an important role in spam detection of
emails.

Get To Know Other Data Science


Students

Peter Liu
Business Intelligence Analyst at Indeed

Read Story

Detecting Email Spam


Modern spam filtering software continuously struggles to
categorise the emails correctly. Unwanted spam & promotional
communication is the toughest of them all. Spam communication
algorithms must be iterated continuously since there is an ongoing
battle between spam filtering software and anonymous spam &
promotional mail senders. Naive Bayes Algorithm in data analytics
forms the base for text filtering in Gmail, Yahoo Mail, Hotmail & all
other platforms.

Like Naive Bayes, other classifier algorithms like Support Vector


Machine, or Neural Network also get the job done! Before we begin,
here is the dataset for you to download:
Email Spam Filtering Using Naive Bayes Algorithm
This would be a zipped file, attached in the email. Please allow users
to download this data.

Source: Sebastian Raschka

For convenience, we have already split the data into train & test files.
Let’s get into it:

import pandas as pd
# read training data & test data
df_train = pd.read_csv("training.csv")
df_test = pd.read_csv("test.csv")

Always review the first 5 rows of the dataset:

df_test.sample(5)
df_train.sample(5)

Your output for train dataset may look something like this:

type email
1779 Ham <p>Into thereis tapping said that scarce whose…
1646 Ham <p>Then many take the ghastly and rapping gaun…
534 Spam <p>Did parting are dear where fountain save ca…
288 Spam <p>His heart sea he care he sad day there anot…
1768 Ham <p>With ease explore. See whose swung door and…

And the output for test dataset would look something like this:

type email
58 Ham <p>Sitting ghastly me peering more into in the…
80 Spam <p>A favour what whilome within childe of chil…
56 Spam <p>From who agen to sacred breast unto will co…
20 Ham <p>Of to gently flown shrieked ashore such sad…
94 Spam <p>A charms his of childe him. Lowly one was b…

If you notice, you will realise that we have two columned CSV files
here. Type column contains whether the email is marked as Spam
or Ham & the email columns contains body (main text) of the email.
Both the train & test datasets have the same format.

Ensuring data consistency is of utmost importance in any data


analytics problem. Let’s do some descriptive statistics as the first
step in the data analytics problem, on our training data.

df_train.describe(include = 'all')

type email
count 2000 2000
unique 2 2000
top Spam <p>Along childe love and the but womans a the …
freq 1000 1

In the output, we will see that there are 2000 records. We have two
unique Type and 2000 unique emails. Let’s detail a little more about
Type column.

df_train.groupby('type').describe()

email
count unique top freq
type
Ham 1000 1000 <p>Broken if still art within lordly or the it… 1
<p>Along childe love and the but womans
Spam 1000 1000 1
a the …

As we can see, in our test data, we have an equal number (1000


each) of Spam and Ham. There is no duplicate data in the email
column. Let’s sanitise our data now.

import email_pre as ep
from gensim.models.phrases import Phrases

def do_process(row):
global bigram
temp = ep.preprocess_text(row.email,[ep.lowercase,
ep.remove_html,
ep.remove_esc_chars,
ep.remove_urls,
ep.remove_numbers,
ep.remove_punct,
ep.lemmatize,
ep.keyword_tokenize])

if not isinstance(temp,str):
print temp

return ' '.join(bigram[temp.split(" ")])

def phrases_train(sen_list,min_ =3):


if len(sen_list) <= 10:
print("too small to train! ")
return

if isinstance(sen_list,list):
try:
bigram = Phrases.load("email_EN_bigrams_spam")
bigram.add_vocab(sen_list)
bigram.save("email_EN_bigrams_spam")
print "retrain!"

except Exception as ex:


print "first "
bigram = Phrases(sen_list, min_count=min_, threshold=2)
bigram.save("email_EN_bigrams_spam")
print ex

Phrase Model train (we can run this once & save it)

train_email_list = [ep.preprocess_text(mail,[ep.lowercase,
ep.remove_html,
ep.remove_esc_chars,
ep.remove_urls,
ep.remove_numbers,
ep.remove_punct,
ep.lemmatize,
ep.keyword_tokenize]).split(" ") for mail in df_train.email.values]

print "after pre_process :"


print " "
print len(train_email_list)
print df_train.ix[22].email,">>"*80,train_email_list[22]

Here is the output after an initial pre_processing:

2000
<p>Him ah he more things long from mine for. Unto feel they seek
other adieu crime dote. Adversity pangs low. Soon light now time
amiss to gild be at but knew of yet bidding he thence made. Will
care true and to lyres and and in one this charms hall ancient
departed from. Bacchanals to none lay charms in the his most his
perchance the in and the uses woe deadly. Save nor to for that that
unto he. Thy in thy. Might parasites harold of unto sing at that in for
soils within rake knew but. If he shamed breast heralds grace once
dares and carnal finds muse none peace like way loved. If long
favour or flaunting did me with later will. Not calm labyrinth tear
basked little. It talethis calm woe sight time. Rake and to hall. Land
the a him uncouth for monks partings fall there below true sighed
strength. Nor nor had spoiled condemned glee dome monks him
few of sore from aisle shun virtues. Bidding loathed aisle a and if
that to it chill shades isle the control at. So knew with one will wight
nor feud time sought flatterers earth. Relief a would break at he if
break not scape.</p><p>The will heartless sacred visit few. The was
from near long grief. His caught from flaunting sacred care fame
said are such and in but a.</p> [‘ah’, ‘things’, ‘long’, ‘mine’, ‘unto’, ‘feel’,
‘seek’, ‘adieu’, ‘crime’, ‘dote’, ‘adversity’, ‘pangs’, ‘low’, ‘soon’, ‘light’, ‘time’,
‘amiss’, ‘gild’, ‘know’, ‘yet’, ‘bid’, ‘thence’, ‘make’, ‘care’, ‘true’, ‘lyres’, ‘one’,
‘charm’, ‘hall’, ‘ancient’, ‘depart’, ‘bacchanals’, ‘none’, ‘lay’, ‘charm’,
‘perchance’, ‘use’, ‘woe’, ‘deadly’, ‘save’, ‘unto’, ‘thy’, ‘thy’, ‘might’,
‘parasites’, ‘harold’, ‘unto’, ‘sing’, ‘soil’, ‘within’, ‘rake’, ‘know’, ‘sham’,
‘breast’, ‘herald’, ‘grace’, ‘dare’, ‘carnal’, ‘find’, ‘muse’, ‘none’, ‘peace’, ‘like’,
‘way’, ‘love’, ‘long’, ‘favour’, ‘flaunt’, ‘later’, ‘calm’, ‘labyrinth’, ‘tear’, ‘bask’,
‘little’, ‘talethis’, ‘calm’, ‘woe’, ‘sight’, ‘time’, ‘rake’, ‘hall’, ‘land’, ‘uncouth’,
‘monks’, ‘part’, ‘fall’, ‘true’, ‘sigh’, ‘strength’, ‘spoil’, ‘condemn’, ‘glee’,
‘dome’, ‘monks’, ‘sore’, ‘aisle’, ‘shun’, ‘virtues’, ‘bid’, ‘loathe’, ‘aisle’, ‘chill’,
‘shade’, ‘isle’, ‘control’, ‘know’, ‘one’, ‘wight’, ‘feud’, ‘time’, ‘seek’, ‘flatterers’,
‘earth’, ‘relief’, ‘would’, ‘break’, ‘break’, ‘scapethe’, ‘heartless’, ‘sacred’,
‘visit’, ‘near’, ‘long’, ‘grief’, ‘catch’, ‘flaunt’, ‘sacred’, ‘care’, ‘fame’, ‘say’]

df_train["class"] = df_train.type.replace(["Spam","Ham"],[0,1])
df_test["class"] = df_test.type.replace(["Spam","Ham"],[0,1])

Bigram Training

phrases_train(train_email_list,min_=3)
bigram = Phrases.load("email_EN_bigrams_spam")
len(bigram.vocab)

And let’s retrain again! Here is the output:

159158

print len(dict((key,value) for key, value in bigram.vocab.iteritems


You may get this as the output:

4974

df_train["clean_email"] = df_train.apply(do_process,axis=1) df_test


# df_train.head()
print "phrase found train:",df_train[df_train['clean_email'].str.co
print "phrase found test:",df_test[df_test['clean_email'].str.conta

Output

phrase found train: (371, 3)


phrase found test: (7, 3)

Let’s start training for Spam Detection now:

df_train.head()

Output

type email clean_email class


<p>But could then once could pomp glee glorious
0 Spam 0
pomp to nor that glee g… deign vex time childe…
<p>His honeyed and land honey land vile native ah ah
1 Spam 0
vile are so and native… like flash gild b…
<p>Tear womans his was tear womans tis eremites
2 Spam 0
by had tis her eremites… present dear know pro…
<p>The that and land. Cell land cell shun blazon
3 Spam 0
shun blazon passion… passion uncouth paphian …
<p>Sing aught through sing aught part things
4 Spam 0
partings things was sacr… sacred know passion pro…

Source: Medium
Source: TowardsDataScience

For the next section, you can proceed with the Naive Bayes part of
the algorithm:

from sklearn.pipeline
import Pipeline from sklearn.feature_extraction.text import CountVe
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
text_clf = Pipeline([('vect', CountVectorizer()),

('tfidf', TfidfTransformer()), ('clf', MultinomialNB()),

text_clf.fit(df_train.clean_email, df_train["class"])
predicted = text_clf.predict(df_test.clean_email)
from sklearn import metrics
array = metrics.confusion_matrix(df_test["class"], predicted)
import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

df_cm = pd.DataFrame(array, ["Spam","Ham"],


["Spam","Ham"])

sn.set(font_scale=1.4)#for label size


sn.heatmap(df_cm, annot=True,annot_kws={"size": 16})# font size

After running the Naive Bayes Algorithm, the output looks


something like this:
print metrics.classification_report(df_test["class"], predicted,
target_names=["Spam","Ham"])

We print the metrics to see something like this:

precision recall f1-score support

Spam 1.00 1.00 1.00 43


Ham 1.00 1.00 1.00 57

avg / total 1.00 1.00 1.00 100

In order to assess the model, we put the test data into our created
model after which we compare our results. As you can see in the
output above, it is visible that out of 43 spam mail, the model
successfully identifies all the 43 spam mails. And in the same way,
out of 57 ham mail, the model successfully identifies all the 57 Ham
mails.

Our application of the Bayes theorem formula in a Naive Bayes


Classifier technique is working successfully on this dataset. While it
is unusual to have 100% success from a model, we have been able
to achieve it due to the small size of training & testing datasets. All
we need to ensure is that the model trains with sufficient data. If this
happens, it will deliver more accurate results.

For further reading, see here the data science definition guide or
the data scientist job description.

Since you’re here…


Curious about a career in data science? Experiment with our
free data science learning path, or join our Data Science
Bootcamp, where you’ll get your tuition back if you don’t land a
job after graduating. We’re confident because our courses
work – check out our student success stories to get inspired.

About Sakshi Gupta


Sakshi is a Managing Editor at Springboard. She is
a technology enthusiast who loves to read and
write about emerging tech. She is a content
marketer with experience in the Indian and US
markets.

Related Articles

DATA SCIENCE MAY 19, 2023

What Do Data Scientists Make? [Data Scientist Salary


Guide]
DATA SCIENCE JULY 8, 2020

How Much Does a Data Scientist at Amazon Earn?

DATA SCIENCE OCTOBER 28, 2021

Future Scope of Machine Learning: What’s to Come


for Work & Business
Elevate your skills and broaden your
horizons.
ENTER YOUR EMAIL

[email protected] Subscribe for free

DATA + AI WHY SPRINGBOARD FOR PARTNERS

Data Analytics Bootcamp How it works Universities

Data Science Bootcamp Mentorship Businesses

Data Engineering Job guarantee Hire our


Bootcamp graduates
Student
Machine Learning outcomes Affiliates
Engineering and AI
Student stories
Bootcamp RESOURCES
Community
Introduction to Data
Payment options Blog
Analytics
Scholarships Compare
Data Science Prep
bootcamps

CYBERSECURITY
Free courses

Learn data
Cybersecurity Bootcamp
science

Learn coding
SOFTWARE ENGINEERING
Learn
Software Engineering cybersecurity
Bootcamp
Learn data
Software Engineering analytics
Bootcamp for Beginners

Software Engineering Prep ABOUT US

Our story
DESIGN Become a mentor

Join our team


UI/UX Design Bootcamp
Press inquiries:
UX Design Bootcamp
[email protected]
Introduction to Design

Take our All


quiz courses

© Copyright 2024
Terms Privacy Conduct Security Regulation Health Plan T&Cs

You might also like