7.email Spam Filtering Using Naive Bayes Classifier
7.email Spam Filtering Using Naive Bayes Classifier
Back to Blog
DATA SCIENCE
Written by:
Sakshi Gupta
Source: GitHub
Source: Wikipedia
Here:
A, B = events
P(A|B) = probability of A given B is true
P(B|A) = probability of B given A is true
P(A), P(B) = the independent probabilities of A and B
Peter Liu
Business Intelligence Analyst at Indeed
Read Story
For convenience, we have already split the data into train & test files.
Let’s get into it:
import pandas as pd
# read training data & test data
df_train = pd.read_csv("training.csv")
df_test = pd.read_csv("test.csv")
df_test.sample(5)
df_train.sample(5)
Your output for train dataset may look something like this:
type email
1779 Ham <p>Into thereis tapping said that scarce whose…
1646 Ham <p>Then many take the ghastly and rapping gaun…
534 Spam <p>Did parting are dear where fountain save ca…
288 Spam <p>His heart sea he care he sad day there anot…
1768 Ham <p>With ease explore. See whose swung door and…
And the output for test dataset would look something like this:
type email
58 Ham <p>Sitting ghastly me peering more into in the…
80 Spam <p>A favour what whilome within childe of chil…
56 Spam <p>From who agen to sacred breast unto will co…
20 Ham <p>Of to gently flown shrieked ashore such sad…
94 Spam <p>A charms his of childe him. Lowly one was b…
If you notice, you will realise that we have two columned CSV files
here. Type column contains whether the email is marked as Spam
or Ham & the email columns contains body (main text) of the email.
Both the train & test datasets have the same format.
df_train.describe(include = 'all')
type email
count 2000 2000
unique 2 2000
top Spam <p>Along childe love and the but womans a the …
freq 1000 1
In the output, we will see that there are 2000 records. We have two
unique Type and 2000 unique emails. Let’s detail a little more about
Type column.
df_train.groupby('type').describe()
email
count unique top freq
type
Ham 1000 1000 <p>Broken if still art within lordly or the it… 1
<p>Along childe love and the but womans
Spam 1000 1000 1
a the …
import email_pre as ep
from gensim.models.phrases import Phrases
def do_process(row):
global bigram
temp = ep.preprocess_text(row.email,[ep.lowercase,
ep.remove_html,
ep.remove_esc_chars,
ep.remove_urls,
ep.remove_numbers,
ep.remove_punct,
ep.lemmatize,
ep.keyword_tokenize])
if not isinstance(temp,str):
print temp
if isinstance(sen_list,list):
try:
bigram = Phrases.load("email_EN_bigrams_spam")
bigram.add_vocab(sen_list)
bigram.save("email_EN_bigrams_spam")
print "retrain!"
Phrase Model train (we can run this once & save it)
train_email_list = [ep.preprocess_text(mail,[ep.lowercase,
ep.remove_html,
ep.remove_esc_chars,
ep.remove_urls,
ep.remove_numbers,
ep.remove_punct,
ep.lemmatize,
ep.keyword_tokenize]).split(" ") for mail in df_train.email.values]
2000
<p>Him ah he more things long from mine for. Unto feel they seek
other adieu crime dote. Adversity pangs low. Soon light now time
amiss to gild be at but knew of yet bidding he thence made. Will
care true and to lyres and and in one this charms hall ancient
departed from. Bacchanals to none lay charms in the his most his
perchance the in and the uses woe deadly. Save nor to for that that
unto he. Thy in thy. Might parasites harold of unto sing at that in for
soils within rake knew but. If he shamed breast heralds grace once
dares and carnal finds muse none peace like way loved. If long
favour or flaunting did me with later will. Not calm labyrinth tear
basked little. It talethis calm woe sight time. Rake and to hall. Land
the a him uncouth for monks partings fall there below true sighed
strength. Nor nor had spoiled condemned glee dome monks him
few of sore from aisle shun virtues. Bidding loathed aisle a and if
that to it chill shades isle the control at. So knew with one will wight
nor feud time sought flatterers earth. Relief a would break at he if
break not scape.</p><p>The will heartless sacred visit few. The was
from near long grief. His caught from flaunting sacred care fame
said are such and in but a.</p> [‘ah’, ‘things’, ‘long’, ‘mine’, ‘unto’, ‘feel’,
‘seek’, ‘adieu’, ‘crime’, ‘dote’, ‘adversity’, ‘pangs’, ‘low’, ‘soon’, ‘light’, ‘time’,
‘amiss’, ‘gild’, ‘know’, ‘yet’, ‘bid’, ‘thence’, ‘make’, ‘care’, ‘true’, ‘lyres’, ‘one’,
‘charm’, ‘hall’, ‘ancient’, ‘depart’, ‘bacchanals’, ‘none’, ‘lay’, ‘charm’,
‘perchance’, ‘use’, ‘woe’, ‘deadly’, ‘save’, ‘unto’, ‘thy’, ‘thy’, ‘might’,
‘parasites’, ‘harold’, ‘unto’, ‘sing’, ‘soil’, ‘within’, ‘rake’, ‘know’, ‘sham’,
‘breast’, ‘herald’, ‘grace’, ‘dare’, ‘carnal’, ‘find’, ‘muse’, ‘none’, ‘peace’, ‘like’,
‘way’, ‘love’, ‘long’, ‘favour’, ‘flaunt’, ‘later’, ‘calm’, ‘labyrinth’, ‘tear’, ‘bask’,
‘little’, ‘talethis’, ‘calm’, ‘woe’, ‘sight’, ‘time’, ‘rake’, ‘hall’, ‘land’, ‘uncouth’,
‘monks’, ‘part’, ‘fall’, ‘true’, ‘sigh’, ‘strength’, ‘spoil’, ‘condemn’, ‘glee’,
‘dome’, ‘monks’, ‘sore’, ‘aisle’, ‘shun’, ‘virtues’, ‘bid’, ‘loathe’, ‘aisle’, ‘chill’,
‘shade’, ‘isle’, ‘control’, ‘know’, ‘one’, ‘wight’, ‘feud’, ‘time’, ‘seek’, ‘flatterers’,
‘earth’, ‘relief’, ‘would’, ‘break’, ‘break’, ‘scapethe’, ‘heartless’, ‘sacred’,
‘visit’, ‘near’, ‘long’, ‘grief’, ‘catch’, ‘flaunt’, ‘sacred’, ‘care’, ‘fame’, ‘say’]
df_train["class"] = df_train.type.replace(["Spam","Ham"],[0,1])
df_test["class"] = df_test.type.replace(["Spam","Ham"],[0,1])
Bigram Training
phrases_train(train_email_list,min_=3)
bigram = Phrases.load("email_EN_bigrams_spam")
len(bigram.vocab)
159158
4974
Output
df_train.head()
Output
Source: Medium
Source: TowardsDataScience
For the next section, you can proceed with the Naive Bayes part of
the algorithm:
from sklearn.pipeline
import Pipeline from sklearn.feature_extraction.text import CountVe
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
text_clf = Pipeline([('vect', CountVectorizer()),
text_clf.fit(df_train.clean_email, df_train["class"])
predicted = text_clf.predict(df_test.clean_email)
from sklearn import metrics
array = metrics.confusion_matrix(df_test["class"], predicted)
import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In order to assess the model, we put the test data into our created
model after which we compare our results. As you can see in the
output above, it is visible that out of 43 spam mail, the model
successfully identifies all the 43 spam mails. And in the same way,
out of 57 ham mail, the model successfully identifies all the 57 Ham
mails.
For further reading, see here the data science definition guide or
the data scientist job description.
Related Articles
CYBERSECURITY
Free courses
Learn data
Cybersecurity Bootcamp
science
Learn coding
SOFTWARE ENGINEERING
Learn
Software Engineering cybersecurity
Bootcamp
Learn data
Software Engineering analytics
Bootcamp for Beginners
Our story
DESIGN Become a mentor
© Copyright 2024
Terms Privacy Conduct Security Regulation Health Plan T&Cs