Hybrid Machine Learning Based E-Mail Spam Filtering Technique
Hybrid Machine Learning Based E-Mail Spam Filtering Technique
INTRODUCTION
Many experiments are being conducted on spam mails to generate algorithms which are
capable of identifying spam mails. Email filtering is generally categorized on the content,
which involves images, attachments, Ip address or their header that gives the data about the
recipient. As the amount of spam data goes on increasing, [2] has proposed and set up the
problem to stop malicious attack. There are many individuals around the globe who may
respond to such type of attack would risk their financial or personal info and in order to
counterpart to this, author has described few techniques. Many machine learning based
methods are being used for electronic mail non-ham filtering such as: SVM and Artificial
Immune System [9], Anti-spam Email Filtering [3], Comparison of Naïve Bayes and Memory
based Approach [4], Naïve Bayesian and rule learning [5], Neural Networks and Bayesian
Classifiers [6], Bayesian Filtering Junk Email [7], and fuzzy similarity [8]. It is exciting to
see whether the identified techniques are showing any impact on the spam emails and how
effectively it can stop the spam messages before entering into the recipient’s inbox [10].
Research had been conducted in existing methods for email spam detection, but the accuracy
was quite less; hence performance needs to be improved in electronic mail spam detection. In
this paper the proposed HYST is considering the outcomes probabilities obtained from
different classifiers and calculating the most ideal probability of electronic mail content as
ham or spam.
There are many algorithms to classify spam and non spam emails. To identify the best
classification algorithm with respect to computational time, accuracy, misclassification rate
and precision, assessment on the Spam Base dataset feature selection acts as a major role and
then for selection of algorithm. Here are some of the algorithms that are base on email spam
filtering.
In this algorithm the author has extracted different categories of features from Enron
Spam dataset to find the best feature set for spam email filtering. The 4 different categories of
features, consisting of Bag-of-Word (BoW)s, Bigram Bag-of-Words, PoS Tag and Bigram
PoS Tag features were used in this paper. Bag-of-Word (BoW)s and Bigram Bag-of-Words,
are not sufficient enough to prepare an efficient spam email filtering model. This is due to the
absence of features having high correlation with target class. AdaBoostJ48, Random Forest
and Popular linear Support Vector Machine (SVM), called Sequential Minimal Optimization
(SMO) are used as classifiers for model generation. Rare features are eliminated using Naive
Bayes score and features are selected based on Information Gain value. Feature occurrence
matrix is constructed, which is weighted on Term Frequency Inverse Document Frequency
(TF-IDF) values. Singular Value Decomposition as matrix factorization technique is
employed. The experiments were carried out on individual feature models as well as
ensemble models. Best individual feature model is got from Pos Tag feature category and
from Bigram Pos Tag feature category. Best results from individual feature category are
ensemble.
2.5. Web Spam Corpus using Email Spam to Identify Web Spam
Automatically
In this the authors have made researches on how to detect web spam, with the help of email
spam detection techniques. By observing URLs found in email spam messages they try to
identify whether the web page is spam or not. The Webb Spam Corpus is a very large sample
of Web spam (over two orders of magnitude larger than previously cited Web spam data
sets). Also, our automated Web spam collection technique allows researchers to quickly and
easily obtain even more examples. The main challenge with any automated Web spam
classification technique is accurate labelling (as shown by the limited Web spam sample sizes
of previous research), and although our approach does not completely eliminate this problem,
it does minimize the manual effort required. Researchers simply need to identify a few false
positives as opposed to the arduous task of manually searching for a sufficiently large
collection of Web spam pages. Specifically, our work could be used to provide more effective
parental controls on the Web. The Webb Spam Corpus contains a number of porn-related
pages as well as additional content that is not suitable for children. This content provides
valuable insight into the characteristics of Web spam pages and allows researchers to build
more effective Web content filters. In addition to its contributions to Web filtering, the Webb
Spam Corpus also provides a unique approach to email spam filtering.
3.1. INTRODUCTION
3.2.1 ADVANTAGES
Figure: 3.3.1
3.4 DATASET DESCRIPTION:
The definitions of the attributes of every column are described as:48 continuous real
[0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match
WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of
words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by
non-alphanumeric characters or end-of-string. 6 continuous real [0,100] attributes of type
char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 *
(number of CHAR occurrences) / total characters in e-mail 1 continuous real [1,...] attribute
of type capital_run_length_average= average length of uninterrupted sequences of capital
letters. 1 continuous integer [1,...] attribute of type capital_run_length_longest = length of
longest uninterrupted sequence of capital letters. 1 continuous integer [1,...] attribute of type
capital_run_length_total= sum of length of uninterrupted sequences of capital letters = total
number of capital letters in the e-mail 1 nominal {0,1} class attribute of type spam = denotes
whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Creators - Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt Hewlett-
Packard
Labs, 1501 Page Mill Rd., Palo Alto, CA 94304
Donor - George Forman (gforman at nospam hpl.hp.com) 650-857-7835
Generated - June-July 1999
Class Distribution: It consists of spam 1813 (39.4%) and Non-spam 2788(60.6%)
The SPAM database can be found by searching for ‘spam base DOCUMENTATION’ at the
UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html
Figure: 3.4.2
5. DESIGN
UML - a broadly useful visual displaying dialect to envision indicates, develop and archive any
product framework. UML is commonly used to show programming frameworks yet it has no
confinements or limits. It is additionally used to display noncoding modules like process flow in
an assembling unit and so on.
Class diagrams can be utilized both in the early periods of a venture and amid plan
exercises. A class outline comprises of classes, affiliations, and speculations, and can exist in
various dimensions. It characterizes the stationary structures of any framework which is
divided into various parts called classes and furthermore and relations between those classes
along with methods respective to the classes.
The class graph is viewed as the fundamental drawn square of article demonstrating. It is
used for general theoretical demonstrating of the application, and later more point by point
displaying for making an analysis of the models into programming code. These charts can
likewise be utilized for displaying the information. The module in this chart speaks about
primary items and collaborations in relevance and furthers more the classes to be customized.
A class with three areas:
● The top text of the box speaks about the name of the class.
● The middle text of the about speaks to the properties of the class.
● The last part of the box speaks about the methods that are performed by
it.
Figure: 5.1.1
5.1.2 Use case diagram
Use Cases are likewise named as social charts which speak to a many of activities(use cases)
that a couple of modules (subject) ought to perform in investment with at least one different
clients of the outside modules (on-screen characters). Each case should give the right
outcomes to the on-screen characters or different members of the framework. These outlines
are done in a beginning period of an undertaking advancement. They speak to how to utilize
the last framework. Use cases are a decent method to portray the useful prerequisites of a
product framework; they are straightforward so they can be utilized in exchanges with non-
software engineers. The members in a UML use case graph are use cases, one or a few
performer's relations, affiliations and speculations between them are mentioned in the
following diagram.
.
Figure: 5.1.2
5.1.3 Sequence Diagram:
Figure: 5.1.3
Activity Diagram demonstrates the course through a program from a characterized begin
point to an end point. Action charts portray the work process and furthermore performance of
a framework. These diagrams are like state outlines since exercises are the condition of
accomplishing something. These charts depict the condition of the exercises by speaking to
the grouping of exercises performed. They can likewise demonstrate exercises that are
restrictive or parallel. Essential components in movement charts are exercises, branches
(conditions or determinations), advances, forks, and joins.
Figure: 5.1.4
Data flow diagrams explain the stream how information is handled by a framework as
far as data sources and yields. Information stream graphs can be utilized to give an
unmistakable portrayal of any capacity. Furthermore, the troublesome process can likewise
be effectively robotized with the assistance of DFDs utilizing any simple to utilize free
downloadable instruments. A DFD is a diagrammatic model for building and breaking down
the data procedure. Information Flow Diagram clarifies the stream of data in a procedure
dependent on the information and yield. Any DFD can likewise allude as a Process Model. A
DFD portrays a specialized procedure with the help of the information spared, in addition to
the information spilling out of one procedure to another and the final product.
Figure:
5.1.5
DESCRIPTION OF FLOWCHART:
1. Once the E-Mail is being received check whether the mail is in black list or not.
2. If yes, reject the mail and stop the process
3. If the mail is not in the black list verify whether the E-Mail is forged or not.
4. If the mail is modified reject the mail or else if the mail is not forged check whether
the mail is in white list or not.
5. The mail is in white list then delivers the mail and stop the process, if it is not then
add it to the suspicious list and deliver the mail.
6. Finally, the E-Mail is sent to user’s inbox or spam folder.
CHAPTER 6
IMPLEMENTATION
6. IMPLEMENTATION
6.1 Introduction
This stage is less imaginative when contrasted and framework structure. It contemplates
client preparing a framework, and furthermore required record changes. The framework
might be required colossal client preparing. The underlying parameters of the framework
ought to be changed because of programming. A basic methodology is given to the client can
comprehend the distinctive capacities obviously and rapidly. The proposed framework is
anything but difficult to execute. By and large, usage is the way toward changing over
another or overhauled framework structure into an operational one.
It is utilized for:
● Software advancement,
● Mathematics,
● System scripting.
● Python can interface with database frameworks. It can likewise peruse and
alter records.
● Python can be utilized to deal with enormous information and perform
complex science.
The latest significant form of Python will be Python 3, which we will use in this
instructional exercise. Nonetheless, Python 2, in spite of the fact that not being
refreshed with something besides security refreshes, is still very well known.
6.3 ALGORITHM:
HYST is used as a classifier model which uses multiple sets of data to classify at any
given instance by using majority votes. In this algorithm, the dataset which consists of total
4600 records, 58 features and 2 classes i.e. spam and not spam. Each row of the table
represents a separate record. The columns in the dataset represents features like FREE,
URGENT. In this we are using a concept of bagging for generating K number of sets. In the
next process attribute bagging is being applied. After obtaining the resultant set prepare the
confusion matrix for HYST.
Table: 6.3.1
Step-1: We apply K iterations of bagging to create total K number of trees.
Bagging (Bootstrap aggregating): For a standard training set D of size n, bagging generates m
new training sets Di, each of size n1 , by sampling from D uniformly and with replacement. By
sampling with replacement, some observations may be repeated in each Di.
Table 2: A dataset which consists of 1530 E-Mails, 58 features and 2 classes selected from
table 1.
Table: 6.3.2
Step-2: Now for each of the K sample training set, we apply the attribute bagging and learn
the decision tree, the variable from any new node is the best variable (i.e., having least
misclassification error) among extracted random subspace.
Table 3: A dataset which consists of 1530 E-Mails, 8 features and 2 classes selected from
table 2.
Table: 6.3.3
Here the number of records is same as 1530 records; we just make sample selection of
random selection of features with replacement. So this was our first sample set and now we
use this sample set and apply decision tree formation like information gain or gain ratio to
create the tree.
SPAM HAM
SPAM 676 15
HAM 56 404
FP+ FN 25+15
= =0.2
Total 200
Attribute bagging: Let each training object X i (i=1, … .. , n) in the training sample set
X =( X 1 , X 2 , … .. , X n) be a p dimensional vector X i =( x i 1 , , x i 2 , ….. , x ip ) described by p features
(components). In Random subspace sampling, one randomly selects r<p features from the p
dimensional data set X. One thus obtains the r dimensional random subspace of the original p
dimensional feature space. Therefore, the modified training set X b=( X b1 , X b2 , ….. , X bn )
consists of r dimensional training objectsX 1b=¿) (i=1,2,…n), where r components
6.4. CODING
import numpy as np
import pandas as pd
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 57].values
# Splitting the dataset into the Training set and Test set
# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
RESULT:
Accuracy : 93
Misclassification rate : 6
Precision : 96
Artificial Neural Network is nothing but the computing systems which are being
motivated by the neural networks. Neural Network is a frame work for majority of the
machine learning based algorithms which easily solves complex information.
Figure: 6.4.1
Step 1: Calculate hidden layer values by multiplying input values with weights
H1 = X1 * W1 + X2 * W2
1
OUT H1 =
1+ e−H 1
Step 3: Similarly calculate values for hidden layers until ‘n’ inputs.
Step 6: Compare predicted result to actual result and measure the generated error.
Step7: Update the weights according to the generated errors and apply back propagation until
we get the result
Code:
# Artificial Neural Network
# Installing Theano
# Installing Tensorflow
# Installing Keras
import numpy as np
importmatplotlib.pyplot as plt
import pandas as pd
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 57].values
labelencoder_X_1 = LabelEncoder()
labelencoder_X_2 = LabelEncoder()
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]
# Splitting the dataset into the Training set and Test set
# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
import keras
classifier = Sequential()
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
RESULT:
Accuracy : 91
Misclassification rate : 46
Precision : 89
• The algorithm is trained using the training dataset and it generates the classifier model
for us.
• Naïve Bayes is based on the model of conditional probability .It is represented as the
probability of the certain event occurring, if and only if some other event has already
taken place.
• In our case, spam base dataset is a continuous dataset; therefore we choose to use a
Naïve Bayesian Algorithm.
Code:
# Naive Bayes
# Installing Theano
# Installing Tensorflow
# Install Tensorflow from the website:
https://www.tensorflow.org/versions/r0.12/get_started/os_setup.html
# Installing Keras
import numpy as np
import pandas as pd
X = dataset.iloc[:,:-1 ].values
y = dataset.iloc[:, 57].values
# Splitting the dataset into the Training set and Test set
random_state = 0)
# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
RESULT:
Accuracy : 79
Misclassification rate : 20
Precision : 67
6.5. SCREENSHOTS
HYST:
Variable explorer of HYST:
Training Set of X:
Testing set of X:
Confusion Matrix for HYST:
Figure: 6.5.1
Naïve Bayesian:
Code:
Testing set of X:
Confusion Matrix for Naive Bayesian:
Figure: 6.5.2
Neural Network:
Coding:
Variable Explorer:
Training set of X:
Testing Set of X:
Confusion Matrix for Neural Network:
Figure: 6.5.3
CHAPTER 7
RESULTS
7.1 RESULTS:
Accuracy 79 91 93
Misclassification Rate 20 46 6
Precision 67 89 96
Table: 7.1.1
Results are being described in the form of confusion matrix. A confusion matrix is a
technique for summarizing the performance of a classification algorithm. The detailed
explanation of the table is being described below:
FP+ FN
Misclassification rate =
Total
True Positive Rate: When it's actually yes, how often does it predict yes
TP
TP Rate = Actual Yes
False Positive Rate: When it's actually no, how often does it predict yes?
FP
FP Rate =
Actual NO
True Negative Rate: When it's actually no, how often does it predict no?
TN
TN Rate =
Actual NO
Precision: To get the value of precision we divide the total number of correctly classified
positive examples by the total number of predicted positive examples
TP
Precision =
TP+ FP
Accuracy of HYST is greater than that of Naïve Bayesian and Neural Network.
False positive rate is lesser for Neural Network and Naïve Bayesian.
Figure: 7.2.1
Figure: 7.2.2
Figure: 7.2.3
Figure: 7.2.4
Figure: 7.2.5
CHAPTER 8
CONCLUSION
8. CONCLUSION
In order to solve the problems in existing E-mail spam filtering technique, the
proposed work has identified a new technique that has utilized HYST algorithm to derive the
emails as spam or not in the most efficient way. Precision rate has been gradually increased
by the proposed algorithm. HYST performed very well with an improvement of 5%.Fututre
research will be concerned with attribute selection or else feature selection for improvement
in the accuracy rate because electronic mail dataset consists of huge number of attributes
which are irrelevant.
REFERENCES
REFERENCES
[1] T. Subramaniam, H. A. Jalab, and A. Y. Taqa, "Overview of textual anti-spam filtering
techniques," Int. J. Phys. Sci, vol. 5, pp. 1869-1882, 2010
[2] E.-S. M. El-Alfy, “Learning Methods for Spam Filtering,” International Journal of
Computer Research, vol. 16, no. 4, 2008.
[3]Karl-Michael Schneider: “A Comparison of Event Models for Naïve Bayes Anti-Spam E-
Mail Filtering.” In Proceedings of the 10 thConference of the European Chapter of the
Association for
Computational Linguistics, Budapest, Hungary, 307-314, April, 2003.
[6] Y. Yang, S. Elfayoumy, “Anti-spam filtering using neural networks and Bayesian
classifiers,” in Proceedings of the 2007 IEEE International Symposium on Computational
Intelligence in Robotics and Automation, Jacksonville, FL, USA, June 2007.
[8] E.-S. M. El-Alfy and F. S. Al-Qunaieer, “A fuzzy similarity approach for automated spam
filtering,” in Proceedings of IEEE International Conference on Computer Systems and
Applications (AICCSA’08), Qatar, April 2008.
[9] K. Jain, “A Hybrid Approach for spam filtering using Support Vector Machine and
Artificial immune System pp. 5–9, 2014.
[10] Le Zhang, Jingbo Zhu, Tianshun Yao: “An Evaluation of Statistical Spam Filtering
Techniques.” ACM Transactions on Asian Language Information Processing, Vol. 3, No. 4,
Pages 243-269, December, 2004.
[11] Zhuang, L., Dunagan, J., Simon, D.R., Wang,H.J., Tygar, J.D., Characterizing Botnets
from Email spam Records,LEET’08 Proceedings of the 1 stUsenix Workshop on Large-Scale
Exploits and Emergent threats Article No.2.2008
[12] Enrico Blanzieri, Anton Bryl, A survey of learning based techniques of email spam
filtering, Technical Report # DIT-06-056. 2008
[13] CloseSteve Webb, James Caverlee, CaltonPu. Introducing the Webb Spam Corpus:
using Email spam to identify web spam automatically, CEAS.2006.
[14] Sculley, D., Gabriel M. Wachman, 2007. Relaxed online VSMs for spam filtering,
SIGIR 2007 Proceedings
[15] The Enron corpus: a new dataset for email classification researchECML (2004), pp. 217-
226
[16] JiteshShetty, JafarAdibi, 2005. Discovering Important Nodes through Graph Entropy the
Case of Enron Email Database, KDD’2005, Chicago, Illinois.
[17] ShinjaeYoo, Yiming Yang, Frank Lin, I1-Chul Moon, 2009.Mining Social Networks for
Personalized Email Prioritization, KDD’09, June 28-July 1,Paris, France