100% found this document useful (2 votes)
1K views

Hybrid Machine Learning Based E-Mail Spam Filtering Technique

The document discusses various machine learning techniques that have been used for email spam filtering. It provides a literature review of 5 different techniques: 1) a backpropagation neural network algorithm, 2) an adaptive privacy policy prediction method, 3) a technique to extract an efficient feature set from email data, 4) a survey of various learning-based spam filtering techniques, and 5) a method to use URLs from email spam to identify web spam pages. The goal of the literature review is to analyze existing approaches to help identify the best classification algorithms for spam filtering based on factors like accuracy, computation time, and error rates.

Uploaded by

vinith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
1K views

Hybrid Machine Learning Based E-Mail Spam Filtering Technique

The document discusses various machine learning techniques that have been used for email spam filtering. It provides a literature review of 5 different techniques: 1) a backpropagation neural network algorithm, 2) an adaptive privacy policy prediction method, 3) a technique to extract an efficient feature set from email data, 4) a survey of various learning-based spam filtering techniques, and 5) a method to use URLs from email spam to identify web spam pages. The goal of the literature review is to analyze existing approaches to help identify the best classification algorithms for spam filtering based on factors like accuracy, computation time, and error rates.

Uploaded by

vinith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 58

CHAPTER 1

INTRODUCTION

Hybrid Machine Learning based E-mail Spam Filtering Technique 1


1.1. INTRODUCTION
Email has become a robust tool for information exchange. There has been a prominent
growth in spam in recent years as the importance and applications of the web and e-mail has
grown. The origination of unsolicited mails can be from any location across the planet
wherever web is quickly accessible in the market. The count of spam messages persists to
increase in spite of the event of anti-spam services and technologies. In order to counter the
growing problem, determination of best possible techniques to counter spam with various
tools available must be analysed by organizations Various tools, such as the contracted anti-
spam services e-mail filtering gateways, corporate e-mail system and end-user training,
should be advantageous for any organization. However, attempting to counter huge amounts
of spam on a daily basis is still unavoidable issue for users. Spam would still persist
tomorrow after deluging network systems, hampering employee productivity and affecting
bandwidth if no anti-spam activities are performed.

Many experiments are being conducted on spam mails to generate algorithms which are
capable of identifying spam mails. Email filtering is generally categorized on the content,
which involves images, attachments, Ip address or their header that gives the data about the
recipient. As the amount of spam data goes on increasing, [2] has proposed and set up the
problem to stop malicious attack. There are many individuals around the globe who may
respond to such type of attack would risk their financial or personal info and in order to
counterpart to this, author has described few techniques. Many machine learning based
methods are being used for electronic mail non-ham filtering such as: SVM and Artificial
Immune System [9], Anti-spam Email Filtering [3], Comparison of Naïve Bayes and Memory
based Approach [4], Naïve Bayesian and rule learning [5], Neural Networks and Bayesian
Classifiers [6], Bayesian Filtering Junk Email [7], and fuzzy similarity [8]. It is exciting to
see whether the identified techniques are showing any impact on the spam emails and how
effectively it can stop the spam messages before entering into the recipient’s inbox [10].
Research had been conducted in existing methods for email spam detection, but the accuracy
was quite less; hence performance needs to be improved in electronic mail spam detection. In
this paper the proposed HYST is considering the outcomes probabilities obtained from
different classifiers and calculating the most ideal probability of electronic mail content as
ham or spam.

Hybrid Machine Learning based E-mail Spam Filtering Technique 2


CHAPTER 2
LITERATURE SURVEY

Hybrid Machine Learning based E-mail Spam Filtering Technique 3


2. LITERATURE SURVEY

There are many algorithms to classify spam and non spam emails. To identify the best
classification algorithm with respect to computational time, accuracy, misclassification rate
and precision, assessment on the Spam Base dataset feature selection acts as a major role and
then for selection of algorithm. Here are some of the algorithms that are base on email spam
filtering.

2.1. Email Spam Filtering BPNN Classification using Algorithm


In this paper author has described backward propagation neural network based email
spam filtering algorithm which consists of input layer, hidden layer and output layer. In this
the author calculates the error rate and identifies whether the mail is spam or not. It has been
used as a training algorithm. In the pre-processing stage to increase efficiency the author has
implemented a k-mean clustering algorithm on the vector set. To detect spam emails using
neural network, training phase and testing phase need to be done. Proposed model consists of
three modules as the primary stage is pre-processing stage which includes implementation of
clustering algorithm, and secondary stage as neural network training and the final module is
the identification of the spam email and ham email by using Artificial Neural Network feed
forward. In this phase, 11 features have been implemented as a binary value (0 or 1) where
value 1 would indicate appearance of the feature in the tested email and 0 for non-appearance
case. The author has performed experiment on a spam data set with 100 spam emails and 100
non-spam emails. After feature extraction and using k-mean clustering algorithm in the pre-
processing stage, we trained the dataset using BPNN algorithm to get the well trained
effective neural network which is then used for analysis on spam data set.

2.2. ADAPTIVE PRIVACY POLICY PREDICTION FOR EMAIL SPAM FILTERING


The author has proposed an email abstraction scheme which provides adaptive
privacy policy prediction which helps users to automate the privacy policy settings for their
emails. Author investigates a more robust email abstraction scheme which considers email
layout structure to represent emails. The procedure for email abstraction contains process for
tag extraction, tag reordering and finally appending process. Tag extraction phase extracts
each HTML tag, converts all paragraph tags to <my text> and generates a tentative email
abstraction. Tag reordering phase assigns new position to all tags and finally appending phase
adds the anchor set tags to generate the complete email abstraction. The adaptive privacy

Hybrid Machine Learning based E-mail Spam Filtering Technique 4


framework provides a comprehensive framework to infer privacy preferences based on the
information available for a given email. The main objective of appending phase is to reduce
the probability that ham is successfully matched with reported spam when the tag length of an
email abstraction is short.

2.3. Efficient Feature Set for Spam Email Filtering

In this algorithm the author has extracted different categories of features from Enron
Spam dataset to find the best feature set for spam email filtering. The 4 different categories of
features, consisting of Bag-of-Word (BoW)s, Bigram Bag-of-Words, PoS Tag and Bigram
PoS Tag features were used in this paper. Bag-of-Word (BoW)s and Bigram Bag-of-Words,
are not sufficient enough to prepare an efficient spam email filtering model. This is due to the
absence of features having high correlation with target class. AdaBoostJ48, Random Forest
and Popular linear Support Vector Machine (SVM), called Sequential Minimal Optimization
(SMO) are used as classifiers for model generation. Rare features are eliminated using Naive
Bayes score and features are selected based on Information Gain value. Feature occurrence
matrix is constructed, which is weighted on Term Frequency Inverse Document Frequency
(TF-IDF) values. Singular Value Decomposition as matrix factorization technique is
employed. The experiments were carried out on individual feature models as well as
ensemble models. Best individual feature model is got from Pos Tag feature category and
from Bigram Pos Tag feature category. Best results from individual feature category are
ensemble.

2.4. A Survey of Learning-Based Techniques of Email Spam Filtering


Author’s overview of different spam filtering techniques has been discussed and also few
anti-spam protection approaches are discussed. Among those methods few methods have
mainly focused on content of emails and others have also considered parameters such as
length, attachments, URL, to, from, Ip, etc. Feature extraction methods are also used for
image based and content-based filtering. One of the proposed ways of stopping spam is to
enhance or even substitute the existing standards of email transmission by new, spam-proof
variants. The main drawback of the commonly used Simple Mail Transfer Protocol (SMTP)
is that it provides no reliable mechanism of checking the identity of the message source.
Overcoming this disadvantage, namely providing better ways of sender identification, is the
common goal of Sender Policy Framework (SPF, formerly interpreted as Sender Permitted

Hybrid Machine Learning based E-mail Spam Filtering Technique 5


From) The principle of its work is the following: the owner of a domain publishes the list of
authorized outbound mail servers, thus allowing recipients to check, whether a message
which pretends to come from this domain really originates from there. Best results from
individual feature category are ensemble.

2.5. Web Spam Corpus using Email Spam to Identify Web Spam
Automatically
In this the authors have made researches on how to detect web spam, with the help of email
spam detection techniques. By observing URLs found in email spam messages they try to
identify whether the web page is spam or not. The Webb Spam Corpus is a very large sample
of Web spam (over two orders of magnitude larger than previously cited Web spam data
sets). Also, our automated Web spam collection technique allows researchers to quickly and
easily obtain even more examples. The main challenge with any automated Web spam
classification technique is accurate labelling (as shown by the limited Web spam sample sizes
of previous research), and although our approach does not completely eliminate this problem,
it does minimize the manual effort required. Researchers simply need to identify a few false
positives as opposed to the arduous task of manually searching for a sufficiently large
collection of Web spam pages. Specifically, our work could be used to provide more effective
parental controls on the Web. The Webb Spam Corpus contains a number of porn-related
pages as well as additional content that is not suitable for children. This content provides
valuable insight into the characteristics of Web spam pages and allows researchers to build
more effective Web content filters. In addition to its contributions to Web filtering, the Webb
Spam Corpus also provides a unique approach to email spam filtering.

Hybrid Machine Learning based E-mail Spam Filtering Technique 6


CHAPTER 3
PROPOSED SYSTEM

3.1. INTRODUCTION

Hybrid Machine Learning based E-mail Spam Filtering Technique 7


The Software Development Life Cycle, or System Development Life Cycle, in
systems engineering, information systems, and software engineering, is the way toward
making or changing frameworks, and the models and philosophies that individuals use to
build up these frameworks. In programming designing, the SDLC idea supports numerous
sorts of programming advancement procedures. These techniques shape the structure for
arranging and controlling the production of a data framework under the product improvement
process.

3.1.1 EXISTING SYSTEM:


E-mail spamming is one of the major issues in the current era. The e-mail spam is an
advertisement of any company/product or any kind of virus which is receiving email in their
client mailbox without any notification. To solve this problem different spam filtering
techniques are used. There are many algorithms applied to E-mail spamming.

Following are the existing methods applied:

1.BPNN (Back Propagation Neural Network):


BPNN filtering algorithm i.e. Artificial Neural Network feed forward with Back
Propagation which is based on the text classification to classify spam emails from genuine
ones. The process of detecting spam and phishing emails using feed forward neural network
is done by implementing the BPNN algorithm along with the dataset. This process includes
forward and backward passes then after based on the output the mail is recognized as spam or
not. The training phase and testing phase need to be done on all the instances in the dataset.

2.NBA (Naïve Bayesian Algorithm):


Naïve Bayes is Naïve Bayes is based on the model of conditional probability. It is
represented as the probability of the certain event occurring, if and only if some other event
has already taken place.In this algorithm, the word in the email is taken and it is applied to
Naïve Bayesian one then after we are calculating the percentage of the spam word in the
given mail based on the historical data. Training and testing are to be done along with the
dataset
.
3.1.2. LIMITATIONS / DISADVANTAGES:

Hybrid Machine Learning based E-mail Spam Filtering Technique 8


● Computationally costly and complicated with the increase in data size.
● Accuracy and precision are low.
● Productivity.
● Security issues.

3.2 PROPOSED SYSTEM:


To solve the problem of previous study in this project we are using HYST for classify
the spam and non-spam mails. It is one of the most popular and simplest methods for
classification. Training of the large data sample can be easily done with HYST, which is
easier as compared to other classifiers.

3.2.1 ADVANTAGES

● Higher accuracy and precision

● Lesser misclassification rate.

● Easier way to implement the algorithm on the dataset.

3.3 SYSTEM MODEL

Figure: 3.3.1
3.4 DATASET DESCRIPTION:

Hybrid Machine Learning based E-mail Spam Filtering Technique 9


Spam Email database consists of 4601 Instances and Number of Attributes is 58 in which
57 are continuous and 1 nominal class label. The Attribute Information is being described as
the last column of 'spambase.data' denotes whether the e-mail was considered spam (1) or not
(0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular
word or character was frequently occurring in the e-mail. The run-length attributes (55-57)
measure the length of sequences of consecutive capital letters.

The definitions of the attributes of every column are described as:48 continuous real
[0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match
WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of
words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by
non-alphanumeric characters or end-of-string. 6 continuous real [0,100] attributes of type
char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 *
(number of CHAR occurrences) / total characters in e-mail 1 continuous real [1,...] attribute
of type capital_run_length_average= average length of uninterrupted sequences of capital
letters. 1 continuous integer [1,...] attribute of type capital_run_length_longest = length of
longest uninterrupted sequence of capital letters. 1 continuous integer [1,...] attribute of type
capital_run_length_total= sum of length of uninterrupted sequences of capital letters = total
number of capital letters in the e-mail 1 nominal {0,1} class attribute of type spam = denotes
whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail.

Creators - Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt Hewlett-
Packard
Labs, 1501 Page Mill Rd., Palo Alto, CA 94304
Donor - George Forman (gforman at nospam hpl.hp.com) 650-857-7835
Generated - June-July 1999
Class Distribution: It consists of spam 1813 (39.4%) and Non-spam 2788(60.6%)

The SPAM database can be found by searching for ‘spam base DOCUMENTATION’ at the
UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html

Hybrid Machine Learning based E-mail Spam Filtering Technique 10


Figure: 3.4.1

Figure: 3.4.2

Hybrid Machine Learning based E-mail Spam Filtering Technique 11


An electronic mail server is a system with mail transfer agent (MTA) that performs functions.
Mail has been transferred among email servers that runs across aparticularprogram, which is
developed across standardized protocols for managing mails and their varied content. It
generally accepts mail from another mail transfer agent, a mail submission agent (MSA)
along with information of transmission was evaluated by simple mail transfer protocol. When
the MTA gets anE-mail and the user of thatE-mail is not hosted locally, the E-mail is
transfered to other mail transfer agent. When ever it is done the mail transfer agent adds a
"received" trace header on the top header of the message. Itdemonstrationsthat total mail
transfer agents that have managed the mail before it reaches the users inbox. These emails are
further directed towards Intelligent Spam Detecttion system.ISD is a software that is used to
identifymalicious email and to stop those incoming mails fromentering into receipients inbox.

3.5. Training Module:


In the training Module, we take 3450 mails for training the dataset and we extract the
features by using Spam filtering technique and attribute bagging process. If there are any
missing values those are being calculated and being replaced by the calculated values. All the
mails are being segregated into X and Y sets and it train it by the use of HYST algorithm.

3.6. Testing Module:


In the testing Module, we take 1151 mails for training the dataset and we extract the
features by using Spam filtering technique and attribute bagging process. If there are any
missing values those are being calculated and being replaced by the calculated values. Apply
the filtering technique and finally one set is considered as resultant output, calculated the
Information gain. Obtain the results, calculate the misclassification rate and precision and
make a confusion matrix.

Hybrid Machine Learning based E-mail Spam Filtering Technique 12


CHAPTER 4

SOFTWARE REQUIREMENT SPECIFICATIONS

4. SOFTWARE REQUIREMENT SPECIFICATION

Hybrid Machine Learning based E-mail Spam Filtering Technique 13


4.1. Introduction

A Software Requirements Specification (SRS) – a prerequisite particular for a product


framework is a total portrayal of the conduct of a framework to be created. It includes a few
use cases that portray every one of the associations that the clients will have with the product.
Not with remaining to utilize cases, the SRS additionally contains non-practical requirements.
Non-practical essentials are the prerequisites which force requirements on the structure or
execution, (for example, efficiency tuning, quality principles, or plan imperatives).

Framework necessities and determinations: An organized gathering of data that incorporates


the prerequisites of a framework. A business investigator, at times as a framework expert, is
in charge of knowing the business needs of their customers and accomplices to help them in
distinguishing business issues and examinations arrangements. In SDLC the Business
Analysts plays out a contact work between the business side and the data innovation division.
Tasks are liable to three sorts of necessities:

 Business requirements depict in business terms what must be passed on or achieved to


offer some profit.

 Product essentials portray the properties or points of interest of a structure or thing.


Process necessities portray the arrangement of prerequisites utilized as a contribution
by the creating organization. For instance, process necessities could indicate
Preliminary examination to analyse venture plausibility, the likelihood that the
framework will be helpful to the affiliation. The standard target is to test the Technical
Operations and Economical practicality for including new modules and
troubleshooting old modules. All modules are attainable in the event that they have
boundless assets and unbounded time. There are a couple of viewpoints in the
common-sense investigation of the primer examination:

Hybrid Machine Learning based E-mail Spam Filtering Technique 14


4.2 Software Specifications

• Operating system: Windows 10.


• Coding Language: python
• IDE: Spyder
• Development Kit: python 3.6

4.3 Hardware specifications

• System: Intel inside Core i5-4210U.


• Hard Disk: 750 GB.
• Monitor: 14.0' Diagonal Touch screen monitor.
• Mouse: Mouse Touchpad.
• Internet Connection: Data Card/Broadband Connection.
CHAPTER 5
DESIGN

5. DESIGN

5.1 UML DIAGRAMS:


UML is a benchmark for specifying, constructing, documenting and visualizing the
art of software systems. UML was designed by Object Management Group (OMG) and UML
1.0 specification draft was proposed to the Object Management Group in the month of
January 1997.

UML - a broadly useful visual displaying dialect to envision indicates, develop and archive any
product framework. UML is commonly used to show programming frameworks yet it has no
confinements or limits. It is additionally used to display noncoding modules like process flow in
an assembling unit and so on.

5.1.1 Class Diagram

Class diagrams can be utilized both in the early periods of a venture and amid plan
exercises. A class outline comprises of classes, affiliations, and speculations, and can exist in
various dimensions. It characterizes the stationary structures of any framework which is
divided into various parts called classes and furthermore and relations between those classes
along with methods respective to the classes.

The class graph is viewed as the fundamental drawn square of article demonstrating. It is
used for general theoretical demonstrating of the application, and later more point by point
displaying for making an analysis of the models into programming code. These charts can
likewise be utilized for displaying the information. The module in this chart speaks about
primary items and collaborations in relevance and furthers more the classes to be customized.
A class with three areas:

In the class graph underneath, each class contains three sections.

● The top text of the box speaks about the name of the class.

● The middle text of the about speaks to the properties of the class.

● The last part of the box speaks about the methods that are performed by
it.
Figure: 5.1.1
5.1.2 Use case diagram

Use Cases are likewise named as social charts which speak to a many of activities(use cases)
that a couple of modules (subject) ought to perform in investment with at least one different
clients of the outside modules (on-screen characters). Each case should give the right
outcomes to the on-screen characters or different members of the framework. These outlines
are done in a beginning period of an undertaking advancement. They speak to how to utilize
the last framework. Use cases are a decent method to portray the useful prerequisites of a
product framework; they are straightforward so they can be utilized in exchanges with non-
software engineers. The members in a UML use case graph are use cases, one or a few
performer's relations, affiliations and speculations between them are mentioned in the
following diagram.
.

Figure: 5.1.2
5.1.3 Sequence Diagram:

Sequence Diagram demonstrates the cooperation between many items, through


themessages that might be dispatched between them. The outlines comprise of interfacing
items and performers, with messages in the middle of them usually to concentrate the model
on situations determined by use-cases. It is likewise a helpful contribution to the definite
class graph.

Figure: 5.1.3

5.1.4 Activity Diagram

Activity Diagram demonstrates the course through a program from a characterized begin
point to an end point. Action charts portray the work process and furthermore performance of
a framework. These diagrams are like state outlines since exercises are the condition of
accomplishing something. These charts depict the condition of the exercises by speaking to
the grouping of exercises performed. They can likewise demonstrate exercises that are
restrictive or parallel. Essential components in movement charts are exercises, branches
(conditions or determinations), advances, forks, and joins.

An activity diagram is another important diagram in UML to depict dynamic parts of


the framework. It is like a stream graph to speak to the spill out of one action to another
movement. The movement can be said as an activity done by the framework.

Figure: 5.1.4

5.1.5 DATAFLOW DIAGRAMS

Data flow diagrams explain the stream how information is handled by a framework as
far as data sources and yields. Information stream graphs can be utilized to give an
unmistakable portrayal of any capacity. Furthermore, the troublesome process can likewise
be effectively robotized with the assistance of DFDs utilizing any simple to utilize free
downloadable instruments. A DFD is a diagrammatic model for building and breaking down
the data procedure. Information Flow Diagram clarifies the stream of data in a procedure
dependent on the information and yield. Any DFD can likewise allude as a Process Model. A
DFD portrays a specialized procedure with the help of the information spared, in addition to
the information spilling out of one procedure to another and the final product.

Figure:
5.1.5

5.1.6 FLOW CHART


Figure: 5.1.6

DESCRIPTION OF FLOWCHART:

1. Once the E-Mail is being received check whether the mail is in black list or not.
2. If yes, reject the mail and stop the process
3. If the mail is not in the black list verify whether the E-Mail is forged or not.
4. If the mail is modified reject the mail or else if the mail is not forged check whether
the mail is in white list or not.
5. The mail is in white list then delivers the mail and stop the process, if it is not then
add it to the suspicious list and deliver the mail.
6. Finally, the E-Mail is sent to user’s inbox or spam folder.
CHAPTER 6

IMPLEMENTATION

6. IMPLEMENTATION

6.1 Introduction

Implementation phase programming improvement is considered as the most vital


errands in the undertaking and this is a stage in which one must be cautious since every one
of the endeavours that we put amid the venture will be exceptionally intuitive with each
other. This is the most vital stage in acquiring fruitful programming or a framework and
giving the client a certainty that the new programming or the framework is functional and
gives compelling outcomes. Every individual program is experienced during the time spent
testing at the season of advancement utilizing test information and has watched that programs
are connected together in the manner in which we indicate. The new programming or the PC
framework and its condition are tried for the client fulfilment.

This stage is less imaginative when contrasted and framework structure. It contemplates
client preparing a framework, and furthermore required record changes. The framework
might be required colossal client preparing. The underlying parameters of the framework
ought to be changed because of programming. A basic methodology is given to the client can
comprehend the distinctive capacities obviously and rapidly. The proposed framework is
anything but difficult to execute. By and large, usage is the way toward changing over
another or overhauled framework structure into an operational one.

6.2 TECHNOLOGIES USED

Python is a well-known programming dialect. It was made in 1991 by Guido


van Rossum.

It is utilized for:

● Web advancement (server-side),

● Software advancement,

● Mathematics,

● System scripting.

● Python can be utilized on a server to make web applications.

● Python can be utilized nearby programming to make work processes.

● Python can interface with database frameworks. It can likewise peruse and
alter records.
● Python can be utilized to deal with enormous information and perform
complex science.

● Python can be utilized for quick prototyping, or for generation prepared


programming advancement.

Python deals with various stages:

 The latest significant form of Python will be Python 3, which we will use in this
instructional exercise. Nonetheless, Python 2, in spite of the fact that not being
refreshed with something besides security refreshes, is still very well known.

 In this instructional exercise, Python will be written in a content tool. It is conceivable


to compose Python in an Integrated Development Environment, for example, Thonny,
Pycharm, Net Beans or Eclipse which are especially valuable while overseeing bigger
accumulations of Python records.

6.3 ALGORITHM:

Hybrid E-Mail spam filtering technique:

HYST is used as a classifier model which uses multiple sets of data to classify at any
given instance by using majority votes. In this algorithm, the dataset which consists of total
4600 records, 58 features and 2 classes i.e. spam and not spam. Each row of the table
represents a separate record. The columns in the dataset represents features like FREE,
URGENT. In this we are using a concept of bagging for generating K number of sets. In the
next process attribute bagging is being applied. After obtaining the resultant set prepare the
confusion matrix for HYST.

Description of the HYST Algorithm:

Table 1: A dataset which consists of 4600 E-Mails, 58 features and 2 classes.

F1 F2 F3 F4 F5 ------ ------- F58 CLASS


FA1 FB1 FC1 FD1 FE1 ------ ------- FD58 SPAM
FA2 FB2 FC2 FD2 FE2 ------ ------- ---- HAM
-- -- -- -- -- ------ ------- ----- --
-- -- -- -- -- ------ ------ ----- --
-- -- -- -- -- ----- ------ ----- --
-- -- -- -- -- --- ---- ---- ----
FAI FBI FCI FDI FEI ----- ------- FDI SPAM
-- -- -- -- -- ---- ----- ----- ------
-- --- -- -- -- ---- ----- ----- -----
-- --- --- --- --- ---- ---- ---- -----
---- ---- ---- ---- ---- ---- ----- ----- ----
FA4600 FB4600 FC4600 FD4600 FE4600 ------ -------- FD4600 HAM

Table: 6.3.1
Step-1: We apply K iterations of bagging to create total K number of trees.

Bagging (Bootstrap aggregating): For a standard training set D of size n, bagging generates m
new training sets Di, each of size n1 , by sampling from D uniformly and with replacement. By
sampling with replacement, some observations may be repeated in each Di.

Table 2: A dataset which consists of 1530 E-Mails, 58 features and 2 classes selected from
table 1.

F1 F2 F3 F4 F5 ------ ------- F58 CLASS


FA1 FB1 FC1 FD1 FE1 ------ ------- FD58 SPAM
FA2 FB2 FC2 FD2 FE2 ------ ------- ---- HAM
-- -- -- -- -- ------ ------- ----- --
-- -- -- -- -- ------ ------ ----- --
-- -- -- -- -- ----- ------ ----- --
-- -- -- -- -- --- ---- ---- ----
FAI FBI FCI FDI FEI ----- ------- FDI SPAM
-- -- -- -- -- ---- ----- ----- ------
-- --- -- -- -- ---- ----- ----- -----
-- --- --- --- --- ---- ---- ---- -----
---- ---- ---- ---- ---- ---- ----- ----- ----
FA1530 FB1530 FC1530 FD1530 FE1530 ------ -------- FD1530 HAM

Table: 6.3.2

Step-2: Now for each of the K sample training set, we apply the attribute bagging and learn
the decision tree, the variable from any new node is the best variable (i.e., having least
misclassification error) among extracted random subspace.

Table 3: A dataset which consists of 1530 E-Mails, 8 features and 2 classes selected from
table 2.

F1 F8 F16 F23 F32 F45 F50 F58 CLASS


FA1 FB8 FC16 FD23 FE32 FF45 FG50 FH58 SPAM
-- --- -- -- -- ------ ------- ---- HAM
-- -- -- -- -- ------ ------- ----- --
-- -- -- -- -- ------ ------ ----- --
-- -- -- -- -- ----- ------ ----- --
-- -- -- -- -- --- ---- ---- ----
FAI FBI FCI FDI FEI ----- ------- FDI SPAM
-- -- -- -- -- ---- ----- ----- ------
-- --- -- -- -- ---- ----- ----- -----
-- --- --- --- --- ---- ---- ---- -----
---- ---- ---- ---- ---- ---- ----- ----- ----
FA1530 FB1530 FC1530 FD1530 FE1530 ------ -------- FD1530 HAM

Table: 6.3.3

Here the number of records is same as 1530 records; we just make sample selection of
random selection of features with replacement. So this was our first sample set and now we
use this sample set and apply decision tree formation like information gain or gain ratio to
create the tree.

Table 4: Confusion Matrix

SPAM HAM
SPAM 676 15
HAM 56 404

Misclassification Rate: How often classifier is wrong?

FP+ FN 25+15
= =0.2
Total 200

The forest error rate depends upon:

1. The correlation between any two trees

2. The strength of each individual tree in the forest

Attribute bagging: Let each training object X i (i=1, … .. , n) in the training sample set
X =( X 1 , X 2 , … .. , X n) be a p dimensional vector X i =( x i 1 , , x i 2 , ….. , x ip ) described by p features
(components). In Random subspace sampling, one randomly selects r<p features from the p
dimensional data set X. One thus obtains the r dimensional random subspace of the original p
dimensional feature space. Therefore, the modified training set X b=( X b1 , X b2 , ….. , X bn )
consists of r dimensional training objectsX 1b=¿) (i=1,2,…n), where r components

x bij ( j =1,2, … , r ) are randomly selected from p components

x ij ( j=1,2 , … .. p) of the training vector X i . Here, b>1.

6.4. CODING

# Hybrid email spam filtering

# Importing the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

# Importing the dataset

Data set = pd.read_csv('Email Dataset.csv')

X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, 57].values

# Splitting the dataset into the Training set and Test set

From sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling

fromsklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

# Fitting Classification model to the Training set

From sklearn.ensemble import Classifier


classifier = Classifier(n_estimators = 10, criterion = 'entropy', random_state = 0)

classifier.fit(X_train, y_train)

# Predicting the Test set results

y_pred = classifier.predict(X_test)

# Making the Confusion Matrix

From sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

RESULT:

Accuracy : 93

Misclassification rate : 6

Precision : 96

Artificial Neural Network Algorithm:

Artificial Neural Network is nothing but the computing systems which are being
motivated by the neural networks. Neural Network is a frame work for majority of the
machine learning based algorithms which easily solves complex information.
Figure: 6.4.1

Input: Randomly initialize weights near to zero.

Step 1: Calculate hidden layer values by multiplying input values with weights

H1 = X1 * W1 + X2 * W2

Step 2: Calculate Activation Function to get hidden layer outputs.

1
OUT H1 =
1+ e−H 1

Step 3: Similarly calculate values for hidden layers until ‘n’ inputs.

Step 4: Calculate output values ‘Y’ by multiplying outH1 with weights.

Step 5: Repeat until we get all output values ( i.e. Y1…..Yn)

Step 6: Compare predicted result to actual result and measure the generated error.

Step7: Update the weights according to the generated errors and apply back propagation until
we get the result

Code:
# Artificial Neural Network

# Installing Theano

# pip install --upgrade --no-depsgit+git://github.com/Theano/Theano.git

# Installing Tensorflow

# Install Tensorflow from the website:


https://www.tensorflow.org/versions/r0.12/get_started/os_setup.html

# Installing Keras

# pip install --upgrade keras

# Part 1 - Data Preprocessing

# Importing the libraries

import numpy as np

importmatplotlib.pyplot as plt

import pandas as pd

# Importing the dataset

dataset = pd.read_csv('Email Dataset.csv')

X = dataset.iloc[:, :-1].values

y = dataset.iloc[:, 57].values

# Encoding categorical data

fromsklearn.preprocessing import LabelEncoder, OneHotEncoder

labelencoder_X_1 = LabelEncoder()

X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])

labelencoder_X_2 = LabelEncoder()

X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])

onehotencoder = OneHotEncoder(categorical_features = [1])

X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]

# Splitting the dataset into the Training set and Test set

fromsklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling

fromsklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

# Part 2 - Now let's make the ANN!

# Importing the Keras libraries and packages

import keras

fromkeras.models import Sequential

fromkeras.layers import Dense

# Initialising the ANN

classifier = Sequential()

# Adding the input layer and the first hidden layer

classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu', input_dim = 226))

# Adding the second hidden layer

classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu'))

# Adding the output layer

classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))

# Compiling the ANN

classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])


# Fitting the ANN to the Training set

classifier.fit(X_train, y_train, batch_size = 10, nb_epoch = 100)

# Part 3 - Making the predictions and evaluating the model

# Predicting the Test set results

y_pred = classifier.predict(X_test)

y_pred = (y_pred> 0.5)

# Making the Confusion Matrix

fromsklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

RESULT:

Accuracy : 91

Misclassification rate : 46

Precision : 89

Naïve Bayesian Algorithm:


Naïve Bayesian algorithm is a simple classification algorithm which makes
assumptions about the independence of each input variable. In this method the Naïve
Bayesian works for categorical data.
Figure: 6.4.2

• Naïve Bayes is a supervised learning algorithm based on the underlying Bayes


theorem.

• The algorithm is trained using the training dataset and it generates the classifier model
for us.

• Naïve Bayes is based on the model of conditional probability .It is represented as the
probability of the certain event occurring, if and only if some other event has already
taken place.

• In our case, spam base dataset is a continuous dataset; therefore we choose to use a
Naïve Bayesian Algorithm.

p ( spam )∗(word / spam)


• P(spam/word) = p ( spam )∗( word /ham ) + p ( ham )∗p( word /ham)

Code:

# Naive Bayes

# Installing Theano

# pip install --upgrade --no-depsgit+git://github.com/Theano/Theano.git

# Installing Tensorflow
# Install Tensorflow from the website:
https://www.tensorflow.org/versions/r0.12/get_started/os_setup.html

# Installing Keras

# pip install --upgrade keras

# Part 1 - Data Preprocessing

# Importing the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

# Importing the dataset

dataset = pd.read_csv('Email Dataset.csv')

X = dataset.iloc[:,:-1 ].values

y = dataset.iloc[:, 57].values

# Splitting the dataset into the Training set and Test set

From sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,

random_state = 0)

# Feature Scaling

fromsklearn.preprocessing import StandardScaler

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

# Fitting Naive Bayes to the Training set

From sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results

y_pred = classifier.predict(X_test)

# Making the Confusion Matrix

From sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

RESULT:

Accuracy : 79

Misclassification rate : 20

Precision : 67

6.5. SCREENSHOTS

HYST:
Variable explorer of HYST:

Training Set of X:

Testing set of X:
Confusion Matrix for HYST:

Figure: 6.5.1

Naïve Bayesian:
Code:

Variable explorer of Naive Bayesian:


Training Set of X:

Testing set of X:
Confusion Matrix for Naive Bayesian:

Figure: 6.5.2

Neural Network:
Coding:

Variable Explorer:
Training set of X:

Testing Set of X:
Confusion Matrix for Neural Network:

Figure: 6.5.3
CHAPTER 7

RESULTS
7.1 RESULTS:

Naïve Bayesian Neural Network HYST

Accuracy 79 91 93

Misclassification Rate 20 46 6

True Positive Rate 92 90 87

False Positive Rate 29 7 2

True Negative Rate 70 92 97

Precision 67 89 96

Table: 7.1.1

Results are being described in the form of confusion matrix. A confusion matrix is a
technique for summarizing the performance of a classification algorithm. The detailed
explanation of the table is being described below:

Accuracy: It is the ratio of correct predictions to total predictions made.


TP+TN
Accuracy =
Total Predictions

Misclassification Rate: how often it is wrong

FP+ FN
Misclassification rate =
Total

True Positive Rate: When it's actually yes, how often does it predict yes

TP
TP Rate = Actual Yes

False Positive Rate: When it's actually no, how often does it predict yes?

FP
FP Rate =
Actual NO

True Negative Rate: When it's actually no, how often does it predict no?

TN
TN Rate =
Actual NO

Precision: To get the value of precision we divide the total number of correctly classified
positive examples by the total number of predicted positive examples

TP
Precision =
TP+ FP

The values in the results table clearly show that as:

 Accuracy of HYST is greater than that of Naïve Bayesian and Neural Network.

 Misclassification rate of HYST is very less compared to other existing algorithms.

 False positive rate is lesser for Neural Network and Naïve Bayesian.

 True Negative rate is higher for HYST.

 Precision rate of HYST is more than that of two existing algorithms.


7.2 Graphs:

Figure: 7.2.1
Figure: 7.2.2

Figure: 7.2.3

Figure: 7.2.4
Figure: 7.2.5
CHAPTER 8

CONCLUSION

8. CONCLUSION
In order to solve the problems in existing E-mail spam filtering technique, the
proposed work has identified a new technique that has utilized HYST algorithm to derive the
emails as spam or not in the most efficient way. Precision rate has been gradually increased
by the proposed algorithm. HYST performed very well with an improvement of 5%.Fututre
research will be concerned with attribute selection or else feature selection for improvement
in the accuracy rate because electronic mail dataset consists of huge number of attributes
which are irrelevant.
REFERENCES

REFERENCES
[1] T. Subramaniam, H. A. Jalab, and A. Y. Taqa, "Overview of textual anti-spam filtering
techniques," Int. J. Phys. Sci, vol. 5, pp. 1869-1882, 2010

[2] E.-S. M. El-Alfy, “Learning Methods for Spam Filtering,” International Journal of
Computer Research, vol. 16, no. 4, 2008.
[3]Karl-Michael Schneider: “A Comparison of Event Models for Naïve Bayes Anti-Spam E-
Mail Filtering.” In Proceedings of the 10 thConference of the European Chapter of the
Association for
Computational Linguistics, Budapest, Hungary, 307-314, April, 2003.

[4] I. Androutsopoulos et al.: “Learning to Filter Spam E-mail: AComparison of a Naïve


Bayesian and a Memory-based Approach.” In Proceedings of the Workshop on Machine
Learning and Textual
Information Access, Pages 1-13, 2000..

[5] J. Provost, “Naïve-Bayes vs. rule-learning in classification of email,” The University of


Texas at Austin, Department of Computer Sciences, Technical Report AI-TR-99-284, 1999.

[6] Y. Yang, S. Elfayoumy, “Anti-spam filtering using neural networks and Bayesian
classifiers,” in Proceedings of the 2007 IEEE International Symposium on Computational
Intelligence in Robotics and Automation, Jacksonville, FL, USA, June 2007.

[7] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian approach to filtering


junk e-mail,” in Proceedings of AAAI’98 Workshop on Learning for Text Categorization,
Madison, WI, July 1998.

[8] E.-S. M. El-Alfy and F. S. Al-Qunaieer, “A fuzzy similarity approach for automated spam
filtering,” in Proceedings of IEEE International Conference on Computer Systems and
Applications (AICCSA’08), Qatar, April 2008.

[9] K. Jain, “A Hybrid Approach for spam filtering using Support Vector Machine and
Artificial immune System pp. 5–9, 2014.

[10] Le Zhang, Jingbo Zhu, Tianshun Yao: “An Evaluation of Statistical Spam Filtering
Techniques.” ACM Transactions on Asian Language Information Processing, Vol. 3, No. 4,
Pages 243-269, December, 2004.

[11] Zhuang, L., Dunagan, J., Simon, D.R., Wang,H.J., Tygar, J.D., Characterizing Botnets
from Email spam Records,LEET’08 Proceedings of the 1 stUsenix Workshop on Large-Scale
Exploits and Emergent threats Article No.2.2008

[12] Enrico Blanzieri, Anton Bryl, A survey of learning based techniques of email spam
filtering, Technical Report # DIT-06-056. 2008
[13] CloseSteve Webb, James Caverlee, CaltonPu. Introducing the Webb Spam Corpus:
using Email spam to identify web spam automatically, CEAS.2006.

[14] Sculley, D., Gabriel M. Wachman, 2007. Relaxed online VSMs for spam filtering,
SIGIR 2007 Proceedings

[15] The Enron corpus: a new dataset for email classification researchECML (2004), pp. 217-
226

[16] JiteshShetty, JafarAdibi, 2005. Discovering Important Nodes through Graph Entropy the
Case of Enron Email Database, KDD’2005, Chicago, Illinois.

[17] ShinjaeYoo, Yiming Yang, Frank Lin, I1-Chul Moon, 2009.Mining Social Networks for
Personalized Email Prioritization, KDD’09, June 28-July 1,Paris, France

You might also like