0% found this document useful (0 votes)
59 views4 pages

Spam PDF

This document discusses email spam filtering using supervised machine learning techniques. It describes the typical architecture of a spam filter, which can be implemented at multiple levels including firewalls, email servers, and email clients. Common spam filtering methods mentioned include whitelist/blacklist, Bayesian analysis, keyword checking, and mail header analysis. The document also evaluates several supervised machine learning classifiers for identifying spam emails, specifically multilayer perceptron, C4.5 decision tree induction, and naive Bayes classifier. These machine learning techniques are used to learn features from known spam and legitimate emails to build a classification model.

Uploaded by

Chinh Trịnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views4 pages

Spam PDF

This document discusses email spam filtering using supervised machine learning techniques. It describes the typical architecture of a spam filter, which can be implemented at multiple levels including firewalls, email servers, and email clients. Common spam filtering methods mentioned include whitelist/blacklist, Bayesian analysis, keyword checking, and mail header analysis. The document also evaluates several supervised machine learning classifiers for identifying spam emails, specifically multilayer perceptron, C4.5 decision tree induction, and naive Bayes classifier. These machine learning techniques are used to learn features from known spam and legitimate emails to build a classification model.

Uploaded by

Chinh Trịnh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

V. Christina et al.

/ (IJCSE) International Journal on Computer Science and Engineering


Vol. 02, No. 09, 2010, 3126-3129

Email Spam Filtering using Supervised Machine


Learning Techniques
V.Christina#, S.Karpagavalli*, G.Suganya#
#
M.Phil Research scholar Department of Computer Science(PG)
P.S.G.R Krishnammal College for Women
*
Senior Lecturer
GR Govindarajulu School of Appiled Computer Technology

Abstract— E-mail spam, known as unsolicited bulk Email II. SPAM FILTER ARCHITECTURE AND METHODS
(UBE), junk mail, or unsolicited commercial email (UCE), is the
practice of sending unwanted e-mail messages, frequently with
commercial content, in large quantities to an indiscriminate set E-mail spam, known as unsolicited bulk Email (UBE), junk
of recipients. Spam is prevalent on the Internet because the mail, or unsolicited commercial email (UCE), is the practice
transaction cost of electronic communications is radically less of sending unwanted e-mail messages, frequently with
than any alternate form of communication. There are many commercial content, in large quantities to an indiscriminate
spam filters using different approaches to identify the incoming set of recipients. The technical definition of spam is ‘An
message as spam, ranging from white list / black list, Bayesian electronic message is "spam" if (A) the recipient's personal
analysis, keyword matching, mail header analysis, postage,
identity and context are irrelevant because the message is
legislation, and content scanning etc. Even though we are still
flooded with spam emails everyday. This is not because the equally applicable to many other potential recipients; and (B)
filters are not powerful enough, it is due to the swift adoption of the recipient has not verifiably granted deliberate, explicit,
new techniques by the spammers and the inflexibility of spam and still-revocable permission for it to be sent’. The risks in
filters to adapt the changes. In our work, we employed filtering spam are sometimes legitimate mails may be
supervised machine learning techniques to filter the email spam rejected or denied and legitimate mails may be marked as
messages. Widely used supervised machine learning techniques spam. The risks of not filtering spam are the constant flood
namely C 4.5 Decision tree classifier, Multilayer Perceptron, of spam clogs networks and adversely impacts user inboxes,
Naïve Bayes Classifier are used for learning the features of but also drain valuable resources such as bandwidth and
spam emails and the model is built by training with known
storage capacity, productivity loss and interfere with the
spam emails and legitimate emails. The results of the models are
discussed. expedient delivery of legitimate emails.

Spam filters can be implemented at all layers, firewalls


Keywords— Spam, Spam filter, Spammer, Mail header,
exist in front of email server or at MTA(Mail Transfer
Machine learning, Classifier
Agent), Email Server to provide an integrated Anti-Spam and
I. INTRODUCTION Anti-Virus solution offering complete email protection at the
network perimeter level, before unwanted or potentially
The internet has become an integral part of everyday
dangerous email reaches the network. At MDA (Mail
life and e-mail has become a powerful tool for information
Delivery Agent) level also spam filters can be installed as a
exchange. Along with the growth of the Internet and e-mail,
service to all of their customers. At Email client user can
there has been a dramatic growth in spam in recent years.
have personalized spam filters that then automatically filter
Spam can originate from any location across the globe where
mail according to the chosen criteria. Figure 1. shows the
Internet access is available. Despite the development of anti-
typical architecture of spam filter.
spam services and technologies, the number of spam
messages continues to increase rapidly. In order to address
the growing problem, each organization must analyze the The several different methods to identify incoming
tools available to determine how best to counter spam in its messages as spam are, Whitelist/Blacklist, Bayesian analysis,
environment. Tools, such as the corporate e-mail system, e- Mail header analysis, Keyword checking. A whitelist is a
mail filtering gateways, contracted anti-spam services, and list, which includes all addresses from which the users
end-user training, provide an important arsenal for any always wish to receive mail.
organization. However, users cannot avoid the very serious
problem of attempting to deal with large amounts of spam on User can add email addresses or entire domains, or
a regular basis. If there are no anti spam activities, spam will functional domains. An interesting option is an automatic
inundate network systems, kill employee productivity, steal whitelist management tool that eliminates the need for
bandwidth, and still be there tomorrow. administrators to manually input approved addresses on the
whitelist and ensures that mail from particular senders or
domains are never flagged as spam.

ISSN : 0975-3397 3126


V. Christina et al. / (IJCSE) International Journal on Computer Science and Engineering
Vol. 02, No. 09, 2010, 3126-3129

perceptron and Naïve bayes classifier are used for learning


the classification model.

A. MultiLayer Perceptron

Multilayer Perceptron (MLP) network is the most widely


used neural network classifier. MLP networks are general-
purpose, flexible, nonlinear models consisting of a number of
units organised into multiple layers. The complexity of the
MLP network can be changed by varying the number of
layers and the number of units in each layer. Given enough
hidden units and enough data, it has been shown that MLPs
can approximate virtually any function to any desired
accuracy. In other words, MLPs are universal approximators.
MLPs are valuable tools in problems when one has little or
no knowledge about the form of the relationship between
input vectors and their corresponding outputs.
B. C 4.5 Decision Tree Induction

Decision Tree Classification generates the output as a


binary tree like structure called a decision tree, in which each
branch node represents a choice between a number of
The number of records can be configured. When an alternatives, and each leaf node represents a classification or
overflow occurs, obsolete records are overwritten. A blacklist decision. A Decision Tree model contains rules to predict
works similarly to competitive alternatives: this is a list of the target variable. This algorithm scales well, even where
addresses from which user never want to receive mail. Mail there are varying numbers of training examples and
header checking consists of a set of rules that, if a mail considerable numbers of attributes in large databases.
header matches, triggers the mail server to return messages
that have blank "From" field, that lists a lot of addresses in J48 algorithm is an implementation of the C4.5 decision tree
the "To" from the same source, that have too many digits in learner. This implementation produces decision tree models.
email addresses (a fairly popular method of generating false The algorithm uses the greedy technique to induce decision
addresses). It also enables to return messages by matching trees for classification. A decision-tree model is built by
the language code declared in the header. analyzing training data and the model is used to classify
unseen data. J48 generates decision trees, the nodes of which
In Bayesian analysis, the word probabilities (also known evaluate the existence or significance of individual features.
as likelihood functions) are used to compute the probability C. Naïve Bayes Classification
that an email with a particular set of words in it belongs to
either category. This contribution is called the posterior
probability and is computed using Bayes' theorem. Then, the The naive bayes classifier (NB) is a simple but effective
email's spam probability is computed over all words in the classifier which has been used in numerous applications of
email, and if the total exceeds a certain threshold, the filter information processing including, natural language
will mark the email as a spam. Keyword checking is another processing, information retrieval, etc. The Naive Bayes
method widely used in filtering spam. It works by scanning Classifier technique is based on Bayesian theorem and is
both email subject and body. Using "conditions" i.e. particularly suited when the dimensionality of the inputs is
combinations of keywords is a good solution to enhance high. Naïve Bayes classifiers assume that the effect of a
filtering efficiency. We can specify combinations of words variable value on a given class is independent of the values
and update the list that must appear in the spam email. All of other variable. The Naive-Bayes inducer computes
messages that include these words will be blocked. conditional probabilities of the classes given the instance and
picks the class with the highest posterior. Depending on the
precise nature of the probability model, naive Bayes
III. METHODOLOGY classifiers can be trained very efficiently in a supervised
learning setting.
Most of the spam filtering techniques is based on text
categorization methods. Thus filtering spam turns on a
classification problem. In our work, rules are framed to IV. FEATURE EXTRACTION
extract feature vector from email. As the characteristics of
discrimination are not well defined, it is more convenient to The work is based on rules and uses a score-based system.
apply machine learning techniques. Three machine learning The rules are framed by analyzing the mail header
algorithms, C 4.5 Decision tree classifier, Multilayer information, keyword matching and the body of the message.
And a relative score is assigned to each rule.

ISSN : 0975-3397 3127


V. Christina et al. / (IJCSE) International Journal on Computer Science and Engineering
Vol. 02, No. 09, 2010, 3126-3129

There are number of rules framed by considering the various The class labels are designated as L and S to represent
features that will aid to identify the spam messages legitimate and spam message respectively.
effectively. Each rule performs a test on the email, and each The machine learning techniques Naïve Bayes Classifier,
rule has a score. When an email is processed, it is tested C 4.5 Decision tree classifier, Multilayer Perceptron are used
against each rule. For each rule found to be true for an email, for training the dataset in WEKA environment.
the score associated with the rule is added to the overall score The training is carried out with the feature vectors
for that email. Once all the rules have been used, the total extracted by analyzing each message header and keyword
score for the email is compared to a threshold value. If the checking and whitelist/blacklist.
score exceeds the threshold, then the email is marked as The performance of the trained models is evaluated
spam and the others are classified as legitimate mail. In this using 10-fold cross validation for its predictive accuracy.
work, the rules used are Predictive accuracy is used as a performance measure for
email spam classification. The prediction accuracy is
TABLE I measured as the ratio of number of correctly classified
SCHEME OF RULES ASSIGNED TO EACH SPAM FEATURE
instances in the test dataset and the total number of test cases.
In spam filtering, false negatives just mean that some spam
From name meaningful mails are classified as legitimate and moved to inbox. False
From domain name positive mean that legitimate emails that get mistakenly
Blocked IP identified as spam and moved to spam folder or discarded.
Apostrophe in From name
From name in Auto Whitelist (AWL)
For most users, missing legitimate email is an order of
From address in User’s Block list magnitude worse than receiving spam. The false positive rate
From address in User’s White list of each classifier also considered to measure its performance.
Content Type The performance of the classifiers are summarized in
Content Boundary exists Table II and shown in Fig.2 and Fig.3.
To name meaningful
To address Undisclosed recipients TABLE II
To header original COMPARATIVE RESULTS OF THE CLASSIFIERS
From address and To address same
Is subject present Evaluation Naïve
Subject content has obfuscate words J48 MLP
Criteria Bayes
Is forwarded message
Is reply message
Subject Reply without reference header Training time (secs) 0.15 0.20 138.05
Is message body exists
Sensual message
Repeated double quotes in body Correctly Classified
1479 1449 1490
Character set includes foreign language Instances
More blank lines in body
Prediction
98.6 96.6 99.3
Accuracy ( % )
In these 23 rules, some are simple and some are
associated with one another. A simple rule could search for a False Positive (%) 5 4 1
word ‘Viagra’ in subject line of an email, while a complex
rule may involve comparing an email against an online
database of spam. Each rule adds to the overall score, so an
email that triggers only one rule due to the use of the word
‘Viagra’ will not necessarily mark an email as spam. 100
However, if an email triggers several rules, it will have a 99.3
99.5
combined score that could be over the threshold and the mail 99 98.6
could be marked as spam. 98.5
A c c u ra c y %

98
V. EXPERIMENT AND RESULTS 97.5
97 96.6
The email spam filtering has been carried out using 96.5
WEKA. The Weka, Open Source, Portable, GUI-based 96
workbench is a collection of state-of-the-art machine learning 95.5
algorithms and data pre processing tools. 95
Naïve Bayes J48 MLP
The training dataset, spam and legitimate message corpus
is generated from the mails that we received from our
institute mail server for a period of six months. The mails are Fig. 1 Classification Accuracy
analyzed and 23 rules are identified that extremely ease the
process of classifying the spam message. The corpus consists The performance of the three models was evaluated
of 750 spam messages and 750 legitimate messages. From based on the three criteria, the prediction accuracy,
the corpus, the feature vectors are extracted by analyzing learning time and false positive rate. Multilayer
message header, keyword checking, whitelist/blacklist etc. perceptron predicts better than other algorithms.

ISSN : 0975-3397 3128


V. Christina et al. / (IJCSE) International Journal on Computer Science and Engineering
Vol. 02, No. 09, 2010, 3126-3129

138.05 latest mails and employed machine learning techniques to


140 build the model. The performance of the model is evaluated
120
using 10-fold cross validation and observed that Multilayer
Perceptron classifier out performs other classifiers and the
100
false positive rate also very low compared to other
Build Time(sec)

80
alogorithms. Email spam filters using this approach can be
60 adopted either at mailserver or at mail client side to reduce
40 the amount of spam messages and to reduce the risk of
0.15 0.2
20 productivity loss, bandwidth and storage usage.
0
Naïve bayes J48 MLP
REFERENCES
[1] Ahmed Khorsi, "An Overview of Content-based Spam Filtering
Fig. 1 Learning Time of the Models
Techniques", Informatica, vol. 31, no. 3, October 2007, pp 269-277.
[2] Alistair McDonald, “SpamAssassin: A Practical Guide to Integration
and Configuration”, Ist Edition, Packt publishers, 2004.
Multilayer perceptron, the neural network classifier [3] Ian H. Witten, Eibe Frank, “Data Mining – Practical Mahine Learning
consumes more time to build the model. The naivebayes, the Tools and Techniques,” 2nd Edition, Elsevier, 2005.
probabilistic classifier and decision tree model tends to learn
more rapidly for the given data set.

VI. CONCLUSION

Although there are many email spam filtering tools exists


in the world, due to the existence of spammers and adoption
of new techniques, email spam filtering becomes a
challenging problem to the researchers. In our work, we
generated spam and legitimate message corpus from the

ISSN : 0975-3397 3129

You might also like