Spam PDF
Spam PDF
Abstract— E-mail spam, known as unsolicited bulk Email II. SPAM FILTER ARCHITECTURE AND METHODS
(UBE), junk mail, or unsolicited commercial email (UCE), is the
practice of sending unwanted e-mail messages, frequently with
commercial content, in large quantities to an indiscriminate set E-mail spam, known as unsolicited bulk Email (UBE), junk
of recipients. Spam is prevalent on the Internet because the mail, or unsolicited commercial email (UCE), is the practice
transaction cost of electronic communications is radically less of sending unwanted e-mail messages, frequently with
than any alternate form of communication. There are many commercial content, in large quantities to an indiscriminate
spam filters using different approaches to identify the incoming set of recipients. The technical definition of spam is ‘An
message as spam, ranging from white list / black list, Bayesian electronic message is "spam" if (A) the recipient's personal
analysis, keyword matching, mail header analysis, postage,
identity and context are irrelevant because the message is
legislation, and content scanning etc. Even though we are still
flooded with spam emails everyday. This is not because the equally applicable to many other potential recipients; and (B)
filters are not powerful enough, it is due to the swift adoption of the recipient has not verifiably granted deliberate, explicit,
new techniques by the spammers and the inflexibility of spam and still-revocable permission for it to be sent’. The risks in
filters to adapt the changes. In our work, we employed filtering spam are sometimes legitimate mails may be
supervised machine learning techniques to filter the email spam rejected or denied and legitimate mails may be marked as
messages. Widely used supervised machine learning techniques spam. The risks of not filtering spam are the constant flood
namely C 4.5 Decision tree classifier, Multilayer Perceptron, of spam clogs networks and adversely impacts user inboxes,
Naïve Bayes Classifier are used for learning the features of but also drain valuable resources such as bandwidth and
spam emails and the model is built by training with known
storage capacity, productivity loss and interfere with the
spam emails and legitimate emails. The results of the models are
discussed. expedient delivery of legitimate emails.
A. MultiLayer Perceptron
There are number of rules framed by considering the various The class labels are designated as L and S to represent
features that will aid to identify the spam messages legitimate and spam message respectively.
effectively. Each rule performs a test on the email, and each The machine learning techniques Naïve Bayes Classifier,
rule has a score. When an email is processed, it is tested C 4.5 Decision tree classifier, Multilayer Perceptron are used
against each rule. For each rule found to be true for an email, for training the dataset in WEKA environment.
the score associated with the rule is added to the overall score The training is carried out with the feature vectors
for that email. Once all the rules have been used, the total extracted by analyzing each message header and keyword
score for the email is compared to a threshold value. If the checking and whitelist/blacklist.
score exceeds the threshold, then the email is marked as The performance of the trained models is evaluated
spam and the others are classified as legitimate mail. In this using 10-fold cross validation for its predictive accuracy.
work, the rules used are Predictive accuracy is used as a performance measure for
email spam classification. The prediction accuracy is
TABLE I measured as the ratio of number of correctly classified
SCHEME OF RULES ASSIGNED TO EACH SPAM FEATURE
instances in the test dataset and the total number of test cases.
In spam filtering, false negatives just mean that some spam
From name meaningful mails are classified as legitimate and moved to inbox. False
From domain name positive mean that legitimate emails that get mistakenly
Blocked IP identified as spam and moved to spam folder or discarded.
Apostrophe in From name
From name in Auto Whitelist (AWL)
For most users, missing legitimate email is an order of
From address in User’s Block list magnitude worse than receiving spam. The false positive rate
From address in User’s White list of each classifier also considered to measure its performance.
Content Type The performance of the classifiers are summarized in
Content Boundary exists Table II and shown in Fig.2 and Fig.3.
To name meaningful
To address Undisclosed recipients TABLE II
To header original COMPARATIVE RESULTS OF THE CLASSIFIERS
From address and To address same
Is subject present Evaluation Naïve
Subject content has obfuscate words J48 MLP
Criteria Bayes
Is forwarded message
Is reply message
Subject Reply without reference header Training time (secs) 0.15 0.20 138.05
Is message body exists
Sensual message
Repeated double quotes in body Correctly Classified
1479 1449 1490
Character set includes foreign language Instances
More blank lines in body
Prediction
98.6 96.6 99.3
Accuracy ( % )
In these 23 rules, some are simple and some are
associated with one another. A simple rule could search for a False Positive (%) 5 4 1
word ‘Viagra’ in subject line of an email, while a complex
rule may involve comparing an email against an online
database of spam. Each rule adds to the overall score, so an
email that triggers only one rule due to the use of the word
‘Viagra’ will not necessarily mark an email as spam. 100
However, if an email triggers several rules, it will have a 99.3
99.5
combined score that could be over the threshold and the mail 99 98.6
could be marked as spam. 98.5
A c c u ra c y %
98
V. EXPERIMENT AND RESULTS 97.5
97 96.6
The email spam filtering has been carried out using 96.5
WEKA. The Weka, Open Source, Portable, GUI-based 96
workbench is a collection of state-of-the-art machine learning 95.5
algorithms and data pre processing tools. 95
Naïve Bayes J48 MLP
The training dataset, spam and legitimate message corpus
is generated from the mails that we received from our
institute mail server for a period of six months. The mails are Fig. 1 Classification Accuracy
analyzed and 23 rules are identified that extremely ease the
process of classifying the spam message. The corpus consists The performance of the three models was evaluated
of 750 spam messages and 750 legitimate messages. From based on the three criteria, the prediction accuracy,
the corpus, the feature vectors are extracted by analyzing learning time and false positive rate. Multilayer
message header, keyword checking, whitelist/blacklist etc. perceptron predicts better than other algorithms.
80
alogorithms. Email spam filters using this approach can be
60 adopted either at mailserver or at mail client side to reduce
40 the amount of spam messages and to reduce the risk of
0.15 0.2
20 productivity loss, bandwidth and storage usage.
0
Naïve bayes J48 MLP
REFERENCES
[1] Ahmed Khorsi, "An Overview of Content-based Spam Filtering
Fig. 1 Learning Time of the Models
Techniques", Informatica, vol. 31, no. 3, October 2007, pp 269-277.
[2] Alistair McDonald, “SpamAssassin: A Practical Guide to Integration
and Configuration”, Ist Edition, Packt publishers, 2004.
Multilayer perceptron, the neural network classifier [3] Ian H. Witten, Eibe Frank, “Data Mining – Practical Mahine Learning
consumes more time to build the model. The naivebayes, the Tools and Techniques,” 2nd Edition, Elsevier, 2005.
probabilistic classifier and decision tree model tends to learn
more rapidly for the given data set.
VI. CONCLUSION