0% found this document useful (0 votes)

59 views4 pages

Spam PDF

This document discusses email spam filtering using supervised machine learning techniques. It describes the typical architecture of a spam filter, which can be implemented at multiple levels including firewalls, email servers, and email clients. Common spam filtering methods mentioned include whitelist/blacklist, Bayesian analysis, keyword checking, and mail header analysis. The document also evaluates several supervised machine learning classifiers for identifying spam emails, specifically multilayer perceptron, C4.5 decision tree induction, and naive Bayes classifier. These machine learning techniques are used to learn features from known spam and legitimate emails to build a classification model.

Uploaded by

Chinh Trịnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views4 pages

Spam PDF

Uploaded by

Chinh Trịnh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

V. Christina et al.

/ (IJCSE) International Journal on Computer Science and Engineering

Vol. 02, No. 09, 2010, 3126-3129

Email Spam Filtering using Supervised Machine

Learning Techniques
V.Christina#, S.Karpagavalli*, G.Suganya#
#
M.Phil Research scholar Department of Computer Science(PG)
P.S.G.R Krishnammal College for Women
*
Senior Lecturer
GR Govindarajulu School of Appiled Computer Technology

Abstract— E-mail spam, known as unsolicited bulk Email II. SPAM FILTER ARCHITECTURE AND METHODS
(UBE), junk mail, or unsolicited commercial email (UCE), is the
practice of sending unwanted e-mail messages, frequently with
commercial content, in large quantities to an indiscriminate set E-mail spam, known as unsolicited bulk Email (UBE), junk
of recipients. Spam is prevalent on the Internet because the mail, or unsolicited commercial email (UCE), is the practice
transaction cost of electronic communications is radically less of sending unwanted e-mail messages, frequently with
than any alternate form of communication. There are many commercial content, in large quantities to an indiscriminate
spam filters using different approaches to identify the incoming set of recipients. The technical definition of spam is ‘An
message as spam, ranging from white list / black list, Bayesian electronic message is "spam" if (A) the recipient's personal
analysis, keyword matching, mail header analysis, postage,
identity and context are irrelevant because the message is
legislation, and content scanning etc. Even though we are still
flooded with spam emails everyday. This is not because the equally applicable to many other potential recipients; and (B)
filters are not powerful enough, it is due to the swift adoption of the recipient has not verifiably granted deliberate, explicit,
new techniques by the spammers and the inflexibility of spam and still-revocable permission for it to be sent’. The risks in
filters to adapt the changes. In our work, we employed filtering spam are sometimes legitimate mails may be
supervised machine learning techniques to filter the email spam rejected or denied and legitimate mails may be marked as
messages. Widely used supervised machine learning techniques spam. The risks of not filtering spam are the constant flood
namely C 4.5 Decision tree classifier, Multilayer Perceptron, of spam clogs networks and adversely impacts user inboxes,
Naïve Bayes Classifier are used for learning the features of but also drain valuable resources such as bandwidth and
spam emails and the model is built by training with known
storage capacity, productivity loss and interfere with the
spam emails and legitimate emails. The results of the models are
discussed. expedient delivery of legitimate emails.

Spam filters can be implemented at all layers, firewalls

Keywords— Spam, Spam filter, Spammer, Mail header,
exist in front of email server or at MTA(Mail Transfer
Machine learning, Classifier
Agent), Email Server to provide an integrated Anti-Spam and
I. INTRODUCTION Anti-Virus solution offering complete email protection at the
network perimeter level, before unwanted or potentially
The internet has become an integral part of everyday
dangerous email reaches the network. At MDA (Mail
life and e-mail has become a powerful tool for information
Delivery Agent) level also spam filters can be installed as a
exchange. Along with the growth of the Internet and e-mail,
service to all of their customers. At Email client user can
there has been a dramatic growth in spam in recent years.
have personalized spam filters that then automatically filter
Spam can originate from any location across the globe where
mail according to the chosen criteria. Figure 1. shows the
Internet access is available. Despite the development of anti-
typical architecture of spam filter.
spam services and technologies, the number of spam
messages continues to increase rapidly. In order to address
the growing problem, each organization must analyze the The several different methods to identify incoming
tools available to determine how best to counter spam in its messages as spam are, Whitelist/Blacklist, Bayesian analysis,
environment. Tools, such as the corporate e-mail system, e- Mail header analysis, Keyword checking. A whitelist is a
mail filtering gateways, contracted anti-spam services, and list, which includes all addresses from which the users
end-user training, provide an important arsenal for any always wish to receive mail.
organization. However, users cannot avoid the very serious
problem of attempting to deal with large amounts of spam on User can add email addresses or entire domains, or
a regular basis. If there are no anti spam activities, spam will functional domains. An interesting option is an automatic
inundate network systems, kill employee productivity, steal whitelist management tool that eliminates the need for
bandwidth, and still be there tomorrow. administrators to manually input approved addresses on the
whitelist and ensures that mail from particular senders or
domains are never flagged as spam.

ISSN : 0975-3397 3126

V. Christina et al. / (IJCSE) International Journal on Computer Science and Engineering
Vol. 02, No. 09, 2010, 3126-3129

perceptron and Naïve bayes classifier are used for learning

the classification model.

A. MultiLayer Perceptron

Multilayer Perceptron (MLP) network is the most widely

used neural network classifier. MLP networks are general-
purpose, flexible, nonlinear models consisting of a number of
units organised into multiple layers. The complexity of the
MLP network can be changed by varying the number of
layers and the number of units in each layer. Given enough
hidden units and enough data, it has been shown that MLPs
can approximate virtually any function to any desired
accuracy. In other words, MLPs are universal approximators.
MLPs are valuable tools in problems when one has little or
no knowledge about the form of the relationship between
input vectors and their corresponding outputs.
B. C 4.5 Decision Tree Induction

Decision Tree Classification generates the output as a

binary tree like structure called a decision tree, in which each
branch node represents a choice between a number of
The number of records can be configured. When an alternatives, and each leaf node represents a classification or
overflow occurs, obsolete records are overwritten. A blacklist decision. A Decision Tree model contains rules to predict
works similarly to competitive alternatives: this is a list of the target variable. This algorithm scales well, even where
addresses from which user never want to receive mail. Mail there are varying numbers of training examples and
header checking consists of a set of rules that, if a mail considerable numbers of attributes in large databases.
header matches, triggers the mail server to return messages
that have blank "From" field, that lists a lot of addresses in J48 algorithm is an implementation of the C4.5 decision tree
the "To" from the same source, that have too many digits in learner. This implementation produces decision tree models.
email addresses (a fairly popular method of generating false The algorithm uses the greedy technique to induce decision
addresses). It also enables to return messages by matching trees for classification. A decision-tree model is built by
the language code declared in the header. analyzing training data and the model is used to classify
unseen data. J48 generates decision trees, the nodes of which
In Bayesian analysis, the word probabilities (also known evaluate the existence or significance of individual features.
as likelihood functions) are used to compute the probability C. Naïve Bayes Classification
that an email with a particular set of words in it belongs to
either category. This contribution is called the posterior
probability and is computed using Bayes' theorem. Then, the The naive bayes classifier (NB) is a simple but effective
email's spam probability is computed over all words in the classifier which has been used in numerous applications of
email, and if the total exceeds a certain threshold, the filter information processing including, natural language
will mark the email as a spam. Keyword checking is another processing, information retrieval, etc. The Naive Bayes
method widely used in filtering spam. It works by scanning Classifier technique is based on Bayesian theorem and is
both email subject and body. Using "conditions" i.e. particularly suited when the dimensionality of the inputs is
combinations of keywords is a good solution to enhance high. Naïve Bayes classifiers assume that the effect of a
filtering efficiency. We can specify combinations of words variable value on a given class is independent of the values
and update the list that must appear in the spam email. All of other variable. The Naive-Bayes inducer computes
messages that include these words will be blocked. conditional probabilities of the classes given the instance and
picks the class with the highest posterior. Depending on the
precise nature of the probability model, naive Bayes
III. METHODOLOGY classifiers can be trained very efficiently in a supervised
learning setting.
Most of the spam filtering techniques is based on text
categorization methods. Thus filtering spam turns on a
classification problem. In our work, rules are framed to IV. FEATURE EXTRACTION
extract feature vector from email. As the characteristics of
discrimination are not well defined, it is more convenient to The work is based on rules and uses a score-based system.
apply machine learning techniques. Three machine learning The rules are framed by analyzing the mail header
algorithms, C 4.5 Decision tree classifier, Multilayer information, keyword matching and the body of the message.
And a relative score is assigned to each rule.

ISSN : 0975-3397 3127

V. Christina et al. / (IJCSE) International Journal on Computer Science and Engineering
Vol. 02, No. 09, 2010, 3126-3129

There are number of rules framed by considering the various The class labels are designated as L and S to represent
features that will aid to identify the spam messages legitimate and spam message respectively.
effectively. Each rule performs a test on the email, and each The machine learning techniques Naïve Bayes Classifier,
rule has a score. When an email is processed, it is tested C 4.5 Decision tree classifier, Multilayer Perceptron are used
against each rule. For each rule found to be true for an email, for training the dataset in WEKA environment.
the score associated with the rule is added to the overall score The training is carried out with the feature vectors
for that email. Once all the rules have been used, the total extracted by analyzing each message header and keyword
score for the email is compared to a threshold value. If the checking and whitelist/blacklist.
score exceeds the threshold, then the email is marked as The performance of the trained models is evaluated
spam and the others are classified as legitimate mail. In this using 10-fold cross validation for its predictive accuracy.
work, the rules used are Predictive accuracy is used as a performance measure for
email spam classification. The prediction accuracy is
TABLE I measured as the ratio of number of correctly classified
SCHEME OF RULES ASSIGNED TO EACH SPAM FEATURE
instances in the test dataset and the total number of test cases.
In spam filtering, false negatives just mean that some spam
From name meaningful mails are classified as legitimate and moved to inbox. False
From domain name positive mean that legitimate emails that get mistakenly
Blocked IP identified as spam and moved to spam folder or discarded.
Apostrophe in From name
From name in Auto Whitelist (AWL)
For most users, missing legitimate email is an order of
From address in User’s Block list magnitude worse than receiving spam. The false positive rate
From address in User’s White list of each classifier also considered to measure its performance.
Content Type The performance of the classifiers are summarized in
Content Boundary exists Table II and shown in Fig.2 and Fig.3.
To name meaningful
To address Undisclosed recipients TABLE II
To header original COMPARATIVE RESULTS OF THE CLASSIFIERS
From address and To address same
Is subject present Evaluation Naïve
Subject content has obfuscate words J48 MLP
Criteria Bayes
Is forwarded message
Is reply message
Subject Reply without reference header Training time (secs) 0.15 0.20 138.05
Is message body exists
Sensual message
Repeated double quotes in body Correctly Classified
1479 1449 1490
Character set includes foreign language Instances
More blank lines in body
Prediction
98.6 96.6 99.3
Accuracy ( % )
In these 23 rules, some are simple and some are
associated with one another. A simple rule could search for a False Positive (%) 5 4 1
word ‘Viagra’ in subject line of an email, while a complex
rule may involve comparing an email against an online
database of spam. Each rule adds to the overall score, so an
email that triggers only one rule due to the use of the word
‘Viagra’ will not necessarily mark an email as spam. 100
However, if an email triggers several rules, it will have a 99.3
99.5
combined score that could be over the threshold and the mail 99 98.6
could be marked as spam. 98.5
A c c u ra c y %

98
V. EXPERIMENT AND RESULTS 97.5
97 96.6
The email spam filtering has been carried out using 96.5
WEKA. The Weka, Open Source, Portable, GUI-based 96
workbench is a collection of state-of-the-art machine learning 95.5
algorithms and data pre processing tools. 95
Naïve Bayes J48 MLP
The training dataset, spam and legitimate message corpus
is generated from the mails that we received from our
institute mail server for a period of six months. The mails are Fig. 1 Classification Accuracy
analyzed and 23 rules are identified that extremely ease the
process of classifying the spam message. The corpus consists The performance of the three models was evaluated
of 750 spam messages and 750 legitimate messages. From based on the three criteria, the prediction accuracy,
the corpus, the feature vectors are extracted by analyzing learning time and false positive rate. Multilayer
message header, keyword checking, whitelist/blacklist etc. perceptron predicts better than other algorithms.

ISSN : 0975-3397 3128

V. Christina et al. / (IJCSE) International Journal on Computer Science and Engineering
Vol. 02, No. 09, 2010, 3126-3129

138.05 latest mails and employed machine learning techniques to

140 build the model. The performance of the model is evaluated
120
using 10-fold cross validation and observed that Multilayer
Perceptron classifier out performs other classifiers and the
100
false positive rate also very low compared to other
Build Time(sec)

80
alogorithms. Email spam filters using this approach can be
60 adopted either at mailserver or at mail client side to reduce
40 the amount of spam messages and to reduce the risk of
0.15 0.2
20 productivity loss, bandwidth and storage usage.
0
Naïve bayes J48 MLP
REFERENCES
[1] Ahmed Khorsi, "An Overview of Content-based Spam Filtering
Fig. 1 Learning Time of the Models
Techniques", Informatica, vol. 31, no. 3, October 2007, pp 269-277.
[2] Alistair McDonald, “SpamAssassin: A Practical Guide to Integration
and Configuration”, Ist Edition, Packt publishers, 2004.
Multilayer perceptron, the neural network classifier [3] Ian H. Witten, Eibe Frank, “Data Mining – Practical Mahine Learning
consumes more time to build the model. The naivebayes, the Tools and Techniques,” 2nd Edition, Elsevier, 2005.
probabilistic classifier and decision tree model tends to learn
more rapidly for the given data set.

VI. CONCLUSION

Although there are many email spam filtering tools exists

in the world, due to the existence of spammers and adoption
of new techniques, email spam filtering becomes a
challenging problem to the researchers. In our work, we
generated spam and legitimate message corpus from the

ISSN : 0975-3397 3129

Introduction To Algebraic Topology - Holger Kammeyer
100% (1)
Introduction To Algebraic Topology - Holger Kammeyer
186 pages
S-129 en UKCM Product Specification Ed1.1.0 Final Clean
No ratings yet
S-129 en UKCM Product Specification Ed1.1.0 Final Clean
104 pages
List of C++ Multiple-choice Questions and Answers
No ratings yet
List of C++ Multiple-choice Questions and Answers
63 pages
Machine Learning (15Cs73) : Text Book Tom M. Mitchell, Machine Learning, India Edition 2013, Mcgraw Hill
No ratings yet
Machine Learning (15Cs73) : Text Book Tom M. Mitchell, Machine Learning, India Edition 2013, Mcgraw Hill
78 pages
Instructionscandidates
No ratings yet
Instructionscandidates
20 pages
Procedure-Function Sheet1
No ratings yet
Procedure-Function Sheet1
29 pages
idrac9-7-10-90-rn-en-us
No ratings yet
idrac9-7-10-90-rn-en-us
30 pages
Regulation of The Internet A Technological Perspective: Gerry Miller Gerri Sinclair David Sutherland Julie Zilber
No ratings yet
Regulation of The Internet A Technological Perspective: Gerry Miller Gerri Sinclair David Sutherland Julie Zilber
96 pages
Cognitive Computing Model Brief - Hospital Admissions and ED Visits (Version 1)
No ratings yet
Cognitive Computing Model Brief - Hospital Admissions and ED Visits (Version 1)
10 pages
major errors in SAP MM with simple explanation
No ratings yet
major errors in SAP MM with simple explanation
7 pages
Finkbeiner - PDF - Elevator - Ac Power Plugs and Sockets
No ratings yet
Finkbeiner - PDF - Elevator - Ac Power Plugs and Sockets
5 pages
EDX DPU 1100 Courses
No ratings yet
EDX DPU 1100 Courses
40 pages
Brochure [email protected]
No ratings yet
Brochure [email protected]
13 pages
Essential Tips For LibreOffice
No ratings yet
Essential Tips For LibreOffice
14 pages
7 LibreOffice Tips To Get More Out of It
No ratings yet
7 LibreOffice Tips To Get More Out of It
15 pages
Navi Pac 46
No ratings yet
Navi Pac 46
54 pages
Python Cheat Sheet For Beginners
No ratings yet
Python Cheat Sheet For Beginners
26 pages
ML IoT (General) PDF
No ratings yet
ML IoT (General) PDF
33 pages
Price List Schneider MCB 15.06
100% (2)
Price List Schneider MCB 15.06
30 pages
Installation, Operation and Maintenance Manual: Dalamatic Cased Dust Collectors
No ratings yet
Installation, Operation and Maintenance Manual: Dalamatic Cased Dust Collectors
42 pages
Yayasan Akrab Pekanbaru: Keywords: Election, Smart System, Fingerprint, Pilkades
No ratings yet
Yayasan Akrab Pekanbaru: Keywords: Election, Smart System, Fingerprint, Pilkades
10 pages
السيره الذاتيه للدكتور المهندس حسين حسين محمد حسن
No ratings yet
السيره الذاتيه للدكتور المهندس حسين حسين محمد حسن
5 pages
Auto Tool Car Driving Simulator 3 Screen Single Screen
No ratings yet
Auto Tool Car Driving Simulator 3 Screen Single Screen
10 pages
Upwork Beginners Complete Guide 3
No ratings yet
Upwork Beginners Complete Guide 3
4 pages
Discover Chat1
No ratings yet
Discover Chat1
2 pages
Detailed Lesson Plan
No ratings yet
Detailed Lesson Plan
6 pages
Yes Bank Account Application Details
No ratings yet
Yes Bank Account Application Details
2 pages
Accor Hotels Transition
No ratings yet
Accor Hotels Transition
4 pages
6-FMX-150B: Jiangsu Shuangdeng Group Co.,Ltd
No ratings yet
6-FMX-150B: Jiangsu Shuangdeng Group Co.,Ltd
2 pages
VLAN Topo1
No ratings yet
VLAN Topo1
2 pages
Distributed Control Systems: Prof - Dr. Joyanta Kumar Roy
No ratings yet
Distributed Control Systems: Prof - Dr. Joyanta Kumar Roy
27 pages
2020 10 28SupplementaryCE201CE201 I Ktu Qbank
No ratings yet
2020 10 28SupplementaryCE201CE201 I Ktu Qbank
3 pages
Microservices
No ratings yet
Microservices
3 pages
CHAPTER 7 Strama
No ratings yet
CHAPTER 7 Strama
4 pages
Panimalar Engineering College Internal Assessment - Iii Cs8251 Programming in C
No ratings yet
Panimalar Engineering College Internal Assessment - Iii Cs8251 Programming in C
2 pages
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6453)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2884)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)

Spam PDF

Uploaded by

Spam PDF

Uploaded by

V. Christina et al.

/ (IJCSE) International Journal on Computer Science and Engineering

Email Spam Filtering using Supervised Machine

Spam filters can be implemented at all layers, firewalls

ISSN : 0975-3397 3126

perceptron and Naïve bayes classifier are used for learning

Multilayer Perceptron (MLP) network is the most widely

Decision Tree Classification generates the output as a

ISSN : 0975-3397 3127

ISSN : 0975-3397 3128

138.05 latest mails and employed machine learning techniques to

Although there are many email spam filtering tools exists

ISSN : 0975-3397 3129

You might also like