0% found this document useful (0 votes)
8 views

44 Decision Tree Model for Email Classification

Decision Tree Model for Email Classification

Uploaded by

ameenuddin2817
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

44 Decision Tree Model for Email Classification

Decision Tree Model for Email Classification

Uploaded by

ameenuddin2817
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

25th International Conference on Information Technology (IT)

Zabljak, 16 - 20 February 2021

Decision Tree Model for Email Classification


Ivana Cavor
2021 25th International Conference on Information Technology (IT) | 978-1-7281-9103-4/20/$31.00 ©2021 IEEE | DOI: 10.1109/IT51528.2021.9390143

Abstract— In addition to the undeniable benefits, the selection from email body is very important. In this paper,
development of the Internet has led to many undesirable security semantic properties of email content are used for feature
effects. Spam emails are one of the most challenging issues faced selection and reduction. In order to efficiently detect spam
by the Internet users. Spam refers to all emails of unsolicited emails various preprocessing steps need to be done, such as
content that arrive in a user’s email box. Spam can often lead to stop words removal, stemming, term frequency [5], [6], [7].
network congestion and blocking or even damage to the system The aim is to preserve the most important features and to
for receiving and sending electronic messages. Thus, appropriate
reduce computations demand. After feature selection, the ID3
classification of spam email from legitimate email has become
quite important. This paper presents a new approach for feature algorithm is used to generate a decision tree that categorizes
selection and Iterative Dichotomiser 3 (ID3) algorithm designed emails as spam or ham [8], [9]. The proposed approach is
to generate the decision tree for email classification. The evaluated using accuracy, precision and recall. The
experimental results indicate that the proposed model achieves performance of proposed system is measured against the size
very high accuracy. of dataset and feature size.
This paper is organized as follows. Section II explains
I. In t r o d u c t io n proposed approach for spam detection in detail. Section III
The Internet as a “network of networks” has expanded the summarizes the results while Section IV gives the conclusion.
possibilities of communication and placement of content.
Email system is one of the most effective and commonly used II. SPAM DETECTION SYSTEM
sources of communication [1]. Unfortunately, the continuous This section presents proposed Spam Detection (SD)
rise of email users has led to a massive increase of spam system in detail. The system goes through two stages: training
emails. Spam emails are usually sent in bulk and do not target and testing. The training stage has four modules: Data
individual recipients. Whether it is commercial in nature or preparation, Feature selection, Feature reduction and
not, spam emails can cause serious problems in electronic Classification. The testing stage consist of Data preparation
communication. Spam emails produce huge amount of and Classification modules. SD process is presented in Fig. 1
unsolicited data and thus affect the network bandwidth and and the proposed procedure is briefly explained in sections
storage capacity. Due to the large number of spam emails to below.
users of email services it is difficult to distinguish useful from
unsolicited emails. Thus, managing and filtering emails is an
important challenge. The filtering purpose is to detect and
isolate spam emails.

There are two main approaches for spam detection. The


first approach is based on email header analysis and the
second one is based on email body analysis. Spam filters
usually combine those two approaches. Email header includes
fields like From, To, Subject, CC (Carbon Copy), BCC (Blind
Carbon Copy) which almost reveals the nature of the email.
The recent studies have shown that the information provided
by email header is quite important [2], [3]. Content based
filtering relies on the assumption that the body content of
spam email is different than the legitimate or ham mail. In
recent years, a number of Machine Learning (ML) and data
mining techniques have been engaged in order to classify
email messages based on its content. Classification methods,
such as Naive Bayes, Support Vector Machine, Decision
Tree, Random Forest and Neural Networks are commonly
used to develop efficient email classifier [4]. For the most Figure 1. Spam detection process
classification problems the process of feature extraction and

Ivana Cavor (email: [email protected])


Faculty of Maritime Studies Kotor, University of Montenegro,
Put I Bokeljske brigade 44, 85330 Kotor, Montenegro.

978-1-7281-9103-4/21/$31.00 ©2021 IEEE

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 15,2021 at 17:09:19 UTC from IEEE Xplore. Restrictions apply.
A. Email dataset task is to understand that are there any specific words or
The dataset used for the classification purpose consists of sequence of words that determine whether an email is a spam
4000 entries [10]. The dataset contains 3465 ham and 535 or not. For this purpose, the Term Frequency (TF) method is
spam messages. This dataset is divided into two subsets: used. TF can be defined as a numerical statistic which is
training set and testing set. The size of dataset assigned for intended to reflect how crucial a word is to a document
training purpose can affect systems performance which will present in a corpus. The TF value is directly proportional to
be shown further on. the number of times a word appears in a document. Fig. 2
illustrates a word cloud of common words in spam email. The
B. Preprocessing o f dataset size of word in Fig.2 is proportional to its occurrence in spam
The email dataset considered needs to be preprocessed emails. Words like ‘free’, ‘txt ’, ‘call’ have large TF weights
before performing feature selection. It is well known that which makes them good indicators of spam.
spam mails usually contain phone numbers, emails, website
URLs, money amounts, and a lot of whitespace and
punctuation. Instead of removing the following terms, for
each training example, the terms are replaced with a specific
string as follows:

1. Replace email addresses with ‘emailaddr’


2. Replace URLs with ‘httpaddr’
3. Replace money symbols with ‘moneysymb’
4. Replace phone numbers with ‘phonenumbr’
5. Replace numbers with ‘numbr’
Figure 2. Visual representation of important words for spam email
Punctuations are also removed from the text and all
whitespaces (spaces, line breaks, tabs) are replaced with The TF method is used to represent text data for ML
single space. Entire dataset is also lowercased. The sentences algorithm. Data representation is needed because it is hard to
are split into words known as tokens. Each email is tokenized do computation with the textual data. Accordingly, the
in order to reveal characteristic spam words. Stop words are frequency of all words in the preprocessed spam dataset is
also removed. Stop words are words that do not have calculated and twenty most frequent spam words are selected
linguistic meaning, e.g. ‘a’, ‘an’, ‘the’, and ‘is’, etc. The next as features. Next, the occurrence of each feature in an email is
step in the preprocessing stage is the stemming. “Stemming mapped in feature matrix shown in Table 1. In order to
usually refers to a crude heuristic process that chops off the enhance the ML algorithm accuracy, one more feature is
ends of words in the hope of achieving this goal correctly added. That feature represents the total number of important
most of the time, and often includes the removal of spam words in a specific email. The experimental results
derivational affixes” [11]. The preprocessing phase is quite indicate that the corresponded feature has the biggest effect on
important as it reduces the search space for efficient feature the appropriate classification decision. In fact, for most
extraction and selection. features, it is shown that it is not important how many times a
certain spam word occurred in an email but whether it
appeared at all. This conclusion has enabled data
C. Feature extraction and selection
dimensionality reduction, since there are some features that
In this process the emails are analyzed to find out the do not have an effect on the decision. The feature that has no
features (words) which would be most useful in the influence on the class labels can be discarded. The feature
classification stage. The main idea is to find out words that reduction has made the data less sparse and more statistically
are frequently occurring in the dataset or to find out the words significant for the classification algorithm.
which hold relatively higher importance in understanding the
class of an email. In case of identifying an email as spam, the

TABLE I
FEATURE MATRIX: EACH ROW REPRESENTS AN EMAIL WITH THE FEATURES PRESENTED IN COLUMNS

FEATURES
EMAIL Numbr Call Txt Free Claim Httpaddr Moneysymb Total spam words DECISION/CLASS
Email 1 0 1 0 0 0 0 0 1 Ham
Email 2 2 0 0 1 1 1 0 4 Spam
Email 3 1 0 0 3 0 0 0 2 Spam
Email 4 1 0 0 0 0 0 0 0 Ham

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 15,2021 at 17:09:19 UTC from IEEE Xplore. Restrictions apply.
D. Decision tree Information gain is calculated to split the attributes
A decision tree uses a tree-like model to represent a further in the tree. The attribute with the highest
number of possible decision paths as well as their potential information gain is always preferred first. Entropy and
outcomes [13]. Each decisions tree node represents a information gain is related by (2):
feature, each branch represents a decision and each leaf
represents an outcome (class or decision). Decision trees gain(S,Ai)=Entropy(S)-Entropy. (S) (2)
A i

can be used to predict the class of an unknown query


instance by building a model trained on set of labeled data. where EntropyA,{S) is the expected entropy if attribute Ai is
Each training example should be characterized by a used to partition the data.
number of descriptive features or attributes. The features The algorithm was implemented according to the
can have either nominal or continuous values. following steps:
A decision tree consists of root node, internal nodes and 1. Create a root node
leaf nodes. Internal nodes represent the conditions based on 2. Calculate the entropy of the whole (sub) dataset
which the tree splits into branches and the leaf nodes 3. Calculate the information gain for each feature
represent possible outcomes for each path. Each node and select the feature with the largest information
typically has two or more nodes extending from it. “When gain.
classifying an unknown instance, the unknown instance is 4. Assign the (root) node the label of the feature with
routed down the tree according to the values of the maximum information gain. Grow for each
attributes in the successive nodes and when a leaf is feature value an outgoing branch and add
reached the instance is classified according to class unlabeled nodes at the end.
assigned to the leaf’ [14]. The main advantage for using 5. Split the dataset along the values of maximum
decision tree is that it is easy to follow and understand. Fig. information gain feature and remove this feature
4 presents example of a typical decision tree. Words “free” from dataset.
and “money” are typical spam words and they are used as 6. For each sub-dataset, repeat steps 3 to 5 until a
features. If the word “free” appears more than two times in stopping criteria is satisfied.
an email than the email is classified as spam. Otherwise,
we are asking does the email contain the word “money”. If Since the chosen features have continuous values, to
the word “money” appears more than three times than the perform a binary split, it is needed to convert continuous
email is certainly spam, otherwise it is ham. values to nominal ones. That is done using threshold value.
The threshold value is a value that offers maximum
information gain for that attribute. For example, the
information gain maximizes when threshold is equal to two
for total_spam_words feature from Table 1.

III. EXPERIMENTAL RESULTS


The performance of proposed SD system is evaluated
using accuracy, prediction and recall. In order to compute
these measures confusion matrix is created. The confusion
matrix has four outputs:
1. True Positive (TP): the number of instances
correctly classified as spam.
2. True Negative (TN): the number of instances
Figure 3. An example of decision tree
correctly classified as ham.
3. False Positive (FP): the number of instances
The ID3 algorithm is based on the Decision tree incorrectly classified as spam.
algorithm. ID3 algorithm builds the decision tree based on 4. False Negative (FN): the number of instances
entropy and the information gain. “Entropy measures the incorrectly classified as ham.
impurity of an arbitrary collection of samples while the Table 2 represents confusion matrix for email spam
information gain calculates the reduction in entropy by classification.
partitioning the sample according to a certain attribute” TABLE II
CONFUSION MATRIX
[15]. If the target attribute (class) takes on n different
values, then the entropy S relative to this n-wise
Predicted HAM Predicted SPAM
classification is defined as shown in (1): Actual HAM True Negative False Positive
Actual SPAM False Negative True Positive
E ntropy(S)= Sn=i-pi log2p i (1)
Accordingly, accuracy, precision and recall can be defined
where p. is the proportion/probability of S belonging to as follows:
class Cn.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 15,2021 at 17:09:19 UTC from IEEE Xplore. Restrictions apply.
TP+TN the system achieves high accuracy with a few features and
(3)
J TP+TN+FP+FN with relatively small training dataset. In the near future, it
TP is planned to incorporate other classifiers and to compare
precision (4) their performances with the proposed approach.
TP+FP
TP R EFERENCES
recall= (5)
TP+FN
[1] P. Sharma and U. Bhardwaj, Machine Learning based Spam E-Mail
Detection, in I n t e r n a t i o n a l J o u r n a l o f I n t e l l i g e n t E n g i n e e r i n g &
For a classifier, accuracy is the proportion of the total S y s t e m s , vol. 11, no. 3, 2017
testing examples which classifier predicted correct, [2] A. S. Rajput, J. S. Sohal, V. Athavale, “Email Header Feature
precision is ratio of total number of correctly classified Extraction using Adaptive and Collaborative approach for Email
spam emails and the total number of emails predicted as Classification”, in I n t e r n a t i o n a l J o u r n a l o f I n n o v a t i v e T e c h n o l o g y
a n d E x p l o r i n g E n g i n e e r i n g ( I J I T E E ) , ISSN: 2278-3075, vol.8, Issue
spam and recall represents proportion of emails correctly 7S, May 2019
classified as spam among all spam emails. The [3] P. Kulkarni, J.R. Saini and H. Acharya, “Effect of Header-based
performance of proposed SD system is measured against Features on Accuracy o f Classifiers for Spam Email Classification”,
the size of dataset and the features size. The result are in: I n t e r n a t i o n a l J o u r n a l o f A d v a n c e d C o m p u t e r S c i e n c e a n d
A p p l i c a t i o n s ( I J A C S A ) , vol. 11, no. 3, 2020
presented in Table 3. [4] E. G. Dada, S. B. Joseph, H. Chiroma, S. Abdulhamid, A.
TABLE III
Adetunmbi, E. Opeyemi and Ajibuwa, “Machine learning for email
CLASSIFICATION RESULTS BASED ON DATASET SIZE AND
spam filtering: review, approaches and open research problems”. in
FEATURE SIZE
H e l i y o n , June 2019
[5] E. M. Bahgat, S. Rady, W. Gad and I. F. Moawad, “Efficient email
Dataset Feature Accuracy[%] Precision[%] Recall[%] classification approach based on semantic methods”, In: A i n S h a m s
size size E n g . J . , vol. 9, no. 4, pp. 3259-3269, December 2018.
1000 7 97.4 92.01 87.21 [6] F. Ruskanda, “Study on the Effect of Preprocessing Methods for
1000 3 96.63 85.61 88.51 Spam Email Detection”, in: I n d o n e s i a n J o u r n a l o n C o m p u t i n g
1500 7 97.32 92.28 86.21 ( I n d o - J C ) . 4. 109, March 2019.
1500 3 96.56 85.62 87.77 [7] A. Sharma, Manisha, D. Manisha and D.R. Jain, “Data Pre-
3000 7 97.2 91.52 85.71 Processing in Spam Detection”, in: I n t e r n a t i o n a l J o u r n a l o f S c i e n c e
3000 3 96.3 83.96 87.30 T e c h n o l o g y & E n g i n e e r i n g ( I J S T E ) , vol. 1, Issue 11, May 2015
[8] L. Shi, Q. Wang, X. Ma, M. Weng and H. Qiao, “Spam Email
Classification Using Decision Tree Ensemble”,in J o u r n a l o f
The dataset of different sizes are used for measuring the
C o m p u t a t i o n a l I n f o r m a t i o n S y s t e m s 8 , March 2012
performance. For example, in case of 1000 emails and 7 [9] S. Balamurugan and R. Rajaram, “Suspicious E-mail Detection via
features being used for the training process, accuracy was Decision Tree: A Data Mining Approach”, January 2007.
97.4% using decision tree classifier. The precision and [10] T. A. Almeida and J.M. Gomez Hidalgo, SMS Spam Collection,
U C I M a c h i n e L e a r n i n g R e p o s i t o r y , viewed 12 September 2020,
recall values are 92.01% and 87.21% respectively. https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
Reducing the number of features affects the accuracy by [11] C. D. Manning, P. Raghavan and H. Schutze, “Introduction to
decreasing it to 96.63%, with precision and recall value of Information Retrieval”, in C a m b r i d g e U n i v e r s i t y P r e s s , 2008.
85.61% and 88.51% respectively. The dataset size slightly [12] A. Bhowmick and S. M. Hazarika, “Machine Learning for E-mail
Spam Filtering: Review, Techniques and Trends”, 2016
affects accuracy: the accuracy for 1500 training examples
[13] J. Grus, “Data Science from Scratch: First Principles with Python”,
and 3000 training examples was 97.32% and 97.2% O ' R e i l l y M e d i a . I n c ., April 2015
respectively. [14] I.H. Witten and E. Frank, “Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations”,
M o r g a n K a u f m a n n , San Francisco, 2000
IV. CONCLUSION [15] T. Kristensen and G. Kumar, “Entropy based disease classification
In this paper, decision tree-based classification is employed of proteomic mass spectrometry data of the human serum by a
support vector machine”, P r o c e e d i n g s . 2 0 0 5 I E E E I n t e r n a t i o n a l
for spam email detection. A novel approach for feature J o i n t C o n f e r e n c e o n N e u r a l N e t w o r k s , 2005
selection and reduction is also presented. It is shown that

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 15,2021 at 17:09:19 UTC from IEEE Xplore. Restrictions apply.

You might also like