44 Decision Tree Model for Email Classification
44 Decision Tree Model for Email Classification
Abstract— In addition to the undeniable benefits, the selection from email body is very important. In this paper,
development of the Internet has led to many undesirable security semantic properties of email content are used for feature
effects. Spam emails are one of the most challenging issues faced selection and reduction. In order to efficiently detect spam
by the Internet users. Spam refers to all emails of unsolicited emails various preprocessing steps need to be done, such as
content that arrive in a user’s email box. Spam can often lead to stop words removal, stemming, term frequency [5], [6], [7].
network congestion and blocking or even damage to the system The aim is to preserve the most important features and to
for receiving and sending electronic messages. Thus, appropriate
reduce computations demand. After feature selection, the ID3
classification of spam email from legitimate email has become
quite important. This paper presents a new approach for feature algorithm is used to generate a decision tree that categorizes
selection and Iterative Dichotomiser 3 (ID3) algorithm designed emails as spam or ham [8], [9]. The proposed approach is
to generate the decision tree for email classification. The evaluated using accuracy, precision and recall. The
experimental results indicate that the proposed model achieves performance of proposed system is measured against the size
very high accuracy. of dataset and feature size.
This paper is organized as follows. Section II explains
I. In t r o d u c t io n proposed approach for spam detection in detail. Section III
The Internet as a “network of networks” has expanded the summarizes the results while Section IV gives the conclusion.
possibilities of communication and placement of content.
Email system is one of the most effective and commonly used II. SPAM DETECTION SYSTEM
sources of communication [1]. Unfortunately, the continuous This section presents proposed Spam Detection (SD)
rise of email users has led to a massive increase of spam system in detail. The system goes through two stages: training
emails. Spam emails are usually sent in bulk and do not target and testing. The training stage has four modules: Data
individual recipients. Whether it is commercial in nature or preparation, Feature selection, Feature reduction and
not, spam emails can cause serious problems in electronic Classification. The testing stage consist of Data preparation
communication. Spam emails produce huge amount of and Classification modules. SD process is presented in Fig. 1
unsolicited data and thus affect the network bandwidth and and the proposed procedure is briefly explained in sections
storage capacity. Due to the large number of spam emails to below.
users of email services it is difficult to distinguish useful from
unsolicited emails. Thus, managing and filtering emails is an
important challenge. The filtering purpose is to detect and
isolate spam emails.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 15,2021 at 17:09:19 UTC from IEEE Xplore. Restrictions apply.
A. Email dataset task is to understand that are there any specific words or
The dataset used for the classification purpose consists of sequence of words that determine whether an email is a spam
4000 entries [10]. The dataset contains 3465 ham and 535 or not. For this purpose, the Term Frequency (TF) method is
spam messages. This dataset is divided into two subsets: used. TF can be defined as a numerical statistic which is
training set and testing set. The size of dataset assigned for intended to reflect how crucial a word is to a document
training purpose can affect systems performance which will present in a corpus. The TF value is directly proportional to
be shown further on. the number of times a word appears in a document. Fig. 2
illustrates a word cloud of common words in spam email. The
B. Preprocessing o f dataset size of word in Fig.2 is proportional to its occurrence in spam
The email dataset considered needs to be preprocessed emails. Words like ‘free’, ‘txt ’, ‘call’ have large TF weights
before performing feature selection. It is well known that which makes them good indicators of spam.
spam mails usually contain phone numbers, emails, website
URLs, money amounts, and a lot of whitespace and
punctuation. Instead of removing the following terms, for
each training example, the terms are replaced with a specific
string as follows:
TABLE I
FEATURE MATRIX: EACH ROW REPRESENTS AN EMAIL WITH THE FEATURES PRESENTED IN COLUMNS
FEATURES
EMAIL Numbr Call Txt Free Claim Httpaddr Moneysymb Total spam words DECISION/CLASS
Email 1 0 1 0 0 0 0 0 1 Ham
Email 2 2 0 0 1 1 1 0 4 Spam
Email 3 1 0 0 3 0 0 0 2 Spam
Email 4 1 0 0 0 0 0 0 0 Ham
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 15,2021 at 17:09:19 UTC from IEEE Xplore. Restrictions apply.
D. Decision tree Information gain is calculated to split the attributes
A decision tree uses a tree-like model to represent a further in the tree. The attribute with the highest
number of possible decision paths as well as their potential information gain is always preferred first. Entropy and
outcomes [13]. Each decisions tree node represents a information gain is related by (2):
feature, each branch represents a decision and each leaf
represents an outcome (class or decision). Decision trees gain(S,Ai)=Entropy(S)-Entropy. (S) (2)
A i
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 15,2021 at 17:09:19 UTC from IEEE Xplore. Restrictions apply.
TP+TN the system achieves high accuracy with a few features and
(3)
J TP+TN+FP+FN with relatively small training dataset. In the near future, it
TP is planned to incorporate other classifiers and to compare
precision (4) their performances with the proposed approach.
TP+FP
TP R EFERENCES
recall= (5)
TP+FN
[1] P. Sharma and U. Bhardwaj, Machine Learning based Spam E-Mail
Detection, in I n t e r n a t i o n a l J o u r n a l o f I n t e l l i g e n t E n g i n e e r i n g &
For a classifier, accuracy is the proportion of the total S y s t e m s , vol. 11, no. 3, 2017
testing examples which classifier predicted correct, [2] A. S. Rajput, J. S. Sohal, V. Athavale, “Email Header Feature
precision is ratio of total number of correctly classified Extraction using Adaptive and Collaborative approach for Email
spam emails and the total number of emails predicted as Classification”, in I n t e r n a t i o n a l J o u r n a l o f I n n o v a t i v e T e c h n o l o g y
a n d E x p l o r i n g E n g i n e e r i n g ( I J I T E E ) , ISSN: 2278-3075, vol.8, Issue
spam and recall represents proportion of emails correctly 7S, May 2019
classified as spam among all spam emails. The [3] P. Kulkarni, J.R. Saini and H. Acharya, “Effect of Header-based
performance of proposed SD system is measured against Features on Accuracy o f Classifiers for Spam Email Classification”,
the size of dataset and the features size. The result are in: I n t e r n a t i o n a l J o u r n a l o f A d v a n c e d C o m p u t e r S c i e n c e a n d
A p p l i c a t i o n s ( I J A C S A ) , vol. 11, no. 3, 2020
presented in Table 3. [4] E. G. Dada, S. B. Joseph, H. Chiroma, S. Abdulhamid, A.
TABLE III
Adetunmbi, E. Opeyemi and Ajibuwa, “Machine learning for email
CLASSIFICATION RESULTS BASED ON DATASET SIZE AND
spam filtering: review, approaches and open research problems”. in
FEATURE SIZE
H e l i y o n , June 2019
[5] E. M. Bahgat, S. Rady, W. Gad and I. F. Moawad, “Efficient email
Dataset Feature Accuracy[%] Precision[%] Recall[%] classification approach based on semantic methods”, In: A i n S h a m s
size size E n g . J . , vol. 9, no. 4, pp. 3259-3269, December 2018.
1000 7 97.4 92.01 87.21 [6] F. Ruskanda, “Study on the Effect of Preprocessing Methods for
1000 3 96.63 85.61 88.51 Spam Email Detection”, in: I n d o n e s i a n J o u r n a l o n C o m p u t i n g
1500 7 97.32 92.28 86.21 ( I n d o - J C ) . 4. 109, March 2019.
1500 3 96.56 85.62 87.77 [7] A. Sharma, Manisha, D. Manisha and D.R. Jain, “Data Pre-
3000 7 97.2 91.52 85.71 Processing in Spam Detection”, in: I n t e r n a t i o n a l J o u r n a l o f S c i e n c e
3000 3 96.3 83.96 87.30 T e c h n o l o g y & E n g i n e e r i n g ( I J S T E ) , vol. 1, Issue 11, May 2015
[8] L. Shi, Q. Wang, X. Ma, M. Weng and H. Qiao, “Spam Email
Classification Using Decision Tree Ensemble”,in J o u r n a l o f
The dataset of different sizes are used for measuring the
C o m p u t a t i o n a l I n f o r m a t i o n S y s t e m s 8 , March 2012
performance. For example, in case of 1000 emails and 7 [9] S. Balamurugan and R. Rajaram, “Suspicious E-mail Detection via
features being used for the training process, accuracy was Decision Tree: A Data Mining Approach”, January 2007.
97.4% using decision tree classifier. The precision and [10] T. A. Almeida and J.M. Gomez Hidalgo, SMS Spam Collection,
U C I M a c h i n e L e a r n i n g R e p o s i t o r y , viewed 12 September 2020,
recall values are 92.01% and 87.21% respectively. https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
Reducing the number of features affects the accuracy by [11] C. D. Manning, P. Raghavan and H. Schutze, “Introduction to
decreasing it to 96.63%, with precision and recall value of Information Retrieval”, in C a m b r i d g e U n i v e r s i t y P r e s s , 2008.
85.61% and 88.51% respectively. The dataset size slightly [12] A. Bhowmick and S. M. Hazarika, “Machine Learning for E-mail
Spam Filtering: Review, Techniques and Trends”, 2016
affects accuracy: the accuracy for 1500 training examples
[13] J. Grus, “Data Science from Scratch: First Principles with Python”,
and 3000 training examples was 97.32% and 97.2% O ' R e i l l y M e d i a . I n c ., April 2015
respectively. [14] I.H. Witten and E. Frank, “Data Mining: Practical Machine
Learning Tools and Techniques with Java Implementations”,
M o r g a n K a u f m a n n , San Francisco, 2000
IV. CONCLUSION [15] T. Kristensen and G. Kumar, “Entropy based disease classification
In this paper, decision tree-based classification is employed of proteomic mass spectrometry data of the human serum by a
support vector machine”, P r o c e e d i n g s . 2 0 0 5 I E E E I n t e r n a t i o n a l
for spam email detection. A novel approach for feature J o i n t C o n f e r e n c e o n N e u r a l N e t w o r k s , 2005
selection and reduction is also presented. It is shown that
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on May 15,2021 at 17:09:19 UTC from IEEE Xplore. Restrictions apply.