Personalized Classification of Non-Spam Emails Using Machine Learning Techniques
Personalized Classification of Non-Spam Emails Using Machine Learning Techniques
Abstract - With the advent of computer networks and Today most enterprise-level email solutions provide spam
communications, emails have become one of the most widely filtering features in a very sophisticated manner. However, the
accepted communication means, which is faster, more reliable, spam filters are of little or no use in this scenario as these
cheaper, and accessible from anywhere. Due to the increased use emails are not spam but just unimportant emails. Moreover,
of email communications, day-to-day computer users; whether an email is important or not is subjective and also
particularly corporate users, find it cumbersome to filter the contextual. Hence, it is not possible to have a generalized set
most important and urgent emails out of the large number of of rules, to filter out such unimportant emails.
emails they receive on a given business day. Enterprise email
systems are able to automatically identify spam emails but still, This paper presents a machine learning-based method to
there are many non-urgent and unimportant emails among such filter out unimportant non-spam emails based on the
non-spam emails which cannot be filtered by conventional spam respective users’ past behaviour when using the mailbox.
filter programs. Though it may be feasible to set up some static Several classification models have been trained using the past
rules and categorize some of the e-mails, the practicality and data of the first author’s mailbox, to classify new emails either
sustainability of such rules are questionable due to the as important or not important. The ensuing paragraphs discuss
magnitude of such rules, and the validity period as such rules the following aspects consequently, as given here: Section II-
may become redundant after some time. Thus, it is desired to Literature Review, Section III-Solution, Section IV-
have an email filtering system for non-spam emails to filter Methodology and Approach, Section V-Results, Section VI-
unimportant emails, based on the user’s past behaviour. Despite
Future Work and Section VII-Conclusion.
the availability of research on identifying spam e-mails in the
area of further classifying the non-spam e-mails, is lacking. The II. LITERATURE REVIEW
purpose of this research is to provide a machine learning-based
solution to classify non-spam e-mails considering the Previous research and experiments pertaining to e-mail
importance of such e-mails. As part of the research, several classifications have focused on mainly improving spam e-mail
machine learning models have been developed and trained using filtering and also classification of e-mail into different pre-
non-spam e-mails, based on the personal mailbox of the first defined categories like sport, travel, appointment, social
author of this research. The results showed a significant media and personal etc.
accuracy, particularly with a decision tree, random forests and
deep neural network algorithms. This paper presents the Most of the research has been carried out to find which
modelling details and the results obtained accordingly. machine learning algorithm is better for spam email
classification [1].[4].[5]. The previous literature has not
Keywords - logistic regression, non-spam email classification, captured the subject matter of this project-personalized
Random Forest algorithms, supervised learning, Support Vector classification of non-spam e-mails, which is a novel area of
Machines research.
I. INTRODUCTION The following shortfalls have been observed pertaining to
the previous research in this area, as given below:
With the advent of computer networks and
Lack of personalization when categorizing emails
communications, emails have become one of the widely or
mostly accepted communication means which is faster, more Generic data being used
reliable, cheaper, and accessible from anywhere. Due to the
increased use of e-mail communications, day-to-day computer Limited algorithms
users, particularly corporate users, get a large number of Lack of feature engineering work
emails. Sometimes it could be very hard to go through all of
them within the day due to their large volume. Lack of end-to-end implementation
The importance of e-mails received cannot be gauged However, there are two research publications, which seem
without reading the content, as a result of which a considerable to be quite similar. In [2]., the authors discuss an approach in
time is wasted in reading even unimportant e-mails, eroding which personalized priority is given to emails. However, the
the productive time in the corporate world.
research gives more focus to the evaluation of the better model and they have used term weighting to construct the document
out of a regression-based model and a multi-class vector. Other than the above limitations, another perceived
classification model using Support Vector Machines. limitation is that the training sample was also very small.
Moreover, they used a very limited number of features such
as from, to, cc address, title, and body text of email message,
171
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka
In the other research, the authors attempt to classify emails TABLE I. DEFAULT EMAIL ATTRIBUTES
into predefined categories such as Shopping Domain, Social
Media, Appointments, Parenting, Personal, Travel, etc. [3].
Their approach had been to create a data dictionary and basic
keyword search to do the classification. Notably, it doesn’t
have any consideration on whether the email is important or
not.
III. THE SOLUTION
Figure 01 illustrates the overview of the proposed solution
in this paper. The existing emails in the mailbox of the user
are used as data to train the classification model. Following
the steps in the standard machine learning pipeline, features
are extracted from the dataset either straightforwardly or
through feature engineering and, some features are embedded
as personalized parameters as well.
172
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka
C. Personalized Parameters
The following set of personalized parameters was used
during the feature engineering process so that the
classification will be personalized.
NAME
EMAIL
SYSTEMS
173
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka
SUPERVISORS
REPORTEES
AUTOMATED EMAILS
After the training, these parameters could be adjusted so
that the users can improve the accuracy further as well as
accommodate the changes that happened to the environment
over time such as changes in supervisors, and changes in
systems.
D. Class Label and the Balance of the Data Set
The class label or the class of the email is directly taken Fig. 5. Class distribution
from the email attribute “Unread”. Almost half of the emails The data set is balanced enough, and the size of the data
were unread in the first author’s email inbox. This was verified set is also quite sufficient.
to be certain that the unread emails are truly unimportant by
manual inspection of 10 samples from both read and unread F. Data Pre-processing
categories, each of which were having a sample size of 25 While the data is being read from the emails, the attribute
emails. Table 03 contains the results of the accuracy check. extraction is done on the fly and all the other pre-processing
According to the table, the label accuracy with the default activities such as null or empty value check, categorical to
labels is 95% and hence, the proposed criteria of labelling are numerical conversion with Boolean flags too were performed
acceptable for this research. This approach saved a lot of time in parallel.
that could have been taken for manual annotation.
Once the feature set is finalized, the emails were read
TABLE III. CLASS DISTRIBUTION ACCURACY from the outlook inbox and extracted the desired features
iterating each and every email and constructing the data set.
Sample Label Correct Incorrect Total Accuracy %
While constructing the data set all the validations were
1 Unread 15 1 16 94% performed and features were transformed to binary and
1 Read 8 1 9 89% integer values. So that there is no need for additional steps to
2 Unread 13 0 13 100% do the cleansing
2 Read 12 1 13 92%
3 Unread 10 0 10 100%
3 Read 15 1 16 94%
4 Unread 11 0 11 100%
4 Read 14 1 15 93%
5 Unread 18 2 20 90%
5 Read 7 0 7 100%
6 Unread 10 1 11 91%
6 Read 15 1 16 94%
7 Unread 10 0 10 100%
7 Read 15 0 15 100%
8 Unread 9 1 10 90% Fig. 6. Model development and training process
8 Read 16 2 18 89% and required transformations. Following is the high-level
9 Unread 14 0 14 100% process of data extraction used.
9 Read 11 1 12 92%
All personalized parameters were configured so that
10 Unread 11 0 11 100% extracted data set can be directly used for training. Following
10 Read 14 1 15 93% feature transformations were carried out during the feature
Average 95% engineering process.
E. Statistics About the Data Set Categorical feature handling
Total number of records: 15090 Empty value handling
Total number of features: 79
Balance of the data set. Deriving feature creations
Data type conversions
G. Training the Models
Before training the model, the dataset is split into 2 sets:
one for training and one for testing. The split was done on a
70/30 basis. Standard scaler was used for scaling or
standardizing the features, especially for logistic regression,
support vector machine, neural network and KNN.
174
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka
𝑥−𝜇
𝑧= (1)
𝜎
On the other hand, the Min-Max Normalization uses the
minimum and maximum values in a given attribute to
calculate the normalized value. The following formula
explains the calculation of the normalized value using the min-
max normalization method.
𝑋− 𝑋
𝑋 = (2)
𝑋 − 𝑋
Fig. 7. Model deployment
175
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka
Following are the best performance gained from each of However, manual validation was done using the random
the algorithms after optimizing. samples, after the classification. In order to pick the random
sample, a random position in the emails had been selected and
TABLE V. MODEL PERFORMANCE AFTER PARAMETER TUNING 10 subsequent emails were selected from that point.
176
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka
TABLE VI. RANDOM SAMPLE EVALUATION FOR MOVED EMAILS Different types of machine learning algorithms were tried
out and evaluated their performance. The Logistic Regression,
Support Vector Machines and K-Nearest neighbour
algorithms did not yield the envisaged results, for the dataset
used, compared with other algorithms. The findings of the
research support that the Decision Tree, Random Forest and
Neural Network-based algorithms gave a better performance.
The findings are inclined more towards the “Rules-based
algorithms Decision Tree and Random Forest”, in terms of
gauging the expected results.
After implementing the solution, the observations derived
from the several random samples were appealing.
The observations support that the accuracy of the
So according to this, 93% (please see figure VI) of emails important class was around 82% while the accuracy of the
classified as Ignore were correct because all 3 models were unimportant class was about 94%. The highest accuracy was
used to get this confirmed. 83% (please see Figure VII) of reported as 85% for individual models for "unimportant"
the emails which were rained in the inbox were accurate. emails which were further improved with the composite
model.
TABLE VII. RANDOM SAMPLE EVALUATION FOR SKIPPED EMAILS
REFERENCES
[1] G. H. Al-Rawashdeh, and R. B. Mamat, “Comparison of four email
classification algorithms using WEKA”, International Journal of
Computer Science and Information Security. vol. 17, no. 2, pp. 42-54,
2019.
[2] S. Yoo, Y. Yang, and J. Carbonell, “Modeling personalized email
prioritization: classification-based and regression-based approaches”, In
Proceedings of the 20th ACM international conference on Information
and knowledge management, pp. 729-738, 2011.
[3] S. P. Gautam, “Email classification using a self-learning technique
based on user preferences”, Master. dissertation, Dept. Comp. Sc., North
Dakota State Univ., Fargo, ND, 2015.
[4] T. Ssebulime, “Email classification using machine learning techniques”,
Master. dissertation, Dept. Comp. Sc., Bournemouth Univ., United
Kingdom, 2022.
[5] M. Javed, “The best machine learning algorithm for email
VI. FUTURE WORK classification.” Towardsdatascience, https://towardsdatascience
.com. (Accessed 17 March 2022)
Had the manual labelling been done going through each
and every e-mail the outcome of the project could have been
improved further.
As the “Word Cloud” is not actually a static data set, it
needs to be re-constructed and fed into the agent, as the nature
of the email subject and contents keep on getting changed
over time.
For future research, the possibility to make the model
simpler and more robust could be explored, by measuring the
effectiveness of each of the features and using only the most
effective features for the model.
VII. CONCLUSION
In this work, an approach for personalized email
classification was introduced, powered by supervised machine
learning models. The primary objective of the classification is
to correctly classify emails as "important" and "unimportant",
which is a highly personalized classification. There were
many features which are engineered, to get the personal
flavour to the classification. When it comes to the class, if an
email is classified as "unimportant" then the accuracy should
be high while recall should be high for the "important" class.
In this context, an abundance of precaution was considered, in
selecting the right algorithm that best works for this
classification, where the envisaged target was to achieve high
accuracy, performance and integration with the email client.
177
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.