0% found this document useful (0 votes)
30 views

Personalized Classification of Non-Spam Emails Using Machine Learning Techniques

This document summarizes a research paper that proposes using machine learning techniques to classify non-spam emails based on importance. The researchers trained several classification models using a user's past email data to label new emails as important or not important. They discuss literature on identifying spam emails but note that further classifying non-spam emails is lacking. The goal is to provide a machine learning solution to filter unimportant non-spam emails based on a user's past behavior and email importance.

Uploaded by

g9741036727
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Personalized Classification of Non-Spam Emails Using Machine Learning Techniques

This document summarizes a research paper that proposes using machine learning techniques to classify non-spam emails based on importance. The researchers trained several classification models using a user's past email data to label new emails as important or not important. They discuss literature on identifying spam emails but note that further classifying non-spam emails is lacking. The goal is to provide a machine learning solution to filter unimportant non-spam emails based on a user's past behavior and email importance.

Uploaded by

g9741036727
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Smart Computing and Systems Engineering, 2022

Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka

Paper No: SC-26 Smart Computing

Personalized Classification of Non-Spam Emails


Using Machine Learning Techniques
2022 International Research Conference on Smart Computing and Systems Engineering (SCSE) | 978-1-6654-7375-0/22/$31.00 ©2022 IEEE | DOI: 10.1109/SCSE56529.2022.9905110

Harsha Dinendra* Chathura Rajapakse P. P. G. Dinesh Asanka


Faculty of Graduate Studies Faculty of Graduate Studies Faculty of Graduate Studies
University of Kelaniya, Sri Lanka University of Kelaniya, Sri Lanka University of Kelaniya, Sri Lanka
[email protected] [email protected] [email protected]

Abstract - With the advent of computer networks and Today most enterprise-level email solutions provide spam
communications, emails have become one of the most widely filtering features in a very sophisticated manner. However, the
accepted communication means, which is faster, more reliable, spam filters are of little or no use in this scenario as these
cheaper, and accessible from anywhere. Due to the increased use emails are not spam but just unimportant emails. Moreover,
of email communications, day-to-day computer users; whether an email is important or not is subjective and also
particularly corporate users, find it cumbersome to filter the contextual. Hence, it is not possible to have a generalized set
most important and urgent emails out of the large number of of rules, to filter out such unimportant emails.
emails they receive on a given business day. Enterprise email
systems are able to automatically identify spam emails but still, This paper presents a machine learning-based method to
there are many non-urgent and unimportant emails among such filter out unimportant non-spam emails based on the
non-spam emails which cannot be filtered by conventional spam respective users’ past behaviour when using the mailbox.
filter programs. Though it may be feasible to set up some static Several classification models have been trained using the past
rules and categorize some of the e-mails, the practicality and data of the first author’s mailbox, to classify new emails either
sustainability of such rules are questionable due to the as important or not important. The ensuing paragraphs discuss
magnitude of such rules, and the validity period as such rules the following aspects consequently, as given here: Section II-
may become redundant after some time. Thus, it is desired to Literature Review, Section III-Solution, Section IV-
have an email filtering system for non-spam emails to filter Methodology and Approach, Section V-Results, Section VI-
unimportant emails, based on the user’s past behaviour. Despite
Future Work and Section VII-Conclusion.
the availability of research on identifying spam e-mails in the
area of further classifying the non-spam e-mails, is lacking. The II. LITERATURE REVIEW
purpose of this research is to provide a machine learning-based
solution to classify non-spam e-mails considering the Previous research and experiments pertaining to e-mail
importance of such e-mails. As part of the research, several classifications have focused on mainly improving spam e-mail
machine learning models have been developed and trained using filtering and also classification of e-mail into different pre-
non-spam e-mails, based on the personal mailbox of the first defined categories like sport, travel, appointment, social
author of this research. The results showed a significant media and personal etc.
accuracy, particularly with a decision tree, random forests and
deep neural network algorithms. This paper presents the Most of the research has been carried out to find which
modelling details and the results obtained accordingly. machine learning algorithm is better for spam email
classification [1].[4].[5]. The previous literature has not
Keywords - logistic regression, non-spam email classification, captured the subject matter of this project-personalized
Random Forest algorithms, supervised learning, Support Vector classification of non-spam e-mails, which is a novel area of
Machines research.
I. INTRODUCTION The following shortfalls have been observed pertaining to
the previous research in this area, as given below:
With the advent of computer networks and
 Lack of personalization when categorizing emails
communications, emails have become one of the widely or
mostly accepted communication means which is faster, more  Generic data being used
reliable, cheaper, and accessible from anywhere. Due to the
increased use of e-mail communications, day-to-day computer  Limited algorithms
users, particularly corporate users, get a large number of  Lack of feature engineering work
emails. Sometimes it could be very hard to go through all of
them within the day due to their large volume.  Lack of end-to-end implementation
The importance of e-mails received cannot be gauged However, there are two research publications, which seem
without reading the content, as a result of which a considerable to be quite similar. In [2]., the authors discuss an approach in
time is wasted in reading even unimportant e-mails, eroding which personalized priority is given to emails. However, the
the productive time in the corporate world.
research gives more focus to the evaluation of the better model and they have used term weighting to construct the document
out of a regression-based model and a multi-class vector. Other than the above limitations, another perceived
classification model using Support Vector Machines. limitation is that the training sample was also very small.
Moreover, they used a very limited number of features such
as from, to, cc address, title, and body text of email message,

171

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka

In the other research, the authors attempt to classify emails TABLE I. DEFAULT EMAIL ATTRIBUTES
into predefined categories such as Shopping Domain, Social
Media, Appointments, Parenting, Personal, Travel, etc. [3].
Their approach had been to create a data dictionary and basic
keyword search to do the classification. Notably, it doesn’t
have any consideration on whether the email is important or
not.
III. THE SOLUTION
Figure 01 illustrates the overview of the proposed solution
in this paper. The existing emails in the mailbox of the user
are used as data to train the classification model. Following
the steps in the standard machine learning pipeline, features
are extracted from the dataset either straightforwardly or
through feature engineering and, some features are embedded
as personalized parameters as well.

Fig. 1. Overall solution

Thereafter, the above solution is implemented iteratively


for each new email received to the mailbox, to process and
automatically categorize them accordingly. During this
process, a scheduled Python agent keeps scanning the mailbox B. Feature Engineering and Selection
for any unread emails. Once a new email is received, the agent After examining all the attributes and their values, certain
will extract that email’s attributes first and then extract the attributes had been dropped because there was no variance in
required features in order to pass it to the machine learning their values i.e., empty, or constant values. Other email
model. If the predicted class received from the machine attributes were used either straightway or as a few derived
learning model is ‘unimportant’, the email is moved to the features, as shown in Figure 6 in section IV.
ignore folder. This process continues iteratively as a
The content of the email is a piece of very crucial
background process of the email client.
information to decide the importance of the particular email.
IV. METHODOLOGY AND APPROACH The email content is provided by the “Subject” and “Body” of
the email attributes. Since those free text values are non-
This section presents the details of the methodology and numerical or categorical, the word cloud is generated to
approach used to implement the solution explained in Section identify keywords in the subject as well as the body for all the
III. emails in the training set. Once the keywords are identified for
A. Data Collection each and every email the term frequency (TF) is calculated for
each word and used as a feature to represent the email subject
The required data to train the machine learning model and body.
were extracted from the personal mailbox of the first author,
using the “win32com” python library. The attributes available Apart from the TF identified from the word cloud, some
for extraction include subject, body, sender, receiver, date- other features were also extracted from the email such as New
time, email type, size, attachments and importance etc. Table Email, Reply Email, Forwarded Email, Email has been
01 contains the list of all attributes that can be obtained from addressed to the owner of the mailbox, email has been
an email. addressed to someone else, length of the email body, whether
it is a reply to an email of the owner of the mailbox, etc. Table
02 contains the features derived from the email attributes.

172

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka

TABLE II. DERIVED EMAIL FEATURES

Email Attribute Derived Feature Email Attribute Derived Feature


To Only_to_me Subject Reply
To Only_me_in_to Subject Forward
CC Only_me_in_cc Subject New
To Me_in_to Attachments Attachment_Count
CC Me_in_cc Importance Importance
To,CC Supervisor_copy Body Reply_to_my_email
Sender Sent_by_supervisor Body Addressed_to_me
Sender Sent_by_me Body Addressed_to_others
Sender Sent_by_reportee Body Includes_questions
Sender Sent_by_systems Body Important_tags
EmailType Internal_email Body/Sender Automated_emails
RecievedTime Office_time Body Body_length
RecievedTime Weekday Size Size

A Chi-squared test was carried out for each of the


independent variables against the target variable to identify
significant features and the features which gave P>0.05 were
selected for model training.
Once the email attributes had been extracted, 2 corpora Fig. 3. Top words from Word Clouds
were constructed by collecting all the subjects of the emails
and the body of the emails. Then, a word cloud was generated So, the finalized feature set is constructed as shown in
from the corpus. Before generating the word cloud, the corpus figure 4. Some of the features were derived from email
has been passed through the steaming process so that it will attributes.
generalize the terms and also remove the stop words.
Following are the word clouds generated for the subjects
and body of the first author’s emails.

Subject Word Cloud Body Word Cloud

Fig. 2. Word Clouds for subject and body of emails

The words identified to be included in the feature set are


as follows.

Fig. 4. Email feature mapping

C. Personalized Parameters
The following set of personalized parameters was used
during the feature engineering process so that the
classification will be personalized.
 NAME
 EMAIL
 SYSTEMS

173

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka

 SUPERVISORS
 REPORTEES
 AUTOMATED EMAILS
After the training, these parameters could be adjusted so
that the users can improve the accuracy further as well as
accommodate the changes that happened to the environment
over time such as changes in supervisors, and changes in
systems.
D. Class Label and the Balance of the Data Set
The class label or the class of the email is directly taken Fig. 5. Class distribution
from the email attribute “Unread”. Almost half of the emails The data set is balanced enough, and the size of the data
were unread in the first author’s email inbox. This was verified set is also quite sufficient.
to be certain that the unread emails are truly unimportant by
manual inspection of 10 samples from both read and unread F. Data Pre-processing
categories, each of which were having a sample size of 25 While the data is being read from the emails, the attribute
emails. Table 03 contains the results of the accuracy check. extraction is done on the fly and all the other pre-processing
According to the table, the label accuracy with the default activities such as null or empty value check, categorical to
labels is 95% and hence, the proposed criteria of labelling are numerical conversion with Boolean flags too were performed
acceptable for this research. This approach saved a lot of time in parallel.
that could have been taken for manual annotation.
Once the feature set is finalized, the emails were read
TABLE III. CLASS DISTRIBUTION ACCURACY from the outlook inbox and extracted the desired features
iterating each and every email and constructing the data set.
Sample Label Correct Incorrect Total Accuracy %
While constructing the data set all the validations were
1 Unread 15 1 16 94% performed and features were transformed to binary and
1 Read 8 1 9 89% integer values. So that there is no need for additional steps to
2 Unread 13 0 13 100% do the cleansing
2 Read 12 1 13 92%
3 Unread 10 0 10 100%
3 Read 15 1 16 94%
4 Unread 11 0 11 100%
4 Read 14 1 15 93%
5 Unread 18 2 20 90%
5 Read 7 0 7 100%
6 Unread 10 1 11 91%
6 Read 15 1 16 94%
7 Unread 10 0 10 100%
7 Read 15 0 15 100%
8 Unread 9 1 10 90% Fig. 6. Model development and training process
8 Read 16 2 18 89% and required transformations. Following is the high-level
9 Unread 14 0 14 100% process of data extraction used.
9 Read 11 1 12 92%
All personalized parameters were configured so that
10 Unread 11 0 11 100% extracted data set can be directly used for training. Following
10 Read 14 1 15 93% feature transformations were carried out during the feature
Average 95% engineering process.
E. Statistics About the Data Set  Categorical feature handling
Total number of records: 15090  Empty value handling
Total number of features: 79
Balance of the data set.  Deriving feature creations
 Data type conversions
G. Training the Models
Before training the model, the dataset is split into 2 sets:
one for training and one for testing. The split was done on a
70/30 basis. Standard scaler was used for scaling or
standardizing the features, especially for logistic regression,
support vector machine, neural network and KNN.

174

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka

H. Rescaling of Features accuracy of unimportant emails, all trained models with


Since the numerical values have different ranges, certain higher accuracy were used, as a composite ML model.
algorithms need to have standardized or normalized values so Therefore, an email will be moved to the Ignore folder only
that the features with larger value ranges do not affect the if the majority of the models classify the email correctly as
accuracy of the model. Therefore, rescaling of features was unimportant. This was implemented based on the majority
required specifically for logistic regression and SVM voting method. Figure 03 depicts the deployment process of
techniques. Two of the commonly used methods considered the model with this simple logic.
in this project are known as Standardization and Min-Max
Normalization.
Standardization uses the well-known z-score
normalization as the basis, which could be represented by the
formula below. StandardScaler of the sklearn package in
Python is an implementation of this scalar.

𝑥−𝜇
𝑧= (1)
𝜎
On the other hand, the Min-Max Normalization uses the
minimum and maximum values in a given attribute to
calculate the normalized value. The following formula
explains the calculation of the normalized value using the min-
max normalization method.
𝑋− 𝑋
𝑋 = (2)
𝑋 − 𝑋
Fig. 7. Model deployment

MinMaxScaler of sklearn package is an implementation V. RESULTS


of this scalar. Notably, the standardization approach is used
in this research.
I. Machine Learning Models and Training Following are the results of each of the machine learning
algorithms with their default parameters.
Following supervised machine learning techniques were
used to compare the results with default parameters. TABLE IV: INITIAL MODEL PERFORMANCE
 Decision Tree Algorithm Class precision recall F1 SME
DecisionTree Important 0.66 0.85 0.75 0.30
 Random Forest DecisionTree Unimportant 0.77 0.53 0.62 0.30
RandomForest Important 0.70 0.78 0.74 0.29
 Logistic Regression RandomForest Unimportant 0.73 0.64 0.68 0.29
LogisticRegression Important 0.72 0.72 0.72 0.29
 Support Vector Machines
LogisticRegression Unimportant 0.69 0.69 0.69 0.29
 Neural Network Support Vector Machine Important 0.66 0.64 0.65 0.36
Support Vector Machine Unimportant 0.62 0.63 0.63 0.36
 K-Nearest Neighbor (KNN) Neural Network Important 0.72 0.80 0.76 0.26
Neural Network Unimportant 0.75 0.66 0.70 0.26
Based on the initial performance of each algorithm, a few KNN Important 0.72 0.72 0.72 0.29
machine learning algorithms were selected for optimizations KNN Unimportant 0.69 0.70 0.70 0.29
with hyperparameter tuning.
For unimportant emails, precision is more important. As
J. Evaluations
per the observation, “Decision Tree, Random Forest and
Evaluation of the results was mostly done using the Neural Network” have shown better results.
confusion matrix, specifically using the ‘Recall’ and
‘Precision’. For important emails, having a higher “Recall” is
desirable as we need to avoid False Negative cases whereas
for unimportant emails, having higher “Precision” is
desirable, to avoid False Positive cases.
K. Deployment
Once the model was trained and optimized, an agent
program was developed to integrate this with the email client.
Accordingly, when any new email arrives, the agent reads the
email and extracts the features, applies the trained model and
identifies the class. Based on the class, it moves the email to
the “ignore” folder in the email inbox. In order to improve the

175

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka

Fig. 8. Unimportant email classification comparison

For important emails, recall is more important. We can


see that Decision Tree, Random Forest and Neural Network
have shown better results.
Fig. 10. Unimportant email classification comparison

Fig. 11. Important email classification comparison

Decision Tree has shown better performance compared to


the other 2. The highest Precision for “Unimportant”
classification is 85% and the higher recall for important
Fig. 9. Important email classification comparison classification is 91%.
The logistic regression, SVM and KNN are not really B. Deployment
providing convincing results. We can see the precision/recall Once the optimum values are identified for each of the
values shown in the classification report are pretty much algorithms the machine learning models were finalized and
similar and low. This means those models are not responding used inside the agent which is deployed as a background
properly to this data set. process, to scan and classify all new emails in the inbox. If all
A. Hyperparameter Tuning models classify the same email as unimportant, then only the
email is moved to the “ignore” folder.
Decision Tree, Random Forest and Neural Network were
taken to the next step for further optimizations. The same data C. Evaluation
set has been used with various hyperparameter combinations. Each model was technically evaluated with the Standard
Scikit-learn’s GridSearchCV library is used for searching Mean Error method as well as GridSearch and cross-
the most optimum parameters. validation techniques.

Following are the best performance gained from each of However, manual validation was done using the random
the algorithms after optimizing. samples, after the classification. In order to pick the random
sample, a random position in the emails had been selected and
TABLE V. MODEL PERFORMANCE AFTER PARAMETER TUNING 10 subsequent emails were selected from that point.

176

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka

TABLE VI. RANDOM SAMPLE EVALUATION FOR MOVED EMAILS Different types of machine learning algorithms were tried
out and evaluated their performance. The Logistic Regression,
Support Vector Machines and K-Nearest neighbour
algorithms did not yield the envisaged results, for the dataset
used, compared with other algorithms. The findings of the
research support that the Decision Tree, Random Forest and
Neural Network-based algorithms gave a better performance.
The findings are inclined more towards the “Rules-based
algorithms Decision Tree and Random Forest”, in terms of
gauging the expected results.
After implementing the solution, the observations derived
from the several random samples were appealing.
The observations support that the accuracy of the
So according to this, 93% (please see figure VI) of emails important class was around 82% while the accuracy of the
classified as Ignore were correct because all 3 models were unimportant class was about 94%. The highest accuracy was
used to get this confirmed. 83% (please see Figure VII) of reported as 85% for individual models for "unimportant"
the emails which were rained in the inbox were accurate. emails which were further improved with the composite
model.
TABLE VII. RANDOM SAMPLE EVALUATION FOR SKIPPED EMAILS
REFERENCES
[1] G. H. Al-Rawashdeh, and R. B. Mamat, “Comparison of four email
classification algorithms using WEKA”, International Journal of
Computer Science and Information Security. vol. 17, no. 2, pp. 42-54,
2019.
[2] S. Yoo, Y. Yang, and J. Carbonell, “Modeling personalized email
prioritization: classification-based and regression-based approaches”, In
Proceedings of the 20th ACM international conference on Information
and knowledge management, pp. 729-738, 2011.
[3] S. P. Gautam, “Email classification using a self-learning technique
based on user preferences”, Master. dissertation, Dept. Comp. Sc., North
Dakota State Univ., Fargo, ND, 2015.
[4] T. Ssebulime, “Email classification using machine learning techniques”,
Master. dissertation, Dept. Comp. Sc., Bournemouth Univ., United
Kingdom, 2022.
[5] M. Javed, “The best machine learning algorithm for email
VI. FUTURE WORK classification.” Towardsdatascience, https://towardsdatascience
.com. (Accessed 17 March 2022)
Had the manual labelling been done going through each
and every e-mail the outcome of the project could have been
improved further.
As the “Word Cloud” is not actually a static data set, it
needs to be re-constructed and fed into the agent, as the nature
of the email subject and contents keep on getting changed
over time.
For future research, the possibility to make the model
simpler and more robust could be explored, by measuring the
effectiveness of each of the features and using only the most
effective features for the model.
VII. CONCLUSION
In this work, an approach for personalized email
classification was introduced, powered by supervised machine
learning models. The primary objective of the classification is
to correctly classify emails as "important" and "unimportant",
which is a highly personalized classification. There were
many features which are engineered, to get the personal
flavour to the classification. When it comes to the class, if an
email is classified as "unimportant" then the accuracy should
be high while recall should be high for the "important" class.
In this context, an abundance of precaution was considered, in
selecting the right algorithm that best works for this
classification, where the envisaged target was to achieve high
accuracy, performance and integration with the email client.

177

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.

You might also like