0% found this document useful (0 votes)

30 views

Personalized Classification of Non-Spam Emails Using Machine Learning Techniques

This document summarizes a research paper that proposes using machine learning techniques to classify non-spam emails based on importance. The researchers trained several classification models using a user's past email data to label new emails as important or not important. They discuss literature on identifying spam emails but note that further classifying non-spam emails is lacking. The goal is to provide a machine learning solution to filter unimportant non-spam emails based on a user's past behavior and email importance.

Uploaded by

g9741036727

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

Personalized Classification of Non-Spam Emails Using Machine Learning Techniques

Uploaded by

g9741036727

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Smart Computing and Systems Engineering, 2022

Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka

Paper No: SC-26 Smart Computing

Personalized Classification of Non-Spam Emails

Using Machine Learning Techniques
2022 International Research Conference on Smart Computing and Systems Engineering (SCSE) | 978-1-6654-7375-0/22/$31.00 ©2022 IEEE | DOI: 10.1109/SCSE56529.2022.9905110

Harsha Dinendra* Chathura Rajapakse P. P. G. Dinesh Asanka

Faculty of Graduate Studies Faculty of Graduate Studies Faculty of Graduate Studies
University of Kelaniya, Sri Lanka University of Kelaniya, Sri Lanka University of Kelaniya, Sri Lanka
[email protected] [email protected] [email protected]

Abstract - With the advent of computer networks and Today most enterprise-level email solutions provide spam
communications, emails have become one of the most widely filtering features in a very sophisticated manner. However, the
accepted communication means, which is faster, more reliable, spam filters are of little or no use in this scenario as these
cheaper, and accessible from anywhere. Due to the increased use emails are not spam but just unimportant emails. Moreover,
of email communications, day-to-day computer users; whether an email is important or not is subjective and also
particularly corporate users, find it cumbersome to filter the contextual. Hence, it is not possible to have a generalized set
most important and urgent emails out of the large number of of rules, to filter out such unimportant emails.
emails they receive on a given business day. Enterprise email
systems are able to automatically identify spam emails but still, This paper presents a machine learning-based method to
there are many non-urgent and unimportant emails among such filter out unimportant non-spam emails based on the
non-spam emails which cannot be filtered by conventional spam respective users’ past behaviour when using the mailbox.
filter programs. Though it may be feasible to set up some static Several classification models have been trained using the past
rules and categorize some of the e-mails, the practicality and data of the first author’s mailbox, to classify new emails either
sustainability of such rules are questionable due to the as important or not important. The ensuing paragraphs discuss
magnitude of such rules, and the validity period as such rules the following aspects consequently, as given here: Section II-
may become redundant after some time. Thus, it is desired to Literature Review, Section III-Solution, Section IV-
have an email filtering system for non-spam emails to filter Methodology and Approach, Section V-Results, Section VI-
unimportant emails, based on the user’s past behaviour. Despite
Future Work and Section VII-Conclusion.
the availability of research on identifying spam e-mails in the
area of further classifying the non-spam e-mails, is lacking. The II. LITERATURE REVIEW
purpose of this research is to provide a machine learning-based
solution to classify non-spam e-mails considering the Previous research and experiments pertaining to e-mail
importance of such e-mails. As part of the research, several classifications have focused on mainly improving spam e-mail
machine learning models have been developed and trained using filtering and also classification of e-mail into different pre-
non-spam e-mails, based on the personal mailbox of the first defined categories like sport, travel, appointment, social
author of this research. The results showed a significant media and personal etc.
accuracy, particularly with a decision tree, random forests and
deep neural network algorithms. This paper presents the Most of the research has been carried out to find which
modelling details and the results obtained accordingly. machine learning algorithm is better for spam email
classification [1].[4].[5]. The previous literature has not
Keywords - logistic regression, non-spam email classification, captured the subject matter of this project-personalized
Random Forest algorithms, supervised learning, Support Vector classification of non-spam e-mails, which is a novel area of
Machines research.
I. INTRODUCTION The following shortfalls have been observed pertaining to
the previous research in this area, as given below:
With the advent of computer networks and
 Lack of personalization when categorizing emails
communications, emails have become one of the widely or
mostly accepted communication means which is faster, more  Generic data being used
reliable, cheaper, and accessible from anywhere. Due to the
increased use of e-mail communications, day-to-day computer  Limited algorithms
users, particularly corporate users, get a large number of  Lack of feature engineering work
emails. Sometimes it could be very hard to go through all of
them within the day due to their large volume.  Lack of end-to-end implementation
The importance of e-mails received cannot be gauged However, there are two research publications, which seem
without reading the content, as a result of which a considerable to be quite similar. In [2]., the authors discuss an approach in
time is wasted in reading even unimportant e-mails, eroding which personalized priority is given to emails. However, the
the productive time in the corporate world.
research gives more focus to the evaluation of the better model and they have used term weighting to construct the document
out of a regression-based model and a multi-class vector. Other than the above limitations, another perceived
classification model using Support Vector Machines. limitation is that the training sample was also very small.
Moreover, they used a very limited number of features such
as from, to, cc address, title, and body text of email message,

171

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.
Smart Computing and Systems Engineering, 2022
Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka

In the other research, the authors attempt to classify emails TABLE I. DEFAULT EMAIL ATTRIBUTES
into predefined categories such as Shopping Domain, Social
Media, Appointments, Parenting, Personal, Travel, etc. [3].
Their approach had been to create a data dictionary and basic
keyword search to do the classification. Notably, it doesn’t
have any consideration on whether the email is important or
not.
III. THE SOLUTION
Figure 01 illustrates the overview of the proposed solution
in this paper. The existing emails in the mailbox of the user
are used as data to train the classification model. Following
the steps in the standard machine learning pipeline, features
are extracted from the dataset either straightforwardly or
through feature engineering and, some features are embedded
as personalized parameters as well.

Fig. 1. Overall solution

Thereafter, the above solution is implemented iteratively

for each new email received to the mailbox, to process and
automatically categorize them accordingly. During this
process, a scheduled Python agent keeps scanning the mailbox B. Feature Engineering and Selection
for any unread emails. Once a new email is received, the agent After examining all the attributes and their values, certain
will extract that email’s attributes first and then extract the attributes had been dropped because there was no variance in
required features in order to pass it to the machine learning their values i.e., empty, or constant values. Other email
model. If the predicted class received from the machine attributes were used either straightway or as a few derived
learning model is ‘unimportant’, the email is moved to the features, as shown in Figure 6 in section IV.
ignore folder. This process continues iteratively as a
The content of the email is a piece of very crucial
background process of the email client.
information to decide the importance of the particular email.
IV. METHODOLOGY AND APPROACH The email content is provided by the “Subject” and “Body” of
the email attributes. Since those free text values are non-
This section presents the details of the methodology and numerical or categorical, the word cloud is generated to
approach used to implement the solution explained in Section identify keywords in the subject as well as the body for all the
III. emails in the training set. Once the keywords are identified for
A. Data Collection each and every email the term frequency (TF) is calculated for
each word and used as a feature to represent the email subject
The required data to train the machine learning model and body.
were extracted from the personal mailbox of the first author,
using the “win32com” python library. The attributes available Apart from the TF identified from the word cloud, some
for extraction include subject, body, sender, receiver, date- other features were also extracted from the email such as New
time, email type, size, attachments and importance etc. Table Email, Reply Email, Forwarded Email, Email has been
01 contains the list of all attributes that can be obtained from addressed to the owner of the mailbox, email has been
an email. addressed to someone else, length of the email body, whether
it is a reply to an email of the owner of the mailbox, etc. Table
02 contains the features derived from the email attributes.

172

TABLE II. DERIVED EMAIL FEATURES

Email Attribute Derived Feature Email Attribute Derived Feature

To Only_to_me Subject Reply
To Only_me_in_to Subject Forward
CC Only_me_in_cc Subject New
To Me_in_to Attachments Attachment_Count
CC Me_in_cc Importance Importance
To,CC Supervisor_copy Body Reply_to_my_email
Sender Sent_by_supervisor Body Addressed_to_me
Sender Sent_by_me Body Addressed_to_others
Sender Sent_by_reportee Body Includes_questions
Sender Sent_by_systems Body Important_tags
EmailType Internal_email Body/Sender Automated_emails
RecievedTime Office_time Body Body_length
RecievedTime Weekday Size Size

A Chi-squared test was carried out for each of the

independent variables against the target variable to identify
significant features and the features which gave P>0.05 were
selected for model training.
Once the email attributes had been extracted, 2 corpora Fig. 3. Top words from Word Clouds
were constructed by collecting all the subjects of the emails
and the body of the emails. Then, a word cloud was generated So, the finalized feature set is constructed as shown in
from the corpus. Before generating the word cloud, the corpus figure 4. Some of the features were derived from email
has been passed through the steaming process so that it will attributes.
generalize the terms and also remove the stop words.
Following are the word clouds generated for the subjects
and body of the first author’s emails.

Subject Word Cloud Body Word Cloud

Fig. 2. Word Clouds for subject and body of emails

The words identified to be included in the feature set are

as follows.

Fig. 4. Email feature mapping

C. Personalized Parameters
The following set of personalized parameters was used
during the feature engineering process so that the
classification will be personalized.
 NAME
 EMAIL
 SYSTEMS

173

 SUPERVISORS
 REPORTEES
 AUTOMATED EMAILS
After the training, these parameters could be adjusted so
that the users can improve the accuracy further as well as
accommodate the changes that happened to the environment
over time such as changes in supervisors, and changes in
systems.
D. Class Label and the Balance of the Data Set
The class label or the class of the email is directly taken Fig. 5. Class distribution
from the email attribute “Unread”. Almost half of the emails The data set is balanced enough, and the size of the data
were unread in the first author’s email inbox. This was verified set is also quite sufficient.
to be certain that the unread emails are truly unimportant by
manual inspection of 10 samples from both read and unread F. Data Pre-processing
categories, each of which were having a sample size of 25 While the data is being read from the emails, the attribute
emails. Table 03 contains the results of the accuracy check. extraction is done on the fly and all the other pre-processing
According to the table, the label accuracy with the default activities such as null or empty value check, categorical to
labels is 95% and hence, the proposed criteria of labelling are numerical conversion with Boolean flags too were performed
acceptable for this research. This approach saved a lot of time in parallel.
that could have been taken for manual annotation.
Once the feature set is finalized, the emails were read
TABLE III. CLASS DISTRIBUTION ACCURACY from the outlook inbox and extracted the desired features
iterating each and every email and constructing the data set.
Sample Label Correct Incorrect Total Accuracy %
While constructing the data set all the validations were
1 Unread 15 1 16 94% performed and features were transformed to binary and
1 Read 8 1 9 89% integer values. So that there is no need for additional steps to
2 Unread 13 0 13 100% do the cleansing
2 Read 12 1 13 92%
3 Unread 10 0 10 100%
3 Read 15 1 16 94%
4 Unread 11 0 11 100%
4 Read 14 1 15 93%
5 Unread 18 2 20 90%
5 Read 7 0 7 100%
6 Unread 10 1 11 91%
6 Read 15 1 16 94%
7 Unread 10 0 10 100%
7 Read 15 0 15 100%
8 Unread 9 1 10 90% Fig. 6. Model development and training process
8 Read 16 2 18 89% and required transformations. Following is the high-level
9 Unread 14 0 14 100% process of data extraction used.
9 Read 11 1 12 92%
All personalized parameters were configured so that
10 Unread 11 0 11 100% extracted data set can be directly used for training. Following
10 Read 14 1 15 93% feature transformations were carried out during the feature
Average 95% engineering process.
E. Statistics About the Data Set  Categorical feature handling
Total number of records: 15090  Empty value handling
Total number of features: 79
Balance of the data set.  Deriving feature creations
 Data type conversions
G. Training the Models
Before training the model, the dataset is split into 2 sets:
one for training and one for testing. The split was done on a
70/30 basis. Standard scaler was used for scaling or
standardizing the features, especially for logistic regression,
support vector machine, neural network and KNN.

174

H. Rescaling of Features accuracy of unimportant emails, all trained models with

Since the numerical values have different ranges, certain higher accuracy were used, as a composite ML model.
algorithms need to have standardized or normalized values so Therefore, an email will be moved to the Ignore folder only
that the features with larger value ranges do not affect the if the majority of the models classify the email correctly as
accuracy of the model. Therefore, rescaling of features was unimportant. This was implemented based on the majority
required specifically for logistic regression and SVM voting method. Figure 03 depicts the deployment process of
techniques. Two of the commonly used methods considered the model with this simple logic.
in this project are known as Standardization and Min-Max
Normalization.
Standardization uses the well-known z-score
normalization as the basis, which could be represented by the
formula below. StandardScaler of the sklearn package in
Python is an implementation of this scalar.

𝑥−𝜇
𝑧= (1)
𝜎
On the other hand, the Min-Max Normalization uses the
minimum and maximum values in a given attribute to
calculate the normalized value. The following formula
explains the calculation of the normalized value using the min-
max normalization method.
𝑋− 𝑋
𝑋 = (2)
𝑋 − 𝑋
Fig. 7. Model deployment

MinMaxScaler of sklearn package is an implementation V. RESULTS

of this scalar. Notably, the standardization approach is used
in this research.
I. Machine Learning Models and Training Following are the results of each of the machine learning
algorithms with their default parameters.
Following supervised machine learning techniques were
used to compare the results with default parameters. TABLE IV: INITIAL MODEL PERFORMANCE
 Decision Tree Algorithm Class precision recall F1 SME
DecisionTree Important 0.66 0.85 0.75 0.30
 Random Forest DecisionTree Unimportant 0.77 0.53 0.62 0.30
RandomForest Important 0.70 0.78 0.74 0.29
 Logistic Regression RandomForest Unimportant 0.73 0.64 0.68 0.29
LogisticRegression Important 0.72 0.72 0.72 0.29
 Support Vector Machines
LogisticRegression Unimportant 0.69 0.69 0.69 0.29
 Neural Network Support Vector Machine Important 0.66 0.64 0.65 0.36
Support Vector Machine Unimportant 0.62 0.63 0.63 0.36
 K-Nearest Neighbor (KNN) Neural Network Important 0.72 0.80 0.76 0.26
Neural Network Unimportant 0.75 0.66 0.70 0.26
Based on the initial performance of each algorithm, a few KNN Important 0.72 0.72 0.72 0.29
machine learning algorithms were selected for optimizations KNN Unimportant 0.69 0.70 0.70 0.29
with hyperparameter tuning.
For unimportant emails, precision is more important. As
J. Evaluations
per the observation, “Decision Tree, Random Forest and
Evaluation of the results was mostly done using the Neural Network” have shown better results.
confusion matrix, specifically using the ‘Recall’ and
‘Precision’. For important emails, having a higher “Recall” is
desirable as we need to avoid False Negative cases whereas
for unimportant emails, having higher “Precision” is
desirable, to avoid False Positive cases.
K. Deployment
Once the model was trained and optimized, an agent
program was developed to integrate this with the email client.
Accordingly, when any new email arrives, the agent reads the
email and extracts the features, applies the trained model and
identifies the class. Based on the class, it moves the email to
the “ignore” folder in the email inbox. In order to improve the

175

Fig. 8. Unimportant email classification comparison

For important emails, recall is more important. We can

see that Decision Tree, Random Forest and Neural Network
have shown better results.
Fig. 10. Unimportant email classification comparison

Fig. 11. Important email classification comparison

Decision Tree has shown better performance compared to

the other 2. The highest Precision for “Unimportant”
classification is 85% and the higher recall for important
Fig. 9. Important email classification comparison classification is 91%.
The logistic regression, SVM and KNN are not really B. Deployment
providing convincing results. We can see the precision/recall Once the optimum values are identified for each of the
values shown in the classification report are pretty much algorithms the machine learning models were finalized and
similar and low. This means those models are not responding used inside the agent which is deployed as a background
properly to this data set. process, to scan and classify all new emails in the inbox. If all
A. Hyperparameter Tuning models classify the same email as unimportant, then only the
email is moved to the “ignore” folder.
Decision Tree, Random Forest and Neural Network were
taken to the next step for further optimizations. The same data C. Evaluation
set has been used with various hyperparameter combinations. Each model was technically evaluated with the Standard
Scikit-learn’s GridSearchCV library is used for searching Mean Error method as well as GridSearch and cross-
the most optimum parameters. validation techniques.

Following are the best performance gained from each of However, manual validation was done using the random
the algorithms after optimizing. samples, after the classification. In order to pick the random
sample, a random position in the emails had been selected and
TABLE V. MODEL PERFORMANCE AFTER PARAMETER TUNING 10 subsequent emails were selected from that point.

176

TABLE VI. RANDOM SAMPLE EVALUATION FOR MOVED EMAILS Different types of machine learning algorithms were tried
out and evaluated their performance. The Logistic Regression,
Support Vector Machines and K-Nearest neighbour
algorithms did not yield the envisaged results, for the dataset
used, compared with other algorithms. The findings of the
research support that the Decision Tree, Random Forest and
Neural Network-based algorithms gave a better performance.
The findings are inclined more towards the “Rules-based
algorithms Decision Tree and Random Forest”, in terms of
gauging the expected results.
After implementing the solution, the observations derived
from the several random samples were appealing.
The observations support that the accuracy of the
So according to this, 93% (please see figure VI) of emails important class was around 82% while the accuracy of the
classified as Ignore were correct because all 3 models were unimportant class was about 94%. The highest accuracy was
used to get this confirmed. 83% (please see Figure VII) of reported as 85% for individual models for "unimportant"
the emails which were rained in the inbox were accurate. emails which were further improved with the composite
model.
TABLE VII. RANDOM SAMPLE EVALUATION FOR SKIPPED EMAILS
REFERENCES
[1] G. H. Al-Rawashdeh, and R. B. Mamat, “Comparison of four email
classification algorithms using WEKA”, International Journal of
Computer Science and Information Security. vol. 17, no. 2, pp. 42-54,
2019.
[2] S. Yoo, Y. Yang, and J. Carbonell, “Modeling personalized email
prioritization: classification-based and regression-based approaches”, In
Proceedings of the 20th ACM international conference on Information
and knowledge management, pp. 729-738, 2011.
[3] S. P. Gautam, “Email classification using a self-learning technique
based on user preferences”, Master. dissertation, Dept. Comp. Sc., North
Dakota State Univ., Fargo, ND, 2015.
[4] T. Ssebulime, “Email classification using machine learning techniques”,
Master. dissertation, Dept. Comp. Sc., Bournemouth Univ., United
Kingdom, 2022.
[5] M. Javed, “The best machine learning algorithm for email
VI. FUTURE WORK classification.” Towardsdatascience, https://towardsdatascience
.com. (Accessed 17 March 2022)
Had the manual labelling been done going through each
and every e-mail the outcome of the project could have been
improved further.
As the “Word Cloud” is not actually a static data set, it
needs to be re-constructed and fed into the agent, as the nature
of the email subject and contents keep on getting changed
over time.
For future research, the possibility to make the model
simpler and more robust could be explored, by measuring the
effectiveness of each of the features and using only the most
effective features for the model.
VII. CONCLUSION
In this work, an approach for personalized email
classification was introduced, powered by supervised machine
learning models. The primary objective of the classification is
to correctly classify emails as "important" and "unimportant",
which is a highly personalized classification. There were
many features which are engineered, to get the personal
flavour to the classification. When it comes to the class, if an
email is classified as "unimportant" then the accuracy should
be high while recall should be high for the "important" class.
In this context, an abundance of precaution was considered, in
selecting the right algorithm that best works for this
classification, where the envisaged target was to achieve high
accuracy, performance and integration with the email client.

177

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:11 UTC from IEEE Xplore. Restrictions apply.

Email Classification Using Machine Learning
No ratings yet
Email Classification Using Machine Learning
22 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
Final_report(Saie)
No ratings yet
Final_report(Saie)
38 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
Sem5 Paper DT
No ratings yet
Sem5 Paper DT
3 pages
1822 b Deleted Merged Cropped
No ratings yet
1822 b Deleted Merged Cropped
40 pages
Madhavan_2021_IOP_Conf._Ser.__Mater._Sci._Eng._1022_012113
No ratings yet
Madhavan_2021_IOP_Conf._Ser.__Mater._Sci._Eng._1022_012113
12 pages
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
No ratings yet
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
4 pages
Amrit Science Campus: Submitted by
No ratings yet
Amrit Science Campus: Submitted by
35 pages
An Approach To Email Categorization For Telecommunication Corpus
No ratings yet
An Approach To Email Categorization For Telecommunication Corpus
8 pages
Spam Detection in Email Using Machine Le
No ratings yet
Spam Detection in Email Using Machine Le
8 pages
Email Prioritization
No ratings yet
Email Prioritization
8 pages
Pending Proj
No ratings yet
Pending Proj
37 pages
1822 b Deleted
No ratings yet
1822 b Deleted
38 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
Email (Research) 3
No ratings yet
Email (Research) 3
7 pages
Published Paper
No ratings yet
Published Paper
9 pages
Final PPT
No ratings yet
Final PPT
18 pages
Final Report - Smart and Fast Email Sorting: 1 Project's Description
No ratings yet
Final Report - Smart and Fast Email Sorting: 1 Project's Description
5 pages
PRUTHVIRAJ MICOR FOML
No ratings yet
PRUTHVIRAJ MICOR FOML
26 pages
E-Mail Spam Classification Via Machine Learning and Natural Language Processing
No ratings yet
E-Mail Spam Classification Via Machine Learning and Natural Language Processing
7 pages
Research Paper Spam Detection
No ratings yet
Research Paper Spam Detection
4 pages
Machine Learning Based Spam E-Mail Detection
No ratings yet
Machine Learning Based Spam E-Mail Detection
10 pages
AntiSpam
No ratings yet
AntiSpam
26 pages
IJRPR8167
No ratings yet
IJRPR8167
7 pages
Machine Learning Based Classification for Spam Detection
No ratings yet
Machine Learning Based Classification for Spam Detection
14 pages
vishal FOML micro project vishal & milan
No ratings yet
vishal FOML micro project vishal & milan
26 pages
NLP Report
No ratings yet
NLP Report
19 pages
Spam Mail Detection Using Machine Learning
No ratings yet
Spam Mail Detection Using Machine Learning
5 pages
InboxIQ_ an Automated Email Reply System Revolutionizing Inbox Management With Machine Learning
No ratings yet
InboxIQ_ an Automated Email Reply System Revolutionizing Inbox Management With Machine Learning
8 pages
Decision Tree Model For Email Classification: Ivana Čavor
No ratings yet
Decision Tree Model For Email Classification: Ivana Čavor
4 pages
10-2018-Composite Email Features For Spam Identification
No ratings yet
10-2018-Composite Email Features For Spam Identification
9 pages
IJISAE 25 Dr+K.+Aditya+Shastry 8 1103
No ratings yet
IJISAE 25 Dr+K.+Aditya+Shastry 8 1103
9 pages
44 Decision Tree Model for Email Classification
No ratings yet
44 Decision Tree Model for Email Classification
4 pages
Project Report Emaildetection
No ratings yet
Project Report Emaildetection
44 pages
0_SPAM MAIL PREDICTION
No ratings yet
0_SPAM MAIL PREDICTION
29 pages
Anil Cap1
No ratings yet
Anil Cap1
6 pages
Jebin 2
No ratings yet
Jebin 2
22 pages
Spam Email Classifier_Ramsanjay
No ratings yet
Spam Email Classifier_Ramsanjay
2 pages
Enhancing Email Security with Naïve Bayes Spam Detection.docx Fully edited
No ratings yet
Enhancing Email Security with Naïve Bayes Spam Detection.docx Fully edited
64 pages
An Analysis of Machine Learning Algorithms and Deep Neural Networks For Email Spam Classification U
No ratings yet
An Analysis of Machine Learning Algorithms and Deep Neural Networks For Email Spam Classification U
6 pages
emailSpamDetection
No ratings yet
emailSpamDetection
8 pages
Spam Classification Based On Supervised Learning U
No ratings yet
Spam Classification Based On Supervised Learning U
6 pages
02 JCCE2202192 Online
No ratings yet
02 JCCE2202192 Online
5 pages
Spam email. Classifier ppt
No ratings yet
Spam email. Classifier ppt
16 pages
Email Spam Filtering Using Machine Learning.1[1]
No ratings yet
Email Spam Filtering Using Machine Learning.1[1]
16 pages
IJISAE Term Paper Charan
No ratings yet
IJISAE Term Paper Charan
6 pages
Maid hiring management system
No ratings yet
Maid hiring management system
43 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
E-Mail Spam Detection Using Machine Learning KNN
No ratings yet
E-Mail Spam Detection Using Machine Learning KNN
5 pages
Spam Email Detection Using Python and Machine Learning
No ratings yet
Spam Email Detection Using Python and Machine Learning
14 pages
Slide Format
No ratings yet
Slide Format
14 pages
Ijst 2023 2979
No ratings yet
Ijst 2023 2979
12 pages
Majority Voting Technique To Classify Emails As Spam or Ham: 1 Background, Context and Scope 2 Problem Description
No ratings yet
Majority Voting Technique To Classify Emails As Spam or Ham: 1 Background, Context and Scope 2 Problem Description
17 pages
Comparative Analysis of Classifiers For PDF
No ratings yet
Comparative Analysis of Classifiers For PDF
6 pages
Considering Behavior of Sender in Spam Mail Detection: S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn
No ratings yet
Considering Behavior of Sender in Spam Mail Detection: S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn
5 pages
Spam Email Using Machine Learning
No ratings yet
Spam Email Using Machine Learning
13 pages
The Infinite Bit: An Inside Story of Digital Technology
From Everand
The Infinite Bit: An Inside Story of Digital Technology
Arvind Padmanabhan
No ratings yet
Software Defined Networking (SDN) - a definitive guide
From Everand
Software Defined Networking (SDN) - a definitive guide
Rajesh Kumar Sundararajan
2/5 (2)
Non-Conventional Instrument Transformers For Improved Substation Design
No ratings yet
Non-Conventional Instrument Transformers For Improved Substation Design
10 pages
E-Commerce - Security Systems
No ratings yet
E-Commerce - Security Systems
2 pages
Apprentice Edit Details PDF
No ratings yet
Apprentice Edit Details PDF
2 pages
ESP32-wrover Datasheet en
No ratings yet
ESP32-wrover Datasheet en
27 pages
chapter 5
No ratings yet
chapter 5
31 pages
M9W002
No ratings yet
M9W002
35 pages
Infineon AN79953 - PSoC4 - MCU ApplicationNotes v27 - 00 EN
No ratings yet
Infineon AN79953 - PSoC4 - MCU ApplicationNotes v27 - 00 EN
69 pages
Ansible Commands
No ratings yet
Ansible Commands
9 pages
OneDrive For Business Training Syllabus
No ratings yet
OneDrive For Business Training Syllabus
19 pages
Unit 5 Data Link
No ratings yet
Unit 5 Data Link
46 pages
CSS Assignment- Pricing Table
No ratings yet
CSS Assignment- Pricing Table
2 pages
Lingat, Alvin (EXT-NSB - PH/Taguig City) : Please Below Guidelines On How To Register New User Site Forge
No ratings yet
Lingat, Alvin (EXT-NSB - PH/Taguig City) : Please Below Guidelines On How To Register New User Site Forge
6 pages
43LH510T DF - 1003 3754
No ratings yet
43LH510T DF - 1003 3754
44 pages
Virtual_doctor_robot_using_iot_ppt
No ratings yet
Virtual_doctor_robot_using_iot_ppt
15 pages
Install pfSense Firewall and Router in VirtualBox
No ratings yet
Install pfSense Firewall and Router in VirtualBox
2 pages
CBT N Scheme Lab Manual-Part - B
No ratings yet
CBT N Scheme Lab Manual-Part - B
26 pages
About Students: Name Matric No. Section
No ratings yet
About Students: Name Matric No. Section
19 pages
Python Turtle
No ratings yet
Python Turtle
10 pages
Python Roadmap
No ratings yet
Python Roadmap
5 pages
06-539 CyberCat Programming Manual
No ratings yet
06-539 CyberCat Programming Manual
138 pages
(IJCST-V10I3P16) :Dr.A.Rehash Rushmi Pavitra, Madhu Vadhani M, Sneha M, Guru Prasath N
No ratings yet
(IJCST-V10I3P16) :Dr.A.Rehash Rushmi Pavitra, Madhu Vadhani M, Sneha M, Guru Prasath N
3 pages
Hibernate in Java
No ratings yet
Hibernate in Java
6 pages
Labview Fpga Based Noise Cancelling Using The Lms Adaptive Algorithm
No ratings yet
Labview Fpga Based Noise Cancelling Using The Lms Adaptive Algorithm
4 pages
Assignment On Unit III Computer Hardware
No ratings yet
Assignment On Unit III Computer Hardware
6 pages
Resume of Noor Mohammad MInhaj - 1
No ratings yet
Resume of Noor Mohammad MInhaj - 1
1 page
Topic Research Worksheet: Examples of Acceptable Internet Sources
No ratings yet
Topic Research Worksheet: Examples of Acceptable Internet Sources
16 pages
Snowflake Partner Network Program
No ratings yet
Snowflake Partner Network Program
20 pages
Chapter 4 Transport Layer Cisco
No ratings yet
Chapter 4 Transport Layer Cisco
6 pages
Algorithem Chapter One
No ratings yet
Algorithem Chapter One
43 pages
Computing Superspecial Hyperelliptic Curves of Genus 4 With Automorphism Group Properly Containing The Klein 4-Group
No ratings yet
Computing Superspecial Hyperelliptic Curves of Genus 4 With Automorphism Group Properly Containing The Klein 4-Group
26 pages

Personalized Classification of Non-Spam Emails Using Machine Learning Techniques

Uploaded by

Personalized Classification of Non-Spam Emails Using Machine Learning Techniques

Uploaded by

Smart Computing and Systems Engineering, 2022

Department of Industrial Management, Faculty of Science, University of Kelaniya, Sri Lanka

Paper No: SC-26 Smart Computing

Personalized Classification of Non-Spam Emails

Harsha Dinendra* Chathura Rajapakse P. P. G. Dinesh Asanka

Fig. 1. Overall solution

Thereafter, the above solution is implemented iteratively

TABLE II. DERIVED EMAIL FEATURES

Email Attribute Derived Feature Email Attribute Derived Feature

A Chi-squared test was carried out for each of the

Subject Word Cloud Body Word Cloud

Fig. 2. Word Clouds for subject and body of emails

The words identified to be included in the feature set are

Fig. 4. Email feature mapping

H. Rescaling of Features accuracy of unimportant emails, all trained models with

MinMaxScaler of sklearn package is an implementation V. RESULTS

Fig. 8. Unimportant email classification comparison

For important emails, recall is more important. We can

Fig. 11. Important email classification comparison

Decision Tree has shown better performance compared to

You might also like