0% found this document useful (0 votes)

146 views

Final Report - Smart and Fast Email Sorting: 1 Project's Description

This document summarizes a project to develop a machine learning algorithm to automatically sort emails into categories. It will learn from manually sorted emails and make predictions for new emails. The algorithm will be online learning, making predictions as emails arrive, even with few training emails. It will only move emails to folders if confident, to avoid mislabeling. The document describes developing tools to retrieve, tokenize, and label emails for training datasets. It tests a Naive Bayes algorithm on two datasets and discusses measuring confidence to avoid errors from mislabeling emails.

Uploaded by

GautamSikka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

146 views

Final Report - Smart and Fast Email Sorting: 1 Project's Description

Uploaded by

GautamSikka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Final Report - Smart and Fast Email Sorting

Antonin Bas - Clement Mennesson

Projects Description

Some people receive hundreds of emails a week and sorting all of them into different categories
(e.g. Stanford, Studies, Holidays, Next Week, Internship, Friends, Graduate Activities) can be timeconsuming. Most email clients provide a sorting mechanism based on rules specified by the user :
senders email address, key words for the subject... The aim of this project is to develop a machine
learning algorithm which would learn from the manual sorting of emails by the user in order to
quickly be able to handle itself the processing of new emails.
We do not want to have to wait for the user to label a sufficiently high number of emails in
each category before being able to make a prediction. We are therefore looking to implement an
online learning algorithm, which will make a prediction for each incoming email as soon as it arrives
(even for the first ones, even though the tentative labelling is likely to be erroneous). However, the
algorithm will not display its prediction to the user and move the email to a specific folder unless
it is confident in its prediction. Indeed, an application which frequently mislabels email or worse,
loses important emails by mistakenly placing them in a garbage folder rarely consulted by the
customer, is just not worth using. An interesting part of the project is therefore to evaluate the
degree of confidence of the algorithm in its prediction.
It is interesting to note that at least one open-source solution (POPFile) already exits which
fullfils this task. POPFile is an email proxy running Naive Bayes, which inserts itself between the
email server and the email client. However, it is our opinion that POPFile is not well adapted to
webmail clients : the user has to run POPFile as a daemon on a machine which remains on. If
the machine is turned off, then the email does not get sorted anymore, which prevents mobile-only
Internet users to sort their email with POPFile. Therefore, one of our goals will be to develop an
email sorter which uses as few computing ressources as possible. Such a product may be able to run
directly on web servers, and could be used directly by webmail providers.

2
2.1

Email Processing
Tokenization

An important part of our work was to obtain well-formated data on which to test our algorithms.
We have developped an Email Retriever / Tokenizer / Labelling tool, with a user interface (UI).
Email Retriever We connect to a given IMAP email server (in our case Gmail) using Oracles
JavaMail API, and retrieve the Inbox content.
Labelling The downloaded emails are displayed one-by-one via the UI to the user, which can choose
to give them a label or to leave them unlabelled
Tokenizer An email can have different content types : text/plain, text/html or multipart/alternative,
when for example it combines a text/plain and a text/html representation of the same message. Our tool only supports emails with at least one text/plain part, which is the part we
use for the tokenization, and is sufficient for our tests. Furthermore, most email clients tend
to include a copy of the original email when you choose the Reply option. We have choosen to
discard this copy from our input : for now, we consider that having the same subject is a good
enough indication that two emails are related, and that the extra information is not needed.
For the tokenization, we use the Stanford Log-linear Part-Of-Speech Tagger, which includes a
stemming algorithm. We discard ponctuation tokens. Web addresses and email addresses are
1

respectively replaced by tokens HT T P ADDR domain and EM AILADDR domain, where

domain is the domain name for the web / email address. The mails subject, as well as the
recipient and senders addresses are also tokenized. The final output is a one-line text file (per
mail) containing the label, and the number of occurences of each known token. Since we want
to design an online learning algorithm, the representation of an email does not have a fixed
length : each incoming email is likely to increase the number of known tokens and thus the
length of an email representaion.

2.2

Other Features

There are several features one could select instead to lower the dimension of the problem. Such
features include for instance : the size of the email, the number of receivers, the time period when
it was sent... Using these features as regressors, one could expect to obtain results using supervised
learning techniques. However, the loss of information is huge since the actual content of the email is
never taken into account. Email Tokenization seems a better approach. Besides, it is always possible
to identify the most relevant tokens for a particular category, and only feed those to the algorithm.

Description of the data sets

The algorithms presented are tested with two different data sets obtained from one of our Gmail
boxes.
The table below shows the distribution of the data sets : number of categories, total number of
emails, proportion of unlabelled emails (no label given by the user), average number of emails per
category, lower / higher number of emails in one category.
Data Set
1
2

Nb of categories
10
4

Nb of emails
565
715

Unlabelled
118 (20.9%)
68 (9.5%)

Average number
44.7
161.8

Lower
6
18

Higher
212
500

Table 1 Distribution of the data sets

The first data set specifies 10 very specific categories, with sometimes very few emails per category.
For the second data set, we have tried to simplify the task of the classifier : there are fewer categories,
and more emails per category. For both data sets, it would be impossible to define objective criteria to
sort the emails (i.e. Gmail current sorting model using user-defined categories could not be applied)
Given the Gmail box we consider, these data sets are both reasonable email allocations. We do
not try to fool the algorithm by wrongly labelling some emails. This way, we make sure there is
no bias in data selection and that we are able to analyze the different behaviours of our algorithm
accurately. We want to insist on the fact that there are unlabelled emails each time, which means we
do not want to sort the inbox completely, as we assume most users do not.

Error and Hesitation

Each data set contains emails the user does not want to classify. Consequently, we distinguish
between error and hesitation and introduce some criterion to measure the degree of confidence of the
algorithm in his prediction.
Error : The algorithm makes a prediction for the email category or box, which it is confident about,
but the prediction is wrong : the email ends up in the wrong category or receives a label whereas
the user did not want it labelled. The algorithm should try to minimize the number of errors,
as an important email could end up in a box of little interest to the user.
Hesitation : The algorithm makes a prediction for the email category or box, which it is not confident about. The email receives no label and is left in the main Inbox. Therefore, the user has
to sort the email himself, but there is no risk of losing an important email. An hesitation is
considered preferable to an error.

First Approach : Multinomial Naive Bayes Framework

5.1

First Implementation

We present here a first implementation of email sorting using the Multinomial Naive Bayes
framework with Laplace smoothing. It is a natural extension of the spam/non spam Naive Bayes
algorithm used in many spam classifiers, except that we now have several categories, each with its
own Naive Bayes classifier. At each step of the algorithm, i.e. for each new incoming email, the
training set is increased by one, and the dictionnary / tokens list is extended. As in the spam/non
spam approach, we calculate the probability for the new email to belong to each category. We can
note that the probabilities found do not necessarily sum up to 1. The most obvious example is when
an email does not belong to any category. We describe below the different steps of the algorithm for
each incoming email. We implemented it in Matlab.
1. Train the algorithm with the n sorted emails.
2. Assign the n + 1 email to a category based on the words contained in current dictionnary and
qualify the confidence of the prediction. The category predicted is the maximum classification
probability among the categories. The prediction is confident when this probability is higher
than some value , and all the other classification probabilities are lower than some value .
In the final tests, we choose = 0.98 and = 0.5. Given the results observed, the choice for
has more impact as classification probabilites often end up being 0 or 1. A high beta means a
high level of confidence in your training.
3. Check the correctness of the labelling. In case of mislabelling, add an error and reassign the
new email to its correct category.
4. Expand the token list and update the representation of the trainig set (n + 1 emails) in this
new dictionnary .
This algorithm demands much computing : one Naive Bayes classifier is maintained for each
category, and has to be updated for each incoming email.

5.2

Adding Relevance

The assignment of an email generally relies on keywords. Consequently, we select the most relevant
tokens for each category to compute a relevance filter, excluding high-frequency tokens (which appear
in all categories), and tokens which only appear sporadically in a category. Suppose we know j,i =
P(token = j|y = i) for all (i, j), the relevance ri,j of a token j for category i is :
ri,j = log P

j,i
l6=i j,i

The number of relevant tokens is a key parameter. A low number sharpens the classification but is
sensitive to the variabilty of content in a category whereas a large number gives too much credit
to each classification thus increasing the hesitation rate. The number of relevant tokens, as well as
how they are computed, impacts the algorithms performance time. Recomputing the list for each
new incoming email is very time-consuming. However, when the categories are more stable, it is
not necessary to recompute the list at every step, and classifying incoming emails becomes faster (a
reduced corpus means fewer operations).

5.3

Results

Data set 1 : 3.6% error and 13% hesitation, considering the 50 most relevant tokens for each
category, = 0.98, = 0.5.
Data set 2 3.6% error and 2.94% hesitation, considering the 100 most relevant tokens for each
category, = 0.98, = 0.5.
The error rate seems to be less sensitive to the number of relevant tokens than the hesitation rate.
Actually, after a short training period, the algorithm becomes very accurate and stops mislabelling
3

emails, which explains the low error rate passed a few hundred emails. What happens when we add
more tokens, especially with the first data set (10 boxes), is that the probability of belonging to a
category becomes higher than 0.5 for several categories. This result ends up with more hesitation.
There is no clear rule for choosing the number of relevant tokens, but we suggest keeping it to a
small fraction of the total dictionary (5%).

K-Means Clustering

While Naive Bayes produces good results, we are also considering another approach, inspired
by unsupervised learning, which may lead to a reduced performing time. Each mail category is
represented by a a vector (centroid) in a m dimensional space, where m is the size of the dictionnary.
m grows for each incoming email we add to the training set. We represent incoming emails by binary
vectors (tokens presence / absence). We use two different metrics to calculate the distance between
a new mail and a centroid. The first one is the traditional L2 norm, the second one is the scalar
product of the new email vector and the centroid.

6.1

Algorithm

1. Label the incoming email by finding the closest centroid, based on the tokens currently
contained in the dictionnary. The prediction is confident when the mail is really closer to a
centroid than to the others, we use the ratio of the distances and some threshold value to
decide whether the mail will actually be classified or not.
2. Check the correctness of the labelling. In case of mislabelling, add an error and reassign correctly
the new email.
3. Expand the token list and update the centroids (new tokens, new dimension) . The space
dimension is increased by one for each new token.
This algorithm is a little faster than our Naive Bayes approach since for each new incoming email,
we only have to update one centroid, and this update requires few operations.

6.2

Adding Relevance

As for Naive Bayes, we can only use the most significant tokens in our predictions to reduce the
dimension across which we make a prediction : is the centroids matrix. Given a category i and a
token j, we define :
(i, j)
ri,j = P
l6=i (i, j)

6.3
6.3.1

Results
Choice of Distance

The L2 norm is the first natural choice for a distance but it gives poor results. The scalar product
of the new email vector and a centroid is a projection rather than a distance because the closest
you are to a centroid, the higher is the norm of the projection. However, the second approach gives
better results both in terms of error and hesitation, and this is the measure used for plotting the
curves presented below. We can note that using the scalar product makes the algorithm look a lot
like Naive Bayes.
6.3.2

Analysis

The two curves show the Error-Hesitation curves for the two data sets when we vary the parameter
. A low error rate comes at the cost of a high hesitation rate and vice-versa. The algorithm produces
satisfying results, especially for the second data set : 3% error for 10% hesitation when is set
correctly. However we do not see how to provide an heuristic for the choice of .

Figure 1 Error Hesitation Curve for First Data Set

Figure 2 Error Hesitation Curve for Second Data Set

6.4

Additional Remark : Giving more weight to recent emails

In some cases (especially for the second data set), for a given category, the lexical field of the
emails can change with a new topic of discussion for instance... We thought this might impact the
precision of the algorithm. Therefore, we modified the algorithm to give more weight (in the centroids)
to the most recent emails in the category. This slightly lowered the hesitation rate, but we did not
judge this improvement enough to justify the additional computing time.

Conclusion

We obtained very good performances with Naive Bayes. However, we could not complete our
secondary objective, which was to devise an alternate algorithm, with similar (or slightly inferior)
results than Naive Bayes, but with a lower running time. Our K-means adaptation does take less time
to execute than our Naive Bayes implementation, but is not as accurate and requires a finer tuning
of parameters. Actually, we realize now that an online machine learning algorithm will always require
some heavy computation at each step, since the model needs to be updated for each new example.
In the absence of a fast algorithm, one can always try to reduce the frequency of the updates, and
see if the accuracy is significantly reduced.

Product Design and Market Share Optimization
No ratings yet
Product Design and Market Share Optimization
5 pages
Gisp Syllabus 5
No ratings yet
Gisp Syllabus 5
24 pages
Personalized Classification of Non-Spam Emails Using Machine Learning Techniques
No ratings yet
Personalized Classification of Non-Spam Emails Using Machine Learning Techniques
7 pages
Email Filtering: Machine Learning Techniques and An Implementation For The UNIX Pine Mail System
No ratings yet
Email Filtering: Machine Learning Techniques and An Implementation For The UNIX Pine Mail System
42 pages
FICE Project Report Spam
No ratings yet
FICE Project Report Spam
14 pages
Sem5 Paper DT
No ratings yet
Sem5 Paper DT
3 pages
InboxIQ_ an Automated Email Reply System Revolutionizing Inbox Management With Machine Learning
No ratings yet
InboxIQ_ an Automated Email Reply System Revolutionizing Inbox Management With Machine Learning
8 pages
Elsarticle Template New
No ratings yet
Elsarticle Template New
3 pages
Email (Research) 3
No ratings yet
Email (Research) 3
7 pages
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
No ratings yet
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
6 pages
An Approach To Email Categorization For Telecommunication Corpus
No ratings yet
An Approach To Email Categorization For Telecommunication Corpus
8 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
Maths Answers
No ratings yet
Maths Answers
4 pages
Email Classification Using Naive Bayes Classifier: Domain Algorithms Framework Platform
No ratings yet
Email Classification Using Naive Bayes Classifier: Domain Algorithms Framework Platform
7 pages
Spam Email Detection Using Machine Learning[1] (1)
No ratings yet
Spam Email Detection Using Machine Learning[1] (1)
8 pages
Chapters Report 16it088
No ratings yet
Chapters Report 16it088
13 pages
E Mail
No ratings yet
E Mail
3 pages
Amrit Science Campus: Submitted by
No ratings yet
Amrit Science Campus: Submitted by
35 pages
Content Based Spam Detection in Email Us PDF
No ratings yet
Content Based Spam Detection in Email Us PDF
5 pages
Review 0
No ratings yet
Review 0
6 pages
Detecting Spam Emails Using Machine Learning and Lemmatization vs Traditional Methods
No ratings yet
Detecting Spam Emails Using Machine Learning and Lemmatization vs Traditional Methods
2 pages
Enhancing Email Security with Naïve Bayes Spam Detection.docx Fully edited
No ratings yet
Enhancing Email Security with Naïve Bayes Spam Detection.docx Fully edited
64 pages
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
No ratings yet
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
7 pages
Categorization of Email Using Machine Learning On Cloud: Abstract
No ratings yet
Categorization of Email Using Machine Learning On Cloud: Abstract
5 pages
COA Project Report
No ratings yet
COA Project Report
11 pages
Optimizing Spam Filtering With Machine Learning
No ratings yet
Optimizing Spam Filtering With Machine Learning
35 pages
Reverse of E-Mail Spam Filtering Algorithms To Maintain E-Mail Deliverability
No ratings yet
Reverse of E-Mail Spam Filtering Algorithms To Maintain E-Mail Deliverability
4 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
Kongunadu College of Engineering and Technology: Automated Spam Filtering: A Fuzzy Similarity Approach
No ratings yet
Kongunadu College of Engineering and Technology: Automated Spam Filtering: A Fuzzy Similarity Approach
6 pages
Spam-T5: Benchmarking Large Language Models For Few-Shot Email Spam Detection
No ratings yet
Spam-T5: Benchmarking Large Language Models For Few-Shot Email Spam Detection
18 pages
PPt For Email (3)
No ratings yet
PPt For Email (3)
8 pages
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
No ratings yet
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
4 pages
Spam Filtering Email Classification SFECM Using Gain and Graph Mining Algorithm
No ratings yet
Spam Filtering Email Classification SFECM Using Gain and Graph Mining Algorithm
6 pages
Majority Voting Technique To Classify Emails As Spam or Ham: 1 Background, Context and Scope 2 Problem Description
No ratings yet
Majority Voting Technique To Classify Emails As Spam or Ham: 1 Background, Context and Scope 2 Problem Description
17 pages
A Comparative Approach To Email Classification Using Naive Bayes Classifier and Hidden Markov Model
No ratings yet
A Comparative Approach To Email Classification Using Naive Bayes Classifier and Hidden Markov Model
6 pages
Spam email. Classifier ppt
No ratings yet
Spam email. Classifier ppt
16 pages
Spam Email Detection Using Machine Learning
No ratings yet
Spam Email Detection Using Machine Learning
8 pages
aryan blackbook 1
No ratings yet
aryan blackbook 1
29 pages
Email Classification Using Machine Learning
No ratings yet
Email Classification Using Machine Learning
22 pages
Final_report(Saie)
No ratings yet
Final_report(Saie)
38 pages
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
No ratings yet
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
9 pages
Spam Detection Final-2
No ratings yet
Spam Detection Final-2
24 pages
REPORT[1]_1
No ratings yet
REPORT[1]_1
35 pages
2023 V14i805
No ratings yet
2023 V14i805
7 pages
PPT
0% (1)
PPT
15 pages
IJISAE 25 Dr+K.+Aditya+Shastry 8 1103
No ratings yet
IJISAE 25 Dr+K.+Aditya+Shastry 8 1103
9 pages
Iccs 2020 Published Paper
No ratings yet
Iccs 2020 Published Paper
9 pages
Fuzzy Classifier For Spam Detection in Emails
No ratings yet
Fuzzy Classifier For Spam Detection in Emails
5 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
IJCRT23A5429
No ratings yet
IJCRT23A5429
7 pages
Naive Bayes Spam Filte....
No ratings yet
Naive Bayes Spam Filte....
10 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
Spam Email Classifier_Ramsanjay
No ratings yet
Spam Email Classifier_Ramsanjay
2 pages
Spam Filtering Email Classification SFECM Using Gain and Graph Mining Algorithm
No ratings yet
Spam Filtering Email Classification SFECM Using Gain and Graph Mining Algorithm
7 pages
Machine Learning Based Spam E-Mail Detection
No ratings yet
Machine Learning Based Spam E-Mail Detection
10 pages
0_SPAM MAIL PREDICTION
No ratings yet
0_SPAM MAIL PREDICTION
29 pages
Analysis of Spam Email Filtering Through Naive Bayes Algorithm Across Different Datasets
No ratings yet
Analysis of Spam Email Filtering Through Naive Bayes Algorithm Across Different Datasets
4 pages
Article 28
No ratings yet
Article 28
5 pages
ml lab
No ratings yet
ml lab
13 pages
NLP Report
No ratings yet
NLP Report
19 pages
Work Smarter, Rule Your Email
From Everand
Work Smarter, Rule Your Email
Alexandra Samuel
1/5 (1)
Module 3
No ratings yet
Module 3
8 pages
Probability of Compound Event
No ratings yet
Probability of Compound Event
3 pages
CFC 2309 Maths Stats LR - Question Paper
No ratings yet
CFC 2309 Maths Stats LR - Question Paper
7 pages
MAT133 Solving Inequalities
100% (1)
MAT133 Solving Inequalities
9 pages
Arithmetic and Logic Unit (ALU) : ALU Is Responsible To Perform The Operation in The Computer
No ratings yet
Arithmetic and Logic Unit (ALU) : ALU Is Responsible To Perform The Operation in The Computer
13 pages
David A. Schmidt - Programming Language Semantics
No ratings yet
David A. Schmidt - Programming Language Semantics
20 pages
Fluid Mechanics II-course Guide Book - 2023 - ASTU
No ratings yet
Fluid Mechanics II-course Guide Book - 2023 - ASTU
3 pages
Op Art Movement: Mid Term Assignment
50% (2)
Op Art Movement: Mid Term Assignment
8 pages
ECE 203 Course Syllabus-2
No ratings yet
ECE 203 Course Syllabus-2
6 pages
A Sample L TEX Article
No ratings yet
A Sample L TEX Article
4 pages
Variations On The Body, Serres
100% (1)
Variations On The Body, Serres
138 pages
Target Ntse Mvpp-Mat: Mental Ability
No ratings yet
Target Ntse Mvpp-Mat: Mental Ability
15 pages
KOM3712 Control Systems Design: Welcome To Spring 2019
No ratings yet
KOM3712 Control Systems Design: Welcome To Spring 2019
5 pages
Cheatsheet and Mindmap
No ratings yet
Cheatsheet and Mindmap
14 pages
Rbi Assistant Previous Year Paper 2017: Reasoning Ability
No ratings yet
Rbi Assistant Previous Year Paper 2017: Reasoning Ability
62 pages
APPLIED MATHEMATICS II (Chapter One Plus) PPT Complete
No ratings yet
APPLIED MATHEMATICS II (Chapter One Plus) PPT Complete
90 pages
Chapter 1
100% (2)
Chapter 1
23 pages
Vibration & FFT Analyzer Basic
No ratings yet
Vibration & FFT Analyzer Basic
29 pages
Lecture 7 - 1
No ratings yet
Lecture 7 - 1
10 pages
Reissner-Nordstrom Metric - Gulmammad Mammadov PDF
No ratings yet
Reissner-Nordstrom Metric - Gulmammad Mammadov PDF
4 pages
Charging and Discharging A Capacitor
No ratings yet
Charging and Discharging A Capacitor
10 pages
Negative
No ratings yet
Negative
4 pages
Ai TS 2 (X) - APT 2 - CMP - 16 09 2019 - SET A
100% (2)
Ai TS 2 (X) - APT 2 - CMP - 16 09 2019 - SET A
15 pages
DPP - Straight Lines
No ratings yet
DPP - Straight Lines
2 pages
Form 1 Test Cells & Graphs: Partofthecell Function
No ratings yet
Form 1 Test Cells & Graphs: Partofthecell Function
3 pages
Assignment 4 - Regression and Interpolation - Attempt Review - Econcordia
No ratings yet
Assignment 4 - Regression and Interpolation - Attempt Review - Econcordia
13 pages
Quantum Computers 2015 PDF
No ratings yet
Quantum Computers 2015 PDF
5 pages
Duality Concept in Linear Programming: Prof. Biswajit Mahanty
No ratings yet
Duality Concept in Linear Programming: Prof. Biswajit Mahanty
53 pages

Final Report - Smart and Fast Email Sorting: 1 Project's Description

Uploaded by

Final Report - Smart and Fast Email Sorting: 1 Project's Description

Uploaded by

Final Report - Smart and Fast Email Sorting

Antonin Bas - Clement Mennesson

respectively replaced by tokens HT T P ADDR domain and EM AILADDR domain, where

Description of the data sets

Table 1 Distribution of the data sets

Error and Hesitation

First Approach : Multinomial Naive Bayes Framework

Figure 1 Error Hesitation Curve for First Data Set

Figure 2 Error Hesitation Curve for Second Data Set

Additional Remark : Giving more weight to recent emails

You might also like