Mini Projects
Mini Projects
Examensarbete 15 hp
Maj 2022
Isak Jonsson
Abstract
Email classification using machine learning algorithms
Isak Jonsson
1 Introduction 1
2 Theory 1
2.1 String to numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Bias–variance tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 k-nearest neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.5 Adaptive boosting vs Random forest . . . . . . . . . . . . . . . . . . . . . 4
2.6 Artificial neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.7 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Implementation 8
3.1 Python libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Implementation of k-nearest neighbors, Adaptive boosting and Random
forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Artificial neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Assembling the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.5 Implementation of the API’s . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.6 Final implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Conclusions 20
5.1 Model conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Feedback loop conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.3 Final conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 Further work 20
7 Populärvetenskaplig sammanfattning 22
8 Appendix 22
1 Introduction
In a more digitizing world, building and constructing models that are designed to separate
data into categories has become a relevant design space for different models and algorithms.
The goal of this project is to construct a machine learning algorithm that separates emails
into two different categories. Furthermore an algorithm will be constructed that will
improves over time and become better at categorising emails.
This will be done by first constructing a dataset reflecting real world data, since it will
be nearly impossible to collect enough data from a normal email this will be solved by
collecting the data from two different online forums. The data collected will be used to
train and tune the different machine learning algorithms. The 4 different methods were:
k-nearest neighbors, adaptive boosting, random forest and artificial neural network. The
best preforming algorithm will be incorporated in the final product. The final product
will be a feedback loop machine learning algorithm. Where new data will be presented to
the algorithm and tested. If the algorithm is sure that a new data point is the right class
it would incorporate it into the dataset and over time the dataset would grow larger and
the algorithm would adapt to new data and trends.
The goals of the project is to understand what makes a good machine learning algorithm
for email classification and to understand how it can be incorporated into something that
grows over time. The project main limitation is time since only 4 different methods will
be tested and the growing algorithm will not be able to run for a substantial time.
2 Theory
2.1 String to numbers
In the concept of machine learning, most of the known methods uses numerical numbers
as its parameters when classifying. This presents problems when it comes to email
classification since it mostly contains strings of letters and numbers. With the help of
Scikit-learn library this problem can be solved with the build in function TfidfVectorizer
which builds on the theory of TF-IDF transformation. TF-IDF stands for Term frequency
inverse document frequency and it transforms text in to a numerical vector. It is built on
two concepts, Term Frequency (TF) and Inverse document frequency (IDF).
Term Frequency calculates the occurrence of a specific term in a string and converts it to
a matrix whose rows represents the number of strings and its columns is the number of
distinct terms throughout all documents. Document frequency calculates the number of
a specific term and this tells the frequency of a specific term. Inverse document frequency
indicates the weight of the term. It calculates and reduces the terms if it shows up in
throughout the data-set. Idf is presented in equation (1)
n
idfi = log (1)
dfi
1
where idfi is the IDF score for specific term i, dfi is the number of is the number of terms
found in the data-set, and n is the total number of data-sets. The final TF-IDF score is
the combined matrixes of TF and IDF and is calculated with the equation (2). [8]
Figure 1: TF-IDF transformation for "What video games can I play with my father? I am 17M and
my father is 45, I am wondering what games would be good to play together. I ask because most of the
games I play aren39;t his style, He is"
figure(1) shows a sparse matrix where every row is corresponding to a distinct word with
a value.
Bias is the difference between the correct value and the prediction our model produces.
A higher bias leads to a model that oversimplifies the results and pays less attention to
the training data. This leads to a higher error on the training and test data.
Variance tells how spread out the fitted data is and a model with higher variance pays more
attention to the training data. This leads to lower generalization on data it has not seen
2
yet. A model with higher variance will preform well on training data but not on test data.
K-fold divides the data in to k different splits and uses one of the splits as test data
and the rest as training data, this is preformed for every split and will result in an even
distribution when testing the accuracy for the given method. In this project the chosen
k is 10 and the mean accuracy and the standard deviation is calculated for the different
methods. [4]
(4)
p
D= (x1 − x2 )2 + (y1 − y2 )2
Where D is the Euclidean distance, x1 and y1 is the coordinates for one of the data point
and x2 , y2 is the coordinates for the second data point. This distance is calculated for all
the data points and helps the algorithm classify depending in the distance. [3]
With help of Scikit-learn library the algorithm can be executed with the help of python
using the function KNeighborsClassifier(n_neighbors = k). The parameter k was calculated
with help of k-fold cross validation and depending on the k different accuracy’s could be
achieved, this will be presented in the result and generally a lower k means higher variance
and a higher k means a higher bias. [7]
3
2.5 Adaptive boosting vs Random forest
Adaptive boosting and random forest builds on the theory of bagging and boosting which
is methods were the data set is tested multiple times. Bagging is where the data set is
split in to n different splits and all the different splits are trained and the result become
the mean of all the different splits. Boosting is where the data set is trained and improved
over n iterations. Depending on the errors from the previous model the algorithm learns
from its mistakes.
Both adaptive boosting and random forest uses decision tree as its method of learning.
Which is a method where the data uses tree structures as it’s base of classifying, as an
example see figure (6).
For this project different tree depths will be tested and compared. The split are determined
by minimizing the gini index, which is calculated by equation(5).
M
X
Gini index = π̂lm (1 − π̂lm ) (5)
m=1
Where M is the total number of classes and πlm is the probability of picking a certain class.
In the case of Adaptive boosting the weights are calculated with the help of equation (6).
1 − error
Weight = learning rate ∗ log( ) (6)
error
Where the error is the percentage of errors in the prediction and the learning rate is the
original weight of the tree. [5]
4
Random forest and adaptive boosting are implemented with the help of Scikit-learn
library and with help of the function RandomForestClassifier() for random forest and
AdaBoostClassifier() for adaptive boosting. Hyper parameters are n estimators, random
state for random forest and n estimators, learning rate for adaptive boosting. Both
algorithms are using DecisionTreeClassifier() as its classification model with different
depth. N estimators are how many different splits/estimators, random states controls
the random states, learning rate is the weight applied to each classifier. The different
hyper parameters are tuned later to maximise the accuracy.
In general random forest reduces the variance in the model and adaptive boosting reduces
the bias in the model. Depending on the result it will become clear witch method is most
suited for the data set. [7] [1]
The Input layer is the first layer and has a set number of nodes and are corresponding
to each training value for the data. Each node in each layer is connected to each node
in the second layer. When training the network the weights of each node are changed to
reduce the cost function. The output and weights are calculated with equation(7) and
equation(8),
m
X
wi xi + bias = w1 x1 + w2 x2 + w3 x3 + bias (7)
i=1
5
( )
if P i=1 wi xi + bi ≥ 0,
Pm
1,
Output = f (x) = (8)
0, if m
i=1 wi xi + bi < 0
Where i is the index of the sample and m is the number of samples. In equation(7) the wi
describes the weights that are tuned during training and the final output in equation(8)
is either a 1 or a 0, this could be interpreted as if the node is on or off. When the data
passes trough a node the given weights are applied and the output is passed through a
given activation function. If the values sent to the node satisfies the given requirements
the node sends data to the next layer. Each nodes input values are determined on its
predecessors. This structure is defined as a f eedf orward network and is the structure
that will be used in this project.
When training the model the goal is to maximise the accuracy and evaluate it with help
of a cost function. In this project different cost functions will be tested and compared.
The cost functions that will be tested are: mean square error (equation(9)), Poisson
(equation(10)) and Binary crossentropy (equation(11))
m
1 X
Mean square error = + (ŷi − yi )2 (9)
2m i=1
m
1 X
Poisson = (ŷi − yi log(ŷi )) (10)
m i=1
m
1 X
Binary crossentropy = (yi − log(ŷi ) + (1 − yi )log(1 − ŷi )) (11)
m i=1
Where i is the index of the sample, ŷ is the predicted outcome, y is the real value and m
is the number of samples. These cost functions are used together with gradient descent
find local minimum. Gradient descent has the equation(12)
In this project the Keras library will be implemented and optimised. Where the the model
will be defined as model = Sequential() and the layers will be inputted as
6
model.add(Dense(7490, activation = x)). Where x is the activation function, the two
activation function used in this project is sigmoid and relu. Sigmoid is described in
equation(13) and relu in equation(14). [2]
ex
f (x) = (13)
ex − 1
Figure 4: https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
The model will be trained using epochs and batch size. One Epoch is when all the training
data is trained on one time and passed forward and backwards through the artificial neural
network. The network cant be trained on all the data at once, therefor the data is sent
through the network in batches, this is called batch size. For this project the model with
be trained on with 20 epochs and a batch size of 10. That in other word mean that every
data point will be trained on exactly 20 times and every epoch will use 10 data points at
a time when training. [10]
2.7 API
API stands for Application Programming Interface and is a software that connects an
application to a server (website). For the Gmail API an OAuth agreement has to be
signed and is a standard access delegation that gives the user permission to share data
between the two programs. The API’s used in this project is a REST API and stands
for representational state transfer API. A REST API Uses HTTP requests, which i other
words mean that the API sends a request to the server where the data is stored and
extracting data like reading it, updating and more. In this project reading the data will
be the main focus since the REST API will mainly focus on reading emails from a certain
email account. Rest API uses a GET request to collect data. Which in other words mean
requesting data from a specific server. The data can be transferred in several different
formats, JSON is one of the more popular ones and is the one that Gmail API builds on.
JSON is a compact text based format that is used when data is exchanged. [9]
7
3 Implementation
3.1 Python libraries
For the implementation of the k-nearest neighbors, Adaptive boosting and Random forest
algorithms the Scikit-Learn Python library was used. Scikit-Learn is a free library
designed for machine learning and is used extensively through out the project. Some
of the more used function are:
• Train test split: Witch is used to divide the test data in to training data and test
data.
• KFold: Kfold is used for cross validation.
• TfidfVectorizer: Is used for converting string to numbers.
• AdaBoostClassifier: Is used for the Adaptive boosting algorithm.
• RandomForestClassifier: Is used for the Random forest algorithm.
• DecisionTreeClassifier: Is used for the Adaptive boosting and Random forest algorithm.
• KNeighborsClassifier: IS used for the KNeighborsClassifier algorithm.
For the artificial neural network the TensorFlow-Keras library was implemented. Keras is
a free library designed for artificial neural network and some of the more notable functions
are:
• Sequential: Is the model name for the model used for this artificial neural network.
• Dense: Is the layers that are used in between the input and output layers.
• Dropout: is a layer that dropout some of the data to avoid overfitting.
Other notable Python libraries used were Panda and Numpy. Pandas is used to read the
data from csv files and and convert it into the Panda format. Numpy is used to calculate
some of the necessary calculations in the code. Pandas is allso used to transfer some of
the data into other files. Notable functions are:
8
• Credentials: Checks the credential for the given email.
• build: Builds the model that extracts the email form the given email.
The hyper parameters are calculated with taking the mean accuracy after running it
through out the data with help of k-fold. For the different methods this correlates to:
value of neighbors (k) for k-nearest neighbors, n estimators, random state for random
forest and n estimators, learning rate for adaptive boosting. For both Random forest and
Adaptive boosting different tree depths are also tuned.
9
• r/worldnews: Russia cuts gas to Poland, Bulgaria, West vows arms for Kyiv | AP
News
• r/worldnews: France39;s Atos moves Russian services to India and Turkey amid
Ukraine war
• r/gaming: I still love how BoTW’s 2014 demo looked back then. Wish they could
keep it that way.
This data was collected until a dataset of around 1000 samples was assembled. The
dataset distribution was 50/50 and the data would be used to tune the different models.
IFTTT is a website that works like an API between different websites. For this project
the IFTTT API was used to extract Reddit posts and sending them to a specific email
address. This was done by creating a short script that told the API to send an email
everytime a post was created on r/worldnews or r/gaming.
10
of the above parts and the final result will be a network that grows and becomes better over
time. This will be done by first establishing the most suitable machine learning algorithm.
After testing all the above algorithms the method that yields the best accuracy will be
Incorporated in the feedback loop. The Google gmail API and the IFTTT API will
supply the new data for the algorithm. The model will be trained on the original data
and the new data will be tested on the model, if the model is determined with a certain
thresholds that the new data is r/worldnews or r/gaming it will incorporate it into the
original dataset. The correct class for every data point will be stored so future testing
can be done.
The algorithm will be uploaded to a server where the dataset can grow and become larger.
After 2 weeks the dataset will be tested and a conclusion for how well it preformed will
be displayed in the result. The data set will also be tested against the same algorithm
without the feedback loop to conclude if it decreases the number of misclassifications.
11
4.2 k-nearest neighbors
The k-nearest neighbors algorithm was tested for different values of k and the resulting
accuracy’s are displayed in figure(7) and table(1)
Table 1: Best value of k and it’s mean accuracy and mean standard deviation.Using k-fold = 10
Setting up the model with k = 4 and using 80 % of the data as training data, this yields
the confusion matrix:
True diagnosis
r/gaming r/worldnews Total
r/gaming 87 10 97
Screening test
r/worldnews 5 98 103
Total 92 108 200
Form this results it is clear that the model is not bias toward any of the different classes.
The final result of the model is a misclassification of 15 out of 200.
12
3 hyper parameters tuned were max depth, n estimators and learning rate. The result of
the hyper tuning is displayed in figure(8)(9)(10)
13
Figure 10: Mean accuracy for different values of n estimators
From the hyper parameter tuning the values chosen for the model were: max depth =
1, learning rate = 1.4 and n estimators = 70. The result of this model is displayed in
table(2).
Table 2: Best value for n estimator, mac depth and learning rate and it’s mean accuracy and mean
standard deviation. Using k-fold = 10
Setting up the model with the hyper parameters tuned and using 80% of the data as
training data, this yields the confusion matrix:
True diagnosis
r/gaming r/worldnews Total
r/gaming 89 23 112
Screening test
r/worldnews 3 85 88
Total 92 108 200
The result from the confusion matrix shows that there is some bias in the results since
it wrongly classifies 23 data points as r/worldnews. The final result of the model is a
misclassificationsification of 26 out of 200.
14
depth, n estimators and max samples. The result of the hyper tuning is displayed in
figure(11)(12)(13)
Figure 11: Mean accuracy for different value for max depth
Figure 12: Mean accuracy for different values for max samples
15
Figure 13: Mean accuracy for different values of n estimators
From the hyper parameter tuning the values chosen for the model were: max depth =
50, max samples = 900 and n estimators = 90. The result of this model is displayed in
table(3).
Table 3: Best value for n estimator, mac depth and learning rate and it’s mean accuracy and mean
standard deviation. Using k-fold = 10
Setting up the model with the hyper parameters tuned and using 80% of the data as
training data, this yields the confusion matrix:
True diagnosis
r/gaming r/worldnews Total
r/gaming 89 26 115
Screening test
r/worldnews 3 82 85
Total 92 108 200
The result from the confusion matrix shows that there is some bias in the results since
it wrongly classifies 26 data points as r/worldnews. The final result of the model is a
misclassificationsification of 26 out of 200.
16
function, Number of layers and total number of nodes. Different configuration are tested
using 20 epochs and 10 as batch size and the result displayed in table(4)
Table 4: Different configuration of the artificial neural network and it’s resulting accuracy
From the result in table(4), the best configuration for the artificial neural networks were
using: activation function: sigmoid, cost function: Mean Squared Error, number of layers:
1, total number of nodes: 7,717, accuracy: 0.980 and std: 0.099. Setting up the model
using 80% of the data as training data and training the model using 20 epochs and 10 as
batch size yields the figure(15)(14) and the confusion matrix (4.5)
17
Figure 15: Mean loss for different epochs
True diagnosis
r/gaming r/worldnews Total
r/gaming 89 1 90
Screening test
r/worldnews 3 107 110
Total 92 108 200
Form the confusion matrix is clear that the model correctly classifies and is the most
accurate of all the above models. The result is that the model classified 197 out of 200
correctly.
18
Figure 16: misclassifications over iterations
from figure(16) it is clear that the misclassifications can be deducted as noise since it
only makes up 0.7% of the data and the figure looks like to be linear. After 2 weeks the
dataset grew to 8464 and the total number of misclassifications s grew to 74 and gave the
figure(17).
From figure(17) it is also clear that the misclassifications can be deducted as noise and
the total number of errors are now 0.8% and it still looks to be linear.
19
When testing the algorithm against itself without a feedback loop, it produced the
following result displayed in figure(5).
Table 5: Total number of misclassifications for two different Models
From figure(5) it is clear that the incorporated feedback loop archived a lower misclassifications
rate then without it. The result is around 50 % less misclassifications and shows that the
feedback loop is something that is necessary when classifying emails.
5 Conclusions
5.1 Model conclusion
The final result of all the different machine learning methods showed that using machine
learning algorithms is a suited method to classify emails and the results showed that
the most suited method was the artificial neural networks and it followed the theory
since generally it should preform the best out of the 4 methods. The two methods that
stood out was random forrest and adaptive boosting since they preformed underwhelming
compared to the other methods. This could be a result of the dataset being to small and
future testing would be needed to conclude if random forrest and adaptive boosting would
be suited for email classification.
6 Further work
This project explores some of the adaptations of machine learning algorithms and it is
clear that there are many avenues that is yet not explored. Some notable actions that
20
can be done in the future is:
• Add more classes to explore how it effects the modules and explore how it would
effect the growing network. Adding more classes would greatly improve the complexity
of the models and make the final product harder to construct.
• Test more machine learning algorithms. There are many more machine learning
algorithm that would be needed testing to come to a conclusion for what the best
algorithm is for classifying emails.
• Start with a bigger or smaller initial dataset. To truly understand the conversion
rate of the growing network different initial data sets would be needed to be tested.
• let the feedback loop machine learning run for a longer time. The final result for
the growing network was only run for 2 weeks and could be explored further.
• Using BERT instead of TF-IDF transformation, BERT is a already trained model
that gives numerical values to words. This could prove to be a better solution then
using TF-IDF transformation.
References
[1] Lilly Chen. Basic Ensemble Learning (Random Forest, AdaBoost, Gradient Boosting)-
Step by Step Explained. url: https://towardsdatascience.com/basic-ensemble-
learning - random - forest - adaboost - gradient - boosting - step - by - step -
explained-95d49d1e2725.
[2] Francois Chollet et al. Keras. 2015. url: https://github.com/fchollet/keras.
[3] Antony Christopher. K-Nearest Neighbor. url: https://medium.com/swlh/k-
nearest-neighbor-ca2593d7a3c4.
[4] Cross-validation: evaluating estimator performance. url: https://towardsdatascience.
com/understanding-the-bias-variance-tradeoff-165e6942b229.
[5] Decision Trees for Classification: A Machine Learning Algorithm. url: https://
www . xoriant . com / blog / product - engineering / decision - trees - machine -
learning-algorithm.html.
[6] Neural Networks. url: https://www.ibm.com/cloud/learn/neural-networks.
[7] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of
Machine Learning Research 12 (2011), pp. 2825–2830.
[8] Luthfi Ramadhan. TF-IDF Simplified. January 20, 2021. url: https://towardsdatascience.
com/tf-idf-simplified-aba19d5f5530.
[9] REST APIs. url: https://www.ibm.com/cloud/learn/rest-apis.
[10] SAGAR SHARMA. Epoch vs Batch Size vs Iterations. url: https://towardsdatascience.
com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9.
[11] Seema Singh. Understanding the Bias-Variance Tradeoff. url: https://towardsdatascience.
com/understanding-the-bias-variance-tradeoff-165e6942b229.
21
7 Populärvetenskaplig sammanfattning
I en mer digitaliserad värld där mycket av kommunikationen sker genom internet är det
allt mer viktigt med moduler och algoritmer som kan separera och klassifiera inehåll.
Varje dag sickas det milljontalls medelanden och dessa medelande kan med hjälp av olika
program separeras i olika kategorier. Företag som Google och Microsoft har lagt mycket
resurser på att tillverka algoritmer som kan särskilja innehåll och klassificera emails. I
denna kamp har statistisk maskininlärning varit i största fokus och det finns alltid plats
för förbättringar.
Maskininlärning är ett verktyg som andvänds för att separera data och i detta projekt
kommer detta andvändas för att separera emails i två olika klasser. Datan är tagen från
två olika online formum(reddit) och de två olika forumen var r/worldnews och r/gaming.
I projektet testades 4 olika metoder som var följande: k-nearest neighbors, adaptive
boosting, random forest och artificial neural network. Resultatet visade på att artificial
neural network kunde klassifiera datan med minsta felmarginal. Projektet utbyggdes med
att bygga ett program som kunde växa och bli bättre över tid. Detta gjordes med att
andvända den besta maskininlärnings modulen och sätta upp en feedback loop. Resultatet
blev ett program som gissade på ny data som kom in och om den var tillräckligt säcker så
uppdaterades programet. Detta gjorde programet smartare och över tid blev den bättre
på att klassifiera data som den alldrig tidigare har sett.
8 Appendix
Github repository for code and data.
https://github.com/khasec/Email-classification-using-machine-learning-algorithms-Appendix
22