0% found this document useful (0 votes)
13 views

Final Paper

The document discusses using a pre-trained BERT model for universal spam detection by training it on four different datasets. It achieved over 97% accuracy on the combined model, demonstrating the effectiveness of transfer learning for this task. The research aims to build an efficient spam detection model and evaluates performance on different testing sizes and batch sizes.

Uploaded by

mkjain7428
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Final Paper

The document discusses using a pre-trained BERT model for universal spam detection by training it on four different datasets. It achieved over 97% accuracy on the combined model, demonstrating the effectiveness of transfer learning for this task. The research aims to build an efficient spam detection model and evaluates performance on different testing sizes and batch sizes.

Uploaded by

mkjain7428
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Proceedings of the 55th Hawaii International Conference on System Sciences | 2022

Universal Spam Detection using Transfer Learning of BERT Model


Vijay Srinivas Tida Sonya Hsu
Center of Advanced Computer Studies Informatics Program
School of Computing and Informatics School of Computing and Informatics
University of Louisiana at Lafayette University of Louisiana at Lafayette
[email protected] [email protected]

Abstract proposed as the base of many pre-trained models. Google's


Deep learning transformer models become important by BERT (Bidirectional Encoder Representations from
training on text data based on self-attention mechanisms. Transformers) model has become more popular because it
This manuscript demonstrated a novel universal spam shows higher efficiency in real-world applications and a more
straightforward structure. Transformers usually have encoder-
detection model using pre-trained Google's Bidirectional
decoder architecture whereas, BERT uses only the encoder
Encoder Representations from Transformers (BERT) base
part – discarding the decoder part. BERT models were trained
uncased models with four datasets by efficiently classifying on the huge amount of data from Book Corpus and English
ham or spam emails in real-time scenarios. Wikipedia [12]. These models usually produce two outputs.
Different methods for Enron, Spamassain, Lingspam, and The first output is used for language translation applications
Spamtext message classification datasets, were used to train like name entity recognition and speech tagging. The other
models individually in which a single model was obtained output is used for classification applications like sentiment
with acceptable performance on four datasets. The Universal analysis and fake news detection.
Spam Detection Model (USDM) was trained with four
datasets and leveraged hyperparameters from each model. This research aimed to build an efficient universal spam
The combined model was finetuned with the same detection model using the pre-trained BERT base uncased
model. In this work, the researchers use the second output
hyperparameters from these four models separately. When
from the pre-trained Google's BERT base uncased model to
each model using its corresponding dataset, an F1-score is
find whether the given mail is spam or not since it is a
at and above 0.9 in individual models. An overall accuracy classification problem with the help of the Hugging Face
reached 97%, with an F1 score of 0.96. Research results and Transformers library [13]. The takeaways from this work are
implications were discussed. three-fold:

1. Introduction By adding the various layers over the output vector of


length 768, the researchers analyzed how to find the sequence
Spam email is defined as unsolicited messages in the bulk length, learning rate, and model architecture for each dataset.
from Blanzieri and Bryl [1] and constitutes 53.95 percent in Firstly, the pre-trained BERT uncased base model has trained
the year 2020 [2]. Spam messages create problems because of four datasets separately. Secondly, the preprocessing of the
the internet's availability to most people worldwide. The hyperparameters, i.e., sequence length and learning rate, speed
widespread use of email in many companies results from the up the final model architecture selection and training process
quick distribution of information to many people quickly and using the individual trained model. Thirdly, the proposed
easily accessible. Spammers access devices in phishing universal model was computed with recall, precision, F1-
attacks by enticing users to click on the spam link [3]. Another score, and accuracy to evaluate different testing sizes and
type of threat is email spoofing, in which users believe that it mini-batch sizes from five datasets. The proposed model is the
is sent by the person they know [4]. Spam detection tools and first approach to detecting spam messages using multiple
techniques are developed by companies so that users will get datasets to train the single model.
a better experience. Google's Gmail is one of the largest mail The article flows as follows: Section 2 illustrated the
networks claiming 99.9% success of their spam filtering previous research, whereas Section 3 explained the
technique [5][6]. Machine learning and deep learning methodology with detailed descriptions of datasets and
techniques help identify spam emails automatically through modeling the universal model. Section 4 illustrated the final
trained models [7]. model results, and Section 5 concluded the manuscript with
A new entrant to spam processing, transformers results discussion, research implications, and future research.
showcased their deep learning impacts for Natural Language
Processing (NLP) applications [8]. Significant time reduction
was achieved due to training the models for better efficiency. 2. Literature review
Previous methods like Recurrent Neural Networks,
Long/Short Term Memory (LSTM) [9], and Gated Recurrent Many researchers have shown their effort in spam
Units (GRU) [10] need to wait for the previous time step detection mechanisms. This literature review classified them
information. The data is processed sequentially as the model as machine learning, deep learning, and some combine-
progressed and challenged to capture long-range based/other approaches. Machine learning and deep learning
dependencies without considering the previously used data are approaches to solving real-world problems like image
points. Instead, transformers parallelize the computation and classification and language processing. Machine learning
embed word position with position encoding. Using a approaches perform well on a small amount of data, whereas
multiheaded self-attention mechanism solved the inputs and deep learning approaches require massive data to surpass the
long-range dependencies [11]. Later, transformers have been

URI: https://hdl.handle.net/10125/80263
978-0-9981331-5-7 Page 7669
(CC BY-NC-ND 4.0)
performance of machine learning approaches [13][14]. These like the backpropagation and the Genetic based. Raj [26] used
two were addressed briefly in the following sections. the Long Short Term Memory approach and showed an
accuracy of about 97% over the Lingspam dataset. Rahman
2.1. Machine Learning-based approaches [27] used Bidirectional Long Short Term Memory and
achieved 98% accuracy on Lingspam and SPMDC datasets
Machine learning means designing a machine to learn based on their separate implementations. Jie [28] discussed
how to solve a particular task like regression and the unsupervised deep learning approach, which can be used
classification. Machine Learning is given rules or instructions for fake content detection in social media.
of algorithms to extract features from the data to solve the
given task. In machine learning, the programmer should 2.3. Combine based / other approaches
extract features manually.
Harisinghaney [16] tried to implement text and image- Faris [29] proposed a PSO-based Wrapper with a Random
based spam emails with the help of the k-nearest neighbor Forest algorithm that effectively detects spam messages. Ajaz
(KNN) algorithm, Naïve Bayes, and reverse DBSCAN [30] used a secure hash algorithm with the Naïve Bayes
algorithms. Preprocessing step of the Enron dataset is done to feature extraction method for spam filtering. Van Wanrooij
extract email text before applying any algorithmic [31] used an IP-based approach and showed their
approaches to the data using specific feature extraction implementations with a better false-positive rate. Lin's [32]
techniques like Tesseract. The final performance is reported system identified spamming botnets with the Bloom filter,
with the help of four metrics precision, sensitivity, and which yielded higher precision and values. Esquivel [33]
accuracy. Results indicated that all algorithms performed developed IP reputation lists and constantly updated them to
notable well. Youn [17] proposed an ontology-based spam perform better than existing models. Regex [34]
filtering approach and applied the J48 decision tree on the automatically detected spam/ham mails using regular
UCI email dataset. The extended work can be seen in [18] expressions. This work showed significant performance with
Bahgat proposed SVM classifier based on semantic feature minimal computing resources. Marie-Saint [35] introduced
selection on the Enron dataset, which showed 94% accuracy. the firefly algorithm with SVM and worked with Arabic text.
WordNet ontology with some semantic-based methods, This article concluded the proposed method outperformed
principal component analysis, and correlation feature SVM alone. Later in [36], Natarajan proposed Enhanced
selection methods was used to reduce the number of features Cuckoo Search for bloom filter optimization. Results showed
to the maximum extent. However, results indicated that that ECS outperformed normal Cuckoo search.
logistic regression performed very well. Laorden [19] Until now, researchers showed significant efforts to detect
developed a Word Sense Disambiguation preprocessing step spam emails with the help of machine learning, deep learning,
before applying machine learning algorithms to detect spam and combined approaches with the help of algorithms. Most
data. Finally, results indicate a 2 to 6% increase in the of them tend to design their models based on a specific
precision score when applied on Ling Spam and TREC dataset that might cause the problem in real-world
datasets. George [20] showed KNN based approach achieves applications. Cross learning of data is necessary to achieve
higher accuracy when compared to Feed Forward Neural real-time efficiency through advanced deep learning models
Network. Khonji [21] made an extension for Lexical URL like transformers and techniques like transfer learning
Analysis with the random forest algorithm's help to apply approaches for Natural Language Processing applications. In
spam detection by their dataset. Jáñez-Martin [22] made the this work, the researchers solved this issue using four publicly
combined model of TF-IDF and SVM showed 95.39% F1- available datasets and showed significant performance in
score and the fastest spam classification achieved with the detecting spam messages. This work is carried by designing
help of the TF-IDF and NB approach. Alberto [23] explained the model using the output vector from Google's BERT
deception detection using various machine learning model individually from four datasets with acceptable
algorithms with the help of neural networks, random forests, performance. Then all datasets are combined followed the
etc.. and paved a path for a new research direction. same process, which showed better performance. The basic
terminology is provided in the next section for a better
2.2. Deep Learning-based approaches understanding of the methodology and modeling.
Jie [37] discussed the bilingual language multi-type spam
Deep learning mimics the human brain to solve the given detection model using M-BERT, which used image-based
task without human intervention [24]. Deep Learning uses a spam detection and achieved an accuracy of about 96%. Lee's
neural network with multi-layers with many parameters. In research [38] showed an accuracy rate of 87% using the
deep learning, automatic extraction of features is Sophos AI proposed CATBERT model by collecting
accomplished by giving the architecture shape with some phishing emails. The BERT model was also used for other
hyperparameters. Presently, massive data sets are becoming applications like fake news detection, lie detector, sentiment
available to support deep learning approaches. As a result, analysis. Jie [39] also discussed unsupervised deep learning,
deep learning techniques show promising results compared to which is suggested for fake content detection in social media.
machine learning counterparts in most aspects. Further, Barsever [40] proposed a model with the new
Faris [25] used proposed Feed Forward Neural Network generative adversarial network to detect lies. In Man's
on Spam Assassin dataset in which the Krill Herd algorithm research paper [41], he proposed a sentiment analysis
is used for feature extraction. Results indicated that using algorithm based on BERT and Convolutional Neural
Krill Herd Algorithm for feature extraction showed better Network with an accuracy rate of 90.5% and 85.2%,
results when compared to other popular training algorithms respectively.

Page 7670
2.3.1. Transformers mathematical equation of the softmax function can be stated
Sequential computation load reduction has been a major in equation 2 [56]:
problem for NLP applications over a long time [42]. NLP is exp⁡(𝑥𝑖 )
still burdened by linear or logarithmic dependency despite f(x) = (2)
∑𝑗(exp⁡(𝑥𝑗 )
many proposed solutions as the sequence grows [43] [44].
Transformers have simpler architecture without having any Where f(x) is softmax activation, xi is the input to the
convolutional [45] and recurrent layers [9]. The change in function, and j is the number of classes.
architecture solved the problem to a constant number of
operations with the help of averaging attention-weighted Advanced version softmax is Log softmax function which
position, which can be considered Multi-Head Attention and will apply log to the existing softmax function. The
positional embeddings[11]. Transformer models mathematical equation can be stated in equation 3 [57]:
outperformed the existing models with less training cost.
exp(𝑥𝑖 )
Many transformers based models have been invented f(x) = log⁡(∑ ) (3)
𝑗( exp(𝑥𝑗 )
[46][47][48][49][50][12][51][52]. However, BERT and GPT-
2 (Generative Pre-trained Transformer) became the most
Where f(x) is log softmax activation, xi is the input to the
popular models among the released versions [12][51].
function, and j is the number of classes.
2.3.2. Self-attention
2.3.6. Loss Function
Self-attention is the mechanism used to determine the
The loss function is often considered the cost function to
interdependence of tokens in the given input sequence. Self-
evaluate the model performance given weights [58]. Cross
attention encodes a token by taking information from other
entropy loss is usually used for measuring the performance of
tokens. It consists of three weight matrices query, key, and
a classification-based model whose output probabilities lie
value vectors learned during the training process. Multiheaded
between 0 and 1. Equation 4 can represent cross-entropy
self-attention is the extension of self-attention, consisting of
multiple sets of the query, key, and value vectors built into Cross entropy loss = − (𝑦 log(𝑝) + (1 − 𝑦) log(1 − 𝑝)) (4)
transformers[11]. However, this entire process will be
managed by the transformers library [11]. The number of
heads in multiheaded self-attention is set according to the user 3. Methodology
requirements, which can be considered a hyperparameter.
3.1. Datasets description
2.3.3. Transfer learning
Transfer Learning is not only knowledge acquired from This project used four publicly available datasets, and
pre-trained models for a specific application based on user these are processed such that only the content of the samples
needs. Usually, these pre-trained models begin with big is used for training the model.
datasets in which the model's weights contain a lot of
information. By fine-tuning the pre-trained model and adding 3.1.1. Ling-spam dataset
some layers over the pre-trained model's output, the This dataset [59] consists of 2893 samples separated
researchers use the same weights from the base model. into two classes 1) 481 spam messages and 2) 2412 ham
messages. This dataset can be accessed from the Kaggle
2.3.4. Parameters and hyper-parameters website, which was prepared by modifying the Linguist
List. The samples in the dataset focused mainly on job
Model parameters are considered as weights and biases
postings, software discussion, and research opportunities
where the programmer has no control over them. Once the
model is defined, then the values will be changed accordingly. areas.
On the other side, hyperparameters are given by the user
according to the need. For models with better accuracy, 3.1.2. Spam text messages dataset
hyperparameter selection plays a crucial role which can be This dataset [60] consists of 5574 samples separated
obtained by tuning the values 3]. into two classes 1) 724 spam messages and 2) 4,850 ham
messages. The samples in this dataset were collected from
2.3.5. Activation functions the mobile phone spam research-related area. This dataset
Activation functions help determine the neural network's also was accessed from the Kaggle, which was prepared
output with the help of some non-linear function to the from the UCI Machine Learning Repository.
corresponding output of neurons [54]. The Rectified Linear
Unit (ReLU) is added to the neuron outputs at hidden layers 3.1.3. Enron dataset
[55]. The mathematical equation for ReLU is stated in This dataset [61] consists of 32,638 emails separated
equation 1: into two classes 1) 16,544 spam mails and 2) 16,094 ham
mails. This dataset can be considered as one of the standard
f(x) = max(0, 𝑥) (1) benchmarks in spam classification. This dataset covers a
where f(x) is the ReLU activation and x is the input to the large wide of samples from almost all available options.
function. 3.1.4. Spam assassin
This dataset [62] consists of 6047 emails separated into
Another activation function that is used for multi-class
classification problems is the softmax function. The 1) 1897 spam mails and 2) 4150 ham mails. This dataset
can also be considered as one of the standard benchmarks
in spam classification. This dataset has two classification

Page 7671
levels for ham messages, like easy and hard ham messages, tagging. The outputs against the unique tokens were
and one for spam messages. However, the unified model discarded.
presents the combinations of these two kinds of ham In this work, the researchers used the second approach
messages into one group. where the finetuning process is made by using linear layers,
It's a supervised learning model so that labels are noted dropout layers [63], batch normalization layers [64], Rectified
as 1 or 0 based on spam or not. Then after the model is trained Linear Unit (ReLU) activations, and log softmax activation
under certain conditions. Label encoding is done for all these with Xavier initialization [65] of weights added at the end in
samples, in which '0' is considered for ham type, and '1' is the classifier part of the pre-trained model. The main reason
considered for spam type data. So along with the text data, for adding the dropout layer is to avoid overfitting, batch
these encoded labels are used for training and testing the normalization is used to reduce internal covariate shift, and
model's performance. Xavier initialization will help converge the designed model
faster.
3.2. Data preprocessing and Model Selection Process

The selection of a pre-trained model is essential for the


task. Data Preprocessing is considered an essential step for
any natural language processing task. However, if using pre-
trained datasets, some rules need to be followed according to
the model in which conditions were trained. BERT model can
be used for applications like generate text embeddings, text
classification, named entity recognition, and question
answered. To classify spam messages, the BERT model is best
for this task as it contains many versions. The base model suits
the needs of this research, especially for spam detection. This
version contains only 12 encoders with 110 Million
parameters which are sufficient for our application.
Unlike transformers, BERT uses only an encoder unit, and
the decoder part will be discarded as the name suggests, as
shown in Fig. 1. Each encoder consists of the same layers as
transformers counterpart, namely Self-Attention and Feed
Forward Neural Networks, as shown in Fig. 2. Hence, BERT
is considered as a language-based model rather than a
sequence-to-sequence-based model. Bidirectional means that
the input sequence is processed from both directions so that Figure. 1. Architecture of BERT
the model can learn from both directions to predict the word
in the context with better efficiency. This model was trained
on Wikipedia's unlabelled text corpus (2.5 million words) and
book corpus (800 million words). The word representations
obtained from the intermediate layers through different
weights after training will be helpful for our application in
detecting the given input sample is spam or ham. At the end
of the model, the design the classifier performs better using
adding some neural network layers.
The input sequence will directly feed into the tokenizer as
there is no need for any preprocessing steps required for the
BERT model. But with some preprocessing steps helped to Figure. 2. Encoder internal structure [11]
reduce the sequence length selection which is considered as
one of the hyperparameters of the model. Tokenizer will
handle the input sequence and perform certain operations on There are two ways to use BERT in our applications. The
the input data. These operations include tokenizing, first approach is to train the model from scratch using the pre-
contextual and positional encoding, padding, adding unique trained weights as initial weights, which requires massive
tokens like (CLS), (SEP), and (PAD), and finally converting data samples and more computational resources as all the
the tokenized data into integer sequences. (CLS) and (SEP) weights are updated after each step. In the second approach,
tokens are placed at the start and end of the sequence, all the pre-trained weights are not updated and require
respectively. Tokenizer implementation can be used directly fewer data samples and fewer computational resources.
with the help of the Hugging Face Transformers Library. The This approach was used.
output of the architecture consists of two parts. The first output Hyperparameter tuning is a crucial step for the model to
will be used for text classification in which the outputs against give better performance. Hyperparameters in our proposed
the (CLS) tokens are considered. This classifier usually model include the sequence length, number of layers, number
consists of a linear layer and log softmax function. The of neurons, selection of optimizer, learning rate, number of
remaining outputs are used for sequence prediction epochs, minibatch size, and selection of layers. The final
applications like named entity recognition, parts of speech model should have the best performance by tuning these
hyperparameters. This can be done by changing the sequence

Page 7672
length, varying the number of neurons in the layers, adding the here to update the weights as it helps converge the model
layers, changing the learning rate of the optimizer until the faster. The learning rate for this optimizer is 3e-4 which
accuracy of the model is increased. Here Adam optimizer is showed promising results when compared to other values. The
used for updating the weights in the training process as it has loss function used for this model is a cross-entropy loss.
several advantages like computational efficiency and less
memory usage with faster training time [66]. In the proposed
work, four datasets present different distributions of samples
for ham and spam classes. Except for the Enron dataset, all
other datasets do not have equal spam and ham samples.
Designing model architecture is challenging as the samples
from different datasets have different sequence lengths and
different numbers of sample distributions. To avoid the
problems of biased models, the first models were trained
separately on four datasets to analyze the suitable conditions
with acceptable performance. Training the standard model
from the individual datasets is a crucial step because the Enron
dataset has more samples, making the model biased to some
extent. While training the individual datasets, it is tough to
have the standard architecture for all four datasets to have
better performance.
3.3. Final modeling
The finalized model was obtained by hyperparameter
tuning, which has three fully connected linear layers with
batch normalization layers, dropout layers, and some
activation functions, which can be seen in Fig .3. In the
finalized model, the input was from the output of the [CLS]
token side to increase the detection of spam messages. This
finalized model has the input of vector length 768, in which
this data is passed to the linear layer, which contains 175
neurons. This linear layer will accept 768 as input vector and
produces 175 output vector length and hence the shape (768, Figure. 3. Final model architecture for classifier
175). After this dropout layer of factor 0.1 is placed over the
linear layer to ignore 10% of neuron outputs from the linear
layer, which will reduce the overfitting of the model. The When tuning the hyperparameters, the individual models
batch normalization layer is used to make the training faster to the same sequence length of the 40 Lingspam dataset
using reducing generalization error. Activation function model showed a drastic decrease in F1-score to 0.7. If the
ReLU is applied to the output of the batch normalized outputs. sample contains words more than sequence length in the pre-
Again, the dropout layer is placed at the output of the batch trained BERT model, it will be discarded so that the
normalization with 0.1 factor. maximum useful content in the sample will be retained by
The combined model performed the best with precision deleting the words in samples that were less than or equal to
and recall values close to maximum values. The combination 3 letters. Since our purpose of using the model is to detect
minimized the false positives and true negatives. False- spam messages, we tried to delete the words less or equal to
positive is considered ham message classified as spam three in their string length which helped in reducing the
message whereas true negative is considered spam message sequence length. After this preprocessing step, the modified
classified as ham message. Adding the dropout layer before sample is fed as input to the tokenizer. Reducing the
and after the batch normalization layer could avoid the sample length of the sample helped improve the F1-score
differences of false-positive and true-negative in the trained of the Lingspam dataset without affecting much of the
model on the combined dataset. The dropout layers produced other models' performance, which showed an F1-score of
higher precision and recall values with better accuracy and F1- above 0.9. After the model architecture is designed with the
score. Accuracy and F1-score are the metrics for evaluating same hyperparameters for different datasets, the datasets
the model performance. Accuracy is to determine how the are combined and trained under the same conditions. The
model is good in classifying the sampled data to their finalized model architecture mentioned in Fig. 3 resulted in
corresponding classes. The F1-score metric represents the the highest F1-score at 0.96. This finalized model further
distribution of data samples. changed the hyperparameters with different mini-batch sizes,
train-valid-test data distributions, and the number of
The exact process is repeated one more time, and then a epochs, which can be further explained in the Hugging Face
linear layer with shape (100, 2) is placed with log softmax Transformers library. This library is a collection of
activation to classify whether the input is ham or spam. The transformer implementations that researchers can use directly
output at the end shows if '0', then the input sample is for their projects. These implementations are compatible with
considered ham or if '1', then the input sample is considered TensorFlow and PyTorch APIs [67]. Google's BERT base
spam. After hyperparameter tuning, this finalized model uncased model is used to implement this library with PyTorch
showed better performance compared to other model related to this project. To further improve the performance,
implementations. The selection of these layers and neurons is gradient clipping was used to prevent the model from
based on hyperparameter tuning. Adam optimizer [66] is used exploding gradient problem, that is set to 1.0 [68], and model
checking process, which makes the model save the weights

Page 7673
corresponding to the less validation loss. The PyTorch The formula for defining precision is defined in equation 7
framework was used for training the model, which has [71]:
predefined data loaders to help create batches, shuffle and load
𝑇𝑃
the data in parallel using multiprocessing [69][70]. Precision = (7)
𝑇𝑃+𝐹𝑃

4. Results
4.4. F1-Score
F1- score is defined as the harmonic mean of precision and
This section will discuss the performance metrics and recall values. The formula for F1-score can be defined in
results obtained from the finalized model as discussed in equation 8 [71]:
2×(𝑝𝑟𝑒𝑐𝑖𝑠𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙)
section Ⅳ by varying minibatch sizes from 16 to 1024, f1-score = (8)
different train-valid-test distributions like 60:20:20. (𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙)
70:15:15, 80:10:10 and epoch size set to 200.
Different performance metrics are considered while To analyze the finalized version of the model, the varied
evaluating our proposed model and is explained as below: batch size, the number of epochs, and train-valid-test data
distribution were implemented. Batch size is considered the
number of input samples is fed to the model at a time. Epochs
Table1. Confusion Matrix are considered the number of times the same samples are
repeated and fed to the model during the training phase. Train-
Real Result Test Result Predicted
valid-test data distribution is considered as dividing the
HAM SPAM dataset into three parts training, validation, and testing parts
from the given dataset. In this work, 60:20:20, 70:15:15, and
HAM TN FP 80:10:10 were used to train valid and test distributions. The
training part is considered for training the model. The
SPAM FN TP validation part is used to visualize whether the model is
learning properly or not during the corresponding training
phase. The testing part is the final step to analyze the
performance of the trained model.
4.1. Accuracy Table 2. F1 and acuracy values for the different datasets
As shown in table1, the metric will help to visualize the Dataset Minibatch Distribution Highest Accuracy
model's performance. Confusion Matrix consists of four size f1-score
elements which are defined as below SpamAssassin 128 70:15:15 0.9764 0.98
Enron 128 70:15:15 0.9720 0.97
a) TN: True Negative in which ham sample predicted as LingSpam 512 80:10:10 0.9400 0.98
ham. SpamText 128 80:10:10 0.9396 0.98
b) TP: True Positive in which spam sample predicted as Combined 128 70:15:15 0.9608 0.97
spam.
c) FP: False Positive in which spam sample predicted as Table 3. Corresponding precision and recall values
ham Dataset Minibatch Distribution Precision Recall
d) FN: False Negative in which ham sample predicted size
as spam SpamAssassin 128 70:15:15 0.96 0.99
Classifying whether the model performed well or not is Enron 128 70:15:15 0.96 0.98
simply dividing correct predictions by all predictions. LingSpam 512 80:10:10 0.90 0.98
Measuring this metric, the sklearn library was used. The SpamText 128 80:10:10 0.95 0.93
formula for defining accuracy is defined below [71]: Combined 128 70:15:15 0.95 0.97
(𝑇𝑁+𝑇𝑃) 1
Accuracy = (5)
(𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁) 0.95
0.9
4.2. Recall
0.85
Recall measurement is defined as the number of spam
samples that are correctly predicted among all the spam
samples provided in the dataset. The formula for defining
recall is defined in equation 6 [71]:
𝑇𝑃
Recall = (6) Highest F1-score Precision Recall Highest Accuracy
𝑇𝑃+𝐹𝑁
4.3. Precision
Precision measurement is defined by the number of Figure. 4. Accuracy: F1 values with precision and recall
samples classified from the given set of positive samples. In values
other words, how many samples were correctly predicted as The results of the finalized model using four datasets
spam from the total number of samples predicted positive? separately and then, after getting acceptable performance, will

Page 7674
perform the same process by combining all four datasets. The https://www.statista.com/statistics/420391/spam-email-
number of epochs is set to 200, which helped the model for all traffic-share/.
four datasets individually showed acceptable performance. [3] W. Feng, J. Sun, L. Zhang, C. Cao, and Q. Yang, "A
After model evaluation, the best case from the individual and support vector machine based naive Bayes algorithm for
combined dataset are shown in Tables 2 and 3. 200 epochs by
spam filtering," in 2016 IEEE 35th International
varying minibatch size from 16 to 1024 was performed.
Performance Computing and Communications
Finally, the highest accuracy, F1-score, and corresponding
recall precision values for all the four and combined datasets Conference, IPCCC 2016.
are visualized from Fig. 4. The results indicated that batch size [4] K. Pandove, A. Jindal, and R. Kumar, "Email Spoofing,"
128 is the best fit for the models trained on five datasets. The Int. J. Comput. Appl., 2010.
combined dataset achieved 97% accuracy with a 0.96 F1- [5] "The Most Popular Email Providers in the U.S.A."
score by the hyperparameters from the individually trained [Online]. Available: https://blog.shuttlecloud.com/the-
models. Although the Lingspam dataset showed more most-popular-email-providers-in-the-u-s-a/.
accuracy at minibatch size 512, the combined dataset [6] "Google Says Its AI Catches 99.9 Percent of Gmail Spam."
performed better at 128 batch size. [Online]. Available:
https://www.wired.com/2015/07/google-says-ai-catches-
99-9-percent-gmail-spam/.
5. Conclusion and Future work [7] E. G. Dada, J. S. Bassi, H. Chiroma, S. M. Abdulhamid, A.
O. Adetunmbi, and O. E. Ajibuwa, "Machine learning for
Previous models of the past research showed their results
based on the individual datasets using several techniques. email spam filtering: review, approaches and open research
Specifically, those models trained on individual datasets with problems," Heliyon, 2019.
accuracy; however, their performance might vary if they were [8] T. Wolf et al., "Transformers: State-of-the-art natural
to replicate on other datasets. The researchers intended to language processing," arXiv. 2019.
address this issue by combining all datasets with training the [9] A. Sherstinsky, "Fundamentals of Recurrent Neural
model for better accuracy in this manuscript. The combined Network (RNN) and Long Short-Term Memory (LSTM)
model outperformed those models trained on individual network," Phys. D Nonlinear Phenom., 2020.
datasets. [10] J. Chung, "Gated Recurrent Neural Networks on Sequence
In this manuscript, the USDM is provided to solve the Modeling arXiv : 1412. 3555v1 [ cs . NE ] 11 Dec 2014,"
different results by individual datasets. It can be helpful in Int. Conf. Mach. Learn., 2015.
the real-time scenario for spam classification with the USDM [11] A. Vaswani et al., "Attention is all you need," in Advances
using BERT based on the combination of different datasets in Neural Information Processing Systems, 2017.
as an input to the designed model. Based on the individually [12] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova,
trained model of multiple datasets and added dropout layers "BERT: Pre-training of deep bidirectional transformers for
above and below the batch normalization layers, the USDM language understanding," in NAACL HLT 2019 - 2019
performed better with an F1 score and acceptable precision Conference of the North American Chapter of the
and recall values. The designed model achieved a 97% of Association for Computational Linguistics: Human
accuracy at an F1 score of 0.97. With frequently released Language Technologies - Proceedings of the Conference,
transformer models, a better model may be used for better 2019.
spam data detection with less training time. [13] A. M. R. Thomas Wolf, Lysandre Debut, Victor Sanh,
Another problem worth mentioning is overtrained models in Julien Chaumond, Clement Delangue, Anthony Moi,
real-time classifications. If the model is trained using a single Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
dataset, it doesn't relate to other samples. By adding more
Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma,
data samples, the model performs better than the overfitted
Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao,
model.
"HuggingFace's Transformers: State-of-the-art Natural
This manuscript can be extended to various applications, i.e.,
Language Processing," arXiv:1910.
fake news detections from social media platforms, inadequate
content filtering from online sources, and such. Deep [14] M. Copeland, "What's the Difference Between Artificial
Learning models have higher accuracy when feeding more Intelligence, Machine Learning and Deep Learning? "
data samples. Deep learning research mainly focused on a 2016.
specific dataset might not have a desirable outcome. This [Online].Available:https://blogs.nvidia.com/blog/2016/07/
manuscript presented the first attempt at combining datasets 29/whats-difference-artificial-intelligence-machine-
using a deep learning approach to provide a better model. learning-deep-learning-ai/. [Accessed: 02-Aug-2021].
Further research in combining multiple datasets is [15] A. Ng, “Deep Learning,”
encouraged to further validate the USDM. http://cs229.stanford.edu/materials/CS229-
DeepLearning.pdf. [Online]. Available:
6. References http://cs229.stanford.edu/materials/CS229-
DeepLearning.pdf. [Accessed: 02-Aug-2021].
[1] E. Blanzieri and A. Bryl, "A survey of learning-based [16] A. A. A. Harisinghaney, A. Dixit, S. Gupta, "Text and
techniques of email spam filtering," Artif. Intell. Rev., image based spam email classification using knn, na ̈ ıve
2008. bayes and reverse dbscan algorithm," Int. Conf. Reliab.
[2] "Spam: share of global email traffic 2014-2020," Inf. Technol. (ICROIT), pp. 153–155, 2014.

Page 7675
[17] S. Youn and D. McLeod, "Efficient spam email filtering International Conference on Communication Systems and
using adaptive ontology," in Proceedings - International Networks, COMSNETS 2010.
Conference on Information Technology-New Generations, [34] D. Ruano-Ordás, F. Fdez-Riverola, and J. R. Méndez,
ITNG 2007. "Using evolutionary computation for discovering spam
[18] E. M. Bahgat, S. Rady, W. Gad, and I. F. Moawad, patterns from e-mail samples," Inf. Process. Manag., 2018.
"Efficient email classification approach based on semantic [35] S. Larabi Marie-Sainte and N. Alalyani, "Firefly Algorithm
methods," Ain Shams Eng. J., 2018. based Feature Selection for Arabic Text Classification," J.
[19] C. Laorden, I. Santos, B. Sanz, G. Alvarez, and P. G. King Saud Univ. - Comput. Inf. Sci., 2020.
Bringas, “Word sense disambiguation for spam filtering,” [36] A. Natarajan and S. Subramanian, "Bloom filter
Electron. Commer. Res. Appl., 2012. optimization using Cuckoo Search," in 2012 International
[20] N. G. M. J. and L. E. George, "Methodologies to Detect Conference on Computer Communication and Informatics,
Phishing Emails," Sch. Press. 2013. ICCCI 2012.
[21] M. Khonji, Y. Iraqi, and A. Jones, "Lexical URL analysis [37] J. Cao and C. Lai, “A bilingual multi-type spam detection
for discriminating phishing and legitimate websites," in model based on M-BERT,” in 2020 IEEE Global
ACM International Conference Proceeding Series, 2011. Communications Conference, GLOBECOM 2020 -
[22] F. Jáñez-Martino, E. Fidalgo, S. González-Martínez, and J. Proceedings, 2020.
Velasco-Mata, “Classification of spam emails through [38] R. H. Younghoo Lee, Joshua Saxe, “CATBERT:
hierarchical clustering and supervised learning,” arXiv. CONTEXT-AWARE TINY BERT FOR DETECTING
2020. SOCIAL ENGINEERING EMAILS,” arxiv, 2020.
[23] A. A. Ceballos Delgado, W. Glisson, N. Shashidhar, J. [39] J. Tao, X. Fang, and L. Zhou, “Unsupervised Deep
Mcdonald, G. Grispos, and R. Benton, "Deception Learning for Fake Content Detection in Social Media,” in
Detection Using Machine Learning," in Proceedings of the Proceedings of the 54th Hawaii International Conference
54th Hawaii International Conference on System Sciences, on System Sciences, 2021
2021. [40] D. Barsever, S. Singh, and E. Neftci, “Building a Better Lie
[24] A. C. Ian Goodfellow, Yoshua Bengio, Deep Learning. Detector with BERT: The Difference between Truth and
MIT Press, 2015. Lies,” in Proceedings of the International Joint Conference
[25] H. Faris, I. Aljarah, and J. Alqatawna, "Optimizing on Neural Networks, 2020.
Feedforward neural networks using Krill Herd algorithm [41] R. Man and K. Lin, “Sentiment analysis algorithm based
for E-mail spam detection," in 2015 IEEE Jordan on bert and convolutional neural network,” in Proceedings
Conference on Applied Electrical Engineering and of IEEE Asia-Pacific Conference on Image Processing,
Computing Technologies, AEECT 2015. Electronics and Computers, IPEC 2021, 2021.
[26] H. Raj, Y. Weihong, S. K. Banbhrani, and S. P. Dino, [42] J. F. Kolen and S. C. Kremer, "Gradient Flow in Recurrent
"LSTM based short message service (SMS) modeling for Nets: The Difficulty of Learning LongTerm
spam classification," in ACM International Conference Dependencies," in A Field Guide to Dynamical Recurrent
Proceeding Series, 2018. Networks, 2010.
[27] S. E. Rahman and S. Ullah, "Email Spam Detection using [43] Ł. Kaiser and S. Bengio, "Can active memory replace
Bidirectional Long Short Term Memory with attention?," in Advances in Neural Information Processing
Convolutional Neural Network," in 2020 IEEE Region 10 Systems, 2016.
Symposium, TENSYMP 2020. [44] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N.
[28] J. Tao, X. Fang, and L. Zhou, "Unsupervised Deep Learning Dauphin, "Convolutional sequence to sequence learning,"
for Fake Content Detection in Social Media," in in 34th International Conference on Machine Learning,
Proceedings of the 54th Hawaii International Conference ICML 2017.
on System Sciences, 2021. [45] S. Albawi, T. A. Mohammed, and S. Al-Zawi,
[29] H. Faris, I. Aljarah, and B. Al-Shboul, “A hybrid approach "Understanding of a convolutional neural network," in
based on particle swarm optimization and random forests
Proceedings of 2017 International Conference on
for e-mail spam filtering,” in Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial Engineering and Technology, ICET 2017.
Intelligence and Lecture [46] W. Liu et al., "K-BERT: Enabling language representation
with knowledge graph," arXiv. 2019.
[30] V. S. S. Ajaz, M. T. Nafis, “Spam mail detection using [47] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov,
hybrid secure hash-based naive classifier,” Int. J. Adv.
and Q. V. Le, "XLNet: Generalized autoregressive
Comput. Sci. vol. 8, pp. 1195–1199, 2017.
pretraining for language understanding," arXiv. 2019.
[31] W. Van Wanrooij and A. Pras, "Filtering spam from bad [48] P. He, X. Liu, J. Gao, and W. Chen, "DeBERTa: Decoding-
neighborhoods," Int. J. Netw. Manag., 2010. enhanced BERT with Disentangled Attention," arXiv.
[32] P. C. Lin, P. H. Lin, P. R. Chiou, and C. T. Liu, "Detecting 2020.
spamming activities by network monitoring with Bloom [49] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R.
filters," in International Conference on Advanced Salakhutdinov, "Transformer-XL: Attentive language
Communication Technology, ICACT, 2013. models beyond a fixed-length context," in ACL 2019 - 57th
[33] H. Esquivel, A. Akella, and T. Mori, "On the effectiveness Annual Meeting of the Association for Computational
of IP reputation for spam filtering," in 2010 2nd Linguistics, Proceedings of the Conference, 2020.

Page 7676
[50] N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and //www2.aueb.gr/users/ion/data/enron-spam/.
R. Socher, "CTRL: A conditional transformer language [62] "Index of /Old/Publiccorpus," 2002. [Online]. Available:
model for controllable generation," arXiv. 2019. https://spamassassin.apache.org/old/publiccorpus.
[51] Radford Alec, Wu Jeffrey, Child Rewon, Luan David, [63] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
Amodei Dario, and Sutskever Ilya, "Language Models are R. Salakhutdinov, "Dropout: A simple way to prevent
Unsupervised Multitask Learners | Enhanced Reader," neural networks from overfitting," J. Mach. Learn. Res.,
OpenAI Blog, 2019. 2014.
[52] W. Su et al., "VL-BERT: Pre-training of generic visual- [64] S. Ioffe and C. Szegedy, "Batch normalization:
linguistic representations," arXiv. 2019. Accelerating deep network training by reducing internal
[53] J. Brownlee, "What is the Difference Between a Parameter covariate shift," in 32nd International Conference on
and a Hyperparameter? " 2017. [Online]. Available: Machine Learning, ICML 2015.
https://machinelearningmastery.com/difference-between- [65] X. Glorot and Y. Bengio, "Xavier Initialization," J. Mach.
a-parameter-and-a-hyperparameter/. Learn. Res., 2010.
[54] C. E. Nwankpa, W. Ijomah, A. Gachagan, and S. Marshall, [66] O. Konur, "Adam Optimizer," Energy Education Science
"Activation functions: Comparison of trends in practice and Technology Part B: Social and Educational Studies.
and research for deep learning," arXiv. 2018. 2013.
[55] A. F. M. Agarap, "Deep Learning using Rectified Linear [67] "HuggingFace Transformers Library." [Online]. Available:
Units (ReLU)," arXiv. 2018. https://huggingface.co/transformers/quicktour.html.
[56] P. Ramachandran, B. Zoph, and Q. V. Le, "Searching for [68] R. Pascanu, T. Mikolov, and Y. Bengio, "On the difficulty
activation functions," arXiv. 2017. of training recurrent neural networks," in 30th
[57] A. De Brébisson and P. Vincent, "An exploration of International Conference on Machine Learning, ICML
softmax alternatives belonging to the spherical loss 2013.
family," in 4th International Conference on Learning [69] N. S. Murthy, "Datasets and Dataloaders in PyTorch."
Representations, ICLR 2016 - Conference Track [Online]. Available: https://medium.com/analytics-
Proceedings, 2016. vidhya/datasets-and-dataloaders-in-pytorch-
[58] K. Janocha and W. M. Czarnecki, "On loss functions for b1066892b759.
deep neural networks in classification," Schedae [70] "Pytorch API." [Online]. Available: https://pytorch.org/.
Informaticae, 2016. [71] "Metrics and scoring: quantifying the quality of
[59] I. Androutsopoulos., "Ling-Spam," 2000. . predictions."
[60] “SMS spam collection dataset,” 2016. [Online]. Available: [Online].Available:https://scikitlearn.org/stable/modules/
https://www.kaggle.com/uciml/sms-spam-collection- model_evaluation.html.
dataset.
[61] and G. P. ( I.Androutsopoulos, V. Metsis, “The Enron-
SpamDatasets,” 2006. [Online]. Available:

Page 7677

You might also like