0% found this document useful (0 votes)
58 views

Spam-T5: Benchmarking Large Language Models For Few-Shot Email Spam Detection

This document evaluates large language models and traditional machine learning techniques for email spam detection using four public datasets. It finds that large language models generally outperform traditional techniques, especially when there are limited training examples. It introduces Spam-T5, a fine-tuned version of Flan-T5 specifically for spam detection, which achieves the best performance across scenarios.

Uploaded by

amanmitt123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Spam-T5: Benchmarking Large Language Models For Few-Shot Email Spam Detection

This document evaluates large language models and traditional machine learning techniques for email spam detection using four public datasets. It finds that large language models generally outperform traditional techniques, especially when there are limited training examples. It introduces Spam-T5, a fine-tuned version of Flan-T5 specifically for spam detection, which achieves the best performance across scenarios.

Uploaded by

amanmitt123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Spam-T5: Benchmarking Large Language

Models for Few-Shot Email Spam Detection

Maxime Labonne? and Sean Moran

JPMorgan Chase
{maxime.labonne,sean.j.moran}@jpmchase.com
arXiv:2304.01238v3 [cs.CL] 7 May 2023

Abstract. This paper investigates the effectiveness of large language


models (LLMs) in email spam detection by comparing prominent mod-
els from three distinct families: BERT-like, Sentence Transformers, and
Seq2Seq. Additionally, we examine well-established machine learning
techniques for spam detection, such as Naı̈ve Bayes and LightGBM,
as baseline methods. We assess the performance of these models across
four public datasets, utilizing different numbers of training samples (full
training set and few-shot settings). Our findings reveal that, in the ma-
jority of cases, LLMs surpass the performance of the popular baseline
techniques, particularly in few-shot scenarios. This adaptability renders
LLMs uniquely suited to spam detection tasks, where labeled samples
are limited in number and models require frequent updates. Addition-
ally, we introduce Spam-T5, a Flan-T5 model that has been specifically
adapted and fine-tuned for the purpose of detecting email spam. Our
results demonstrate that Spam-T5 surpasses baseline models and other
LLMs in the majority of scenarios, particularly when there are a limited
number of training samples available. Our code is publicly available at
https://github.com/jpmorganchase/llm-email-spam-detection.

Keywords: Spam detection · Large language models · Few-shot learn-


ing.

1 Introduction
Email communication continues to be an essential part of our daily lives, facili-
tating efficient asynchronous communication globally for personal and business
users alike. Given this prominence, email is also a prime target for fraudulent and
malicious activities such as spam and phishing attacks. Spam email can cause a
multitude of problems ranging from user inconvenience arising from unsolicited
communications, to overload of computational resources on servers and secu-
rity compromise arising from fraudulent links and malware attachments in the
emails that are designed to attack personal and business security. It is estimated
that in 2022, almost 49% of emails sent over the internal were spam1 , highlight-
ing the continued prevalence of the problem and the need to explore ever more
?
Corresponding author
1
https://securelist.com/spam-phishing-scam-report-2022/108692/
2 Maxime Labonne, Sean Moran

sophisticated machine learning techniques to reduce the volume. Techniques for


automatically detecting and filtering out spam emails are critical for enabling us-
ability of personal and business email services and continue to attract significant
research interest [10].
The detection of spam emails presents several challenges, which we catego-
rize as data imbalance, data distribution shift and adversarial drift. Despite the
prevalence of email spam on the internet, one of the main obstacles in train-
ing spam detection models is the rarity of labeled datasets of fraudulent emails,
which makes it difficult to obtain a representative sample for training effec-
tive machine learning models. This problem is further exacerbated for private,
company-internal applications of email spam filtering, where there can be even
less labeled data available due to the nature of the task (e.g. filtering out fraud
and phishing attacks for specific user-groups such as high net worth individuals).
This leads to the problem of imbalanced learning, where the model may simply
not have enough fraudulent samples to learn from, resulting in poor detection
performance.
In addition, email communication is constantly evolving through time, pro-
viding an additional challenge. The changing nature of spam emails due to evo-
lution and techniques used by spammers can lead to data distribution shifts in
the dataset [20,38,41,44]. The problem of data shift arises when the distribution
of the training and test data is not the same, which can occur in real-world
scenarios. For example, at a certain point in time, the word “Casino” could be
indicative of spam, but the relative importance of this word might change (drift)
through time. This violates the fundamental assumption of supervised learning
and can cause classification models to fail to generalize in the deployment envi-
ronment over time. Continual refresh of such models with recent representative
data from the domain is critical.
Furthermore, the environment is highly adversarial and characterized by a
constantly developing arms race between spammers and email service owners,
mainly driven by the lucrative gains that can be made through successful fraud
and phishing attempts [20]. Attackers have effectively created a cottage industry
that constantly devises new and clever ways to bypass spam filters, resulting in
an adversarial drift. They focus their efforts on outmatching the textual filters by
perturbing the data extracted from the email body and legible headers. Common
strategies involve obfuscation techniques, disguising the content of the email, or
manipulating the header information.
To address these significant challenges, we claim that a promising approach
is to use few-shot learning [48] to train classifiers that can detect fraudulent
emails with limited samples. By using this approach, we can reduce the need for
large labeled datasets and build classifiers that better generalize to unseen and
constantly evolving data.
In this paper, we evaluate the performance of Large Language Models (LLMs)
for sequence classification in a few-shot learning setting, compared to traditional
machine learning techniques. Few-shot learning is ideal for the spam detection
task in which the prevalence of the anomalous class (spam) is much less than the
Spam-T5: Benchmarking Large Language Models for Spam Detection 3

normal class (ham). Our main contribution is the development of a benchmark


for traditional machine learning algorithms and LLMs on the four most popular
datasets in spam detection. We evaluate the performance of these models in
both a traditional supervised learning setting and a few-shot learning setting.
Furthermore, we introduce a novel model, Spam-T5, which is a fine-tuned version
of Flan-T5 specifically designed for email spam detection. Our findings show
that Spam-T5 outperforms all other models on this benchmark. To facilitate
further research in this area, we make our code and data publicly available at
https://github.com/jpmorganchase/emailspamdetection.

2 Related Work
2.1 Spam Detection using Machine Learning
The email spam detection task has been well explored in the literature [10] and
is commonly used in undergraduate machine learning courses as an introductory
use-case for study and learning. Despite the familiarity of the email spam de-
tection task, it continues to provide real challenges to practitioners due to data
distribution drift, the adversarial nature of the environment in which the task
is embedded, and class imbalance. Significant research effort has been expended
on this task, and sustained research is necessary to keep ahead of ever-evolving
data and the changing landscape of spamming techniques.
The email spam detection task is framed as the development of an effective
automated algorithm for differentiating spam versus ham (non-spam) emails.
Spam detection methods have been classified in terms of rules-based, collabora-
tive, and content-based approaches [25]. Rules-based approaches include check-
ing incoming email addresses against white lists of allowed email addresses, black
lists of commonly used spam email sources, and hand-crafted rules (e.g. that look
for empty To: fields or a certain combination of keywords). Collaborative ap-
proaches compute, for example, a hash function on the content of example spam
which is shared with the community and the hash function compared to new
emails to detect spam [5, 39, 50]. These simple methods are unable to generalize
well and are brittle [9], which is why content-based methods involving machine
learning have been explored [4].
For content-based methods, early methods explored conventional machine
learning methods and variations thereof, including Naı̈ve Bayes classifiers [35],
KNN [12], Support Vector Machines (SVMs) [8, 15, 43, 46], and Boosting trees
(e.g. XGBoost) [6]. The task is frequently formulated as a binary classification
problem where a classifier is trained on a representative dataset of spam and
ham (non-spam) emails, and learns to classify the data points into these two
classes. Performance is measured on generalization to new email data received,
for example, on standard benchmark datasets [2, 26, 37] or in a live production
operating environment for industrial use-cases. Studies have also been conducted
on the effectiveness of various hand-crafted feature representations for input into
the machine learning classifiers [25]. Other work has sought to tackle the concept
drift problem, which is a defining aspect of the spam detection task [38, 41],
4 Maxime Labonne, Sean Moran

and also to improve conventional classification techniques for the task [3, 18].
Generally speaking, conventional machine learning methods are faster to train
and evaluate compared to deep neural networks, but have less modeling capacity
to capture the complex patterns inherent in the data.

2.2 Spam Detection using Large Language Models (LLMs)

The field of machine learning has been revolutionized with the emergence of
Large Language Models (LLMs), a powerful suite of deep learning methods that
have exceptional abilities in understanding and learning from the structural,
relational, and semantic patterns inherent in natural language datasets [24].
LLMs are extreme multi-taskers, able to replace most bespoke models for natural
language tasks. For example they are very capable at text generation, sentiment
detection, question answering, and text summarization. The release of ChatGPT
and more recently GPT4 [28] to the public made waves around the world that
continue to reverberate given the naturalness and human-like response from the
model 2 . Despite the popularity and importance of email spam detection, there
is little prior work that explores the benefits of LLMs for the task. We address
the gap in this paper.
Given the robust popularity of the field, the literature on LLMs is many
and varied, with new advances occurring rapidly and on daily basis. We focus
on the suite of benchmark models that are commonly used as baselines in aca-
demic papers and in operation systems in industry, including RoBERTa [24],
SetFit [45], and Flan-T5 [49]. These benchmark models have the advantage of
publicly released code and weights, enabling comparison and evaluation on new
tasks. Underlying almost all of the recent innovations in the field is the semi-
nal Transformer architecture introduced by Google Research in 2017 [47]. The
Transformer architecture dispenses with recurrent and convolutional layers, ad-
vocating the primary use of attention mechanisms for learning, which prove
to be massively more scalable to big data. This technology was incorporated in
the several generations (GPT-n) of Generative Pre-Trained Transformers (GPT)
models [30], to spectacular effect across many natural language understanding
tasks [28]. Aside from the older GPT-2 model [31], the recent suite of GPT
models are closed-source with weights hidden behind an API, making compari-
son impossible.
Among the open-source and widely available models are RoBERTa, SetFit
and Flan-T5. BERT (Bidirectional Encoder Representations from Transform-
ers) [14] addresses the issue of existing models only learning from previous to-
kens in text, creating more powerful (bi-directional) Transformer representations
that gain from knowledge from an extended context. BERT has subsequently
developed into a performant off-the-shelf model for many tasks. A subsequent,
and popular evolution of BERT is embodied in the RoBERTa model [24] that
improves BERT through a refined training methodology involving larger mini-
batches and learning rates. SetFit (Sentence Transformer Fine Tuning) [45] is a
2
https://openai.com/blog/chatgpt, https://openai.com/research/gpt-4
Spam-T5: Benchmarking Large Language Models for Spam Detection 5

recently proposed learning framework for efficiently (using orders of magnitude


fewer parameters) fine-tuning sentence transformer models [34]. Finally, and in
the same spirit as SetFit, Flan-T5 [7] is an improved model fine-tuning approach
that leverages datasets of a larger number tasks framed as instructions and as
applied to the T5 LLM [32]. The Flan fine-tuning methodology leads to signifi-
cant gains on standard natural language understanding tasks. For the first time
in the literature, we explore RoBERTa, SetFit, Flan-T5 and compare each to
conventional models in this paper for the email spam detection task.
In terms of spam detection, some authors have applied early deep models
to the task, including LSTM [17] and BERT architectures [11], although the
literature currently is very sparse. We contribute to the field by exploring more
recent LLM architectures in the few-shot learning paradigm.

3 Methods
3.1 Problem Formulation
Let D = {(xi , yi )}n be a dataset of emails, where each email xi is represented as
a feature vector in Rd and labeled as either spam (yi = 1) or not spam (yi = 0),
also called ham. The spam detection classification problem can be formulated
as learning a function f : Rd → {0, 1} that maps an email feature vector to a
binary label, such that f (xi ) approximates yi for all email examples i in the
dataset.
This problem can be approached using supervised learning techniques, where
a training set Dtrain is used to learn the function f . Specifically, the goal is to
find the optimal parameters θ for a classifier fθ that minimizes a suitable loss
function L(θ), such as the cross-entropy loss:
n
1X
L(θ) = − (yi log(fθ (xi )) + (1 − yi ) log(1 − fθ (xi ))) (1)
n i=1

where n is the number of emails in the training set. The optimal parameters
θ∗ can be found by minimizing the loss function:
θ∗ = argmin L(θ) (2)
θ

Once the optimal parameters are found, the learned classifier fθ∗ can be used
to predict the spam label for new emails not seen during training.
However, in some scenarios, we may not have access to a large labeled dataset
like Dtrain . This is where few-shot learning comes into play. Few-shot learning
is a variant of supervised learning that aims to learn from a small amount of
labeled data [48] In the few-shot learning setting, we are given a small support
set S of labeled examples, where S = {(xi , yi )}k , with k  n, and we need to
learn a function f that can generalize to new examples not seen during training.
This setting is particularly relevant in the context of spam detection, where
labeled examples are scarce and require frequent updates to account for data
shift and adversarial drift.
6 Maxime Labonne, Sean Moran

3.2 Data Pipeline for Text Data


Traditional text classification using machine learning involves preprocessing text
data and extracting useful features. The resulting numerical features can be used
to train machine learning models [21, 23].

Preprocessing

Raw Tokenize Remove Stem or Preprocessed


text words stopwords lemmatize text

Fig. 1: Flowchart of the traditional preprocessing steps used for text classification.

Preprocessing. The goal of preprocessing is to transform the raw text into a


cleaner and more structured representation to apply a feature extraction algo-
rithm (see Figure 1).
The first step is tokenization, which involves splitting the input text into in-
dividual words, phrases, or other units of meaning. The second step is to remove
stopwords, which are common words such as “the”, “and”, or “in” that do not
carry much semantic information. They can be safely discarded without affecting
the meaning of the text. The third step is to apply stemming or lemmatization,
which are techniques for reducing words to their base or root form. Stemming
involves removing suffixes and prefixes from words, while lemmatization involves
mapping words to their canonical or dictionary form.
Our preprocessing pipeline includes word tokenization and stemming, using
the Porter stemming algorithm3 [29].

Feature Extraction. The next step is to extract relevant features from the
preprocessed text. Several algorithms can be considered for this task, such as
bag-of-words [16] and Word2Vec [27]. In this study, we will focus on the popular
Term Frequency-Inverse Document Frequency (tf-idf) encoding.
The tf-idf encoding is a common approach to representing text documents
as numerical vectors, which can be fed to machine learning models. The main
idea behind this encoding is to give a higher weight to words that are frequent
in a particular document (term frequency) but rare in other documents (in-
verse document frequency). This helps to capture the unique characteristics of
a document and distinguish it from other documents in the corpus.
Formally, the tf-idf encoding of a term t in a document d can be defined as:

tf-idf(t, d) = tf(t, d) × idf(t) (3)


3
https://www.nltk.org/_modules/nltk/stem/porter.html
Spam-T5: Benchmarking Large Language Models for Spam Detection 7

where tf(t, d) is the term frequency of term t in document d, and idf(t) is the
inverse document frequency of term t calculated as:
N
idf(t) = log (4)
nt
where N is the total number of documents in the corpus, and nt is the number
of documents that contain term t. The logarithmic function dampens the effect
of rare terms with very low values of nt .

4 Experimental Setup

4.1 Datasets

We leveraged four widely recognized spam datasets: the Ling-Spam dataset, SMS
Spam Collection, SpamAssassin Public Corpus, and Enron Email dataset. These
datasets were chosen for their popularity in the field of spam detection [1,36,42]
and the diversity of communication channels they represent, including SMS,
mailing lists, and other sources.

Ling-Spam Dataset. The Ling-Spam dataset (2003) is a collection of messages


used in experiments related to spam detection [37]. The dataset is a mixture of
spam messages and legitimate messages sent via the Linguist list, a moderated
mailing list about linguistics. The corpus contains 2893 messages, with 2412
being legitimate messages obtained by randomly downloading digests from the
list’s archives and removing server-added text. The remaining 481 messages are
spam messages received by one of the authors, translating to a spam rate of
approximately 16%. The Linguist messages cover various topics, including job
postings and software availability announcements.

SMS Spam Collection. The SMS Spam Collection4 (2011) is a dataset of


5,574 SMS messages in English that have been tagged as either ham or spam [2].
The dataset was collected from various free or free-for-research sources on the
internet. These sources include a UK forum where cell phone users report SMS
spam messages, a dataset of legitimate messages collected for research at the
National University of Singapore, a list of ham messages from a PhD thesis, and
the SMS Spam Corpus v.0.1 Big. This is an imbalanced dataset with a spam
rate of approximately 13%.

SpamAssassin Public Corpus. The SpamAssassin dataset5 is a publicly


available collection of email messages suitable for testing spam filtering sys-
tems. The dataset includes 6,047 messages, with a spam ratio of approximately
4
https://archive.ics.uci.edu/ml/datasets/sms+spam+collection
5
https://spamassassin.apache.org/old/publiccorpus/
8 Maxime Labonne, Sean Moran

31%. The messages were either posted to public forums, sent to the creator, or
originated as newsletters from public news websites. The corpus is divided into
five parts: spam, easy ham, hard ham, easy ham 2, and spam 2, with varying
difficulty levels in differentiating spam from non-spam messages. In this study,
we used the most recent versions of the five parts, from 2003 and 2005.

Enron Email Dataset. The Enron Email dataset6 , also known as the Enron
Corpus, was collected in 2002 during the investigation into the bankruptcy of
the Enron Corporation [26]. This dataset was generated by 158 employees and
contains over 600,000 emails. We use a smaller version of this dataset, which
contains 33,716 emails, labeled as spam or ham. The Enron Email dataset is
balanced and contains 17,171 spam emails and 16,545 ham emails (spam rate of
approximately 49%).
All these datasets were preprocessed by removing duplicates, NaN values,
and empty messages. Figure 2 shows the proportions of spam and ham emails
in every preprocessed dataset.

LING SMS SPAMASSASSIN ENRON


(2,876 samples) (5,169 samples) (6,051 samples) (30,493 samples)

83.73% 87.37% 68.63% 52.18%

16.27% 12.63% 31.37% 47.82%

spam ham

Fig. 2: Distribution of spam and ham messages across the datasets. Notably, the Ling,
SMS, and SpamAssassin datasets exhibit an imbalanced learning scenario, where the
prevalence of spam messages outweighs that of ham messages. Conversely, the Enron
dataset is characterized by a balanced distribution of spam and ham messages.

4.2 Large Language Models


We evaluate three large language models in this experiment (RoBERTa, SetFit,
and Flan-T5) from three different families of architectures (BERT-like, Sentence
Transformers, Seq2Seq).

RoBERTa. RoBERTa (2019) is a pretrained model for natural language pro-


cessing [24] that builds on the success of BERT [14]. RoBERTa improves on
BERT by modifying key hyperparameters, such as removing the next-sentence
6
https://www.cs.cmu.edu/~enron/
Spam-T5: Benchmarking Large Language Models for Spam Detection 9

pretraining objective and training with larger mini-batches and learning rates.
Additionally, RoBERTa explores training on a larger amount of data and for a
longer period than BERT.
We used the HuggingFace implementation of the RoBERTa model
(roberta-base) with the byte-level Byte-Pair Encoding (BPE) tokenizer [31].

SetFit. SetFit (2022) is an efficient and prompt-free framework for few-shot


fine-tuning of Sentence Transformers (ST) [34, 45]. By fine-tuning a pretrained
ST on a small number of text pairs in a contrastive Siamese manner, SetFit
generates rich text embeddings that are used to train a classification head. This
approach requires no prompts or verbalizers, and achieves high accuracy with
significantly fewer parameters than existing techniques.
We used the SetFit implementation from the setfit library7 ,
combined with the sentence-transformers implementation of MPNet
(sentence-transformers/all-mpnet-base-v2) [40], using the WordPiece
tokenizer [51]. Our implementation generates 20 training pairs. The distance
between the resulting embeddings is measured using the cosine similarity.

Flan-T5. Flan-T5 (2022) is a family of models based on T5 (2019) [33], an


encoder-decoder transformer architecture trained on multiple language tasks.
The Flan-T5 models have undergone instruction-finetuning on over 1,800 lan-
guage tasks, leading to a significant enhancement in their reasoning skills and
promptability. However, it is worth noting that the Flan-T5 models were not
trained to perform spam detection tasks.
We used the HuggingFace implementation of the Flan-T5 model (google/
Flan-t5-base) with the SentencePiece tokenizer [22]. Our experimentation in-
cluded the small version of the Flan-T5 model (80M parameters), but it demon-
strated limited generalization capabilities, which is why it was excluded from
this study.
The Flan-T5 model is a Seq2Seq model that is capable of generating textual
outputs, as opposed to binary labels or probabilities. To leverage the capabilities
of this model for spam detection, we fine-tuned it as a new task, introducing a
dedicated prefix of “classify as ham or spam:” to every sample. As a result,
the model was trained to correctly output either “ham” or “spam” based on
the input text. To obtain numerical values for classification metrics, a post-
processing step was utilized to map the textual labels to 0 and 1. We call Spam-
T5 this modified model fine-tuned on the spam detection task.

Fine-tuning details. We found that the batch size, learning rate, and number
of epochs are all important hyperparameters that can affect how well the model
generalizes to new data and how quickly it converges during training. Table 1
details the specific values used to fine-tune these models.
7
https://github.com/huggingface/setfit
10 Maxime Labonne, Sean Moran

Table 1: Table captions should be placed above the tables.


Model Train batch size Eval batch size LR Epochs
RoBERTa 16 8 5e-5 10
SetFit 16 16 2e-5 3
Spam-T5 8 8 5e-5 5

4.3 Baseline techniques

We selected 6 baseline models that perform well in spam detection: Naı̈ve Bayes
(NB), Logistic Regression (LR), K-Nearest Neighbors (KNN), SVM, XGBoost,
and LightGBM.
NB is a probabilistic model that assumes independence among features, mak-
ing it fast and efficient for large datasets. LR is a linear model that uses a sigmoid
function to predict binary outcomes. KNN is a non-parametric model that clas-
sifies data based on the proximity of its neighbors. In our implementation, we
selected one neighbor (K = 1). SVM is a linear model that finds the optimal
hyperplane to separate data into different classes. We implemented it with a
sigmoid kernel function and a gamma of 1.0. XGBoost is a high-performing
implementation of gradient boosting that utilizes a level-wise strategy to build
decision trees in parallel and optimize the quality of splits in the training set. We
set its learning rate to 0.01 with 150 estimators. LightGBM is another gradient
boosting framework that shares many of XGBoost’s advantages, but differs in its
leaf-wise construction of decision trees and use of a highly optimized histogram-
based algorithm. Likewise, we set its learning rate to 0.01 with 20 leaves.

Fine-tuning details. We found that the optimal number of tf-idf features


was model-dependent. In order to tune this number, a stratified 5-fold cross-
validation technique was employed. This involved training the model with dif-
ferent numbers of tf-idf features and selecting the number that resulted in the
highest performance. Table 2 shows the optimal numbers of features for each
model.

Table 2: Number of features generated by the tf-idf algorithm for each model. These
numbers were found using grid search on the validation set.
Model NB LR KNN SVM XGBoost LightGBM
# features 1000 500 150 3000 2000 3000
Spam-T5: Benchmarking Large Language Models for Spam Detection 11

Table 3: Test F1 score, precision, and recall performance of the six baselines and three
LLMs for each dataset.
Ling SMS SA Enron
Model F1 P R F1 P R F1 P R F1 P R
NB 1.00 1.00 1.00 0.89 0.82 0.98 0.87 0.83 0.91 0.96 0.96 0.96
LR 0.98 0.96 1.00 0.87 0.78 0.98 0.92 0.89 0.96 0.97 0.98 0.96
KNN 0.93 0.96 0.90 0.81 0.74 0.89 0.92 0.88 0.95 0.91 0.94 0.89
SVM 1.00 1.00 1.00 0.90 0.83 0.98 0.94 0.92 0.97 0.98 0.99 0.97
XGBoost 0.92 0.94 0.90 0.78 0.65 0.98 0.94 0.92 0.96 0.91 0.98 0.85
LightGBM 0.95 0.96 0.94 0.87 0.82 0.93 0.98 0.98 0.98 0.98 0.99 0.96
RoBERTa 0.97 0.98 0.96 0.95 0.97 0.92 0.97 0.98 0.95 0.99 0.99 0.99
SetFit 0.99 0.98 1.00 0.96 0.97 0.95 0.95 0.96 0.94 — — —
Spam-T5 0.99 0.98 1.00 0.95 0.97 0.94 0.96 0.96 0.96 0.99 0.99 1.00
† We excluded results from the SetFit model on the full Enron training set because it
did not achieve a meaningful result after 104 hours of training.

5 Results

5.1 Full Training Sets

We assess the performance of machine learning and large language models when
they are provided with complete access to the training set. The complete training
set in this context refers to 80% of the entire dataset. To evaluate the perfor-
mance of each model, we employ three metrics: F1 score (F1), precision (P),
recall (R). The outcomes of the evaluation are presented in Table 3.
We found that large language models outperformed baseline models on the
SMS and Enron datasets. However, we also observed that LLMs did not perform
better than baseline models on the Ling and SpamAssassin datasets. Among all
the datasets, the SMS dataset showed the most significant difference between the
best-performing baseline model and the worst-performing large language model
in terms of F1 score, with a difference of 0.05 points.
Our experiments show that the Spam-T5 model had the best overall perfor-
mance, with an average F1 score of 0.9742. The RoBERTa and SetFit models
also surpassed the baseline models with the same score of 0.9670. Among the
baseline models, the SVM approach performed the best, achieving an average
F1 score of 0.9560. Conversely, the XGBoost model was the least effective, with
an average F1 score of 0.8842. These outcomes indicate that LLMs are superior
to traditional machine learning algorithms in most spam detection scenarios.

5.2 Few-shot Learning

In the few-shot learning setting, we evaluated the performance of each model


after being trained on k ∈ {4, 8, 16, 32, 64, 128, 256, Full} samples. The results of
our analysis are presented in Table 4.
12 Maxime Labonne, Sean Moran

Table 4: Test F1 score for each model using different numbers of training samples
(macro average using the four datasets). The “Full” column corresponds to the com-
plete training sets (80% of the original datasets).
Number of training samples
Model 4 8 16 32 64 128 256 Full
NB 0.145 0.210 0.211 0.243 0.361 0.505 0.663 0.930
LR 0.153 0.195 0.210 0.248 0.353 0.420 0.599 0.927
KNN 0.516 0.523 0.596 0.591 0.603 0.688 0.733 0.887
SVM 0.155 0.267 0.288 0.334 0.531 0.732 0.858 0.952
XGBoost 0.000 0.079 0.351 0.431 0.600 0.666 0.767 0.877
LightGBM 0.000 0.000 0.000 0.000 0.455 0.608 0.703 0.948
RoBERTa 0.241 0.174 0.575 0.738 0.459 0.915 0.929 0.970
SetFit 0.215 0.339 0.557 0.855 0.887 0.929 0.941 0.967
Spam-T5 0.544 0.534 0.619 0.726 0.806 0.864 0.933 0.974

The performance profiles of NB, LR, and SVM are similar, with a noticeable
improvement on larger datasets. In contrast, KNN achieves relatively higher
F1 scores for smaller training sets, with scores exceeding 0.5 for sizes of 4 and
8. However, its performance plateaus as the number of shots increases, with
a maximum score of 0.887 on the full datasets. Gradient-boosted tree models,
such as XGBoost and LightGBM, exhibit underwhelming results on the smallest
dataset sizes (4, 8, 16, and 32). Their performance rapidly improves with an
increase in the training set size, culminating in scores of 0.877 and 0.948 on the
full datasets, respectively.

RoBERTa’s performance is somewhat inconsistent across training set sizes,


starting at 0.241 for size 4, dropping to 0.174 for size 8, and then increasing to
0.970 on the full datasets. In contrast, SetFit consistently improves in perfor-
mance as the number of samples increases, achieving an F1 score of 0.967 on
the full datasets. This model performs best on dataset sizes 32, 64, 128, and
256, indicating that SetFit is more suitable for this particular type of “medium”
few-shot learning. Spam-T5, on the other hand, is the best-performing model
in very-few-shot scenarios, i.e., when there are only 4–16 samples available for
training. Its performance steadily increases with more samples, achieving the
highest F1 score on the full datasets.

Figure 3 illustrates the consistent superiority of LLMs over the baseline mod-
els in terms of F1 score across all training sample sizes. Furthermore, Table 5
presents the mean F1 scores of every model, and shows that Spam-T5 achieves
the highest overall performance with an F1 score of 0.7498. These results can be
attributed to Spam-T5’s high accuracy in the very-few-shot setting and consis-
tent robustness across all sample sizes.
Spam-T5: Benchmarking Large Language Models for Spam Detection 13

1.0
Model F1 score
Average F1 score 0.8 NB 0.4085.2734
LR 0.3880.2621
0.6 KNN 0.6421.1234
SVM 0.5146.3005
0.4 RoBERTa XGBoost 0.4716.3159
SetFit LightGBM 0.3392.3871
0.2 Spam-T5
RoBERTa 0.6253.3139
Best baseline
SetFit 0.7112.2990
0.0
4 8 16 32 64 128 256 Full Spam-T5 0.7498.1718
Number of training samples
Table 5: Mean test F1 scores and stan-
Fig. 3: Comparison of test F1 scores
dard deviations across all numbers of
achieved by LLMs vs. the best baseline
training samples (macro average using four
model for every number of training sam-
datasets).
ples (macro average using four datasets).

6 Discussion
The results of our experiments provide insights into the strengths and limitations
of LLMs for spam detection. Specifically, our findings suggest that LLMs, such
as RoBERTa and MPNet (model used by SetFit), perform well in general but
are outperformed by Spam-T5 in the very-few-shot setting. This difference in
performance can be attributed to the number of parameters, with Spam-T5
having 250M parameters compared to RoBERTa’s 125M and MPNet’s 110M.
Moreover, our results indicate a clear trade-off between accuracy and runtime
when using LLMs and baseline techniques for spam detection, as illustrated in
Figure 4. While LLMs are more robust and perform better in most cases, they
require long training and inference times. In contrast, baseline techniques are
faster but do not obtain the same level of accuracy. This suggests that LLMs
achieve improved sample efficiency at the expense of computational efficiency.
This trade-off highlights the need to consider the specific requirements of the
task, such as the available computational resources and the desired level of ac-
curacy.
Figure 4 also shows a surprising increase in inference time for the baseline
models as the number of training samples increases (i.e., the number of test
samples decreases). This trend is counterintuitive, as we would expect a similar
trend to that of the LLMs, where the inference time decreases since there are
fewer samples to process. This is due to the sigmoid kernel function used by the
SVM model. As the number of training samples increases, the sigmoid kernel
requires more computational effort during the inference phase, leading to the
observed increase in inference time.
The practical application of LLMs is hindered by their substantial compu-
tational requirements for training and deployment, which necessitate the use of
specialized hardware, including GPUs and TPUs. Addressing this limitation is
14 Maxime Labonne, Sean Moran

102

Average inference time (s)


Average training time (s)
103
101

101
100

10−1 10−1

4 8 16 32 64 128 256 Full 4 8 16 32 64 128 256 Full


Number of training samples Number of training samples
RoBERTa Spam-T5 RoBERTa Spam-T5
SetFit Avg. baseline SetFit Avg. baseline

Fig. 4: Training and inference times for the three LLMs and the average times for the
baseline techniques (macro average using the four datasets). Here, “inference time”
corresponds to the execution time to process the entire test set. We exclude training
and inference times from the SetFit model on the full Enron dataset because it did not
achieve a meaningful score after 104 hours of training.

an active area of research, with numerous techniques that reduce the memory
footprint and computational resources required by LLMs. For instance, some ap-
proaches have explored the use of 8-bit floating-point formats to reduce the mem-
ory requirements, as demonstrated in [13]. Other methods, such as LoRA [19],
aim to reduce the computational resources required to train and deploy LLMs.

7 Conclusion

Our study demonstrates the effectiveness of LLMs for email spam detection, par-
ticularly in few-shot scenarios. Experiments show that LLMs outperform well-
established baseline techniques in terms of F1 score across all training sample
sizes. Furthermore, our solution, Spam-T5, achieves the highest overall perfor-
mance with an average F1 score of 0.7498.
These findings suggest that LLMs, and specifically Spam-T5, could be a
valuable tool for addressing the ongoing challenges in spam detection. By in-
corporating a limited number of fraudulent samples, we can update models and
enhance their performance without the need for extensive data labeling efforts.
This approach simplifies the task of building robust models that can handle
dynamic data distributions, thus offering a practical and effective solution to
real-world problems.
In order to deploy LLMs in real-world applications, future work will need to
focus on reducing the computational requirements of these models. One approach
to achieving this goal involves developing techniques that minimize the memory
footprint and computational resources required by LLMs, such as those explored
in recent studies.
Spam-T5: Benchmarking Large Language Models for Spam Detection 15

8 Ethical Statement

As spam detection is an essential component of email systems and other com-


munication platforms, using effective language models in this domain can lead
to more accurate filtering and improved user experience. However, our research
raises ethical concerns about the potential misuse of such language models for
censorship. The ability to classify messages as spam or non-spam could be used
to suppress or filter out content that does not align with certain political or so-
cial agendas. We recognize the importance of safeguarding against such misuse
and promoting responsible use of machine learning tools in the public domain.
Furthermore, the development and deployment of large language models have
a significant environmental impact, consuming significant amounts of energy
and contributing to carbon emissions. We acknowledge the potential ecological
consequences of our research and call for greater attention to the environmental
sustainability of machine learning models and their applications.

References

1. Agboola, O.: Spam Detection Using Machine Learning and Deep Learning. LSU
Doctoral Dissertations (2022)
2. Almeida, T.A., Hidalgo, J.M.G., Yamakami, A.: Contributions to the study
of sms spam filtering: New collection and results. In: Proceedings of the
11th ACM Symposium on Document Engineering. p. 259–262. DocEng
’11, Association for Computing Machinery, New York, NY, USA (2011).
https://doi.org/10.1145/2034691.2034742, https://doi.org/10.1145/2034691.
2034742
3. Alspector, J.: Svm-based filtering of e-mail spam with content-specic misclassica-
tion costs. In: Proceedings of the Workshop on text mining (2001)
4. Androutsopoulos, I., Paliouras, G., Michelakis, E., Michelakis, E.: Learning to filter
unsolicited commercial e-mail (2006)
5. Attenberg, J., Weinberger, K., Dasgupta, A., Smola, A., Zinkevich, M.: Collab-
orative email-spam filtering with the hashing trick. In: Proceedings of the Sixth
Conference on Email and Anti-Spam (CEAS 2009) (01 2009)
6. Carreras, X., Salgado, J.: Boosting trees for anti-spam email filtering. ArXiv
cs.CL/0109015 (10 2001)
7. Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang,
X., Dehghani, M., Brahma, S., Webson, A., Gu, S.S., Dai, Z., Suzgun, M., Chen,
X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Valter, D., Narang,
S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E.H.,
Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q.V., Wei, J.: Scaling instruction-
finetuned language models (2022)
8. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297
(1995)
9. Cranor, L.F., LaMacchia, B.A.: Spam! Commun. ACM 41(8), 74–83
(aug 1998). https://doi.org/10.1145/280324.280336, https://doi.org/10.1145/
280324.280336
16 Maxime Labonne, Sean Moran

10. Dada, E.G., Bassi, J.S., Chiroma, H., Abdulhamid, S.M., Adetunmbi,
A.O., Ajibuwa, O.E.: Machine learning for email spam filtering: re-
view, approaches and open research problems. Heliyon 5(6), e01802
(2019). https://doi.org/https://doi.org/10.1016/j.heliyon.2019.e01802,
https://www.sciencedirect.com/science/article/pii/S2405844018353404
11. Debnath, K., Kar, N.: Email spam detection using deep learning approach. In: 2022
International Conference on Machine Learning, Big Data, Cloud and Parallel Com-
puting (COM-IT-CON). vol. 1, pp. 37–41 (2022). https://doi.org/10.1109/COM-
IT-CON54601.2022.9850588
12. Delany, S., Cunningham, P., Coyle, L.: An assessment of case-based rea-
soning for spam filtering. Artif. Intell. Rev. 24, 359–378 (11 2005).
https://doi.org/10.1007/s10462-005-9006-6
13. Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L.: Llm.int8(): 8-bit matrix
multiplication for transformers at scale (2022)
14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
rectional transformers for language understanding (2019)
15. Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam cate-
gorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999).
https://doi.org/10.1109/72.788645
16. Harris, Z.S.: Distributional structure. WORD 10(2-3), 146–162 (1954).
https://doi.org/10.1080/00437956.1954.11659520, https://doi.org/10.1080/
00437956.1954.11659520
17. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation
9(8), 1735–1780 (1997)
18. Hovold, J.: Naive bayes spam filtering using word-position-based attributes. In:
Proceedings of the Second Conference on Email and Anti-Spam, http://www. ceas.
cc (2005)
19. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen,
W.: Lora: Low-rank adaptation of large language models (2021)
20. Jáñez-Martino, F., Alaiz, R., González-Castro, V., Fidalgo, E., Alegre, E.: A review
of spam email detection: analysis of spammer strategies and the dataset shift prob-
lem. Artificial Intelligence Review 56 (05 2022). https://doi.org/10.1007/s10462-
022-10195-4
21. Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L.,
Brown, D.: Text classification algorithms: A survey. Information 10(4) (2019).
https://doi.org/10.3390/info10040150, https://www.mdpi.com/2078-2489/10/4/
150
22. Kudo, T., Richardson, J.: Sentencepiece: A simple and language independent sub-
word tokenizer and detokenizer for neural text processing (2018)
23. Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P.S., He, L.: A survey
on text classification: From traditional to deep learning. ACM Trans. Intell. Syst.
Technol. 13(2) (apr 2022), https://doi.org/10.1145/3495162
24. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining
approach (2019), http://arxiv.org/abs/1907.11692, cite arxiv:1907.11692
25. Méndez, J.R., Fdez-Riverola, F., Dı́az, F., Iglesias, E.L., Corchado, J.M.: A com-
parative performance study of feature selection methods for the anti-spam filtering
domain. In: Perner, P. (ed.) Advances in Data Mining. Applications in Medicine,
Web Mining, Marketing, Image and Signal Mining. pp. 106–120. Springer Berlin
Heidelberg, Berlin, Heidelberg (2006)
Spam-T5: Benchmarking Large Language Models for Spam Detection 17

26. Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam filtering with naive bayes –
which naive bayes? (2006), http://citeseer.ist.psu.edu/757874.html
27. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed represen-
tations of words and phrases and their compositionality (2013)
28. OpenAI: Gpt-4 technical report (2023)
29. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–
137 (1980). https://doi.org/10.1108/eb046814, http://www.emeraldinsight.com/
doi/abs/10.1108/eb046814
30. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language un-
derstanding by generative pre-training (2018)
31. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Lan-
guage models are unsupervised multitask learners (2018), https://d4mucfpksywv.
cloudfront.net/better-language-models/language-models.pdf
32. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y.,
Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-
to-text transformer. Journal of Machine Learning Research 21(140), 1–67 (2020),
http://jmlr.org/papers/v21/20-074.html
33. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li,
W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text
transformer (2020)
34. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-
networks. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) EMNLP/IJCNLP (1).
pp. 3980–3990. Association for Computational Linguistics (2019), http://dblp.
uni-trier.de/db/conf/emnlp/emnlp2019-1.html#ReimersG19
35. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to
filtering junk e-mail. In: AAAI-98 Workshop on Learning for Text Categorization.
pp. 55–62 (1998)
36. Sahmoud, T., Mikki, D.M.: Spam detection using bert (2022)
37. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D.,
Stamatopoulos, P.: A memory-based approach to anti-spam filtering for mailing
lists. Inf. Retr. 6(1), 49–73 (jan 2003). https://doi.org/10.1023/A:1022948414856,
https://doi.org/10.1023/A:1022948414856
38. Scholz, M., Klinkenberg, R.: An ensemble classifier for drifting concepts. In: Pro-
ceedings of the Second International Workshop on Knowledge Discovery from Data
Streams (IWKDDS’05) (2005)
39. Shi, W., Xie, M., Huang, Y.: Collaborative spam filtering technique based on mime
fingerprints. In: 2011 9th World Congress on Intelligent Control and Automation.
pp. 225–230 (2011). https://doi.org/10.1109/WCICA.2011.5970733
40. Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: Mpnet: Masked and permuted pre-
training for language understanding (2020)
41. Syed, N.A., Liu, H., Sung, K.K.: Handling concept drifts in incremental
learning with support vector machines. In: Proceedings of the Fifth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining.
p. 317–321. KDD ’99, Association for Computing Machinery, New York, NY,
USA (1999). https://doi.org/10.1145/312129.312267, https://doi.org/10.1145/
312129.312267
42. Tida, V.S., Hsu, S.: Universal spam detection using transfer learning of bert model
(2022)
43. Torabi, Z., Nadimi-Shahraki, M.H., Nabiollahi, A.: Efficient support vector ma-
chines for spam detection: A survey. (IJCSIS) International Journal of Computer
Science and Information Security, Vol. 13, No. 1, January 2015 13 (01 2015)
18 Maxime Labonne, Sean Moran

44. Tsymbal, A.: The problem of concept drift: definitions and related work. Tech.
Rep. TCD-CS-2004-15, The University of Dublin, Trinity College, Department of
Computer Science, Dublin, Ireland (2004)
45. Tunstall, L., Reimers, N., Jo, U.E.S., Bates, L., Korat, D., Wasserblat, M., Pereg,
O.: Efficient few-shot learning without prompts (2022)
46. Vapnik, V.N.: The nature of statistical learning theory. Springer-Verlag New York,
Inc. (1995)
47. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg,
U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.)
Advances in Neural Information Processing Systems. vol. 30. Curran Associates,
Inc. (2017), https://proceedings.neurips.cc/paper_files/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
48. Wang, Y., Yao, Q., Kwok, J.T., Ni, L.M.: Generalizing from a few exam-
ples: A survey on few-shot learning. ACM Comput. Surv. 53(3) (jun 2020).
https://doi.org/10.1145/3386252, https://doi.org/10.1145/3386252
49. Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M.,
Le, Q.V.: Finetuned language models are zero-shot learners. In: International Con-
ference on Learning Representations (2021)
50. Wittel, G., Wu, S.: On Attacking Statistical Spam Filters. In: Proc. of the Con-
ference on Email and Anti-Spam (CEAS). Mountain View, CA, USA (July 2004),
http://www.ceas.cc/papers-2004/index.html
51. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,
M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X.,
Lukasz Kaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian,
G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O.,
Corrado, G., Hughes, M., Dean, J.: Google’s neural machine translation system:
Bridging the gap between human and machine translation (2016)

You might also like