Spam-T5: Benchmarking Large Language Models For Few-Shot Email Spam Detection
Spam-T5: Benchmarking Large Language Models For Few-Shot Email Spam Detection
JPMorgan Chase
{maxime.labonne,sean.j.moran}@jpmchase.com
arXiv:2304.01238v3 [cs.CL] 7 May 2023
1 Introduction
Email communication continues to be an essential part of our daily lives, facili-
tating efficient asynchronous communication globally for personal and business
users alike. Given this prominence, email is also a prime target for fraudulent and
malicious activities such as spam and phishing attacks. Spam email can cause a
multitude of problems ranging from user inconvenience arising from unsolicited
communications, to overload of computational resources on servers and secu-
rity compromise arising from fraudulent links and malware attachments in the
emails that are designed to attack personal and business security. It is estimated
that in 2022, almost 49% of emails sent over the internal were spam1 , highlight-
ing the continued prevalence of the problem and the need to explore ever more
?
Corresponding author
1
https://securelist.com/spam-phishing-scam-report-2022/108692/
2 Maxime Labonne, Sean Moran
2 Related Work
2.1 Spam Detection using Machine Learning
The email spam detection task has been well explored in the literature [10] and
is commonly used in undergraduate machine learning courses as an introductory
use-case for study and learning. Despite the familiarity of the email spam de-
tection task, it continues to provide real challenges to practitioners due to data
distribution drift, the adversarial nature of the environment in which the task
is embedded, and class imbalance. Significant research effort has been expended
on this task, and sustained research is necessary to keep ahead of ever-evolving
data and the changing landscape of spamming techniques.
The email spam detection task is framed as the development of an effective
automated algorithm for differentiating spam versus ham (non-spam) emails.
Spam detection methods have been classified in terms of rules-based, collabora-
tive, and content-based approaches [25]. Rules-based approaches include check-
ing incoming email addresses against white lists of allowed email addresses, black
lists of commonly used spam email sources, and hand-crafted rules (e.g. that look
for empty To: fields or a certain combination of keywords). Collaborative ap-
proaches compute, for example, a hash function on the content of example spam
which is shared with the community and the hash function compared to new
emails to detect spam [5, 39, 50]. These simple methods are unable to generalize
well and are brittle [9], which is why content-based methods involving machine
learning have been explored [4].
For content-based methods, early methods explored conventional machine
learning methods and variations thereof, including Naı̈ve Bayes classifiers [35],
KNN [12], Support Vector Machines (SVMs) [8, 15, 43, 46], and Boosting trees
(e.g. XGBoost) [6]. The task is frequently formulated as a binary classification
problem where a classifier is trained on a representative dataset of spam and
ham (non-spam) emails, and learns to classify the data points into these two
classes. Performance is measured on generalization to new email data received,
for example, on standard benchmark datasets [2, 26, 37] or in a live production
operating environment for industrial use-cases. Studies have also been conducted
on the effectiveness of various hand-crafted feature representations for input into
the machine learning classifiers [25]. Other work has sought to tackle the concept
drift problem, which is a defining aspect of the spam detection task [38, 41],
4 Maxime Labonne, Sean Moran
and also to improve conventional classification techniques for the task [3, 18].
Generally speaking, conventional machine learning methods are faster to train
and evaluate compared to deep neural networks, but have less modeling capacity
to capture the complex patterns inherent in the data.
The field of machine learning has been revolutionized with the emergence of
Large Language Models (LLMs), a powerful suite of deep learning methods that
have exceptional abilities in understanding and learning from the structural,
relational, and semantic patterns inherent in natural language datasets [24].
LLMs are extreme multi-taskers, able to replace most bespoke models for natural
language tasks. For example they are very capable at text generation, sentiment
detection, question answering, and text summarization. The release of ChatGPT
and more recently GPT4 [28] to the public made waves around the world that
continue to reverberate given the naturalness and human-like response from the
model 2 . Despite the popularity and importance of email spam detection, there
is little prior work that explores the benefits of LLMs for the task. We address
the gap in this paper.
Given the robust popularity of the field, the literature on LLMs is many
and varied, with new advances occurring rapidly and on daily basis. We focus
on the suite of benchmark models that are commonly used as baselines in aca-
demic papers and in operation systems in industry, including RoBERTa [24],
SetFit [45], and Flan-T5 [49]. These benchmark models have the advantage of
publicly released code and weights, enabling comparison and evaluation on new
tasks. Underlying almost all of the recent innovations in the field is the semi-
nal Transformer architecture introduced by Google Research in 2017 [47]. The
Transformer architecture dispenses with recurrent and convolutional layers, ad-
vocating the primary use of attention mechanisms for learning, which prove
to be massively more scalable to big data. This technology was incorporated in
the several generations (GPT-n) of Generative Pre-Trained Transformers (GPT)
models [30], to spectacular effect across many natural language understanding
tasks [28]. Aside from the older GPT-2 model [31], the recent suite of GPT
models are closed-source with weights hidden behind an API, making compari-
son impossible.
Among the open-source and widely available models are RoBERTa, SetFit
and Flan-T5. BERT (Bidirectional Encoder Representations from Transform-
ers) [14] addresses the issue of existing models only learning from previous to-
kens in text, creating more powerful (bi-directional) Transformer representations
that gain from knowledge from an extended context. BERT has subsequently
developed into a performant off-the-shelf model for many tasks. A subsequent,
and popular evolution of BERT is embodied in the RoBERTa model [24] that
improves BERT through a refined training methodology involving larger mini-
batches and learning rates. SetFit (Sentence Transformer Fine Tuning) [45] is a
2
https://openai.com/blog/chatgpt, https://openai.com/research/gpt-4
Spam-T5: Benchmarking Large Language Models for Spam Detection 5
3 Methods
3.1 Problem Formulation
Let D = {(xi , yi )}n be a dataset of emails, where each email xi is represented as
a feature vector in Rd and labeled as either spam (yi = 1) or not spam (yi = 0),
also called ham. The spam detection classification problem can be formulated
as learning a function f : Rd → {0, 1} that maps an email feature vector to a
binary label, such that f (xi ) approximates yi for all email examples i in the
dataset.
This problem can be approached using supervised learning techniques, where
a training set Dtrain is used to learn the function f . Specifically, the goal is to
find the optimal parameters θ for a classifier fθ that minimizes a suitable loss
function L(θ), such as the cross-entropy loss:
n
1X
L(θ) = − (yi log(fθ (xi )) + (1 − yi ) log(1 − fθ (xi ))) (1)
n i=1
where n is the number of emails in the training set. The optimal parameters
θ∗ can be found by minimizing the loss function:
θ∗ = argmin L(θ) (2)
θ
Once the optimal parameters are found, the learned classifier fθ∗ can be used
to predict the spam label for new emails not seen during training.
However, in some scenarios, we may not have access to a large labeled dataset
like Dtrain . This is where few-shot learning comes into play. Few-shot learning
is a variant of supervised learning that aims to learn from a small amount of
labeled data [48] In the few-shot learning setting, we are given a small support
set S of labeled examples, where S = {(xi , yi )}k , with k n, and we need to
learn a function f that can generalize to new examples not seen during training.
This setting is particularly relevant in the context of spam detection, where
labeled examples are scarce and require frequent updates to account for data
shift and adversarial drift.
6 Maxime Labonne, Sean Moran
Preprocessing
Fig. 1: Flowchart of the traditional preprocessing steps used for text classification.
Feature Extraction. The next step is to extract relevant features from the
preprocessed text. Several algorithms can be considered for this task, such as
bag-of-words [16] and Word2Vec [27]. In this study, we will focus on the popular
Term Frequency-Inverse Document Frequency (tf-idf) encoding.
The tf-idf encoding is a common approach to representing text documents
as numerical vectors, which can be fed to machine learning models. The main
idea behind this encoding is to give a higher weight to words that are frequent
in a particular document (term frequency) but rare in other documents (in-
verse document frequency). This helps to capture the unique characteristics of
a document and distinguish it from other documents in the corpus.
Formally, the tf-idf encoding of a term t in a document d can be defined as:
where tf(t, d) is the term frequency of term t in document d, and idf(t) is the
inverse document frequency of term t calculated as:
N
idf(t) = log (4)
nt
where N is the total number of documents in the corpus, and nt is the number
of documents that contain term t. The logarithmic function dampens the effect
of rare terms with very low values of nt .
4 Experimental Setup
4.1 Datasets
We leveraged four widely recognized spam datasets: the Ling-Spam dataset, SMS
Spam Collection, SpamAssassin Public Corpus, and Enron Email dataset. These
datasets were chosen for their popularity in the field of spam detection [1,36,42]
and the diversity of communication channels they represent, including SMS,
mailing lists, and other sources.
31%. The messages were either posted to public forums, sent to the creator, or
originated as newsletters from public news websites. The corpus is divided into
five parts: spam, easy ham, hard ham, easy ham 2, and spam 2, with varying
difficulty levels in differentiating spam from non-spam messages. In this study,
we used the most recent versions of the five parts, from 2003 and 2005.
Enron Email Dataset. The Enron Email dataset6 , also known as the Enron
Corpus, was collected in 2002 during the investigation into the bankruptcy of
the Enron Corporation [26]. This dataset was generated by 158 employees and
contains over 600,000 emails. We use a smaller version of this dataset, which
contains 33,716 emails, labeled as spam or ham. The Enron Email dataset is
balanced and contains 17,171 spam emails and 16,545 ham emails (spam rate of
approximately 49%).
All these datasets were preprocessed by removing duplicates, NaN values,
and empty messages. Figure 2 shows the proportions of spam and ham emails
in every preprocessed dataset.
spam ham
Fig. 2: Distribution of spam and ham messages across the datasets. Notably, the Ling,
SMS, and SpamAssassin datasets exhibit an imbalanced learning scenario, where the
prevalence of spam messages outweighs that of ham messages. Conversely, the Enron
dataset is characterized by a balanced distribution of spam and ham messages.
pretraining objective and training with larger mini-batches and learning rates.
Additionally, RoBERTa explores training on a larger amount of data and for a
longer period than BERT.
We used the HuggingFace implementation of the RoBERTa model
(roberta-base) with the byte-level Byte-Pair Encoding (BPE) tokenizer [31].
Fine-tuning details. We found that the batch size, learning rate, and number
of epochs are all important hyperparameters that can affect how well the model
generalizes to new data and how quickly it converges during training. Table 1
details the specific values used to fine-tune these models.
7
https://github.com/huggingface/setfit
10 Maxime Labonne, Sean Moran
We selected 6 baseline models that perform well in spam detection: Naı̈ve Bayes
(NB), Logistic Regression (LR), K-Nearest Neighbors (KNN), SVM, XGBoost,
and LightGBM.
NB is a probabilistic model that assumes independence among features, mak-
ing it fast and efficient for large datasets. LR is a linear model that uses a sigmoid
function to predict binary outcomes. KNN is a non-parametric model that clas-
sifies data based on the proximity of its neighbors. In our implementation, we
selected one neighbor (K = 1). SVM is a linear model that finds the optimal
hyperplane to separate data into different classes. We implemented it with a
sigmoid kernel function and a gamma of 1.0. XGBoost is a high-performing
implementation of gradient boosting that utilizes a level-wise strategy to build
decision trees in parallel and optimize the quality of splits in the training set. We
set its learning rate to 0.01 with 150 estimators. LightGBM is another gradient
boosting framework that shares many of XGBoost’s advantages, but differs in its
leaf-wise construction of decision trees and use of a highly optimized histogram-
based algorithm. Likewise, we set its learning rate to 0.01 with 20 leaves.
Table 2: Number of features generated by the tf-idf algorithm for each model. These
numbers were found using grid search on the validation set.
Model NB LR KNN SVM XGBoost LightGBM
# features 1000 500 150 3000 2000 3000
Spam-T5: Benchmarking Large Language Models for Spam Detection 11
Table 3: Test F1 score, precision, and recall performance of the six baselines and three
LLMs for each dataset.
Ling SMS SA Enron
Model F1 P R F1 P R F1 P R F1 P R
NB 1.00 1.00 1.00 0.89 0.82 0.98 0.87 0.83 0.91 0.96 0.96 0.96
LR 0.98 0.96 1.00 0.87 0.78 0.98 0.92 0.89 0.96 0.97 0.98 0.96
KNN 0.93 0.96 0.90 0.81 0.74 0.89 0.92 0.88 0.95 0.91 0.94 0.89
SVM 1.00 1.00 1.00 0.90 0.83 0.98 0.94 0.92 0.97 0.98 0.99 0.97
XGBoost 0.92 0.94 0.90 0.78 0.65 0.98 0.94 0.92 0.96 0.91 0.98 0.85
LightGBM 0.95 0.96 0.94 0.87 0.82 0.93 0.98 0.98 0.98 0.98 0.99 0.96
RoBERTa 0.97 0.98 0.96 0.95 0.97 0.92 0.97 0.98 0.95 0.99 0.99 0.99
SetFit 0.99 0.98 1.00 0.96 0.97 0.95 0.95 0.96 0.94 — — —
Spam-T5 0.99 0.98 1.00 0.95 0.97 0.94 0.96 0.96 0.96 0.99 0.99 1.00
† We excluded results from the SetFit model on the full Enron training set because it
did not achieve a meaningful result after 104 hours of training.
5 Results
We assess the performance of machine learning and large language models when
they are provided with complete access to the training set. The complete training
set in this context refers to 80% of the entire dataset. To evaluate the perfor-
mance of each model, we employ three metrics: F1 score (F1), precision (P),
recall (R). The outcomes of the evaluation are presented in Table 3.
We found that large language models outperformed baseline models on the
SMS and Enron datasets. However, we also observed that LLMs did not perform
better than baseline models on the Ling and SpamAssassin datasets. Among all
the datasets, the SMS dataset showed the most significant difference between the
best-performing baseline model and the worst-performing large language model
in terms of F1 score, with a difference of 0.05 points.
Our experiments show that the Spam-T5 model had the best overall perfor-
mance, with an average F1 score of 0.9742. The RoBERTa and SetFit models
also surpassed the baseline models with the same score of 0.9670. Among the
baseline models, the SVM approach performed the best, achieving an average
F1 score of 0.9560. Conversely, the XGBoost model was the least effective, with
an average F1 score of 0.8842. These outcomes indicate that LLMs are superior
to traditional machine learning algorithms in most spam detection scenarios.
Table 4: Test F1 score for each model using different numbers of training samples
(macro average using the four datasets). The “Full” column corresponds to the com-
plete training sets (80% of the original datasets).
Number of training samples
Model 4 8 16 32 64 128 256 Full
NB 0.145 0.210 0.211 0.243 0.361 0.505 0.663 0.930
LR 0.153 0.195 0.210 0.248 0.353 0.420 0.599 0.927
KNN 0.516 0.523 0.596 0.591 0.603 0.688 0.733 0.887
SVM 0.155 0.267 0.288 0.334 0.531 0.732 0.858 0.952
XGBoost 0.000 0.079 0.351 0.431 0.600 0.666 0.767 0.877
LightGBM 0.000 0.000 0.000 0.000 0.455 0.608 0.703 0.948
RoBERTa 0.241 0.174 0.575 0.738 0.459 0.915 0.929 0.970
SetFit 0.215 0.339 0.557 0.855 0.887 0.929 0.941 0.967
Spam-T5 0.544 0.534 0.619 0.726 0.806 0.864 0.933 0.974
The performance profiles of NB, LR, and SVM are similar, with a noticeable
improvement on larger datasets. In contrast, KNN achieves relatively higher
F1 scores for smaller training sets, with scores exceeding 0.5 for sizes of 4 and
8. However, its performance plateaus as the number of shots increases, with
a maximum score of 0.887 on the full datasets. Gradient-boosted tree models,
such as XGBoost and LightGBM, exhibit underwhelming results on the smallest
dataset sizes (4, 8, 16, and 32). Their performance rapidly improves with an
increase in the training set size, culminating in scores of 0.877 and 0.948 on the
full datasets, respectively.
Figure 3 illustrates the consistent superiority of LLMs over the baseline mod-
els in terms of F1 score across all training sample sizes. Furthermore, Table 5
presents the mean F1 scores of every model, and shows that Spam-T5 achieves
the highest overall performance with an F1 score of 0.7498. These results can be
attributed to Spam-T5’s high accuracy in the very-few-shot setting and consis-
tent robustness across all sample sizes.
Spam-T5: Benchmarking Large Language Models for Spam Detection 13
1.0
Model F1 score
Average F1 score 0.8 NB 0.4085.2734
LR 0.3880.2621
0.6 KNN 0.6421.1234
SVM 0.5146.3005
0.4 RoBERTa XGBoost 0.4716.3159
SetFit LightGBM 0.3392.3871
0.2 Spam-T5
RoBERTa 0.6253.3139
Best baseline
SetFit 0.7112.2990
0.0
4 8 16 32 64 128 256 Full Spam-T5 0.7498.1718
Number of training samples
Table 5: Mean test F1 scores and stan-
Fig. 3: Comparison of test F1 scores
dard deviations across all numbers of
achieved by LLMs vs. the best baseline
training samples (macro average using four
model for every number of training sam-
datasets).
ples (macro average using four datasets).
6 Discussion
The results of our experiments provide insights into the strengths and limitations
of LLMs for spam detection. Specifically, our findings suggest that LLMs, such
as RoBERTa and MPNet (model used by SetFit), perform well in general but
are outperformed by Spam-T5 in the very-few-shot setting. This difference in
performance can be attributed to the number of parameters, with Spam-T5
having 250M parameters compared to RoBERTa’s 125M and MPNet’s 110M.
Moreover, our results indicate a clear trade-off between accuracy and runtime
when using LLMs and baseline techniques for spam detection, as illustrated in
Figure 4. While LLMs are more robust and perform better in most cases, they
require long training and inference times. In contrast, baseline techniques are
faster but do not obtain the same level of accuracy. This suggests that LLMs
achieve improved sample efficiency at the expense of computational efficiency.
This trade-off highlights the need to consider the specific requirements of the
task, such as the available computational resources and the desired level of ac-
curacy.
Figure 4 also shows a surprising increase in inference time for the baseline
models as the number of training samples increases (i.e., the number of test
samples decreases). This trend is counterintuitive, as we would expect a similar
trend to that of the LLMs, where the inference time decreases since there are
fewer samples to process. This is due to the sigmoid kernel function used by the
SVM model. As the number of training samples increases, the sigmoid kernel
requires more computational effort during the inference phase, leading to the
observed increase in inference time.
The practical application of LLMs is hindered by their substantial compu-
tational requirements for training and deployment, which necessitate the use of
specialized hardware, including GPUs and TPUs. Addressing this limitation is
14 Maxime Labonne, Sean Moran
102
101
100
10−1 10−1
Fig. 4: Training and inference times for the three LLMs and the average times for the
baseline techniques (macro average using the four datasets). Here, “inference time”
corresponds to the execution time to process the entire test set. We exclude training
and inference times from the SetFit model on the full Enron dataset because it did not
achieve a meaningful score after 104 hours of training.
an active area of research, with numerous techniques that reduce the memory
footprint and computational resources required by LLMs. For instance, some ap-
proaches have explored the use of 8-bit floating-point formats to reduce the mem-
ory requirements, as demonstrated in [13]. Other methods, such as LoRA [19],
aim to reduce the computational resources required to train and deploy LLMs.
7 Conclusion
Our study demonstrates the effectiveness of LLMs for email spam detection, par-
ticularly in few-shot scenarios. Experiments show that LLMs outperform well-
established baseline techniques in terms of F1 score across all training sample
sizes. Furthermore, our solution, Spam-T5, achieves the highest overall perfor-
mance with an average F1 score of 0.7498.
These findings suggest that LLMs, and specifically Spam-T5, could be a
valuable tool for addressing the ongoing challenges in spam detection. By in-
corporating a limited number of fraudulent samples, we can update models and
enhance their performance without the need for extensive data labeling efforts.
This approach simplifies the task of building robust models that can handle
dynamic data distributions, thus offering a practical and effective solution to
real-world problems.
In order to deploy LLMs in real-world applications, future work will need to
focus on reducing the computational requirements of these models. One approach
to achieving this goal involves developing techniques that minimize the memory
footprint and computational resources required by LLMs, such as those explored
in recent studies.
Spam-T5: Benchmarking Large Language Models for Spam Detection 15
8 Ethical Statement
References
1. Agboola, O.: Spam Detection Using Machine Learning and Deep Learning. LSU
Doctoral Dissertations (2022)
2. Almeida, T.A., Hidalgo, J.M.G., Yamakami, A.: Contributions to the study
of sms spam filtering: New collection and results. In: Proceedings of the
11th ACM Symposium on Document Engineering. p. 259–262. DocEng
’11, Association for Computing Machinery, New York, NY, USA (2011).
https://doi.org/10.1145/2034691.2034742, https://doi.org/10.1145/2034691.
2034742
3. Alspector, J.: Svm-based filtering of e-mail spam with content-specic misclassica-
tion costs. In: Proceedings of the Workshop on text mining (2001)
4. Androutsopoulos, I., Paliouras, G., Michelakis, E., Michelakis, E.: Learning to filter
unsolicited commercial e-mail (2006)
5. Attenberg, J., Weinberger, K., Dasgupta, A., Smola, A., Zinkevich, M.: Collab-
orative email-spam filtering with the hashing trick. In: Proceedings of the Sixth
Conference on Email and Anti-Spam (CEAS 2009) (01 2009)
6. Carreras, X., Salgado, J.: Boosting trees for anti-spam email filtering. ArXiv
cs.CL/0109015 (10 2001)
7. Chung, H.W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang,
X., Dehghani, M., Brahma, S., Webson, A., Gu, S.S., Dai, Z., Suzgun, M., Chen,
X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Valter, D., Narang,
S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E.H.,
Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q.V., Wei, J.: Scaling instruction-
finetuned language models (2022)
8. Cortes, C., Vapnik, V.: Support-vector networks. Machine learning 20(3), 273–297
(1995)
9. Cranor, L.F., LaMacchia, B.A.: Spam! Commun. ACM 41(8), 74–83
(aug 1998). https://doi.org/10.1145/280324.280336, https://doi.org/10.1145/
280324.280336
16 Maxime Labonne, Sean Moran
10. Dada, E.G., Bassi, J.S., Chiroma, H., Abdulhamid, S.M., Adetunmbi,
A.O., Ajibuwa, O.E.: Machine learning for email spam filtering: re-
view, approaches and open research problems. Heliyon 5(6), e01802
(2019). https://doi.org/https://doi.org/10.1016/j.heliyon.2019.e01802,
https://www.sciencedirect.com/science/article/pii/S2405844018353404
11. Debnath, K., Kar, N.: Email spam detection using deep learning approach. In: 2022
International Conference on Machine Learning, Big Data, Cloud and Parallel Com-
puting (COM-IT-CON). vol. 1, pp. 37–41 (2022). https://doi.org/10.1109/COM-
IT-CON54601.2022.9850588
12. Delany, S., Cunningham, P., Coyle, L.: An assessment of case-based rea-
soning for spam filtering. Artif. Intell. Rev. 24, 359–378 (11 2005).
https://doi.org/10.1007/s10462-005-9006-6
13. Dettmers, T., Lewis, M., Belkada, Y., Zettlemoyer, L.: Llm.int8(): 8-bit matrix
multiplication for transformers at scale (2022)
14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
rectional transformers for language understanding (2019)
15. Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam cate-
gorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999).
https://doi.org/10.1109/72.788645
16. Harris, Z.S.: Distributional structure. WORD 10(2-3), 146–162 (1954).
https://doi.org/10.1080/00437956.1954.11659520, https://doi.org/10.1080/
00437956.1954.11659520
17. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation
9(8), 1735–1780 (1997)
18. Hovold, J.: Naive bayes spam filtering using word-position-based attributes. In:
Proceedings of the Second Conference on Email and Anti-Spam, http://www. ceas.
cc (2005)
19. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen,
W.: Lora: Low-rank adaptation of large language models (2021)
20. Jáñez-Martino, F., Alaiz, R., González-Castro, V., Fidalgo, E., Alegre, E.: A review
of spam email detection: analysis of spammer strategies and the dataset shift prob-
lem. Artificial Intelligence Review 56 (05 2022). https://doi.org/10.1007/s10462-
022-10195-4
21. Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L.,
Brown, D.: Text classification algorithms: A survey. Information 10(4) (2019).
https://doi.org/10.3390/info10040150, https://www.mdpi.com/2078-2489/10/4/
150
22. Kudo, T., Richardson, J.: Sentencepiece: A simple and language independent sub-
word tokenizer and detokenizer for neural text processing (2018)
23. Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P.S., He, L.: A survey
on text classification: From traditional to deep learning. ACM Trans. Intell. Syst.
Technol. 13(2) (apr 2022), https://doi.org/10.1145/3495162
24. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining
approach (2019), http://arxiv.org/abs/1907.11692, cite arxiv:1907.11692
25. Méndez, J.R., Fdez-Riverola, F., Dı́az, F., Iglesias, E.L., Corchado, J.M.: A com-
parative performance study of feature selection methods for the anti-spam filtering
domain. In: Perner, P. (ed.) Advances in Data Mining. Applications in Medicine,
Web Mining, Marketing, Image and Signal Mining. pp. 106–120. Springer Berlin
Heidelberg, Berlin, Heidelberg (2006)
Spam-T5: Benchmarking Large Language Models for Spam Detection 17
26. Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam filtering with naive bayes –
which naive bayes? (2006), http://citeseer.ist.psu.edu/757874.html
27. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed represen-
tations of words and phrases and their compositionality (2013)
28. OpenAI: Gpt-4 technical report (2023)
29. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–
137 (1980). https://doi.org/10.1108/eb046814, http://www.emeraldinsight.com/
doi/abs/10.1108/eb046814
30. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language un-
derstanding by generative pre-training (2018)
31. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Lan-
guage models are unsupervised multitask learners (2018), https://d4mucfpksywv.
cloudfront.net/better-language-models/language-models.pdf
32. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y.,
Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-
to-text transformer. Journal of Machine Learning Research 21(140), 1–67 (2020),
http://jmlr.org/papers/v21/20-074.html
33. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li,
W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text
transformer (2020)
34. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-
networks. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) EMNLP/IJCNLP (1).
pp. 3980–3990. Association for Computational Linguistics (2019), http://dblp.
uni-trier.de/db/conf/emnlp/emnlp2019-1.html#ReimersG19
35. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to
filtering junk e-mail. In: AAAI-98 Workshop on Learning for Text Categorization.
pp. 55–62 (1998)
36. Sahmoud, T., Mikki, D.M.: Spam detection using bert (2022)
37. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D.,
Stamatopoulos, P.: A memory-based approach to anti-spam filtering for mailing
lists. Inf. Retr. 6(1), 49–73 (jan 2003). https://doi.org/10.1023/A:1022948414856,
https://doi.org/10.1023/A:1022948414856
38. Scholz, M., Klinkenberg, R.: An ensemble classifier for drifting concepts. In: Pro-
ceedings of the Second International Workshop on Knowledge Discovery from Data
Streams (IWKDDS’05) (2005)
39. Shi, W., Xie, M., Huang, Y.: Collaborative spam filtering technique based on mime
fingerprints. In: 2011 9th World Congress on Intelligent Control and Automation.
pp. 225–230 (2011). https://doi.org/10.1109/WCICA.2011.5970733
40. Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: Mpnet: Masked and permuted pre-
training for language understanding (2020)
41. Syed, N.A., Liu, H., Sung, K.K.: Handling concept drifts in incremental
learning with support vector machines. In: Proceedings of the Fifth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining.
p. 317–321. KDD ’99, Association for Computing Machinery, New York, NY,
USA (1999). https://doi.org/10.1145/312129.312267, https://doi.org/10.1145/
312129.312267
42. Tida, V.S., Hsu, S.: Universal spam detection using transfer learning of bert model
(2022)
43. Torabi, Z., Nadimi-Shahraki, M.H., Nabiollahi, A.: Efficient support vector ma-
chines for spam detection: A survey. (IJCSIS) International Journal of Computer
Science and Information Security, Vol. 13, No. 1, January 2015 13 (01 2015)
18 Maxime Labonne, Sean Moran
44. Tsymbal, A.: The problem of concept drift: definitions and related work. Tech.
Rep. TCD-CS-2004-15, The University of Dublin, Trinity College, Department of
Computer Science, Dublin, Ireland (2004)
45. Tunstall, L., Reimers, N., Jo, U.E.S., Bates, L., Korat, D., Wasserblat, M., Pereg,
O.: Efficient few-shot learning without prompts (2022)
46. Vapnik, V.N.: The nature of statistical learning theory. Springer-Verlag New York,
Inc. (1995)
47. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg,
U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.)
Advances in Neural Information Processing Systems. vol. 30. Curran Associates,
Inc. (2017), https://proceedings.neurips.cc/paper_files/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
48. Wang, Y., Yao, Q., Kwok, J.T., Ni, L.M.: Generalizing from a few exam-
ples: A survey on few-shot learning. ACM Comput. Surv. 53(3) (jun 2020).
https://doi.org/10.1145/3386252, https://doi.org/10.1145/3386252
49. Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M.,
Le, Q.V.: Finetuned language models are zero-shot learners. In: International Con-
ference on Learning Representations (2021)
50. Wittel, G., Wu, S.: On Attacking Statistical Spam Filters. In: Proc. of the Con-
ference on Email and Anti-Spam (CEAS). Mountain View, CA, USA (July 2004),
http://www.ceas.cc/papers-2004/index.html
51. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,
M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X.,
Lukasz Kaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian,
G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O.,
Corrado, G., Hughes, M., Dean, J.: Google’s neural machine translation system:
Bridging the gap between human and machine translation (2016)