A_Novel_Approach_for_Spam_Detection_Using_Natural_Language_Processing_With_AMALS_Models
A_Novel_Approach_for_Spam_Detection_Using_Natural_Language_Processing_With_AMALS_Models
Corresponding authors: Mohammad Asef Hossaini ([email protected]), Hasmat Malik ([email protected]), and
Asyraf Afthanorhan ([email protected])
This work was supported in part by Intelligent Prognostic Private Ltd., Delhi, India; and in part by Badghis University, Badghis,
Afghanistan.
ABSTRACT To enhance their company operations, organizations within the industry leverage the ecosystem
of big data to manage vast volumes of information effectively. To achieve this objective, it is imperative to
analyze textual data while prioritizing the safeguarding of data integrity and implementing robust measures
for organizing and validating data through the utilization of spam filters. Various methodologies can be
employed, including Word2Vec, bag-of-words, BERT, as well as term frequency & reciprocal document
frequency (TF-IDF). Nevertheless, none of these solutions effectively address the problem of data scarcity,
which might lead to the existence of missing information in the collected documents. To properly address
this problem, it is necessary to employ a strategy that categorizes each document based on the topic matter
and uses statistical approaches for approximation. This research paper presents a novel approach for spam
detection using natural language processing. The proposed strategy utilizes a least-squares model to modify
themes and incorporates gradient descent and altering least-squares (i.e., AMALS) models for estimating
missing data. TF-IDF and uniform-distribution methods perform the estimation. The performance evaluation
reveals that the suggested technique exhibits a superior performance of 98% compared to the existing
industry TF-IDF model in accurately predicting spam within big data ecosystems. By this model, the
environment of an organization or a company can be saved from spamming or other attacks, which can
lead to extracting their data for unauthorized users to protect the details.
INDEX TERMS Artificial intelligence, big data, machine learning, spam detection.
might potentially contain harmful content that is concealed meticulous assessment of the various instruments at their dis-
within attachments or URLs, posing a risk to the security posal to address this escalating problem. Several well-known
of the host system [2]. Spam refers to the transmission of techniques for identifying and analyzing incoming emails
unsolicited and irrelevant messages or emails by an individual to detect spam include whitelisting/blacklisting [7], email
or entity to a large number of recipients using various means header analysis, and keyword verification, among others.
of information dissemination, such as email or other commu- According to estimates provided by social networking
nication channels [3]. Therefore, there is a significant need professionals, over 40% of accounts on social networks are
for robust security measures in place for the email system. utilized for spam [8]. Spammers employ widely used social
Spam emails have the potential to contain malicious software networking technologies to selectively target distinct seg-
such as viruses, rats, as well as Trojans. This approach is ments, and review the pages, or fan pages to discreetly embed
predominantly employed by attackers to entice consumers hyperlinks that direct users to pornographic and other com-
into internet services. The individuals in question can trans- mercial websites. These websites are typically associated
mit unsolicited emails that include attachments including a with false accounts and aim to promote the sale of illicit prod-
variety of file extensions. These attachments may contain ucts. The poisonous emails that are disseminated to persons
URLs that have been manipulated to direct users to web- or organizations of a similar nature have recurring character-
sites that engage in hazardous activities, such as spamming istics. Through a thorough examination of these key points,
and fraudulent behaviour. As a result, users may experi- one can enhance the efficacy of identifying and detecting
ence detrimental consequences, including data or financial such forms of electronic correspondence. The classification
fraud, as well as identity theft [4], [5]. Numerous email of emails into spam & non-spam categories can be achieved
service providers offer their users the capability to estab- by the application of artificial intelligence (AI) [9]. One alter-
lish rule-based filters that automatically categorize incoming native approach to solving this problem involves extracting
emails based on keywords. However, this methodology seems features from the headers, subject, & body of the messages.
to be of limited utility as it presents challenges in terms of Once the data has been extracted and categorized according
complexity, and users exhibit a reluctance to personalize their to their characteristics, they can be classified into two groups:
emails, rendering their email accounts vulnerable to spam spam or ham. Currently, spam detection is frequently accom-
attacks. plished by the utilization of learning-based classifiers [10].
Over the past few decades, the Internet of Things (i.e. IoT) In the context of learning-based classification, the approach
has emerged as an integral aspect of contemporary society, to detection operates under the assumption that spam emails
seeing significant and rapid expansion. The Internet of Things possess distinct properties that can be used to identify them
(IoT) has emerged as a crucial element within the context of from valid emails [11]. Several aspects contribute to the
smart cities. There exists a multitude of social media applica- heightened complexity of the spam identification process in
tions and platforms that are based on the Internet of Things learning-based models. The elements encompassed in this
(IoT) technology. The proliferation of the Internet of Things context are spam subjectivity, concept drift, linguistic diffi-
(IoT) has led to a significant escalation in the prevalence of culties, overhead processing, as well as text latency.
spamming issues. The researchers put forth a range of spam Prominent multinational firms like Amazon have estab-
detection techniques to identify and eliminate spam content lished an extensive infrastructure comprising numerous
and individuals engaging in spamming activities. The current servers and databases. These resources are utilized not only
methods for spam identification can be broadly classified into for the storage of literary works but also to accommodate
two categories: behaviour pattern-based approaches as well a substantial volume of product-related data. The aforemen-
as semantic pattern-based approaches. These methodologies tioned data facilities have been intentionally created to attain
possess inherent restrictions and disadvantages. The prolif- optimal productivity and have the potential to be offered as
eration of spam emails has experienced a notable expansion services to other organizations [1]. Various forms of struc-
in tandem with the emergence and widespread adoption of tured data are grouped inside big data ecosystems. However,
the Internet & global communication [6]. Spam messages text data often lacks structure and necessitates analysis to
are produced globally through the utilization of the Internet, offer additional services utilizing consumer big data. The
employing techniques to conceal the identity of the attacker. capturing of the features of company and customer actions
Numerous antispam methods and approaches have been in the online environment may be achieved through the use
developed; yet, the prevalence of spam remains significantly of textual communication [2]. The utilization of Natural Lan-
elevated. The most perilous forms of unsolicited electronic guage Processing (i.e. NLP) methodologies for the analysis
communications are malicious emails that include hyperlinks of unstructured textual data encompasses approaches such as
directing recipients to websites designed to inflict harm upon Word2Vec and bag-of-words.
the victim’s data. The presence of spam emails has the poten- Bag-of-words (BOW), Bidirectional Encoder Representa-
tial to impede server response times due to the occupation of tions from Transformers (BERT), & term frequency–inverse
server memory or capacity. To effectively identify and prevent document frequency (i.e. TF-IDF) are three commonly
the proliferation of spam emails, organizations undertake a used techniques in natural language processing (NLP).
Nevertheless, the task of analyzing surface-level textual approach designed for feedforward neural networks, specif-
information obtained through Natural Language Processing ically focusing on architectures with a solitary hidden
(NLP) poses challenges, particularly about the scarcity and layer [12]. When compared to standard neural networks,
omission of textual data. To address this issue, traditional it effectively addresses issues related to sluggish training
models employ a range of methodologies in conjunction speed and overfitting. In the ELM framework, a single
with machine learning and statistical methods. Furthermore, iteration cycle is sufficient. Due to its enhanced capac-
several models have conducted comparisons and experi- ity for generalization, robustness, as well as controllability,
ments on documentary clustering matrices by transforming this method has gained widespread adoption across vari-
the document-word matrix into a document-factor scoring ous domains. This study examines various machine-learning
matrix [12]. Nevertheless, the issue of sparsity continues to techniques utilized in the context of spam identification. The
have an impact on the performance of document clustering. contributions made by our team are categorized as follows:
This paper introduces a novel approach for spam identifica-
• The present paper examines a range of machine
tion using natural language processing (NLP). The proposed
learning-based filters for spam, exploring their architec-
technique combines the ratios of topic-altering least squares
tural design and evaluating their respective advantages
(i.e. TALS), approximations gradient descent (i.e. AMGD),
and disadvantages. In addition, we engaged in a dis-
& approximations alternating least squares (i.e. AMALS)
cussion regarding the fundamental characteristics of
models:
unsolicited email communications, commonly referred
• The TALS framework categorizes feature-related con- to as spam.
cerns by putting them into the process of addressing • A complete examination of the proposed strategies and
sparsity issues and approximating them through the the nature of spam revealed some intriguing research
utilization of a probability distribution. This approach gaps in the field of spam detection and filtering.
aims to enhance the predictability and suitability of the • This section presents a discussion on open research top-
features. ics and future research objectives aimed at enhancing
• The AMGD algorithm employs a gradient descent email security and spam email filtration through the
(i.e. GD) function as well as a uniform distribution to utilization of machine learning algorithms.
address the issue of missing information by approximat- • In this paper, the authors examine the existing obstacles
ing the model. encountered by spam filtering algorithms and analyze
• The remaining scarcity issue is addressed by AMALS the impact of these challenges on the efficiency of the
by the implementation of alternating least squares models.
(i.e. ALS), L2 normalization, & uniform distribution. • This paper presents a thorough examination of several
This research presents a unique machine learning method- machine learning techniques & concepts, with a spe-
ology to address the challenges of shortage and missing cific focus on their application in the field of spam
information in large-scale data documents. identification.
• This research successfully reduces the performance gap • The paper classifies several machine learning techniques-
between the testing & training sets of documents. based spam detection approaches to gain a comprehen-
• This study provides a novel natural language process- sive understanding of their underlying principles.
ing (NLP)–based spam detection model that exhibits • This section presents a range of potential avenues for
enhanced performance in comparison to the con- future research in the field of spam detection and
ventional term frequency-inverse document frequency filtration. These areas aim to enhance the detection capa-
(TF-IDF) approach. bilities and bolster the security of email platforms.
• This study presents a new finding that supports the
advantages of utilizing the ALS function in conjunction II. LITERATURE REVIEW
with the GD algorithm for effectively classifying spam Email spam refers to the dissemination of fraudulent or unso-
text inside a large-scale data environment. licited bulk messages through various accounts or automated
The subsequent sections of this work are organized in the systems. The proliferation of unsolicited emails, commonly
following manner. Section II provides an overview of the referred to as spam, has exhibited a steady upward trend,
backdrop. Section III provides an elucidation of the under- emerging as a prevalent issue during the past ten years. Spam
lying factors that drive the research endeavour. Section IV emails are commonly obtained through the utilization of
introduces the recommended methodology. Section V of the spambots, which are automated programs designed to scour
paper provides an analysis and assessment of the subject the Internet for email addresses. The utilization of machine
matter, while Section VI serves as the concluding section, learning techniques has significantly contributed to the iden-
summarizing the main findings and implications of the study. tification and detection of unsolicited and unwanted emails
An instance of learning-based models can be observed in commonly referred to as spam. Researchers are employing a
the form of an extreme learning machine (i.e. ELM). The range of models and strategies to advance the development
present study introduces a contemporary machine-learning of innovative spam detection & filtering models [13]. In their
study, McMahan et al. [14] conducted a comprehensive sur- employed rely on diverse classification approaches that are
vey on the topic of email spam detection. They focused applied to the different elements of email communications.
on employing a supervised approach that incorporates fea- This paper posits whether the Naïve Bayes classifier occupies
ture selection techniques. The authors engage in a discourse a distinct place among various learning algorithms employed
regarding the knowledge discovery process employed in the in the context of spam filtering. The tool exhibits remarkable
context of spam detection systems. In addition, the authors efficiency and clarity, yielding outcomes of great accuracy.
provide detailed explanations of numerous strategies and In their study, Bhuiyan et al. [22] provide a comprehensive
technologies that have been presented for the detection of analysis of contemporary methodologies employed in email
spam. This survey also discusses the selection of features spam filtering. The authors provide an overview of several
using N-gram analysis. The N-Gram [15], [16] algorithm spam filtering methodologies and evaluate the performance
is a predictive-based method employed for estimating the of several suggested systems by examining multiple metrics
likelihood of the subsequent word appearing after identifying through a comprehensive analysis. The authors engage in
N − 1 phrases inside a sentence as well as text corpus. The a discussion regarding the efficacy of various approaches
N-Gram model employs probabilistic methods to anticipate employed to filter unsolicited and unwanted emails com-
the subsequent word. The study conducts a comparative anal- monly referred to as spam. Certain individuals have achieved
ysis of different ways for email spam detection, including favourable outcomes, while others are endeavouring to inte-
both machine learning techniques such as multilayer per- grate alternative methods to enhance their level of accuracy.
ceptron neural network, support vector machine, and Naïve Despite their overall success, experts remain concerned about
Bayes, as well as non-machine learning methods such as the various challenges encountered in spam filtering tech-
Signatures, Blacklist as well as Whitelist, including mail nologies. The researchers are endeavouring to develop an
header verification. advanced spam filtering mechanism capable of comprehend-
In their publication, Saleh et al. [19] provide an exten- ing vast quantities of multimedia data to effectively filter
sive examination of the topic of smart spam email detection out spam emails. The authors conclude that the predominant
through the use of a survey. The authors engage in a com- approach for email spam filtering involves the utilization of
prehensive examination of security vulnerabilities associated the Naïve Bayes and Support Vector Machine (SVM) algo-
with electronic mail, with a particular emphasis on spam rithms. To evaluate the efficacy of spam filtration models, it is
emails. The discourse encompasses an exploration of the possible to train these models using many datasets, such as the
breadth of spam analysis, as well as an examination of ‘‘ECML’’ and UCI datasets [21].
several methodologies employed in both machine learning In their study, Ferrag et al. [24] conducted a comprehensive
and non-machine learning approaches for spam identifica- examination of deep learning techniques utilized in intrusion
tion and filtration. The researchers concluded that there is detection systems as well as spam detection datasets. The
a significant prevalence of supervised learning algorithms, authors engaged in a comprehensive examination of detec-
as evidenced by the adoption rates, in the context of email tion systems that rely on deep learning models, subsequently
spam detection [18]. The authors assert that the primary assessing the efficacy of those models. The researchers
reason for the widespread adoption of supervised learning is analyzed a total of 35 widely recognized cyber datasets,
due to the high level of accuracy and consistency exhibited which were then classified into seven distinct groups. The
by supervised techniques. The researchers also engaged in aforementioned categories encompass datasets that are clas-
a discussion on multialgorithm frameworks and determined sified as Internet traffic-based, networking traffic-based,
that such frameworks exhibit more efficiency compared to Intranet traffic-based, electric network-based, virtualized pri-
their single-algorithm counterparts. It has been observed that vate network-based, Android apps-based, IoT traffic-based,
the majority of research endeavours involving the use of email & Internet linked device-based datasets. The researchers
content to identify spam, namely phishing emails, mostly rely concluded that deep learning models exhibit superior perfor-
on word-based classification as well as clustering techniques. mance compared to classical machine learning and lexical
Sun et al. [21] provide a comprehensive overview of models in the context of intrusion as well as spam detection.
learning-based methodologies employed in the domain of In their study, Vyas et al. [25] provide a comprehen-
email spam filtering. This study discusses the issue of spam sive analysis of supervised machine-learning techniques
and presents a comprehensive analysis of learning-based employed in the context of spam email filtering. The
spam filtering techniques. The authors elucidate diverse char- researchers concluded that the Naive Bayes method exhibits
acteristics of unsolicited electronic communications com- superior speed and satisfactory precision compared to the
monly referred to as spam emails. This study examines the other methods reviewed, except SVM and ID3. Support Vec-
impact of spam emails on various domains. This study also tor Machines (SVM) and Iterative Dichotomiser 3 (ID3)
examines the diverse economic and ethical concerns asso- algorithms provide higher precision compared to the Naïve
ciated with spam. The prevalent antispam strategy involves Bayes algorithm, albeit at the cost of significantly increased
the utilization of learning-based filtering, which has under- system construction time. A trade-off exists between the fac-
gone significant advancements. The filters that are frequently tors of timing and precision. The authors conclude that the
choice of learning algorithm is contingent upon the specific filter, which involves the extraction of header data from the
circumstances and the desired levels of accuracy and effi- email, occurs during the second stage. Subsequently, a series
ciency. It is asserted that to develop a more resilient spam of backlist filters are implemented to effectively identify and
filtering architecture, careful consideration should be given intercept emails originating from the backlist file, thereby
to all components of the email. mitigating the influx of spam emails. Following this phase,
This survey study examines three primary categories of rule-based filters are employed to identify the sender by
machine-learning techniques that can be employed for spam utilizing the subject line and parameters specified by the user.
filtering. In this study, we undertake a comprehensive exami- The utilization of allowance & task filters is achieved through
nation of multiple scholarly articles, analyzing the suggested the implementation of a technique that enables the account
methodologies and deliberating on the obstacles encountered holder to initiate the transmission of mail [26].
in spam detection & filtration systems. This paper also exam-
ines the merits and drawbacks of the proposed methodologies
for spam identification and filtration that have not been pre-
viously evaluated.
client & enterprise-level filtering of spam process is depicted The iOS operating system, encompassing both its objects
in Figure 2. and networks, exhibits susceptibility to network & physical
threats as well as privacy breaches. Figure 4 provides a visual
representation of the primary categories of attacks targeting
the Internet of Things (IoT).
• The topic of discussion is the malicious node attack. The data, hence enabling improved decision-making in the future.
perpetrator physically inserts the malicious node into the The primary goal of machine learning methods is to acquire
group of nodes. knowledge autonomously, without requiring human interac-
• The topic of discussion is the Man in the Middle Attack. tion. Machine learning encompasses three primary categories
In this particular form of attack, the assailant covertly that are employed for a wide range of activities.
intercepts the conversation between two nodes across the Over the past decade, scholars have endeavoured to
Internet. The perpetrator acquires crucial information enhance the efficacy of email communication beyond its
through the act of surreptitiously listening in on private current state. The implementation of spam filtering tech-
conversations. niques for email systems is widely recognized as a crucial
• The Sybil Attack is a type of security threat that involves measure in safeguarding email networks [35]. Numerous
an adversary creating many fake identities in a network scholarly publications have been dedicated to employing
to gain control or manipulate the system. The com- diverse machine-learning methodologies to detect and man-
promised node illicitly appropriates the reputation of age spam emails. However, certain areas within this research
legitimate nodes and assumes the role of a trustworthy domain remain unexplored or inadequately addressed. The
node. study of junk mail is a prominent and compelling area of
A study conducted by Nozomi Networks reveals a notable research that addresses existing knowledge gaps [36]. Numer-
rise in attacks and threats targeting Operational Technology ous studies have been conducted to enhance the reliability
(i.e. OT) & Internet of Things (IoT) networks over the initial and use of email communication by employing various tech-
six months of 2020. Figure 5 illustrates the frequency of niques in spam classification. This study aims to provide
cyber assaults on Internet of Things (IoT) devices throughout a concise overview of several machine-learning techniques
different years. and approaches currently employed in the field of email
spam detection. This research additionally assesses the pre-
vailing machine learning methodologies, namely K-Nearest
Neighbors (KNN), Support Vector Machines (SVM), random
forest, as well as Naïve Bayes.
V. MACHINE LEARNING
Machine learning is widely recognized as a significant and
valuable implementation of artificial intelligence (i.e. AI),
enabling computer systems to autonomously acquire knowl-
edge and improve their performance without the need for
explicit programming [34]. The basic objective of machine
learning algorithms is to construct automated systems that
enable the retrieval and utilization of data for training. FIGURE 6. Methodologies of ML.
in this research included a total of 1000 electronic mail classification principle of linear SVMs is depicted in Figure 9.
messages, out of which 350 (35%) were identified as spam, The depicted diagram includes many circular and star-shaped
while the remaining 650 (65%) were classified as legitimate entities, which are referred to as objects. These objects have
messages (ham). the potential to be classified into one of two categories,
Verma and Sofat employed the supervised machine learn- specifically the category of stars as well as dots. The selection
ing technique ID3 (Quinlan, 1986) to construct decision trees of items between those that are green and those that are
for the given task [46]. Additionally, they utilized the hidden brown is determined by the isolated lines. The objects located
Markov model [47] to estimate the probabilities of various on the bottom half of the plane exhibit a brown star shape,
occurrences, which were then combined to categorize emails while the objects situated on the top edge of this plane are
as either junk mail or ham [47]. The suggested model employs represented by green dots. This distinction indicates that two
a method of initially classifying emails as either spam or valid distinct objects have been categorized into separate classes.
by assessing the overall probability of each email based on the When presented with a new object, specifically a black circle,
later classification of email phrases. Subsequently, the system the model will utilize the training instances provided during
proceeds to construct decision trees for individual emails. the training phase to categorize the circle into a single of the
This analysis utilizes the Enron dataset [48], which has a available classes.
total of 5172 emails. Out of the total 5172 emails analyzed, In their study, Banday and Jan [53] provide a compre-
2086 were identified as spam, while an equal number of hensive analysis of the statistical spam filter methodology.
2086 were classified as legitimate emails. The model can The filters are designed with Naïve Bayes, support vector
classify emails as either spam or ham by utilizing the feature machines (i.e. SVM), KNN, as well as regression trees [54].
set derived from the Enron dataset. An 11% inaccuracy was Various supervised machine learning methods are employed,
obtained when utilizing the fitness function from the sk-learn and the obtained results are assessed by metrics such as
library in the suggested model. The model achieved an accu- precision, recall, as well as accuracy. Based on the application
racy rate of 89% on the provided dataset. of these machine learning techniques, it was shown that the
The email classification methodology for IoT systems dataset yielded optimal results when utilizing classification
presented by Li et al. [49] is founded on the principles of & regression trees (CART) [55] as well as Naïve Bayes
supervised machine learning. The employed methodology classifiers. According to this method, the computational cost
involves the utilization of a multiview methodology that of evaluating false positive instances is higher than that of
prioritizes the acquisition of more comprehensive data for false negative instances in the context of spam filtering.
classification. A dataset with two distinct feature sets, namely
internal and exterior, is generated. The suggested methodol-
ogy has the potential to be applied to both labelled as well
as unlabeled data, and its effectiveness was assessed using
two datasets inside an authentic network setting. The findings
of this study suggest that the implementation of the multi-
view model yields higher levels of accuracy compared to the
straightforward approach of email classification. Ultimately,
the multiview model is juxtaposed with other extant models.
Subasi et al. proposed a spam filtering methodology that
utilizes various decision tree algorithms [50]. The objective
of their study was to assess the accuracy of these algorithms
and determine the most effective one for their specific dataset.
FIGURE 9. Linear SVMs.
The researchers applied various algorithms, including regres-
sion, classification, and tree (CART), NBT, C4.5, LAD tree, The approach proposed by Zeng et al. [56] aims to iden-
REP Tree, random forest, & rotation forest, to the dataset to tify and classify spammers as well as spam communications
perform email classification. The findings of the study indi- within a given social network. In contemporary society, the
cate that the customized random forest model outperformed use of social media has become ubiquitous, with a substantial
other decision tree models in terms of accuracy when applied portion of individuals devoting a significant portion of their
to publically available datasets. time to engaging in interpersonal communication with their
SVM: The support vector machine (i.e. SVM) is a cru- close acquaintances. Spammers exploit diverse social media
cial and highly esteemed machine learning model [51]. The networks as well as the content posted by users to disseminate
Support Vector Machine (SVM) is a prejudiced supervised malicious content, ads, information, and other undesirable
learning classifier that is technically defined. It operates by materials within the accounts of social media users. This
utilizing labelled examples during the training phase and pro- study examines the methods for identifying and detecting
duces a hyperplane as its output, which is used to categorize posts or information with malicious intent on social media
fresh data [52]. Objects in a given set are segregated based on sites. The researchers in this study employ the Sina Weibo
their respective class memberships using decision planes. The social network and utilize a machine learning method known
124306 VOLUME 12, 2024
R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models
as support vector machine (SVM) to identify and classify machine-learning methodologies employed in the context of
spammers. The dataset employed in this study comprised email spam filtering [60]. This study conducted a comparative
16 million messages obtained from many individuals. A set analysis of precision outcomes between instances of false
of 18 features was employed as a component of the vector positives against precision outcomes after the removal of
set. The network’s clientele can be classified into two distinct false positives. The findings demonstrate the outcomes after
groups: legitimate users as well as spammers. The model’s the removal of false positives, which exhibited enhanced
training phase utilized 80% of the available data, with the accuracy and reliability compared to previous iterations.
remaining 20% allocated for testing purposes. To enhance the
precision of the results, a ratio of 1:2 was employed between VII. NB CLASSIFIER
spammers and non-spammers in the training dataset. The The Naïve Bayes classifier is derived from the Bayes the-
suggested model achieves a classification accuracy of 99.5% orem. The assumption is made that the predictors exhibit
for distinguishing between spammers and non-spammers, independence, implying that the knowledge of one charac-
as reported in reference [57]. teristic does not influence the value of any other attribute.
Jamil et al. [58] describe a fitness framework that utilizes Naïve Bayes classifiers are characterized by their ease of
Internet of Things (IoT)-enabled blockchain technology & construction, as they do not necessitate an iterative procedure.
machine learning approaches. The model they have suggested Furthermore, they exhibit notable efficiency when applied to
consists of two distinct parts. The first system is a network extensive datasets, while maintaining a commendable degree
that utilizes blockchain technology to ensure the security of of accuracy. Despite its straightforwardness, Naïve Bayes has
devices that sense. It also incorporates intelligent contracts to been well recognized for its superior performance compared
facilitate relationships as well as an inference engine the fact to other classification approaches across a range of issues.
that reveals concealed insights and actionable information In their study, Rusland et al. [61] investigate the topic
from data collected from Internet of Things (IoT) sensors of email spam filtering as well as employ the Naïve Bayes
and user devices. The enhanced smart contract provides cus- machine learning method to conduct their investigation. Two
tomers with a valuable application that enables real-time datasets were utilized and analyzed based on the metrics of
monitoring, enhanced control, and expedited access to mul- accuracy, F-measure, precision, & recall. Naïve Bayes is a
tiple devices dispersed across diverse domains. The primary classification algorithm that uses probability theory to assign
objective of the inference engine’s module is to analyze the class labels to instances. Specifically, it calculates the likeli-
data collected from the IoT environment to identify hidden hood by examining the frequency and mix of values present
patterns and extract valuable information. This process aids in a given dataset. This study employs a three-step approach
in facilitating efficient decision-making and offering easy for email filtration, namely preprocessing, feature selection,
services. According to the findings of the researchers, the & implementation of features through the Naïve Bayes clas-
model they have proposed has the potential to enhance system sifier. The initial stage of preprocessing is the elimination of
throughput and optimize resource utilization. The technology conjunction words, articles, as well as stop words from the
suggested in this paper has potential applications in many content of the email. Subsequently, the researchers employed
domains, such as healthcare and intelligent enterprises. the WEKA program [64] to generate two distinct datasets,
The spam filtering program was developed by Olatunji [59], namely the spam data as well as the spam base dataset.
employing support vector machine as well as extreme learn- The mean accuracy achieved across the two datasets was
ing machine techniques. The researcher utilized a commonly 89.59%, with the spam dataset exhibiting a higher accuracy
employed dataset to construct the spam detection model. The of 91.13%. The accuracy achieved by the spam-based dataset
support vector machine (SVM) earned an accuracy rate of was 82.54%. The precision findings for the spam data set
94.06% in the study, while an extreme learning machine were found to be 83% on average, whereas for the spam base
(ELM) model acquired an accuracy rate of 93.04%. This data set, the precision results were 88%. It has been asserted
indicates that the SVM outperformed the ELM by a marginal that the Naïve Bayes classifier exhibits superior performance
improvement of 1.1% in terms of performance. The individ- when used to spam base data in comparison to spam data.
ual suggested that the improvement in accuracy of Support Sharma and Sahni published a scholarly study discussing
Vector Machines (SVM) compared to Extreme Learning the utilization of machine learning algorithms to detect spam
Machines (ELM) is minimal. This suggests that in scenarios in Internet of Things (IoT) devices [62]. The researchers
where the timeliness of detection is of utmost importance, employed a total of five machine-learning models and
including in real-time systems, it is advisable to prioritize analyzed their outcomes utilizing a range of performance
the utilization of the ELM spam detector over the SVM indicators. A substantial quantity of characteristics of the
spam detection method. While doing his research, it was input were employed in the training of the proposed models.
observed that the Support Vector Machine (SVM) exhibited The spam score of each model is computed by considering
a greater level of accuracy. However, it was also noted the input attributes. The aforementioned score serves as an
that the training process of the SVM system required more indicator of the reliability and credibility of an Internet of
time compared to the Extreme Learning Machine (ELM) Things (IoT) device, taking into account a range of pertinent
system. Tretyakov provided an extensive analysis of different aspects. The proposed methodology is verified by employing
the REFIT home automation dataset [63]. The authors assert inputs and outcomes or atypical performance patterns. The
that their suggested system exhibits superior spam detection diagram presented in Figure 10 illustrates the fundamental
capabilities compared to existing systems in use. The applica- architecture within a neural network.
tion of their work extends to smart homes as well as additional This section provides an elaboration on various proposed
environments where intelligent gadgets are employed. strategies for detecting and preventing email spam through
In their study, Singh and Batra examined the application of the utilization of neural networks.
multiple machine-learning techniques for email spam iden-
tification [64]. The essay delves into the examination of
machine learning methodologies and their practical appli-
cation on various datasets. The identification of the most
optimal method for email spam detection, which exhibits the
best precision and accuracy, is achieved through the evalua-
tion of multiple machine learning techniques. The researchers
concluded that the use of the Multinomial Naïve Bayes
algorithm yields the most favourable outcomes. However,
it is important to acknowledge that this approach has certain
FIGURE 10. Architecture of NN.
drawbacks stemming from its reliance on class-conditional
independence. Consequently, there are instances where the The approach proposed by Faris et al. [67] aims to detect
machine misclassifies certain inputs. In this study, it was spam within online social networks. The research conducted
observed that ensemble models yielded superior and depend- by the authors centres on the amalgamation of unsolicited
able outcomes compared to Multinomial Naïve Bayes. The messages across different social networking platforms. The
approach described in this study is limited to the detection of researchers collected a total of 1937 tweets classified as
spam solely from the content within the body of email. spam and 10943 tweets classified as ham for further anal-
Sattu introduced a semi-supervised machine learning ysis, utilizing the Twitter platform. In addition, a total of
approach for spam identification in social Internet of Things 1338 spam posts as well as 9285 ham posts were utilized in
(IoT) platforms [65]. An ensemble-based framework includ- the analysis. In the context of Twitter Spam Detection (TSD),
ing four classifiers was employed. The architectural design it was observed that 75.6% of tweets analyzed had URL links,
relies on the utilization of probabilistic data structures (i.e. which were identified as spam tweets. On the other hand,
PDS) that include a Quotient Filter (QF) for querying the 24.4% of the tweets consisted of distinct phrases, indicating a
database containing URLs, spam users, and databases of different type of content. Among a total of 10,942 tweets cat-
spam keywords. Additionally, Locality Sensitive Hashing egorized as ham, it was observed that 62.9% of these tweets
(LSH) is employed for doing similarity searches. The sug- had both URL links as well as words, while the remaining
gested model employs the adaptive weighted voting strategy 37.1% consisted only of words. According to the findings,
to minimize its decision-making process, taking into account it has been observed that approximately 32.8% of the spam
the output of each classifier. The hybrid sampling technique posts generated by FSD are comprised of various web links,
reduces computational efforts by selectively collecting data while the remaining 67.2% of these spam posts solely con-
based on each classifier. The findings of this study suggest sist of textual content [68]. Out of a total of 9285 postings
that the methodology described in this research holds the classified as ham, 95.1% of them contain web links, while
potential for effectively detecting spam in extensive datasets. the remaining 4.9% solely consist of textual content. The
The efficacy of the suggested model was assessed by con- researchers employed the most frequently occurring twenty
ducting a comparison between PDS and conventional data feature words extracted from datasets comprising Facebook
models, using commonly employed assessment criteria such spam and Twitter spam. The TSD and FSD are partitioned
as accuracy, recall, as well as F-score. into two distinct sets, namely the training dataset and the
ANNs: The artificial neural network (i.e. ANN) is a com- testing dataset. The aforementioned datasets were employed
puter model that is derived from the functional characteristics in the training of diverse machine learning classifiers, includ-
of biological neural networks, commonly referred to as the ing Naïve Bayes, logistic regression random tree, random
neural network (NN) [66]. A neural network consists of many forest, as well as Bayes Net. Upon analyzing the precision
sets of interconnected neurons, wherein information is pro- of various classifiers, the researchers integrated the spam
cessed through computational connections. In the majority dataset from Facebook with the learning dataset of Twitter,
of scenarios, an artificial neural network (ANN) exhibits and likewise, incorporated the spam dataset from Twitter
adaptability as a system, wherein its structure undergoes with the training datasets of Facebook. Subsequently, the
modifications based on the influx of either internal or external researchers utilized the merged dataset to train and evaluate
information throughout the learning phase. Contemporary the performance of the classifiers. Ultimately, the researchers
neural networks represent non-linear methodologies for the conduct a comparative analysis of the classifiers’ outcomes
analysis of statistical data. These are frequently employed on the aforementioned social networks, after an assessment of
in situations where there exist intricate connections between precision, accuracy, recall, as well as the F-1 measure. It was
discovered that the precision of aggregated datasets surpassed TABLE 1. Comparison techniques of supervised method to do filtering of
spam.
that of alternative datasets [68], [69].
The spammer detection technique developed by Makkar
and Kumar [70] involves the utilization of a collaborating
neural network in the context of Internet of Things (IoT)
applications. The authors introduce an innovative spam detec-
tion mechanism named Cospam, specifically designed for
Internet of Things (IoT) applications. Initially, the individual
and the speech content at various time intervals are seen
as sequences of features. The subsequent phase involves
the utilization of a cooperative neural network model. The
collaborative model comprises three distinct models, namely
the Bi-AE model, the GCN model, and the LSTM model.
These models are employed to determine the characteristics
or attributes of the user. Ultimately, a sequence of tests was
carried out to assess the efficacy of the proposed method-
ology. The model under consideration demonstrated a 5%
increase in accuracy compared to currently employed meth-
ods for detecting spammers. The time required for Cospam is
greater compared to existing techniques due to the presence
of numerous parameters. rithms. The training dataset consisted of 70% of the data,
In the realm of the Internet of Things (IoT), Zavvar et al. while the remaining 30% was allocated to evaluate the mod-
[71] introduced a deep learning framework aimed at iden- els. The principles of Root Mean Square Error (RMSE),
tifying and mitigating web spam. The method in question Normalized Root Mean Square Error (NRMSE), and Stan-
improves the cognitive capabilities of search engines to dard Deviation (STD) were examined, yielding findings of
identify instances of web spam effectively. The efficacy 0.08733, 0.0185, and 0.08742, respectively, during the test-
of this strategy lies in its ability to eliminate spam pages ing phase. The findings indicate that the proposed approach
through the utilization of a website’s rank score, which is exhibits favourable levels of accuracy and performance in the
derived from calculations performed by a search engine. The detection of spam emails. Table 1 provides a summary of
framework employed in their study leverages the compre- the supervised machine-learning algorithms that have been
hensive capabilities of deep learning. The first application described for the purpose of spam identification.
of the LSTM model for spam detection has since been This paper will examine several significant issues encoun-
extended to several domains, including weather forecasting. tered by spam filters:
This study involves a comparison between the suggested • The proliferation of data on the Internet, characterized
model and ten distinct machine learning models. This study by its diverse range of properties, presents a significant
utilizes the WEBSPAM-UK 2007 standardized dataset. The obstacle for spam detection systems.
dataset undergoes preprocessing using a unique technique • Evaluating the features of spam filters poses challenges
referred to as ‘‘Split by Oversampling as well as Train by in various dimensions, including temporal, style of writ-
Underfitting.’’ The proposed model demonstrated a level of ing, semantic, as well as statistical aspects.
accuracy of 95.25%. Following the use of system optimiza- • (iii) The majority of models are trained using datasets
tion techniques, the suggested model achieved a high level of that are balanced in nature, whereas self-learning models
accuracy, specifically 96.96%. aren’t feasible.
In their publication, Zavvar et al. (72) discuss the topic of • There exists a significant challenge in the realm of spam
spam detection. They propose a methodology that involves detection models, as they are susceptible to adversar-
the integration of particle swarm optimization techniques ial machine-learning approaches that can significantly
and neural networks for feature selection. In addition, sup- undermine their efficacy. During the testing and train-
port vector machines (SVM) were employed for spam ing stages of machine learning models, adversaries can
classification and segregation. The researchers conducted a launch a diverse range of attacks. Adversarial actors pos-
comparative analysis of the proposed methodology and alter- sess the capability to manipulate training data to induce
native methodologies, namely a self-organizing map along misclassification by a classifier, a technique known
with k-means data grouping, utilizing region under curve as a poisoning attack. Additionally, they can generate
characteristics. This study employs the UCI base dataset unfavourable samples during the testing phase to avoid
to assess the effectiveness of spam categorization and pro- detection, referred to as an evasion assault. Furthermore,
poses a spam detection methodology based on the Particle these adversaries can acquire sensitive training data
Swarm Optimization-Artificial Neural Network (PSO-ANN) by exploiting a learning model, constituting a privacy
and Adaptive Neuro-Fuzzy Inference System (ANFIS) algo- attack.
least-squares model to modify themes and incorporate gra- [6] S. Madakam, R. Ramaswamy, and S. Tripathi, ‘‘Internet of Things (IoT): A
dient descent and AMALS models for estimating missing literature review,’’ J. Comput. Commun., vol. 3, no. 5, pp. 164–173, 2015.
[7] D. S. Ibrahim, ‘‘Hybrid approach to detect spam emails using preventive
data. The technique exhibits a superior performance of 98% and curing techniques,’’ J. Al-Qadisiyah Comput. Sci. Math., vol. 10, no. 3,
compared to existing industry TF-IDF models in accurate pp. 16–24, Aug. 2018.
spam prediction within big data ecosystems. The paper [8] M. Salb, L. Jovanovic, M. Zivkovic, E. Tuba, A. Elsadai, and N. Bacanin,
‘‘Training logistic regression model by enhanced moth flame optimizer for
also discusses various spam detection techniques, including spam email classification,’’ in Computer Networks and Inventive Commu-
Co-spam, a collaborative neural network model for IoT appli- nication Technologies. Singapore: Springer, 2022, pp. 753–768.
cations. The paper also discusses the challenges faced by [9] S. S. Roy and V. M. Viswanatham, ‘‘Classifying spam emails using arti-
ficial intelligent techniques,’’ Int. J. Eng. Res. Afr., vol. 22, pp. 152–161,
spam filters, such as the proliferation of data, evaluating spam Feb. 2016.
filters’ features, training models using balanced datasets, and [10] Y. Pathak, P. K. Shukla, A. Tiwari, S. Stalin, S. Singh, and P. K. Shukla,
the vulnerability of models to adversarial machine learning ‘‘Deep transfer learning based classification model for COVID-19 dis-
ease,’’ IRBM, vol. 43, no. 2, pp. 87–92, Apr. 2022.
approaches. [11] A. Karim, S. Azam, B. Shanmugam, K. Kannoorpatti, and M. Alazab,
The emergence of deep fake technology poses a sig- ‘‘A comprehensive survey for intelligent spam email detection,’’ IEEE
nificant problem for spam detection systems, as it can Access, vol. 7, pp. 168261–168295, 2019.
[12] H. Malik, A. Afthanorhan, N. A. Amirah, and N. Fatema, ‘‘Machine
propagate inaccurate information. Future research should learning approach for targeting and recommending a product for project
focus on real-life data for training experiments and models, management,’’ Mathematics, vol. 9, no. 16, p. 1958, Aug. 2021.
rather than manually generated datasets. Hybrid algorithms, [13] S. P. Osborne, ‘‘From public service-dominant logic to public service logic:
Are public service organizations capable of co-production and value co-
deep learning techniques, clustering techniques, blockchain creation?’’ Public Manage. Rev., vol. 20, no. 2, pp. 225–231, Feb. 2018.
concepts, and collaboration between linguistics and psy- [14] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas,
cholinguistics experts can enhance accuracy and efficiency ‘‘Communication-efficient learning of deep networks from decentral-
ized data,’’ in Proc. 20th Int. Conf. Artif. Intell. Statist., Apr. 2017,
in spam detection. Graphics Processing Units (GPUs) and pp. 1273–1282.
Field Programmable Gate Arrays (FPGAs) can improve the [15] E. G. Dada, J. S. Bassi, H. Chiroma, S. M. Abdulhamid, A. O. Adetunmbi,
performance of spam filters. Standardized labelled datasets and O. E. Ajibuwa, ‘‘Machine learning for email spam filtering: Review,
approaches and open research problems,’’ Heliyon, vol. 5, no. 6, Jun. 2019,
and incorporating supplementary features like IP addresses Art. no. e01802.
and geographical locations can enhance the accuracy and [16] G. Das, B. P. Biswal, S. Kandambeth, V. Venkatesh, G. Kaur, M. Addicoat,
dependability of spam detection algorithms. The Multinomial T. Heine, S. Verma, and R. Banerjee, ‘‘Chemical sensing in two dimen-
sional porous covalent organic nanosheets,’’ Chem. Sci., vol. 6, no. 7,
Naive Bayes algorithm, a probabilistic learning technique, pp. 3931–3939, 2015.
is widely used in Natural Language Processing (NLP) for [17] M. Habib, M. Faris, R. Qaddoura, A. Alomari, and H. Faris, ‘‘A pre-
spam detection. It provides comparatively the best results dictive text system for medical recommendations in telemedicine: A
deep learning approach in the Arabic context,’’ IEEE Access, vol. 9,
for spam detection, with an accuracy of up to 98%. Future pp. 85690–85708, 2021.
research should explore potential avenues for further research [18] P. K. Mallick, S. Mishra, and G.-.-S. Chae, ‘‘Digital media news catego-
in this field. For future work, the security network will be rization using Bernoulli document model for web content convergence,’’
Pers. Ubiquitous Comput., vol. 27, no. 3, pp. 1087–1102, Jun. 2023.
required to improve the consistency of the result and maintain [19] S. A. K. Saleh, H. M. Adly, A. A. Abdelkhaliq, and A. M. Nassir, ‘‘Serum
more accuracy than this model. levels of selenium, zinc, copper, manganese, and iron in prostate cancer
patients,’’ Current Urol., vol. 14, no. 1, pp. 44–49, Mar. 2020.
ACKNOWLEDGMENT [20] S. Douzi, F. A. AlShahwan, M. Lemoudden, and B. E. Ouahidi, ‘‘Hybrid
The authors would like to acknowledge the support from email spam detection model using artificial intelligence,’’ Int. J. Mach.
Learn. Comput., vol. 10, no. 2, pp. 316–322, Feb. 2020.
Intelligent Prognostic Private Limited Delhi, India for pro- [21] G. Sun, S. Li, T. Chen, X. Li, and S. Zhu, ‘‘Active learning method for
viding support for carrying out this research work. They Chinese spam filtering,’’ Int. J. Performability Eng., vol. 13, no. 4, p. 511,
also would like to acknowledge the support from Badghis 2017.
[22] M. S. H. Bhuiyan, M. Y. Miah, S. C. Paul, T. D. Aka, O. Saha,
University, Badghis 3351, Afghanistan for providing support M. M. Rahaman, M. J. I. Sharif, O. Habiba, and M. Ashaduzzaman,
for carrying out this research work. ‘‘Green synthesis of iron oxide nanoparticle using carica papaya leaf
extract: Application for photocatalytic degradation of remazol yellow
REFERENCES RR dye and antibacterial activity,’’ Heliyon, vol. 6, no. 8, Aug. 2020,
[1] B. Reaves, L. Blue, D. Tian, P. Traynor, and K. R. B. Butler, ‘‘Detecting Art. no. e04603.
SMS spam in the age of legitimate bulk messaging,’’ in Proc. 9th ACM [23] Z. S. Torabi, M. H. Nadimi-Shahraki, and A. Nabiollahi, ‘‘Efficient support
Conf. Secur. Privacy Wireless Mobile Netw., Jul. 2016, pp. 165–170. vector machines for spam detection: A survey,’’ Int. J. Comput. Sci. Inf.
[2] W. Z. Khan, M. K. Khan, F. T. Bin Muhaya, M. Y. Aalsalem, and Secur., vol. 13, no. 1, p. 11, 2015.
H.-C. Chao, ‘‘A comprehensive study of email spam botnet detec- [24] M. A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke, ‘‘Deep learn-
tion,’’ IEEE Commun. Surveys Tuts., vol. 17, no. 4, pp. 2271–2295, ing for cyber security intrusion detection: Approaches, datasets, and com-
4th Quart., 2015. parative study,’’ J. Inf. Secur. Appl., vol. 50, Feb. 2020, Art. no. 102419.
[3] S. K. Tuteja and N. Bogiri, ‘‘Email spam filtering using BPNN classifi- [25] S. Vyas, E. Zaganjor, and M. C. Haigis, ‘‘Mitochondria and cancer,’’ Cell,
cation algorithm,’’ in Proc. Int. Conf. Autom. Control Dyn. Optim. Techn. vol. 166, no. 3, pp. 555–566, 2016.
(ICACDOT), Sep. 2016, pp. 915–919. [26] S. Vyas, M. D. Golub, D. Sussillo, and K. V. Shenoy, ‘‘Computation
[4] D. Burnes, M. DeLiema, and L. Langton, ‘‘Risk and protective factors of through neural population dynamics,’’ Annu. Rev. Neurosci., vol. 43, no. 1,
identity theft victimization in the United States,’’ Preventive Med. Rep., pp. 249–275, Jul. 2020.
vol. 17, Mar. 2020, Art. no. 101058. [27] J. Leggott, ‘‘The royal philharmonic Goes to the bathroom: The music
[5] F. Cassim, ‘‘Protecting personal information in the era of identity theft: Just of Monty Python,’’ in And Now for Something Completely Different:
how safe is our personal information from identity thieves?’’ Potchefstroom Critical Approaches to Monty Python, vol. 75. Edinburgh, U.K.: Edin-
Electronic Law J./Potchefstroomse Elektroniese Regsblad, vol. 18, no. 2, burgh Univ. Press, 2020. [Online]. Available: https://www.degruyter.com/
pp. 68–110, Mar. 2015. document/doi/10.1515/9781474475174-008/pdf?licenseType=restricted
[28] K. Cunningham, D. Headey, A. Singh, C. Karmacharya, and P. P. Rana, [51] H. Malik, A. Almutairi, and M. A. Alotaibi, ‘‘Power quality disturbance
‘‘Maternal and child nutrition in nepal: Examining drivers of progress analysis using data-driven EMD-SVM hybrid approach,’’ J. Intell. Fuzzy
from the mid-1990s to 2010s,’’ Global Food Secur., vol. 13, pp. 30–37, Syst., vol. 42, no. 2, pp. 669–678, Jan. 2022.
Jun. 2017. [52] Q. Wang, Y. Guan, and X. Wang, ‘‘SVM-based spam filter with active and
[29] J. J. Siegel, Stocks for the Long Run: The Definitive Guide to Financial online learning,’’ in Proc. TREC, 2006, pp. 1–8.
Market Returns & Long-Term Investment Strategies. New York, NY, USA: [53] M. T. Banday and T. R. Jan, ‘‘Effectiveness and limitations of statistical
McGraw-Hill, 2021. spam filters,’’ 2009, arXiv:0910.2540.
[30] S. Yadav, A. Saini, A. Dhamija, and Y. Narnauli, ‘‘Discerning spam in [54] W. Peng, L. Huang, J. Jia, and E. Ingram, ‘‘Enhancing the naive
social networking sites,’’ Adv. Vis. Comput., Int. J., vol. 3, no. 2, pp. 1–10, Bayes spam filter through intelligent text modification detection,’’ in
Jun. 2016. Proc. 17th IEEE Int. Conf. Trust, Secur. Privacy Comput. Commun./12th
[31] S. Duman, K. Kalkan-Cakmakci, M. Egele, W. Robertson, and E. Kirda, IEEE Int. Conf. Big Data Sci. Eng. (TrustCom/BigDataSE), Aug. 2018,
‘‘EmailProfiler: Spearphishing filtering with header and stylometric fea- pp. 849–854.
tures of emails,’’ in Proc. IEEE 40th Annu. Comput. Softw. Appl. Conf. [55] D. Steinberg and P. Colla, ‘‘CART: Classification and regression
(COMPSAC), vol. 1, Jun. 2016, pp. 408–416. trees,’’ in The Top ten Algorithms in Data Mining, vol. 9.
[32] M. Elhoseny, G. Ramírez-González, O. M. Abu-Elnasr, S. A. Shawkat, London, U.K.: Taylor & Francis, 2009, p. 179. [Online]. Available:
N. Arunkumar, and A. Farouk, ‘‘Secure medical data transmission https://www.taylorfrancis.com/chapters/edit/10.1201/9781420089653-
model for IoT-based healthcare systems,’’ IEEE Access, vol. 6, 17/cart-classification-regression-trees-dan-steinberg
pp. 20596–20608, 2018. [56] Z. Zeng, X. Zheng, G. Chen, and Y. Yu, ‘‘Spammer detection on Weibo
[33] S. Park, A. X. Zhang, L. S. Murray, and D. R. Karger, ‘‘Opportunities for social network,’’ in Proc. IEEE 6th Int. Conf. Cloud Comput. Technol. Sci.,
automating email processing: A need-finding study,’’ in Proc. CHI Conf. Dec. 2014, pp. 881–886.
Human Factors Comput. Syst., May 2019, pp. 1–12. [57] C. Lin, J. He, Y. Zhou, X. Yang, K. Chen, and L. Song, ‘‘Analysis
[34] H. Bhuiyan, A. Ashiquzzaman, T. I. Juthi, S. Biswas, and J. Ara, ‘‘A survey and identification of spamming behaviors in sina Weibo microblog,’’
of existing e-mail spam filtering methods considering machine learning in Proc. 7th Workshop Social Netw. Mining Anal., Aug. 2013,
techniques,’’ Global J. Comput. Sci. Technol., vol. 18, no. 2, pp. 20–29, pp. 1–9.
2018. [58] F. Jamil, H. K. Kahng, S. Kim, and D.-H. Kim, ‘‘Towards secure fit-
[35] D. Sipahi, G. Dalkiliç, and M. H. Özcanhan, ‘‘Detecting spam through their ness framework based on IoT-enabled blockchain network integrated
sender policy framework records,’’ Secur. Commun. Netw., vol. 8, no. 18, with machine learning algorithms,’’ Sensors, vol. 21, no. 5, p. 1640,
pp. 3555–3563, Dec. 2015. Feb. 2021.
[36] M. Bassiouni, M. Ali, and E. A. El-Dahshan, ‘‘Ham and spam e-mails [59] S. O. Olatunji, ‘‘Improved email spam detection model based on support
classification using machine learning techniques,’’ J. Appl. Secur. Res., vector machines,’’ Neural Comput. Appl., vol. 31, no. 3, pp. 691–699,
vol. 13, no. 3, pp. 315–331, Jul. 2018. Mar. 2019.
[37] M. N. I. Ahsan, T. Nahian, A. A. Kafi, M. I. Hossain, and F. M. Shah, [60] K. Tretyakov, ‘‘Machine learning techniques in spam filtering,’’ in Proc.
‘‘An ensemble approach to detect review spam using hybrid machine Data Mining Problem-Oriented Seminar (MTAT), May 2004, vol. 3,
learning technique,’’ in Proc. 19th Int. Conf. Comput. Inf. Technol. (ICCIT), no. 177, pp. 60–79.
Dec. 2016, pp. 388–394. [61] N. F. Rusland, N. Wahid, S. Kasim, and H. Hafit, ‘‘Analysis of Naïve Bayes
[38] H. Malik, R. Sharma, and S. Mishra, ‘‘Fuzzy reinforcement learning based algorithm for email spam filtering across multiple datasets,’’ IOP Conf.
intelligent classifier for power transformer faults,’’ ISA Trans., vol. 101, Ser., Mater. Sci. Eng., vol. 226, no. 1, Aug. 2017, Art. no. 012091.
pp. 390–398, Jun. 2020. [62] A. K. Sharma and S. Sahni, ‘‘A comparative study of classification algo-
[39] R. M. A. Mohammad, ‘‘A lifelong spam emails classification model,’’ rithms for spam email data analysis,’’ Int. J. Comput. Sci. Eng., vol. 3, no. 5,
Appl. Comput. Informat., vol. 20, no. 1, pp. 35–54, Jan. 2024. pp. 1890–1895, 2011.
[40] J. Johnson, B. Hariharan, L. Van Der Maaten, J. Hoffman, L. Fei-Fei, [63] N. Kumar and S. Sonowal, ‘‘Email spam detection using machine learn-
C. L. Zitnick, and R. Girshick, ‘‘Inferring and executing programs for ing algorithms,’’ in Proc. 2nd Int. Conf. Inventive Res. Comput. Appl.
visual reasoning,’’ in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, (ICIRCA), Jul. 2020, pp. 108–113.
pp. 2989–2998. [64] A. Singh and S. Batra, ‘‘Ensemble based spam detection in social IoT
[41] I. D. Foster, J. Larson, M. Masich, A. C. Snoeren, S. Savage, and using probabilistic data structures,’’ Future Gener. Comput. Syst., vol. 81,
K. Levchenko, ‘‘Security by any other name: On the effectiveness of pp. 359–371, Apr. 2018.
provider based email security,’’ in Proc. 22nd ACM SIGSAC Conf. Comput. [65] N. Sattu, ‘‘A study of machine learning algorithms on email spam clas-
Commun. Secur., Oct. 2015, pp. 450–464. sification,’’ M.S. thesis, Dept. Appl. Comput. Sci., Graduate School
[42] D. J. Larsson, A. Andremont, J. Bengtsson-Palme, K. K. Brandt, Southeast Missouri State Univ., Cape Girardeau, MO, USA, 2020.
A. M. D. R. Husman, P. Fagerstedt, J. Fick, C. F. Flach, W. H. Gaze, [Online]. Available: https://www.proquest.com/openview/a165c1b42d9
M. Kuroda, and K. Kvint, ‘‘Critical knowledge gaps and research needs c959784792ae606b130a4/1?pq-origsite=gscholar&cbl=18750&diss=y
related to the environmental dimensions of antibiotic resistance,’’ Environ. [66] H. Xu, W. Sun, and A. Javaid, ‘‘Efficient spam detection across online
Int., vol. 117, pp. 132–138, Aug. 2018. social networks,’’ in Proc. IEEE Int. Conf. Big Data Anal. (ICBDA),
[43] H. Takhmiri and D. A. Haroonabadi, ‘‘Identifying valid email spam emails Mar. 2016, pp. 1–6.
using decision tree,’’ Int. J. Comput. Appl. Technol. Res., vol. 5, no. 2, [67] H. Faris, I. Aljarah, and J. Alqatawna, ‘‘Optimizing feedforward neu-
pp. 61–65, Jan. 2016. ral networks using Krill Herd algorithm for e-mail spam detection,’’ in
[44] S. E. Kille, Mapping Between X, document 400 and RFC 822 (RFC987), Proc. IEEE Jordan Conf. Appl. Electr. Eng. Comput. Technol. (AEECT),
1986. Nov. 2015, pp. 1–5.
[45] I. Rish, ‘‘An empirical study of the naive Bayes classifier,’’ in Proc. [68] A. H. Wang, ‘‘Detecting spam bots in online social networking sites: A
Workshop Empirical Methods Artif. Intell., 2001, vol. 3, no. 22, pp. 41–46. machine learning approach,’’ in Proc. IFIP Annu. Conf. Data Appl. Secur.
[46] M. Verma, D. Divya, and S. Sofat, ‘‘Techniques to detect spammers in Privacy, Berlin, Germany: Springer, Jun. 2010, pp. 335–342.
Twitter—A survey,’’ Int. J. Comput. Appl., vol. 85, no. 10, pp. 27–32, [69] Z. Guo, Y. Shen, A. K. Bashir, M. Imran, N. Kumar, D. Zhang, and
Jan. 2014. K. Yu, ‘‘Robust spammer detection using collaborative neural network in
[47] S. Fine, Y. Singer, and N. Tishby, ‘‘The hierarchical hidden Markov model: Internet-of-Things applications,’’ IEEE Internet Things J., vol. 8, no. 12,
Analysis and applications,’’ Mach. Learn., vol. 32, pp. 41–62, Jul. 1998. pp. 9549–9558, Jun. 2021.
[48] P. S. Keila and D. B. Skillicorn, ‘‘Structure in the enron email dataset,’’ [70] A. Makkar and N. Kumar, ‘‘An efficient deep learning-based scheme for
Comput. Math. Org. Theory, vol. 11, no. 3, pp. 183–199, Oct. 2005. web spam detection in IoT environment,’’ Future Gener. Comput. Syst.,
[49] W. Li, W. Meng, Z. Tan, and Y. Xiang, ‘‘Design of multi-view based vol. 108, pp. 467–487, Jul. 2020.
email classification for IoT systems via semi-supervised learning,’’ J. Netw. [71] M. Zavvar, M. Rezaei, and S. Garavand, ‘‘Email spam detection using
Comput. Appl., vol. 128, pp. 56–63, Feb. 2019. combination of particle swarm optimization and artificial neural network
[50] A. Subasi, J. Kevric, and M. Abdullah Canbaz, ‘‘Epileptic seizure detection and support vector machine,’’ Int. J. Modern Educ. Comput. Sci., vol. 8,
using hybrid machine learning methods,’’ Neural Comput. Appl., vol. 31, no. 7, pp. 68–74, Jul. 2016.
no. 1, pp. 317–325, Jan. 2019.