0% found this document useful (0 votes)

4 views

A_Novel_Approach_for_Spam_Detection_Using_Natural_Language_Processing_With_AMALS_Models

This research paper presents a novel spam detection approach using natural language processing (NLP) with AMALS models, which enhances data integrity and addresses issues of data scarcity in spam filtering. The proposed method demonstrates superior performance, achieving a 98% accuracy rate compared to traditional TF-IDF models, by employing a least-squares model and gradient descent techniques for estimating missing data. The study highlights the importance of robust spam detection mechanisms to protect organizations from unauthorized data access and malicious activities.

Uploaded by

bhavishyabonam

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

A_Novel_Approach_for_Spam_Detection_Using_Natural_Language_Processing_With_AMALS_Models

Uploaded by

bhavishyabonam

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Received 9 February 2024, accepted 15 April 2024, date of publication 18 April 2024, date of current version 13 September 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3391023

A Novel Approach for Spam Detection Using

Natural Language Processing With
AMALS Models
RUCHI AGARWAL1 , ANSHITA DHOOT2 , SURYA KANT3 , VIMAL SINGH BISHT3 ,
HASMAT MALIK 4,5 , (Senior Member, IEEE), MD. FAHIM ANSARI 5 ,
ASYRAF AFTHANORHAN 6,7 , AND MOHAMMAD ASEF HOSSAINI 8
1 Department of Computer Applications, JIMS Engineering Management Technical Campus, Greater Noida 201303, India
2 Department Phystech, School of Radio Engineering and Computer Technology, Moscow Institute of Physics and Technology, 141701 Moscow, Russia
3 Department of Electronics and Communication Engineering, Graphic Era Hill University, Bhimtal 263136, India
4 Department of Electrical Power Engineering, Faculty of Electrical Engineering, Universiti Technologi Malaysia (UTM), Johor Bahru 81310, Malaysia
5 Department of Electrical Engineering, Graphic Era (Deemed to be University), Dehradun 248002, India
6 Artificial Intelligence for Islamic Civilization and Sustainability, Universiti Sultan Zainal Abidin (UniSZA), Kuala Nerus, Terengganu 21300, Malaysia
7 Operation Research and Management Sciences Universiti Sultan Zainal Abidin (UniSZA), Kuala Nerus, Terengganu 21300, Malaysia
8 Department of Physics, Badghis University, Badghis 3351, Afghanistan

Corresponding authors: Mohammad Asef Hossaini ([email protected]), Hasmat Malik ([email protected]), and
Asyraf Afthanorhan ([email protected])
This work was supported in part by Intelligent Prognostic Private Ltd., Delhi, India; and in part by Badghis University, Badghis,
Afghanistan.

ABSTRACT To enhance their company operations, organizations within the industry leverage the ecosystem
of big data to manage vast volumes of information effectively. To achieve this objective, it is imperative to
analyze textual data while prioritizing the safeguarding of data integrity and implementing robust measures
for organizing and validating data through the utilization of spam filters. Various methodologies can be
employed, including Word2Vec, bag-of-words, BERT, as well as term frequency & reciprocal document
frequency (TF-IDF). Nevertheless, none of these solutions effectively address the problem of data scarcity,
which might lead to the existence of missing information in the collected documents. To properly address
this problem, it is necessary to employ a strategy that categorizes each document based on the topic matter
and uses statistical approaches for approximation. This research paper presents a novel approach for spam
detection using natural language processing. The proposed strategy utilizes a least-squares model to modify
themes and incorporates gradient descent and altering least-squares (i.e., AMALS) models for estimating
missing data. TF-IDF and uniform-distribution methods perform the estimation. The performance evaluation
reveals that the suggested technique exhibits a superior performance of 98% compared to the existing
industry TF-IDF model in accurately predicting spam within big data ecosystems. By this model, the
environment of an organization or a company can be saved from spamming or other attacks, which can
lead to extracting their data for unauthorized users to protect the details.

INDEX TERMS Artificial intelligence, big data, machine learning, spam detection.

I. INTRODUCTION cost-effective, and expeditious means of disseminating

In the contemporary age of information technology, the information around the globe. However, owing to their inher-
process of transferring information has become signifi- ent simplicity, electronic mail (email) systems are susceptible
cantly streamlined and expedited. Numerous platforms exist to several forms of malicious activities, with the most preva-
that enable users to disseminate knowledge globally. Email lent and perilous being unsolicited bulk messages, commonly
is often regarded as one of the most straightforward, referred to as spam [1]. The receipt of unsolicited emails
that are unrelated to one’s interests is generally undesir-
The associate editor coordinating the review of this manuscript and able since it results in the wastage of recipients’ time and
approving it for publication was Manuel Rosa-Zurera. resources. In addition, it is important to note that emails
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
124298 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 12, 2024
R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

might potentially contain harmful content that is concealed meticulous assessment of the various instruments at their dis-
within attachments or URLs, posing a risk to the security posal to address this escalating problem. Several well-known
of the host system [2]. Spam refers to the transmission of techniques for identifying and analyzing incoming emails
unsolicited and irrelevant messages or emails by an individual to detect spam include whitelisting/blacklisting [7], email
or entity to a large number of recipients using various means header analysis, and keyword verification, among others.
of information dissemination, such as email or other commu- According to estimates provided by social networking
nication channels [3]. Therefore, there is a significant need professionals, over 40% of accounts on social networks are
for robust security measures in place for the email system. utilized for spam [8]. Spammers employ widely used social
Spam emails have the potential to contain malicious software networking technologies to selectively target distinct seg-
such as viruses, rats, as well as Trojans. This approach is ments, and review the pages, or fan pages to discreetly embed
predominantly employed by attackers to entice consumers hyperlinks that direct users to pornographic and other com-
into internet services. The individuals in question can trans- mercial websites. These websites are typically associated
mit unsolicited emails that include attachments including a with false accounts and aim to promote the sale of illicit prod-
variety of file extensions. These attachments may contain ucts. The poisonous emails that are disseminated to persons
URLs that have been manipulated to direct users to web- or organizations of a similar nature have recurring character-
sites that engage in hazardous activities, such as spamming istics. Through a thorough examination of these key points,
and fraudulent behaviour. As a result, users may experi- one can enhance the efficacy of identifying and detecting
ence detrimental consequences, including data or financial such forms of electronic correspondence. The classification
fraud, as well as identity theft [4], [5]. Numerous email of emails into spam & non-spam categories can be achieved
service providers offer their users the capability to estab- by the application of artificial intelligence (AI) [9]. One alter-
lish rule-based filters that automatically categorize incoming native approach to solving this problem involves extracting
emails based on keywords. However, this methodology seems features from the headers, subject, & body of the messages.
to be of limited utility as it presents challenges in terms of Once the data has been extracted and categorized according
complexity, and users exhibit a reluctance to personalize their to their characteristics, they can be classified into two groups:
emails, rendering their email accounts vulnerable to spam spam or ham. Currently, spam detection is frequently accom-
attacks. plished by the utilization of learning-based classifiers [10].
Over the past few decades, the Internet of Things (i.e. IoT) In the context of learning-based classification, the approach
has emerged as an integral aspect of contemporary society, to detection operates under the assumption that spam emails
seeing significant and rapid expansion. The Internet of Things possess distinct properties that can be used to identify them
(IoT) has emerged as a crucial element within the context of from valid emails [11]. Several aspects contribute to the
smart cities. There exists a multitude of social media applica- heightened complexity of the spam identification process in
tions and platforms that are based on the Internet of Things learning-based models. The elements encompassed in this
(IoT) technology. The proliferation of the Internet of Things context are spam subjectivity, concept drift, linguistic diffi-
(IoT) has led to a significant escalation in the prevalence of culties, overhead processing, as well as text latency.
spamming issues. The researchers put forth a range of spam Prominent multinational firms like Amazon have estab-
detection techniques to identify and eliminate spam content lished an extensive infrastructure comprising numerous
and individuals engaging in spamming activities. The current servers and databases. These resources are utilized not only
methods for spam identification can be broadly classified into for the storage of literary works but also to accommodate
two categories: behaviour pattern-based approaches as well a substantial volume of product-related data. The aforemen-
as semantic pattern-based approaches. These methodologies tioned data facilities have been intentionally created to attain
possess inherent restrictions and disadvantages. The prolif- optimal productivity and have the potential to be offered as
eration of spam emails has experienced a notable expansion services to other organizations [1]. Various forms of struc-
in tandem with the emergence and widespread adoption of tured data are grouped inside big data ecosystems. However,
the Internet & global communication [6]. Spam messages text data often lacks structure and necessitates analysis to
are produced globally through the utilization of the Internet, offer additional services utilizing consumer big data. The
employing techniques to conceal the identity of the attacker. capturing of the features of company and customer actions
Numerous antispam methods and approaches have been in the online environment may be achieved through the use
developed; yet, the prevalence of spam remains significantly of textual communication [2]. The utilization of Natural Lan-
elevated. The most perilous forms of unsolicited electronic guage Processing (i.e. NLP) methodologies for the analysis
communications are malicious emails that include hyperlinks of unstructured textual data encompasses approaches such as
directing recipients to websites designed to inflict harm upon Word2Vec and bag-of-words.
the victim’s data. The presence of spam emails has the poten- Bag-of-words (BOW), Bidirectional Encoder Representa-
tial to impede server response times due to the occupation of tions from Transformers (BERT), & term frequency–inverse
server memory or capacity. To effectively identify and prevent document frequency (i.e. TF-IDF) are three commonly
the proliferation of spam emails, organizations undertake a used techniques in natural language processing (NLP).

VOLUME 12, 2024 124299

R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

Nevertheless, the task of analyzing surface-level textual approach designed for feedforward neural networks, specif-
information obtained through Natural Language Processing ically focusing on architectures with a solitary hidden
(NLP) poses challenges, particularly about the scarcity and layer [12]. When compared to standard neural networks,
omission of textual data. To address this issue, traditional it effectively addresses issues related to sluggish training
models employ a range of methodologies in conjunction speed and overfitting. In the ELM framework, a single
with machine learning and statistical methods. Furthermore, iteration cycle is sufficient. Due to its enhanced capac-
several models have conducted comparisons and experi- ity for generalization, robustness, as well as controllability,
ments on documentary clustering matrices by transforming this method has gained widespread adoption across vari-
the document-word matrix into a document-factor scoring ous domains. This study examines various machine-learning
matrix [12]. Nevertheless, the issue of sparsity continues to techniques utilized in the context of spam identification. The
have an impact on the performance of document clustering. contributions made by our team are categorized as follows:
This paper introduces a novel approach for spam identifica-
• The present paper examines a range of machine
tion using natural language processing (NLP). The proposed
learning-based filters for spam, exploring their architec-
technique combines the ratios of topic-altering least squares
tural design and evaluating their respective advantages
(i.e. TALS), approximations gradient descent (i.e. AMGD),
and disadvantages. In addition, we engaged in a dis-
& approximations alternating least squares (i.e. AMALS)
cussion regarding the fundamental characteristics of
models:
unsolicited email communications, commonly referred
• The TALS framework categorizes feature-related con- to as spam.
cerns by putting them into the process of addressing • A complete examination of the proposed strategies and
sparsity issues and approximating them through the the nature of spam revealed some intriguing research
utilization of a probability distribution. This approach gaps in the field of spam detection and filtering.
aims to enhance the predictability and suitability of the • This section presents a discussion on open research top-
features. ics and future research objectives aimed at enhancing
• The AMGD algorithm employs a gradient descent email security and spam email filtration through the
(i.e. GD) function as well as a uniform distribution to utilization of machine learning algorithms.
address the issue of missing information by approximat- • In this paper, the authors examine the existing obstacles
ing the model. encountered by spam filtering algorithms and analyze
• The remaining scarcity issue is addressed by AMALS the impact of these challenges on the efficiency of the
by the implementation of alternating least squares models.
(i.e. ALS), L2 normalization, & uniform distribution. • This paper presents a thorough examination of several
This research presents a unique machine learning method- machine learning techniques & concepts, with a spe-
ology to address the challenges of shortage and missing cific focus on their application in the field of spam
information in large-scale data documents. identification.
• This research successfully reduces the performance gap • The paper classifies several machine learning techniques-
between the testing & training sets of documents. based spam detection approaches to gain a comprehen-
• This study provides a novel natural language process- sive understanding of their underlying principles.
ing (NLP)–based spam detection model that exhibits • This section presents a range of potential avenues for
enhanced performance in comparison to the con- future research in the field of spam detection and
ventional term frequency-inverse document frequency filtration. These areas aim to enhance the detection capa-
(TF-IDF) approach. bilities and bolster the security of email platforms.
• This study presents a new finding that supports the
advantages of utilizing the ALS function in conjunction II. LITERATURE REVIEW
with the GD algorithm for effectively classifying spam Email spam refers to the dissemination of fraudulent or unso-
text inside a large-scale data environment. licited bulk messages through various accounts or automated
The subsequent sections of this work are organized in the systems. The proliferation of unsolicited emails, commonly
following manner. Section II provides an overview of the referred to as spam, has exhibited a steady upward trend,
backdrop. Section III provides an elucidation of the under- emerging as a prevalent issue during the past ten years. Spam
lying factors that drive the research endeavour. Section IV emails are commonly obtained through the utilization of
introduces the recommended methodology. Section V of the spambots, which are automated programs designed to scour
paper provides an analysis and assessment of the subject the Internet for email addresses. The utilization of machine
matter, while Section VI serves as the concluding section, learning techniques has significantly contributed to the iden-
summarizing the main findings and implications of the study. tification and detection of unsolicited and unwanted emails
An instance of learning-based models can be observed in commonly referred to as spam. Researchers are employing a
the form of an extreme learning machine (i.e. ELM). The range of models and strategies to advance the development
present study introduces a contemporary machine-learning of innovative spam detection & filtering models [13]. In their

124300 VOLUME 12, 2024

R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

study, McMahan et al. [14] conducted a comprehensive sur- employed rely on diverse classification approaches that are
vey on the topic of email spam detection. They focused applied to the different elements of email communications.
on employing a supervised approach that incorporates fea- This paper posits whether the Naïve Bayes classifier occupies
ture selection techniques. The authors engage in a discourse a distinct place among various learning algorithms employed
regarding the knowledge discovery process employed in the in the context of spam filtering. The tool exhibits remarkable
context of spam detection systems. In addition, the authors efficiency and clarity, yielding outcomes of great accuracy.
provide detailed explanations of numerous strategies and In their study, Bhuiyan et al. [22] provide a comprehensive
technologies that have been presented for the detection of analysis of contemporary methodologies employed in email
spam. This survey also discusses the selection of features spam filtering. The authors provide an overview of several
using N-gram analysis. The N-Gram [15], [16] algorithm spam filtering methodologies and evaluate the performance
is a predictive-based method employed for estimating the of several suggested systems by examining multiple metrics
likelihood of the subsequent word appearing after identifying through a comprehensive analysis. The authors engage in
N − 1 phrases inside a sentence as well as text corpus. The a discussion regarding the efficacy of various approaches
N-Gram model employs probabilistic methods to anticipate employed to filter unsolicited and unwanted emails com-
the subsequent word. The study conducts a comparative anal- monly referred to as spam. Certain individuals have achieved
ysis of different ways for email spam detection, including favourable outcomes, while others are endeavouring to inte-
both machine learning techniques such as multilayer per- grate alternative methods to enhance their level of accuracy.
ceptron neural network, support vector machine, and Naïve Despite their overall success, experts remain concerned about
Bayes, as well as non-machine learning methods such as the various challenges encountered in spam filtering tech-
Signatures, Blacklist as well as Whitelist, including mail nologies. The researchers are endeavouring to develop an
header verification. advanced spam filtering mechanism capable of comprehend-
In their publication, Saleh et al. [19] provide an exten- ing vast quantities of multimedia data to effectively filter
sive examination of the topic of smart spam email detection out spam emails. The authors conclude that the predominant
through the use of a survey. The authors engage in a com- approach for email spam filtering involves the utilization of
prehensive examination of security vulnerabilities associated the Naïve Bayes and Support Vector Machine (SVM) algo-
with electronic mail, with a particular emphasis on spam rithms. To evaluate the efficacy of spam filtration models, it is
emails. The discourse encompasses an exploration of the possible to train these models using many datasets, such as the
breadth of spam analysis, as well as an examination of ‘‘ECML’’ and UCI datasets [21].
several methodologies employed in both machine learning In their study, Ferrag et al. [24] conducted a comprehensive
and non-machine learning approaches for spam identifica- examination of deep learning techniques utilized in intrusion
tion and filtration. The researchers concluded that there is detection systems as well as spam detection datasets. The
a significant prevalence of supervised learning algorithms, authors engaged in a comprehensive examination of detec-
as evidenced by the adoption rates, in the context of email tion systems that rely on deep learning models, subsequently
spam detection [18]. The authors assert that the primary assessing the efficacy of those models. The researchers
reason for the widespread adoption of supervised learning is analyzed a total of 35 widely recognized cyber datasets,
due to the high level of accuracy and consistency exhibited which were then classified into seven distinct groups. The
by supervised techniques. The researchers also engaged in aforementioned categories encompass datasets that are clas-
a discussion on multialgorithm frameworks and determined sified as Internet traffic-based, networking traffic-based,
that such frameworks exhibit more efficiency compared to Intranet traffic-based, electric network-based, virtualized pri-
their single-algorithm counterparts. It has been observed that vate network-based, Android apps-based, IoT traffic-based,
the majority of research endeavours involving the use of email & Internet linked device-based datasets. The researchers
content to identify spam, namely phishing emails, mostly rely concluded that deep learning models exhibit superior perfor-
on word-based classification as well as clustering techniques. mance compared to classical machine learning and lexical
Sun et al. [21] provide a comprehensive overview of models in the context of intrusion as well as spam detection.
learning-based methodologies employed in the domain of In their study, Vyas et al. [25] provide a comprehen-
email spam filtering. This study discusses the issue of spam sive analysis of supervised machine-learning techniques
and presents a comprehensive analysis of learning-based employed in the context of spam email filtering. The
spam filtering techniques. The authors elucidate diverse char- researchers concluded that the Naive Bayes method exhibits
acteristics of unsolicited electronic communications com- superior speed and satisfactory precision compared to the
monly referred to as spam emails. This study examines the other methods reviewed, except SVM and ID3. Support Vec-
impact of spam emails on various domains. This study also tor Machines (SVM) and Iterative Dichotomiser 3 (ID3)
examines the diverse economic and ethical concerns asso- algorithms provide higher precision compared to the Naïve
ciated with spam. The prevalent antispam strategy involves Bayes algorithm, albeit at the cost of significantly increased
the utilization of learning-based filtering, which has under- system construction time. A trade-off exists between the fac-
gone significant advancements. The filters that are frequently tors of timing and precision. The authors conclude that the

VOLUME 12, 2024 124301

R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

choice of learning algorithm is contingent upon the specific filter, which involves the extraction of header data from the
circumstances and the desired levels of accuracy and effi- email, occurs during the second stage. Subsequently, a series
ciency. It is asserted that to develop a more resilient spam of backlist filters are implemented to effectively identify and
filtering architecture, careful consideration should be given intercept emails originating from the backlist file, thereby
to all components of the email. mitigating the influx of spam emails. Following this phase,
This survey study examines three primary categories of rule-based filters are employed to identify the sender by
machine-learning techniques that can be employed for spam utilizing the subject line and parameters specified by the user.
filtering. In this study, we undertake a comprehensive exami- The utilization of allowance & task filters is achieved through
nation of multiple scholarly articles, analyzing the suggested the implementation of a technique that enables the account
methodologies and deliberating on the obstacles encountered holder to initiate the transmission of mail [26].
in spam detection & filtration systems. This paper also exam-
ines the merits and drawbacks of the proposed methodologies
for spam identification and filtration that have not been pre-
viously evaluated.

III. SPAM DETECTION

The origin of the term ‘‘spam’’ can be traced back to a
Monty Python episode [23], whereby the Hormel canned beef
product is humorously exaggerated and repetitively empha-
sized. The term ‘‘spam’’ was reportedly first employed in
1978 to refer to unsolicited email. However, its prevalence
FIGURE 1. Email filtering process.
grew significantly in the mid-1990s, extending beyond aca-
demic and research communities [24]. One such type is the
development expense deception, wherein a recipient is sent an
electronic communication with a proposition that purportedly 2) SPAM FILTERING ON THE CLIENT SIDE
leads to a reward. During the contemporary technological A client refers to an individual who can utilize the Inter-
era, the dodger or spammer presents a narrative wherein an net as well as an email network to transmit or receive
unlucky individual requires immediate financial assistance, electronic mail [27]. Client-side spam detection provides
enabling the fraudster to amass a significantly larger sum of various rules and techniques to ensure the secure delivery
money, which they would subsequently distribute amongst of communications between individuals and organizations.
themselves. The individual engaging in fraudulent activities To facilitate the transfer of data, a client should imple-
may choose to either generate financial gains or cease all ment several pre-existing frameworks on their system. These
forms of communication once the unsuspecting victim fulfills systems establish connections with client mail agents as
the agreed-upon installment. well as carry out the task of filtering the client’s mailbox
by composing, accepting, and managing incoming emails
A. METHODOLOGY FOR SPAN FILTERING FOR IoT [28], [29].
PLATFORMS & EMAIL
The prevalence of unsolicited emails, sometimes referred 3) SPAM FILTERING AT ENTERPRISE LEVEL
to as spam, is experiencing a notable surge across several Email spam detection at the enterprise level involves the
domains including marketing, chain communication, stock implementation of diverse filtering frameworks on the server.
market tips, politics, as well as education [24]. At present, These frameworks are responsible for managing the mail
multiple organizations are engaged in the development of transfer agent as well as categorizing the received emails
diverse approaches and algorithms aimed at enhancing the as either spam or legitimate (ham) [30]. The system client
effectiveness of spam detection & filtering processes. In this continuously and successfully utilizes the enterprise filter-
section, we examine several filtering procedures to have a ing methodology to filter emails on a network. Current
comprehensive understanding of the filtering process. approaches to spam detection employ a scoring system to
evaluate emails. This principle outlines the specification of
1) METHOD OF SPAM FILTERING a rating function, which generates a score for each post.
The standard spam filtering mechanism is the filtering system The categorization of messages as either junk mail, as well
that employs a predefined set of rules and operates as a as ham, is determined through the assignment of certain
classifier based on these protocols. Figure 1 depicts a con- scores as well as ranks [31]. Due to the varying tactics
ventional approach to the filtration of unsolicited electronic employed by spammers, a list-based technique is frequently
communications, commonly referred to as spam. The initial employed to automatically block their communications,
phase involves the implementation of content filters, which necessitating continual modifications to all associated activ-
employ artificial intelligence methodologies to discern and ities. The reproduction of Figure 2 has been sourced from
identify spam [25]. The implementation of the email headers the work of Bhuiyan et al. [22]. The architecture of both the

124302 VOLUME 12, 2024

R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

client & enterprise-level filtering of spam process is depicted The iOS operating system, encompassing both its objects
in Figure 2. and networks, exhibits susceptibility to network & physical
threats as well as privacy breaches. Figure 4 provides a visual
representation of the primary categories of attacks targeting
the Internet of Things (IoT).

FIGURE 2. Enterprises level spam process.

4) SPAM FILTERING BASED ON ANY CASE

The case-based as well as sample-based filtering of spam sys-
tems is a widely recognized and traditional machine-learning
approach for detecting spam [32]. Figure 3 depicts a stan-
dard case-based filtering framework. The filtering process
in question involves multiple stages, facilitated by the col-
lecting method. In the initial phase, data (namely, emails)
FIGURE 4. Primary categories of IoT.
is gathered. Subsequently, the primary transition persists
using the preprocessing procedures executed via the graphical
user interface of the client. These steps involve delineating The enumerated instances of attacks targeting Internet of
abstraction and selecting the method for classifying email Things (IoT) systems are presented as follows:
data. The overall process is then tested using vector expres-
sion, resulting in the classification of the data into two distinct • A self-promotion attack: This attack involves a hacked
categories: spam as well as legitimate email. node attempting to gain superiority over all other nodes
within the Internet of Things (IoT) environment for a
specific recommendation.
• Criticism or derogatory remarks directed towards
someone or something. In the context of this attack,
a compromised node erroneously accepted an incorrect
recommendation, potentially undermining the trustwor-
thiness of the trustworthy node. The services provided
by the trusted node experienced a decline.
• The topic under discussion is the ballot stuffing attack.
Within the context of the Internet of Things (IoT)
FIGURE 3. Standard case of filtering framework.
ecosystem, it is observed that a compromised node
can amplify the functionality and effectiveness of other
IV. UNITS compromised nodes. The compromised node has an
The Internet of Things (i.e. IoT) refers to a network of opportunity to provide its services. This phenomenon is
interconnected objects that are connected to the Internet and commonly referred to as the collision advice assault.
capable of collecting and transmitting data wirelessly, with- • The topic of discussion is an opportunistic service
out requiring human involvement. The Internet of Things attack. In this particular form of attack, a compromised
(IoT) facilitates the seamless integration and deployment node actively cooperates with other malicious nodes in
of physical items in various geographical locations. In the order to execute the mouthing & ballot-stuffing attack.
given context, the effective management and monitoring of • The topic of discussion is the On-Off Attack. In this
network performance pose significant challenges and neces- particular style of attack, the infiltrated node exhibits
sitate the implementation of robust privacy and security substandard service provision, as it engages in the ran-
solutions. To address security concerns in IoT applications, dom execution of detrimental services.
it is imperative to prioritize the protection of privacy against • The concept of node tampering. The perpetrator manip-
various threats, including but not limited to intrusions, phish- ulates the malevolent node and obtains targeted data,
ing attempts, DoS attacks, spamming, as well as malware. including a security key.

VOLUME 12, 2024 124303

R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

• The topic of discussion is the malicious node attack. The data, hence enabling improved decision-making in the future.
perpetrator physically inserts the malicious node into the The primary goal of machine learning methods is to acquire
group of nodes. knowledge autonomously, without requiring human interac-
• The topic of discussion is the Man in the Middle Attack. tion. Machine learning encompasses three primary categories
In this particular form of attack, the assailant covertly that are employed for a wide range of activities.
intercepts the conversation between two nodes across the Over the past decade, scholars have endeavoured to
Internet. The perpetrator acquires crucial information enhance the efficacy of email communication beyond its
through the act of surreptitiously listening in on private current state. The implementation of spam filtering tech-
conversations. niques for email systems is widely recognized as a crucial
• The Sybil Attack is a type of security threat that involves measure in safeguarding email networks [35]. Numerous
an adversary creating many fake identities in a network scholarly publications have been dedicated to employing
to gain control or manipulate the system. The com- diverse machine-learning methodologies to detect and man-
promised node illicitly appropriates the reputation of age spam emails. However, certain areas within this research
legitimate nodes and assumes the role of a trustworthy domain remain unexplored or inadequately addressed. The
node. study of junk mail is a prominent and compelling area of
A study conducted by Nozomi Networks reveals a notable research that addresses existing knowledge gaps [36]. Numer-
rise in attacks and threats targeting Operational Technology ous studies have been conducted to enhance the reliability
(i.e. OT) & Internet of Things (IoT) networks over the initial and use of email communication by employing various tech-
six months of 2020. Figure 5 illustrates the frequency of niques in spam classification. This study aims to provide
cyber assaults on Internet of Things (IoT) devices throughout a concise overview of several machine-learning techniques
different years. and approaches currently employed in the field of email
spam detection. This research additionally assesses the pre-
vailing machine learning methodologies, namely K-Nearest
Neighbors (KNN), Support Vector Machines (SVM), random
forest, as well as Naïve Bayes.

A. SPAM FILTERING BASED ON ML

Machine learning plays a crucial role in enabling the efficient
processing of large volumes of data. While the use of this
technology generally yields expedited and precise outcomes
in identifying undesirable content, it may necessitate addi-
tional investments of time as well as finances to adequately
train the models for optimal performance. The combination
of machine learning, artificial intelligence (AI), and cogni-
tive computing has the potential to enhance the processing
FIGURE 5. Frequency of getting cyber assaults. capabilities of large datasets. Figure 6 illustrates a range of
machine learning methodologies.
Machine learning methodologies have demonstrated con-
siderable efficacy in the realm of preventing and detecting
such attacks, exhibiting superior performance. Numerous
research projects have been conducted to identify and miti-
gate the aforementioned difficulties outlined in Section V.

V. MACHINE LEARNING
Machine learning is widely recognized as a significant and
valuable implementation of artificial intelligence (i.e. AI),
enabling computer systems to autonomously acquire knowl-
edge and improve their performance without the need for
explicit programming [34]. The basic objective of machine
learning algorithms is to construct automated systems that
enable the retrieval and utilization of data for training. FIGURE 6. Methodologies of ML.

The initial stage of the learning process involves acquiring

labelled data, which is commonly referred to as the training 1) SUPERVISED ML
dataset. The user’s input can encompass several forms such as Supervised ML algorithms refer to machine learning models
real-life experiences, reviews, examples, or feedback. These that require annotated data to learn and make predictions. The
forms serve the purpose of identifying patterns within the models are initially trained using labelled training data and

124304 VOLUME 12, 2024

R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

FIGURE 7. Supervised learning method of ML.

subsequently used to make predictions about future events.

To clarify, these models initiate the process by examining
a pre-existing training dataset, from which they derive a FIGURE 8. Decision tree structure.
methodology for predicting success ratings. After undergoing
appropriate training, the system can generate predictions for • C4.5 and C5.0 are specific versions of decision tree algo-
any new data that is relevant to the data provided by the rithms that use information gain and gain ratio measures
user during the training phase [38]. In addition, the learning to construct the trees.
algorithm effectively evaluates the generated output against • The chi-square test is a statistical method used to deter-
the desired output and detects mistakes to refine the model. mine the independence of two categorical variables.
Supervised learning is a machine learning approach that This section presents an examination of various decision
relies on the utilization of labelled data during the training tree algorithms that have been developed to detect and prevent
phase, enabling the model to make predictions on unseen email spam:
data. This form of learning has applications in diverse Kille et al., examine a spam filtering methodology that uses
problem domains, including assessing the attractiveness of random forest algorithms to categorize spam emails, while
advertisements, classifying spam emails, recognizing faces, also incorporating active learning techniques to enhance the
and categorizing objects. The method of supervised learning accuracy of the classification (43). The researchers utilized
is depicted in Figure 7. the dataset comprising email messages sourced from RFC
822 (Internet) [44] and subsequently partitioned each email
VI. DT CLASSIFIER into two distinct portions. Next, the researchers calculate the
The decision tree classifier is an approach to machine term frequency as well as inverse document frequency for
learning that has gained significant popularity in the field all features present in each email, commonly referred to as
of classification during the past decade [39]. The present TF/IDF. To construct the training dataset, a clustering tech-
technique utilizes a straightforward approach for resolving nique is employed to label a collection of emails. Following
categorization problems. A classifier based on decision trees an evaluation of the cluster prototype emails for training
refers to a set of precisely defined inquiries about the proper- purposes, the researchers proceeded to conduct experiments
ties of test records. With each response obtained, a subsequent utilizing supervised machine learning methods, namely ran-
inquiry arises, leading to a continuous cycle of questioning dom forests, Naïve Bayes, support vector machine, as well
until a definitive conclusion is reached and documented [40]. as KNN [45]. The findings of the study indicate that the
Tree-based decision algorithms are a class of models that ‘‘random forest’’ method demonstrates superior efficiency in
are generated by an iterative or recursive process, utilizing data classification, with an accuracy rate of 95.2%.
the available data. The objective of decision tree-based algo- Takhmiri and Haroonabadi [43] propose an alternative
rithms is to forecast the value of a target variable based on approach for spam detection, utilizing a fuzzy decision tree in
a given set of input values. The technique described in this conjunction with the Naïve Bayes method. The bake voting
study utilizes a hierarchical tree-based structure to effectively algorithm is employed to extract patterns of spam behavior.
address classification and regression difficulties [41]. The This behaviour is shown due to the absence of overt qualities
basic structure underlying the decision tree is depicted in in the tangible realm. The degree of cross-linking utilized to
Figure 8. explicate or depict personalities is both sensible and impar-
Several decision tree algorithms include the following: tial. Decision trees employ fuzzy Mamdani rules to classify
• The random forest algorithm is a popular machine- spam and ham emails. Subsequently, the authors employ the
learning technique that combines many decision trees to Naïve Bayes classifier [45] to analyze the dataset. Ultimately,
make predictions. the electoral process employs the technique of partitioning
• Classification and regression trees (CART) are a type of votes into more manageable segments. This solution provides
decision tree algorithm used for both classification and an optimum weight that may be applied to derived percent-
regression tasks. ages to reach a higher level of accuracy. The dataset utilized

VOLUME 12, 2024 124305

R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

in this research included a total of 1000 electronic mail classification principle of linear SVMs is depicted in Figure 9.
messages, out of which 350 (35%) were identified as spam, The depicted diagram includes many circular and star-shaped
while the remaining 650 (65%) were classified as legitimate entities, which are referred to as objects. These objects have
messages (ham). the potential to be classified into one of two categories,
Verma and Sofat employed the supervised machine learn- specifically the category of stars as well as dots. The selection
ing technique ID3 (Quinlan, 1986) to construct decision trees of items between those that are green and those that are
for the given task [46]. Additionally, they utilized the hidden brown is determined by the isolated lines. The objects located
Markov model [47] to estimate the probabilities of various on the bottom half of the plane exhibit a brown star shape,
occurrences, which were then combined to categorize emails while the objects situated on the top edge of this plane are
as either junk mail or ham [47]. The suggested model employs represented by green dots. This distinction indicates that two
a method of initially classifying emails as either spam or valid distinct objects have been categorized into separate classes.
by assessing the overall probability of each email based on the When presented with a new object, specifically a black circle,
later classification of email phrases. Subsequently, the system the model will utilize the training instances provided during
proceeds to construct decision trees for individual emails. the training phase to categorize the circle into a single of the
This analysis utilizes the Enron dataset [48], which has a available classes.
total of 5172 emails. Out of the total 5172 emails analyzed, In their study, Banday and Jan [53] provide a compre-
2086 were identified as spam, while an equal number of hensive analysis of the statistical spam filter methodology.
2086 were classified as legitimate emails. The model can The filters are designed with Naïve Bayes, support vector
classify emails as either spam or ham by utilizing the feature machines (i.e. SVM), KNN, as well as regression trees [54].
set derived from the Enron dataset. An 11% inaccuracy was Various supervised machine learning methods are employed,
obtained when utilizing the fitness function from the sk-learn and the obtained results are assessed by metrics such as
library in the suggested model. The model achieved an accu- precision, recall, as well as accuracy. Based on the application
racy rate of 89% on the provided dataset. of these machine learning techniques, it was shown that the
The email classification methodology for IoT systems dataset yielded optimal results when utilizing classification
presented by Li et al. [49] is founded on the principles of & regression trees (CART) [55] as well as Naïve Bayes
supervised machine learning. The employed methodology classifiers. According to this method, the computational cost
involves the utilization of a multiview methodology that of evaluating false positive instances is higher than that of
prioritizes the acquisition of more comprehensive data for false negative instances in the context of spam filtering.
classification. A dataset with two distinct feature sets, namely
internal and exterior, is generated. The suggested methodol-
ogy has the potential to be applied to both labelled as well
as unlabeled data, and its effectiveness was assessed using
two datasets inside an authentic network setting. The findings
of this study suggest that the implementation of the multi-
view model yields higher levels of accuracy compared to the
straightforward approach of email classification. Ultimately,
the multiview model is juxtaposed with other extant models.
Subasi et al. proposed a spam filtering methodology that
utilizes various decision tree algorithms [50]. The objective
of their study was to assess the accuracy of these algorithms
and determine the most effective one for their specific dataset.
FIGURE 9. Linear SVMs.
The researchers applied various algorithms, including regres-
sion, classification, and tree (CART), NBT, C4.5, LAD tree, The approach proposed by Zeng et al. [56] aims to iden-
REP Tree, random forest, & rotation forest, to the dataset to tify and classify spammers as well as spam communications
perform email classification. The findings of the study indi- within a given social network. In contemporary society, the
cate that the customized random forest model outperformed use of social media has become ubiquitous, with a substantial
other decision tree models in terms of accuracy when applied portion of individuals devoting a significant portion of their
to publically available datasets. time to engaging in interpersonal communication with their
SVM: The support vector machine (i.e. SVM) is a cru- close acquaintances. Spammers exploit diverse social media
cial and highly esteemed machine learning model [51]. The networks as well as the content posted by users to disseminate
Support Vector Machine (SVM) is a prejudiced supervised malicious content, ads, information, and other undesirable
learning classifier that is technically defined. It operates by materials within the accounts of social media users. This
utilizing labelled examples during the training phase and pro- study examines the methods for identifying and detecting
duces a hyperplane as its output, which is used to categorize posts or information with malicious intent on social media
fresh data [52]. Objects in a given set are segregated based on sites. The researchers in this study employ the Sina Weibo
their respective class memberships using decision planes. The social network and utilize a machine learning method known
124306 VOLUME 12, 2024
R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

as support vector machine (SVM) to identify and classify machine-learning methodologies employed in the context of
spammers. The dataset employed in this study comprised email spam filtering [60]. This study conducted a comparative
16 million messages obtained from many individuals. A set analysis of precision outcomes between instances of false
of 18 features was employed as a component of the vector positives against precision outcomes after the removal of
set. The network’s clientele can be classified into two distinct false positives. The findings demonstrate the outcomes after
groups: legitimate users as well as spammers. The model’s the removal of false positives, which exhibited enhanced
training phase utilized 80% of the available data, with the accuracy and reliability compared to previous iterations.
remaining 20% allocated for testing purposes. To enhance the
precision of the results, a ratio of 1:2 was employed between VII. NB CLASSIFIER
spammers and non-spammers in the training dataset. The The Naïve Bayes classifier is derived from the Bayes the-
suggested model achieves a classification accuracy of 99.5% orem. The assumption is made that the predictors exhibit
for distinguishing between spammers and non-spammers, independence, implying that the knowledge of one charac-
as reported in reference [57]. teristic does not influence the value of any other attribute.
Jamil et al. [58] describe a fitness framework that utilizes Naïve Bayes classifiers are characterized by their ease of
Internet of Things (IoT)-enabled blockchain technology & construction, as they do not necessitate an iterative procedure.
machine learning approaches. The model they have suggested Furthermore, they exhibit notable efficiency when applied to
consists of two distinct parts. The first system is a network extensive datasets, while maintaining a commendable degree
that utilizes blockchain technology to ensure the security of of accuracy. Despite its straightforwardness, Naïve Bayes has
devices that sense. It also incorporates intelligent contracts to been well recognized for its superior performance compared
facilitate relationships as well as an inference engine the fact to other classification approaches across a range of issues.
that reveals concealed insights and actionable information In their study, Rusland et al. [61] investigate the topic
from data collected from Internet of Things (IoT) sensors of email spam filtering as well as employ the Naïve Bayes
and user devices. The enhanced smart contract provides cus- machine learning method to conduct their investigation. Two
tomers with a valuable application that enables real-time datasets were utilized and analyzed based on the metrics of
monitoring, enhanced control, and expedited access to mul- accuracy, F-measure, precision, & recall. Naïve Bayes is a
tiple devices dispersed across diverse domains. The primary classification algorithm that uses probability theory to assign
objective of the inference engine’s module is to analyze the class labels to instances. Specifically, it calculates the likeli-
data collected from the IoT environment to identify hidden hood by examining the frequency and mix of values present
patterns and extract valuable information. This process aids in a given dataset. This study employs a three-step approach
in facilitating efficient decision-making and offering easy for email filtration, namely preprocessing, feature selection,
services. According to the findings of the researchers, the & implementation of features through the Naïve Bayes clas-
model they have proposed has the potential to enhance system sifier. The initial stage of preprocessing is the elimination of
throughput and optimize resource utilization. The technology conjunction words, articles, as well as stop words from the
suggested in this paper has potential applications in many content of the email. Subsequently, the researchers employed
domains, such as healthcare and intelligent enterprises. the WEKA program [64] to generate two distinct datasets,
The spam filtering program was developed by Olatunji [59], namely the spam data as well as the spam base dataset.
employing support vector machine as well as extreme learn- The mean accuracy achieved across the two datasets was
ing machine techniques. The researcher utilized a commonly 89.59%, with the spam dataset exhibiting a higher accuracy
employed dataset to construct the spam detection model. The of 91.13%. The accuracy achieved by the spam-based dataset
support vector machine (SVM) earned an accuracy rate of was 82.54%. The precision findings for the spam data set
94.06% in the study, while an extreme learning machine were found to be 83% on average, whereas for the spam base
(ELM) model acquired an accuracy rate of 93.04%. This data set, the precision results were 88%. It has been asserted
indicates that the SVM outperformed the ELM by a marginal that the Naïve Bayes classifier exhibits superior performance
improvement of 1.1% in terms of performance. The individ- when used to spam base data in comparison to spam data.
ual suggested that the improvement in accuracy of Support Sharma and Sahni published a scholarly study discussing
Vector Machines (SVM) compared to Extreme Learning the utilization of machine learning algorithms to detect spam
Machines (ELM) is minimal. This suggests that in scenarios in Internet of Things (IoT) devices [62]. The researchers
where the timeliness of detection is of utmost importance, employed a total of five machine-learning models and
including in real-time systems, it is advisable to prioritize analyzed their outcomes utilizing a range of performance
the utilization of the ELM spam detector over the SVM indicators. A substantial quantity of characteristics of the
spam detection method. While doing his research, it was input were employed in the training of the proposed models.
observed that the Support Vector Machine (SVM) exhibited The spam score of each model is computed by considering
a greater level of accuracy. However, it was also noted the input attributes. The aforementioned score serves as an
that the training process of the SVM system required more indicator of the reliability and credibility of an Internet of
time compared to the Extreme Learning Machine (ELM) Things (IoT) device, taking into account a range of pertinent
system. Tretyakov provided an extensive analysis of different aspects. The proposed methodology is verified by employing

VOLUME 12, 2024 124307

R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

the REFIT home automation dataset [63]. The authors assert inputs and outcomes or atypical performance patterns. The
that their suggested system exhibits superior spam detection diagram presented in Figure 10 illustrates the fundamental
capabilities compared to existing systems in use. The applica- architecture within a neural network.
tion of their work extends to smart homes as well as additional This section provides an elaboration on various proposed
environments where intelligent gadgets are employed. strategies for detecting and preventing email spam through
In their study, Singh and Batra examined the application of the utilization of neural networks.
multiple machine-learning techniques for email spam iden-
tification [64]. The essay delves into the examination of
machine learning methodologies and their practical appli-
cation on various datasets. The identification of the most
optimal method for email spam detection, which exhibits the
best precision and accuracy, is achieved through the evalua-
tion of multiple machine learning techniques. The researchers
concluded that the use of the Multinomial Naïve Bayes
algorithm yields the most favourable outcomes. However,
it is important to acknowledge that this approach has certain
FIGURE 10. Architecture of NN.
drawbacks stemming from its reliance on class-conditional
independence. Consequently, there are instances where the The approach proposed by Faris et al. [67] aims to detect
machine misclassifies certain inputs. In this study, it was spam within online social networks. The research conducted
observed that ensemble models yielded superior and depend- by the authors centres on the amalgamation of unsolicited
able outcomes compared to Multinomial Naïve Bayes. The messages across different social networking platforms. The
approach described in this study is limited to the detection of researchers collected a total of 1937 tweets classified as
spam solely from the content within the body of email. spam and 10943 tweets classified as ham for further anal-
Sattu introduced a semi-supervised machine learning ysis, utilizing the Twitter platform. In addition, a total of
approach for spam identification in social Internet of Things 1338 spam posts as well as 9285 ham posts were utilized in
(IoT) platforms [65]. An ensemble-based framework includ- the analysis. In the context of Twitter Spam Detection (TSD),
ing four classifiers was employed. The architectural design it was observed that 75.6% of tweets analyzed had URL links,
relies on the utilization of probabilistic data structures (i.e. which were identified as spam tweets. On the other hand,
PDS) that include a Quotient Filter (QF) for querying the 24.4% of the tweets consisted of distinct phrases, indicating a
database containing URLs, spam users, and databases of different type of content. Among a total of 10,942 tweets cat-
spam keywords. Additionally, Locality Sensitive Hashing egorized as ham, it was observed that 62.9% of these tweets
(LSH) is employed for doing similarity searches. The sug- had both URL links as well as words, while the remaining
gested model employs the adaptive weighted voting strategy 37.1% consisted only of words. According to the findings,
to minimize its decision-making process, taking into account it has been observed that approximately 32.8% of the spam
the output of each classifier. The hybrid sampling technique posts generated by FSD are comprised of various web links,
reduces computational efforts by selectively collecting data while the remaining 67.2% of these spam posts solely con-
based on each classifier. The findings of this study suggest sist of textual content [68]. Out of a total of 9285 postings
that the methodology described in this research holds the classified as ham, 95.1% of them contain web links, while
potential for effectively detecting spam in extensive datasets. the remaining 4.9% solely consist of textual content. The
The efficacy of the suggested model was assessed by con- researchers employed the most frequently occurring twenty
ducting a comparison between PDS and conventional data feature words extracted from datasets comprising Facebook
models, using commonly employed assessment criteria such spam and Twitter spam. The TSD and FSD are partitioned
as accuracy, recall, as well as F-score. into two distinct sets, namely the training dataset and the
ANNs: The artificial neural network (i.e. ANN) is a com- testing dataset. The aforementioned datasets were employed
puter model that is derived from the functional characteristics in the training of diverse machine learning classifiers, includ-
of biological neural networks, commonly referred to as the ing Naïve Bayes, logistic regression random tree, random
neural network (NN) [66]. A neural network consists of many forest, as well as Bayes Net. Upon analyzing the precision
sets of interconnected neurons, wherein information is pro- of various classifiers, the researchers integrated the spam
cessed through computational connections. In the majority dataset from Facebook with the learning dataset of Twitter,
of scenarios, an artificial neural network (ANN) exhibits and likewise, incorporated the spam dataset from Twitter
adaptability as a system, wherein its structure undergoes with the training datasets of Facebook. Subsequently, the
modifications based on the influx of either internal or external researchers utilized the merged dataset to train and evaluate
information throughout the learning phase. Contemporary the performance of the classifiers. Ultimately, the researchers
neural networks represent non-linear methodologies for the conduct a comparative analysis of the classifiers’ outcomes
analysis of statistical data. These are frequently employed on the aforementioned social networks, after an assessment of
in situations where there exist intricate connections between precision, accuracy, recall, as well as the F-1 measure. It was

124308 VOLUME 12, 2024

R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

discovered that the precision of aggregated datasets surpassed TABLE 1. Comparison techniques of supervised method to do filtering of
spam.
that of alternative datasets [68], [69].
The spammer detection technique developed by Makkar
and Kumar [70] involves the utilization of a collaborating
neural network in the context of Internet of Things (IoT)
applications. The authors introduce an innovative spam detec-
tion mechanism named Cospam, specifically designed for
Internet of Things (IoT) applications. Initially, the individual
and the speech content at various time intervals are seen
as sequences of features. The subsequent phase involves
the utilization of a cooperative neural network model. The
collaborative model comprises three distinct models, namely
the Bi-AE model, the GCN model, and the LSTM model.
These models are employed to determine the characteristics
or attributes of the user. Ultimately, a sequence of tests was
carried out to assess the efficacy of the proposed method-
ology. The model under consideration demonstrated a 5%
increase in accuracy compared to currently employed meth-
ods for detecting spammers. The time required for Cospam is
greater compared to existing techniques due to the presence
of numerous parameters. rithms. The training dataset consisted of 70% of the data,
In the realm of the Internet of Things (IoT), Zavvar et al. while the remaining 30% was allocated to evaluate the mod-
[71] introduced a deep learning framework aimed at iden- els. The principles of Root Mean Square Error (RMSE),
tifying and mitigating web spam. The method in question Normalized Root Mean Square Error (NRMSE), and Stan-
improves the cognitive capabilities of search engines to dard Deviation (STD) were examined, yielding findings of
identify instances of web spam effectively. The efficacy 0.08733, 0.0185, and 0.08742, respectively, during the test-
of this strategy lies in its ability to eliminate spam pages ing phase. The findings indicate that the proposed approach
through the utilization of a website’s rank score, which is exhibits favourable levels of accuracy and performance in the
derived from calculations performed by a search engine. The detection of spam emails. Table 1 provides a summary of
framework employed in their study leverages the compre- the supervised machine-learning algorithms that have been
hensive capabilities of deep learning. The first application described for the purpose of spam identification.
of the LSTM model for spam detection has since been This paper will examine several significant issues encoun-
extended to several domains, including weather forecasting. tered by spam filters:
This study involves a comparison between the suggested • The proliferation of data on the Internet, characterized
model and ten distinct machine learning models. This study by its diverse range of properties, presents a significant
utilizes the WEBSPAM-UK 2007 standardized dataset. The obstacle for spam detection systems.
dataset undergoes preprocessing using a unique technique • Evaluating the features of spam filters poses challenges
referred to as ‘‘Split by Oversampling as well as Train by in various dimensions, including temporal, style of writ-
Underfitting.’’ The proposed model demonstrated a level of ing, semantic, as well as statistical aspects.
accuracy of 95.25%. Following the use of system optimiza- • (iii) The majority of models are trained using datasets
tion techniques, the suggested model achieved a high level of that are balanced in nature, whereas self-learning models
accuracy, specifically 96.96%. aren’t feasible.
In their publication, Zavvar et al. (72) discuss the topic of • There exists a significant challenge in the realm of spam
spam detection. They propose a methodology that involves detection models, as they are susceptible to adversar-
the integration of particle swarm optimization techniques ial machine-learning approaches that can significantly
and neural networks for feature selection. In addition, sup- undermine their efficacy. During the testing and train-
port vector machines (SVM) were employed for spam ing stages of machine learning models, adversaries can
classification and segregation. The researchers conducted a launch a diverse range of attacks. Adversarial actors pos-
comparative analysis of the proposed methodology and alter- sess the capability to manipulate training data to induce
native methodologies, namely a self-organizing map along misclassification by a classifier, a technique known
with k-means data grouping, utilizing region under curve as a poisoning attack. Additionally, they can generate
characteristics. This study employs the UCI base dataset unfavourable samples during the testing phase to avoid
to assess the effectiveness of spam categorization and pro- detection, referred to as an evasion assault. Furthermore,
poses a spam detection methodology based on the Particle these adversaries can acquire sensitive training data
Swarm Optimization-Artificial Neural Network (PSO-ANN) by exploiting a learning model, constituting a privacy
and Adaptive Neuro-Fuzzy Inference System (ANFIS) algo- attack.

VOLUME 12, 2024 124309

R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

FIGURE 12. Faceted distribution of label and unnamed (Category)

available in the dataset.

ity. In the future, it is possible to enhance the performance of

spam filters by leveraging Graphics Processing Units (GPUs)
as well as Field Programmable Gate Arrays (FPGAs). These
technologies provide advantages such as improved process-
FIGURE 11. Actual and predicted sets of Ham and Spam.
ing speed, higher classification accuracy, reduced energy
consumption, enhanced flexibility, and the ability to analyze
• The user’s text is already academic and does not need data in real-time. Furthermore, it is recommended that future
to be rewritten. The emergence of deep fake technol- research focuses on the provision of standardized labelled
ogy poses a significant problem for spam detection datasets that can be utilized by researchers for training classi-
systems. Neural network models, such as GPT-2 and fiers. Additionally, enhancing the accuracy and dependability
GPT-3, are utilized to generate, modify, and styliz- of spam detection algorithms can be achieved by incorpo-
ing images and videos. Additionally, image generation rating supplementary features into the dataset, such as the
models, namely BigGAN, StyleGAN, and CycleGAN, IP address and geographical location of the spammer. The
are also employed for this purpose. The utilization of subsequent sections outline other avenues for further research
deep fakes has the potential to propagate inaccurate and highlight unresolved issues within the field of spam
information. identification.
Over the past two decades, there has been a significant
VIII. RESEARCH GAP AND PROBLEM focus from the scientific community on the subject of spam
This section examines the areas of research that have not yet
identification and filtration. The rationale behind extensive
been addressed and the unresolved issues within the field
research in this domain stems from its significant and far-
of spam detection and filtration. In forthcoming research
reaching implications, particularly about customer behaviour
endeavours, it is advisable to employ real-life data for training
and the prevalence of counterfeit reviews. The survey encom-
experiments and models, as opposed to relying on manu-
passes a range of machine learning methods and models that
ally generated datasets. This recommendation is based on
have been presented by researchers to identify and mitigate
the observation that models trained on fake datasets exhibit
spam in emails as well as IoT systems. The study classified
notably inadequate performance when applied to real-life
the many types of learning approaches such as supervised,
data, as highlighted in multiple scholarly articles. Presently,
unsupervised, and reinforcement learning. This study does a
the field of spam detection employs reinforcement learning,
comparative analysis of several methodologies and presents
supervised learning, and unsupervised learning algorithms.
a comprehensive overview of the key insights gained from
However, the potential for enhanced accuracy and efficiency
each category. The present study draws the conclusion that
in spam detection can be realized through the utilization of
a majority of the suggested solutions for detecting spam in
hybrid algorithms in forthcoming research endeavours. In the
email and Internet of Things (IoT) systems rely on super-
future, the enhancement of the extraction of features can be
vised machine learning approaches. The process of creating a
achieved by the utilization of deep learning techniques for
labelled dataset to use in training a supervised model is essen-
feature extraction. The utilization of clustering techniques
tial and requires a significant amount of time. In the domain of
in the context of spam filtering, specifically for relevance
spam detection, it has been observed that supervised learning
feedback with dynamic updating, has the potential to enhance
algorithms, specifically Support Vector Machines (SVM) and
the clustering of spam and ham messages. In addition to
Naïve Bayes, exhibit superior performance compared to alter-
machine learning, the utilization of blockchain concepts and
native models. This paper offers a thorough examination of
models holds potential for future applications in email spam
various algorithms utilized in the identification and filtering
detection.
of email spam, along with an exploration of potential avenues
In the future, there is potential for collaboration between
for future research in this field.
linguistics and psycholinguistics experts in the manual
annotation of datasets. This collaboration might lead to the IX. METHODOLOGY
creation of spam datasets that are both successful and adhere The Multinomial Naive Bayes algorithm constitutes a proba-
to standardized practices, characterized by high dimensional- bilistic learning technique commonly employed in the field

124310 VOLUME 12, 2024

R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

TABLE 2. Comparative table of different methods of ML.

FIGURE 13. Graphical representations of values and distributions of spam emails.

The formula upon which it is based is as follows:

P (A | B) = P (A) ∗P(B|A)/P(B) (1)
In this, the computation of the likelihood of class A given the
presence of predictor B. The symbol P(B) represents the prior
probability of event B. The symbol P(A) represents the prior
probability of the class A. The conditional probability P(B|A)
represents the likelihood of the incidence of predictor B given
class A. This model provides comparatively best results for
spam detection and gives an accuracy of up to 98%.
FIGURE 14. Graph shows the different categorical distribution of dataset This table 2 shows the comparative table to find out
‘‘ham’’ and ‘‘spam’’. different methods of ML in which Multinomial NB shows
98% accuracy, SVM shows 96.37% accuracy, K-NN gives
of Natural Language Processing (i.e. NLP). The algorithm an output of 97% of accuracy, the random forest gives 96%
utilized in this study is grounded on the principles of Bayes’ accuracy, and Adaboost+DT shows 97.50% result.
theorem, enabling it to make predictions regarding the classi-
X. RESULTS AND DISCUSSION
fication of various textual forms, including but not limited
In this model, the prediction designing architecture has been
to emails and newspaper articles. The algorithm computes
used to gather the best possible data information which
the likelihood of each tag given a given sample & afterward
has 5171 entries and four columns to show the integer
outputs the tag containing the greatest likelihood.
(int64) type data, with two objects and it has used 161.78+
The Naive Bayes technique is widely recognized for its
KB. This Multinomial NB model shows an accuracy of
efficacy in analyzing text input and addressing classification
0.9777458722182341, which shows the model precision
problems involving several classes. To comprehend the func-
value is high enough to process the facet distribution that
tioning of the Naive Bayes theorem, it is imperative first to
shows a better result than the existing model for spam detec-
grasp the concept of the Bayes theorem, as the former is built
tion with this dataset. This model contains a pipeline which
upon the latter.
includes Count Vectorizer and Multinomial NB
Bayes’ theorem, originally proposed by Thomas Bayes,
is a mathematical formula that enables the calculation of the XI. CONCLUSION
likelihood of an event’s occurrence by incorporating prior This research paper presents a novel approach for spam
knowledge regarding conditions associated with the event. detection using natural language processing, utilizing a
VOLUME 12, 2024 124311
R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

least-squares model to modify themes and incorporate gra- [6] S. Madakam, R. Ramaswamy, and S. Tripathi, ‘‘Internet of Things (IoT): A
dient descent and AMALS models for estimating missing literature review,’’ J. Comput. Commun., vol. 3, no. 5, pp. 164–173, 2015.
[7] D. S. Ibrahim, ‘‘Hybrid approach to detect spam emails using preventive
data. The technique exhibits a superior performance of 98% and curing techniques,’’ J. Al-Qadisiyah Comput. Sci. Math., vol. 10, no. 3,
compared to existing industry TF-IDF models in accurate pp. 16–24, Aug. 2018.
spam prediction within big data ecosystems. The paper [8] M. Salb, L. Jovanovic, M. Zivkovic, E. Tuba, A. Elsadai, and N. Bacanin,
‘‘Training logistic regression model by enhanced moth flame optimizer for
also discusses various spam detection techniques, including spam email classification,’’ in Computer Networks and Inventive Commu-
Co-spam, a collaborative neural network model for IoT appli- nication Technologies. Singapore: Springer, 2022, pp. 753–768.
cations. The paper also discusses the challenges faced by [9] S. S. Roy and V. M. Viswanatham, ‘‘Classifying spam emails using arti-
ficial intelligent techniques,’’ Int. J. Eng. Res. Afr., vol. 22, pp. 152–161,
spam filters, such as the proliferation of data, evaluating spam Feb. 2016.
filters’ features, training models using balanced datasets, and [10] Y. Pathak, P. K. Shukla, A. Tiwari, S. Stalin, S. Singh, and P. K. Shukla,
the vulnerability of models to adversarial machine learning ‘‘Deep transfer learning based classification model for COVID-19 dis-
ease,’’ IRBM, vol. 43, no. 2, pp. 87–92, Apr. 2022.
approaches. [11] A. Karim, S. Azam, B. Shanmugam, K. Kannoorpatti, and M. Alazab,
The emergence of deep fake technology poses a sig- ‘‘A comprehensive survey for intelligent spam email detection,’’ IEEE
nificant problem for spam detection systems, as it can Access, vol. 7, pp. 168261–168295, 2019.
[12] H. Malik, A. Afthanorhan, N. A. Amirah, and N. Fatema, ‘‘Machine
propagate inaccurate information. Future research should learning approach for targeting and recommending a product for project
focus on real-life data for training experiments and models, management,’’ Mathematics, vol. 9, no. 16, p. 1958, Aug. 2021.
rather than manually generated datasets. Hybrid algorithms, [13] S. P. Osborne, ‘‘From public service-dominant logic to public service logic:
Are public service organizations capable of co-production and value co-
deep learning techniques, clustering techniques, blockchain creation?’’ Public Manage. Rev., vol. 20, no. 2, pp. 225–231, Feb. 2018.
concepts, and collaboration between linguistics and psy- [14] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas,
cholinguistics experts can enhance accuracy and efficiency ‘‘Communication-efficient learning of deep networks from decentral-
ized data,’’ in Proc. 20th Int. Conf. Artif. Intell. Statist., Apr. 2017,
in spam detection. Graphics Processing Units (GPUs) and pp. 1273–1282.
Field Programmable Gate Arrays (FPGAs) can improve the [15] E. G. Dada, J. S. Bassi, H. Chiroma, S. M. Abdulhamid, A. O. Adetunmbi,
performance of spam filters. Standardized labelled datasets and O. E. Ajibuwa, ‘‘Machine learning for email spam filtering: Review,
approaches and open research problems,’’ Heliyon, vol. 5, no. 6, Jun. 2019,
and incorporating supplementary features like IP addresses Art. no. e01802.
and geographical locations can enhance the accuracy and [16] G. Das, B. P. Biswal, S. Kandambeth, V. Venkatesh, G. Kaur, M. Addicoat,
dependability of spam detection algorithms. The Multinomial T. Heine, S. Verma, and R. Banerjee, ‘‘Chemical sensing in two dimen-
sional porous covalent organic nanosheets,’’ Chem. Sci., vol. 6, no. 7,
Naive Bayes algorithm, a probabilistic learning technique, pp. 3931–3939, 2015.
is widely used in Natural Language Processing (NLP) for [17] M. Habib, M. Faris, R. Qaddoura, A. Alomari, and H. Faris, ‘‘A pre-
spam detection. It provides comparatively the best results dictive text system for medical recommendations in telemedicine: A
deep learning approach in the Arabic context,’’ IEEE Access, vol. 9,
for spam detection, with an accuracy of up to 98%. Future pp. 85690–85708, 2021.
research should explore potential avenues for further research [18] P. K. Mallick, S. Mishra, and G.-.-S. Chae, ‘‘Digital media news catego-
in this field. For future work, the security network will be rization using Bernoulli document model for web content convergence,’’
Pers. Ubiquitous Comput., vol. 27, no. 3, pp. 1087–1102, Jun. 2023.
required to improve the consistency of the result and maintain [19] S. A. K. Saleh, H. M. Adly, A. A. Abdelkhaliq, and A. M. Nassir, ‘‘Serum
more accuracy than this model. levels of selenium, zinc, copper, manganese, and iron in prostate cancer
patients,’’ Current Urol., vol. 14, no. 1, pp. 44–49, Mar. 2020.
ACKNOWLEDGMENT [20] S. Douzi, F. A. AlShahwan, M. Lemoudden, and B. E. Ouahidi, ‘‘Hybrid
The authors would like to acknowledge the support from email spam detection model using artificial intelligence,’’ Int. J. Mach.
Learn. Comput., vol. 10, no. 2, pp. 316–322, Feb. 2020.
Intelligent Prognostic Private Limited Delhi, India for pro- [21] G. Sun, S. Li, T. Chen, X. Li, and S. Zhu, ‘‘Active learning method for
viding support for carrying out this research work. They Chinese spam filtering,’’ Int. J. Performability Eng., vol. 13, no. 4, p. 511,
also would like to acknowledge the support from Badghis 2017.
[22] M. S. H. Bhuiyan, M. Y. Miah, S. C. Paul, T. D. Aka, O. Saha,
University, Badghis 3351, Afghanistan for providing support M. M. Rahaman, M. J. I. Sharif, O. Habiba, and M. Ashaduzzaman,
for carrying out this research work. ‘‘Green synthesis of iron oxide nanoparticle using carica papaya leaf
extract: Application for photocatalytic degradation of remazol yellow
REFERENCES RR dye and antibacterial activity,’’ Heliyon, vol. 6, no. 8, Aug. 2020,
[1] B. Reaves, L. Blue, D. Tian, P. Traynor, and K. R. B. Butler, ‘‘Detecting Art. no. e04603.
SMS spam in the age of legitimate bulk messaging,’’ in Proc. 9th ACM [23] Z. S. Torabi, M. H. Nadimi-Shahraki, and A. Nabiollahi, ‘‘Efficient support
Conf. Secur. Privacy Wireless Mobile Netw., Jul. 2016, pp. 165–170. vector machines for spam detection: A survey,’’ Int. J. Comput. Sci. Inf.
[2] W. Z. Khan, M. K. Khan, F. T. Bin Muhaya, M. Y. Aalsalem, and Secur., vol. 13, no. 1, p. 11, 2015.
H.-C. Chao, ‘‘A comprehensive study of email spam botnet detec- [24] M. A. Ferrag, L. Maglaras, S. Moschoyiannis, and H. Janicke, ‘‘Deep learn-
tion,’’ IEEE Commun. Surveys Tuts., vol. 17, no. 4, pp. 2271–2295, ing for cyber security intrusion detection: Approaches, datasets, and com-
4th Quart., 2015. parative study,’’ J. Inf. Secur. Appl., vol. 50, Feb. 2020, Art. no. 102419.
[3] S. K. Tuteja and N. Bogiri, ‘‘Email spam filtering using BPNN classifi- [25] S. Vyas, E. Zaganjor, and M. C. Haigis, ‘‘Mitochondria and cancer,’’ Cell,
cation algorithm,’’ in Proc. Int. Conf. Autom. Control Dyn. Optim. Techn. vol. 166, no. 3, pp. 555–566, 2016.
(ICACDOT), Sep. 2016, pp. 915–919. [26] S. Vyas, M. D. Golub, D. Sussillo, and K. V. Shenoy, ‘‘Computation
[4] D. Burnes, M. DeLiema, and L. Langton, ‘‘Risk and protective factors of through neural population dynamics,’’ Annu. Rev. Neurosci., vol. 43, no. 1,
identity theft victimization in the United States,’’ Preventive Med. Rep., pp. 249–275, Jul. 2020.
vol. 17, Mar. 2020, Art. no. 101058. [27] J. Leggott, ‘‘The royal philharmonic Goes to the bathroom: The music
[5] F. Cassim, ‘‘Protecting personal information in the era of identity theft: Just of Monty Python,’’ in And Now for Something Completely Different:
how safe is our personal information from identity thieves?’’ Potchefstroom Critical Approaches to Monty Python, vol. 75. Edinburgh, U.K.: Edin-
Electronic Law J./Potchefstroomse Elektroniese Regsblad, vol. 18, no. 2, burgh Univ. Press, 2020. [Online]. Available: https://www.degruyter.com/
pp. 68–110, Mar. 2015. document/doi/10.1515/9781474475174-008/pdf?licenseType=restricted

124312 VOLUME 12, 2024

R. Agarwal et al.: Novel Approach for Spam Detection Using NLP With AMALS Models

[28] K. Cunningham, D. Headey, A. Singh, C. Karmacharya, and P. P. Rana, [51] H. Malik, A. Almutairi, and M. A. Alotaibi, ‘‘Power quality disturbance
‘‘Maternal and child nutrition in nepal: Examining drivers of progress analysis using data-driven EMD-SVM hybrid approach,’’ J. Intell. Fuzzy
from the mid-1990s to 2010s,’’ Global Food Secur., vol. 13, pp. 30–37, Syst., vol. 42, no. 2, pp. 669–678, Jan. 2022.
Jun. 2017. [52] Q. Wang, Y. Guan, and X. Wang, ‘‘SVM-based spam filter with active and
[29] J. J. Siegel, Stocks for the Long Run: The Definitive Guide to Financial online learning,’’ in Proc. TREC, 2006, pp. 1–8.
Market Returns & Long-Term Investment Strategies. New York, NY, USA: [53] M. T. Banday and T. R. Jan, ‘‘Effectiveness and limitations of statistical
McGraw-Hill, 2021. spam filters,’’ 2009, arXiv:0910.2540.
[30] S. Yadav, A. Saini, A. Dhamija, and Y. Narnauli, ‘‘Discerning spam in [54] W. Peng, L. Huang, J. Jia, and E. Ingram, ‘‘Enhancing the naive
social networking sites,’’ Adv. Vis. Comput., Int. J., vol. 3, no. 2, pp. 1–10, Bayes spam filter through intelligent text modification detection,’’ in
Jun. 2016. Proc. 17th IEEE Int. Conf. Trust, Secur. Privacy Comput. Commun./12th
[31] S. Duman, K. Kalkan-Cakmakci, M. Egele, W. Robertson, and E. Kirda, IEEE Int. Conf. Big Data Sci. Eng. (TrustCom/BigDataSE), Aug. 2018,
‘‘EmailProfiler: Spearphishing filtering with header and stylometric fea- pp. 849–854.
tures of emails,’’ in Proc. IEEE 40th Annu. Comput. Softw. Appl. Conf. [55] D. Steinberg and P. Colla, ‘‘CART: Classification and regression
(COMPSAC), vol. 1, Jun. 2016, pp. 408–416. trees,’’ in The Top ten Algorithms in Data Mining, vol. 9.
[32] M. Elhoseny, G. Ramírez-González, O. M. Abu-Elnasr, S. A. Shawkat, London, U.K.: Taylor & Francis, 2009, p. 179. [Online]. Available:
N. Arunkumar, and A. Farouk, ‘‘Secure medical data transmission https://www.taylorfrancis.com/chapters/edit/10.1201/9781420089653-
model for IoT-based healthcare systems,’’ IEEE Access, vol. 6, 17/cart-classification-regression-trees-dan-steinberg
pp. 20596–20608, 2018. [56] Z. Zeng, X. Zheng, G. Chen, and Y. Yu, ‘‘Spammer detection on Weibo
[33] S. Park, A. X. Zhang, L. S. Murray, and D. R. Karger, ‘‘Opportunities for social network,’’ in Proc. IEEE 6th Int. Conf. Cloud Comput. Technol. Sci.,
automating email processing: A need-finding study,’’ in Proc. CHI Conf. Dec. 2014, pp. 881–886.
Human Factors Comput. Syst., May 2019, pp. 1–12. [57] C. Lin, J. He, Y. Zhou, X. Yang, K. Chen, and L. Song, ‘‘Analysis
[34] H. Bhuiyan, A. Ashiquzzaman, T. I. Juthi, S. Biswas, and J. Ara, ‘‘A survey and identification of spamming behaviors in sina Weibo microblog,’’
of existing e-mail spam filtering methods considering machine learning in Proc. 7th Workshop Social Netw. Mining Anal., Aug. 2013,
techniques,’’ Global J. Comput. Sci. Technol., vol. 18, no. 2, pp. 20–29, pp. 1–9.
2018. [58] F. Jamil, H. K. Kahng, S. Kim, and D.-H. Kim, ‘‘Towards secure fit-
[35] D. Sipahi, G. Dalkiliç, and M. H. Özcanhan, ‘‘Detecting spam through their ness framework based on IoT-enabled blockchain network integrated
sender policy framework records,’’ Secur. Commun. Netw., vol. 8, no. 18, with machine learning algorithms,’’ Sensors, vol. 21, no. 5, p. 1640,
pp. 3555–3563, Dec. 2015. Feb. 2021.
[36] M. Bassiouni, M. Ali, and E. A. El-Dahshan, ‘‘Ham and spam e-mails [59] S. O. Olatunji, ‘‘Improved email spam detection model based on support
classification using machine learning techniques,’’ J. Appl. Secur. Res., vector machines,’’ Neural Comput. Appl., vol. 31, no. 3, pp. 691–699,
vol. 13, no. 3, pp. 315–331, Jul. 2018. Mar. 2019.
[37] M. N. I. Ahsan, T. Nahian, A. A. Kafi, M. I. Hossain, and F. M. Shah, [60] K. Tretyakov, ‘‘Machine learning techniques in spam filtering,’’ in Proc.
‘‘An ensemble approach to detect review spam using hybrid machine Data Mining Problem-Oriented Seminar (MTAT), May 2004, vol. 3,
learning technique,’’ in Proc. 19th Int. Conf. Comput. Inf. Technol. (ICCIT), no. 177, pp. 60–79.
Dec. 2016, pp. 388–394. [61] N. F. Rusland, N. Wahid, S. Kasim, and H. Hafit, ‘‘Analysis of Naïve Bayes
[38] H. Malik, R. Sharma, and S. Mishra, ‘‘Fuzzy reinforcement learning based algorithm for email spam filtering across multiple datasets,’’ IOP Conf.
intelligent classifier for power transformer faults,’’ ISA Trans., vol. 101, Ser., Mater. Sci. Eng., vol. 226, no. 1, Aug. 2017, Art. no. 012091.
pp. 390–398, Jun. 2020. [62] A. K. Sharma and S. Sahni, ‘‘A comparative study of classification algo-
[39] R. M. A. Mohammad, ‘‘A lifelong spam emails classification model,’’ rithms for spam email data analysis,’’ Int. J. Comput. Sci. Eng., vol. 3, no. 5,
Appl. Comput. Informat., vol. 20, no. 1, pp. 35–54, Jan. 2024. pp. 1890–1895, 2011.
[40] J. Johnson, B. Hariharan, L. Van Der Maaten, J. Hoffman, L. Fei-Fei, [63] N. Kumar and S. Sonowal, ‘‘Email spam detection using machine learn-
C. L. Zitnick, and R. Girshick, ‘‘Inferring and executing programs for ing algorithms,’’ in Proc. 2nd Int. Conf. Inventive Res. Comput. Appl.
visual reasoning,’’ in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2017, (ICIRCA), Jul. 2020, pp. 108–113.
pp. 2989–2998. [64] A. Singh and S. Batra, ‘‘Ensemble based spam detection in social IoT
[41] I. D. Foster, J. Larson, M. Masich, A. C. Snoeren, S. Savage, and using probabilistic data structures,’’ Future Gener. Comput. Syst., vol. 81,
K. Levchenko, ‘‘Security by any other name: On the effectiveness of pp. 359–371, Apr. 2018.
provider based email security,’’ in Proc. 22nd ACM SIGSAC Conf. Comput. [65] N. Sattu, ‘‘A study of machine learning algorithms on email spam clas-
Commun. Secur., Oct. 2015, pp. 450–464. sification,’’ M.S. thesis, Dept. Appl. Comput. Sci., Graduate School
[42] D. J. Larsson, A. Andremont, J. Bengtsson-Palme, K. K. Brandt, Southeast Missouri State Univ., Cape Girardeau, MO, USA, 2020.
A. M. D. R. Husman, P. Fagerstedt, J. Fick, C. F. Flach, W. H. Gaze, [Online]. Available: https://www.proquest.com/openview/a165c1b42d9
M. Kuroda, and K. Kvint, ‘‘Critical knowledge gaps and research needs c959784792ae606b130a4/1?pq-origsite=gscholar&cbl=18750&diss=y
related to the environmental dimensions of antibiotic resistance,’’ Environ. [66] H. Xu, W. Sun, and A. Javaid, ‘‘Efficient spam detection across online
Int., vol. 117, pp. 132–138, Aug. 2018. social networks,’’ in Proc. IEEE Int. Conf. Big Data Anal. (ICBDA),
[43] H. Takhmiri and D. A. Haroonabadi, ‘‘Identifying valid email spam emails Mar. 2016, pp. 1–6.
using decision tree,’’ Int. J. Comput. Appl. Technol. Res., vol. 5, no. 2, [67] H. Faris, I. Aljarah, and J. Alqatawna, ‘‘Optimizing feedforward neu-
pp. 61–65, Jan. 2016. ral networks using Krill Herd algorithm for e-mail spam detection,’’ in
[44] S. E. Kille, Mapping Between X, document 400 and RFC 822 (RFC987), Proc. IEEE Jordan Conf. Appl. Electr. Eng. Comput. Technol. (AEECT),
1986. Nov. 2015, pp. 1–5.
[45] I. Rish, ‘‘An empirical study of the naive Bayes classifier,’’ in Proc. [68] A. H. Wang, ‘‘Detecting spam bots in online social networking sites: A
Workshop Empirical Methods Artif. Intell., 2001, vol. 3, no. 22, pp. 41–46. machine learning approach,’’ in Proc. IFIP Annu. Conf. Data Appl. Secur.
[46] M. Verma, D. Divya, and S. Sofat, ‘‘Techniques to detect spammers in Privacy, Berlin, Germany: Springer, Jun. 2010, pp. 335–342.
Twitter—A survey,’’ Int. J. Comput. Appl., vol. 85, no. 10, pp. 27–32, [69] Z. Guo, Y. Shen, A. K. Bashir, M. Imran, N. Kumar, D. Zhang, and
Jan. 2014. K. Yu, ‘‘Robust spammer detection using collaborative neural network in
[47] S. Fine, Y. Singer, and N. Tishby, ‘‘The hierarchical hidden Markov model: Internet-of-Things applications,’’ IEEE Internet Things J., vol. 8, no. 12,
Analysis and applications,’’ Mach. Learn., vol. 32, pp. 41–62, Jul. 1998. pp. 9549–9558, Jun. 2021.
[48] P. S. Keila and D. B. Skillicorn, ‘‘Structure in the enron email dataset,’’ [70] A. Makkar and N. Kumar, ‘‘An efficient deep learning-based scheme for
Comput. Math. Org. Theory, vol. 11, no. 3, pp. 183–199, Oct. 2005. web spam detection in IoT environment,’’ Future Gener. Comput. Syst.,
[49] W. Li, W. Meng, Z. Tan, and Y. Xiang, ‘‘Design of multi-view based vol. 108, pp. 467–487, Jul. 2020.
email classification for IoT systems via semi-supervised learning,’’ J. Netw. [71] M. Zavvar, M. Rezaei, and S. Garavand, ‘‘Email spam detection using
Comput. Appl., vol. 128, pp. 56–63, Feb. 2019. combination of particle swarm optimization and artificial neural network
[50] A. Subasi, J. Kevric, and M. Abdullah Canbaz, ‘‘Epileptic seizure detection and support vector machine,’’ Int. J. Modern Educ. Comput. Sci., vol. 8,
using hybrid machine learning methods,’’ Neural Comput. Appl., vol. 31, no. 7, pp. 68–74, Jul. 2016.
no. 1, pp. 317–325, Jan. 2019.

VOLUME 12, 2024 124313

Unban For Fortnite Method
100% (2)
Unban For Fortnite Method
2 pages
Temp Mail - Disposable Temporary Email
No ratings yet
Temp Mail - Disposable Temporary Email
2 pages
Student User Manual v8 1 PDF
No ratings yet
Student User Manual v8 1 PDF
23 pages
Spec Maths Practice SAC 1 2021
No ratings yet
Spec Maths Practice SAC 1 2021
3 pages
Pilotless Aircraft Registration Form: Drone
No ratings yet
Pilotless Aircraft Registration Form: Drone
4 pages
Spam Detection Synopsis
No ratings yet
Spam Detection Synopsis
8 pages
VBK23 Cse 041
No ratings yet
VBK23 Cse 041
6 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
46_ijme...Mech Engg..Research Paper-1
No ratings yet
46_ijme...Mech Engg..Research Paper-1
10 pages
IJRPR8167
No ratings yet
IJRPR8167
7 pages
Synopsis Email Spam
No ratings yet
Synopsis Email Spam
9 pages
Jebin 2
No ratings yet
Jebin 2
22 pages
Evaluating the Effectiveness of Machine Learning Methods for
No ratings yet
Evaluating the Effectiveness of Machine Learning Methods for
8 pages
E-Mail Spam Detection by Using NLP and Naïve Bayes Classification Through Machine Learning
No ratings yet
E-Mail Spam Detection by Using NLP and Naïve Bayes Classification Through Machine Learning
5 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
Madhavan_2021_IOP_Conf._Ser.__Mater._Sci._Eng._1022_012113
No ratings yet
Madhavan_2021_IOP_Conf._Ser.__Mater._Sci._Eng._1022_012113
12 pages
Email (Research) 3
No ratings yet
Email (Research) 3
7 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
E-Mail Spam Detection Using Machine Learning Naive Bayes Theorem
No ratings yet
E-Mail Spam Detection Using Machine Learning Naive Bayes Theorem
5 pages
1822 b Deleted Merged Cropped
No ratings yet
1822 b Deleted Merged Cropped
40 pages
Security and Communication Networks - 2022 - Ahmed - Machine Learning Techniques For Spam Detection in Email and IoT
No ratings yet
Security and Communication Networks - 2022 - Ahmed - Machine Learning Techniques For Spam Detection in Email and IoT
19 pages
NLP Report
No ratings yet
NLP Report
19 pages
Machine Learning Based Spam E-Mail Detection
No ratings yet
Machine Learning Based Spam E-Mail Detection
10 pages
Research Paper Spam Detection
No ratings yet
Research Paper Spam Detection
4 pages
Spam Mail Detection Using Machine Learning
No ratings yet
Spam Mail Detection Using Machine Learning
5 pages
Pending Proj
No ratings yet
Pending Proj
37 pages
Application of Natural Languag
No ratings yet
Application of Natural Languag
32 pages
Major-Final Research Paper
No ratings yet
Major-Final Research Paper
3 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
Enhancing Email Security with Naïve Bayes Spam Detection.docx Fully edited
No ratings yet
Enhancing Email Security with Naïve Bayes Spam Detection.docx Fully edited
64 pages
Moutafis EWS 098
No ratings yet
Moutafis EWS 098
8 pages
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
No ratings yet
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
7 pages
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
No ratings yet
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
4 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
Project Report Emaildetection
No ratings yet
Project Report Emaildetection
44 pages
0_SPAM MAIL PREDICTION
No ratings yet
0_SPAM MAIL PREDICTION
29 pages
Emai Spam Detection Using Machine Learning and Python - IJRPR3714
No ratings yet
Emai Spam Detection Using Machine Learning and Python - IJRPR3714
6 pages
2023 V14i805
No ratings yet
2023 V14i805
7 pages
44 Decision Tree Model for Email Classification
No ratings yet
44 Decision Tree Model for Email Classification
4 pages
Email Spam A Comprehensive Review of Optimize Detection Methods Challenges and Open Research Problems
No ratings yet
Email Spam A Comprehensive Review of Optimize Detection Methods Challenges and Open Research Problems
31 pages
Amrit Science Campus: Submitted by
No ratings yet
Amrit Science Campus: Submitted by
35 pages
Spam Email Using Machine Learning
No ratings yet
Spam Email Using Machine Learning
13 pages
Email Spam Detection (Research Paper)
No ratings yet
Email Spam Detection (Research Paper)
8 pages
Id - 3747 - Literature Review
No ratings yet
Id - 3747 - Literature Review
3 pages
122 14211291439 13 PDF
No ratings yet
122 14211291439 13 PDF
5 pages
Using Support Vector Machine For Classification and Feature Extraction of Spam in Email
No ratings yet
Using Support Vector Machine For Classification and Feature Extraction of Spam in Email
7 pages
Spam 2023
No ratings yet
Spam 2023
11 pages
Constructing A User Preference Ontology For Anti-Spam Mail Systems
No ratings yet
Constructing A User Preference Ontology For Anti-Spam Mail Systems
12 pages
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
No ratings yet
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
6 pages
PPT
0% (1)
PPT
15 pages
ETCW15
No ratings yet
ETCW15
4 pages
Spam_filtering_on_social_media_using_machine_learning_ijariie21244
No ratings yet
Spam_filtering_on_social_media_using_machine_learning_ijariie21244
6 pages
Spam Detection in Email Using Machine Le
No ratings yet
Spam Detection in Email Using Machine Le
8 pages
BT-3435 ALI (2)
No ratings yet
BT-3435 ALI (2)
49 pages
Decision Tree Model For Email Classification: Ivana Čavor
No ratings yet
Decision Tree Model For Email Classification: Ivana Čavor
4 pages
Maths Answers
No ratings yet
Maths Answers
4 pages
REPORT[1]_1
No ratings yet
REPORT[1]_1
35 pages
Spam Classification Based On Supervised Learning U
No ratings yet
Spam Classification Based On Supervised Learning U
6 pages
email report
No ratings yet
email report
15 pages
Case Study on Email Spam and Non
No ratings yet
Case Study on Email Spam and Non
5 pages
IJCRT23A5429
No ratings yet
IJCRT23A5429
7 pages
1822 b Deleted
No ratings yet
1822 b Deleted
38 pages
02 JCCE2202192 Online
No ratings yet
02 JCCE2202192 Online
5 pages
E-Mail Spam Detection Using Machine Learning KNN
No ratings yet
E-Mail Spam Detection Using Machine Learning KNN
5 pages
Writing Good Emails
From Everand
Writing Good Emails
IntroBooks Team
No ratings yet
TechCorner 15 - C-More Remote Access, Data Log, FTP File Transfer, and Email (Tutorial)
No ratings yet
TechCorner 15 - C-More Remote Access, Data Log, FTP File Transfer, and Email (Tutorial)
15 pages
Sales Productivity
No ratings yet
Sales Productivity
1,104 pages
Basic ICT Skills
No ratings yet
Basic ICT Skills
10 pages
Attack Secure Boot of SEP: Windknown@pangu
No ratings yet
Attack Secure Boot of SEP: Windknown@pangu
46 pages
CUET-UG - 2024 English/hindi
No ratings yet
CUET-UG - 2024 English/hindi
3 pages
Funciones 4.6 Write A Program To Prompt The User For Hours and Rate Per Hour Using Raw - Input To
No ratings yet
Funciones 4.6 Write A Program To Prompt The User For Hours and Rate Per Hour Using Raw - Input To
4 pages
User Guide
No ratings yet
User Guide
32 pages
Certificate of Completion: Electronic Record and Signature Disclosure
No ratings yet
Certificate of Completion: Electronic Record and Signature Disclosure
6 pages
jitu (1)
No ratings yet
jitu (1)
36 pages
(123doc) - Mon-Nghe-Tieng-Anh-2-En22-Ehou
No ratings yet
(123doc) - Mon-Nghe-Tieng-Anh-2-En22-Ehou
34 pages
How To Increase Withdrawal Limit On
No ratings yet
How To Increase Withdrawal Limit On
2 pages
CPE236 Diagram 5125
No ratings yet
CPE236 Diagram 5125
6 pages
Networking Lab 2
No ratings yet
Networking Lab 2
5 pages
Breaking The Chains of WFM Paradigms For Dummies
No ratings yet
Breaking The Chains of WFM Paradigms For Dummies
53 pages
Bigovpn Configuration Guide: 1. VPN Introduce
No ratings yet
Bigovpn Configuration Guide: 1. VPN Introduce
11 pages
CSA -M3 -Ktunotes.in
No ratings yet
CSA -M3 -Ktunotes.in
12 pages
Cloud Service Security & Application Vulnerability: April 2015
No ratings yet
Cloud Service Security & Application Vulnerability: April 2015
9 pages
Answers
No ratings yet
Answers
162 pages
Mentee Handbook 2023
No ratings yet
Mentee Handbook 2023
17 pages
Survey365 Information 2020
No ratings yet
Survey365 Information 2020
3 pages
DIRECTORY
No ratings yet
DIRECTORY
2 pages
Activity 123 Purcom
No ratings yet
Activity 123 Purcom
4 pages
Chapter 1 to 3 Yesefren
No ratings yet
Chapter 1 to 3 Yesefren
49 pages
SunPower Remittance Advice_ Payment Reference Number - 748940
No ratings yet
SunPower Remittance Advice_ Payment Reference Number - 748940
4 pages
Office Nic Email
No ratings yet
Office Nic Email
4 pages

A_Novel_Approach_for_Spam_Detection_Using_Natural_Language_Processing_With_AMALS_Models

Uploaded by

A_Novel_Approach_for_Spam_Detection_Using_Natural_Language_Processing_With_AMALS_Models

Uploaded by

Received 9 February 2024, accepted 15 April 2024, date of publication 18 April 2024, date of current version 13 September 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3391023

A Novel Approach for Spam Detection Using

I. INTRODUCTION cost-effective, and expeditious means of disseminating

VOLUME 12, 2024 124299

124300 VOLUME 12, 2024

VOLUME 12, 2024 124301

III. SPAM DETECTION

124302 VOLUME 12, 2024

FIGURE 2. Enterprises level spam process.

4) SPAM FILTERING BASED ON ANY CASE

VOLUME 12, 2024 124303

A. SPAM FILTERING BASED ON ML

The initial stage of the learning process involves acquiring

124304 VOLUME 12, 2024

FIGURE 7. Supervised learning method of ML.

subsequently used to make predictions about future events.

VOLUME 12, 2024 124305

VOLUME 12, 2024 124307

124308 VOLUME 12, 2024

VOLUME 12, 2024 124309

FIGURE 12. Faceted distribution of label and unnamed (Category)

ity. In the future, it is possible to enhance the performance of

124310 VOLUME 12, 2024

TABLE 2. Comparative table of different methods of ML.

FIGURE 13. Graphical representations of values and distributions of spam emails.

The formula upon which it is based is as follows:

124312 VOLUME 12, 2024

VOLUME 12, 2024 124313

You might also like