An Effective Detection Approach For Phishing URL Using ResMLP
An Effective Detection Approach For Phishing URL Using ResMLP
India
4 Department of Information and Communication Engineering, Sunchon National University, Jeollanam-do, Suncheon 57922, South Korea
ABSTRACT Phishing websites, mimicking legitimate counterparts, pose significant threats by stealing
user information through deceptive Uniform Resource Locators (URLs). Traditional blacklists struggle
to identify dynamic URLs, necessitating advanced detection mechanisms. In this study, we propose an
effective approach utilizing residual pipelining for phishing URL detection. Our method extracts common
URL features and sentiments, employing a residual pipeline comprising convolutional and inverted residual
blocks. These resultant features are then fed into a Multi-Layer Perceptron (MLP) for classification.
We evaluate the efficacy of our approach against traditional algorithms using a Kaggle dataset. Our results
demonstrate superior accuracy, precision, F1 Score, and recall, showcasing its effectiveness in mitigating
phishing threats. Utilizing a residual pipeline made up of convolutional and inverted residual blocks, we start
our method by identifying similar URL features and sentiments. We also use domain age research to figure
out how long URLs have been around. Additionally, the lexical study of URL structure makes our method
more useful, resulting in impressive accuracy. With an accuracy of 98.29%, this research highlights the
importance of innovative techniques in combating evolving cyber threats. Future research directions could
focus on enhancing the model’s robustness against adversarial attacks and integrating real-time monitoring
for proactive defense strategies.
has increased more than twice as much as it did in 2022. enhanced performance and resilience against evolving
December 2021 witnessed 316,747 attacks, reported by the cyber threats.
Anti-Phishing Working Group’s (APWG) [2], the highest • Additionally, we introduce a comprehensive feature set
case in its history. Bank-related phishing assaults accounted comprising novel and existing features, boosting the
for 23.2% of all phishing attacks in the fourth quarter of 2021, effectiveness of our detection mechanism.
according to OpSec Security, a founding member of APWG. • Through extensive experimentation and evaluation,
Anti-phishing is the technique of preventing phishing we validate the precision and accuracy of our method in
attacks in which attackers try to get sensitive information identifying legitimate websites while minimizing false
through non-repudiation. Attacker’s tactics and methods of positive rates.
targeting have advanced significantly as common phishing To present our research findings logically, the subsequent
techniques have become more transparent to the general sections are organized as follows. The section on Literature
public. Many businesses have created anti-phishing systems Review thoroughly examines relevant publications that form
[3] to reduce these hazards, however, these tools are not the the basis of our research and offer insights into current phish-
last layer of defense. Anti-phishing software is a platform ing detection techniques and approaches. The methodology
or series of software services that can identify malicious section provides comprehensive details about the technical
inbound messages that pose as authentic or try to gain trust principles and theoretical foundations of our suggested
through social engineering. It also allows users to create methodology, which are crucial for placing our research in
whitelists and blacklists for message filtering and takes context. The dataset that we used for our experiments is
preventative measures when necessary [4]. However, these presented in the section Dataset and Experimental Results,
are insufficient to battle phishing, since attackers make use along with a detailed analysis of the findings of our research.
of one-time phishing URLs. Machine learning techniques In the result section, we also provide in-depth analyses of
are used to deal with this trick, depending on an integrated the performance of our proposed approach across various
classifier to look at the properties of sample URLs [5], [6] evaluation metrics. Concisely summarising our results, the
to make judgments for new, developing ones[21]. Likewise, conclusion section explores future directions for phishing
deep learning-based methods [8] are developed which are detection research and development.
capable of classifying data more accurately than traditional
ML models. II. RELATED WORKS
A novel approach is introduced in this research work Phishing has long been one of the most popular cyber-attack
leveraging residual pipelining methodology to enhance the strategies used by bad actors. The problems posed by
efficacy of traditional detection mechanisms, where URL phishing websites have been addressed by numerous studies.
features along with a few sentimental features are collected Many techniques for detecting phishing websites have
as part of feature extraction and transformed into a matrix. been proposed, including blacklist-based and heuristic-based
Then this matrix is fed onto the residual pipeline module techniques. The statistics from the training dataset have
which consists of convolution layers and inverted residual a substantial impact on the weights in the heuristic-based
layers. After that, the obtained result is fed into an output approach. Blacklists [9], which is a dataset consisting of
block where the actual classification of the URL takes malicious URLs are still used by several internet companies.
place. In response to the escalating sophistication of phishing However, it is unable to forecast outcomes for a new URL
attacks, we provide an effective solution using a hybrid that has not yet been added to the list, because attackers
feature set. This collection includes different hyperlink are increasingly using one-time URLs to carry out attacks.
information, and URL character sequence characteristics, To address this issue several approaches have been developed.
culminating in the creation of feature vectors necessary for Xiao et al [10] developed CNN-MHSA, a combination of
training our anti-phishing model. Ultimately, rigorous testing Convolutional Neural Network (CNN) and multi-head self-
reveals that the accuracy reaches up to 98.295%, which attention mechanisms to detect phishing websites. In this
outperforms the conventional methods. method, feature extraction and weight calculation are per-
Our anti-phishing solution is meticulously designed to formed independently by duplicating the input matrix into
fulfill stringent requirements essential for robust detection two. The self-attention mechanism then aids in identifying
and prevention of phishing attacks. It prioritizes high whether websites are malicious or benign. This method
detection efficiency, real-time detection capabilities, target exhibits strong performance in differentiating phishing
independence, and third-party independence. By minimizing websites from authentic ones by utilizing CNN’s capability
false positives and maximizing true positives, our method for spatial feature learning and self-attention for collecting
ensures timely prediction of phishing attempts while main- long-range relationships. Model CNN-MHSA, improved
taining adaptability to emerging threats without reliance on efficiency and interpretability can be seen in its ability
external services. to decouple the weight calculation procedure from feature
Key contributions of our research include: extraction. It is crucial to recognize that, even with the
• The proposal of a phishing detection approach that encouraging results, the complex neural network architecture
integrates residual pipelining methodology, offering may need a significant amount of computing power for both
79368 VOLUME 12, 2024
S. Remya et al.: Effective Detection Approach for Phishing URL Using ResMLP
training and inference. The efficacy of the model can also there might be some restrictions on it. For example, the
be impacted by factors such as the variety and quality of availability and variety of phishing kits evaluated, as well as
the training data as well as the dynamic tactics employed by the timeliness and accuracy of the data acquired, could limit
phishing opponents. the efficacy. Furthermore, the study generalizes the findings
For precise phishing detection, Weiping Wang et al. to more extensive phishing threats and attack scenarios. Apart
[11] established a method called Recurrent Convolutional from that, it might be difficult to analyze the dynamic nature
Neural Networks (PDRCNN). A two-dimensional tensor of phishing attempts and the quick development of phishing
representation generated by the PDRCNN is given as an input strategies. Notwithstanding these possible drawbacks, this
for classification purposes. By utilizing these attributes, the investigation adds a great deal to our knowledge of how
model can identify temporal and spatial patterns of URL phishing kits are used and develop stronger defences against
data, which improves the identification of phishing efforts. phishing attacks.
Large labeled datasets are necessary, high computational The novel technique ‘‘Antiphishing through Phishing
resource needs, and overfitting vulnerability are some of Target Discovery,’’ by Liu et al. [15], aims to detect possible
the drawbacks of PDRCNN, despite its advantages such as phishing websites through an analysis of their parasitic
its ability to handle sequential data and adapt to various community structure. Identifying the principal phishing
URL formats. PDRCNN cannot be successfully used in target webpage and exposing ‘‘parasitic’’ connections are the
actual cybersecurity applications until these problems are goals of the above method, which collects webpages that
fixed. are either directly or indirectly linked to a certain suspicious
The Phishing Hybrid Feature-Based Classifier (PHFBC), webpage. However there is a chance that this approach
designed by Zuhair et al. [12], combines recursive fea- may fail, especially in cases when the parasitic community
ture subset selection with ML approaches to produce a structure is dynamic or complex. Furthermore, the quality and
comprehensive phishing detection system. With a set of completeness of the web link data used for analysis, as well
features gathered from phishing and legitimate websites, as the variety of phishing strategies and techniques used by
their objective was to accurately classify phishing. PHFBC the attackers, could have an impact on the accuracy of the
incorporates decision tree and Naive Bayes models using methodology.
a statistical measure known as the Phish Ratio. Though it CANTINA is a content-based method developed by
is an innovative technique, it has limitations such as being Zhang et al. [16] that analyses character scores and extracts
susceptible to feature replication, requiring laborious and keywords from website texts to identify phishing websites.
prone to error manual feature extraction, and having trouble Notwithstanding its inventive methodology, CANTINA can
selecting the best features for different phishing scenarios. encounter various constraints. For example, the use of
Furthermore, parameters like representativeness in response TF-IDF analysis and character scores alone may miss less
to changing phishing strategies could have an impact on how obvious signs of phishing activity, like visual cues or
successful PHFBC is. Resolving these issues would improve contextual components. In addition, the model’s efficacy
PHFBC’s resilience and suitability for use in actual phishing might be restricted by the level of accuracy and significance
detection situations. of the selected keywords in addition to the possibility
Ramesh et al. [13] provided a technique for identifying of false positives or false negatives in the Google search
phishing webpages and their target domains through simula- results. Furthermore, Google search rankings as a metric
tion analysis. Utilizing row and column sums, the technique for legitimacy could lead to bias and inaccurate results,
determines target linkages, producing a parasitic matrix especially when phishing websites alter search engine results
that depicts the connection between two sites. However, or genuine websites are not highly ranked.
scalability problems with this method could appear when To create a classification model, Marchal et al. [17] used
handling huge datasets or intricate website architecture. a feature extraction technique, extracting 212 features and
Additionally, how well the technique works may depend on applying Gradient Boosting. Although this methodology is
the accuracy of the presumed correlations discovered and the a thorough attempt to capture several aspects of phishing
consistency of the row and column total computations. The websites, it might run into issues with feature selection and
human-generated parasite matrix may also produce biased or model complexity. The amount of features that are extracted
erroneous results, and the system may not be able to adapt to may cause problems like overfitting, particularly if some
evolving phishing or website design trends. of the features are irrelevant to the purpose of phishing
Cova et al. [14] intended to comprehend the basic detection. Furthermore, the process of manually extracting
design and application of phishing kits to determine the features can be time-consuming and may eliminate important
methods of deception used by hackers to hide backdoors details from phishing websites, which could reduce the
they had installed and to educate interested parties about ability of the model to identify new and developing phishing
the techniques that phishers normally use to send phished techniques. Furthermore, considering large-scale deployment
data. Although their research offers insightful information scenarios when computational resources are limited, the
about the strategies and methods used in phishing attempts, selection of Gradient Boosting as the classification algorithm
may provide interpretability and scalability issues for the computational complexity of training deep neural networks
model. like DBN and the additional overhead caused by over-
Self-structuring neural networks are the basis of an sampling techniques like Borderline-SMOTE could threaten
inventive technique proposed by Mohammad et al. [18] for the scalability and efficiency of the model, particularly in
comprehending phishing websites. For all its advantages— real-time or resource-constrained contexts. Addressing these
such as a self-organized neural network and a high level of limitations will be essential for ensuring the practical viability
noise tolerance—this method may have a few disadvantages. and effectiveness of the proposed phishing detection method
Neural network complexity can be a disadvantage since it in real-world cybersecurity applications.
can lead to problems with the interpretability of the models Leveraging online learning with n-grams as a technique
and processing performance. The effectiveness may also for phishing website identification was proposed by Verma
be influenced by the training data, given the dynamic and and Das [21]. They divide URLs into n-grams to detect
ever-changing nature of phishing attacks. Furthermore, the phishing websites. Even while this strategy has advantages,
diverse and representative nature of the training dataset, like its adaptability to new phishing strategies and its capacity
along with the neural network’s capacity to generalize across to manage flowing data, it may also have disadvantages.
various phishing scenarios and attack vectors, could have Other variables could impact the outcome, including the
an impact on the model’s performance. Addressing these choice of n-grams, the level of information in the feature
limitations will be crucial for ensuring the practical utility and representation, and the effectiveness of online learning
effectiveness of the proposed approach. algorithms in processing massive amounts of data quickly.
Nguyen et al. [19] developed a single-layer neural network, The method may also not work as well or last as long if it
which computes heuristics and generates weights using the relies too much on online training, which can lead to problems
network. This is an effective technique for phishing detection. with model shifting and idea development over time.
The single-layer architecture of this approach makes it Through deep learning-based multidimensional features,
simple and computationally efficient, but it might have Yang et al. [22] suggested an approach to find fake
drawbacks. For example, the single-layer neural network’s websites. Because they use deep learning to identify features
ability to identify complex patterns and relationships in related to character sequences from URLs, their method
the data may limit the efficacy. Additionally, in situations makes it possible to quickly group things into categories.
when the underlying data distribution is extremely complex Dimensionality reduction, pattern recognition, and one-hot
or unpredictable, relying solely on heuristics for feature encoding and embedding of URLs are used in this method
extraction and weight computation may result in less- to try to find complicated patterns that point to phishing
than-ideal performance. Additionally, single-layer neural activities.
networks’ lack of depth and limited representational capacity For finding fake websites, Sun et al. [23] proposed a
may limit the capacity to generalize, which could impair their new method using graph neural networks. Unlike traditional
effectiveness in phishing cases that have yet to be discovered feature-based systems, the one created by Sun et al. does an
or developed. excellent task of capturing the complex relationships between
Zhang and Li [20] introduced Borderline-SMOTE Deep URLs. The network architecture is extensively looked at
Belief Network (DBN). Improved detection accuracy and and subtle patterns linked to phishing are found using a
model robustness are two potential advantages of this graph neural network. Utilizing the framework of information
approach, but it may also have some disadvantages. An exam- found in URLs and the intricate links between them, their
ple of this would be the representativeness and the training method greatly enhances the accuracy of detection. There will
data quality, which could affect the effectiveness, particularly be significant advantages over current feature engineering
considering the challenges that imbalanced datasets in methods if this new method can regularly and accurately spot
phishing detection tasks may provide. Nevertheless, the complex phishing attempts.
In order to deal with the problem of not having adequate handle large-scale datasets. It is important to tackle these
information, Chen et al. [24] suggested a deep transfer constraints to guarantee the dependability and effectiveness
learning system that would be optimized for finding phishing of the proposed phishing detection technique in practical
emails. Their method works well with new datasets that cybersecurity implementations. The summary of the existing
don’t have a lot of labeled data because it uses models state of the art is shown in Table 1.
that have already been trained and transfer knowledge
from high-quality datasets. To get high recognition accuracy III. METHODOLOGY
with minimal training data, this method works well for With the increasing threat of phishing attacks, our research
generalizing models, which is helpful when it’s hard to get aims to create a strong method for spotting phishing
cases that have been labeled. Their method combines domain URLs. Central to our approach is the integration of MLP
knowledge with transferable information. Therefore, it will within a residual pipelining framework. This innovative
be possible to make detection systems that are more flexible amalgamation of methodologies aims to capitalize on the
and reliable. strengths of each approach, thereby enhancing the efficacy
Asiri et al. [25] came up with a new way to use deep and accuracy of phishing website detection.
reinforcement learning to find hacking attempts in real time
so that security can be proactive against attempts that change A. SYSTEM ARCHITECTURE
all the time. Because it is always changing and learning from The overall architecture depicted in Figure 1 of the proposed
how people use URLs, their system gets better at finding approach is divided into 4 phases such as, the features are
things over time. An flexible learning method is used to make extracted in Phase 1. Feature vectorization to create a unique
the model quickly respond to new phishing threats as they feature vector for every webpage is incorporated in Phase 2.
appear. This is done by using real-world data about how Phase 3 is doing the ML part. Whether the provided webpage
users interact with the system. Their system transforms into a is phishing or not is determined in Phase 4.
strong defense that can find and stop phishing attempts very
well through a routine of observation, action, and reward. 1) FEATURE EXTRACTION
Flexible cybersecurity solutions can start a new era with this An integral part of our methodology is the feature extraction
way. High-level security is provided by these solutions, which procedure from URLs, which is a critical step in the detection
can adapt quickly to changing cyber threat situations. pipeline. We carefully choose and extract emotive attributes
These strategies may have problems despite their benefits, from 25 different URLs to use as input features in our
such as being able to naturally learn hierarchical represen- detection model. A few instances of the numerous variables
tations from raw data and capturing complex correlations covered by these characteristics are the length of the URL,
between attributes. The efficacy of the model may be the host, the directory, the TLD, and the number of special
affected, for instance, by the quantity of labeled training characters like @, -,., =, and? Moreover, we recognize
data as well as the computational resources needed to train that affective dimensions are important in determining the
deep learning models. Practical application in real-world legitimacy of URLs and account for them by considering
settings may also be hampered by the interpretability of variables such as domain age, domain registration duration,
the learned representations and the approach’s scalability to and Google indexing status. Table 2 displays the features that
we have carefully chosen to capture the nuances of patterns to feed a sequence of symbols in raw data directly into a
and characteristics present in phishing URLs. As a result, our sci-kit feature extractor. Consequently, while some of them
detection system gained a thorough understanding of accurate assume unstructured text documents of different lengths,
classification. most of them assume numerical feature vectors of a given
Our detection model, which aims to reliably and precisely size. Sci-kit-learn offers tools for the most popular methods
distinguish between phishing and authentic URLs, is trained of extracting numerical features from text to handle this, such
using the extracted attributes as its basis. Making use of as:
the wide range of parameters readily accessible our model • tokenizing: The strings are tokenized, in which each
applies sophisticated machine-learning techniques to identify potential token is assigned an integer id. The token
minute details and irregularities indicative of phishing separators may include whitespaces and punctuation
activities. Furthermore, by using deep learning techniques, marks.
our model is better equipped to detect malicious URLs since • counting: The number of times each token appears in a
it can find intricate patterns and relationships in the data. Our document is recorded.
research attempts to clear the path for more powerful and • normalizing: Involves weighting and normalising the
efficient defenses against the ubiquitous threat of phishing tokens according to decreasing significance to those that
assaults in the digital realm by using this integrated and appear in most samples.
meticulously developed strategy. Figure 2 displays the model Here are the definitions for features and samples: The
overview. frequency with which each unique token appears (normalised
Feature extractor is designed to include the contextual or not) is considered a feature. For a particular document, the
sentiment score for each URL by using the GLOVE and vector containing all of those token frequencies is regarded
Natural Language Tool Kit (NLTK) tools. Each URL in the as a multivariate sample.
dataset will get preprocessed and tokenized. The prepro-
cessing includes the removal of stopping words, trimming, B. FEATURE VECTORIZATION
etc. Then each token will be passed on to the sci-kit learns Vectorization is the process of converting a set of text
text vectorizers to get a sentiment score. It is not possible documents into numerical feature vectors. In this process,
a matrix can be used to represent a corpus of documents, for phishing URL detection. To guarantee data quality, the
where each row represents a document and tokens are method begins with preprocessing a dataset D made up of
represented in each column. Tokenization, counting, and URLs to eliminate null values and duplicates. It then utilizes
normalisation combined into ‘‘Bag of Words’’ or ‘‘Bag a feature extraction technique to identify characteristics
of n-grams’’ representation. Word occurrences are used to indicative of phishing activity from every URL in D.
characterise the documents, with no consideration given to Afterwards, a list L containing the features is created. The
the terms’ relative positions within the text. software appends to L the features it computes for every
Some terms will be prevalent in massive text corpus; URL u in D. The technique leverages the characteristics
hence, it has relatively little significant information about the gathered and stored in L to train a machine learning model
document’s real contents. Usually, one applies the TF-IDF M after processing each URL. The output is then this trained
transform to re-weight the count features into floating point model M , which can accurately identify URLs as phishing or
values suitable for a classifier. Terms are denoted by the sign authentic based on attributes that have been extracted. The
Tf, while inverted document frequencies are indicated by the algorithm provides a systematic framework for building a
notation Tf–idf. phishing detection model, leveraging machine learning tech-
For example, TF-IDF Transformer (norm =’l2’, use_idf = niques to enhance cybersecurity measures against fraudulent
True, smooth_idf = True, sublinear_tf = False) might be used online activities.
with its default parameters. The term frequency is defined as
the number of times a word appears in a particular document. Algorithm 1 Phishing URL Detection Algorithm
It is multiplied by the idf component and is calculated as: Require: URL dataset D
1+n Ensure: Phishing detection model M
idf (x) = log +1 (1) 1: Preprocess dataset D to remove duplicates and null
1 + df (x)
values
In the document set, df(x) is the number of documents that
2: Extract features from URLs in D using feature
include word x, and n is the total number of documents in the
extraction algorithm
document set. Subsequently, the Euclidean norm is used to
3: Initialize empty list L
normalize the resulting TF-IDF vectors.
4: for each URL u in D do
u u 5: Calculate features for URL u
unorm = 2
=√ (2)
||u|| u12 + u22 + . . . .. + un2 6: Append features to L
The term weighting method was initially created for infor- 7: end for
mation retrieval and is useful for grouping and classifying 8: Train machine learning model M using features in L
documents. The calculation of TF-IDF is shown in the next 9: return Trained model M
section.
n
idf (x) = log (3)
1 + df (x) IV. RESIDUAL PIPELINE
In positional feature extraction the tfid transformer gets Residual pipeline enhances the overall system performance
the weight by the token position on the glove dataset which and is a crucial component of the entire architecture.
will be defined by NLTK tool kit and assigned by the sci-kit To address the issue of the vanishing gradient, residual
transformers. There is a need for normalization because the blocks were introduced. Skip connection is the technique
matrix acquired during feature extraction contains floating primarily used here, it connects layer activations to subse-
point values. Min-max normalization operation rescales a set quent layers by skipping portions of the intermediate layers.
of data. The original set’s smallest value would be mapped to Regularisation will bypass any layer that reduces architecture
0. The largest value in the original set would be assigned the performance, which is an advantage of using this type of skip
value 1. Every other value would be assigned a value between link.
these two bounds. The lower bound is denoted by min(y) and Figure 3 shows, the overview of the residual pipeline block.
the upper bound is denoted by max(y). The normalized value The input to the residual pipeline includes 27 URL properties,
(y’) can be represented as: 64 filters, and two classes. It consists of convolutional
y − min(y) blocks and seven inverted residual blocks which execute
y′ = (4) asynchronously. The convolutional block consists of a 3 ×
max(y) − min(y)
3 convolution layer followed by a batch normalization layer,
Followed by normalization, the entire dataset is divided where the batch size chosen is 32. The ReLU activation
into training and test sets of 80:20. function turns the provided input to the necessary output
with the specified range. The output matrix from the
C. PSEUDOCODE convolutional block is fed onto the inverted residual blocks,
The goal of the proposed Phishing URL Detection Algorithm, which conduct different operations including convolution,
presented in Algorithm 1, is to provide a dependable model separable convolution, batch normalization and activation.
V. OUTPUT BLOCK
After obtaining the result from the residual block, the output
block comes into action. Figure 4 depicts the output block
structure. Here, max pooling is performed initially and it
FIGURE 4. Training and detection phases-output block.
determines the maximum value to gradually shrink the spatial
size representation. Then the pooled matrix is flattened into a
single column. Following the process of flattening, a neural collection. Type indicates if a URL is phishing or safe.
network is used to process the massive input data vector for The dataset includes 6,51,191 URLs and their types; of
further purposes. The dense layer, which is highly connected these, 32,520 are malware URLs, 94,111 are phishing URLs,
to the layer before it, works to change the output’s dimension. 96,457 are defacement URLs, and 4,28,103 are benign
Typically, a dropout layer is used after a dense layer. Finally, URLs. From this, only the benign and phishing URLs are
an activation is performed, softmax activation function is used selected for conducting the experiment, which constitutes
here because it always returns a value between 0 and 1. As a 5,22,214 URL samples, among which 94,111 are phishing
result, very small or negative values can be mapped to 0.0 and and 4,28,103 are benign. The sample dataset is shown in
very large values can be represented as 1.0 when given as the Table 4.
weighted total of the input. The result from the output block Batch size denotes the maximum number of URLs that our
will be a floating point value. A threshold of 0.5 is set, values model can handle concurrently, while ‘‘epoch’’ denotes the
below the threshold are placed in the lower class, which number of training cycles that require the training set. Here,
equals 0 and others are placed in a higher class, which equals in this research study, the number of epochs is determined
1. Class 0 represents the benign URLs and class 1 represents as 50. After the completion of each epoch, the loss is
the phishing URLs. A sample output is represented in Table 3. monitored, if the same error occurs for all fifty iterations
then the execution gets stopped, which means the system is
VI. EXPERIMENTAL RESULTS AND DISCUSSIONS not correctly configured. Accuracy is monitored throughout
A. EXPERIMENTAL DATA the epochs and whenever best accuracy is observed then
The underlying experimental data is taken from the Kaggle it is saved and the model is trained using this saved
dataset [26]. URLs and their types are included in the data.
B. IMPLEMENTATION OF DOMAIN AGE ANALYSIS the date when the domain was created. The level of detail
Adding domain age analysis to our proposed approach is needed for our study dictates the interval between the two
a critical step in improving the robustness and efficiency dates.
of our detection model. This feature is implemented by Adding domain age analysis to our detection process
first gathering WHOIS information for every URL in our significantly enhanced and improved the detection accuracy
collection, and then determining the age of the domain that of our model. Our model can differentiate between real and
hosts the URL. malicious URLs by looking at the age of the domain that hosts
For each URL in our dataset, we use WHOIS data to obtain them. The analysis of domain age has had a significant impact
detailed information on domain registration. Gaining access on our results such as:
to this data allows us to learn important things about the • Addition of domain age analysis has led to a big drop
age and legitimacy of the URL-associated domains. Using in false results. Knowing the difference between real
the WHOIS information, we can find out the exact date websites and phishing URLs has made our model more
that the domain for each URL was created. For purposes of accurate and lowered the number of false positives.
determining the domain’s age, this creation date is considered • Our method finds and avoids future computer threats
as the starting point. Next, we find out how old the domain is by looking at domain ages. Our ability to find more
in days, months, or years by subtracting the current date from things has improved. To help stop phishing attempts
Different textual content features are representations of a document relative to a collection, while TF-IDF at the
text data used as input for machine learning algorithms, N-gram level considers sequences of words or characters.
each capturing distinct aspects of the text. TF-IDF at the Character patterns are analysed using TF-IDF character
word level assesses the significance of individual words in level representation, which is helpful for languages with
complex morphology. By counting the instances of words TABLE 7. Comparison of phishing URL detection models.
in documents, count vectors offer efficiency and simplicity
in situations where word frequency is crucial. Because
word sequence vectors maintain word order when encoding
word sequences, they are essential for applications like
text generation [39]. Character sequence vectors are useful
for analyzing complex writing systems and identifying
misspelled words since character sequences are encoded
to represent text. The best representation strategy must
be found through experimentation because it depends on
several variables, including properties, task complexity, and
algorithm requirements.
From Table 6, it can be observed that the performance
varies depending on the classifier and the type of textual
content features used. For instance, MLP consistently by looking at the textual content parts. Figure 7 offers a visual
achieves high precision, recall, F1 Score, AUC, and accuracy representation of the comparisons for each group.
across different types of textual content features, indicating its
robustness and effectiveness in capturing complex patterns in D. LEXICAL ANALYSIS OF URL STRUCTURE
the data. When it comes to more complex textual content fea- Our proposed approach to find phishing URLs involves
tures, such as word sequence vectors and character sequence thoroughly studying the structure of the URL’s words and
vectors, alternative classifiers do better than Naive Bayes. looking for small connections that could mean a phishing
Text categorization tasks show how different classifiers work attempt. From the URL’s domain name, route, and query
FIGURE 7. Comparison of precision, recall, F1 score, and accuracy for various classifiers across different feature representations. Each subplot
represents the performance metrics for a specific feature representation category, including TF-IDF word level, TF-IDF N-gram level, TF-IDF character
level, count vectors, word sequences vectors, and character sequences vectors.
parameters, we look for patterns in their syntax and meanings can analyze user input and detect abnormal activity, such
in this part. potential phishing attempts.
Checking the domain name for any misspellings or Our analysis of URLs is comprehensive, covering both
ambiguities helps us spot scam attempts. Phishing websites syntactic and semantic aspects. Just by examining the context
often use attacks that are similar to those used by real and meaning of the URL sections, you may be able to find
domains. For example, attackers may change language or semantic inconsistencies or conflicts. Malicious URLs use
characters slightly. Our approach is designed to detect domain names that don’t relate to the content of the webpage
these unusual occurrences and identify potentially hazardous or have an unusual combination of path segments and query
URLs [40]. parameters.
The query parameters and URL route are the first places We incorporate lexical analysis of the URL structure into
we search for signs of phishing attempts. Phishing URLs our detection method to enhance our model’s understanding
may use a convoluted path structure or a large number of URL properties and their security ramifications. This
of query parameters to hide their true intent. Our model enhanced analysis allows our computer to detect phishing
attempts despite the presence of minute signals that would [7] A. Blum, B. Wardman, T. Solorio, and G. Warner, ‘‘Lexical feature
be missed by previous detection techniques. based phishing URL detection using online learning,’’ in Proc. 3rd ACM
Workshop Artif. Intell. Secur., Chicago, IL, USA, Oct. 2010, pp. 54–60.
The method we propose can consistently differentiate [8] A. C. Bahnsen, E. C. Bohorquez, S. Villegas, J. Vargas, and
between legitimate and malicious URLs by analyzing them F. A. González, ‘‘Classifying phishing URLs using recurrent neural
component by component. Our approach is designed to networks,’’ in Proc. APWG Symp. Electron. Crime Res. (eCrime),
Scottsdale, AZ, USA, Apr. 2017, pp. 1–8.
identify suspicious patterns and highlight them, enhancing
[9] R. Aravindhan, R. Shanmugalakshmi, K. Ramya, and C. Selvan, ‘‘Certain
the accuracy of phishing attempt detection while lowering investigation on web application security: Phishing detection and phishing
false positives and negatives. We enhanced the detection target discovery,’’ in Proc. 3rd Int. Conf. Adv. Comput. Commun. Syst.
(ICACCS), Coimbatore, India, Jan. 2016, pp. 1–10.
system’s ability to handle newly emerging cyber threats by
[10] X. Xiao, D. Zhang, G. Hu, Y. Jiang, and S. Xia, ‘‘CNN–MHSA: A
incorporating lexical analysis into our model. Our technology convolutional neural network and multi-head self-attention combined
is designed to effortlessly handle even the most advanced approach for detecting phishing websites,’’ Neural Netw., vol. 125,
phishing techniques, thanks to its continuous learning and pp. 303–312, May 2020.
[11] W. Wang, F. Zhang, X. Luo, and S. Zhang, ‘‘PDRCNN: Precise
discovery of new patterns that may indicate malicious phishing detection with recurrent convolutional neural networks,’’
activity. We can conduct experiments with and without Secur. Commun. Netw., vol. 2019, Oct. 2019, Art. no. 2595794, doi:
this feature to evaluate performance and examine results 10.1155/2019/2595794.
[12] H. Zuhair and A. Selamat, ‘‘Phishing hybrid feature-based classifier
using lexical analysis of URL structure. A comparison of
by using recursive features subset selection and machine learning
performance is presented in Table 7. algorithms,’’ in Proc. 3rd Int. Conf. Reliable Inf. Commun. Technol.
(IRICT). Springer, 2018, pp. 267–277.
[13] G. Ramesh, J. Gupta, and P. G. Gamya, ‘‘Identification of phish-
VII. CONCLUSION AND FUTURE WORK ing webpages and its target domains by analyzing the feign rela-
Phishing website assaults are a serious and growing risk tionship,’’ J. Inf. Secur. Appl., vol. 35, pp. 75–84, Aug. 2017, doi:
to Internet users, as seen by the rise in incidents in recent 10.1016/j.jisa.2017.06.001.
[14] M. Cova, C. Kruegel, and G. Vigna, ‘‘There is no free phish: An analysis
times. Daily and hourly, a multitude of users inadvertently of ‘free’ and live phishing kits,’’ in Proc. WOOT, Jul. 2008, pp. 1–8.
engage with phishing URLs, perpetuating the risk of cyber [15] L. Wenyin, G. Liu, B. Qiu, and X. Quan, ‘‘Antiphishing through phishing
exploitation. Exploiters favor phishing as it exploits human target discovery,’’ IEEE Internet Comput., vol. 16, no. 2, pp. 52–61,
vulnerabilities, exploiting the innate trust users place in Mar. 2012.
[16] Y. Zhang, J. I. Hong, and L. F. Cranor, ‘‘Cantina: A content-based approach
seemingly authentic links, and evading conventional security to detecting phishing web sites,’’ in Proc. 16th Int. Conf. World Wide Web,
measures. Although extensive research endeavors have been Banff, AB, Canada, May 2007, pp. 639–648.
undertaken to counter these threats, achieving optimal [17] S. Marchal, J. François, R. State, and T. Engel, ‘‘PhishStorm: Detecting
phishing with streaming analytics,’’ IEEE Trans. Netw. Service Manage.,
detection accuracy remains an ongoing pursuit. vol. 11, no. 4, pp. 458–471, Dec. 2014.
This research work aims to discern and categorize URLs [18] R. M. Mohammad, F. Thabtah, and L. McCluskey, ‘‘Predicting phishing
into either phishing or benign classes. Evaluation metrics websites based on self-structuring neural network,’’ Neural Comput. Appl.,
encompassing Accuracy, Precision, Recall, and F1 Score vol. 25, no. 2, pp. 443–458, Aug. 2014.
[19] L. A. Tuan Nguyen, B. L. To, H. K. Nguyen, and M. H. Nguyen,
underscore the superior performance of the proposed system. ‘‘An efficient approach for phishing detection using single-layer neural
Looking ahead, future endeavors may explore the expansion network,’’ in Proc. Int. Conf. Adv. Technol. Commun. (ATC ), Hanoi,
of this work into a multi-class classification framework. Vietnam, Oct. 2014, pp. 435–440.
[20] J. Zhang, and X. Li, ‘‘Phishing detection method based on borderline-
Meanwhile, efforts to optimize the residual pipeline, which smote deep belief network,’’ in Security, Privacy, and Anonymity in
currently comprises seven inverted residual blocks, will Computation, Communication, and Storage (SpaCCS) (Lecture Notes in
focus on streamlining and reducing the complexity of this Computer Science), vol. 10658, G. Wang, M. Atiquzzaman, Z. Yan, and
K. K. Choo Eds. Cham, Switzerland: Springer, 2017, pp. 45–53.
architectural component.
[21] R. Verma and A. Das, ‘‘What’s in a URL: Fast feature extraction and
malicious URL detection,’’ in Proc. 3rd ACM Int. Workshop Secur. Privacy
REFERENCES Anal., Scottsdale, Arizona, USA, Mar. 2017, pp. 55–63.
[22] P. Yang, G. Zhao, and P. Zeng, ‘‘Phishing website detection based on
[1] A. Van der Merwe, M. Loock, and M. Dabrowski, ‘‘Characteristics and multidimensional features driven by deep learning,’’ IEEE Access, vol. 7,
responsibilities involved in a phishing attack,’’ in Proc. Winter Int. Symp. pp. 15196–15209, 2019.
Inf. Commun. Technol., 2005, pp. 249–254. [23] H. Sun, Z. Liu, S. Wang, and H. Wang, ‘‘Adaptive attention-based graph
[2] (4th Quart., 2021). APWG Phishing Activity Trends Report. representation learning to detect phishing accounts on the Ethereum
[Online]. Available: https://docs.apwg.org/reports/apwg/_trends_ blockchain,’’ IEEE Trans. Netw. Sci. Eng., vol. 11, no. 3, pp. 2963–2975,
report_q4_2021.pdf May 2024, doi: 10.1109/tnse.2024.3355089.
[3] B. Liang, M. Su, W. You, W. Shi, and G. Yang, ‘‘Cracking classifiers for [24] M. W. Shaukat, R. Amin, M. M. A. Muslam, A. H. Alshehri, and J. Xie,
evasion: A case study on the Google’s phishing pages filter,’’ in Proc. Int. ‘‘A hybrid approach for alluring ads phishing attack detection using
Conf. World Wide Web (WWW), Montral, QC, Canada, 2016, pp. 345–356. machine learning,’’ Sensors, vol. 23, no. 19, p. 8070, Sep. 2023.
[4] Q. Cui, G. V. Jourdan, G. V. Bochmann, R. Couturier, and I. V. Onut, [25] S. Asiri, Y. Xiao, S. Alzahrani, S. Li, and T. Li, ‘‘A survey of intelligent
‘‘Tracking phishing attacks over time,’’ in Proc. 26th Int. Conf. World Wide detection designs of HTML URL phishing attacks,’’ IEEE Access, vol. 11,
Web (WWW), Perth, WA, Australia, 2017, pp. 667–676. pp. 6421–6443, 2023.
[5] H. Y. Abutair and A. Belghith, ‘‘Using case-based reasoning for phishing [26] M. Sameen, K. Han, and S. O. Hwang, ‘‘PhishHaven—An efficient
detection,’’ Proc. Comput. Sci., vol. 109, pp. 281–288, Jan. 2017. real-time AI phishing URLs detection system,’’ IEEE Access, vol. 8,
[6] M. Al-Janabi, E. D. Quincey, and P. Andras, ‘‘Using supervised machine pp. 83425–83443, 2020.
learning algorithms to detect suspicious URLs in online social networks,’’ [27] S. He, B. Li, H. Peng, J. Xin, and E. Zhang, ‘‘An effective cost-sensitive
in Proc. IEEE/ACM Int. Conf. Adv. Social Netw. Anal. Mining, Sydney, XGBoost method for malicious URLs detection in imbalanced dataset,’’
NSW, Australia, Jul. 2017, pp. 1104–1111. IEEE Access, vol. 9, pp. 93089–93096, 2021.
[28] X. Xiao, Z. Wang, Q. Li, S. Xia, and Y. Jiang, ‘‘Back-propagation neural MANU J. PILLAI received the Ph.D. degree
network on Markov chains from system call sequences: A new approach for in computer science and engineering from the
detecting Android malware with system call sequences,’’ IET Inf. Secur., National Institute of Technology, Calicut. He is
vol. 11, no. 1, pp. 8–15, Jan. 2017. currently an Associate Professor with the Depart-
[29] D. Antoniades, I. Polakis, G. Kontaxis, E. Athanasopoulos, S. Ioannidis, ment of Computer Science and Engineering, TKM
E. P. Markatos, and T. Karagiannis, ‘‘We.B: The web of short URLs,’’ in College of Engineering, Kollam, Kerala, India. His
Proc. 20th Int. Conf. World Wide Web, Mar. 2011, pp. 715–724. research interests include wireless networks, deep
[30] N. Ketkar and J. Moolayil, ‘‘Convolutional neural networks,’’ in Deep learning, and smart environments.
Learning with Python: Learn Best Practices of Deep Learning Models With
PyTorch, 2021, pp. 197–242.
[31] D. Ciregan, U. Meier, and J. Schmidhuber, ‘‘Multi-column deep neural
networks for image classification,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., Providence, RI, USA, Jun. 2012, pp. 3642–3649.
[32] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by
jointly learning to align and translate,’’ Sep. 2014, arXiv:1409.0473. KAJAL K. NAIR received the B.Tech. degree from
[33] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, ‘‘Recurrent models of Kerala Technological University (KTU) and the
visual attention,’’ in Proc. 27th Int. Conf. Neural Inf. Process. Syst. (NIPS), M.Tech. degree from the TKM College of Engi-
Montreal, QC, Canada, 2014, pp. 2204–2212. neering, Kollam, Kerala, where she demonstrated
[34] M. T. Luong, H. Pham, and C. D. Manning, ‘‘Effective approaches outstanding academic performance. Her research
to attention-based neural machine translation,’’ Aug. 2015, interest includes cybersecurity, with a specific
arXiv:1508.04025. focus on identifying phishing attacks.
[35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. 31st
Conf. Neural Inf. Process. Syst., Long Beach, CA, USA, 2017, pp. 1–11.
[36] T. Berners-Lee, L. Masinter, and M. McCahill, Uniform Resource Locators
(URL), document RFC 106107, 1994.
[37] Y. Kim, ‘‘Convolutional neural networks for sentence classification,’’ in
Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), Doha,
Qatar, 2014, pp. 1746–1751. SOMULA RAMA SUBBAREDDY received the
[38] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image Ph.D. degree in computer science and engi-
recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), neering from VIT University, Vellore, India,
Las Vegas, NV, USA, Jun. 2016, pp. 770–778. in 2022. He was a Postdoctoral Research with
[39] M. J. Pillai, S. Remya, V. Devika, S. Ramasubbareddy, and Y. Cho, the Department of Information and Communica-
‘‘Evasion attacks and defense mechanisms for machine learning-based web tion, Sunchon National University, South Korea,
phishing classifiers,’’ IEEE Access, vol. 12, pp. 19375–19387, 2024. in 2024. He is currently an Assistant Professor
[40] S. Remya, M. J. Pillai, C. Arjun, S. Ramasubbareddy, and Y. Cho, with the Department of Information Technology,
‘‘Enhancing security in LLNs using a hybrid trust-based intrusion detection Vallurupalli Nageswara Rao Vignana Jyothi Insti-
system for RPL,’’ IEEE Access, vol. 12, pp. 58836–58850, 2024, doi:
tute of Engineering and Technology, Hyderabad.
10.1109/access.2024.3391918.
He has more than 40 publications in reputed journals and conferences.
His research interests include mobile cloud computing, the IoT, machine
learning, and edge computing.
S. REMYA received the Ph.D. degree in computer YONG YUN CHO received the Ph.D. degree in
science and engineering from Vellore Institute of computer engineering from Soongsil University.
Technology, Vellore Campus. She is currently an He is currently a Professor with the Department
Assistant Professor with the Department of Com- of Information and Communication Engineering,
puter Science and Engineering, School of Com- Sunchon National University. His main research
puting, Amrita Vishwa Vidyapeetham, Amritapuri interests include system software, embedded soft-
Campus, Kollam, Kerala, India. Her research inter- ware, and ubiquitous computing.
ests include deep learning, data science, computer
vision, security, and smart environments.