IEEE_Format_Paper
IEEE_Format_Paper
APPROACH
ABSTRACT
Although the web has become part and parcel of our daily lives, it also provides
anonymity for people who undertake malicious acts such as phishing. To deceive its
victims, a phisher can use various methods like social engineering and creating
counterfeit websites to steal personal and corporate account IDs, usernames,
passwords among others. In order to detect phishing websites several techniques
have been proposed but phishers have come up with their own ways of detecting
them. Machine learning is one of the best approaches used in identifying these
malicious activities because most phishing attacks have common characteristics that
machine learning can recognize. This paper compares different machine learning
models for predicting phishing websites.
Keywords: Phishing, Classification, Cybercrime, Machine Learning
INTRODUCTION
RELATED WORKS
Many investigations have been done in the past regarding phishing detection.
Different methods, challenges and future prospects have been dealt with in the
studies. A selection of significant contributions in the area of phishing detection and
prevention is as follows.
Mittal et al. (2022) conducted a study of phishing detection that focuses on NLP
methods such as tokenization, stemming, and Bag-of-Words. However, their work
was with a good textual analysis but no attachment analysis was carried out, hence
calls for further works incorporating file-based phishing detection were made.
In their paper, Rawal et al. (2017) applied SVM, Naive Bayes, and Random Forest
algorithms to detect email-based phishing. Their study whilst performing
commendably with certain accuracy had the limitations of a small dataset size and
few email samples that were not diverse.
Alattas et al. (2022) used SVM and Random Forest in email phishing detection but
encountered problems associated with imbalanced datasets. They speculated that
attachment analysis can be a worthy future approach to enhance their model
robustness.
Most of the works reviewed recognized the limited diversity of the dataset and the
potential for further development of the model. Particularly increasing the size of
the dataset and enhancing the detection strategies in response to new forms of
phishing attacks.
PHISHING TECHNIQUES
Users can be targeted and their personal information obtained using different
technologies. With advancements in technology, cyber criminals have also changed
the techniques they use.
To protect yourself from phishing, you must know what a cyber-criminal is doing
and understand how to counter any kind of a phishing attack that may come your
way.
Phishing
Email/Spam
These are mostly seen with many other kinds of phishing methods where one email
gets sent out simultaneously to thousands and millions of users asking for personal
details. For example, such details could be used by the phishers for illegal activities.
Most often messages contain short description stating that users need to provide
credentials so as to update account information, change details or authenticate an
account quickly. It sometimes prompts them to fill out a form via a link in the email
to access the new service.
Content Splicing
Content splicing is a method used by phishers to modify some of the content. The
web pages’ contents are dependable. To this end, they force a user into another page
instead of the official website where he/she will be required to fill in personal
details.
Web-based distribution
Link Policy
Link Policy is a way that phishers send malicious links to illegal sites. After clicking
on a fake link, instead of opening up the specific webpage provided for in that link, it
opens up on the phisher’s website instead. Place your cursor over any link to see its
actual address and prevent being controlled by it.
Voice Phishing
During phone phishing, the phisher calls the user and makes them dial a number. It
aims to get bank account information through a call. Usually, phone phishing is
perpetuated by fake callers.
SMS Phishing
For example, phishing emails which are usually links to phishing websites that try to
deceive victims into giving out personal details.
Keyloggers
These messages go to hackers who can decrypt passwords and other messages. To
avoid sensitive software from being able to log important financial data, secure sites
allow users to click on virtual keyboards for logging in with mouse clicks only.
Malware
Phishing scams require the delivery of malware which runs on the recipient’s
computer. Emails sent to individuals by phishers may contain malware. At that point
when you press on it, the malware starts running. Sometimes malware can also be
added to the downloaded file.
Trojan Horses
A Trojan horse is a kind of malware that looks as if it functions legitimately but
instead, it enables users to access remote accounts from their local machines.
Submit your credentials. The information provided will be sent to cyber criminals.
Ransomware
Ransomware makes device or data unavailable until the ransom payment has been
made. Personal computer ransomware is a form of malware which sneaks into a
user’s machine through a social networking attack. The user gets prompted to click
on, open or visit links.
Malicious Advertising
Language based:
Names for browsers like Mozilla Firefox, Microsoft Edge and Google Chrome act as a
way to identify phishing websites. However, whitelist and blacklist are two lists.
When it comes to the whitelist, it contains valid URLs that can be accessed through
the browser; It implies that if the URL is whitelisted then it means that the browser
can download this webpage. On the other hand, on black list there are phishing or
scam URLs which prevent downloading web page by browser itself. There is one
major shortcoming: even minute variation in URL may allow bypassing it during
running; therefore, this list has to be updated regularly by sites to stop new phishers
URLs from running on them. Some features that can differentiate between fake
websites from real ones have been selected for this strategy of protection against
phishing scams through email links.
This system receives information from several sources e.g., URLs, text content (i.e.,
emails), DNS data / metadata records–domain names resolution services-, digital
certificates (e.g., SSL) and web traffic (i.e., IP packets). Specific techniques used in
training models e.g., classification algorithms determine success rate or failure
among others. One positive aspect of using these technologies is ability of
recognizing zero-day type of phishing attack before its actualization takes place on
an automatic basis which was not possible before now at all times – when they
occur.
Digital landscape is still a war zone, where phishing attacks constantly threaten both
individuals and organizations. These tricks intend to steal sensitive information like
credentials or financial data, or mislead users into accessing malicious links that can
install malware on their devices or redirect them to fake websites. Machine learning
(ML) has become a potent weapon to fight phishing in the context of this emerging
threat. This analysis delves deep into various ML techniques that are common in
detecting phishing, highlighting their strengths, limitations, and things one must
consider when implementing them.
Email header properties: Sender address (fake), recipient address, email size,
timestamps.
Logistic Regression: A simple yet powerful linear model for binary classification; it
perfectly identifies between legitimate and phishing emails. It predicts the
likelihood that an email belongs to one class or another (legitimate or phishing)
based on those extracted features. Logistic regression is easy to understand and
computationally efficient making it a good starting point for phishing detection.
However, this technique may struggle with complex feature relationships seen in
some phishing attempts.
The K-Nearest Neighbours is an algorithm that classifies data points based on their
likeness to labelled data points in the training set. In phishing detection, an email is
considered phishing when its features are quite similar to known phishing emails in
the training data. The good thing about KNN is that it is relatively simple to
implement and it offers a fair interpretation. Nevertheless, for very large datasets,
this may take a lot of time due to computational load and performance may depend
greatly on the choice of “K” (nearest neighbours taken into account).
Support Vector Machines (SVM): These powerful algorithms aim at separating
legitimate emails from phishing ones in a high-dimensional feature space. SVMs are
able to handle complex feature relationships effectively and usually they are not
sensitive to noise within the dataset. They can be especially valuable whenever
unbalanced datasets exist where there might be less of these type of emails among
all e-mails collected. However, training SVMs on big datasets tends to be
computationally expensive and their decision-making process can be less
transparent compared to simpler models.
Decision trees: Decision trees serve as a set of branching queries on the features of
e-mail. The tree moves through characteristics of e-mail and finally ends up in a
node for the classification (whether it is legitimate or phishing). Decision tree is
understandable, making its decision-making process simple to grasp. Also, they can
handle numeric and categorical attributes without much need for data
preprocessing. On the flip side, poor use of control may lead to overfitting in
decision trees, and their performance may be affected by order dependence during
tree construction.
The scalable nature of machine learning models makes them a perfect fit for
companies that are bombarded with high volumes of emails every day. This
scalability ensures that email protection remains effective even as the number of
emails increases. To aid in ML real-time email analytics, some organizations may
choose to use cloud-based solutions or invest in high-performance computing
resources for training and deploying ML models.
However, it is important to note that while machine learning has been found to be a
powerful tool for detecting phishers, it should be incorporated with other security
measures in order to make a comprehensive plan towards optimization levels of
protection. The main points are as follows:
Employee training and user education: When it comes to phishing attacks, the
employees are usually in the front line. Organizations need to invest in user
education programs that will help train their staff in detecting popular phishing
techniques and ensure safe email practices.
Secure Email Gateways (SEGs): SEGs can be implemented for scanning both
incoming and outgoing emails for malicious contents such as phishing. They employ
anti-virus and anti-phishing technologies to detect suspicious e-mails before they
reach the recipient’s inbox.
Email authentication protocols (DMARC, SPF, DKIM): Such protocols shield against
email spoofing which is a frequently used technique by attackers for phishing
purposes. For DMARC, it enables companies to set up the way email recipients
should treat emails claiming to originate from a given domain there by preventing
domain spoofing. On the other hand, SPF and DKIM can assist in validating that the
sender is genuine.
One of the major challenges fraced by us was the scarcity of phising dataset . Many
research paper on phising detection have been published but most of them have not
provided the dataset they used in their research . An ideal dataset to work onj
contains standard set of record characteristics of a phising website. The dataset we
used in our research is well epquiped with features with a range index of 1000
entries , ranging from 0 to 999 and a total of 50 columns , Each website is marked
either legitimate or phishing. The features of our dataset are as follows:
The mean and standard deviation of all features are given below:-
Evaluation Metrics:-
Recall measures the percentage of phishing websites that the model manages to
detects model’s effectiveness. F1 score detects harmonic mean of precision and
recall. \
Let NL→L be the number of legitimate websites classified as legitimate, NL→P be the
number of legitimate websites misclassified as phishing, NP→L be the number of
phishing misclassified as legitimate and NP→P be the number of phishing websites
classified as phishing. Thus the following equations hold
Experimental Results :-
In our study , we used various machine learning models for phishing detection like
logistic rregression , ada booster , random forest , gradient boosting , SVM , stacking
classifier , voting classifier , XGBoost , GaussianNB.
We are evaluating the accuracy , precision , F1 score and recall of these models and
comparing them to get the best working model for the dataset for the best results .
The table below shows the comparison between accuracy precision , recall and F1
recall of these models .
In our findings , we observed that the various classifiers have range of capabilities
and performance matrix . The SVM aka support vector machine exhibited notable
different results across different kernels , with RBF kernel showcasing the best
performance . We can say so as RBF kernel’s non linear classification abilities
proved effective with our dataset. However we recognize the importance of
meticulous hyperparameter turning through cross validation , especially when
dealing with models like SVM. Overfitting is a real concern and cross validation
helps us maintain that balance.
Random Forest shined out as one of the best performer , with high accuracy ,
robustness against noise and outliers and efficient feature selection capabilities .
These observations aligns well with a high accuracy of 98% and F1 score of 98.5%
of random forest in our experimental analysis . Furthermore , despite the pro points
of random forest , we faced various challenges with its numerous paraemters for
optimal performance .
XGBoost again was one of the best model to work with in our dataset , its strength is
it’s speed and regularization for various reduction . In our experimental analysis we
found its’s accuracy to be 98.8% and F1 score of again 98.8%. However , we alsoe
noted the model algorithim’s complexity and the expertise required for effective
tuning that matches the challaneges in our dataset .
Logistic regression peformed fairly nice with an accuracy of 92.8% and F1 score of
again 92.8%. It’s simplicity and effiecient training makes it easier to use and a
valuable choice for binary classification tasks . Despite its linear nature , it handles
feature scaling well and provides cler and useful insights into features through
coefficient analysis .
Ada Boosting and gradient boosting are ensemble techniques , with acuuracy of 97.5
% and 97.95% respectively . Similary with a F1score of 97.54% and 97.98%
respectively . AdaBoost‘s Ability to combine weak leaners into a strong classifier
makes it useful against overfitting g and low noisy datasets . It’s simplicity and ease
to use made it a right choice for us to experiment with it . Gradient boosting , is an
iterative training process where each newer version of model corrects errors of the
previous ones , leads to powerful and predictive performance . It’s ability to handle
both numericao and categorical features made it a right choice for us to experiment
with .
The Stacking Classifier and Voting Classifier demonstrated high accuracy and
precision in ensemble learning experiments. The Stacking Classifier combined
multiple classification models, optimizing ensemble predictions, but may require
more computational resources. The Voting Classifier combined strengths from
different algorithms, improving model robustness.
In the study, ten different classifiers such as Logistic Regression, Decision Tree,
Support Vector Machine (SVM), AdaBoost, Random Forest, Gradient Boosting and
XGBoost were evaluated on a phishing website dataset. The findings indicated the
better competence of ensemble classifiers especially Random Forest and XGBoost
with respect to duration of computation and accuracy of forecasting. As it was
established for classification tasks in practices, ensemble methods, which forge
many weak learners into one strong, came in handy. When applied on low noise
data samples AdaBoost was not only robust to over-fitting but was also very easy to
interpret; however, this approach has shown drawbacks when applied on noisy
samples owing to the very long learning time and probable distort on the sampled
results. In addition, this algorithm was slower in execution than Random Forest and
XgBoost.
In the scope of the current work, we also developed a hybrid model which consisted
of SVM and Random Forest aiming to enhance the detection of phishing websites by
utilizing the advantages of both classifiers. This hybrid approach is supposed to
offer high classification performance by taking advantage of the high accuracy of the
Random Forest and effective boundary setting by the SVM.
REFERENCES
[4] I.-F. Lam, W.-C. Xiao, S.-C. Wang, and K.-T. Chen, “Counteracting phishing page
polymorphism: An image layout analysis approach,” in International Conference on
Information Security and Assurance, pp. 270–279, Springer, 2009.
[8] R. C. Dodge Jr, C. Carver, and A. J. Ferguson, “Phishing for user security
awareness,” computers & security, vol. 26, no. 1, pp. 73–80, 2007.
[13] K.-T. Chen, J.-Y. Chen, C.-R. Huang, and C.-S. Chen, “Fighting phishing with
discriminative keypoint features,” IEEE Internet Computing, vol. 13, no. 3, pp. 56–
63, 2009.
[14] A. K. Jain and B. B. Gupta, “Phishing detection: Analysis of visual similarity
based approaches,” Security and Communication Networks, vol. 2017, 2017.
[15] R. S. Rao and S. T. Ali, “A computer vision technique to detect phishing attacks,”
in 2015 Fifth International Conference on Communication Systems and Network
Technologies, pp. 596–601, IEEE, 2015.
[17] A. Karatzoglou, D. Meyer, and K. Hornik, “Support vector machines in r,” Journal
of statistical software, vol. 15, no. 9, pp. 1–28, 2006.
[18] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5– 32, 2001.
[19] T. Hastie, S. Rosset, J. Zhu, and H. Zou, “Multi-class adaboost,” Statistics and its
Interface, vol. 2, no. 3, pp. 349–360, 2009.
[22] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.