Phishing Websites Features
Phishing Websites Features
One of the challenges faced by our research was the unavailability of reliable training datasets. In fact,
this challenge faces any researcher in the eld. However, although plenty of articles about predicting
phishing websites using data mining techniques have been disseminated these days, no reliable
training dataset has been published publically, maybe because there is no agreement in literature on
the de nitive features that characterize phishing websites, hence it is dif cult to shape a dataset that
covers all possible features.
In this article, we shed light on the important features that have proved to be sound and effective in
predicting phishing websites. In addition, we proposed some new features, experimentally assign new
rules to some well-known features and update some other features.
{Otherwise → Legitimate
If The Domain Part has an IP Address → Phishing
Rule: IF
We have been able to update this feature rule by using a method based on frequency and thus
improving upon its accuracy.
{ Otherwise → Legitimate
TinyURL → Phishing
Rule: IF
{ Otherwise → Legitimate
Url Having @ Symbol → Phishing
Rule: IF
{Otherwise → Legitimate
ThePosition of the Last Occurrence of "//" in the URL > 7 → Phishing
Rule: IF
{Otherwise → Legitimate
Domain Name Part Includes ( − ) Symbol → Phishing
Rule: IF
1.8. HTTPS (Hyper Text Transfer Protocol with Secure Sockets Layer)
The existence of HTTPS is very important in giving the impression of website legitimacy, but this is
clearly not enough. The authors in (Mohammad, Thabtah and McCluskey 2012) (Mohammad,
Thabtah and McCluskey 2013) suggest checking the certificate assigned with HTTPS including the
extent of the trust certificate issuer, and the certificate age. Certificate Authorities that are consistently
listed among the top trustworthy names include: “GeoTrust, GoDaddy, Network Solutions, Thawte,
Comodo, Doster and VeriSign”. Furthermore, by testing out our datasets, we find that the minimum
age of a reputable certificate is two years.
Rule: IF
Use https and Issuer Is Trusted and Age of Certificate ≥ 1 Years → Legitimate
Using https and Issuer Is Not Trusted → Suspicious
Otherwise → Phishing
{Otherwise → Legitimate
Domains Expires on ≤ 1 years → Phishing
Rule: IF
1.10.Favicon
A favicon is a graphic image (icon) associated with a specific webpage. Many existing user agents
such as graphical browsers and newsreaders show favicon as a visual reminder of the website identity
in the address bar. If the favicon is loaded from a domain other than that shown in the address bar,
then the webpage is likely to be considered a Phishing attempt.
{Otherwise → Legitimate
Favicon Loaded From External Domain → Phishing
Rule: IF
{Otherwise → Legitimate
Port # is of the Preffered
Status → Phishing
Rule: IF
Table 1 Common ports to be checked
Meaning Preferred
PORT Service
Status
445 SMB Providing shared access to iles, printers, serial ports Close
1433 MSSQL Store and retrieve data as requested by other software applications Close
{Otherwise → Legitimate
Using HTTPToken in Domain Part of The URL → Phishing
Rule: IF
1.Request URL
Request URL examines whether the external objects contained within a webpage such as images,
videos and sounds are loaded from another domain. In legitimate webpages, the webpage address and
most of objects embedded within the webpage are sharing the same domain.
% of Request URL < 22% → Legitimate
Rule: IF %of Request URL ≥ 22% and 61% → Suspicious
Otherwise → feature = Phishing
2.URL of Anchor
An anchor is an element defined by the <a> tag. This feature is treated exactly as “Request URL”.
However, for this feature we examine:
1. If the <a> tags and the website have different domain names. This is similar to request URL
feature.
2. If the anchor does not link to any webpage, e.g.:
A. <a href=“#”>
B. <a href=“#content”>
f
f
C. <a href=“#skip”>
D. <a href=“JavaScript ::void(0)”>
% of URL Of Anchor < 31% → Legitimate
Rule: IF % of URL Of Anchor ≥ 31% And ≤ 67% → Suspicious
Otherwise → Phishing
{ Otherwise → Legitimate
Using "mail()" or "mailto:" Function to Submit User Information → Phishing
Rule: IF
6.Abnormal URL
This feature can be extracted from WHOIS database. For a legitimate website, identity is typically
part of its URL.
{Otherwise → Legitimate
The Host Name Is Not Included In URL → Phishing
Rule: IF
3.HTML and JavaScript based Features
1.Website Forwarding
The fine line that distinguishes phishing websites from legitimate ones is how many times a website
has been redirected. In our dataset, we find that legitimate websites have been redirected one time
max. On the other hand, phishing websites containing this feature have been redirected at least 4
times.
#ofRedirect Page ≤ 1 → Legitimate
Rule: IF #of Redirect Page ≥ 2 And < 4 → Suspicious
Otherwise → Phishing
{ Otherwise → Legitimate
Right Click Disabled → Phishing
Rule: IF
{Otherwise → Legitimate
Popoup Window Contains Text Fields → Phishing
Rule: IF
5.IFrame Redirection
IFrame is an HTML tag used to display an additional webpage into one that is currently shown.
Phishers can make use of the “iframe” tag and make it invisible i.e. without frame borders. In this
regard, phishers make use of the “frameBorder” attribute which causes the browser to render a visual
delineation.
{ Otherwise → Legitimate
Using iframe → Phishing
Rule: IF

4.Domain based Features
1.Age of Domain
This feature can be extracted from WHOIS database (Whois 2005). Most phishing websites live for a
short period of time. By reviewing our dataset, we find that the minimum age of the legitimate
domain is 6 months.
{Otherwise → Phishing
Age Of Domain ≥ 6 months → Legitimate
Rule: IF
2.DNS Record
For phishing websites, either the claimed identity is not recognized by the WHOIS database (Whois
2005) or no records founded for the hostname (Pan and Ding 2006). If the DNS record is empty or not
found then the website is classified as “Phishing”, otherwise it is classified as “Legitimate”.
{Otherwise → Legitimate
no DNS Record For The Domain → Phishing
Rule: IF
3.Website Traf ic
This feature measures the popularity of the website by determining the number of visitors and the
number of pages they visit. However, since phishing websites live for a short period of time, they may
not be recognized by the Alexa database (Alexa the Web Information Company., 1996). By reviewing
our dataset, we find that in worst scenarios, legitimate websites ranked among the top 100,000.
Furthermore, if the domain has no traffic or is not recognized by the Alexa database, it is classified as
“Phishing”. Otherwise, it is classified as “Suspicious”.
Website Rank < 100,000 → Legitimate
Rule: IF Website Rank > 100,000 → Suspicious
Otherwise → Phish
4.PageRank
PageRank is a value ranging from “0” to “1”. PageRank aims to measure how important a webpage is
on the Internet. The greater the PageRank value the more important the webpage. In our datasets, we
find that about 95% of phishing webpages have no PageRank. Moreover, we find that the remaining
5% of phishing webpages may reach a PageRank value up to “0.2”.
{ Otherwise → Legitimate
PageRank < 0.2 → Phishing
Rule: IF
5.Google Index
This feature examines whether a website is in Google’s index or not. When a site is indexed by
Google, it is displayed on search results (Webmaster resources, 2014). Usually, phishing webpages are
merely accessible for a short period and as a result, many phishing webpages may not be found on the
Google index.
{Otherwise → Phishing
Webpage Indexed by Google → Legitimate
Rule: IF
{Otherwise → Legitimate
Host Belongs to Top Phishing IPs or Top Phishing Domains → Phishing
Rule: IF