Cantina content based approach to detect phishing websites

2 likes•1,683 views

CANTINA is a content-based approach to detecting phishing websites that examines the text-based content of websites along with surface characteristics like the age of the domain and presence of known images. It uses the TF-IDF algorithm to generate lexical signatures of web pages by calculating term frequency-inverse document frequency values for words. It then tests the page by searching for its lexical signature and checking if the domain matches top search results to determine if the page is legitimate or a phishing attempt. The system aims to detect phishing URLs and links to protect users from disclosing personal information on fraudulent websites or through malicious emails.

Technology Design

CANTINA
A Content-Based Approach to Detecting Phishing Web
Sites

•CANTINA is a content-based
approach.
•Examines whether the content is
legitimate or not.
•Detects phishing URLs and links.
ABSTRACT

INTRODUCTION
• Phishing
A kind of attack in which victims are tricked by
spoofed emails and fraudulent web sites into giving
up personal information
•How many phishing sites are there?
9,255 unique phishing sites were reported in June of
2006 alone
•How much phishing costs each year?
$1 billion to 2.8 billion per year

EXISTING SYSTEM
• NetCraft(Surface Characteristics)
• SpoofGuard(Surface Characteristics and
blacklist)
• Cloudmark(Blacklist )

PROPOSED SYSTEM
• Detects phishing websites
• Examines text-based content along with surface
characteristics.
• Text based content includes:
-Age of Domain.
-Known Images.
-Suspicious URL.
-Suspicious links.
 Detects phishing links in users email.

TF-IDF ALGORITHM
• Term Frequency (TF)
–The number of times a given term appears
in a specific document
–Measure of the importance of the term
within the particular document
• Inverse Document Frequency (IDF)
–Measure how common a term is across an
entire collection of documents
• High TF-IDF weight means High TF

MODULES
• Parsing the web pages
• Generating the lexical signature
• Testing Process
• Report Generation

Parsing the web pages
• Link, anchor tag, form tag and attachment in the
web pages is turned into corresponding Text Link,
HTML Link e.t.c.
•Done by parsing each Text
• Uses HTML Parser API
• It is used for extracting information from
HTML code

Generating the lexical signature
• TF-IDF algorithm used to generate
lexical signatures.
• Calculating the TF-IDF value for each
word in a document.
• Selecting the words with highest
value.

Testing Process
• Feed this lexical signature to a search
engine.
• Check domain name of the current
web page matches the domain name
of the N top search results.

Report Generation
• If a page is Legitimate it returns
“legitimate”
• If a page is phishing it returns
“phishing”

• Used to detect fraudulent websites,
emails.
•Protects from giving up personal
information like credit card numbers,
bank details, account passwords etc.
•Used to detect suspicious links in
email.
APPLICATIONS

•Content-based approach for detecting
phishing websites.
•User friendly interface for the users.
•Anti-phishing website that protects users
from giving their personal information.
CONCLUSION

Cantina content based approach to detect phishing websites

1. CANTINA A Content-Based Approach to Detecting Phishing Web Sites

2. •CANTINA is a content-based approach. •Examines whether the content is legitimate or not. •Detects phishing URLs and links. ABSTRACT

3. INTRODUCTION • Phishing A kind of attack in which victims are tricked by spoofed emails and fraudulent web sites into giving up personal information •How many phishing sites are there? 9,255 unique phishing sites were reported in June of 2006 alone •How much phishing costs each year? $1 billion to 2.8 billion per year

4. EXISTING SYSTEM • NetCraft(Surface Characteristics) • SpoofGuard(Surface Characteristics and blacklist) • Cloudmark(Blacklist )

5. PROPOSED SYSTEM • Detects phishing websites • Examines text-based content along with surface characteristics. • Text based content includes: -Age of Domain. -Known Images. -Suspicious URL. -Suspicious links.  Detects phishing links in users email.

6. TF-IDF ALGORITHM • Term Frequency (TF) –The number of times a given term appears in a specific document –Measure of the importance of the term within the particular document • Inverse Document Frequency (IDF) –Measure how common a term is across an entire collection of documents • High TF-IDF weight means High TF

7. REAL EBAY WEBPAGE

8. FAKE EBAY WEBPAGE

9. MODULES • Parsing the web pages • Generating the lexical signature • Testing Process • Report Generation

10. Parsing the web pages • Link, anchor tag, form tag and attachment in the web pages is turned into corresponding Text Link, HTML Link e.t.c. •Done by parsing each Text • Uses HTML Parser API • It is used for extracting information from HTML code

11. Generating the lexical signature • TF-IDF algorithm used to generate lexical signatures. • Calculating the TF-IDF value for each word in a document. • Selecting the words with highest value.

12. Testing Process • Feed this lexical signature to a search engine. • Check domain name of the current web page matches the domain name of the N top search results.

13. Report Generation • If a page is Legitimate it returns “legitimate” • If a page is phishing it returns “phishing”

14. • Used to detect fraudulent websites, emails. •Protects from giving up personal information like credit card numbers, bank details, account passwords etc. •Used to detect suspicious links in email. APPLICATIONS

15. •Content-based approach for detecting phishing websites. •User friendly interface for the users. •Anti-phishing website that protects users from giving their personal information. CONCLUSION

Cantina content based approach to detect phishing websites

More Related Content

Similar to Cantina content based approach to detect phishing websites (20)

Recently uploaded (20)

Cantina content based approach to detect phishing websites