0% found this document useful (0 votes)

10 views

Web Mining

Uploaded by

cmptup2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Web Mining

Uploaded by

cmptup2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Introduction Web Content Mining Web Usage Mining Web Structure Mining

Web document Hyperlink Induced Topic

clustering Search (HITS) algorithm

Suffix Tree Clustering

(STC)

Resemblance and
containment

Fingerprinting
Introduction

• Web mining is an application of data mining techniques to find information patterns from the web data.

• Web mining helps to improve the power of web search engine by identifying the web pages and classifying the
web documents.

• Web mining is very useful to e-commerce websites and e-services.

• Web mining is a branch of data mining concentrating on the World Wide Web as the primary data source,
including all of its components from Web content, server logs to everything in between.

• The data extracted from the web can include a range of content, such as textual information, structured data like
lists and tables, as well as multimedia elements like images, videos, and audio.
Introduction

• A great overview of the evolution of the web and the development of techniques to manage the massive
amount of data generated by it.

1. Exponential Growth of Websites and Data

• Since the creation of the first web page by Tim Berners-Lee in 1991, the number of websites has
grown exponentially, reaching 1.8 billion by 2018.
• This growth has led to a massive increase in the amount of data available online.

• As of January 2024, there are approximately 1.98 billion websites on the internet. .

2. Early Organization of Web Data

• Initially, web directories were used to organize and categorize websites.

• These directories grouped similar web pages together and relied on human reviewers to manually
tag pages based on keywords.
Introduction

3.Development of Search Engines

• As the web expanded, search engines emerged to automate the process of finding relevant
information.
• Unlike directories, search engines could crawl the web and index pages, allowing for more efficient
retrieval of data.
4. Introduction of Web Mining
• Web mining refers to the application of data mining techniques and machine learning to analyze and
extract useful information from web data.
• This involves using methods to analyze website content, structure, and user behavior to improve
search engine rankings.
• Overall, the evolution from web directories to advanced search engines and web mining techniques
reflects the need to efficiently manage and extract value from the vast amounts of information available
on the internet.
Introduction

• Web mining is divided into three parts,

• Web Content Mining

• Structure Mining

• Usage Mining
Web Content Mining

1. Focus

• Web content mining is concerned with extracting relevant knowledge from the
contents of individual web pages.

2. Exclusion

• It does not consider how other web pages link to or interact with the given page.

3. Basic Approach

• A common method involves analyzing the location and frequency of keywords.

Web Content Mining

4. Problems
• Scarcity
Occurs when queries result in very few or no search results.
• Abundance
Occurs when queries generate an overwhelming number of search results.

5. Root Cause
• Both problems are due to the nature of web data, which is typically in semi-
structured HTML format and scattered across multiple pages.
Web document clustering

• Web document clustering is an approach to manage large number of

documents based on keywords.
• The core idea is to form meaningful clusters of the web pages instead
of returning a list of web pages arranged by their rank.
• Cluster analysis techniques, namely K-mean and agglomerative
clustering, can be used to achieve this goal.
• The input attribute set for applying clustering techniques is generally
vector of words and their frequency in a given web page.
• But, such clustering techniques don’t give adequate results.
Web Content Mining - Web document clustering

Purpose
• Web document clustering manages large numbers of documents using keywords.

Core Idea
• The aim is to create meaningful clusters of web pages rather than just providing a
ranked list of pages.

Techniques
• Clustering methods like K-means and agglomerative clustering are used to form
these clusters.

Input Attributes
• Clustering typically uses a vector of words and their frequencies from each web
page as input.

Limitations
• These clustering techniques often do not produce satisfactory results
Web Content Mining - Suffix Tree Clustering (STC)

• Suffix Tree Clustering (STC) does clustering based upon phrases rather than frequency of keywords.

• Suffix Tree Clustering (STC) is a method used to cluster web documents based on their content using
suffix trees.

• Suffix Tree: A suffix tree is a data structure that represents all the suffixes of a given text,
allowing efficient string operations.

• STC leverages suffix trees to identify and group similar patterns or substrings across different web
documents.
Web Content Mining - Suffix Tree Clustering (STC)

STC works as follows:

Step 1 Obtain text from the web page.

For every sentence in the text, filter out common words and punctuations.
Convert the remaining words into their root form.
Step 2 Make a tree based on the list of words obtained in step 1.

Step 3 Compare trees obtained from various documents.

Tree having same root to leaf node sequence of words are grouped into
same cluster.
Web Content Mining - Suffix Tree Clustering (STC)

STC considers the sequence of phrases in the document and thus tries to cluster the document in a
more meaningful manner.
Web Content Mining - Resemblance and containment

• Resemblance and containment

• To improve query results by removing duplicate or nearly identical web pages

• techniques used to compare documents based on their content similarity.

Web Content Mining - Resemblance and containment

Resemblance
• Measures how similar two documents are to each other.

• Range: 0 to 1.

1 • Documents are identical.

Close to 1 • High similarity.

Close to 0 • Low similarity.

Web Content Mining - Resemblance and containment

Containment
• Measures whether one document is contained within another.

• Range: 0 to 1.

• Document XXX is completely contained within

1 document YYY.

0 • No containment.
Web Content Mining - Resemblance and containment

• To define resemblance and containment mathematically, the concept of shingles is used.

• The document is divided into sets with continuous sequence of words of length L.
• These sequences are called shingles.
• So, for given two documents X and Y, resemblance R(X, Y) is defined as:

Where, S(X) and S(Y) are sets of shingles for document X and Y respectively.

Total number of unique

Number of common shingles.
shingles in both documents.

• So, Resemblance is equal to the total number of shingles that are common between two documents X and Y,
divided by the total number of shingles in both the documents.
Web Content Mining - Resemblance and containment

• Containment is equal to the total number of shingles that are common between two documents
X and Y, divided by the total number of shingles in original document X.
• Containment C(X, Y) is defined as:

Where, S(X) and S(Y) are sets of shingles for document X and Y respectively.

Total number of shingles in

Number of common shingles. document 𝑋.
Web Content Mining - Fingerprinting

• Fingerprinting
• to compare every pair of document.
• works by dividing a document into a continuous sequence of words (shingles) of every possible
length.
• For instance, consider two documents with the respective content given below.
• Document 1: I love machine learning.
• Document 2: I love artificial intelligence.
• For the above two documents, consider every sequence for the shingle length two.
• Shingles is a technique to compare and identify similarities between documents.
• Shingles are essentially overlapping sequences of words or characters extracted from a
document.

• only 1 out of 3 sequences match.

• This is used to find similarity between two
documents.
• this method is very accurate
• But it is very inefficient for documents with large
numbers of words.
Web Usage Mining
Web usage mining deals with extracting useful information from log data about a user’s interaction with the web
pages.
•The primary goal is to analyze user interactions with web pages to enhance user experience and optimize business
strategies.
•By predicting user behavior, businesses can tailor their web pages and marketing efforts to better meet user needs
and preferences.
Web Usage Mining

• Involves extracting useful information from log data regarding user interactions with web pages.

• To predict user behavior and make web pages more customer-centric, enhancing monetization and
business strategies.

• Example Application:
• Ad Investment: If most visitors to a page come from Facebook rather than Twitter, investing more
in Facebook ads could be more profitable.

• Data Collection:
• Typical Data: Logs of user interactions, including page visits, timestamps, and referral sources.
Web Usage Mining
• In order to perform web usage mining, following information is usually collected.

Important parameters for web usage mining

Web Usage Mining

• Analysis Techniques:
• Association Mining:
• Identifies relationships between pages, such as discovering which pages are commonly visited
together.
• Clustering:
• Groups similar user behaviors to uncover patterns.

• Return Visits: Unlike market basket analysis, users can return to pages, complicating the direct
application of transaction-based models.

• Analysis might reveal that visiting page A is often followed by visiting page B with high confidence.

• Pages can be restructured based on these associations, e.g., by merging content from page B into page
A to enhance user experience.
Web Structure Mining

•obtaining information from the hyperlinked structure of the web.

•Understand the relationships between web pages through their links.
Role of Web Structure:
•Ranking:
•Web structure helps in ranking web pages based on their importance and relevance.
•Authority Identification:
•Determines which pages are considered authoritative on specific topics.
•Hub Identification:
•Finds websites that link to many authoritative pages, acting as hubs in the web structure.
Web Structure Mining

HITS algorithm PageRank algorithm

• uses web structure in order to • uses web structure in order to rank
identify hubs and authorities. web pages
Web Structure Mining

Hyperlink Induced Topic Search (HITS) algorithm

• HITS algorithm (also known as hubs and authorities) is an algorithm that analyzes hyperlinked structure of
the web in order to rank web pages.

• It was developed by Jon Kleinberg in 1999.

• Developed during the early days of the web when web page directories were prevailing.

• HITS works on the basis of the concept of hubs and authority.

• Hubs: Pages that link to many other pages. They are considered as directories or resource lists.

• Authorities: Pages that are linked to by many hubs. They are considered as authoritative sources on
specific topics.

• HITS is based on the assumption that web pages which act as directories are not themselves authority on any
information but act as a hub pointing out various web pages which may be the authority on the required
information.
Web Structure Mining
Web Structure Mining
A hyperlinked web structure

•H1, H2, and H3 are considered hubs. They act as directories or aggregators of information, linking out to
various pages. These hubs don't hold the information themselves but help direct users to pages that are
believed to have valuable content.
•A, B, C, and D are the web pages that are linked to by the hubs. These pages are considered authorities
because they are linked by multiple hubs. The more links from different hubs a page receives, the higher its
authority.

•Good hubs are those that link to many authoritative pages.

•Good authorities are those that receive links from many
different hubs.
The HITS algorithm evaluates both the hubness and
authority of pages to determine their relevance and
importance within a web structure.
Web Structure Mining

• an example in order to understand the working of the HITS algorithm.

• The Figure shows a web page structure where each node represents a web page and arrows show
the hyperlinks between the vertices.
Web Structure Mining

• Step 1: Present the given web structure as an adjacency matrix in order to perform further calculations.
Let the required adjacency matrix be A

Adjacency matrix representing web structure

Web Structure Mining

• Step 2: Prepare the transpose of matrix A.

• Step 3: Assume initial hub weight vector to be 1 and calculate authority weight vector by
multiplying the transpose of matrix A with the initial hub weight vector as shown
Web Structure Mining

• Step 4: Calculate the updated hub weight vector by multiplying the adjacency matrix A with
authority weight matrix obtained in step 3.
Web Structure Mining

• As per the calculations done above, the graph shown

• can be updated with hub and authority weights and can be represented as
Web Structure Mining

• This completes a single iteration of the HITS algorithm.

• In order to obtain much more accurate results, steps 3 and 4 can be repeated to obtain updated
authority weight vectors and updated hub vector values.
• From the above calculations, we can rank hubs and authorities and display authorities in search
results in order of decreasing authority weight value.
• For instance, as per Figure,

• we can say that web page N4 has the highest authority for some keyword as it is hyperlinked to most
high ranking hubs.
• Over the years the Internet has become increasingly sophisticated and so has the World Wide Web
with it.

As 1
0% (4)
As 1
2 pages
Eccouncil 312-38: Certified Network Defender
No ratings yet
Eccouncil 312-38: Certified Network Defender
4 pages
Hyperlocal Retail - Final
No ratings yet
Hyperlocal Retail - Final
11 pages
Webmining I
No ratings yet
Webmining I
69 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Mining
100% (3)
Web Mining
28 pages
Web Mining
No ratings yet
Web Mining
42 pages
6 WebMining
No ratings yet
6 WebMining
45 pages
Module1PartAweb mining-intro
No ratings yet
Module1PartAweb mining-intro
28 pages
Week 1
No ratings yet
Week 1
80 pages
Web Mining: Created By
No ratings yet
Web Mining: Created By
11 pages
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
No ratings yet
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
25 pages
Introduction to Web Mining
No ratings yet
Introduction to Web Mining
20 pages
Webminingtextmining 160906165305
No ratings yet
Webminingtextmining 160906165305
18 pages
Web Mining
No ratings yet
Web Mining
53 pages
Web Content Mining: A Case Study For Bput Results: Binayak Panda, K Murali Gopal, Sudhanshu Shekhar Bisoyi
No ratings yet
Web Content Mining: A Case Study For Bput Results: Binayak Panda, K Murali Gopal, Sudhanshu Shekhar Bisoyi
5 pages
Web Mining MMMUT NOTES
No ratings yet
Web Mining MMMUT NOTES
5 pages
Business Data Mining Week 13
No ratings yet
Business Data Mining Week 13
15 pages
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
No ratings yet
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
4 pages
UNIT - 3 Final
No ratings yet
UNIT - 3 Final
37 pages
Web Mining and Text Mining
No ratings yet
Web Mining and Text Mining
65 pages
Web Content Mining: by Saumya Aggarwal (0232083107 - IT) Richa Sharma (0732082707 - CSE)
No ratings yet
Web Content Mining: by Saumya Aggarwal (0232083107 - IT) Richa Sharma (0732082707 - CSE)
12 pages
Unit 4 (DWDM)
No ratings yet
Unit 4 (DWDM)
27 pages
Web Mining
No ratings yet
Web Mining
13 pages
Analysis of Web Usage Mining: International Journal of Application or Innovation in Engineering & Management (IJAIEM)
No ratings yet
Analysis of Web Usage Mining: International Journal of Application or Innovation in Engineering & Management (IJAIEM)
7 pages
Data Mining-World Wide Web
No ratings yet
Data Mining-World Wide Web
4 pages
Web Mining and Knowledge Discovery of Usage Patterns: CS 748T Project (Part I)
No ratings yet
Web Mining and Knowledge Discovery of Usage Patterns: CS 748T Project (Part I)
25 pages
Web Mining
No ratings yet
Web Mining
28 pages
unit 5 DW & DM
No ratings yet
unit 5 DW & DM
11 pages
Web Page Similarity Draft Final
No ratings yet
Web Page Similarity Draft Final
71 pages
QU PPT Format
No ratings yet
QU PPT Format
12 pages
43.v. Bharanipriya1 & v. Kamakshi Prasad2
No ratings yet
43.v. Bharanipriya1 & v. Kamakshi Prasad2
6 pages
Sandaruwan WP
No ratings yet
Sandaruwan WP
4 pages
Web Mining: Presented By: Vikash Kumar
No ratings yet
Web Mining: Presented By: Vikash Kumar
24 pages
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
No ratings yet
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
28 pages
On The Improvement of Weighted Page Content Rank: Seifedine Kadry and Ali Kalakech
No ratings yet
On The Improvement of Weighted Page Content Rank: Seifedine Kadry and Ali Kalakech
5 pages
Unit V - Web and Text Mining
No ratings yet
Unit V - Web and Text Mining
35 pages
Web Miining: Summary: Sonia Gupta, Neha Singh
No ratings yet
Web Miining: Summary: Sonia Gupta, Neha Singh
6 pages
Bda Class - Feb 7th
No ratings yet
Bda Class - Feb 7th
28 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
Web Mining Notes
100% (1)
Web Mining Notes
8 pages
Web Mining
No ratings yet
Web Mining
15 pages
UNIT 3 DMW
No ratings yet
UNIT 3 DMW
31 pages
Web Content Mining Thesis PDF
100% (2)
Web Content Mining Thesis PDF
5 pages
7
No ratings yet
7
48 pages
Webminingtextmining 160906165305
No ratings yet
Webminingtextmining 160906165305
17 pages
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
No ratings yet
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
33 pages
Web Mining
No ratings yet
Web Mining
3 pages
Data Mining
No ratings yet
Data Mining
12 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
19 Web Mining 2
No ratings yet
19 Web Mining 2
41 pages
3.Eng-A Survey On Web Mining
No ratings yet
3.Eng-A Survey On Web Mining
8 pages
Web Mining U-1,2
No ratings yet
Web Mining U-1,2
15 pages
Unit 7: Web Mining and Text Mining
No ratings yet
Unit 7: Web Mining and Text Mining
13 pages
Dinuca Ciobanu
No ratings yet
Dinuca Ciobanu
8 pages
Web Mining
No ratings yet
Web Mining
73 pages
Analysis of Web Mining Types and Weblogs
No ratings yet
Analysis of Web Mining Types and Weblogs
4 pages
Webmininglec
No ratings yet
Webmininglec
75 pages
Data Mining
No ratings yet
Data Mining
80 pages
Web Mining Presentation
No ratings yet
Web Mining Presentation
14 pages
Web Mining Course
No ratings yet
Web Mining Course
8 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
The Future of Search
From Everand
The Future of Search
Andres J. Clary
No ratings yet
Hitfile Premium Account Link Generator Login Key Free
No ratings yet
Hitfile Premium Account Link Generator Login Key Free
2 pages
A High-Speed Communication System Is Based On The Design of A Bi-NoC Router, Which Uses An Advanced FIFO Structure
No ratings yet
A High-Speed Communication System Is Based On The Design of A Bi-NoC Router, Which Uses An Advanced FIFO Structure
8 pages
CHAPTER 10 - Computer Assisted Legal Research (CALR)
No ratings yet
CHAPTER 10 - Computer Assisted Legal Research (CALR)
2 pages
IoT - Module 4
No ratings yet
IoT - Module 4
40 pages
Palo Alto Networks - Edu-210 Lab 4: App-ID: Document Version
No ratings yet
Palo Alto Networks - Edu-210 Lab 4: App-ID: Document Version
21 pages
Monitor JDBC Conn
No ratings yet
Monitor JDBC Conn
49 pages
Onlinecounseling PPT 121002175253 Phpapp01
No ratings yet
Onlinecounseling PPT 121002175253 Phpapp01
12 pages
Walk in The Spirit PDF
0% (1)
Walk in The Spirit PDF
27 pages
Dictionary Thesaurus
No ratings yet
Dictionary Thesaurus
3 pages
CloudGenix Software Defined WAN For Dummies 1
No ratings yet
CloudGenix Software Defined WAN For Dummies 1
36 pages
From a Knight to a Lady Otome Isekai Wiki Fandom
No ratings yet
From a Knight to a Lady Otome Isekai Wiki Fandom
1 page
ADFS Docs
No ratings yet
ADFS Docs
64 pages
Data Communication Terminologies
No ratings yet
Data Communication Terminologies
4 pages
Jargons
100% (1)
Jargons
6 pages
Argument Paper 3 Hacker-Jacobs-MLA-Arg PDF
No ratings yet
Argument Paper 3 Hacker-Jacobs-MLA-Arg PDF
6 pages
Ionic Framework
No ratings yet
Ionic Framework
29 pages
Press Release - TF & Telegram
No ratings yet
Press Release - TF & Telegram
2 pages
The Chaser (Film) : From Wikipedia, The Free Encyclopedia
No ratings yet
The Chaser (Film) : From Wikipedia, The Free Encyclopedia
9 pages
P V Gourlay - Eichenwald Testimony 3-7-07 (P) (P)
No ratings yet
P V Gourlay - Eichenwald Testimony 3-7-07 (P) (P)
53 pages
Lab Test
No ratings yet
Lab Test
44 pages
2022 All Papers of PPSC in One File
No ratings yet
2022 All Papers of PPSC in One File
634 pages
Web Application Testing
No ratings yet
Web Application Testing
8 pages
Impact of Social Media
No ratings yet
Impact of Social Media
8 pages
XS Max - PDF - Verizon Wireless - Mobile Phones
No ratings yet
XS Max - PDF - Verizon Wireless - Mobile Phones
1 page
Plant Automation and Telecontrol in One System: Simatic Pcs 7
No ratings yet
Plant Automation and Telecontrol in One System: Simatic Pcs 7
8 pages
Dark Web: Albert Bharath Divya Rachana Shrika
No ratings yet
Dark Web: Albert Bharath Divya Rachana Shrika
34 pages
Java Programmer Resume Examples - Sample 1: Abdul XXXX
No ratings yet
Java Programmer Resume Examples - Sample 1: Abdul XXXX
3 pages