Web Mining
Web Mining
Resemblance and
containment
Fingerprinting
Introduction
• Web mining is an application of data mining techniques to find information patterns from the web data.
• Web mining helps to improve the power of web search engine by identifying the web pages and classifying the
web documents.
• Web mining is a branch of data mining concentrating on the World Wide Web as the primary data source,
including all of its components from Web content, server logs to everything in between.
• The data extracted from the web can include a range of content, such as textual information, structured data like
lists and tables, as well as multimedia elements like images, videos, and audio.
Introduction
• A great overview of the evolution of the web and the development of techniques to manage the massive
amount of data generated by it.
• As of January 2024, there are approximately 1.98 billion websites on the internet. .
• These directories grouped similar web pages together and relied on human reviewers to manually
tag pages based on keywords.
Introduction
• Structure Mining
• Usage Mining
Web Content Mining
1. Focus
• Web content mining is concerned with extracting relevant knowledge from the
contents of individual web pages.
2. Exclusion
• It does not consider how other web pages link to or interact with the given page.
3. Basic Approach
4. Problems
• Scarcity
Occurs when queries result in very few or no search results.
• Abundance
Occurs when queries generate an overwhelming number of search results.
5. Root Cause
• Both problems are due to the nature of web data, which is typically in semi-
structured HTML format and scattered across multiple pages.
Web document clustering
Purpose
• Web document clustering manages large numbers of documents using keywords.
Core Idea
• The aim is to create meaningful clusters of web pages rather than just providing a
ranked list of pages.
Techniques
• Clustering methods like K-means and agglomerative clustering are used to form
these clusters.
Input Attributes
• Clustering typically uses a vector of words and their frequencies from each web
page as input.
Limitations
• These clustering techniques often do not produce satisfactory results
Web Content Mining - Suffix Tree Clustering (STC)
• Suffix Tree Clustering (STC) does clustering based upon phrases rather than frequency of keywords.
• Suffix Tree Clustering (STC) is a method used to cluster web documents based on their content using
suffix trees.
• Suffix Tree: A suffix tree is a data structure that represents all the suffixes of a given text,
allowing efficient string operations.
• STC leverages suffix trees to identify and group similar patterns or substrings across different web
documents.
Web Content Mining - Suffix Tree Clustering (STC)
Tree having same root to leaf node sequence of words are grouped into
same cluster.
Web Content Mining - Suffix Tree Clustering (STC)
STC considers the sequence of phrases in the document and thus tries to cluster the document in a
more meaningful manner.
Web Content Mining - Resemblance and containment
Resemblance
• Measures how similar two documents are to each other.
• Range: 0 to 1.
Containment
• Measures whether one document is contained within another.
• Range: 0 to 1.
0 • No containment.
Web Content Mining - Resemblance and containment
Where, S(X) and S(Y) are sets of shingles for document X and Y respectively.
• So, Resemblance is equal to the total number of shingles that are common between two documents X and Y,
divided by the total number of shingles in both the documents.
Web Content Mining - Resemblance and containment
• Containment is equal to the total number of shingles that are common between two documents
X and Y, divided by the total number of shingles in original document X.
• Containment C(X, Y) is defined as:
Where, S(X) and S(Y) are sets of shingles for document X and Y respectively.
• Fingerprinting
• to compare every pair of document.
• works by dividing a document into a continuous sequence of words (shingles) of every possible
length.
• For instance, consider two documents with the respective content given below.
• Document 1: I love machine learning.
• Document 2: I love artificial intelligence.
• For the above two documents, consider every sequence for the shingle length two.
• Shingles is a technique to compare and identify similarities between documents.
• Shingles are essentially overlapping sequences of words or characters extracted from a
document.
• Involves extracting useful information from log data regarding user interactions with web pages.
• To predict user behavior and make web pages more customer-centric, enhancing monetization and
business strategies.
• Example Application:
• Ad Investment: If most visitors to a page come from Facebook rather than Twitter, investing more
in Facebook ads could be more profitable.
• Data Collection:
• Typical Data: Logs of user interactions, including page visits, timestamps, and referral sources.
Web Usage Mining
• In order to perform web usage mining, following information is usually collected.
• Analysis Techniques:
• Association Mining:
• Identifies relationships between pages, such as discovering which pages are commonly visited
together.
• Clustering:
• Groups similar user behaviors to uncover patterns.
• Return Visits: Unlike market basket analysis, users can return to pages, complicating the direct
application of transaction-based models.
• Analysis might reveal that visiting page A is often followed by visiting page B with high confidence.
• Pages can be restructured based on these associations, e.g., by merging content from page B into page
A to enhance user experience.
Web Structure Mining
• HITS algorithm (also known as hubs and authorities) is an algorithm that analyzes hyperlinked structure of
the web in order to rank web pages.
• Developed during the early days of the web when web page directories were prevailing.
• Hubs: Pages that link to many other pages. They are considered as directories or resource lists.
• Authorities: Pages that are linked to by many hubs. They are considered as authoritative sources on
specific topics.
• HITS is based on the assumption that web pages which act as directories are not themselves authority on any
information but act as a hub pointing out various web pages which may be the authority on the required
information.
Web Structure Mining
Web Structure Mining
A hyperlinked web structure
•H1, H2, and H3 are considered hubs. They act as directories or aggregators of information, linking out to
various pages. These hubs don't hold the information themselves but help direct users to pages that are
believed to have valuable content.
•A, B, C, and D are the web pages that are linked to by the hubs. These pages are considered authorities
because they are linked by multiple hubs. The more links from different hubs a page receives, the higher its
authority.
• Step 1: Present the given web structure as an adjacency matrix in order to perform further calculations.
Let the required adjacency matrix be A
• Step 3: Assume initial hub weight vector to be 1 and calculate authority weight vector by
multiplying the transpose of matrix A with the initial hub weight vector as shown
Web Structure Mining
• Step 4: Calculate the updated hub weight vector by multiplying the adjacency matrix A with
authority weight matrix obtained in step 3.
Web Structure Mining
• can be updated with hub and authority weights and can be represented as
Web Structure Mining
• we can say that web page N4 has the highest authority for some keyword as it is hyperlinked to most
high ranking hubs.
• Over the years the Internet has become increasingly sophisticated and so has the World Wide Web
with it.