0% found this document useful (0 votes)
10 views

Web Mining

Uploaded by

cmptup2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Web Mining

Uploaded by

cmptup2020
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Introduction Web Content Mining Web Usage Mining Web Structure Mining

Web document Hyperlink Induced Topic


clustering Search (HITS) algorithm

Suffix Tree Clustering


(STC)

Resemblance and
containment

Fingerprinting
Introduction

• Web mining is an application of data mining techniques to find information patterns from the web data.

• Web mining helps to improve the power of web search engine by identifying the web pages and classifying the
web documents.

• Web mining is very useful to e-commerce websites and e-services.

• Web mining is a branch of data mining concentrating on the World Wide Web as the primary data source,
including all of its components from Web content, server logs to everything in between.

• The data extracted from the web can include a range of content, such as textual information, structured data like
lists and tables, as well as multimedia elements like images, videos, and audio.
Introduction

• A great overview of the evolution of the web and the development of techniques to manage the massive
amount of data generated by it.

1. Exponential Growth of Websites and Data


• Since the creation of the first web page by Tim Berners-Lee in 1991, the number of websites has
grown exponentially, reaching 1.8 billion by 2018.
• This growth has led to a massive increase in the amount of data available online.

• As of January 2024, there are approximately 1.98 billion websites on the internet. .

2. Early Organization of Web Data


• Initially, web directories were used to organize and categorize websites.

• These directories grouped similar web pages together and relied on human reviewers to manually
tag pages based on keywords.
Introduction

3.Development of Search Engines


• As the web expanded, search engines emerged to automate the process of finding relevant
information.
• Unlike directories, search engines could crawl the web and index pages, allowing for more efficient
retrieval of data.
4. Introduction of Web Mining
• Web mining refers to the application of data mining techniques and machine learning to analyze and
extract useful information from web data.
• This involves using methods to analyze website content, structure, and user behavior to improve
search engine rankings.
• Overall, the evolution from web directories to advanced search engines and web mining techniques
reflects the need to efficiently manage and extract value from the vast amounts of information available
on the internet.
Introduction

• Web mining is divided into three parts,

• Web Content Mining

• Structure Mining

• Usage Mining
Web Content Mining

1. Focus

• Web content mining is concerned with extracting relevant knowledge from the
contents of individual web pages.

2. Exclusion

• It does not consider how other web pages link to or interact with the given page.

3. Basic Approach

• A common method involves analyzing the location and frequency of keywords.


Web Content Mining

4. Problems
• Scarcity
Occurs when queries result in very few or no search results.
• Abundance
Occurs when queries generate an overwhelming number of search results.

5. Root Cause
• Both problems are due to the nature of web data, which is typically in semi-
structured HTML format and scattered across multiple pages.
Web document clustering

• Web document clustering is an approach to manage large number of


documents based on keywords.
• The core idea is to form meaningful clusters of the web pages instead
of returning a list of web pages arranged by their rank.
• Cluster analysis techniques, namely K-mean and agglomerative
clustering, can be used to achieve this goal.
• The input attribute set for applying clustering techniques is generally
vector of words and their frequency in a given web page.
• But, such clustering techniques don’t give adequate results.
Web Content Mining - Web document clustering

Purpose
• Web document clustering manages large numbers of documents using keywords.

Core Idea
• The aim is to create meaningful clusters of web pages rather than just providing a
ranked list of pages.

Techniques
• Clustering methods like K-means and agglomerative clustering are used to form
these clusters.

Input Attributes
• Clustering typically uses a vector of words and their frequencies from each web
page as input.

Limitations
• These clustering techniques often do not produce satisfactory results
Web Content Mining - Suffix Tree Clustering (STC)

• Suffix Tree Clustering (STC) does clustering based upon phrases rather than frequency of keywords.

• Suffix Tree Clustering (STC) is a method used to cluster web documents based on their content using
suffix trees.

• Suffix Tree: A suffix tree is a data structure that represents all the suffixes of a given text,
allowing efficient string operations.

• STC leverages suffix trees to identify and group similar patterns or substrings across different web
documents.
Web Content Mining - Suffix Tree Clustering (STC)

STC works as follows:

Step 1 Obtain text from the web page.


For every sentence in the text, filter out common words and punctuations.
Convert the remaining words into their root form.
Step 2 Make a tree based on the list of words obtained in step 1.

Step 3 Compare trees obtained from various documents.

Tree having same root to leaf node sequence of words are grouped into
same cluster.
Web Content Mining - Suffix Tree Clustering (STC)

STC considers the sequence of phrases in the document and thus tries to cluster the document in a
more meaningful manner.
Web Content Mining - Resemblance and containment

• Resemblance and containment

• To improve query results by removing duplicate or nearly identical web pages

• techniques used to compare documents based on their content similarity.


Web Content Mining - Resemblance and containment

Resemblance
• Measures how similar two documents are to each other.

• Range: 0 to 1.

1 • Documents are identical.

Close to 1 • High similarity.

Close to 0 • Low similarity.


Web Content Mining - Resemblance and containment

Containment
• Measures whether one document is contained within another.

• Range: 0 to 1.

• Document XXX is completely contained within


1 document YYY.

0 • No containment.
Web Content Mining - Resemblance and containment

• To define resemblance and containment mathematically, the concept of shingles is used.


• The document is divided into sets with continuous sequence of words of length L.
• These sequences are called shingles.
• So, for given two documents X and Y, resemblance R(X, Y) is defined as:

Where, S(X) and S(Y) are sets of shingles for document X and Y respectively.

Total number of unique


Number of common shingles.
shingles in both documents.

• So, Resemblance is equal to the total number of shingles that are common between two documents X and Y,
divided by the total number of shingles in both the documents.
Web Content Mining - Resemblance and containment

• Containment is equal to the total number of shingles that are common between two documents
X and Y, divided by the total number of shingles in original document X.
• Containment C(X, Y) is defined as:

Where, S(X) and S(Y) are sets of shingles for document X and Y respectively.

Total number of shingles in


Number of common shingles. document 𝑋.
Web Content Mining - Fingerprinting

• Fingerprinting
• to compare every pair of document.
• works by dividing a document into a continuous sequence of words (shingles) of every possible
length.
• For instance, consider two documents with the respective content given below.
• Document 1: I love machine learning.
• Document 2: I love artificial intelligence.
• For the above two documents, consider every sequence for the shingle length two.
• Shingles is a technique to compare and identify similarities between documents.
• Shingles are essentially overlapping sequences of words or characters extracted from a
document.

• only 1 out of 3 sequences match.


• This is used to find similarity between two
documents.
• this method is very accurate
• But it is very inefficient for documents with large
numbers of words.
Web Usage Mining
Web usage mining deals with extracting useful information from log data about a user’s interaction with the web
pages.
•The primary goal is to analyze user interactions with web pages to enhance user experience and optimize business
strategies.
•By predicting user behavior, businesses can tailor their web pages and marketing efforts to better meet user needs
and preferences.
Web Usage Mining

• Involves extracting useful information from log data regarding user interactions with web pages.

• To predict user behavior and make web pages more customer-centric, enhancing monetization and
business strategies.

• Example Application:
• Ad Investment: If most visitors to a page come from Facebook rather than Twitter, investing more
in Facebook ads could be more profitable.

• Data Collection:
• Typical Data: Logs of user interactions, including page visits, timestamps, and referral sources.
Web Usage Mining
• In order to perform web usage mining, following information is usually collected.

Important parameters for web usage mining


Web Usage Mining

• Analysis Techniques:
• Association Mining:
• Identifies relationships between pages, such as discovering which pages are commonly visited
together.
• Clustering:
• Groups similar user behaviors to uncover patterns.

• Return Visits: Unlike market basket analysis, users can return to pages, complicating the direct
application of transaction-based models.

• Analysis might reveal that visiting page A is often followed by visiting page B with high confidence.

• Pages can be restructured based on these associations, e.g., by merging content from page B into page
A to enhance user experience.
Web Structure Mining

•obtaining information from the hyperlinked structure of the web.


•Understand the relationships between web pages through their links.
Role of Web Structure:
•Ranking:
•Web structure helps in ranking web pages based on their importance and relevance.
•Authority Identification:
•Determines which pages are considered authoritative on specific topics.
•Hub Identification:
•Finds websites that link to many authoritative pages, acting as hubs in the web structure.
Web Structure Mining

HITS algorithm PageRank algorithm


• uses web structure in order to • uses web structure in order to rank
identify hubs and authorities. web pages
Web Structure Mining

Hyperlink Induced Topic Search (HITS) algorithm

• HITS algorithm (also known as hubs and authorities) is an algorithm that analyzes hyperlinked structure of
the web in order to rank web pages.

• It was developed by Jon Kleinberg in 1999.

• Developed during the early days of the web when web page directories were prevailing.

• HITS works on the basis of the concept of hubs and authority.

• Hubs: Pages that link to many other pages. They are considered as directories or resource lists.

• Authorities: Pages that are linked to by many hubs. They are considered as authoritative sources on
specific topics.

• HITS is based on the assumption that web pages which act as directories are not themselves authority on any
information but act as a hub pointing out various web pages which may be the authority on the required
information.
Web Structure Mining
Web Structure Mining
A hyperlinked web structure

•H1, H2, and H3 are considered hubs. They act as directories or aggregators of information, linking out to
various pages. These hubs don't hold the information themselves but help direct users to pages that are
believed to have valuable content.
•A, B, C, and D are the web pages that are linked to by the hubs. These pages are considered authorities
because they are linked by multiple hubs. The more links from different hubs a page receives, the higher its
authority.

•Good hubs are those that link to many authoritative pages.


•Good authorities are those that receive links from many
different hubs.
The HITS algorithm evaluates both the hubness and
authority of pages to determine their relevance and
importance within a web structure.
Web Structure Mining

• an example in order to understand the working of the HITS algorithm.


• The Figure shows a web page structure where each node represents a web page and arrows show
the hyperlinks between the vertices.
Web Structure Mining

• Step 1: Present the given web structure as an adjacency matrix in order to perform further calculations.
Let the required adjacency matrix be A

Adjacency matrix representing web structure


Web Structure Mining

• Step 2: Prepare the transpose of matrix A.

• Step 3: Assume initial hub weight vector to be 1 and calculate authority weight vector by
multiplying the transpose of matrix A with the initial hub weight vector as shown
Web Structure Mining

• Step 4: Calculate the updated hub weight vector by multiplying the adjacency matrix A with
authority weight matrix obtained in step 3.
Web Structure Mining

• As per the calculations done above, the graph shown

• can be updated with hub and authority weights and can be represented as
Web Structure Mining

• This completes a single iteration of the HITS algorithm.


• In order to obtain much more accurate results, steps 3 and 4 can be repeated to obtain updated
authority weight vectors and updated hub vector values.
• From the above calculations, we can rank hubs and authorities and display authorities in search
results in order of decreasing authority weight value.
• For instance, as per Figure,

• we can say that web page N4 has the highest authority for some keyword as it is hyperlinked to most
high ranking hubs.
• Over the years the Internet has become increasingly sophisticated and so has the World Wide Web
with it.

You might also like