Data Mining Unit 5
Data Mining Unit 5
• Introduction
• Web mining
• Web content mining
• Web structure mining
• Web usage mining
• Text mining
• Episode rule discovery for texts
• Hierarchy of categories
• Text clustering
Introduction
• Data mining: turn data into knowledge
• Web mining is to apply data mining
techniques to extract and uncover knowledge
from web documents and services.
• Web: A huge, widely-distributed, highly
heterogeneous, semi-structured,
hypertext/hypermedia, interconnected
information repository
Web is a huge collection of documents plus
• Hyper-link information
• Access and usage information
Web Mining
• Web Mining is the application of data mining
techniques to extract knowledge
• from web data such as Web content, Web structure
and Web usage data.
• It is the process of discovering the useful and
previously unknown information from the web data.
• Web data is :-
• Web content :- text, images, records, etc.
• Web structure :- hyperlinks, tags, etc.
• Web usage :- http logs, app server logs, etc.
Web Content Mining
• Web content mining performed by extracting useful
information from the content of a web page/site.
• It includes extraction of structured
data/information from web pages, identification,
match, and integration of semantically similar data.
• The type of web content may consist of text, image,
audio, video, etc. It is also known as text mining.
• It uses the Natural Language Processing and
Information Retrieval techniques for mining the data
Web Structure Mining
• The structure of a typical Web graph consists of Web
pages as nodes, and hyperlinks as edges connecting
between two related pages.
• Web structure mining is the process of discovering
structure information from the web.
• This type of mining can be performed either at the
(intra-page) document level or the (inter-page)
hyperlink level.
• The research at the hyperlink level is also called
Hyperlink Analysis
Web Structure Terminology
• Web-graph : A directed graph that represents
theWeb.
• Node : EachWeb page represents a node of
theWeb-graph.
• Link : Each hyperlink on theWeb is a directed edge
of theWeb-graph.
• In-degree : The number of distinct links that point
to a node.
• Out-degree : The number of distinct links
originating at a node that point to other nodes.
• Directed Path : It is a sequence of links, starting
from a node say r that can be followed to reach
another node say t.
• Shortest Path : The path with the shortest length
out of all the paths between nodes p and q.
• Diameter : It is the maximum of all the shortest
paths between a pair of nodes p and q, for all
pairs of nodes p and q in theWeb-graph.
Web Search
• There are two approches:
• page rank: for discovering the most important
• pages on the Web (as used in Google)
• hubs and authorities: a more detailed
evaluation of the importance of Web pages
Basic definition of importance:
• A page is important if important pages link to it
Page Rank (1)
Examples:
60% of clients who accessed
/home/products/file1.html, followed the path
/home ==> /home/whatsnew ==> /home/products
==> /home/products/file1.html
(Olympics Web site) 30% of clients who accessed sport
Information Retrieval
Given: A source of textual documents
A user query (textbased)
Find:A set (ranked) of documents that are
relevant to the query
Intelligent Information Retrieval
meaning of words
Synonyms “buy” / “purchase”
Ambiguity “bat” (baseball vs. mammal)
Find:
Sentences with relevant information
Extract the relevant information and
predetermined format
Clustering
Given:
A source of textual
documents
Similarity measure