0% found this document useful (0 votes)
21 views

Data Mining Unit 5

Uploaded by

nanipavan830
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Data Mining Unit 5

Uploaded by

nanipavan830
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Unit 5 Web and Text Mining

• Introduction
• Web mining
• Web content mining
• Web structure mining
• Web usage mining
• Text mining
• Episode rule discovery for texts
• Hierarchy of categories
• Text clustering
Introduction
• Data mining: turn data into knowledge
• Web mining is to apply data mining
techniques to extract and uncover knowledge
from web documents and services.
• Web: A huge, widely-distributed, highly
heterogeneous, semi-structured,
hypertext/hypermedia, interconnected
information repository
Web is a huge collection of documents plus
• Hyper-link information
• Access and usage information
Web Mining
• Web Mining is the application of data mining
techniques to extract knowledge
• from web data such as Web content, Web structure
and Web usage data.
• It is the process of discovering the useful and
previously unknown information from the web data.
• Web data is :-
• Web content :- text, images, records, etc.
• Web structure :- hyperlinks, tags, etc.
• Web usage :- http logs, app server logs, etc.
Web Content Mining
• Web content mining performed by extracting useful
information from the content of a web page/site.
• It includes extraction of structured
data/information from web pages, identification,
match, and integration of semantically similar data.
• The type of web content may consist of text, image,
audio, video, etc. It is also known as text mining.
• It uses the Natural Language Processing and
Information Retrieval techniques for mining the data
Web Structure Mining
• The structure of a typical Web graph consists of Web
pages as nodes, and hyperlinks as edges connecting
between two related pages.
• Web structure mining is the process of discovering
structure information from the web.
• This type of mining can be performed either at the
(intra-page) document level or the (inter-page)
hyperlink level.
• The research at the hyperlink level is also called
Hyperlink Analysis
Web Structure Terminology
• Web-graph : A directed graph that represents
theWeb.
• Node : EachWeb page represents a node of
theWeb-graph.
• Link : Each hyperlink on theWeb is a directed edge
of theWeb-graph.
• In-degree : The number of distinct links that point
to a node.
• Out-degree : The number of distinct links
originating at a node that point to other nodes.
• Directed Path : It is a sequence of links, starting
from a node say r that can be followed to reach
another node say t.
• Shortest Path : The path with the shortest length
out of all the paths between nodes p and q.
• Diameter : It is the maximum of all the shortest
paths between a pair of nodes p and q, for all
pairs of nodes p and q in theWeb-graph.
Web Search
• There are two approches:
• page rank: for discovering the most important
• pages on the Web (as used in Google)
• hubs and authorities: a more detailed
evaluation of the importance of Web pages
Basic definition of importance:
• A page is important if important pages link to it
Page Rank (1)

• Simple solution: create a stochastic matrix of


the Web:
– Each page i corresponds to row i and column i
of the matrix
– If page j has n successors (links) then the ijth
cell of the matrix is equal to 1/n if
page i is one of these n succesors of page j, and
0 otherwise.
• The intuition behind this matrix:
initially each page has 1 unit of importance. At each round,
each page shares importance it has among its
successors, and receives new importance from its
predecessors.
• The importance of each page reaches a limit after some
steps
• That importance is also the probability that a Web
surfer, starting at a random page, and following random
links from each page will be at the page in question after
a long series of links.
HITS Algorithm
• Hyperlink-Induced Topic Search
Authorities
• Relevant pages of the highest quality on a
broad topic
Hubs
• Pages that link to a collection of authoritative
pages on a broad topic
The approach consists of two phases:
• It uses the query terms to collect a starting set of pages
(200 pages) from an index-based search engine – root
set of pages.
• The root set is expanded into a base set by including all
the pages that the root set pages link to, and all the
pages that link to a page in the root set, up to a designed
size cutoff, such as 2000-5000.
• A weight-propagation phase is initiated. This is an
iterative process that determines numerical estimates of
hub and authority weights
Web Usage Mining
• A Web is a collection of inter-related files on one or more
Web Servers.
• Discovery of meaningful patterns from data generated by
client-server transaction on one or more Web localities.
Typical Sources of Data :
• Automatically generated data stored in server access logs,
referrer logs, agent logs, and client-side cookies.
User profiles.
• Metadata : page attributes, content attributes, usage
data.
• Web servers, Web proxies, and client application can
quite easily capture Web Usage data.
• Web Server Log : It is a file that is created by the
server to record all the activities it performs.
• For ex: When a user enters URL into the browsers
address bar or requests by clicking on a link.
• The page request sent to web server maintains the
following info. in its log like Information about URL,
Whether the request was successful, Users IP
address, time and date, etc.
Web Usage Mining – Three Phases
Path and Usage Pattern Discovery
Types of Path/Usage Information
 Most Frequent paths traversed by users
 Entry and Exit Points

 Distribution of user session duration

 Examples:
 60% of clients who accessed
/home/products/file1.html, followed the path
/home ==> /home/whatsnew ==> /home/products
==> /home/products/file1.html
 (Olympics Web site) 30% of clients who accessed sport

specific pages started from the Sneakpeek page.


 65% of clients left the site after 4 or less references.
Search Engines for Web
Mining
The number of Internet hosts exceeded...
 1.000 in 1984
 10.000 in 1987
 100.000 in 1989
 1.000.000 in 1992
 10.000.000 in 1996
 100.000.000 in 2000
Search engine components

 Spider (a.k.a. crawler/robot) – builds corpus


 Collects web pages recursively
• For each known URL, fetch the page, parse it, and extract new URLs
• Repeat
 Additional pages from direct submissions & other sources
 The indexer – creates inverted indexes
Various policies wrt which words are indexed, capitalization,
support for Unicode, stemming, support for phrases, etc.
 Query processor – serves query results
 Front end – query reformulation, word stemming,
capitalization, optimization of Booleans, etc.
 Back end – finds matching documents and ranks them
Web Search Products and Services
• Alta Vista
 DB2 text extender
 Excite
 Fulcrum
 Glimpse (Academic)
 Google!
 Inforseek Internet
 Inforseek Intranet
 Inktomi (HotBot)
 Lycos
 PLS
 Smart (Academic)
 Oracle text extender
 Verity
 Yahoo
Three examples of search strategies
 Rank web pages based on popularity
 Rank web pages based on word frequency
 Match query to an expert database
All the major search engines use a mixed
strategy in ranking web pages and
responding to queries
Text Mining
• The objective of Text Mining is to exploit
information contained in textual documents in
various ways, including discovery of patterns
and trends in data, associations among entities,
predictive rules, etc.
The results can be important both for :
• The analysis of the collection, and
• Providing intelligent navigation and browsing
methods.
• Data mining in text: find something useful and
surprising from a text collection;
Types of text mining
1. Keyword (or term) based association analysis
2. automatic document (topic) classification
3. similarity detection
• cluster documents by a common author
• cluster documents containing information from a
common source
4. sequence analysis: predicting a recurring event,
discovering trends
5. anomaly detection: find information that
violates usual patterns
6. discovery of frequent phrases
7. text segmentation (into logical chunks)
8. event detection and tracking
text mining vs. information retrieval

Information Retrieval
Given: A source of textual documents
A user query (textbased)
Find:A set (ranked) of documents that are
relevant to the query
Intelligent Information Retrieval
meaning of words
 Synonyms “buy” / “purchase”
 Ambiguity “bat” (baseball vs. mammal)

 order of words in the query


 hot dog stand in the amusement park
 hot amusement stand in the dog park

 user dependency for the data


 direct feedback
 indirect feedback

 authority of the source


IBM is more likely to be an authorized source then my second
far cousin
Intelligent Web Search
Combine the intelligent IR tools
 meaning of words
 order of words in the query

 user dependency for the data

 authority of the source

 With the unique web features


 retrieve Hyper-link information
 utilize Hyper-link as input
Information Extraction
Given:
 A source of textual documents
 A well defined limited query (text based)

 Find:
 Sentences with relevant information
 Extract the relevant information and

ignore non-relevant information (important!)


 Link related information and output in a

predetermined format
Clustering
Given:
 A source of textual
documents
 Similarity measure

• e.g., how many words


are common in these
Documents
Find:
• Several clusters of documents
that are relevant to each other
Clustering

You might also like