0% found this document useful (0 votes)
15 views

Week 1

Uploaded by

jumain.dj
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Week 1

Uploaded by

jumain.dj
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 80

WEB MINING RESEARCH:

RESEARCH A
SURVEY

Raymond Kosala and Hendrik Blockeel


ACM SIGKDD, July 2000
[email protected]

Intro to Web Mining


Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Introduction
 Why we need ?
 What is it ?
 How it is different from classical data
mining ?
 How big is the web ?
 What are the problems ?
 Role of web mining ?
 Subtasks of web mining ?
 Web mining Taxonomy ?
Why we need Web Mining?
 Explosive growth of amount of content on
the internet
 Web search engines return thousands of

results so difficult to browse


 Online repositories are growing rapidly
Using web mining web documents can easily
BROWSED, ORGANISED and CATALOGED
with
minimal human intervention
What is Purpose Web
Mining?

Discovering useful information from the


World-Wide Web and its usage patterns
Background
 Huge amount of information online
 WWW is fertile area for data mining
 Cross road research from other research
such as
Database,IR , AI , NLP etc
 Data is distributed over Internet
 Presence of undesired data along with
relevant information
 Requirement of data from various sources
in local repository for further analysis
How does it differ from “classical”
Data Mining?
 The web is not a relation
 Textual information and linkage structure
 Usage data is huge and growing rapidly
 Google’s usage logs are bigger than their web crawl
 Data generated per day is comparable to largest
conventional data warehouses
 Ability to react in real-time to usage patterns
 No human in the loop
How big is the Web ?
 Number of pages
 Technically, infinite
 Because of dynamically generated content
 Lots of duplication (30-40%)
 Best estimate of “unique” static HTML
pages comes from search engine claims
 Google = 8 billion, Yahoo = 20 billion
 Lots of marketing hype
76,184,000 web sites (Feb 2006)
Netcraft survey

http://news.netcraft.com/archives/web_server_survey.html
The web as a graph
 Pages = nodes, hyperlinks = edges
 Ignore content
 Directed graph
 High linkage
 10-20 links/page on average
 Power-law degree distribution
Structure of Web graph
 Let’s take a closer look at structure
 Broder et al (2000) studied a crawl of 200M
pages and other smaller crawls
 Bow-tie structure
 Not a “small world”
Bow-tie Structure
What can the graph tell us?
 Distinguish “important” pages from
unimportant ones
 Page rank
 Discover communities of related pages
 Hubs and Authorities
 Detect web spam
 Trust rank
Searching the Web

The Web Content aggregators Content consumers


Ads vs. search results
Ads vs. search results
 Search advertising is the revenue model
 Multi-billion-dollar industry
 Advertisers pay for clicks on their ads
 Interesting problems
 What ads to show for a search?
 If I’m an advertiser, which search terms
should I bid on and how much to bid?
Two Approaches to
Analyzing Data
 Machine Learning approach
 Emphasizes sophisticated algorithms e.g.,
Support Vector Machines
 Data sets tend to be small, fit in memory
 Data Mining approach
 Emphasizes big data sets (e.g., in the
terabytes)
 Data cannot even fit on a single disk!
 Necessarily leads to simpler algorithms
Web Mining: Problems
 The “abundance” problem
 Limited coverage of the Web
 Limited query interface based on
keyword-oriented search
 Limited customization to individual users
 Dynamic and semi structured
Role of web mining
 Finding relevant information
 Creating new knowledge out
 Personalization
 Learning about consumers or individual
users.
Finding relevant
information
 People browse or use the search service
when they want to find specific
information
 Use simple keyword query
 Based on similarity
 Low precision and irrelevance
 Low recall due inability to index
Creating new knowledge
 Problem of relevant information
 Extract potential usefull knowledge (data
mining)
 Utilization web as knowledge base for
decision making
Personalization
 Problem associated with type and
presentation of information
 People differ in the contents and
presentation
Learning about consumers
 Problem with personalization
 Knowing what the customers do and
want
 Sub Problem with Mass customization
 Problem effective web site design and
management related to marketing
Other Approaches
Web mining is NOT the only approach
 Database approach (DB)

 Information retrieval (IR)

 Natural language processing (NLP)

 In-depth syntactic and semantic analysis


 Web document community
 Standards, manually appended meta-
information, maintained directories, etc
Direct vs. Indirect Web
Mining
 Web mining techniques can be used to
solve the information overload problems:
 Directly
Attack the problem with web mining techniques
E.g. newsgroup agent classifies news as relevant
 Indirectly
Used as part of a bigger application that
addresses problems
E.g. used to create index terms for a web search
service
The Research
 Converging research from: Database,
information retrieval, and artificial
intelligence (specifically NLP and
machine learning)
 Focusing on research from the machine
learning point of view
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
What is it?
 Web mining - data mining techniques to
automatically discover and extract
information from web documents/services
www

Knowledge
Web Mining : Data Mining
On the Web
 A Term coined by “ Etzioni“ in 1996
Web Mining: Definition
 “Web mining refers to the overall
process of discovering potentially useful
and previously unknown information or
knowledge from the Web data.”
 Can be viewed as four subtasks
 Not the same as Information Retrieval

 Not the same as Information Extraction


Web Mining v. Data Mining
 Structure (or lack of it)
 Textual information and linkage structure
 Scale
 Data generated per day is comparable to
largest conventional data warehouses
 Speed
 Often need to react to evolving usage
patterns in real-time (e.g., merchandising)
Web Mining : Subtasks
 Resource Finding
 Task of retrieving intended web-documents
 Information Selection & Pre-processing
 Automatic selection and pre-processing specific
information from retrieved web resources
 Generalization
 Automatic Discovery of patterns in web sites
 Analysis
 Validation and / or interpretation of mined patterns
Web Mining: Not IR
 Information retrieval (IR) is the
automatic retrieval of all relevant
documents while at the same time
retrieving as few of the non-relevant
documents as possible
 Web document classification, which is a
Web Mining task, could be part of an IR
system (e.g. indexing for a search
engine)
Web Mining: Not IE
 Information extraction (IE) aims to
extract the relevant facts from given
documents
 IE systems for the general Web are not
feasible
 Most focus on specific Web sites or content
Web Mining and Machine
Learning
 Machine learning is concerned with the
development of algorithms and techniques
that allow computers to "learn".
 Web mining is NOT learning from the Web.
 Some applications of machine learning on

the web are NOT Web Mining


 Methods used for Web Mining are NOT

limited to machine learning


 Oops, there is a close relationship

between web mining and machine


learning
Web Mining: The Agent
Paradigm
 User Interface Agents
 information retrieval agents, information
filtering agents, & personal assistant
agents.
 Distributed Agents
 distributed agents for knowledge discovery
or data mining.
 Problem solving by a group of agents
 Mobile Agents
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Web Mining Taxonomy

W E B M IN IN G TA X O N O M Y

W eb M in in g

W eb U sage M ining W eb S tructure M ining W eb C ontent M ining


Web Mining Taxonomy (1)
Web Mining

Web Content Web Structure Web Usage


Mining Mining Mining

Identify information Uses interconnections Understand access


within given web between web pages to patterns and the trends
pages give weight to the to improve structure
pages
Distinguish personal
home pages from
other web pages
 Web Structure Mining: The structure of a
typical web graph consists of Web pages nodes
and hyperlinks as edges connecting between
two related pages. It can be regarded as the
process of discovering structure information
from the web
 Web Usage Mining: It focuses on techniques
that could predict user behavior while the user
interacts with the web.
 Web Content Mining: It emphasizes on the
content of the web page. It is an automatic
process that extracts pattern from web pages
and goes beyond only the keyword extraction.
Web Mining Categories
 Web Content Mining
 Discovering useful information from web
contents/data/documents.
 Web Structure Mining
 Discovering the model underlying link structures
(topology) on the Web. E.g. discovering authorities
and hubs
 Web Usage Mining
 Make sense of data generated by surfers
 Usage data from logs, user profiles, user sessions,
cookies, user queries, bookmarks, mouse clicks and
scrolls, etc.
Web Content Data Structure
 Unstructured – free text
 Semi-structured – HTML
 More structured – Table or Database
generated HTML pages
 Multimedia data – receive less attention
than text or hypertext
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Web Content Mining: IR
View
 Unstructured Documents
 Bag of words, or phrase-based feature
representation
 Features can be boolean or frequency
based
 Features can be reduced using different
feature selection techniques
 Word stemming, combining morphological
variations into one feature
Web Content Mining: IR
View
 Semi-Structured Documents
 Uses richer representations for features,
based on information from the document
structure (typically HTML and hyperlinks)
 Uses common data mining methods

(whereas unstructured might use more text


mining methods)
Web Content Mining: DB
View
 Tries to infer the structure of a Web site or
transform a Web site to become a
database
 Better information management
 Better querying on the Web
 Can be achieved by:
 Finding the schema of Web documents
 Building a Web warehouse
 Building a Web knowledge base
 Building a virtual database
Web Content Mining: DB
View
 Mainly uses the Object Exchange Model
(OEM)
 Represents semi-structured data (some
structure, no rigid schema) by a labeled graph
 Process typically starts with manual
selection of Web sites for content mining
 Main application: building a structural
summary of semi-structured data (schema
extraction or discovery)
Web Content Mining

Web Content Mining

Agent Based Approach Database Approach

Intelligent Information
Personalized Multilevel Web Query
Search Filtering &
Web Agent Databases Systems
Agent Categorization
Intelligent Search Agents
 Concentrate on searching relevant information
using the characteristics of a particular domain to
interpret and organize the collected information.

 It can be further classified into two types:


 Interpretation Based on Pre-Specified Information:
 Examples: Harvest, FAQFinder, Information Manifold, OCCAM
 Interpretation Based on Unfamiliar Source:
 Example: ShopBot
ShopBot
 A ShopBot is an autonomous software
agent that comb the internet providing
users with low price product or product
recommendations.

 A ShopBot basically looks for product


information from a variety of vendor
sites using the general information about
the product domain.

 The following example displays a


shopBot at www.allbookstores.com.
Information Filtering &
Categorization
 This makes use of various
information retrieval techniques and
characteristics of hypertext web
documents to interpret and
categorize data.

 Examples: HyPursuit, BO (Bookmark


Organizer).
Bookmark Organizer (BO)
 Makes use of hierarchical clustering techniques
and involves user interaction to organize a
collection of web documents.

 It operates in two modes:


 Automatic
 Manual

 Frozen Nodes: In a hierarchical structure, if we


freeze a node N, then the subtree rooted at N
represents a coherent group of documents
Additions & Deletions in BO
 For addition, we can use either of the following
two mechanisms:
 Fully Automatic: Makes use of an extremely precise
search algorithm to find the most relevant frozen cluster
to insert into.
 Semi Automatic: Insert the bookmark, and then climb up
the tree to find the closest frozen ancestor, and then re-
cluster the sub folder.

 When we Delete, we must re-cluster the


containing sub folder.
Personalized Web Agents
 This category of Web agents learn user
preferences and discover Web information
sources based on these preferences, and those of
other individuals with similar interests.

 Examples:
 WebWatcher
 PAINT
 Syskill&Webert
 GroupLens
 Firefly
Multilevel Databases

Text Image Audio Video Maps Games


Levels of a MLDB
 Layer 0 :
 Unstructured, massive and global information base.

 Layer 1:
 Derived from lower layers.
 Relatively structured.
 Obtained by data analysis, transformation &
Generalization.

 Higher Layers (Layer n):


 Further generalization to form smaller, better structured
databases for more efficient retrieval.
Web Query System
 These systems attempt to make use of:
 Standard database query language – SQL
 Structural information about web documents
 Natural language processing for queries made in www
searches.

 Examples:
 WebLog: Restructuring extracted information from Web
sources.
 W3QL: Combines structure query (organization of
hypertext) and content query (information retrieval
techniques).
Architecture of a Global
MLDB
Source Generalized Data
Source11

Concept
Higher
Hierarchy
Levels
Source
Source22

.
.
.
Resource Discovery (MLDB)

Source
Sourcenn Knowledge Discovery
Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Web Structure Mining
 Interested in the structure between
Web documents (not within a
document)
 Inspired by the study of social

networks and citation analysis


 Example: PageRank – Google

 Application: Discovering micro-

communities in the Web


 Measuring the “completeness” of a

Web site
But…
Retrieving relevant
information from the
web seems to be like –
Finding the Needle in
the Haystack...
 The Web is highly volatile (isinya
berubah-rubah), distributed (letaknya
dimana-mana) and heterogeneous
(formatnya bebas).

 The Web is a huge chaotic information


space without central authority.

 The Web is noisy.


Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Web Usage Mining
 Tries to predict user behavior from
interaction with the Web
 Wide range of data (logs)
 Web client data
 Proxy server data
 Web server data
 Two common approaches
 Map usage data into relational tables before
using adapted data mining techniques
 Use log data directly by utilizing special pre-
processing techniques
Web Usage Mining
 Typical problems: Distinguishing among
unique users, server sessions,
episodes, etc in the presence of
caching and proxy servers
 Often Usage Mining uses some
background or domain knowledge
E.g. site topology, Web content, etc
Advantage
 This technology has enabled
ecommerce to do personalized
marketing,
 The government agencies are using

this technology to classify threats


and fight against terrorism.
 The companies can establish better

customer relationship
Disadvantage
 The most criticized ethical issue
involving web mining is the invasion
of privacy
 there is no law preventing them from

trading the data


 Some mining algorithms might use

controversial attributes like sex, race,


religion, or sexual orientation to
categorize individuals.
 applications de-individualize the users

by judging them by their mouse clicks.


Outline
 Introduction
 Web Mining
 Web Content Mining
 Web Structure Mining
 Web Usage Mining
 Conclusion & Exam Questions
Conclusions
 The paper tried to resolve confusion with
regards to the term Web Mining
 Differentiated from IR and IE
 Suggest three Web mining categories
 Content, Structure, and Usage Mining
 Briefly described approaches for the three
categories
 Explored connection with agent paradigm
Exam Question #1
 Question: Outline the main
characteristics of Web information.

 Answer: Web information is huge,


diverse, and dynamic.
Exam Question #2
 Question: How data mining techniques
can be used in Web information analysis?
Give at least two examples.
 Classification: classification on server logs
using decision tree, Naïve-Bayes classifier to
discover the profiles of users belonging to a
particular class
 Clustering: Clustering can be used to group
users exhibiting similar browsing patterns.
 Association Analysis: association analysis can
be used to relate pages that are most often
referenced together in a single server session.
Exam Question #3
 Question: What are the three main areas
of interest for Web mining?

 Answer: (1) Web Content


(2) Web Structure
(3) Web Usage
80

And Raymond Kosala,


Hendrik Blockeel
And Shan Huang!

THANK YOU!

You might also like