Week 1
Week 1
RESEARCH A
SURVEY
http://news.netcraft.com/archives/web_server_survey.html
The web as a graph
Pages = nodes, hyperlinks = edges
Ignore content
Directed graph
High linkage
10-20 links/page on average
Power-law degree distribution
Structure of Web graph
Let’s take a closer look at structure
Broder et al (2000) studied a crawl of 200M
pages and other smaller crawls
Bow-tie structure
Not a “small world”
Bow-tie Structure
What can the graph tell us?
Distinguish “important” pages from
unimportant ones
Page rank
Discover communities of related pages
Hubs and Authorities
Detect web spam
Trust rank
Searching the Web
Knowledge
Web Mining : Data Mining
On the Web
A Term coined by “ Etzioni“ in 1996
Web Mining: Definition
“Web mining refers to the overall
process of discovering potentially useful
and previously unknown information or
knowledge from the Web data.”
Can be viewed as four subtasks
Not the same as Information Retrieval
W E B M IN IN G TA X O N O M Y
W eb M in in g
Intelligent Information
Personalized Multilevel Web Query
Search Filtering &
Web Agent Databases Systems
Agent Categorization
Intelligent Search Agents
Concentrate on searching relevant information
using the characteristics of a particular domain to
interpret and organize the collected information.
Examples:
WebWatcher
PAINT
Syskill&Webert
GroupLens
Firefly
Multilevel Databases
Layer 1:
Derived from lower layers.
Relatively structured.
Obtained by data analysis, transformation &
Generalization.
Examples:
WebLog: Restructuring extracted information from Web
sources.
W3QL: Combines structure query (organization of
hypertext) and content query (information retrieval
techniques).
Architecture of a Global
MLDB
Source Generalized Data
Source11
Concept
Higher
Hierarchy
Levels
Source
Source22
.
.
.
Resource Discovery (MLDB)
Source
Sourcenn Knowledge Discovery
Outline
Introduction
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Conclusion & Exam Questions
Web Structure Mining
Interested in the structure between
Web documents (not within a
document)
Inspired by the study of social
Web site
But…
Retrieving relevant
information from the
web seems to be like –
Finding the Needle in
the Haystack...
The Web is highly volatile (isinya
berubah-rubah), distributed (letaknya
dimana-mana) and heterogeneous
(formatnya bebas).
customer relationship
Disadvantage
The most criticized ethical issue
involving web mining is the invasion
of privacy
there is no law preventing them from
THANK YOU!