Datamining
Datamining
WAREHOUSING - 18BIT55S
UNIT IV: Web Data Mining: Introduction – Web
Terminology and Characteristics – Locality and
Hierarchy in the Web – Web Content Mining – Web
Usage Mining – Web Structure Mining – Web Mining
Software. Search Engines: Search Engine Functionality
- Search Engine Architecture – Ranking of Web Pages.
TEXT BOOK
G.K Gupta, “Introduction to Data Mining with Case
Studies”, Prentice Hall of India(Pvt) Ltd, India, 2008.
Prepared by : Mrs. G. Shashikala, Assistant Professor, PG
Department of Information Technology
Web mining
Web mining is the application of data mining techniques
to find interesting and potentially useful knowledge from
web data. Either the hyperlink structure of the web or the
web log data or both is used in the mining process.
Web mining is divided into several categories
1 Web content mining : it deals with discovering Useful
information or knowledge from Web page contents.
2 Web structure mining: It deals with discovering and
modelling the link structure of the web
3 Web usage mining: It deals with understanding user
behaviour in interacting with the web or with a website
The following are the major differences between searching
conventional text and searching the web:
1 Hyperlink: The text documents do not have hyperlinks, while
the links are very important components of web documents
2 Types of information : Webpages differ in structure quality and
their usefulnesss. Web pages consist of text frames, multimedia
objects, animation and other types of information. Documents
mainly consist of text but may have tables, diagrams, figures
3 Dynamics : The text documents do not change unless a new
edition of a book appear , while webpages change frequently
4 Quality: The text documents are usually of high quality, but
much of the information on the web is of low quality
5 Huge size : Although some of the libraries are very large, the web
in comparison is much larger
6 Document use : Compared to the use of conventional
documents, the use of web documents is very different
Web terminology and characteristics:
Some of the web terminology based on W3C are :
The world wide web (WWW) is the set of all the nodes which are
interconnected by hypertext links
A link expresses one or more relationships between two or more resources.
Links may also be established within a document by using anchors
a web page is a collection of information consisting of one or more web
resources and identified by a single URL. A web site is a collection of
interlinked web pages, including a homepage residing at the same network
location
In addition to simple text, HTML allows embedding of images, sounds and
video streams
A client browser is the primary user interface to the web. It is a program which
allows a person to view the contents of the Web pages, and for navigating from
one page to another
A uniform resource locator (URL) is an identifier for an abstract or physical
resource, for example a server and the file path or index . URLs are location
dependent and each URL contains four distinct parts namely the protocol
types(http), the name of the web server, the directory path and the file name
A Web server serves web pages using http to client
machines so that a browser can display them
A Client is the role adapted by an application when it is
retrieving a web resource
A proxy is an intermediary which acts as both a server and
the client for the purpose of retrieving resources on behalf
of other clients. Clients using a proxy know that the proxy
is present and that it is an intermediary
A domain name server is a distributed database of name to
address mappings
Yeah cookie is the data sent by a web server to a web
client, to be stored locally by the client and sent back to
the server on subsequent requests
Locality and Hierarchy in the web
Most social structures tend to organize themselves as
hierarchies. The web shows a strong hierarchical
structure.
Web pages can be classified into several types :
1 Home page or the head page : represents an entry point
for the web site of an enterprise
2 Index page : assists the user to navigate through the
enterprise’s web site
3 Reference page : provides some basic information that is
used by a number of pages . For ex., link to a page that
provides enterprise’s privac policy
4 Content page : provides content and are often the leaf
nodes of a tree
Web content mining
This deals with discovering useful information from the
web
The algorithm proposed is called Dual Iterative Pattern
Relation Extraction (DIPRE). It works as follows:
1 Sample : Start with a sample provided by the user
2 Occurrences : Find occurrences of tuples starting with
those in S. Once tuples are found, the context of every
occurrence is save. Let these be O. O→S
3 Pattern : Generate patterns based on the set of
occurrences O. This requires generating patterns with
similar contexts. P →O
4 Match patterns : The web is now searched for the
patterns
5 Stop if enough matches are found. Else, go to Step 2
Web usage mining
The objective of web usage mining is to understand and predict user behaviour in interacting with the
web or with the website in order to improve the quality of service
using some tools the following information may be obtained
number of hits
number of visitors
visitor referring website
visitor referral website
entry point
Visitor time and duration
path analysis
visitor IP address
browser type
Platform
Cookies
it is decidable to collect information on
Path Traversed
conversion rates
impact of advertising
impact of promotions
website design
customer segmentation
enterprise search
Web structure mining
The aim of web structure mining is to discover the link structure or the model
that is assumed to underlie the web. The Hyperlink Induced Topic Search
(HITS) algorithm is used for this .HITS algorithm has two major steps
1 sampling step : It collects relevant web pages for a given topic
2 Iterative step m: It finds hubs and authorities using the information collected
during
sampling
step 1 - sampling step
HITS algorithm expands the root set R into a base set S by using the following
algorithm:
1 let S = R
2 for each page in S, do steps 3 to 5
3 let T be the set of all pages S points to
4 let F be the set of all pages that point to S
5 let S = S + T + some or all of F
6 delete all links with the same domain name
7 this S is returned
step 2 - finding hubs and authorities
1 let a page p have a non-negative authority weight Xp
and a non negative hub weight Yp
2 the weights are normalized so their squared sum for
each type of weight is 1 since only the relative weights are
important
3 for a page p, the value of Xp is updated to be the sum of
Yq over all pages q that link to p
4 for a page p the value of Yp is updated to be the sum of
Xq over all pages q that p links to
5 continue with step 2 unless a termination condition has
been reached
6 on termination the output of the algorithm is a set of
pages with the largest Xp and Yp weights
Web mining software
123LogAnalyzer
Analog(from Dr. Stephen Turner)
Azure web log analyser
ClickTracks
Datanautics G2 and Insight 5
LiveStats.NET
NetTracker Web Analytics
Nihuo web log analyser
Webanalyst from megaputer
Weblog expert 3.5
Webtrends 7 from netiq
WUM – web utilization miner
Search Engines
Introduction