Chapter 12: Web Usage Mining: - An Introduction
Chapter 12: Web Usage Mining: - An Introduction
Mining
- An introduction
Chapter written by Bamshad Mobasher
Bing Liu 3
Web server logs
1 2006-02-01 00:08:43 1.2.3.4 - GET /classes/cs589/papers.html - 200 9221
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)
http://dataminingresources.blogspot.com/
2 2006-02-01 00:08:46 1.2.3.4 - GET /classes/cs589/papers/cms-tai.pdf - 200 4096
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)
http://maya.cs.depaul.edu/~classes/cs589/papers.html
3 2006-02-01 08:01:28 2.3.4.5 - GET /classes/ds575/papers/hyperlink.pdf - 200
318814 HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
http://www.google.com/search?hl=en&lr=&q=hyperlink+analysis+for+the+web+survey
4 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/announce.html - 200 3794
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/
5 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/styles2.css - 200 1636
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/announce.html
6 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/header.gif - 200 6027
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/announce.html
Bing Liu 4
Web usage mining process
Bing Liu 5
Data preparation
Bing Liu 6
Pre-processing of web usage data
Bing Liu 7
Data cleaning
Data cleaning
remove irrelevant references and fields in server
logs
remove references due to spider navigation
remove erroneous references
add missing references due to caching (done after
sessionization)
Bing Liu 8
Identify sessions (sessionization)
Bing Liu 9
Sessionization strategies
Bing Liu 10
Sessionization heuristics
Bing Liu 11
Sessionization example
Bing Liu 12
User identification
Bing Liu 13
User identification: an example
Bing Liu 14
Pageview
Bing Liu 15
Path completion
Client- or proxy-side caching can often result
in missing access references to those pages
or objects that have been cached.
For instance,
if a user returns to a page A during the same
session, the second access to A will likely result in
viewing the previously downloaded version of A
that was cached on the client-side, and therefore,
no request is made to the server.
This results in the second reference to A not being
recorded on the server logs.
Bing Liu 16
Missing references due to caching
Bing Liu 17
Path completion
The problem of inferring missing user
references due to caching.
Effective path completion requires extensive
knowledge of the link structure within the site
Referrer information in server logs can also
be used in disambiguating the inferred paths.
Problem gets much more complicated in
frame-based sites.
Bing Liu 18
Integrating with e-commerce events
Either product oriented or visit oriented
Used to track and analyze conversion of
browsers to buyers.
Major difficulty for E-commerce events is defining
and implementing the events for a site, however,
in contrast to clickstream data, getting reliable
preprocessed data is not a problem.
Another major challenge is the successful
integration with clickstream data
Bing Liu 19
Product-Oriented Events
Product View
Occurs every time a product is displayed on a
page view
Typical Types: Image, Link, Text
Product Click-through
Occurs every time a user “clicks” on a product to
get more information
Bing Liu 20
Product-Oriented Events
Bing Liu 21
Web usage mining process
Bing Liu 22
Integration with page content
Bing Liu 23
Integration with link structure
Bing Liu 24
E-commerce data analysis
Bing Liu 25
Session analysis
Bing Liu 27
OLAP
Bing Liu 28
Data mining
Bing Liu 29
Data mining (cont.)
Bing Liu 30
Some usage mining applications
Bing Liu 31
Personalization application
Bing Liu 32
Standard approaches
Bing Liu 33
Summary
Web usage mining has emerged as the essential
tool for realizing more personalized, user-friendly
and business-optimal Web services.
The key is to use the user-clickstream data for
many mining purposes.
Traditionally, Web usage mining is used by e-
commerce sites to organize their sites and to
increase profits.
It is now also used by search engines to improve
search quality and to evaluate search results, etc,
and by many other applications.
Bing Liu 34