Webmining I
Webmining I
Web Mining
Anushri Gupta (105390464)
Gaurao Bardia (105390862)
Ankush Chadha (105571759)
Krati Jain (105571032)
Group: 9
Course Instructor: Prof. Anita Wasilewska
State University of New York at Stony Brook
References
Mining the Web: Discovering Knowledge
from Hypertext Data by Soumen Chakrabarti
(Morgan-Kaufmann Publishers )
Web Mining :Accomplishments & Future
Directions by Jaideep Srivastava
The World Wide Web: Quagmire or goldmine
by Oren Entzioni
http://www.galeas.de/webmining.html
Overview
Web Mining
B E
C D Web Mining
Mining
Result
uential Patterns with Supporting
Support >= 40% Customers
Classifications
Clustering
Association
Document Classification
Supervised Learning
Supervised learning is a ‘machine learning’ technique for creating a
function from training data .
Documents are categorized
The output can predict a class label of the input object (called
classification).
Techniques used are
Nearest Neighbor Classifier
Feature Selection
Decision Tree
Feature Selection
Removes terms in the training documents which are
statistically uncorrelated with the class labels
Simple heuristics
Stop words like “a”, “an”, “the” etc.
Empirically chosen thresholds for ignoring “too
frequent” or “too rare” terms
Discard “too frequent” and “too rare terms”
Document Clustering
Unsupervised Learning : a data set of input objects is gathered
Hierarchical
Bottom-Up
Top-Down
Partitional
Semi-Supervised Learning
A collection of documents is available
A subset of the collection has known labels
Goal: to label the rest of the collection.
Approach
Train a supervised learner using the labeled subset.
March 8,1997
UES
T CLIENT
RE Q
LY
REP
SERVER
UES
T CLIENT
RE Q
LY
REP
SERVER
Page A
Page B
UES
T CLIENT
RE Q
LY
REP
SERVER
UES
T CLIENT
RE Q
LY
REP
SERVER
Web Usage Mining Model
Web Usage Data Preprocessing
DATA CLEANING
- Clean/Filter raw data to eliminate redundancy
LOGICAL CLUSTERS
- Notion of Single User Transaction
Data Cleaning
There are a variety of files accessed as a result of a request by a
client to view a particular Web page.
These include image, sound and video files, executable cgi files ,
coordinates of clickable regions in image map files and HTML files.
Thus the server logs contain many entries that are redundant or
irrelevant for the data mining tasks
a.gif
Browser Request : Page1.html, a.gif, b.gif
b.gif
3 Entries for same user request in the Server Log,
hence redundancy.
Data Cleaning cont…
Hostname Date : Time Request
SOLUTION
All the log entries with filename suffixes such as, gif, jpeg, GIF, JPEG, JPG
and map are removed from the log.
Logical Clusters
Representation of a Single User Transaction.
One of the significant factors which distinguish Web mining from other
data mining activities is the method used for identifying user transactions
The clustering is based on comparing pairs of log entries and
determining the similarity between them by means of some kind of
distance measure.
PROBLEMS:
Δt = Time Gap
Volume of data
Structural complexity of web sites
User Session
compact sequence of web accesses by a user
Visualization to obtain :
- understanding of the structure of a particular website
- web surfers’ behavior when visiting that site
Due to the large dataset and the structural complexity of the sites, 3D
visual representations used.
In parallel, Web Server Log files are downloaded and processed through
a sessionizer and a LOGML file is generated.
What is the typical behavior of a user entering our website in page A from
‘Discounted Book Sales’ link on a referrer web page B of another web
site?