Web Mining

SEMINAR
ON
WEB MINING
Abstract: With the explosive growth of information sources available on the World Wide Web, it has
become increasingly necessary for users to utilize automated tools in find the desired information
resources, and to track and analyze their usage patterns. These factors give rise to the necessity of
creating serverside and clientside intelligent systems that can effectively mine for knowledge. Web
mining can be broadly defined as the discovery and analysis of useful information from the World Wide
Web. This describes the automatic search of information resources available online, i.e. Web content
mining, and the discovery of user access patterns from Web servers, i.e., Web usage mining. In this
paper we present detail statistical formulation and experimental result to show how web mining can be
utilized to perform potential customer. Web usability is an important and sometimes controversial
research area. We proposed an integrated system for web mining and usability study where four core
modules are designed to address the fundamental issues in usability analysis.. As an example to cross
modules analysis, we apply association rule mining from the link structure obtained from web mining
module to automatically discover menus and structures in a web site.
Keywordds: web mining, potential customer.
1. INTRODUCTION
With the explosive growth of information sources
available on the World Wide Web, it has become
increasingly necessary for users to utilize auto-
mated tools in order to find, extract, filter, and
evaluate the desired information and resources. In
addition, with the transformation of the web into
the primary tool for electronic commerce, it is
imperative for organizations and companies, who
have invested millions in Internet and Intranet
technologies, to track and analyze user access
patterns. These factors give rise to the necessity of
creating server-side and client-side intelligent
systems that can
effectively mine for knowledge both across the
Internet and in particular web localities.

At present most of the users commonly use
searching engines such as www.google.com, to
find their required information. Moreover, the
target of the Web search engine is only to discover
resource on the Web. Each searching engines
having its own characteristics and employing
different algorithms to index, rank, and present
web documents. But because all these searching
engines is build based on exact key words
matching and it's query language belongs to some
artificial kind, with restricted syntax and
vocabulary other than natural language, there are
defects that all kind of searching engines cannot
overcome.
Narrowly Searching Scope: Web pages indexed
by any searching engines are only a tiny part of
the whole pages on the www, and the return pages
when user input and submit query are another tiny
part of indexed numbers of the searching engine.
Low Precision: User cannot browse all the pages
one by one, and most pages are irrelevant to the
user's meaning, they are highlighted and returned
by searching engine just because these pages in
possession of the key words.
Web mining techniques could be used to solve the
information over load problems directly or
indirectly. However, Web mining techniques are
not the only tools. Other techniques and works
from different research areas, such as DataBase
(DB), Information Retrieval (IR), Natural
Language Processing (NLP), and the Web
document community, could also be used.
Information retrieval
Information retrieval is the art and science of
searching for information in documents, searching
for documents themselves, searching for metadata
which describes documents, or searching within
databases, whether relational standalone databases
or hypertext networked databases such as the
Internet or intranets, for text, sound, images or
data.
Natural language processing
Natural language processing (NLP) is concerned
with the interactions between computers and
human (natural) languages. NLP is a form of
human-to-computer interaction where the
elements of human language, be it spoken or
written, are formalized so that a computer can
perform value-adding tasks based on that
interaction.
Natural language understanding is sometimes
referred to as an AI-complete problem, because
natural-language recognition seems to require
extensive knowledge about the outside world and
the ability to manipulate it.
The purpose of Web mining is to develop methods
and systems for discovering models
of objects and processes on the World Wide Web
and for web-based systems that show adaptive
performance. Web Mining integrates three parent
areas: Data Mining (we use this term here also for
the closely related areas of Machine Learning and
Knowledge Discovery), Internet technology and

World Wide Web, and for the more recent
SemanticWeb.
The World Wide Web has made an enormous
amount of information electronically accessible.
The use of email, news and markup languages like
HTML allow users to publish and read documents
at a world-wide scale and to communicate via chat
connections, including information in the form of
images and voice records. The HTTP protocol that
enables access to documents over the network via
Web browsers created an immense improvement
in communication and access to information. For
some years these possibilities were used mostly in
the scientific world but recent years have seen an
immense growth in popularity, supported by the
wide availability of computers and broadband
communication. The use of the internet for other
tasks than finding information and direct
communication is increasing, as can be seen from
the interest in“e-activities” such as e-commerce,
e-learning, e-government, e-science.
Independently of the development of the Internet,
Data Mining expanded out of the academic world
into industry. Methods and their potential became
known outside the academic world and
commercial toolkits became available that allowed
applications at an industrial scale. Numerous
industrial applications have shown that models
can be constructed from data for a wide variety of
industrial problems. The World-Wide Web is an
interesting area for Data Mining because huge
amounts of information are available. Data
Mining methods can be used to analyze the
behavior of individual users, access patterns of
pages or sites, properties of collections of
documents.
Almost all standard data mining methods are
designed for data that are organized as multiple
“cases” that are comparable and can be viewed as
instances of a single pattern, for example patients
described by a fixed set of symptoms and
diseases, applicants for loans, customers of a shop.
A “case” is typically described by a fixed set of
features (or variables). Data on the Web have a
different nature. They are not so easily
comparable and have the form of free text, semi-
structured text (lists, tables) often with images and
hyperlinks, or server logs. The aim to learn
models of documents has given rise to the interest
in Text Mining methods for modeling documents
in terms of properties of documents. Learning
from the hyperlink structure has given rise to
graph-based methods, and server logs are used to
learn about user behavior.
Instead of searching for a document that matches
keywords, it should be possible to combine
information to answer questions. Instead of
retrieving a plan for a trip to Hawaii, it should be
possible to automatically construct a travel plan
that satisfies certain goals and uses opportunities
that arise dynamically. This gives rise to a wide
range of challenges. Some of them concern the
infrastructure, including the interoperability of

systems and the languages for the exchange of
information rather than data. Many challenges are
in the area of knowledge representation, discovery
and engineering. They include the extraction of
knowledge from data and its representation in a
form understandable by arbitrary parties, the
intelligent questioning and the delivery of answers
to problems as opposed to conventional queries
and the exploitation of formerly extracted
knowledge in this process.
2. WEB MINING
Web mining is the integration of information
gathered by traditional data mining methodologies
and techniques with information gathered over the
World Wide Web.
Data mining is also called knowledge discovery
and data mining (KDD). It is the extraction of
useful patterns from data sources, e.g.databases,
texts, web, images, etc. Patterns must be valid,
novel, potentially useful, understandable.Classic
data mining tasks
 Classification: mining patterns that can
classify future (new) data into known
classes.
 Association rule mining: mining any rule
of the form X ® Y, where X and Y are sets
of data items. E.g.,Cheese, Milk® Bread
[sup =5%, confid=80%]
 Clustering: identifying a set of similarity
groups in the data
 Sequential pattern mining: A sequential
rule: A® B, says that event A will be
immediately followed by event B with a
certain confidence
Fig .1 The Data Mining (KDD) Process
Just as data mining aims at discovering valuable
information that is hidden in conventional
databases, the emerging field of web mining aims
at finding and extracting relevant information that
is hidden in Web-related data, in particular hyper-
text documents published on the Web. Web
Mining is the extraction of interesting and
potentially useful patterns and implicit
information from artifacts or activity related to the
World Wide Web. There are roughly three
knowledge discovery domains that pertain to web
mining: Web Content Mining, Web Structure
Mining, and Web Usage Mining. Web content
mining is the process of extracting knowledge
from the content of documents or their
descriptions. Web document text mining, resource
discovery based on concepts indexing or agent
based technology may also fall in this category.

Web structure mining is the process of inferring
knowledge from the World Wide Web
organization and links between references and
referents in the Web. Finally, web usage mining,
also known as Web Log Mining, is the process of
extracting interesting patterns in web access logs.
Web is a collection of inter-related files on one or
more Web servers. Web mining is a multi-
disciplinary effort that draws techniques from
fields like information retrieval, statistics,
machine learning, natural language processing,
and others. Web mining has new character
compared with the traditional data mining. First,
the objects of Web mining are a large number of
Web documents which are heterogeneously
distributed and each data source are
heterogeneous; second, the Web document itself is
semi-structured or unstructured and lack the
semantics the machine can understand.
3. HISTORY
The term “Web Mining” first used in [E1996],
defined in a ‘task oriented’ manner. Alternate
‘data oriented’ definition given in [CMS1997]. Its
First panel discussion at ICTAI 1997 [SM1997]. It
is a continuing forum.
 WebKDD workshops with ACM
SIGKDD, 1999, 2000, 2001, 2002, … ; 60
–90 attendees
 SIAM Web analytics workshop 2001,
2002, …
 Special issues of DMKD journal,
SIGKDD Explorations
 Papers in various data mining conferences
& journals
 Surveys [MBNL 1999, BL 1999,
KB2000]
This area of research is so huge today due to the
tremendous growth of information sources
available on the Web and the recent interest in e-
commerce. Web mining is used to understand
customer behavior, evaluate the effectiveness of a
particular Web site, and help quantify the success
of a marketing campaign.
3.1.web mining subtasks
Web mining can be decomposed into the subtasks,
namely:
1. Resource finding: the task of retrieving
intended Web documents. By resource
finding we mean the process of retrieving
the data that is either online or offline from
the text sources available on the web such
as electronic newsletters, electronic
newswire, the text contents of HTML
documents obtained by removing HTML
tags, and also the manual selection of Web
resources.
2. Information selection and pre-
processing: automatically selecting and
pre-processing specific information from
retrieved Web resources. It is a kind of
transformation processes of the original
data retrieved in the IR process. These
transformations could be either a kind of
pre-processing that are mentioned above

such as stop words, stemming, etc. or a
pre-processing aimed at obtaining the
desired representation such as finding
phrases in the training corpus,
transforming the representation to
relational or first order logic form, etc.
3. Generalization: automatically discovers
general patterns at individual Web sites as
well as across multiple sites. Machine
learning or data mining techniques are
typically used in the process of
generalization. Humans play an important
role in the information or knowledge
discovery process on the Web since the
Web is an interactive medium.
4. Analysis: validating and/or interpretation
of the mined patterns.
4. CHALLENGES OF WEB
MINING
1. Today World Wide Web is flooded with
billions of static and dynamic web pages
created with programming languages such
as HTML, PHP and ASP. It is significant
challenge to search useful and relevant
information on the web.
2. Creating knowledge from available
information.
3. As the coverage of information is very
wide and diverse, personalization of the
information is a tedious process.
4. Learning customer and individual user
patterns.
5. Complexity of Web pages far exceeds the
complexity of any conventional text
document. Web pages on the internet lack
uniformity and standardization.
6. Much of the information present on web is
redundant, as the same piece of
information or its variant appears in many
pages.
7. The web is noisy i.e. a page typically
contains a mixture of many kinds of
information like, main content,
advertisements, copyright notice,
navigation panels.
8. The web is dynamic, information keeps on
changing constantly. Keeping up with the
changes and monitoring them are very
important.
9. The Web is not only disseminating
information but it also about services.
Many Web sites and pages enable people
to perform operations with input
parameters, i.e., they provide services.
10. The most important challenge faced is
Invasion of Privacy. Privacy is considered
lost when information concerning an
individual is obtained, used, or
disseminated, when it occurs without their
knowledge or consent.
Techniques to Address the Problem
1.1 Preprocessing technique - Web
Robots

When attempting to detect web robots from a
stream it is desirable to monitor both the Web
server log and activity on the client-side. What we
are looking for is to distinguish single Web
sessions from each other. A Web session is a
series of requests to web pages, i.e. visits to web
pages. Since the navigation patterns of web robots
differs from the navigation patterns of human
users the contribution from web robots has to be
eliminated before proceeding with any further data
mining, i.e. when we are looking into web usage
behaviour of real users.
One problem with identifying web robots is
that they might hide their identity behind a facade
looking a lot like conventional web browsers.
Standard approaches to robot detection will fail to
detect camouflaged web robots. As web robots are
used for tasks like website indexing, e.g. by
Google, or detection of broken links they have to
exist. There is a special file on every domain
called “robot.txt” which, according to the Robot
Exclusion Standard [M. Koster, 1994], will be
examined by the robot in order to prevent the
robot from visiting certain pages of no interest.
Evil web robots however aren’t guaranteed to
follow the advice from robot.txt.
The classes chosen for evaluation are Temporal
Features, Page Features, Communication Features
and Path Features. It is desirable to be able to
detect the presence of a web robot after as few
requests as possible, this is ofcourse a tradeoff
between computational effort and result accuracy.
A simple decision model for determining the class
of a visitor is to first check if the visitor requested
robots.txt, then it will be labeled as robot, second
the visitor will be matched against a list of former
known robots. Third the referer “-” is searched
for, since robots seldom assign any value to the
referer fields this is a rewarding place to look. If a
robot is found, the list of known robots is updated
with the new one.
3.1.2 Avoiding Mislabeled Sessions To avoid
mislabeling of sessions an ensemble filtering
approach [C. Brodley et al., 1999] is used, where
the idea is to instead of just one model for
classification, build several models which are used
to find classification errors via finding single
mislabeled sessions.
The set of models acquired are used to classify all
sessions respectively. For each session, the
amount of false negative and false positive
classifications are counted. A large value of false
positive classifications imply that the session is
currently assigned to be a non-robot despite being
predicted to be a robot in most of the models. A
large value of false negative classifications imply
that the session might be a non-robot but has the
robot classifier.
4.2 Mining Issue

3.2.1 Indirect Association Common association
methods often employ patterns that connects
objects to each other. Sometimes, on the other
hand, it might be valuable to consider indirect
association between objects. Indirect association is
used to e.g. represent the behaviour of distinct
user groups.
3.2.2 Clustering With the growth of the World
Wide Web it can be very time consuming to
analyze every web page on its own. Therefore it is
a good idea to cluster web pages based on
attributes that can be considered similar to find
successful and less successful attributes and
patterns.
5. TAXONOMY OF WEB
MINING
In general, Web mining tasks can be classified
into three categories:
1. Web content mining,
2. Web structure mining and
3. Web usage mining.
However, there are two other different approaches
to categorize Web mining. In both, the categories
are reduced from three to two: Web content
mining and Web usage mining. In one, Web
structure is treated as part of Web Content while
in the other Web usage is treated as part of Web
Structure. All of the three categories focus on the
process of knowledge discovery of implicit,
previously unknown and potentially useful
information from the Web. Each of them focuses
on different mining objects of the Web.
Fig. 2 Taxonomy of Web mining
5.1. Web content mining
Web content mining is an automatic process that
goes beyond keyword extraction. Since the
content of a text document presents no machine
readable semantic, some approaches have
suggested to restructure the document content in a
representation that could be exploited by
machines. The usual approach to exploit known
structure in documents is to use wrappers to map
documents to some data model. Techniques using
lexicons for content interpretation are yet to come.
There are two groups of web content mining
strategies: Those that directly mine the content of
documents and those that improve on the content
search of other tools like search engines.

Web Content Mining deals with discovering
useful information or knowledge from web page
contents. Web content mining analyzes the
content of Web resources. Content data is the
collection of facts that are contained in a web
page. It consists of unstructured data such as free
texts, images, audio, video, semi-structured data
such as HTML documents, and a more structured
data such as data in tables or database generated
HTML pages. The primary Web resources that are
mined in Web content mining are individual
pages. They can be used to group, categorize,
analyze, and retrieve documents. Web content
mining could be differentiated from two points of
view:
5.1.1. Agent-Based Approach
This approach aims to assist or to improve the
information finding and filtering the information
to the users. This could be placed into the
following three categories:
a. Intelligent Search Agents: These agents
search for relevant information using
domain characteristics and user profiles to
organize and interpret the discovered
information.
b. Information Filtering/ Categorization:
These agents use information retrieval
techniques and characteristics of open
hypertext Web documents to automatically
retrieve, filter, and categorize them.
c. Personalized Web Agents: These agents
learn user preferences and discover Web
information based on these preferences,
and preferences of other users with similar
interest.
1. Intelligent Search Agents:
Several intelligent Web agents have been
developed that search for relevant information
using domain characteristics and user profiles
to organize and interpret the discovered
information. Agents such as Harvest , FAQ
Finder , Information Manifold , OCCAM , and
ParaSite rely either on pre-specified domain
information about particular types of
documents, or on hard coded models of the
information sources to retrieve and interpret
documents. Agents such as ShopBot and ILA
(Internet Learning Agent) interact with and
learn the structure of unfamiliar information
sources. ShopBot retrieves product
information from a variety of vendor sites
using only general information about the
product domain. ILA learns models of various
information sources and translates these into
its own concept hierarchy.
2.InformationFialtering/Categorization:
A number of Web agents use various information
retrieval techniques and characteristics of open
hypertext Web documents to automatically
retrieve, alter, and categorize them, BO
(Bookmark Organizer) 34] combines hierarchical
clustering techniques and user interaction to

organize a collection of Web documents based on
conceptual information.
3. Personalized Web Agents:
This category of Web agents learn user
preferences and discover Web information sources
based on these preferences, and those of other
individuals with similar interests (using
collaborative altering). A few recent examples of
such agents include the WebWatcher , PAINT ,
Syskill & Webert . For example, Syskill & Webert
utilizes a user profile and learns to rate Web pages
of interest using a Bayesian classier.
5.1.2. Database Approach
Database approach aims on modeling the data on
the Web into more structured form in order to
apply standard database querying mechanism and
data mining applications to analyze it. The two
main categories are
Multilevel databases: The main idea behind this
approach is that the lowest level of the database
contains semi-structured information stored in
various Web sources, such as hypertext
documents. At the higher level(s) meta data or
generalizations are extracted from lower levels
and organized in structured collections, i.e.
relational or object-oriented databases.
Web query systems: Many Web-based query
systems and languages utilize standard database
query languages such as SQL, structural
information about Web documents, and even
natural language processing for the queries that
are used in World Wide Web searches.. W3QL
combines structure queries, based on the
organization of hypertext documents, and content
queries, based on information retrieval techniques.
WebLog, logic-based query language for
restructuring extracts information from Web
information sources. . TSIMMIS .extracts data
from heterogeneous and semi-structured
information sources and correlates them to
generate an integrated database representation of
the extracted information.
5.2. WEB STRUCTURE MINING
World Wide Web can reveal more information
than just the information contained in documents.
For example, links pointing to a document
indicate the popularity of the document, while
links coming out of a document indicate the
richness or perhaps the variety of topics covered
in the document. This can be compared to
bibliographical citations. When a paper is cited
often, it ought to be important. The PageRank and
CLEVER methods take advantage of this
information conveyed by the links to find
pertinent web pages. By means of counters, higher
levels cumulate the number of artifacts subsumed
by the concepts they hold. Counters of hyperlinks,
in and out documents, retrace the structure of the
web artifacts summarized.
Web structure mining is the process of
discovering structure information from the web.

The structure of a typical web graph consists of
web pages as nodes, and hyperlinks as edges
connecting related pages. This can be further
divided into two kinds based on the kind of
structure information used.
Fig.3 Web graph structure
Hyperlinks
A hyperlink is a structural unit that connects a
location in a web page to a different location,
either within the same web page or on a different
web page. A hyperlink that connects to a different
part of the same page is called an Intra-document
hyperlink, and a hyperlink that connects two
different pages is called an inter-document
hyperlink.
Document Structure
In addition, the content within a Web page can
also be organized in a tree structured format,
based on the various HTML and XML tags within
the page. Mining efforts here have focused on
automatically extracting document object model
(DOM) structures out of documents.
Web structure mining focuses on the hyperlink
structure within the Web itself. The different
objects are linked in some way. Simply applying
the traditional processes and assuming that the
events are independent can lead to wrong
conclusions. However, the appropriate handling of
the links could lead to potential correlations, and
then improve the predictive accuracy of the
learned models.
Two algorithms that have been proposed to lead
with those potential correlations are:
1. HITS and
2. PageRank.
5.2.1. PageRank
Page Rank is a metric for ranking hypertext
documents that determines the quality of these
documents. The key idea is that a page has high
rank if it is pointed to by many highly ranked
pages. So the rank of a page depends upon the
ranks of the pages pointing to it. This process is
done iteratively till the rank of all the pages is
determined.
The rank of a page p can thus be written as:
Here, n is the number of nodes in the graph,
OutDegree(q) is the number of hyperlinks on page
q and d damping factor is the probability at each
page the random surfer will get bored and request
another random page.
5.2.2. HITS

Hyperlink-induced topic search (HITS) is an
iterative algorithm for mining the Web graph to
identify topic hubs and authorities. Authorities are
the pages with good sources of content that are
referred by many other pages or highly ranked
pages for a given topic; hubs are pages with good
sources of links. The algorithm takes as input,
search results returned by traditional text indexing
techniques, and filters these results to identify
hubs and authorities. The number and weight of
hubs pointing to a page determine the page's
authority. The algorithm assigns weight to a hub
based on the authoritativeness of the pages it
points to. If many good hubs point to a page p,
then authority of that page p increases. Similarly if
a page p points to many good authorities, then hub
of page p increases.
After the computation, HITS outputs the pages
with the largest hub weight and the pages with the
largest authority weights, which is the search
result of a given topic.
5.3. WEB USAGE MINING
Web usage mining is a process of extracting
useful information from server logs i.e. users
history. Web usage mining is the process of
finding out what users are looking for on the
Internet.
Web usage mining focuses on techniques that
could predict the behavior of users while they are
interacting with the WWW. It collects the data
from Web log records to discover user access
patterns of Web pages. Usage data captures the
identity or origin of web users along with their
browsing behavior at a web site.
Web servers record and accumulate data about
user interactions whenever requests for resources
are received. Analyzing the web access logs of
different web sites can help understand the user
behavior and the web structure, thereby improving
the design of this colossal collection of resources.
There are two main tendencies in Web Usage
Mining driven by the applications of the
discoveries: General Access Pattern Tracking and
Customized Usage Tracking. The general access
pattern tracking analyzes the web logs to
understand access patterns and trends. These
analyses can shed light on better structure and
grouping of resource providers. Many web
analysis tools existed but they are limited and
usually unsatisfactory. We have designed a web
log data mining tool, Web Log Miner, and
proposed techniques for using data mining and
OnLine Analytical Processing (OLAP) on treated
and transformed web access files. Applying data
mining techniques on access logs unveils
interesting access patterns that can be used to
restructure sites in a more efficient grouping,
pinpoint effective advertising locations, and target
specific users for specific selling ads.
Customized usage tracking analyzes individual
trends. Its purpose is to customize web sites to
users. The information displayed, the depth of the

site structure and the format of the resources can
all be dynamically customized for each user over
time based on their access patterns.
While it is encouraging and exciting to see the
various potential applications of web log file
analysis, it is important to know that the success
of such applications depends on what and how
much valid and reliable knowledge one can
discover from the large raw log data. Current web
servers store limited information about the
accesses. Some scripts custom-tailored for some
sites may store additional information. However,
for an effective web usage mining, an important
cleaning and data transformation
step before analysis may be needed.
In the using and mining of Web data, the most
direct source of data are Web log files on the Web
server. Web log files records of the visitor's
browsing behavior very clearly. Web log _les
include the server log, agent log and client log (IP
address, URL, page reference, access time,
cookies etc.).
There are several available research projects and
commercial products that analyze those patterns
for different purposes. The applications generated
from this analysis can be classified as
personalization, system improvement, site
modification, business intelligence and usage
characterization.
The Web Mining Architechture
Fig. 4 Web Usage Mining Process
The Web Usage Mining can be decomposed into
the following three main sub tasks:
Fig 5. Web usage mining process
5.3.1. Pre-processing
It is necessary to perform a data preparation to
convert the raw data for further process. The
actual data collected generally have the features
that incomplete, redundancy and ambiguity. In
order to mine the knowledge more effectively,
pre-processing the data collected is essential.
Preprocessing can provide accurate, concise data

for data mining. Data preprocessing, includes data
cleaning, user identification, user sessions
identification, access path supplement and
transaction identification.
 The main task of data cleaning is to
remove the Web log redundant data which
is not associated with the useful data,
narrowing the scope of data objects.
 Determining the single user must be done
after data cleaning. The purpose of user
identification is to identify the users
uniqueness. It can be complete by means
of cookie technology, user registration
techniques and investigative rules.
 User session identification should be done
on the basis of the user identification. The
purpose is to divide each user's access
information into several separate session
processes. The simplest way is to use time-
out estimation approach, that is, when the
time interval between the page requests
exceeds the given value, namely, that the
user has started a new session.
 Because the widespread use of the page
caching technology and the proxy servers,
the access path recorded by the Web server
access logs may not be the complete
access path of users. Incomplete access log
does not accurately reflect the user's access
patterns, so it is necessary to add access
path. Path supplement can be achieved
using the Web site topology to make the
page analysis.
 The transaction identification is based on
the user's session recognition, and its
purpose is to divide or combine
transactions according to the demand of
data mining tasks in order to make it
appropriate for demand of data mining
analysis.
5.3.2. Pattern discovery
Pattern discovery mines effective, novel,
potentially useful and ultimately understandable
information and knowledge using mining
algorithm. Its methods include statistical analysis,
classification analysis, association rule discovery,
sequential pattern discovery, clustering analysis,
and dependency modeling.
 Statistical Analysis: Statistical analysts
may perform different kinds of descriptive
statistical analyses (frequency, mean,
median, etc.) based on different variables
such as page views, viewing time and
length of a navigational path when
analyzing the session _le. By analyzing the
statistical information contained in the
periodic web system report, the extracted
report can be potentially useful for
improving the system performance,
enhancing the security of the system,
facilitation the site modification task, and
providing support for marketing decisions.

 Association Rules: In the web domain, the
pages, which are most often referenced
together, can be put in one single server
session by applying the association rule
generation. Association rule mining
techniques can be used to discover
unordered correlation between items found
in a database of transactions.
 Clustering analysis: Clustering analysis is
a technique to group together users or data
items (pages) with the similar
characteristics. Clustering of user
information or pages can facilitate the
development and execution of future
marketing strategies.
 Classification analysis: Classification is
the technique to map a data item into one
of several predefined classes. The
classification can be done by using
supervised inductive learning algorithms
such as decision tree classifiers, nave
Bayesian classifiers, k-nearest neighbor
classifier,Support Vector Machines etc.
 Sequential Pattern: This technique
intends to find the inter-session pattern,
such that a set of the items follows the
presence of another in a time-ordered set
of sessions or episodes. Sequential patterns
also include some other types of temporal
analysis such as trend analysis, change
point detection, or similarity analysis.
 Dependency Modeling: The goal of this
technique is to establish a model that is
able to represent significant dependencies
among the various variables in the web
domain. The modeling technique provides
a theoretical framework for analyzing the
behavior of users, and is potentially useful
for predicting future web resource
consumption.
5.3.3. Pattern Analysis
Pattern Analysis is a final stage of the whole web
usage mining. The goal of this process is to
eliminate the irrelevant rules or patterns and to
understand, visualize and to extract the interesting
rules or patterns from the output of the pattern
discovery process. The output of web mining
algorithms is often not in the form suitable for
direct human consumption, and thus need to be
transform to a format can be assimilate easily.
There are two most common approaches for the
patter analysis. One is to use the knowledge query
mechanism such as SQL, while another is to
construct multi-dimensional data cube before
perform OLAP operation.
6. APPLICATIONS OF WEB
MINING
Web mining techniques can be applied to
understand and analyze such data, and turned into
actionable information, that can support a web
enabled electronic business to improve its
marketing, sales and customer support operations.

Based on the patterns found and the original cache
and log data, many applications can be developed.
Some of them are:
In order to achieve personalized service, it first
has to obtain and collect information on clients to
grasp customer's spending habits, hobbies,
consumer psychology, etc., and then can be
targeted to provide personalized service. To obtain
consumer spending behavior patterns, the
traditional marketing approach is very difficult,
but it can be done using Web mining techniques.
Early on in the life of Amazon.com, its visionary
CEO Jeff Bezos observed, In a traditional (brick-
and mortar) store, the main effort is in getting a
customer to the store. Once a customer is in the
store they are likely to make a purchase since the
cost of going to another store is high and thus the
marketing budget (focused on getting the
customer to the store) is in general much higher
than the in-store customer experience budget
(which keeps the customer in the store). In the
case of an on-line store, getting in or out requires
exactly one click, and thus the main focus must be
on customer experience in the store. This
fundamental observation has been the driving
force behind Amazons comprehensive approach to
personalized customer experience, based on the
mantra a personalized store for every customer. A
host of Web mining techniques, e.g. associations
between pages visited, click-path analysis, etc.,
are used to improve the customers experience
during a store visit. Knowledge gained from Web
mining is the key intelligence behind Amazons
features such as instant recommendations,
purchase circles, wish-lists, etc.
6.1.Improve the website design
Attractiveness of the site depends on its
reasonable design of content and organizational
structure. Web mining can provide details of user
behavior, providing web site designers basis of
decision making to improve the design of the site.
6.2.System Improvement
Performance and other service quality attributes
are crucial to user satisfaction from services such
as databases, net-works, etc. Similar qualities are
expected from the users of Web services. Web
usage mining provides the key to under-standing
Web traffic behavior, which can in turn be used
for developing policies for Web caching, network
transmission, load balancing, or data distribution.
Security is an acutely growing concern for Web-
based services, especially as electronic commerce
continues to grow at an exponential rate. Web
usage mining can also provide patterns which are
useful for detecting intrusion, fraud, attempted
break-ins, etc.
6.3.Predicting trends
Web mining can predict trend within the retrieved
information to indicate future values. For
example, an electronic auction company provides
information about items to auction, previous

auction details, etc. Predictive modeling can be
utilized to analyze the existing information, and to
estimate the values for auctioneer items or number
of people participating in future auctions.
The predicting capability of the mining
application can also benefit society by identifying
criminal activities.
6.4.To carry out intelligent business
A visit cycle of customer network marketing
activities can be divided into four steps: Being
attracted, presence, purchase and left. Web mining
technology can dig out the customers' motivation
by analyzing the customer click-stream
information in order to help sales make reasonable
strategies, custom personalized pages for
customers, carry out targeted information
feedback and advertising. In short, in e-commerce
network marketing, Using Web mining techniques
to analyze large amounts of data can dig out the
laws of the consumption of goods and the
customer’s access patterns, help businesses
develop effective marketing strategies, enhance
enterprise competitiveness.
The companies can establish better customer
relationship by giving them exactly what they
need. Companies can understand the needs of the
customer better and they can react to customer
needs faster. The companies can find, attract and
retain customers; they can save on production
costs by utilizing the acquired insight of customer
requirements. They can increase profitability by
target pricing based on the profiles created. They
can even find the customer who might default to a
competitor the company will try to retain the
customer by providing promotional offers to the
specific customer, thus reducing the risk of losing
a customer.
7.RESEARCH DIRECTIONS
The techniques being applied to Web content
mining draw heavily from the work on
information retrieval, databases, intelligent agents,
etc. Since most of these techniques are well
known and reported elsewhere, we have focused
on Web usage mining in this survey instead of
Web content mining. In the following we provide
some directions for future research.
7.1 Data Pre-Processing for Mining
Web usage data is collected in various ways, each
mechanism collecting attributes relevant for its
purpose. There is a need to pre-process the data to
make it easier to mine for knowledge.
Specifically, we believe that issues such as
instrumentation and data collection, data
integration and transaction identification need to
be addressed. Clearly improved data quality can
improve the quality of any analysis on it. A
problem in the Web domain is the inherent
conflict between the analysis needs of the analysts
(who want more detailed usage data collected),
and the privacy needs of users (who want as little

data collected as possible). This has lead to the
development of cookie les on one side and cache
busting on the other. The emerging OPS standard
on collecting profile data may be a compromise on
what can andwill be collected. However, it is not
clear how much compliance to this can be
expected. Hence, there will be a continual need to
develop better instrumentation and data collection
techniques, based on whatever is possible and
allowable at any point in time. Portions of Web
usage data exist in sources as diverse as Web
server logs, referral logs, registration les, and
index server logs. Intelligent integration and
correlation of information from these diverse
sources can reveal usage information which may
not be evident from any one of them. Techniques
from data integration should be examined for this
purpose. Web usage data collected in various logs
is at a very fine granularity. Therefore, while it
has the advantage of being extremely general and
fairly detailed, it also has the corresponding
drawback that it cannot be analyzed directly, since
the analysis may start focusing on micro trends
rather than on the macro trends. On the other
hand, the issue of whether a trend is micro or
macro depends on the purpose of a specific
analysis.
Hence, we believe there is a need to group
individual data collection events into groups,
called Web transactions , before feeding it to the
mining system. While have proposed techniques
to do so, more attention needs to be given to this
issue.
7.2 The Mining Process
The key component of Web mining is the mining
process itself. As discussed in this paper, Web
mining has adapted techniques from the field of
data mining, databases, and information retrieval,
as well as developing some techniques of its own,
e.g. path analysis. A lot of work still remains to be
done in adapting known mining techniques as well
as developing new ones. Web usage mining
studies reported to date have mined for association
rules, temporal sequences, clusters, and path
expressions. As the manner in which the Web is
used continues to expand, there is a continual need
to figure out new kinds of knowledge about user
behavior that needs to be mined. The quality of a
mining algorithm can be measured both in terms
of how effective it is in mining for knowledge and
how efficient it is in computational terms. There
will always be a need to improve the performance
of mining algorithms along both these dimensions.
Usage data collection on the Web is incremental
in nature. Hence, there is a need to develop
mining algorithms that take as input the existing
data, mined knowledge, and the new data, and
develop a new model in an efficient manner.
Usage data collection on the Web is also
distributed by its very nature. If all the data were
to be integrated before mining, a lot of valuable
information could be extracted. However, an

approach of collecting data from all possible
server logs is both non-scalable and impractical.
Hence, there needs to be an approach where
knowledge mined from various logs can be
integrated together into a more comprehensive
model.
7.3 Analysis of Mined Knowledge
The output of knowledge mining algorithms is
often not in a form suitable for direct human
consumption, and hence there is a need to develop
techniques and tools for helping an analyst better
assimilate it. Issues that need to be addressed in
this area include usage analysis tools and
interpretation of mined knowledge.
There is a need to develop tools which incorporate
statistical methods, visualization, and human
factors to help better understand the mined
knowledge. Section 4 provided a survey of the
current literature in this area. One of the open
issues in data mining, in general, and Web mining,
in particular, is the creation of intelligent tools that
can assist in the interpretation of mined
knowledge. Clearly, these tools need to have
specific knowledge about the particular problem
domain to do any more than altering based on
statistical attributes of the discovered rules or
patterns. In Web mining, for example, intelligent
agents could be developed that based on
discovered access patterns, the topology of the
Web locality, and certain heuristics derived from
user behavior models, could give
recommendations about changing the physical
link structure of a particular site.
8. WEB MINING PROS & CONS
8.1. PROS
Web mining essentially has many advantages
which makes this technology attractive to
corporations including the government agencies.
This technology has enabled ecommerce to do
personalized marketing, which eventually results
in higher trade volumes. The government agencies
are using this technology to classify threats and
fight against terrorism. The predicting capability
of the mining application can benefit the society
by identifying criminal activities. The companies
can establish better customer relationship by
giving them exactly what they need. Companies
can understand the needs of the customer better
and they can react to customer needs faster. The
companies can find, attract and retain customers;
they can save on production costs by utilizing the
acquired insight of customer requirements. They
can increase profitability by target pricing based
on the profiles created. They can even find the
customer who might default to a competitor the
company will try to retain the customer by
providing promotional offers to the specific
customer, thus reducing the risk of losing a
customer.
Prospects
The future of Web Mining will to a large extent
depend on developments of the Semantic Web.

The role of Web technology still increases in
industry, government, education, entertainment.
This means that the range of data to which Web
Mining can be applied also increases. Even
without technical advances, the role of Web
Mining technology will become larger and more
central. The main technical advances will be in
increasing the types of data to which Web Mining
can be applied. In particular Web Mining for text,
images and video/audio streams will increase the
scope of current methods. These are all active
research topics in Data Mining and Machine
Learning and the results of this can be exploited
for Web Mining.
The second type of technical advance comes from
the integration of Web Mining with other
technologies in application contexts. Examples are
information retrieval, ecommerce, business
process modeling, instruction, and health care.
The widespread use of web-based systems in these
areas makes them amenable to Web Mining.
In this section we outline current generic practical
problems that will be addressed, technology
required for these solutions, and research issues
that need to be addressed for technical progress.
Knowledge Management
Knowledge Management is generally viewed as a
field of great industrial importance. Systematic
management of the knowledge that is available in
an organization can increase the ability of the
organization to make optimal use of the
knowledge that is available in the organization
and to react effectively to new developments,
threats and opportunities. Web Mining technology
creates the
opportunity to integrate knowledge management
more tightly with business processes.
Standardization efforts that use SemanticWeb
technology and the availability of ever more data
about business processes on the internet creates
opportunities for Web Mining technology. More
widespread use of Web Mining for Knowledge
Management requires the availability of low-
threshold Web Mining tools that can be used by
non-experts and that can flexibly be integrated in a
wide variety of tools and systems.
E-commerce
The increased use of XML/RDF to describe
products, services and business processes
increases the scope and power of Data Mining
methods in e-commerce. Another direction is the
use of text mining methods for modeling
technical, social and commercial developments.
This requires advances in text mining and
information extraction.
E-learning
The Semantic Web provides a way of organizing
teaching material, and usage mining can be
applied to suggest teaching materials to a learner.
This opens opportunities for Web Mining. For

example, a recommending approach can be
followed to find courses or teaching material for a
learner. The material can then be organized with
clustering techniques, and ultimately be shared on
the web again, e. g., within a peer to peer network.
Web mining methods can be used to construct a
profile of user skills, competence or knowledge
and of the effect of instruction. Another possibility
is to use web mining to analyze student
interactions for teaching purposes. The internet
supports students who collaborate during learning.
Web mining methods can be used to monitor this
process, without requiring the teacher to follow
the interactions in detail. Current web mining
technology already provides a good basis for this.
Research and development must be directed
toward important characteristics of interactions
and to integration in the instructional process.
E-government
Many activities in governments involve large
collections of documents. Think of regulations,
letters, announcements, reports. Managing access
and availability of this amount of textual
information can be greatly facilitated by a
combination of Semantic Web standardization and
text mining tools. Many internal processes in
government involve documents, both textual and
structured. Web mining creates the opportunity to
analyze these governmental processes and to
create models of the processes and the information
involved. It seems likely that standard ontologies
will be used in governmental organizations and
the standardization that this produces will make
Web Mining more widely applicable and more
powerful than it currently is. The issues involved
are those of Knowledge Management. Also
governmental activities that involve the general
public include many opportunities for Web
Mining. Like shops, governments that offer
services via the internet can analyze their
customers behavior to improve their services.
Information about social processes can be
observed and monitored using Web Mining, in the
style of marketing analyses. Examples of this are
the analysis of research proposals for the
European Commission and the development of
tools for monitoring and structuring internet
discussion for non political issues. Enabling
technologies for this are more advanced
information extraction methods and tools.
Health care
Medicine is one of the Web’s fastest-growing
areas. It profits from Semantic Web technology in
a number of ways: First, as a means of organizing
medical knowledge - for example, the widely-used
taxonomy International Classification of Diseases
and its variants serve to organize telemedicine
portal content and interfaces. The Unified
Medical Language System

(http://www.nlm.nih.gov/research/umls) integrates
this classification and many others. Second, health
care institutions can profit from interoperability
between the different clinical information systems
and semantic representations of member
institutions’ organization and services. Usage
analyses of medical sites can be employed for
purposes such as Web site evaluation and the
inference of design guidelines for international
audiences, or the detection of epidemics. In
general, similar issues arise, and the same
methods can be used for analysis and design as in
other content classes of Web sites. Some of the
facets of Semantic Web Mining that we have
mentioned in this article form specific challenges,
in particular: the privacy and security of patient
data, the semantics of visual material, and the
cost-induced pressure towards national and
international integration of Web resources.
E-science
In E-Science two main developments are visible.
One is the use of text mining and Data Mining for
information extraction to extract information from
large collections of textual documents. Much
information is “buried” in the huge scientific
literature and can be extracted by combining
knowledge about the domain and information
extraction. Enabling technology for this is
information extraction in combination with
knowledge representation and ontologies. The
other development is large scale data collection
and data analysis. This also requires common
concept and organisation of the information using
ontologies. However, this form of collaboration
also needs a common methodology and it needs to
be extended with other means of communication,
see for examples and discussion.
Web mining for images and video and audio
streams
So far, efforts in Semantic Web research have
addressed mostly written documents. Recently this
is broadened to include sound/voice and images.
Images and parts of images are annotated with
terms from ontologies.
Privacy and security
A factor that limits the application of Web
Mining is the need to protect privacy of users.
Web Mining uses data that are available on the
web anyway but the use of Data Mining makes it
possible to induce general patterns that can be
applied to personal data to inductively infer data
that should remain private. Recent
research addresses this problem and searches for
selective restrictions on access to data that do
allow the induction of general patterns but at the
same time preserves a preset uncertainty about
individuals, thereby protecting privacy of
individuals.
Information extraction with formalized
knowledge

We briefly reviewed the use of concept
hierarchies and thesauri for information
extraction. If knowledge
is represented in more general formal Semantic
Web languages like OWL, in principle there are
stronger possibilities to use this knowledge for
information extraction.
In summary, the main foreseen developments are:
– The extensive use of annotated documents
facilitates the application of Data Mining
techniques to documents.
– The use of a standardized format and a
standardized vocabulary for information on the
web will increase the effect and use of Web
Mining.
– The Semantic Web goal of large-scale
construction of ontologies will require the use of
Data Mining methods, in particular to extract
knowledge from text.
8.2. CONS
Web mining, itself, doesn’t create issues, but this
technology when used on data of personal nature
might cause concerns. The most criticized ethical
issue involving web mining is the invasion of
privacy. Privacy is considered lost when
information concerning an individual is obtained,
used, or disseminated, especially if this occurs
without their knowledge or consent. The obtained
data will be analyzed, and clustered to form
profiles; the data will be made anonymous before
clustering so that there are no personal profiles.
Thus these applications de-individualize the users
by judging them by their mouse clicks. De-
individualization, can be defined as a tendency of
judging and treating people on the basis of group
characteristics instead of on their own individual
characteristics and merits.
Another important concern is that the companies
collecting the data for a specific purpose might
use the data for a totally different purpose, and
this essentially violates the user’s interests. The
growing trend of selling personal data as a
commodity encourages website owners to trade
personal data obtained from their site. This trend
has increased the amount of data being captured
and traded increasing the likeliness of one’s
privacy being invaded. The companies which buy
the data are obliged make it anonymous and these
companies are considered authors of any specific
release of mining patterns. They are legally
responsible for the contents of the release; any
inaccuracies in the release will result in serious
lawsuits, but there is no law preventing them from
trading the data.
Some mining algorithms might use controversial
attributes like sex, race, religion, or sexual
orientation to categorize individuals. These
practices might be against the anti-discrimination
legislation. The applications make it hard to
identify the use of such controversial attributes,

and there is no strong rule against the usage of
such algorithms with such attributes. This process
could result in denial of service or a privilege to
an individual based on his race, religion or sexual
orientation, right now this situation can be avoided
by the high ethical standards maintained by the
data mining company. The collected data is being
made anonymous so that, the obtained data and
the obtained patterns cannot be traced back to an
individual. It might look as if this poses no threat
to one’s privacy, actually many extra information
can be inferred by the application by combining
two separate unscrupulous data from the user.
9. CONCLUSION
The term Web mining has been used to refer to
techniques that encompass a broad range of issues.
However, while meaningful and attractive, this
very broadness has caused Web mining to mean
different things to different people, and there is a
need to develop a common vocabulary. Towards
this goal we proposed a definition of Web mining,
and developed taxonomy of the various ongoing
efforts related to it. Next, presented a survey of
the research in this area and concentrated on Web
usage mining.The provided a detailed survey of
the e orts in this area, even though the survey is
short because of the area's newness. To provided a
general architecture of a system to do Web usage
mining, and identified the issues and problems in
this area that require further research and
development.
As the Web and its usage continue to grow, so
does the opportunity to analyze Web data and
extract all manner of useful knowledge from it.
The past few years have seen the emergence of
Web mining as a rapidly growing area, due to the
efforts of the research community as well as
various organizations that are practicing. The key
component of web mining is the mining process
itself. Here described the key computer science
contributions made in this field, including the
overview of web mining, taxonomy of web
mining, the prominent successful applications, and
outlined some promising areas of future research.
10.REFERENCE
[1] http://en.wikipedia.org/wiki/Web mining
[2] http://www.galeas.de/webimining.html
[3] Jaideep srivastava, Robert Cooley, Mukund
Deshpande, Pan-Ning Tan, Web Usage Mining:
Discovery and Applications of Usage Patterns
from Web Data, SIGKDD Explorations, ACM
SIGKDD,Jan 2000.
[4] Miguel Gomes da Costa Jnior,Zhiguo Gong,
Web Structure Mining: An Introduction,
Proceedings of the 2005 IEEE International
Conference on Information Acquisition
[5] R. Cooley, B. Mobasher, and J.
Srivastava,Web Mining: Information and Pattern
Discovery on the World Wide Web, ICTAI97
[6] Brijendra Singh, Hemant Kumar Singh, WEB
DATA MINING RE- SEARCH: A SURVEY,
2010 IEEE

[7] Mining the Web: discovering knowledge from
hypertext data, Part 2 By Soumen Chakrabarti,
2003 edition
[8] Web mining: applications and techniques By
Anthony Scime
[9] . R. Agrawal and R. Srikant. Fast algorithms
for mining association rules.
[10] S. Agrawal, R. Agrawal, P.M. Deshpande, A.
Gupta, J. Naughton, R. Ramakrishna, and S.
Sarawagi. On the computation of
multidimensional aggregates.
[11] R. Armstrong, D. Freitag, T. Joachims, and
T. Mitchell. Webwatcher: A learning apprentice
for the world wide web.
[12] M. Balabanovic, Yoav Shoham, and Y. Yun.
An adaptive agent for automated web browsing.
Journal of Visual Communication and Image
Representation,
[13] A. Z. Broder, S. C. Glassman, M. S.
Manasse, and G Zweig. Syntactic clustering of the
web.

Web Mining

Recommended

More Related Content

What's hot (20)

Similar to Web Mining (20)

More from Shobha Rani (8)

Recently uploaded (20)

Web Mining