0% found this document useful (0 votes)
85 views

Data Mining and Ware Housing

This document contains the syllabus for a course on data mining and warehousing. The syllabus covers five units: Unit I introduces data mining applications, techniques, case studies, the future of data mining, and data mining software like association rule mining and the Apriori algorithm. Unit II covers classification techniques including decision trees, naive Bayes methods, and evaluating classification accuracy. Unit III discusses cluster analysis, different data types, computing distances, partitioning and hierarchical clustering methods, and evaluating cluster quality. Unit IV presents web data mining, including web content, usage, and structure mining as well as search engine functionality and architecture. Unit V focuses on data warehousing, operational data sources, design

Uploaded by

vjkarthika45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

Data Mining and Ware Housing

This document contains the syllabus for a course on data mining and warehousing. The syllabus covers five units: Unit I introduces data mining applications, techniques, case studies, the future of data mining, and data mining software like association rule mining and the Apriori algorithm. Unit II covers classification techniques including decision trees, naive Bayes methods, and evaluating classification accuracy. Unit III discusses cluster analysis, different data types, computing distances, partitioning and hierarchical clustering methods, and evaluating cluster quality. Unit IV presents web data mining, including web content, usage, and structure mining as well as search engine functionality and architecture. Unit V focuses on data warehousing, operational data sources, design

Uploaded by

vjkarthika45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

DATA MINING AND WAREHOUSING

(For BSC (CS) & BCA , Semester-VI)

VIDHYAA ARTS AND SCIENCE COLLEGE


KONGANAPURAM

III B.Sc(cs) & BCA

Department of Computer Science

Version: 2019 – 2020


Contents
UNIT – I ..................................................................................................................... 1
1.1 INTRODUCTION ............................................................................................ 1
1.1.1 Data mining applications ......................................................................... 1
1.1.2 Data mining techniques:........................................................................... 2
1.1.3 Some Data Mining Case Studies .............................................................. 4
1.1.4 The Future Of Data Mining...................................................................... 7
1.1.5 Data Mining Software............................................................................... 7
1.2 Association Rules Mining ............................................................................. 10
1.2.1 Introduction ............................................................................................. 10
1.2.2 basics ........................................................................................................ 11
1.2.3 The task and a naive algorithm ............................................................. 13
1.2.4 The Apriori Algorithm ........................................................................... 15
1.2.5 Improving the efficiency of the apriori algorithm ............................... 17
1.2.6 Mining frequent patterns without candidate generation (fp-growth):18
1.2.7 Performance evaluation of algorithms: ................................................. 24
UNIT – II .................................................................................................................. 25
2.1 Classification .................................................................................................. 25
2.1.1 Introduction ............................................................................................. 25
2.1.2 Decision Tree ........................................................................................... 27
2.1.3 Overfitting and pruning ......................................................................... 29
2.1.4 Decision tree rules................................................................................... 30
2.1.5 Naive Bayes Method ............................................................................... 31
2.1.6 Estimating predictive accuracy of classification methods ................... 36
2.1.7 Other evaluation criteria for classification methods ............................ 40
2.1.8 Classification Software ........................................................................... 41
UNIT-III ................................................................................................................... 44
3.1 Cluster Analysis............................................................................................. 44
3.1.1 Cluster Analysis ...................................................................................... 44
3.1.2 Types of data ........................................................................................... 45
3.1.3 Computing Distance ............................................................................... 46
3.1.4 Types of cluster analysis methods ......................................................... 47
3.1.5 Partitional Methods ................................................................................ 47
3.1.6 Hierarchical Methods ............................................................................. 49
3.1.7 Density-Based Methods.......................................................................... 53
3.1.8 Dealing with large databases ................................................................. 55
3.1.9 Quality and validity of cluster analysis methods................................. 56
3.1.10 luster analysis software ........................................................................ 57
UNIT-IV ................................................................................................................... 58
4.1 Web Data Mining .......................................................................................... 58
4.1.1 Introduction ............................................................................................. 58
4.1.2 Web terminology and characteristics .................................................... 58
4.1.3 Locality and hierarchy in the web ......................................................... 63
4.1.4 Web Content Mining .............................................................................. 65
4.1.5 Web Usage Mining ................................................................................. 71
4.1.6 Web Structure Mining ............................................................................ 74
4.2 Search Engines ............................................................................................... 80
4.2.1 Search Engines Functionality ................................................................. 81
4.2.2 Search Engines Architecture .................................................................. 83
4.2.3 Ranking of Web pages ............................................................................ 89
UNIT-V..................................................................................................................... 93
5.1 Data warehousing ......................................................................................... 93
5.1.1 Introduction ............................................................................................. 93
5.1.2 Operational Data Sources ....................................................................... 95
5.1.3 Data Warehousing .................................................................................. 97
5.1.4 Data warehousing Design ...................................................................... 97
5.1.5 Guidelines For Data Warehouse Implementation ............................. 104
5.1.6 Data Warehouse Metadata ................................................................... 107
5.2 Online Analytical Processing (OLAP) ....................................................... 108
5.2.1 Introduction ........................................................................................... 109
5.2.2 OLAP Characteristics of OLAP Systems............................................. 109
5.2.3 Multidimensional view and data cube................................................ 112
5.2.4 Data Cube Implementation .................................................................. 114
5.2.5 Data Cube Operations OLAP implementation guidelines ................ 116
SYLLABUS
DATA MINING AND WAREHOUSING
UNIT–I
Introduction: Data mining application –data mining techniques –data mining
case studies - the future of data mining –data mining software -Association rules
mining: Introduction-basics-task and a naïve algorithm-apriori algorithm –improve the
efficient of the apriori algorithm –mining frequent pattern without candidate generation
(FP-growth) –performance evaluation of algorithms.
UNIT–II
Classification : Introduction –decision tree –over fitting and pruning -DT rules--
naïve bayes method-estimation predictive accuracy of classification methods -other
evaluation criteria for classification method –classification software
UNIT–III
Cluster analysis: cluster analysis –types of data –computing distances-types of
cluster analysis methods -partitioned methods –hierarchical methods –density based
methods –dealing with large databases –quality and validity of cluster analysis methods
-cluster analysis software.
UNIT –IV
Web data mining: Introduction-web terminology and characteristics-locality and
hierarchy in the web-web content mining-web usage mining-web structure mining –
web mining software -Search engines: Search engines functionality-search engines
architecture –ranking of web pages.
UNIT–V
Data Warehousing: Introduction –Operational data sources-data warehousing -Data
warehousing design –Guidelines for data warehousing implementation -Data
warehousing metadata -Online analytical processing (OLAP): Introduction –OLAP
characteristics of OLAP system –Multidimensional view and data cube -Data cube
implementation -Data cube operations OLAP implementation guidelines.
Vidhyaa Arts And Science College-Konganapuram

UNIT – I

1.1 INTRODUCTION

a. Data mining:
Data mining is a collection of techniques for efficient automated discovery of
previously unknown, valid, novel, useful and understandable patterns in large
database. The patterns must be actionable so that they may be used in an enterprise’s
decision making process.
1.1.1 Data mining applications

Data mining is being used for a wide variety of application.


1. prediction and description:
Data mining may be used to answer questions like “would this customer buy a
product?” or “is this customer likely to leave?” Data mining techniques may also be
used for sales forecasting and analysis. Usually the techniques involve selecting some or
all the attributes of the objects available in a database to predict other variables of
interest.
2. Relationship marketing:
Data mining can help in analyzing customer profiles, discovering sales triggers,
and in identifying critical issue that determine client loyalty and help in improving
customer retention. This also includes analyzing customer profiles and improving
direct marketing plans.
3. Customer profiling:
It is the process of using the relevant and available information to describe the
characteristics of a group of customers or ordinary consumers and derivers for their
purchasing decisions. Profiling can help an enterprise identify its most valuable
customers so that the enterprise may differentiate their needs and values.
4. Outliers identification and detecting fraud:
There are many uses of data mining in identifying outliers, fraud or unusual
cases. This might be as simple as identifying unusual expense claims by staff,

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

identifying anomalies in expenditure between similar units of an enterprise, perhaps


during auditing, or identifying fraud, for example involving credit or phone cards.
5. Customer segmentation:
It is a way to access and view individuals in the market based on their status and
needs. Data mining can be used for customer segmentation, for promoting the cross-
selling services, and in increasing customer retention.
Data mining may also be used for branch segmentation and for evaluating the
performance of various banking channels, such as phone or online banking.
6. Web site design and promotion:
Web mining may be used to discover how users navigate a web site and the
results can help in improving the site design and making it more visible on the web.
1.1.2 Data mining techniques:

a. Association rules mining or market basket analysis:


Association rules mining is a technique that analyses a set of transactions like
those captured at a supermarket checkout, each transaction being a list of products or
item purchased by one customer.
The aim of association rules mining is to determine which items are information
may be used for cross-selling. The association rules mining has many applications other
than market basket analysis, including applications in marketing, customer
segmentation, medicine, electronic commerce, classification, clustering, web mining,
bioinformatics, and finance.
A simple algorithm called the Apriori algorithm may be used to find
associations, but the algorithm becomes inefficient for applications where the data is
very large.
i. Supervised classification:
Supervised classification is an important data mining technique that has its
origins in machine learning. Supervised classification is appropriate to use if the data is
known to have a small number of classes, the classes are already known and some
training data with their classes known is available.

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

This can be used in predicting the class to which an object or individual is likely
to belong. This is useful, for example, in predicting whether an individual is likely to
respond to a direct mail solicitation, in identifying a good candidate for a surgical
procedure, or in identifying a good risk for granting a loan or insurance.
One of the most widely used supervised classification technique is the decision
tree. The decision tree technique is widely used because it generates easily
understandable rule for classifying data.
b. Cluster analysis:
Cluster analysis is similar to classification but, in contrast to supervised
classification, cluster analysis is useful when the data are not already known and the
training data is not available.
The aim of cluster analysis is to find groups that are very different from each
other in a collection of data. Cluster analysis breaks up a single collection of perhaps
diverse data into to a number of groups.
One of the most widely used cluster analysis method is called the k-means
algorithm, which requires that the user specifies not only the number of cluster but also
their starting seeds.
c. Web data mining:
Searching the web has become an everyday experience for millions of people
from all over the world. From its beginning in the early 1990s, the web had grown to
more than four billion pages in 2004, and perhaps would grow to more than eight
billion pages by the end of 2006.
d. Search engines:
Search engines, are huge database of web pages as well as software packages for
indexing and retrieving the pages that enable users to find information of interest to
them. Normally the search engine databases of web pages are built and updated
automatically by web crawlers.

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

e. Data warehousing and OLAP:


Data warehousing is a process by which an enterprise collects data fro the whole
enterprise to build a single version of the truth. This information is useful for decision
makers and may also be used for data mining.
A data warehouse can be of real help in data mining since data cleaning and
other problems of collecting data would have already been overcome.
1.1.3 Some Data Mining Case Studies

a. Aviation-Wipro’s frequent flyer program:


Wipro has reported a study of frequent flyer data from an Indian airline. Before
carrying out data mining, the data was selected and prepared. It was decided to use
only the three most common sectors.
For example, the airline did not know customers martial status, or their income
or their reasons for taking a journey. These results provided the airline with a better
understanding of its business and may have helped it to refine flight programming to
better suit the customer.
b. Astronomy:
An interesting data mining application area is astronomy. Astronomers produce
huge amounts of data every night on the fluctuating intensity of around 20 million stars
which are classified by their spectra and their surface temperature.
Some 90% of stars are called main sequence stars including some stars that are
very large, very hot, and blue in color. The main sequence stars are fuelled by nuclear
fusion and are very stable, lasting billions of years.
As an example, when a clustering program was used to group a large amount of
astronomical data, four classes corresponding to stars, galaxies with bright central
cores, galaxies without bright central cores, and stars with a visible “fuzz” around them.
The clustering program found unexpected but meaningful results without any
understanding of the astronomical data.

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

c. Banking and finance:


Banking and finance is a rapidly changing competitive industry. The industry is
using data mining for a variety of tasks including building customer profiles to better
understand the customer, to identify fraud, to evaluate risk in personal and hoe loans,
and to better forecast stock prices, interest rates, exchange rates and commodity prices.
d. Climate:
A study has been reported on atmospheric and oceanic parameters that cause
drought in the state of Nebraska in the USA.
o Standardized precipitation index(SPI)
o Palmer drought severity index
o Southern oscillation index
o Multivariate ENSO
o Pacific/North American index
o North Atlantic oscillation index
o Pacific decadal oscillation index
e. Crime prevention:
Data mining techniques were used to link serious sexual crimes to other
crimes that might have been committed by the same offenders. The data used
related to ore than 2000 offences involving a variety of sexual crimes.
f. Direct mail service:
A direct mail company held a list of a large number of potential customers.
To carry out data mining, the company had to first prepare data, which included
sampling the data to select a subset of customers including those responded to direct
mail and those that did not.
Using the decision tree approach, the company was able to identify the
characteristics of customers who were more likely to respond and was thus able to
reduce the number of customers it mailed to while simultaneously improving the
response rate.

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

g. Healthcare:
Data mining carried out a variety of applications in healthcare. Fro example,
drug testing, data mining may assist in isolating those patients where the drug is
most effective or where the drug is having unintended side effects.
Data mining has been used in determining factors influencing the survival
rate of heart transplant patients when the survival rate data were available for
the significant number of patients over several years.
The aim of the study was to be able to predict the length of hospital stay
for patients suffering from spinal cord injuries. The study required data
validation and data cleaning.
h. Manufacturing:
Data mining tools have been used to identify factors that lead to critical
manufacturing situations in order to warn engineers of impending problems. Data
mining applications have also been reported in power station, petrochemical plants
and other types of manufacturing plants. For example, in a study involving a power
station, the company wanted to reduce its operating costs.
i. Marketing and Electronic Commerce:
One of the most widely discussed applications of data mining is that by
Amazon.com, which uses simple data mining tools to recommend to customers
what other products they might consider buying.
The rules use the customers’ own history of purchases as well as purchases
by similar customers. Strange results can sometimes be found in data mining. In
one data mining study, it was found that customers tended to favour one side of the
store where the specials were put and did not shop in the whole store.
j. Telecommunications:
The telecommunication industry in many countries is in turmoil due to
deregulation. The telecommunications business is changing through consolidation in
the market place and the convergence of new technologies.
o For example, video, data, voice.

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

o They also have to deal with technologies like voice over IP.
o A widely discussed data mining application in the telecommunications
industry is churn analysis.
The telecommunication company has to deal with a large number of variables
including the cost of local calls, the cost of international calls, the mobile phone plans,
the installation and disconnection rate, customer satisfaction data, or the data about
customer who do not pay their bills or do not pay them on time.
1.1.4 The Future Of Data Mining

Most time spent in data mining is actually spent in data extraction, data cleaning
and data manipulation, it is expected that technologies like data warehousing will grow
in importance. It has been found that as much as 40% of all collected data contains
errors.
To deal with such large error rate, there is likely to be ore emphasis in the future on
building data warehousing data cleaning and extraction. Business users often find the
techniques difficult to understand and integrate into business processes.
The academic community is more interested to developing new techniques that
perform than those that are already known. Data mining technique depend upon a lot
of careful analysis of the business and a good understanding of the technique and
software available.
Many data mining techniques are not based on sound theoretical background.
More theory regarding all data mining techniques and practices is also likely to be the
focus of data mining efforts in the future.
1.1.5 Data Mining Software

There is considerable data mining software available on the market. Most major
computing companies, like IBM, Oracle and Microsoft, are providing data mining
packages. Some data mining software can be expensive while other software is
available free and therefore a user should spend some time selecting an appropriate tool
for the task they faces.

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Data mining tools:


A more expensive list is available at www.kdnugget.co/software/index.html.
Other sites providing good information on data mining tools are
http://www.business.com/directory/copmuters_and_software/software_applications
/data_management/data_mining/and
http://fuzzy.cs.Uni-Magdeburg.de/~borgelt/software.html.

a. List of software packages:


i. Angoss software:
 It has data mining software called Knowledge STUDIO.
 It is a completer data mining package that includes facilities for
classification, cluster analysis and prediction.
 Knowledge STUDIO provide a visual, easy-to-use interface.
 Angoss also has another package called Knowledge SEEKER that is
designed to support decision tree classification.
ii. CART and MARS:
 This software fro Salford Systems includes CART decision tree.
 MARS predictive modeling, automated regression, TreeNet
classification and regression, modeling, clustering and anomaly
detection.
a).Clementine:
This is a well-known and comprehensive package that provides association rules,
classification, cluster analysis, factor analysis, forecasting, prediction and sequence
discovery.
b).Data miner software kit:
It is a collection of data mining tools offered in combination with the book
predictive data mining.

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

c).DB Miner technologies:


This provides techniques for association rules, classification and cluster analysis
.It interfaces with SQL server and is bale to use some of the facilities of SQL server.
d).Enterprise Miner:
It provides a user-friendly icon-based GUI front-end using their process model
called SEMMA (Sample, Explore, Modify, Model, and Assess).
e).Ghost Miner:
It including data preprocessing, feature selection, k-nearest neighbors, neural
nets, decision tree, clustering and visualization.
h).Intelligent Miner:
This is a comprehensive data mining package fro IB. Its functionality includes
association rules, classification, cluster analysis, prediction, sequential patters and time
series.
i).JDA intellect:
JDA Software Group has a comprehensive package called JDA Intellect that
provides facilities for association rules, classification, cluster analysis, and prediction.
j).Mantas:
This is a small company that was a spin-off fro SRA International. This is now
designed to focus on detecting and analyzing suspicious behavior in financial markets
and to assist companies in complying with global regulations.
k).MCabiX from Diagnose:
It is a complete and affordable data mining toolbox, including decision tree,
neural networks, association rules and visualization.
l).MineSet:
Originally developed by SGI, now owned and further developed by purple
insight. This specializes in visualization and provides a variety of visualization tools
including the scatter visualize, tree visualiser, statistics visualize and the map visualiser.

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

m).Mining Mart:
This package was developed at the University of Dortund in Germany. The
software focuses on data cleaning and provides a graphical tool for data preprocessing.
n).Oracle:
Users of Oracle therefore have access to techniques for association rules,
classification prediction. Oracle data mining is a graphical user interface.
o).Weka 3:
A collection of machine learning algorithm for solving data mining problems. It
is written in Java that runs on almost any platform
p).Software evaluation and selection:
Any factors must be considered in evaluating the suitability of software.
 Product and vendor information.
 Total cost of the ownership.
 Performance.
 Functionality and modularity.
 Training and support.
 Reporting facilities and visualization.
 Usability.

1.2 Association Rules Mining

1.2.1 Introduction

Association rules mining or market basket analysis is a large database of


transactions with the aim of finding association rules. This has many applications other
than market basket analysis, including application in marketing, customer
segmentation, medicine, electronic commerce, classification, clustering, web mining
bioinformatics, and finance.

10

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

1.2.2 basics

We can define the association rules mining terminology by using an example of a


small shop. Assume that the shop sells only a small variety of products.
Bread Cheese Coffee
Juice Milk Tea
Biscuits Newspaper Sugar
We assume that the shopkeeper keeps records of what each customer purchases.
 The shopkeeper wants to find which products are sold together
frequently.
 From the example, sugar and tea are two items that are sold together
frequently then the shopkeeper might consider having a sale on one of
them in the hope that it will not only increase the sale of that item but also
increase the sale of the other. We will not find a solution to this example.
 We now define some terminology.
 We assume that the number of items the shop stocks is n, in our example
n=9 and these items are represented by I {i1, i2, ….in}.
 We assume that there are N transactions N=10.
 We denote them by T (t1, t2, …. tN} each with a unique identifier and each
specifying a subset of items from the item set I purchased by one
customer.
 Let each transaction of m items be {i1, i2, …..im} with m ≤ n.
 Typically, transactions differ in the number of items. We now find the
association relationships, given a large number of transactions, such that
items that tend to occur together are identified.
 Association rules are often written as X → Y meaning that whenever X
appear Y also tends to appear.
 X and Y may be single items or set of items or sets of items. X is often
referred to as the rule’s antecedent and Y as the consequent.

11

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

 X → Y is a probabilistic relationship. It indicates only that X and Y have


been found together frequently in the given data and does not shoe a
causal relationship implying that buying of X by a customer causes
him/her to buy Y.
 Suppose items X and Y appear together in only 10% of the transactions
but whenever X appears these is an 80% chance that Y also appears.
 The 10% presence of X and Y together is called the support (or prevalence)
of the rule and the 80% of chance is called the confidence (or
predictability) of the rule.
 The support and confidence are measures of the interestingness of the rule.
 A high level of support indicates that the rule is frequent enough for the
business to be interested in it.
 A high level confidence shows that the rule is true often enough to justify
a decision based on it.
 Support of X is the number of times it appears in the database divided by
N and support for X and Y together is the number of ties they appear
together divided by N.
There for using P(X) to mean probability of X in the database we have,
Support (X) = (number of times X appears) / N = P (X)
Support (XY) = (number of times X and Y appears) / N = P (X Ω Y)
 Confidence for X→Y is defined as the radio of the support for X and Y
together to the support for x.
 If X appears much more frequently than X and Y appears together, the
confidence will be low. It does not depend on how frequently Y appears.
Confidence (X→Y) = Support (XY) / Support (X) = P (X Ω Y) / p (X)
= p (Y/X)
 P (Y/X) is the probability of Y once X has taken place, also called the
conditional probability of Y.

12

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

1.2.3 The task and a naive algorithm

Let us consider a Naïve Brute fore algorithm to do the task. Consider the
following example we have only four items for sale (Bread, Cheese, Juice and Milk) and
have only four transactions.
We have to find the association rules with a minimum Support of 50% and
minimum Confidence of 75%.
Transaction table
Transaction ID Items
100 Bread, Cheese
200 Bread, Cheese, Juice
300 Bread, Milk
400 Cheese, Juice, Milk,
The basis of our Naïve algorithm is as follows. We can list all the combination of
the items that we have in stock and find which of these combinations are frequent, and
then we can find the association rule that have the confidence from these frequent
combinations.
The four items and all the combinations of these four items and their frequencies
of occurrence in the transaction database are given in the following table.
The list of all item sets and their frequencies
Item sets Frequency
Bread 3
Cheese 3
Juice 2
Milk 2
(Bread, Cheese) 2
(Bread, Juice) 1
(Bread, Milk) 1
(Cheese, Juice) 2

13

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

(Cheese, Milk) 1
(Juice, Milk) 1
(Bread, Cheese, Juice) 1
(Bread, Cheese, Milk) 0
(Bread, Juice, Milk) 0
(Cheese, Juice, Milk) 1
(Bread, Cheese, Juice, Milk) 0
From the above table we required minimum support of 50%, we find the item
sets that occur in at least two transactions. Such item sets are called frequent.

The lost of frequencies shows that all four items are frequent.

The frequent item sets are given in the following table


Item sets Frequency
Bread 3
Cheese 3
Juice 2
Milk 2
(Bread, 2
Cheese) 2
(Cheese, Juice)
 We can now proceed to determine if the two 2-itemsets (Bread and cheese)
and (cheese, Juice) lead to association rule with required confidence of
75%.
 Every 2-itemsets (A, B) can lead to two rules A→B and B→A if both
satisfy the required confidence.
 The confidence of A→B is given by the support A and B together divided
by the support for A.
We have four possible rules and their confidence as follows:

14

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Bread → Cheese with confidence of 2/3 = 67%


Cheese → Bread with confidence of 2/3 = 67%
Cheese → Juice with confidence of 2/3 = 67%
Juice → Cheese with confidence of 100%
 Therefore only the last rule Juice → Cheese has confidence above the
minimum 75% required and qualifies. Rules that have more than the user
specified minimum confidence is called confident.
1.2.4 The Apriori Algorithm

The basic algorithm for finding the association rules was first proposed in 1993.
In 1994, an improved algorithm was proposed and called the Apriori algorithm.
This may be considered to consist of two parts. In the first part, those item sets
that exceed the minimum support requirement are found such item set is called frequent
item set, in the second part, the association rules that meet the minimum confidence
requirement are found from the frequent item set.
a.First part Frequent item sets:
This part is divided into two steps (step2 and step3).
i.Step 1:
Scan all transactions and find all frequent items that have support above p%. let
these frequent items be L1.
ii.Step 2:
Build potential sets of K items from Lk‐ 1 by using pairs of item sets in Lk-1 such
that each pair has the first k-2 items in common. Now the k-2 common items and the on
remaining item from each of the two item sets are combined to form a k-item set. K is
the candidate set Ck. this set is called Apriori-gen.
iii.Step 3:
Scan all transactions and find all k-item sets in Ck that are frequent. The frequent
set obtained is Lk.
Terminate when no further frequent item sets are found, otherwise continue with
step 2. The main notation for association rule mining is used in the Apriori algorithm is

15

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

the following: 1. A k-item set is a set of k items. 2. The set Ck is a set of candidate k-
item sets that are potentially frequent. 3. The set L k is a subset of Ck and is the set of k-
item sets that are frequent. Some of the issues needed for Apriori algorithms are:
a.Computing L1 :
We scan the disk-resident database only one to obtain L 1 . An item vector of
length n with count for each item stored in the main memory may be used. Once the
scan of the database is finished and the count for each item found, the items that meet
the support criterion can be identified and L1 determined.
b.Apriori-gen function:
This is the step 2 of the apriori algorithm. It takes an argument Lk-1 and returns a
set of all candidate k-itemset. In computing C3 from L2,we organize L2 so that the
itemsets sorted in their lexicographic order. Observe that if an item set that if an itemset
in C3 is (a,b,c) then L2 must have items(a,b) and (a,c) since all subsets of C3 must be
frequent.
c.Pruning:
Once a candidate set Ci has been produced, we can prune some of the candidate
item sets by checking that all subsets of every item set in the set are frequent, for
example if we have derived {a,b,c} from {a,b} and {a,c}, then we check that {b,c} is
also in L2. If it is not, (a,b,c) may be removed from C3.
The task of such pruning becomes harder as the number of items in the item set
grows, but the number of large item sets tends to be small.
d.Apriori subset function:
To improve the efficiency of searching, the candidate item sets Ck are stored in a
hash tree. Each leaf node is reached by traversing the tree whose root is at depth 1. Each
internal node of depth d points to all the related nodes at depth d+1 and the branch to
be taken is determined by applying the hash function.
e.Transactions storage:
We assume the data is too large to be stored in the main memory, should it be
stored as a set of transactions, each transaction being a sequence of item numbers.

16

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

f.Computing L2:
Assuming that C2 is available in the main memory, each candidate pair needs to
be tested to find if the pair is frequent. Given that C2 is likely to be large, this testing
must be done efficiently, In one scan, each transaction can be checked for the candidate
pairs.
g.Second Part – Finding the rules:
To find the association rules from the frequent item sets, we take a large frequent
item set, say p, and find each nonempty subset a, The rule a-> (p-a) is possible if it
satisfies the confidence, Confidence of this rule is given by support(p)/support(a).
Since confidence is given by support(p)/support(a), it is cleared that if for some a,
the rule a->(p-a) does not have the minimum confidence then all rules like b-> (p-
b), where b is a subset of a , will also not have the confidence since support(b) cannot be
smaller than support(a).
Another way to improve rule generation is to consider rules like (p-a)->a. If this
rule has the minimum confidence then all rules (p-b)->b will also have minimum
confidence if b is a subset of ‘a’ since (p-b) has more items than (p-a), given that ‘b’ is
smaller than ‘a’ and so cannot have support higher than that of (p-a).
Once again this can be used in improving the efficiency of rule generation. In
both the improvements noted above, the total number of items, and therefore the
support(p),stays the same.
1.2.5 Improving the efficiency of the apriori algorithm

The Apriori algorithm is resource intensive for large sets of transactions that
have a large set of frequent items. The major reasons for this may be summarized as
follows:
1. The number of candidate itemsets grows quickly and can result in huge
candidate sets.The larger the candidate set, the higher the processing cost for
scanning the transaction database to find the frequent itemsets. The performance
of the Apriori algorithm in the later stages therefore is not so much of a concern.

17

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

2. The Apriori algorithm requires many scans of the database, If n is the length of
the longest itemset, then (n+1) scans are required.
3. Many trivial rules are derived and it can often be difficult to extract the most
interesting rules from all the rules derived.
4. Some rules can be inexplicable and very fine grained, for example, toothbrush
was the most frequently sold item on Thursday mornings.
5. Redundant rules are generated, For example, if A->B is a rule then any rule AC-
>b is redundant. A number of approaches have been suggested to avoid
generating redundant rules.
6. Apriori assumes sparsity since the number of items in each transaction is small
compared with the total number of items.
A number of techniques for improving the performance of Apriori algorithm have
been suggested, They can be classified in to 4 categories:
 Reduce the number of candidate itemsets.
 Reduce the number of transactions.
 Reduce the number of Comparisons.
 Reduce the candidate sets efficiently.
Some algorithms that use one or more of the above approaches are:
 Apriori-TID
 Direct Hashing and Pruning(DHP)
 Dynamic Itemset Counting (DIC)
 Frequent Pattern Growth.
1.2.6 Mining frequent patterns without candidate generation (fp-growth):

This algorithm uses an approach that is different from that used by methods
based on the Apriori algorithm. The major difference between frequent pattern-growth
(FP-growth) and the other algorithms is that FP-growth does not generate the
candidates, it only tests.

18

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

a.Motivation for the FP-tree method is as follows:


Only the frequent items are needed to find the association rules, so it is best to
find the frequent items and ignore the others. If the frequent items can be stored in a
compact structure, then the original transaction database does not need to be used
repeatedly.
If multiple transactions share a set of frequent items, it may be possible to merge
the shared sets with the number of occurrences registered as count.
b.Generating the FP-tree:
The algorithm as follows:
1. Scan the transaction database once, as in the Apriori algorithm, to find
the frequent items and their support.
2. Sort the frequent items in descending order of their support.
3. Initially, start creating the FP-tree with a root “null”.
4. Get the first transaction from the transaction database. Remove all non-
frequent items and list the remaining items according to the order in the
sorted frequent items.
5. Use the transaction to construct the first branch of the tree with each node
corresponding to a frequent item and showing that item’s frequency,
which is 1 for the first transaction.
6. Get the next transaction from the transaction database. Remove all non-
frequent items and list the remaining items to the sorted frequent items.
7. Insert the transaction in the tree using any common prefix that may
appear. Increase the item counts.
8. Continue with step 6 until all transaction in the database are processed.
Example- FP-tree

19

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Teble: 1 Transaction database Table: 2 Frequent Item for


Database
Items are sorted by their frequency
Transaction items item Frequency
ID
Bread 4
100 Bread, cheese, Eggs,
Juice 4
200 Juice
Cheese 3
300 Bread, cheese, Juice
Milk 3
400 Bread, Milk, Juice
500 Bread, juice, Milk
Bread, Juice, Milk

Now we remove the items that are not frequent from the transaction and order
the items according to their frequency.
Table: 3 Database For After Removing the Non-Frequent Items And Reordering
Transaction items
ID
100 Bread, Juice, cheese
200 Bread, Juice, cheese
300 Bread, Milk
400 Bread, juice, Milk
500 Juice, cheese, Milk

c.Now we start building the FP-tree.


We use the symbols B, J, C, and M for the items. An FP-tree consists of nodes. A
node contains three fields: an item name, a count, and a node link. The count tells the
number of occurrences the path has in the transaction database. The node link is a link
to the next node in the FP-tree containing the same item name or a null pointer if this
node is the last one with this name.

20

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

The FP-tree also consists of a header table with an entry for each item-set and a
link to the first item in the tree with the same name. This linking is done to make
traversal of the tree ore efficient. Nodes with the same name in the tree are linked via
the dotted node-links.

STEPS: FP-tree
1. The tree is built by making a root node labeled NULL. A node is made for each
frequent item in the first transaction and the count is set to 1.
2. The first transaction {B, J, and C} is inserted in the empty tree with the root node
labeled NULL. Each of these items is given a frequency count of 1.
3. The second transaction, which is identical to the first, is inserted; it changes the
frequency item to 2.
4. Next {B, M} is inserted. This requires that a node for M be created. The counter
for B goes to 3 and M is set to 1.
5. The next transaction {B, J, and M} results the counter for B and J going up to 4
and 3 respectively and a new node for M with count of 1.
6. The last transaction {J, C and M} results in a brand new branch for the tree which
is shown on the right-hand side in the above figure.

The nodes near the root in the tree are more frequent than those further down
the tree. The height of an FP-tree is always equal to the maximum number of item-
set in a transaction excluding the root node. The FP-tree is compact and often orders

21

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

of magnitude smaller than the transaction database. Once the FP-tree is constructed
the transaction database is not required.
d.Mining the FP-tree for frequent items:
To find the frequent item-sets we should note that for any frequent item a, all the
frequent item-sets containing a can be obtained by following the a’s node-links
starting fro a’s head in the FP-tree header.
The mining on the FP-tree structure is done using an algorithm called the
frequent pattern growth (FP-growth). This algorithm starts with the least frequent
item, which is the last item in the header table.
By using the above example we find the frequent item-sets. We start with the
item M and find the following patterns:
BM (1)
BJM (1)
JCM (1)

No frequent item-set is discovered fro these since no item-set appear three times.
Next we look at c and find the following:
BJC (2)
JC (1)

22

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

These two patterns give us a frequent item-set JC (3). Looking at J, the next
frequent item in the table, we obtain:
BJ (3)
J (1)

Again we obtain a frequent item-set BJ (3). There is no need to follow links fro
item B as there are no other frequent item-set.
Advantages of the FP-tree approach:
FP-tree algorithm is that it avoids scanning the database ore than twice to find
the support counts. FP-tree is completely eliminates the costly candidate generation,
which can be expensive in particular for the Apriori algorithm for the candidate set C2.
FP-growth algorithm is better than the Apriori algorithm when the transaction
database is huge and the minimum support count is low.
FP-growth algorithm uses a more efficient structure to mine patterns when the
database grows.

23

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

1.2.7 Performance evaluation of algorithms:

Performance evaluation has been carried out on a number of different association


mining algorithms. In one study that compared a number of methods including the
Apriori, CHRAM and FP-Growth methods using real world data as well as artificial
data. It was concluded that:
1. The FP-growth method was usually better than the best implementation
of the Apriori algorithm.
2. CHARM was also usually better than Apriori. In some cases, CHARM
was better than the FP-growth.
3. Apriori was generally better than other algorithms if the support required
was high since high support leads to a smaller number of frequent items
which suits the Apriori algorithm
4. At very low support, the number of frequent number of frequent items
become large none of the algorithms were able to handle large frequent
sets gracefully.
In 2003 performance evaluation of programs found that two algorithms the best.
 An efficient implementation of the FP-tree algorithm.
 An algorithm that combined a number of algorithms using multiple
heuristics.
The performance evaluation also included algorithms for closed item-set mining
as well as for maximal item-set mining.

24

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

UNIT – II

2.1 Classification

2.1.1 Introduction

Classification is a classical problem extensively studied by statistician and


machine learning researchers. Classification is the separation or ordering of objects (or
things) into the classes. If the classes are created without looking at the data (non –
empirically), the classification is called apriori classification.
If the classes created empirically (by looking at the data), the classification called
posteriori classification. In most literature on classification it is assumed that the
classes have been deemed apriori and classification then consists of training the system
so that when a new object is presented to the trained system it is able to assign the
object to one of the existing classes. This approach is called supervised learning. Some
technique is available for posteriori or unsupervised classification in which the classes
are determined based on the given data.
Data mining has generated renewed interest in classification. Since the database
in data mining are often large, new classifications techniques have been developed to
deal with millions of objects having perhaps dozens or even hundreds of attributes.
A classification process in which classes have been pre-defined needs a method
that will train the classification system to allocate objects to the classes. The training
sample, a set of sample data where for each sample the class is already known.
This attribute is known for the training data but for data other than the training
data (we call this other data the test data) we assume that the value of the attribute is
unknown and is to be determined by the classification method. This attribute may be
considered as the output of all other attributes and is often referred to as the output
attribute or the dependent attribute.
The attributes other than the output attribute are called the input attributes or
the independent attributes. In supervised learning schemes, it is assumed that we have
sufficient training data to build an accurate model during the training phase. The model

25

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

that we build from the training data is never 100% accurate and classification based on
the model will always lead to errors in some cases. In spite of such errors, classification
can be useful for prediction and better understanding of the data.
The attributes may be of different types. Attributes whose domain is numerical
are called numerical attributes while attributes whose domain is not numerical are
called categorical attributes. Categorical attributes may be ordered (e.g. a student’s
grade) or may be unordered (e.g. gender). Usually, the dependent attribute is assumed
to be categorical if it is a classification problem and then the value of the attribute is the
class label.
If the problem is not a classification problem, the dependent attribute may be
numerical. Usually such problems are called regression problems. Obviously, the two
problems are closely related and one type of problem may sometimes be converted to
another type, if necessary, by simple transformation of variables from either categorical
to continuous (by converting categories to numerical values which may not be always
possible) or from continuous to categorical (by bracketing numerical values and
assigning categories, e.g. salaries may be assigned high, medium and low categories).
Binning of continuous data into categories is quite simple although the selection of
ranges can have a significant impact on the results.
Classification has many applications, for example prediction of customer
behavior (e.g. predicting direct mail responses or identifying telecom customers that
might switch companies) and identifying fraud. A classification method may use the
sample to derive a set of rules for allocating new applications to either of the two
classes.
Supervised learning requires training data while testing requires additional data
which also has been reclassified. Classifying the test data and comparing the results
with the known results can then determine the accuracy. The number of cases classified
correctly provides us with an estimate of the accuracy of model. Although accuracy is a
very useful metric, it does not provide sufficient information about the utility of model.

26

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Our aim is to find highly accurate model that are easy to understand and which
are efficient when dealing with large datasets. There are a number of classification
methods to only discuss the Decision Tree and Naive Bayes techniques.
2.1.2 Decision Tree

A decision tree is a popular classification method that results in a flow-


chart like tree structure where each node denotes a test on an attribute value and each
branch represents an outcome of the rest. The tree leaves represent the classes.
As an example of a decision tree, we show a possible result in figure 3.1 of
classifying the data in Table 3.1.

Table 3.1 Training data for a classification problem

Name Eggs Pouch Flies Feathers Class


Owl Yes No Yes Yes Bird
Cockatoo Yes No Yes Yes Bird
Emu Yes No No Yes Bird
Penguin Yes No No Yes Bird

Pouch?

No Yes
Feathe Marsupia
l
Yes rs No

Bird
Mamma
l

27

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Each node of the tree is a decision while the leaves are the classes. Clearly this is
a very simple example that is unlike problems in real life.
Decision tree is a model that is both predictive. A decision tree is a tree that
displays relationships found in the training data. The tree consists of zero or more
internal nodes and one or more leaf nodes with each internal node being a decision
node having two or more child nodes.
Each node of the tree represents a choice between a number of alternatives (in
“20 questions” the choices are binary) and each leaf node represents a classification or a
decision. The training process that generates the tree is called induction.
The decision tree techniques is popular since the rule generated are easy to
describe and understand, the technique is fast unless the data is very large and there is a
variety of software available.
It should be noted that there needs to be a balance between the number of
training samples and the number of independent attributes. Generally, the number of
training samples requires is likely
to be relatively small if the number of independent attributes is small and the number
of training samples required is likely to be large when the number of attributes is large.
It is not unusual to have hundreds of objects in the training sample although the
examples in these notes must consider only a small training set.
The complexity of a decision tree increases as the number of attributes increases,
although in some situations it has been found that only a small number of attributes can
determine the class to which an object belongs and the rest of the attributes have little or
no impact.
The quality of training data usually plays an important role in determining the
quality of the decision tree. If there are a number of classes, then there should normally
be sufficient training data available that belongs to each of the classes.
It is of course not going to be possible to model the most infrequent situations. If
one tried to do that then we have a condition that is called an “overtrained model”
which produces errors because unusual cases were present in the training data.

28

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Measuring the quality of a decision tree itself is an interesting problem.


Classification accuracy determined using test data is obviously a good measure but
other measures may be used. These include average cost and worst case cost of
classifying an object.
A decision tree may be able to classify the training data 100% accuracy but that
does not imply that the tree will be just as accurate on test data that was not part of the
training set. It may be that tree’s performance on test data will be well bellow 100% if
the training data was not a good sample of the data population.
2.1.3 Overfitting and pruning

The decision tree building algorithm continues until all leaf nodes are single
class. Nodes or no more attributes are available for splitting a node that has objects of
more than one class.
When the objects being classified have a large number of attributes and a tree of
maximum possible depth is built, the tree quality may not be high since the tree is built
to deal correctly with the training set.
Some branches of the tree may reflect anomalies due to noise or outlines in the
training samples. Such decision trees are a result of over fitting the training data and
may result in poor accuracy for unseen samples.
According to the Occam’ razor principle (due to the medieval philosopher
William of Occam) it is best to posit that the world is inherently simple and to choose
the simplest model from similar models since the simplest model is more likely to be a
better model.

We can therefore “shave off” nodes and branches of a decision tree, essentially
replacing a whole subtree by a leaf node, if it can be established that the expected error
rate in the subtree is greater than that in the single leaf. This makes the classifier
simpler. A simple model has less chance of introducing inconsistencies, ambiguities
and redundancies.

29

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Pruning is a technique to make an overfitted decision tree simpler and more


general.
There are a number of techniques for pruning a decision tree by removing some
splits and subtrees created by them. One approach involves removing branches from a
“fully grown” tree to obtain a sequence of progressively pruned trees. The accuracy of
these trees is than computed and a pruned tree that is accurate enough and simple
enough is selected. It is advisable to use a set of data different from the training data to
decide which is the “best pruned tree”.
Another approach is called pre-pruning in which tree construction is halted
early. Essentially a node is not split if this would result in the goodness measure of the
tree falling below a threshold. It is, however, quite difficult to choose an appropriate
threshold.
2.1.4 Decision tree rules

The decision tree method is a popular and relatively simple supervised


classification method that involves each node of the tree specifying a test of some
attribute and each brance from the node corresponding to one of the values of the
attribute.
Each path from the root to a leaf of the decision tree therefore consists of
attribute tests, finally reaching a leaf that describes the class. The popularity of decision
trees is partly due to the ease of understanding the rules that the nodes specify. One
could even use the rules specified by a decision tree to retrieve data from a relational
database satisfying the rules using SQL.
There are a number of advantages in converting a decision tree to rules. Decision
rules make it easier to make pruning decisions since it is easier to see the context of each
rule. Also, converting to rules removes the distinction between attribute tests that occur
near the root of the tree and those that occur near the leaves.
IF-THEN rules may be derived based on the various paths from the root to the
leaf nodes. Although the simple approach will lead to as many rules as the leaf node,
rules can often be combined to produce a smaller set of rules.

30

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

If Gender= “Male “then Class =B


If Gender= “Female “and Married = “Yes “then Class = C, else Class =A

Once all the rules have been generated, it may be possible to simplify the rules.
Rules with only one antecedent (e.g if gender = “Male “then class =B) cannot be further
simplified, so we only consider those with two or more antecedents. It may be possible
to eliminate unnecessary rule antecedents that have no effect on the conclusion reached
by the rule. Some rules may be unnecessary and these may be removed. In some cases a
number of rules that lead to the same class may be combined.

The quality of the decision tree and that of the rules depends on the quality of the
training sample. If the training sample is not a good representation of the population,
then one should be careful in reading too much into rules derived.
2.1.5 Naive Bayes Method

The Naïve Bayes method is based on the work of Thomas Bayes (1702-
1761).Bayes was a British minister and his theory was published only after his death. It
is mystery what Bayes wanted to do with such calculation.
Bayes Classification is quite different from the decision tree approach. In
Bayesian classification we have a hypothesis that the given data belongs to a particular
class. We then calculate the probability for the hypothesis is to be true. The approach
requires only one scan of the whole data. Also, if at some stage there are additional
training data then each training example can incrementally increase/decrease the
probability that a hypothesis is correct.
Before we define the Baye’s theorem we will define some notation. The
expression P(A) refers to the probability that event occur, P(A|B) stands for the
probability that event A will happen, given that event B has already happened. In other
words P (A|B) is the conditional probability of A based on the condition that B has
already happened.

31

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Baye’s theorem:-

P(A|B)=P(B|A)P(A)/P(B)
P(A|B)=P(A & B)/P(B)
P(B/A)= P(A & B)/P(A)

Dividing the first equation by the second gives us the Baye’s theorem.
Continuing with A and B being courses, we can compute the conditional probability if
we knew what the probability of passing both courses was, that is P(A & B), and what
the probabilities of passing A and B separately were. If an event has already happened
then we divide the joint probability P(A & B) with the probability of what has just
happened and obtain the conditional probability.
Once the probabilities have computed for all the classes, we simply assign X to
the class that has the highest conditional probability.
Let us consider the probabilities P(Ci|X) may be calculated.
P(Ci|X) = [P(X| Ci)P(Ci)]/P(X)

 P(Ci|X) is the probability of the object X belonging ti class Ci.


 P(X| Ci) is the probability of obtaining attribute values X if we know that it
belongs to the class Ci..
 P(Ci.) is the probability of any object belonging to class Ci..without any other
information.
 P(X) is the probability of obtaining attribute values X whatever class the object
belongs to.
The probabilities we need to compute P(X| C i), P(Ci.) and P(X) .Actually
the denominator P(X) is the independent of Ci.. and is not required to be known since
we are interested only in computing probabilities P(Ci|X).

32

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Therefore we only need to compute P(X| C i) and P(Ci.) for each class.
Computing P(Ci.) is rather easy since we count the number of instances of each class in
the training data and divide each by the total number of instances. This may not be the
most accurate estimation of P(Ci.) but we have very little information, the training
sample, and we have no other information to obtain a better estimation.
To Compute P(X| Ci) we use a naïve approach by assuming that all
attributes of X are independent which is often not true.
Using the independence of attributes assumption and based on the trained data,
we compute an estimate of the probability of obtaining the data X we have by
estimating the probability of each of the attribute values by counting the frequency of
those values for class Ci. .
We then determine the class allocation of X by computing P (X| Ci) P (Ci) for
each of the class and allocating X to the class with the largest value.
The Bayesian approach is that the probability of the dependent attribute can be
estimated by computing estimates of the probabilities of the independent attributes.
It is possible to use this approach even if values of all the independent attributes
are not known since we can still estimate the probabilities of the attribute values that we
know.
Example 3.3 – Naïve Bayes Method
Owns Credit Risk
Married Gender Employed
Home? Rating class
Yes Yes Male Yes A B
No No Female Yes A A
Yes Yes Female Yes B C
Yes No Male No B B
No Yes Female Yes B C
No No Female Yes B A
No No Male No B B

33

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Yes No Female Yes A A


No Yes Female Yes A C
Yes Yes Female Yes A C

There are 10 (s=10) samples and three classes


Credit risk class A= 3
Credit risk class B= 3
Credit risk class C= 4
The prior probabilities are obtained by dividing these frequencies by the total
number in the training data.
P(A) =0.3 P(B)=0.3 P(C) =0.4
If the data that is presented to us is {yes ,no,female,yes,A} for the five attributes,
we can compute the posterior probability for each class.
P(X| Ci) = P{ yes ,no,female,yes,A}| Ci)=P(Owns home =yes | Ci) X P(Married =
no| Ci) X P(Gender = female| Ci) X P(Employed =yes| Ci) X P(Credit Rating = A| Ci).
Using expressions like that given above, we are able to compute the three
posterior probabilities for the three classes , namely that the person with attribute
values X has credit risk class A or class B or class C.We compute P(X| Ci)P(Ci) for esch
of the three classes given P(A) =0.3 , P(B)=0.3 and P(C) =0.4 and these values are the
basis for comparing the three classes.
To compute P(X| Ci)= P{yes ,no,female,yes,A}| Ci ) for each of the classes, we
need the following probabilities for each:

P(Owns home =yes | Ci)


P(Married = no| Ci)
P(Gender = female| Ci)
P(Employed =yes| Ci)
P(Credit Rating = A| Ci).

34

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Probability of events in the naïve Bayes method


Owns Credit
Married Gender Employed Risk class
Home? Rating
Yes Yes Male Yes A B
No No Female Yes A A
Yes Yes Female Yes B C
Probability of having{yes , no , female
1/3 1 1 1 2/3 , yes , A} attribute values given the
risk class A
Yes Yes Male Yes A B
Yes No Male No B B
No No Male No B B
Probability of having{yes , no , female
2/3 2/3 0 1/3 1/3 , yes , A} attribute values given the
risk class B
Yes Yes Female Yes B C
No Yes Female Yes B C
No Yes Female Yes A C
Yes Yes Female Yes A C
Probability of having{yes , no , female
0.5 0 1 1.0 0.5 , yes , A} attribute values given the
risk class C

Given estimates of the probabilities ,we can compute the posterior probabilities
as
P(X|A) = 2/9
P(X|B) = 0
P(X|C) = 0

35

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Bayes theorem assumes that all attributes are independent and that the training
sample is a good sample to estimate probabilities.
2.1.6 Estimating predictive accuracy of classification methods

The accuracy of a classification method is the ability of the method to correctly


determine the class of a randomly selected data instance. It may be expressed as the
probability of correctly classifying unseen data. Estimating the accuracy of a supervised
classification method can be difficult if only the training data is available and all of that
data has been used in building the model. In such situation, overoptimistic predictions
are often made regarding the accuracy of the model. The accuracy estimation problem is
much easier when much more data is available than is required for training the model.
Often when a great deal of past data is available , only a part of this data is used
for training and the rest , which does not include the training set, can b e used for
testing the method.
The training set should normally be obtained by random sampling since just
using the last n cases or using just data form head office can lead to bias in the training
data which should be avoided.
Different sets of training data would often lead to somewhat different models
and therefore testing is very important in determining how accurate each model.
Accuracy may be measured using a number of metrics. These include sensitivity
, specificity, precision and accuracy.
The methods for estimating errors include holdout, random sub-sampling, k-fold
cross- validation and leave-one-out. Some of the methods may be used when the data
available is limited.
Let us assume that the test data has a total of T objects. When testing a method
we find that C of the T objects correctly classified.
The error rate then may be defined as
Error rate = (T – C)/T
This error rate is only an estimate of the true error rate and it is expected
to be a good estimate if the number of test data T is large and representative of the

36

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

population. A method of estimating the error rate is consider biased if it either tends to
underestimate the error or tends to overestimate it.
The advantage of using this matrix is that it not only tells us how many got
misclassified but also what misclassification occurred.
A confusion matrix for three classes
True Class
Predicated class
1 2 3
1 8 1 1
2 2 9 2
3 0 0 7

Using the above table , we can define the terms “false positive “(FP) and “false
negative”(FN).
False positive cases are those that did not belong to a class but were allocated to
it.
False negative on the other hand are cases that belong to a class but were not allocated
to it.
We now define sensitivity and specificity
Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
Where TP(total positives) is the total correctly classified objects and TN (total
negatives) is the total number of objects that did get classified to a class they did not
belong to.
Consider class 1 in above table .There are 10 objects that belong to this class and
20 20 do not. Of this 10, only 8 are classified correctly. In total , 24 objects are classified
correctly. Out of the 20 that did not belong to class 1 ,2 objects are classified wrongly to
belong to it. So we have TP= 8, TN=18,FN=2. For class 2 , TP= 9, TN=16,FN=1.FP= 4
and for class 3 TP= 7, TN=20 ,FN=3 and FP=0.
Sensitivity = TP / (TP + FN) = 24/30 = 80%

37

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Specificity = TN / (TN + FP) = 54/60 = 90%


Some times terms like “false acceptance rate “(FAR) and “false rejection rate “(FRR) .
Methods for estimating the accuracy:-
1. Holdout Method
2. Random sub-sampling method ,
3. K-fold cross- validation method
4. Leave-one out method
5. Bootstrap method
1. Holdout Method
The holdout method (sometimes called the test sample method) requires a
training set and a test set. The sets are mutually exclusive. It may be that only one
dataset is available which has been divided into two subsets, the training subset and the
test or holdout subset. Once the classification method produces the model using the
training set, the test set can be used to estimate the accuracy.

2. Random sub-sampling method


Random sub-sampling is very much like the holdout method except that it does
not rely on a single test set. Essentially, the holdout estimation is repeated several times
and accuracy estimate is obtained by computing the mean of the several trials. Random
sub-sampling is likely to produce better error estimates than those by the holdout
method.
3. K-fold cross- validation method
In K-fold cross- validation, the available data is randomly divided into k disjoint
subsets of approximately equal size. One of the subsets is used as the test set and
remaining k -1 sets are used for building the classifier. The test set is then to estimate
the accuracy. This is done repeatedly k times so that each subsets is used as a test subset
once. The accuracy estimate is then the mean of the estimates for each of the classifier.
Cross- validation has been tested extensively and has been found to generally work
well when sufficient data is available.

38

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

4. Leave-one out method


Leave-one-out is a simpler version of k-fold cross-validation. In this method, one
of the training samples is taken out and the model is generated using the remaining
training data. Once the model is built, the one remaining sample is used for testing and
the result is coded as 1 or 0 depending if it was classified correctly or not. The average
of such results provided an estimate of the accuracy.
The leave-one out method is useful when the dataset is small. For large training
datasets, leave-one out can become expensive since much iteration are required. Leave-
one-out is unbiased but has high variance and is therefore not particularly reliable.
5. Bootstrap Method
In this method, given a dataset of size n, a bootstrap sample is randomly selected
uniformly with replacement by sampling n times and used to build a model. It can be
shown that only 63.2% of these samples are unique.
The error in building the model is estimated by using the remaining 36.2% of
objects that are not in the bootstrap sample.
The final error is then computed as 0.632 times the training error plus 0.368 times
the testing error. The figures 0.632 and 0.368 are based on the assumption that if there
were n samples available initially from which n samples with replacement were
randomly selected for training data then the excepted percentage of unique objects used
in the test data would be 0.632 or 63.2% and the number of remaining unique objects
used in the test data would be 0.368 or 36.8% of the initial sample.
This is repeated and the average of error estimates is obtained. The bootstrap
method is unbiased and, in contrast to leave-one out , has low variance but many
iterations are needed for good error estimates if the sample is small.
K-fold cross-validation, leave-one-out and bootstrap techniques are used in
many decisions tree packages including CART and Clementine.

39

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

2.1.7 Other evaluation criteria for classification methods

Several methods for estimating the predictive accuracy of classification methods


as well as some technique for improving the accuracy. Other criteria for evaluation of
classification methods, like those for all data mining techniques:-
1. Speed
2. Robustness
3. Scalability
4. Interpretability
5. Goodness of the model
6. Flexibility
7. Time complexity

1.Speed
Speed involves not just the time or computation cost of constructing a model, it
also includes the time required to learn to use the model. Obviously, a user wishes to
minimize both times although it has to be understood that any significant data mining
project will take time to plan and prepare the data.
2.Robustness
Data errors are common, in particular when data is being collected from a
number of sources and errors may remain even after data cleaning. It is therefore
desirable that a method be able to produce good results in spite of some errors and
missing values in datasets.
3.Scalability
Data mining methods were originally designed for small dataset. Many have
been modified to deal with large problems. Given the large datasets are becoming
common, it is desirable that a method continues to work efficiently for large disk-
resident database.

40

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

4.Interpretability
A data mining professional is to ensure that the results of data mining are
explained to the decision makers. It is therefore desirable that the end-user be able to
understand and gain insight from the results produced by the classification method.
Goodness of the model
For a model to be effective , it needs to fit the problem that is being solved. For
example , in a decision tree classification , it is desirable to find a decision tree of the
“right” size and compactness with accuracy.
2.1.8 Classification Software

Decision tree is one of the major techniques in these packages. Some include the
naïve bays method as well as other classification methods. The list of software given
here is for classification only. It should be noted that different software using the same
technique may not produce the same results since there are subtle differences in the
techniques used.
This is not comprehensive list but it includes some of the most widely used
classification software. A more comprehensive classification software list is available at
kdnuggets site (http://www.kdnuggets.com/software/classification.html).
 C4.5, version 8 of the “classic” decision –tree tool, developed by J.R..Quinlan is
available at (http://www.rulequest.com/Personal).
 C5.0/ See5 from Rulequest Research are designed to deal with large data sets. It
Constructs classifiers in the form of decision tree or a set if-then-else rules. The
software uses boosting to reduce errors on unseen data. Links to large number of
publication of case studies of its use are available at :
(http://www.rulequest.com/see5-pubs.html).
 CART 5.0 and Tree Net from Salford Systems are the well known decision tree
software packages.TreeNet provides boosting. CART is the decision tree
software. The package incorporate facilities for data pre-processing and
predictive modeling including bagging and arcing. (http://www.salford-
systems.com/).

41

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

 DTREG, from a company with the same name, generates classification trees
when the classes are categorical and regression decision trees when the classes
are numerical intervals, and finds the optimal tree size. In both cases , the
attribute values may be discrete or numerical. Software modules treeboost and
decision tree generate an ensemble of decision trees. In treeboost, each tree is
generated based on input form a previous tree while decision tree. Forest
generates an ensemble independently of each other. (http://www.dtreg.com/).
 Model builder for decision tree from fair Isaac specializes in credit card and
fraud detection. It offers software for decision trees, including advanced tree-
building software that levaverage data and business expertise to guide the user
in strategy development
(http://www.fairisaac.com/Fairisaac/solution/Product+Index/Model+Builder
/).

 OCI, Oblique Classifier 1 written in ANSI C is a decision tree system accepts


numerical attribute values.It builds decision trees with linear combinations of
one or more attributes at each internal node.
(http://www.cs.jhu.edu/~salzberg/announce-ocl.html)
 Quandstone system version 5 by Quandstone comprises a number of modules
including five variants of decision tree algorithms and six varieties of regression-
based scorecard models. The system provides data extraction, management, pre-
processing and visualization, plus customer profiling, segmentation and
geographical display. (http://www.quandstone.com/).
 Shih Data Miner, from company name shih, includes tree builder to build
decision tree using a variety of spilt algorithms. The software handles missing
values ,provides tools for testing including cross-validation, pruning and
misclassification cost.( http://www.shih.be/dataminer/technical.html)

42

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

 SMILES provides new splitting criteria , non-greedy search, new partitions,


extraction of several and different solutions.(
http://www.dsic.upv.es/~flip/smiles/)
 NBC : a Simple Naïve Bayes Classifier.Written in awk.(
http://scant.org/nbc/nbc.html)

43

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

UNIT-III

3.1 Cluster Analysis

Cluster analysis is a collection of methods that assists the user in putting


different objects from a collection of objects into different groups. The aim of cluster
analysis is exploratory, to find if data naturally falls into meaningful groups with small
within-group variations and large between-group variation. Often we may not have a
hypothesis that we are trying to test.
The aim is to find any interesting grouping of the data. It is possible to define
cluster analysis as an optimization problem in which a given function consisting of
within cluster (intra-cluster) similarity and between clusters (inter-c luster) dissimilarity
needs to be optimized.Clustering methods therefore only try to find an approximate or
local optimum solution.
3.1.1 Cluster Analysis

a. Desired Features Of Cluster Analysis


 (For large datasets) Scalability : Data mining problems can be large and
therefore it is desirable that a cluster analysis method be able to deal with small
as well as large problems gracefully.The method should also scale well to
datasets in which the number of attributes is large.
 (For large datasets) Only one scan of the dataset : For large problems, the data
must be stored on the disk and the cost of I/O from the disk can then become
significant in solving the problem.
 (For large datasets) Ability to stop and resume : When the dataset is very large,
cluster analysis may require considerable processor time to complete the task. In
such cases, it is desirable that the task be able to be stopped and then resumed
when convenient.
 Minimal input parameters: The cluster analysis method should not expect too
much guidance from the user.

44

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

 Robustness : Most data obtained from a variety of sources has errors. It is


therefore desirable that a cluster analysis method be able to deal with noise,
outliers and missing values gracefully.
 Ability to discover different cluster shapes : Clusters come in different shapes
and not all clusters are spherical.Some applications require that various shapes
be considered.
 Different data types : Many problems have a mixture of data types, for
example,numerical, categorical and even textual. It is therefore desirable that a
cluster analysis method be able to deal with not on]y numerical data but also
Boolean and categorical data.
 Result independent of data input order : Although this is a simple requirement,
not all methods satisfy it. It is therefore desirable that a cluster analysis method
not be sensitive to data input order.
3.1.2 Types of data

Datasets come in a number of different forms. The data may be quantitative,


binary, nominal or ordinal. Quantitative (or numerical) data is quite common, for
example, weight, marks, height, price, salary, and count. There are a number of
methods for computing similarity between quantitative data.
Binary data is also quite common, for example, gender, marital status. As we
have noted earlier, computing similarity or distance between categorical variables is not
as simple as for quantitative data but a number of methods have been proposed.
Qualitative nominal data is similar to binary data which may take more than two
values but has no natural order, for example religion, foods or colours. Qualitative
ordinal (or ranked) data is similar to nominal data except that the data has an order
associated with it, for example, grades A, B. C. D, sizes S, M, L. and XL.
The problem of measuring distance between ordinal variables is different than
for nominal variables since the order of the values is important. One method of
computing distance involves transferring the values to numeric values according to

45

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

their rank. Other types of data are also possible. For example, data may include text
strings or a sequence of Web pages visited.
3.1.3 Computing Distance

Cluster analysis methods are based on measuring similarity between objects by


computing the distance between each pair. Distance is a well understood concept that
has a number of simple properties.
1. Distance is always positive.
3. Distance from point x to itself is always Zero.
3, Distance from point x to point y cannot be greater than the sum of the distance
from x to some other point z and distance from z to y.
4. Distance from x to y is always the same as from. y to x.
Let the distance between two points x and y (both vectors) be D(x,y). We now define a
number of distance measures.
1.Euclidean distance
 Euclidean distance or the L2 norm of the difference vector is most commonly
used to compute distances and has an intuitive appeal but the largest valued
attribute may dominate the distance.
D(x. y) = (∑(xi - yi)2)1/2
 Euclidean distance measure is more appropriate when the data is not
standardized, but as noted above the distance measure can be greatly affected by
the scale of the data.
2.Manhattan distance.
 Another commonly used distance metric is the Manhattan distance or the L\
norm of the difference vector. In most cases, the results obtained by the
Manhattan distance are similar to those obtained by using the Euclidean
distance.
D(x, y) =∑| xi - yi|

46

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

3.Chebychev distance
 This distance metric is based on the maximum attribute difference. It is also
called the L norm of the difference vector.
D(x, y) = Max |xi - yi|
4.Categorical data distance
 This distance measure may be used if many attributes have categorical values
with only a small number of values (e.g. binary values). Let N be the total
number of categorical attributes.
D(x, y) =- (number of xi - yi)/N

3.1.4 Types of cluster analysis methods

3.1.5 Partitional Methods

Partitional methods obtain a single level partition of objects. These methodusually


are based on greedy heuristics that are used iteratively to obtain a local optimum
solution.
 Each cluster has at least one object and each object belongs to only one cluster.
Objects may be relocated between clusters as the clusters are refined.
 Often these methods require that the number of cluster be specified apriori and
this number usually does not change during the processing.

a. Hierarchical methods
 Hierarchical methods obtain a nested partition of the objects resulting in a tree of
clusters.
 These methods either start with one cluster and then split into smaller and
smaller clusters (called divisive or top down) or start with each object in an
individual cluster and then try to merge similar clusters into larger and larger
clusters (called agglomerative or bottom up).

47

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

b. Density-based methods
 In this class of methods, typically for each data point in a cluster, at least a
minimum number point must exist within a given radius.
c.Grid-based methods
 In this class of methods, the object space rather than the data is divided into a
grid.
 Grid partitioning is based on characteristics of the data and such methods can
deal with non-numeric data more easily.Grid-based methods are not affected by
data ordering.
d.Model-based methods
 A model is assumed, perhaps based on a probability distribution. Essentially the
algorithm tries to build clusters with a high level of similarity within them and a
low level of similarity between them.
 Similarity measurement is based on the mean values and the algorithm tries to
minimize the squared-error function.
A simple taxonomy of cluster analysis methods is presented in Figure 4.1.

Figure 4.1 Taxonomy of cluster analysis methods,

48

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

3.1.6 Hierarchical Methods

 The Hierarchical methods attempt to capture the structure of the data by


constructing a tree of clusters.Two types of hierarchical approaches are
possible,that is given below.
a)Agglomerative approach
b)Divisive approach (or the top-down approach)
Distance Between Clusters
The hierarchical clustering methods require distances between clusters to be
computed. These distance metrics are often called linkage metrics.Computing distances
between large clusters can be expensive.We will discuss the following methods for
computing distances between clusters:
1. Single-link algorithm
2. Complete-link algorithm
3. Centroid algorithm
4. Average-link algorithm
5. Ward's minimum-variance algorithm
1.Single-link
The single-link (or the nearest neighbour) algorithm is the simplest algorithm for
computing distance between two clusters. The algorithm determines the distance
between two clusters as the minimum of the distances between all pairs of points (a, x)
where a is from the first cluster and x is from the second.
The algorithm therefore requires that all pairwise distances be computed and the
smallest distance (or the shortest link) found.

2.Complete-link
The complete-link algorithm is also called the farthest neighbours algorithm. In
this algorithm, thedistance between two clusters is defined as the maximum of the
pairwise distances (a, x}. Therefore if there are m elements in one cluster and n in the
other, all mn pairwise distances therefore must be computed and the largest chosen.

49

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Both single-link and complete-link measures have their difficulties. In the single-
link algorithm, each cluster may have an outlier and the two outliers may be nearby
and so the distance between the two clusters would be computed to be small. Single-
link can form a chain of objects as clusters are combined since there is no constraint on
the distance between objects . On the other hand, the two outliers may be very far away
although the clusters are nearby and the complete-link algorithm will compute the
distance as large.
3.Centroid
The centroid algorithm computes the distance between two clusters as the
distance between the average point of each of the two clusters. Usually the squared
Euclidean distance between the centroids is used.
4.Average-link
The average-link algorithm computes the distance between two clusters as the
average of all pairwise distances between an object from one cluster and another from
the other cluster. Therefore if there are m elements in one cluster and n in the other,
there are mn distances to be computed, added and divided by mn.
This approach also generally works well. It tends to join clusters with small
variances.Figure 4.5 shows two clusters A and B and the average-link distance between
them.


Figure 4.5 The average-link distance between 1wo clusters.

50

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

5.Ward's minimum-variance method


The method generally works well and results in creating small tight clusters. Ward's
distance is the difference between the total within the cluster sum of squares for the two
clusters separately and within the cluster sum of squares resulting from merging the
two clusters.
An expression for Ward's distance may be derived. It may be expressed as follows;
DW (A,B) ==NAN B D C ( A,B ) / ( N A +N B )
Where DW (A,B) is the 'Warrd's minimum-variance distance between clusters A
and B with NA and N B objects in them respectively. D C ( A,B ) is the centroid distance
between the two clusters computed as squared Euclidean distance between the
centroids.
i.Agglomerative method:
 The agglomerative clustering method tries to discover such structure given a
dataset.
 The basic idea of the agglomerative method is to start out with n clusters for n
data points, that is, each cluster consisting of a single data point.
 Using a measure of distance, at each step of the method, the method merges two
nearest clusters, thus reducing the number of clusters and building successively
larger clusters.
 The process continues until the required number of clusters has been obtained or
all the data points are in one cluster.
 The agglomerative method leads to hierarchical clusters in which at each step we
build larger and larger clusters that include increasingly dissimilar objects.
 The agglomerative method is basically a bottom-up approach which involves the
following steps.
An implementation however may include some variation of these steps.
1. Allocate each point to a cluster of its own. Thus we start with n clusters for n
objects.

51

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

2. Create a distance matrix by computing distances between all pairs of clusters either
using,for example, the single-link metric or the complete-link metric. Some other
metric may also be used. Sort these distances in ascending order.
3. Find the two clusters that have the smallest distance between them.
4. Remove the pair of objects and merge them.
5. If there is only one cluster left then stop.
6. Compute all distances from the new cluster and update the distance matrix after
the merger and go to Step 3.
ii.Divisive Hierarchical Method
 The divisive method is the opposite of the agglomerative method
 In this method starts with the whole dataset as one cluster and then proceeds to
recursively
 Divide the cluster into two sub-clusters and continues until each cluster has only
one object or some other stopping criterion has been reached. There are two
types of divisive methods:
1. Monothetic: It splits a cluster using only one attribute at a time. An attribute that
has the most variation could be selected.
2. Polythetic: It splits a cluster using all of the attributes together. Two clusters far
apart could be built based on distance between objects.
A typical polythetic divisive method works like the following:
1. Decide on a method of measuring the distance between two object.
2. Create a distance matrix by computing distances between all pairs of objects
within the cluster. Sort these distances in ascending order.
3. Find the two objects that have the largest distance between them. They are the
most dissimilar objects.
4. If the distance between the two objects is smaller than the pre-specified threshold
and there is no other cluster that needs to be divided then stop, otherwise continue.
5. Use the pair of objects as seeds of a K-means method to create two new clusters.

52

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

6. If there is only one object in each cluster then stop otherwise continue with Step
2.
In the above method, we need to resolve the following two issues:
• Which cluster to split next?
• How to split a cluster?
a.Which cluster to split next?
1. Split the clusters in some sequential order.
2. Split the cluster that has the largest number of objects.
3. Split the cluster that has the largest variation within it.
b.How to split a cluster?
We used a simple approach for splitting a cluster based on distance between the objects
in the cluster. A distance matrix is created and the two most dissimilar objects are
selected as seeds of two new clusters. The K-means method is then used to split the
cluster.
Advantages of the Hierarchical Approach
1. Hierarchical methods are conceptually simpler and can be implemented easily.
2. Hierarchical methods can provide clusters at different levels of granularity
Disadvantages of the Hierarchical Approach
1. The hierarchical methods do not include a mechanism by which objects that have
been incorrectly put in a cluster may be reassigned to another cluster.
2. The time complexity of hierarchical methods can be shown to be 0(n3).
3. The distance matrix requires 0(n2) space and becomes very large for a large
number of objects.
4. Different distance metrics and scaling of data can significantly change the results.
3.1.7 Density-Based Methods

In this method clusters are high density collections of data that are separated by
a large space of low density data (which is assumed to be noise). Each data point in a
cluster, at least a minimum number of points must exist within a given distance.

53

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Data that is not within such high-density clusters is regarded as outliers or noise.
Density-based clustering is that the clusters are dense regions of probability density in
the data sets.
DBSCAN (density based spatial clustering of applications with noise) is one
example of a density-based method for clustering.
It requires two input parameters; the size of the neighbourhood (R) and the
minimum points in the neighbourhood (N).Essentially these two parameters determine
the density within the clusters the user is willing to accept since they specify how many
points must be in a region.
The number of points not only determines the density of acceptable clusters but
it also determines which objects will be labelled outliers or noise. Objects are declared
to be outliers if there are few other objects in their neighbourhood. The size parameter R
determines the size of the clusters found. If R is big enough, there would be one big
cluster and no outliers. If R is small, there will be small dense clusters and there might
be many outliers.
a.Concepts of DBSCAN method
1. Neighbourhood: The neighbourhood of an object y is defined as all the objects that
are within the radius R from y.
2. Core object: An object y is called a core object if there are N objects within its
neighbourhood.
3. Proximity: Two objects are defined to be in proximity to each other if they belong
to the same cluster. Object x 1 is in proximity to object x 2 if two conditions are
satisfied:
(a) The objects are close enough to each other, i.e. within a distance of R.
(b) x 2 is a core object as defined above.
4. Connectivity: Two objects x 1 and x 2 are connected if there is a path or chain of
objects x 1 ,x 2, …. x n from x 1 to x n
Basic Algorithm for Density-based Clustering:
1. Select values of R and N.

54

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

2. Arbitrarily select an object p.


3. Retrieve all objects that are connected to p, given R and N.
4. If p is a core object, a cluster is formed.
5. If p is a border object, no objects are in its proximity. Choose another object. Go to
Step 3.
6. Continue the process until all of the objects have been processed.
Essentially the above algorithm forms clusters by connecting neighbouring core
objects and those non-core objects that are boundaries of clustering. The remaining non-
core objects are labeled outliers.
3.1.8 Dealing with large databases

Most clustering methods implicitly assume that all data is accessible in the
mainmemory. Often the size of the database is not considered but a method requiring
multiple scans of data that is disk-resident could be quite inefficient for large problems.
a.K-Means Method for Large Databases
One modification of the K-means method for data that is too large to fit in the main
memory.The method first picks the number of clusters and their seed centroids and
then attempts to classify each object to belong to one of the following three groups:
(a) Those that are certain to belong to a cluster. These objects together are called the
discard set. Some information about these objects is computed and saved. This
includes the number of objects n, a vector sum of all attribute values of the n
objects (a vector S) and a vector sum of squares of all attribute values of the n
objects (a vector Q).
(b) The objects are however sufficiently far away from each cluster's centroid that
they cannot yet be put in the discard set of objects. These objects together are
called the compression set.
(c) The remaining objects that are too difficult to assign to either of the two groups
above.These objects are called the retained set and are stored as individual
objects.

55

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

b.Hierarchical Method for Large Databases—Concept of Fractionation


Dealing with large datasets is difficult using hierarchical methods since the
methods require an N x N distance matrix to be computed for N objects.
It is based on the idea of splitting the data into manageable subsets called "fractions"
and then applying a hierarchical method to each fraction. The concept is called
fractionation.
ALGORITHM:
1. Split the large dataset into fractions of size M.
2. The hierarchical clustering technique being used is applied to each fraction. Let the
number of clusters so obtained from all the fractions be C.
3. For each of the C clusters, compute the mean of the attribute values of the objects
in it. Let this mean vector be mi i = 1. , , , ,C. These cluster means are called meta-
observations.
4. If the C meta-observations are too large (greater than M}, go to Step 1, otherwise
apply the same hierarchical clustering technique to the meta-observations
obtained in Step 3.
5. Allocate each object of the original dataset to the cluster with the nearest mean
obtained in Step 4.
The cost of this algorithm is linear in terms of the size of the dataset. The accuracy of
the results is related to the accuracy of the clustering algorithm used.
3.1.9 Quality and validity of cluster analysis methods

Evaluation is particularly difficult since all clustering methods will produce


clusters even if there are no meaningful clusters in the data. In cluster analysis there is
no test data available as there often is in classification. Also, even if the process has been
successful and clusters have been found, two different methods may produce different
clusters.
Different methods are biased towards different types of clusters. For example,
minimizing the maximum distance between the points in a cluster. The results of K-
means may in the first instance be evaluated by examining each attribute's mean for

56

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

each cluster in an attempt to assess how far each cluster is from the other. Another
approach is based on computing within cluster variation (I) and between clusters
variation (E). These variations may be computed as follows:
Let the number of clusters be k and let the clusters be Ci i = 1, ..., k. Let the total
number of objects be N and let the number of objects in cluster Ci, be Mi so that
M 1 + M 2 + …… + M k = N
The within-cluster variation between the objects in a cluster Ci is defined as the
average squared distance of each object from the centroid of the cluster. That is, if mi is
the centroid of the cluster Ci, then the mean of the cluster is given by
mi, = ∑{ xj }/Mi
The between cluster distances E may now be computed given the centroids of the
clusters. It is the average sum of squares of pairwise distances between the centroids of
the k clusters. Evaluating the quality of clustering methods or results of a cluster
analysis is a challenging task.
The quality of a method involves a number of criteria:
1. Efficiency of the method
2. Ability of the method to deal with noisy and missing data
3. Ability of the method to deal with large problems
4. Ability of the method to deal with a variety of attribute types and magnitudes
Once several clustering results have been obtained, they can be combined, hopefully
providing better results than those from just one clustering of the given data.
Experiments using these techniques suggest that they can lead to improved quality and
robustness of results.
3.1.10 luster analysis software

A more comprehensive list of cluster analysis software is available at


http://www.kdnuggets.com/software/clustering.html. ClustanGraphics7 from
Clustan offers a variety of clustering methods including K-means,density-based and
hierarchical cluster analysis.

57

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

UNIT-IV
4.1 Web Data Mining

4.1.1 Introduction

Web mining is the application of data mining techniques to find interesting and
potentially useful knowledge from web data. It is normally expected that either the
hyperlink structure of the web or the web log data or both have been used in the mining
process.
Web mining may be divided in to several categories:
1. Web content mining
It deals with discovering useful information or knowledge from web page
contents. It goes well beyond using the keywords in the search engines. It focuses on
the web pages contents rather than the links.
2. Web Structure Mining:
It deals with discovering and modeling the link structures of the web. Work has
been carried out to model the web based on the topology of the hyperlinks.
3. Web Usage Mining:
It deals with understanding the user behavior in interacting with the web or with
a web site. One of the aims is to obtain information that may assist web site
reorganization or assist site adaptation to better suit the user.
4.1.2 Web terminology and characteristics

The web is seen as having a two tier architecture. The first tier is the web server
that serves the information to the client machine and the second tier is the client that
displays that information to the user.
 This architecture is supported by three web standards, namely HTML( Hyper
Text Markup Language) for defining the web document content, URLs( Uniform
Resource Locators) for naming and identifying remote information resources in
the global web world, and HTTP (Hyper- Text Transfer Protocol) for managing
the transfer of information from the server to the client.

58

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

 Web terminology based on the work of World Wide Web consortium(W3C)


 The World Wide Web (WWW) is the set of all nodes which are interconnected by
hyper text links.
 Link expenses one or more relationship between two or more resources. Links
may also be established within a document by using Anchors.
 A Web page is a collection of information, consisting of one or more web
resources, intended to be rendered simultaneously & identified by a single URL.
 A Wed site is a collection of interlinked web pages, including a home page,
residing at the same network location.
 A Client browser is the primary user interface to the web. It is a program which
allows a person to view the content of web pages and for navigating from one
page to another.
 A Uniform Resource Locator (URL) is an identifier for abstract or physical
resources, for example a server and a file path or index. URLs are location
dependent and each URL consist of 4 distinct parts:
 Protocol types (usually http)
 Name of the web server
 Directory path
 File Name
If a file is not specified, index.html is assumed.
 A Web server serves web pages using http to client machines so that a browser
can display them.
 A Client is the role adopted by an application when it is retrieving a web
resource.
 A Proxy is an intermediary which acts as both a server and a client for the
purpose of retrieving resources on behalf of other clients.

59

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

 A Domain name server is a distributed database of name to address mapping.


When a DNS server looks up a computer name, it either finds it in its list, or asks
another DNS server which knows more names.
 A Cookie is the data sent by a web server to a web client, to be stored locally by
the client and sent back to the server on subsequent requests.
 Obtaining information from the web using a search engine is called “pull” while
information sent to users is called information “push”.
 EX: users may register with a site and then information is sent (pushed) to such
users without them requesting it.
a. Graph Terminology
 “Directed graph” is a set of nodes (that corresponds to pages on the web) denoted
by V and edges (which corresponds to links on the web) donated by E. That is a
graph is (V, E) where all edges are directed.
 “Undirected graph” may also be represented by nodes and edges (V,E) but the
edges have no direction specified.
Directed graph is like a link that points from one page to another in web, while
undirected graph is not like the pages and links on the web unless we assume the
possibility of traversal in both directions. A graph may be searched either by a breadth-first
search or by a depth- first search.
Breadth-first is based on first searching all the nodes that can be reached from the
node where the search is starting and once these nodes have been searched, searching the
nodes at the next level that can be reached from nodes and so on. Depth-first search is
based on first searching any unvisited descendants of a given node first, then visiting the
node and then any brother nodes.
“ Diameter of a graph” –Maximum of the minimum distances between all possible
ordered node pairs(u,v) that is , maximum number of links that one would need to follow
starting from any page u to reach any page v assuming that the best path has been
followed.

60

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

An investing on the structure of the web was carried out and it was found that the
web is not a ball of highly connected nodes. It displayed structural properties and the web
could be divided into the following five components
1. The Strongly Connected Core (SCC): This part of the web was found to consist
of about 30% of the web, which is still very large given more than 4billion pages
on the web in 2004.
2. The IN Group: This part of the web was found to consist of about 20% of the
web.
Main Property- pages in the group can reach the SCC but cannot be reached from
it.
3. The OUT Group: This part of the web was found to consists of about 20% of the
Web.
Main Property- pages in the group can reached from the SCC but cannot be reach
the SCC.

4. Tendrils: This part of the web was found to consists of about 20% of the web.
Main Property- pages cannot be reached by the SCC and cannot reach the SCC.
It does not imply that these pages have no linkages to pages outside the group
since they could well have linkages from the IN group and to the OUT group.
5. The Disconnected Group: This part of the web was found to be less than 10% of
the web and is essentially disconnected from the rest of the web world.
Ex: personal pages at many sites that link to no other page and have no links to
them.

b.Size Of the Web


The deep web includes information stored in searchable database often
inaccessible to search engines. This information can often only be accessed by using
each website’s interface. Some of this information may be available only to subscribers.

61

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

The shallow web is the information on the web that the search engines can access
without accessing the web databases. It has been estimated that the deep web is about
500 times the size of the shallow web. The web pages are very dynamic, changing
almost daily. Many new web pages are added to the web every day.
Perhaps web pages are used for other purposes as well, for example,
communicating information to small number of individuals via the web rather than via
email. In many cases, this makes good sense (wasting disk storage in mailbox is
avoided).
But if this grows, then a very large number of such web pages with a short life
span and low connectivity to other pages are generated each day.
Large numbers of websites disappear every day and create many problems on
the web. Links from even well known sites do not always work. Not all results of a
search engine search are guaranteed to work. To overcome these problems, web pages
are categorized as follows:
1. A web page that is guaranteed not to change over.
2. A Web page that will not delete any content, may add content / links but the
page will not disappear.
3. A web page that may change content / links but the page will not disappear.
4. A web page without any guarantee.
c. Web Metrics
There are a number of properties (other than size and structures of the web) about
the web are useful to measure. Some of the measures are based on distance between the
nodes of the web graph. It is possible to define how well connected a node is by using
the concept of centrality of a node.
Centrality may be out-centrality, which is based on distance measured from the
nodes its out-links while in-centrality is based on distances measured from others nodes
that are connected to the node using its in-links. Based on these metrics, it is possible to
define the concept of compactness that varies from 0 to 1, o for a completely
disconnected web graph and 1 for fully connected web graph.

62

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Perhaps the most important measurements about web pages are about a page’s
relevance and its quality. Relevance is usually related to a user query since it concerns
the user findings pages that are relevant ti his query and may be defined in a number
of ways.
A simple approach involves relevance to be measured by the number of query
terms that appear in the page. Another approach is based on in-links from relevant
pages. In this, relevance of a page may be defined as the sum of the number of query
terms in the pages that refer to the page.
Relevance may also use the concept of co-citation. In co-citation, if both pages ‘a’
and ‘b’ point to a page ‘c’ then it is assumed that ‘a’ and ‘b’ have a common interest.
Similarly if ‘a’ points to both ‘b’ and ‘c’, then we assume that ‘b’ and ‘c’ also share a
common interest.
Quality is not determined by page content since there is no automatic means of
evaluating the quality of content. And it is determined by the link structure. Example: If
page ‘a’ points to page ‘b’ then it is assumed that page ‘a’ is endorsing page ‘b’ and we
can have some confidence in the quality of page ‘b’.
4.1.3 Locality and hierarchy in the web

The web shows a strong hierarchical structure. A website of any enterprise has the
homepage as the root of the tree and as we go down the tree we find more detailed
information about the enterprise.
Example: The homepage of a university will provide basic information and then
links, for example, to: Prospective students, staff, research, and information for current
students, information for current staff.
Prospective students node will have number of links: course offered, admission
requirements, scholarship available, semester dates, etc.. Web also has a strong locality
feature to the extent that almost two-thirds of all links are to sites within the enterprise
domain.

63

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

One-third links to sites outside the enterprise domain have a higher percentage of
broken links. Web sites fetch information from a database to ensure that the information
is accurate and timely.
Web pages can be classified as:
1. Homepage or the head page: These pages represent an entry for the website of
an enterprise(so frequently visited) or a section within the enterprise or an
individual’s web page.
2. Index page: These pages assist the user to navigate through the enterprise
website. A homepage may also act as an index page.
3. Reference page: These pages provide some basic information that is used by a
number of other pages Ex: each page in a website may have link to a page that
provides enterprise’s privacy policy.
4. Content Page: These pages only provide content and have little role in assisting a
user’s navigation. Often they are larger in size, have few out-link and are often
the leaf nodes of a tree.
Web site structure and content are based on careful analysis of what the most
common user questions are. The content should be organized into categories in a way
that, traversing the site simple and logical.
Careful web user data mining can help. A number of simple principles have been
developed to design the structure and content of a web site.
Three basic principles are:
1. Relevant Linkage principle: It is assumed that links from a page to other
relevant resources. Links are assumed to reflect the judgment of the page creator.
By providing link to another page means that, he recommends for the other
relevant page.
2. Topical unity principle: It is assumed that web pages that are co-cited( that is
linked from the same pages) are related.
3. Lexical affinity principle: It is assumed that the text and the links within a page
are relevant to each other.

64

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Unfortunately not all pages follow these basic principles, resulting in difficulties for
web users and web mining researchers.
4.1.4 Web Content Mining

Web content mining deals with discovering useful information from the web.
When we use search engines like Google to search contents on the web, search engines
find pages based on the location and frequency of keywords on the page although some
now use the concept of page rank.
If we consider the type of queries that are posed to search engines, we find that if
the query is very specific we face scarcity problem. If the query is for a broad topic, we
face the abundance problem. Because of these problems, keywords search engine is
called suggestive of the document relevance. User is then left with the task of finding
relevant documents.
Brin presents an example of finding content on the web. It shows hoe relevant
information from a wide variety of sources presented in wide variety of formats may be
integrated for the user.
Example involves extracting a relation of books in the form of (author, title) from
the web starting with a small sample list.
Problem is defined as: we wish to build a relation R that has a number of
attributes. The information about tuples of R is found on web pages but is unstructured.
Aim is to extract the information and structure it with low error rate. Algorithm
proposed is called Dual Iterative Pattern Relation Extraction (DIPRE). It works as
follows:

1. Sample: Start with a sample S provided by the user.


2. Occurrence: Find occurrences of tuples starting with those in S. once tuples are
found the context of every occurrence is saved. Let these be O.
O-> S
3. Patterns: Generate patterns based on the set of occurrences O. This requires
generating patterns with similar contexts.

65

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

4. Match Patterns: Web is now searched for the patterns.


5. Stop: If enough matches found, else go to step 2.

Here the pattern is defined as a tuple like (order, URL prefix, Prefix, middle,
suffix). It may then match strings like (author ,middle, title) or (title, middle, author).
In step 2 it is assumed that the occurrence consists of title and author with a
variety of other information. Web pages are usually semi-structured(ex: HTML
documents). Database generated HTML pages are more structured, however web pages
consist of unstructured free text data which makes it more difficult to extract
information from them.
The semi-structured web pages may be represented based on the HTML
structures inside the documents. Some may also use hyperlink structure. A different
approach called database approach involves inferring the structure of a website and
transforming the website content into a database.
Web pages are published without any control on them. Some of the sophisticated
searching tools being developed are based on the work of the artificial intelligence
community. These are called Intelligent Web Agents. Some web agents replace the work
of search engines in finding information. Ex: ShopBot finds information about a product
that the user is searching for.
a. Web Document Clustering
Web document clustering is an approach to find relevant documents on a topic
or about query keywords. Search engines return huge, unmanageable list of documents
,and finding the useful one is often tedious.
User could apply clustering to a set of documents returned by search engines
with the aim of finding meaningful clusters that are easier to interpret. It is not
necessary to insist that a document can only belong to one cluster since in some cases it
is justified to have document belong to two or more clusters.
Web clustering may be based on content alone, may be based on both content &
links or based only on links. One approach that is specifically designed for web

66

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

document cluster analysis is Suffix Tree Clustering (STC) & it uses a phrase based
clustering approach rather than using a single word frequency.
In STC, the key requirements of a web document clustering algorithm include:

i.Relevance: This is most obvious requirement. We want clusters that are relevant to
user query.
ii.Browsable summaries: The cluster must e easy to understand. User should be
quickly able to browse the description of a cluster.
iii.Snippet tolerance: The clustering method should not require whole documents &
should be able to produce relevant clusters based only on the information that the
search engine returns.
iv.Performance: The clustering method should be able to process the results of search
engine quickly & provide the resulting clusters to the user.

STC algorithm consists of document cleaning, identifying base clusters &


combining base clusters. STC only requires the snippets of text that are returned by the
search engine and remove unwanted text that might get in the way of identifying
common phrases(ex: prefixes, punctuation)
The Second step of algorithm uses the cleaned snippets to mark sentence
boundaries and then build a suffix tree to efficiently identify a set of documents that
share common phrases.
Once a suffix tree has been constructed, number of base clusters can be derived,
by reading the tree from root to each leaf which gives a phrase. If such phrase belongs
to more than two snippets, then the phrase is considered a base cluster. Each base
cluster is given a score based on number of words in the phrase and the number of
snippets that have that phrase.
Many base clusters will be overlapping, since documents often share more than
just one phrase. So clusters must be consolidated. A similarity index for every pair of

67

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

base clusters is created based on the documents that are in each cluster. If the base
clusters are similar, they are combined.
b.Finding Similar Web pages
It has been found that almost 30% of all web pages are very similar to other
pages. For example:
1. A local copy may have been made to enable faster access to the material.
2. FAQs are duplicated since such pages may be used frequently.
3. Online documentation of popular software like Unix may be duplicated fro local
use.
4. There are mirror sites that copy highly accessed sites to reduce traffic.
In some cases, documents are not identical because different formatting might be
joined together to build a single document. Copying a single web page is called
replication and copying an entire web site is called mirroring.
Similarity between web pages usually means content-based similarity. It is also
possible to consider link-based and usage-based similarity. Link-based is related to the
concept of co-citation and is used for discovering a core set of web page on a topic.
Usage-based is useful in grouping pages or users in to meaningful groups.
Content-based is based on comparing textual content of the web pages. Non- text
contents are not considered.
We define two concepts:
1. Resemblance: Resemblance of two documents is define to be a number between
0 and 1 with 1 indicating that the two documents are virtually identical and any
value close to 1 indicating that the documents are very similar.
2. Containment: Containment of one document in another is defined as a number
between 0 and 1 indicating that the first document is completely contained in the
second.
Number of approaches is there to assess the similarity of documents. One Brute
force approach is to compare two documents using software like ‘diff’ in Unix OS,
which compares two documents as files. Other string comparison algorithms may be

68

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

used to find how many characters need to be deleted, changed or added to transform
one document to other, but this is expensive for comparing millions of documents.

 Issues in document matching are:


 If w are looking to compare millions of document then the storage requirement
of the method should not be large.
 Documents may be in HTML, PDF, and MS Word. They need to be converted to
text for comparison, which may introduce errors.
 Method should be robust i.e, it should not be possible to avoid the matching
process with the changes to a document.
c.Finger Printing
An approach for comparing large number of documents is based on the idea of
fingerprinting documents. A document may be divided into all possible substrings of
length L. These substrings are called shingles. Based on the shingles we can define
resemblance R(X,Y) and containment C(X,Y) between two documents X and Y as
follows.
 We assume S(X) and S(Y) to be set of shingles for documents X and Y
respectively.
R(X,Y)={ S(X) S(Y)}/{ S(X) U S(Y)}
C(X,Y)={ S(X) S(Y)}/{S(X)}
Following algorithm may be used to find similar documents:
1. Collect all the documents that one wishes to compare.
2. Choose a suitable shingle width and compute the shingles for each document
3. Compare the shingles for each pair of documents.
4. Identifying those documents that are similar.
This algorithm is inefficient for large collection of documents. Web is very large
and algorithm requires enormous storage to store shingles and takes long time This
approach is called full fingerprinting.

69

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Example:
We find if the two documents with the following content are similar:
Document1: “the web continues to grow at a fast rate”
Document2: “the web is growing at the fast rate”

First Step: Making a set of shingles from the documents. We obtain the below shingles if
we assume the shingle length to be 3 words.
Shingles in Doc1 Shingles in Doc2
The web continues The web is
Web continues to Web is growing
Continues to grow Is growing at
To grow at Growing at a
Grow at a At a fast
At a fast A fast rate
A fast rate

Comparing two sets of shingles we find that only 2 of them are identical. So the
documents are not very similar. We illustrate the approach using 3 shingles that are the
shortest in the number of letters including spaces.
Shingles in Doc1 Number of Shingles in Doc2 Number of letters
Letters
The web continues 17 The web is 10
Web continues to 16 Web is growing 14
Continues to grow 17 Is growing at 13
To grow at 10 Growing at a 12
Grow at a 9 At a fast 9
At a fast 9 A fast rate 11
A fast rate 11

70

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

We select 3 shortest shingles for comparison. For first document, these are “to
grow at”, “grow at a”, “at a fast”. For second document, these are “the web is”, “at a
fast”, “a fast rate”. There is only one match out of three shingles.
False negatives would be obtained for documents like “the Australian economy
is growing at a fast rate”. So, small length shingles cause many false positives while
larger shingles result in more false negatives. A better approach involves randomly
selecting shingles of various lengths.
Issues in comparing documents using fingerprinting are:
 Should the shingle length be in number of words or characters?
 How much storage would be needed to store shingles?
 Should upper and lower-case letters be treated differently?
 Should punctuation marks be ignored?
 Should end of paragraph be ignored?
4.1.5 Web Usage Mining

Objective in web usage mining is to be able to understand and be able to predict


user behaviors in interacting with the web or with a website in order to improve the
quality of service. Other aims are to obtain information and discover usage patterns that
may assists web site design/redesign, perhaps to assist navigation through the site.
Mined data includes web data repositories which include data logs of user’s
interactions with the web, web server logs, proxy server logs, browser server logs, etc..
Information collected in the web server logs includes information about the access,
referrer, etc.. Access information includes the server’s interaction with the server,
referrer information is about referring page and the agent information is about the
browser used.
Some of the routine information may be obtained using tools that are available at
a low cost. Using such tools, it is possible to find at least the following information:
1. Number of hits: Number of times each page in the web site has been viewed.
2. Number of visitors: Number of users who came to the site.

71

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

3. Visitor referring website: Web site URL of the site the user came from.
4. Visitor referral website: web site URL of the site where the user went he left the
web site.
5. Entry point: which web site page the user entered from
6. Visitor time and duration: time and day of visit and how long the visitor
browsed the site.
7. Path analysis: List of path of pages that the user took
8. Visitor IP address: this helps in finding which part of the world the user came
from.
9. Browser type
10. Platform
11. Cookies
This simple information can assist an enterprise to achieve the following:
1. Shorten the path to high visit pages
2. Eliminate or combine low visit pages
3. Redesign some pages including homepage to help user navigation
4. Redesign some pages so that the search engines can find them
5. Help evaluate effectiveness of an advertising campaign
Web usage mining may be desirable to collect information on:
i.Path traversed: What paths do the customers traverse? What are the most commonly
traversed paths? These patterns need to be analyzed and acted upon.
ii.Conversion rates: What are the basket-to-buy rates for each product?
Impact of advertising: Which banners are pulling in the most traffic? What is their
conversion rate?
iii.Impact of promotions: Which promotions generate the most sales?
iv.Web site design: Which links do the customers click most frequently? What links do
they buy from most frequently?
v.Customer segmentation: What are the features of customers who stop without
buying? Where do most profitable customers come from?

72

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

vi.Enterprise search: Which customer use enterprise search? Are they likely to
purchase? What they search?
Web usage mining also deals with catching and proxy servers. Not all page
requests are recorded in the server log because pages are often cached and served from
the cache when the users request them again.
Caching occurs at several levels. Browsers itself has a cache to keep copies of
recently visited pages which may be retained only for a short time. Enterprise maintains
a proxy server to reduce internet traffic and to improve performance.
Proxy server interprets any request for web pages from any user in the enterprise
and server the from the proxy if a copy that has not expired is resident in the proxy.Use
of caches and proxies reduces the number of hits that the web server log records.
One may be interested in the behaviors of users, not just in one session but over
several sessions. Returning visitors may be recognized using cookies(which is issued by
the web server and stored on the client). The cookie is then presented to web server on
return visits.
Log data analysis has been investigated using the following techniques:
1. Using association rules
2. Using cluster analysis

In usage analysis, we want to know the sequence of nodes visited by a user,


which is not possible using association rules analysis. It should be noted that, in
association rules mining, patterns do not play any role. We are only interested in
finding which items appear with which other items frequently.
In fact, if pages A and B appear together in an association rule, it is difficult to
tell if A was traversed first or B was.

73

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

4.1.6 Web Structure Mining

 Aim of web structure mining is to discover the link structure or the model that is
assumed to underline the web. Model may be based on the topology of
hyperlinks.
 This can help in discovering similarity between sites or in discovering authority
site for a particular topic or in discovering overview or survey sites that point to
many authority sites (such sites are called hubs)
 The links on web pages provide a useful source of information that may be
bound together in web searches.
 Kleinberg has developed a connectivity analysis algorithm called Hyperlink-
Induced Topic Search(HITS) based on the assumption that links represent
human judgement.
 Algorithm also assumes that for any topic, there are a set of “authorities” sites
that are relevant on the topic and there are “hub” sites that contain useful links to
relevant sites on the topic.
 HITS algorithm is based on the idea that if the creator of page p provides a link
to page q, then p gives some authority on page q.
 But not all links give authority, since some may be for navigational purposes,
Exlinks to the home page.

a.Definitions required for HITS:


A Graph is a set of vertices and edges (V, E). The vertices correspond to the
pages and edges correspond o the links
A Directed edge (p,q) indicates a link from page p from page q.

b.HITS algorithm has 2 major steps:

1. Sampling step: It collects a set of relevant web pages given a topic

74

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

2. Iterative step: It finds hubs and authorities using the information collected
during sampling.

Example:
For a query “web mining” the algorithm involves carrying out the sampling step
by querying a search engine and then using the most highly ranked web pages
retrieved for determining the authorities and hubs.
Posing a query to a search engine often results in abundance. In some cases, the
search engine may not retrieve all relevant pages for the query. Query for java may not
retrieve pages for object-oriented programming, some of which may also be relevant.
Step-1: Sampling step

First step involves finding a subset of nodes or a sub graph S, which are relevant
authoritative pages. To obtain such a sub graph, the algorithm starts with a root set of,
say 200 pages selected from the results of searching for the query in a traditional search
engine.
Let the root set be R. Starting from R, we wish to obtain a set S that has the
following properties:
1. S is relatively small
2. S is rich in relevant pages given the query
3. S contains most of the strongest authorities
HITS expand the root set R into a base set S by using the following algorithm:
1. Let S=R
2. For each page in S, do steps 3 to 5
3. Let T be the set of all pages S points to
4. Let F be the set of all pages that points to S
5. Let S=S + T + some or all of F(some if F is large)
6. Delete all links with the same domain name.
7. This S is returned

75

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Set S is called base set for the given query. We find the hubs and authorities from
the base set as follows:
 One approach is ordering them by the count of their out-degree and the count of
their in-degree.
 Before starting hubs and authorities step of algorithm, HITS removes all links
between pages on the same web site(or same domain) in step 6 i.e. links between
pages on same site for navigational purposes, not for conferring authority.

Step-2 : Finding Hubs and Authorities

Algorithm for finding hubs and authorities is:

1. Let a page p have a non-negative authority weight Xp and a non-negative hub


weight Yp. Pages with large weights Xp will be classified to be the
authorities(similarly for hubs).
2. Weights are normalized so their squared sum for each type of weight is 1 since
only the relative weights are important.
3. For a page p, value of Xp is updated to be the sum of Yq over all pages Q that
link to p.
4. For a page p, value of Yp is updated to be the sum of Xq over all pages q that p
links to.
5. Continue with step 2 unless a termination condition has been reached.
6. On termination, the output of the algorithm is a set of pages with the largest Xp
weights that can be assumed to be authorities and those with the largest Yp
weights that can be assumed to be hubs.

Example:
Assume a query has been specified. First step has been carries out and a set of
pages that contain most of the authorities has been found.

76

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Let the set be given by the following 10 pages. Their out-links are listed below:
Page A (out-links to G,H,I,J)
Page B (out-links to A,G,H,I,J)
Page C (out-links to B,G,H,I,J)
Page D (out-links to B,G,H,,J)
Page E (out-links to B,H,I,J)
Page F (out-links to B,G,I,J)
Page G (out-links to H,I,)
Page H (out-links to G,I,J)
Page I (out-links to H)
Page J (out-links to F,G,H,I)

Connection matrix for this information is as follows:

Page A B C D E F G H I J
A 0 0 0 0 0 0 1 1 1 1
B 1 0 0 0 0 0 1 1 1 1
C 0 1 0 0 0 0 1 1 1 1
D 0 1 0 0 0 0 1 1 0 1
E 0 1 0 0 0 0 0 1 1 1
F 0 1 0 0 0 0 1 0 1 1
G 0 0 0 0 0 0 0 1 1 0
H 0 0 0 0 0 0 1 0 1 1
I 0 0 0 0 0 0 0 1 0 0
J 0 0 0 0 0 1 1 1 1 0

Every row in the matrix shows the out-links from the web page that is identified
in the first column. Every column shows the in-links from the web page that is
identified in the first row of the table

77

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Another representation is:

Web page In-links Out-links


A B G,H,I,J
B C,D,E,F A,G,H,I,J
C None B, G,H,I,J
D None B, G,H,J
E None B, G,I,J
F J B, G,I,J
G A,B,C,D,F,H,J H,I
H A,B,C,D,E,G,I.J G,I,J
I A,B,C,E,F,G,H,J H
J A,B,C,D,E,F,H F,G,H,I

c.Problems with the HITS Algorithm:

The algorithm works well for most queries, it does not work well for some
others. There are a number of reason for this:

d.Hubs and Authorities:


A clear-cut distinction between hubs and authorities may not be appropriate since
many sites are hubs as well as authorities

e.Topic drift:
Certain arrangements of tightly connected documents, perhaps due to mutually
reinforcing relationships between hosts, can dominate HITS computation. These
documents may not be the most relevant to query sometimes.

78

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

f.Automatically generated links:


Some of the links are computer generated and represent no human judgement
but HITS still gives them equal importance.

g.Non-relevant documents:
Some queries can return non-relevant documents in the highly ranked queries
and this can then lead to erroneous results from HITs.

h.Efficiency:
The real-time performance of the algorithm is not given the steps that involve
finding sites that are pointed to by pages in the root pages

i.Proposals to modify HITS are:


More careful selection of the base set to reduce the possibility of topic drift. One
may argue that the in-link information is more important than the out-links
information. A hub can become important just by pointing to a lot of authorities.

j.Web Communities:
A web community is generated by a group of individuals that share a common
interest. Ex: religious group, sports. Web communities may be discovered by exploring
the web as a graph and identifying sub-graph that have a strong link structure within
them but weak associations with other parts of the graph. Such subgraphs may be
called web-communities or cyber-communities.
The idea of cyber-communities is to find all web communities on any topic by
processing the whole web graph.

79

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

4.1.7 Web Mining Software


Some software packages listed below are commercial products:
123LogAnalyzer-low cost web mining software that provides an overview of a
web site’s performance, statistics about web server activity, web pages viewed, files
downloaded images that were accessed.
Analog-An ultra fast, scalable highly configurable, works on any operating
system & is free. Azure web log analyzer-finds usual web information including
popular pages, number of visitors, what browser & computer they used.
Click Tracks-it offers a number of modules to provide web site analysis.
Datanautica-Web mining software for data collection, processing analysis and reporting
NetTracker web analytics-analyses log files. Comprehensive analysis of behavior
of visitors to Internet, intranet sites. Nihuo Web log analyzer-provides reports on how
many visitors came to the web site, where they came from, which pages they viewed,
how long they spent on the site & more.
Web log Expert 3.5-produces reports that includes these information: accessed
files, paths through the site, search engines, browsers, OS & more. WUM: Web
Utilization Miner-an open source project. Java-based web mining software for log file
preparation, discovery of sequential patterns, etc
4.2 Search Engines

a.Define search engines


A number of web sites provide a large amount of interesting information about
search engines. Search engines are huge databases of web as well as software packages
for indexing and retrieving the pages that enable users to find information of interest to
them.
Normally the search engine databases of web pages are built & updated
automatically by web crawlers. Every search engine provides advertising (sometimes
called sponsored links). There are a variety of ways businesses advertise on the search
engines.

80

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Commonly whenever a particular keyword is typed by the user, related


sponsored links appear with the results of the search.

b.Goals of web search:


It has been suggested that the information needs of the user may be divided into
three classes:
1. Navigational: Primary information need in these queries is to reach a web site that
the user has in mind. User may know that the site exists or may have visited the site
earlier but does not he URL.
2. Informational: Primary information need in such queries is to find a website that
provides useful information about a topic of interest. User does not have a particular
web site in mind. Ex: User wants to investigate IT career prospects in Kolkata.
3. Transactional: Primary need in such queries is to perform some kind of transaction.
User may or may not know the target web site: Ex: to buy a book, user may wish to go
to amazon.com.

c.Quality of search results:


Results from a search engine ideally should satisfy the following quality requirements:
1. Precision: Only relevant documents should be returned.
2. Recall: All the relevant documents should be returned.
3. Ranking: A ranking of the documents providing some indication of the relative
ranking of the results should be returned.
4. First Screen: First page of results should include the most relevant results.
5. Speed: Results should be provided quickly since users have little patience.

4.2.1 Search Engines Functionality

A search engine is a rather complex collection of software modules. A search


engine carries out a variety of tasks. These include:

81

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

1. Collecting Information:
A search engine would normally collect web pages or information about them by
web crawling or by human submission of web pages.
2. Evaluating and categorizing information:
In some cases, fro example, when web pages are submitted to a directory, it may
be necessary to evaluate and decide whether a submitted page should be selected.
3. Creating a database and creating indexes:
The information collected needs to be stored either in a database or some kind of
file system. Indexes must be created so that the information may be searched efficiently.
4. Computing ranks of the web documents:
A variety of methods are being used to determine the rank of each page retrieved
in response to a user query. The information used may include frequency of keywords,
value of in-links & out-links from the page and frequency of use of the page.
5. Checking queries and executing them:
Queries posed by users need to be checked, fro example, fro spelling errors and
whether words in the query are recognizable. Once checked, a query is executed by
searching the search engine database.
6. Presenting results:
How the search engine presents the results to the user is important. The search
engine must determine what results to present and how to display them.
7. Profiling the users:
To improve search performance, the search engines carry out user profiling the
deals with the way users use search engines.

82

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

4.2.2 Search Engines Architecture

No two search engines are exactly the same in terms of sizing, indexing
techniques, page ranking algorithms or speed of search.
Typical search engine architecture is shown below. It consists of many components
including the following three major components to carry out the functions that were
lists above.
 The crawler and the indexer: It collects pages from the web, creates and maintains
the index.
 The user interface: It allows users to submit queries and enables result presentation.
 The database and the query server: It stores information about the web pages and
processes the query and returns results.

Typical architecture of a search engine

The web Page


submissio
n
Page
Users
Web evaluation
crawler
Query

Query Representation
checking

Strategy Indexing
User
Selection
Profiling

Query
execution

Result
Presentatio
n

History
Maintenanc
e 83

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

The crawler:
The crawler or spider or robot or bot) is an application program that carries out a
task similar to graph traversal. It is given set of starting URLs that it uses to
automatically traverse the web by retrieving a page, initially from the starting set.
Some search engines use a number of distributed crawlers. Since a crawler has a
relatively simple task, it is not CPU-bound. It is band width bound. A web crawler must
take into account the load (bandwidth, storage) on the search engine machines and
sometimes also on the machines being traversed in guiding its traversal.
A web crawler starts with a given set of URLs and fetches those pages. This
continues until no new pages are found. Each page found by the crawler is often not
stored as a separate file otherwise four billion pages would require managing four
billion files. Usually lots of pages are stuffed into one file.

Crawlers follow an algorithm like the following:


1. Find base URLs-a set of known & working hyperlinks are collected
2. Build a queue-put the base URLs in the queue as more are discovered.
3. Retrieve the next page-retrieve the next page in the queue, process and store in the
search engine database.
4. Add to the queue-check if the out-links of the current page have already been
processed. Add the unprocessed out-links to the queue of URLs.
5. Continue the process until some stopping criteria are met.
A crawler would often have a traversal strategy. It may traverse the web breadth
firat, depth first or the traversal may be priority based.

b.The Indexer:
Given the size of the web & the number of documents that current search
engines have in their databases, an indexer is essential to reduce the cost of query
evaluation.

84

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Many researchers recommended an inverted file structure in which each


document is essentially indexed on every word other than the most common words.
Google uses this approach.
Indexing can be a very resource intensive process and is often CPU bound since
it involves a lot of analysis of each page.

Building an index requires document analysis and term extraction. Term


Extraction may involve extracting all the words from each page, elimination of stop
words(common words like “the”, “it”, “and”, “that”)and stemming(transforming
words like “computer”, “computing” and “computation” into one word, say
“computer”) to help modify, for example, the inverted file structure.
A good indexing approach is to create an index that will relate keywords to
pages. The overheads of maintaining such indexed can be high.
An inverted index, rather than listing keywords for each page, lists pages fro
each keyword. This description is not quite accurate because an inverted database
normally still maintains a collection of pages including its content in some way.

As an example for inverted index consider the data structure shown below:

Words Page Id
Data 10,18,26,41
Mining 26,30,31,41
Search 72,74,75,76
Engine 72,74,79,82
Google 72,74,90

This index specifies a nu7mber of keywords and the pages that contain the
keywords. If we were searching for “data mining” we look for pages for each keyword

85

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

“data” and “mining” and then find the interaction of the two lists and obtain page
numbers 26& 41.
Challenging part is when the two lists are too large. Usual approach is to split the
lists across many machines to find the intersection quickly. When the index is
distributed over many machines, the distribution may be based on a local inverted
index or global inverted index.
Local inverted index results in ach machine searching for candidate documents fro
each query while the global inverted index is o distributed that each machine is
responsible for only some of the index terms.
c.Updating the index

As the crawler updates the search engine database, the inverted index must also
be updated. Depending on how the index is stored, incremental updating may be
relatively easy but at some stage after many incremental updates it may become
necessary to rebuild the whole index, which normally will take substantial resources.

d.User profiling
Currently most search engines provide just one type of interface to the user. They
provide an input box in which the user types in t he keywords & then waits for the
results.
Interfaces other than those currently provided are being investigated. They
include form fill-in or a menu or even a simple natural language interface.

e.Query Server
First of all, a search engine needs to receive the query and check the spelling of
keywords that the user has typed. If the search engine cannot recognize the keywords
in the language or proper nouns it is desirable to suggest alternative spellings to the
user.

86

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Once the keywords are found to be acceptable, the query may need to be
transformed. Stemming is then carried out and stop words are removed unless they
form part of a phrase that the user wants to search for.

f.Query composition
Query composition is a major problem. Often a user has to submit a query a
number of times using some what different keywords before more or less the “right”
result is obtained.
A search engine providing query refinement based on user feedback would be
useful. Search engines often cache the results of a query and can then the cached results
if the refined query is a modification of a query that has already been processed.
g.Query Processing

Search engine query processing is different from query processing in relational


database systems. In database systems, query processing requires that attribute values
match exactly the values provided in the query.
In search engine query processing, an exact match is not always necessary. A
partial match or a fuzzy match may be sufficient. Some search engines provide an
indication of how well the document matches the user query.

A major search engine will run a collection of machines, several replicating the
index. A query may then be broadcast to all that are not heavily loaded. If the query
consists of a number of keywords, then the indexes will need to be searched a number
of times(intersection of results to be found).

A search engine may put weights on different words in a query. Ex: if query has
words “going” “to” “paris”, then the search engine will put greater weight on “paris”.

87

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

h.Catching query results

Although more bandwidth is available & cheap, the delay in fetching pages from
the web continues. Common approach is to use web caches & proxies as intermediates
between the client browser processes and the machines serving the web pages. Web
cache or proxy is an intermediary that acts as server on behalf of some other server by
supplying web resources without contacting server.

Essentially a cache is supposed to catch a user’s request, get the page that the
user has requested if one is not available in the cache & then save a copy of it in the
cache. It is assumed that if a page is saved, there is a strong likelihood that the
same/ither user will request it again.
Web has very popular sites which are requested very frequently. The proxy
satisfies such requests without going to the web server. A proxy cache must have
algorithms to replace pages in the hope that the pages being stored in the cache are
fresh and are likely to be pages that users are likely to request.

Advantages of proxy caching are:


 It reduces average latency in fetching web pages, reduces network traffic & reduces
load on busy web servers.
 Since all user requests go through cache, it is possible fro site managers to monitor
who is accessing what information & if necessary block some sites.
 A part of user’s hard disk is dedicated to cache to store pages that the user has
browsed. Browser cache is useful when the user hits the back button since the most
recently visited pages are likely to be in the cache and do not need to be fetched
again.

88

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

i.Results Presentation
All search engines results ordered according to relevance. Therefore, before
presentation of results to user, the results must be ranked. Relevance ranking is based
on page popularity in terms of number of back links to the page.
How much of what has been found should be presented to the user. Most search
engines present the retrieved items as a list on the screen with some information for
each item, ex: type of document, its URL & a brief extract from the document.
4.2.3 Ranking of Web pages

Web contains a large number of documents published without any quality


control. It is important to find methods for determining importance & quality of pages,
given a topic.
Search engines differ in size & so number of documents that they index may be
quite different. No two search engines will have exactly the same pages on a given
topic.
a.Page Rank Algorithm
Google has the most well-known ranking algorithm, called the Page Rank
Algorithm that has been claimed to supply top ranking pages that are relevant. This is
based on using the hyperlinks as indicators of a page’s importance. It is almost like
vote counting in election.
Every unique page is assigned a Page Rank. If a lot of pages vote fro a page by
linking to it then the page that is being to will be considered important. Votes cast by
a link farm (a page with many links) are given less importance than votes cast by an
article that only links to a few pages.
Internal site links also count in accessing page rank. Page rank has no
relationship with the content of the page. Page Rank was developed based on a
probability model of a random surfer visiting with the content of the page. The
probability of a random surfer clicking on a link may be estimated based on the
number of links on that page. Probability for the random surfer reaching one page is
the sum of probabilities for the random surfer following links to that page. Model also

89

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

introduces a damping factor-that the random surfer sometimes does not click any link
& jumps to another page. If the damping factor is d(assumed to be between 0 & 1)
then the probability of the surfer jumping off to other page is assumed to be 1-d.
Higher the value of d, surfer will follow one of the links. Given that the surfer has 1-d
probability of jumping to some random page, every page has a minimum page rank of
1-d.
 Algorithm works as follows:
 Let A be the page whose Page Rank PR(A) is required. Let C(T1) be the number of
links going out from page T1. Page Rank of A is then given by:
PR(A)=(1-d)+d(PR(T1)/C(T1)+PR(T2)/C(T2)+… Where d-damping
factor.
 To compute a page rank of A, we need to know the page rank of each page that
points to it. And the number of out-links from each of those pages.
Example:
Let us consider an example of three pages. We are given the following information:
 Damping factor=0.8
 Page A has an out-link to B
 Page B has an out-link to A & another to C
 Page C has an out-link to A
 Starting page rank for each page is 1
We may show the links between the pages as below:

A B

Three Web Pages

90

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Page rank equations may be written as:

PR(A)=0.2+0.4PR(B)+0.8PR(C)
PR(B)=0.2+0.8PR(A)
PR(C)=0.2+0.4PR(B)
These are three linear equations in 3 unknowns, we may solve them. We write them
as follow of we replace PR(A) by a & others similarly
a-0.4 b-0.8 c=0.2
b-0.8 a =0.2
c-0.4 b =0.2
Solution of the above equation is: a=PR(A)=1.19; b=PR(B)=1.15; C=PR(C)=0.66

b.Weakness of the algorithm

Weakness is that a link cannot always be considered a recommendation & people


are now trying to influence the links. Major weakness-this algorithm focuses on the
importance of a page rather in its relevance given the user query.

A side effect of page rank therefore can be something like this. Ex: Suppose
amazon.com decides to get into the real estate business. Obviously its real estate web
site will have links from all other amazon.com decides to get into the real estate
business.

Obviously its real estate web site will have links from all other amazon.com sites
which have no relevance whatsoever with real estate. When a user then searches for real
estate businesses whatever their quality will appear further down the list.

Because of the nature of the algorithm, page rank does not deal with new pages
fairly since it makes high page rank pages even more popular by serving them at the

91

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

top of the results. Thus the rich get richer & the poor get poorer. It takes time for a new
page to become popular, even if the new page is of high quality.
i. Other issues in Page Ranking

 Some page rank algorithms consider the following in determining page rank:
 Page title including keywords
 Page content including links, keywords in link text, spelling errors, length of the
page.
 Out-links including relevance of links to the content.
 In-links &their importance
 In-links text-keywords
 In-linking page content
 Traffic including the number of unique visitors & the time spent on the page.

ii.Yahoo! Web Rank

Yahoo! Has developed its own page ranking algorithm. Algorithm is called Web
Rank. Rank is calculated by analyzing the web page text, title &description as well as
associated links & other unique document characteristics.

92

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

UNIT-V
5.1 Data warehousing

5.1.1 Introduction

Data warehousing is a process for assembling and managing data from various
sources for the purpose of gaining a single detailed view of an enterprise. Although
there are several definitions of data warehouse, a widely accepted definition by Inmon
(1992) is an integrated subject-oriented and time-variant repository of information in
support of management's decision making process. The definition is similar to the
definition of an ODS except that an ODS is a current-valued data store while a data
warehouse is a time-variant repository of data.

The primary aims in building a data warehouse are to provide a single version of
the truth about the enterprise information and to provide good performance for ad hoc
management queries required for enterprise analysis to manipulate, animate and
synthesize enterprise information.

The benefits of implementing a data warehouse are as follows:


• To provide a single version of truth about enterprise information. This may
appear rather obvious but it is not uncommon in an enterprise for two
database systems to have two different versions of the truth. In many years of
working in universities, I have rarely found a university in which everyone
agrees with financial figures of income and expenditure at each reporting time
during the year.
• To speed up ad hoc reports and queries that involve aggregations across
many attributes (that is, many GROUP BY's) which are resource intensive. The
managers require trends, sums and aggregations that allow, for example,
comparing this year's performance to last year's or preparation of forecasts for
next year.

93

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

• To provide a system in which managers who do not have a strong technical


background are able to run complex queries. If the managers are able to access
the information they require, it is likely to reduce the bureaucracy around the
managers.
As noted for ODS, to achieve the benefits listed above, it is necessary to build a
separate data warehouse. This database is very different from OLTP systems and also
different from an ODS. We first summarize the differences between OLTP and data
warehouse systems in Table.

Comparing OLTP and data warehouse systems (Based on Oracle. 2002)


Property OLTP Data warehouse
Nature of the database 3NF Multidimensional
Indexes Few Many
Loins Many Some
Duplicated data Normalized data Denormalized data
Derived data and aggregates Rare Common
Queries Mostly predefined Mostly ad hoc
Nature of queries Mostly simple Mostly complex
Updates All the time Not allowed, only
refreshed
Historical data Often not available Essential

A useful way of showing the relationship between OLTP systems, a data warehouse
and an ODS is given in Figure , The data warehouse, as noted earlier, is more like an
enterprise's long-term memory. The objectives in building the two systems. ODS and
data warehouse, are somewhat conflicting and therefore the two databases are likely to
have different schemas.

94

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

5.1.2 Operational Data Sources

To meet the information needs of management staff, one possible solution is to


build a separate database, one such approach is called Operational Data Store(ODS)
approach. An ODS is designed to provide a consolidated view of the enterprise’s
current operational information.
An ODS has been defined as subject oriented, integrated, volatile, current valued
data store, containing only corporate detailed data.
The ODS is:
1. Subject-oriented-That is , it is organized around the major data subject of
an enterprise.
2. Integrated- That is , it is the collection of subject oriented data from a
variety of system to provide an enterprise wide view of the data.
3. Current valued- That is, ODS is up-to-date and reflects the current status
of the information.
4. Volatile-That is, the data in the ODS changes frequently as new
information refreshes the ODS.
5. Detailed- That is, the ODS is detailed enough to serve the needs of the
operational management staff in the enterprise.
An ODS does not include the historical data. Since the OLTD system data is
changing all the time. ODS may also be used as an interim database for a data
warehouse.
 Since the contents of an ODS are updated frequently, it may also be used to
quickly perform relatively simple queries(such as finding the status of a
customer order).
 An ODs may be viewed as an enterprise’s short term memory in that it stores
only very recent information

95

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

ODS and DW Architecture

A typical ODS structure was shown in Figure It involved extracting information


from source systems by using RTF processes and then storing the information in the
ODS. The ODS could then be used for producing a variety of reports for management.

The architecture of a system that includes an ODS and a data warehouse shown in
Figure is more complex. It involves extracting information from source systems by
using an ETL process and then storing the information in a staging database.

The daily changes also come to the staging area. Another ETL process is used to
transform information from the staging area to populate the ODS. The ODS is then used
for supplying information via another ETL process to the data warehouse which' in turn
feeds a number of data marts that generate the reports required by management.

It should be noted that not all ETL processes in this architecture involve data
cleaning, some may only involve data extraction and transformation to suit the target
systems.

96

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

5.1.3 Data Warehousing

A data warehouse, does not store real-time data and does not require real-time
updates while the ODS does. The ODS does not have historical information but the data
warehouse does.
A data warehouse can be large but growing only slowly over time. These
differences between an ODS and a data warehouse are summarized in Table

Table Comparison of the ODS and data warehouse (Based on IBM, 2001)

A data mart may be used as a proof of data warehouse concept. Data marts can
also create difficulties by setting up "silos of information" although one may build
dependent data marts, which are populated from the central data warehouse.
Data marts are often the common approach for building a data warehouse since
the cost curve for data marts tends to be more linear. A centralized data warehouse
project can be very resource intensive and requires significant investment at the
beginning although overall costs over a number of years for a centralized data
warehouse and for decentralized data marts are likely to be similar.

5.1.4 Data warehousing Design

There are a number of ways of conceptualizing a data warehouse. One approach


is to view it as a three-level structure. The lowest level consists of the OLTP and legacy
systems, the middle level consists of the global or central data warehouse while the top
level consists of local data warehouses.

Whatever the architecture, a data warehouse needs to have a data model that can
form the basis for implementing it

97

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

To develop a data model we view a data warehouse as a multidimensional


structure consisting of dimensions, since that is an intuitive model that matches the
types of OLAP queries posed by management.

A dimension (essentially an attribute) is an ordinate within a multidimensional


structure consisting of a list of ordered values (sometimes called members) just like the
A-axis and y-axis values on a two-dimensional graph. A member is a distinct value (or
name or identifier) for the dimension .

Dimensions often have hierarchies that show parent/child relationships between


the members of a dimension. For example, the dimension country may have a hierarchy
that divides the world into continents and continents into regions followed by regions
into countries if such a hierarchy is useful for the application .
In the multidimensional view of data, values that depend on the dimensions are
called measures. Sometimes it is said that the measure is a numeric attribute of a fact,
for example, number of students or sales or total fees paid. The measures are usually
example, one may be interested in not only storing the number of items sold but
also the total revenue.
A data warehouse model often consists of a central fact table and a set of
surrounding dimension tables on which the facts depend. Such a model is called a. star
schema because of the shape of the model representation. A simple example of such a
schema is shown in Figure. country under each scholarship scheme.

98

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

99

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

100

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

101

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

102

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Name Faculty Scholarship Number of


eligibility semester
Sc Science Yes 6
MBBS Medicine No 10
LLB Law Yes 8
B.Com Business No 6
BA Arts No 6

Star schemas may be refined into snowflake schemas if we wish to provide support for
dimension hierarchies by allowing the dimension tables to have sub tables to represent

A characteristic of a star schema is that all the dimensions directly link to the fact table. The fact table
may look like Table and the dimension tables may look like Tables
Tabic- An example of the fact table

Year Degree name Country name Scholarship name Number


200301 BSc Australia Govt 35
199902 MBBS Canada None 50
200002 LLB USA ABC 22
199901 BCom UK Commonwealth 7
200102 LLB Australia Equity 2
The first dimension is the degree dimension
ion. An example of this dimension table
Table An example or the degree dimension table
Name Faculty Scholarship eli£ ihility Number of semesters
BSc Science Yes 6
MBBS Medicine No 10
LIB Law Yes 8
BCora Business No fi
BA No 6
Arts

hierarchies.

103

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

5.1.5 Guidelines For Data Warehouse Implementation

A data warehouse may be built as a centralized data warehouse, a data


warehouse with data marts or a disturbed data ware house depending on needs of an
enterprise. The benefit of this approach is that each data mart can have a local view of
the data and much processing can be carried out in the local data mart.
A distributed data warehouse on the other hand is suitable for a large
international organization with office in several countries.

a.Implementation Steps
These steps are based on the work of Chauduri and Dayal.
1. Requirements analysis and capacity planning:
In other projects, the first step in data warehousing involves defining
enterprise needs, defining architecture, carrying out capacity planning and selecting the
hardware and software tools. This step will involve consulting with senior management
as well as with the various stake holders.

2. Hardware integration:
Once the hardware and software have been selected, they need to be put together
by integrating the servers, the storage devices and the client software tools.
3. Modeling:
It is a major step that involves designing the warehouse scheme and views. This
may involve using a modeling tool if the data warehouse is complex.
4. Physical modeling:
This involves designing the physical data warehouse organization, data
placement, data partitioning, deciding on access methods and indexing.
5. Sources:
The data for the data warehouse is likely to come from a number of data sources.
This step involves identifying and connecting the sources using gateways, ODBC
drivers or other wrappers.

104

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

6. ETL:
This may involve identifying a suitable ETL tool vendor and purchasing and
implementing the tool. This may include customizing the tool to suit the needs of the
enterprise.

7. Populate the data warehouse:


Once everything is working satisfactorily, the ETL tools may be used in
populating the warehouse given the schema and view definitions.
8. User applications:
This step involves designing and implementing application required by the end
users.
9. Roll-out the warehouse and applications:
Once the data warehouse has been populated and the end-user applications
tested, the warehouse system and the applications may be rolled out for the user
community to use.

b.Implementation Guidelines
These are general guidelines, not all of them applicable to every data warehouse
project.
1. Build incrementally:
Data warehouses must be built incrementally. An enterprise data warehouse can
then be implemented in an iterative manner allowing all data marts to extract
information from the data warehouse.
2. Need a champion:
A data ware house project must have a champion who is willing to carry out
considerable reach in to expected cost and benefits of the project. This has shown that
having a champion can help adoption and success of data warehousing projects.

105

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

3. Senior management support:


A data warehouse project must be fully supported by the senior management. It
shows that top management support is essential for the success of a data warehousing
project.
4. Ensure Quality:
Improved data qualities also ensure that the data warehouse will be increasingly
recognized as the single version of truth in the enterprise.

5. Corporate strategy:
The objectives of the project must be clearly defined before the start of the
project. Given the importance of senior management support for a data warehousing
project, the project’s fit with the corporate strategy is essential.

6. Business plan:
Without such understanding, rumors about expenditure and benefits can become
the only source of information, understanding the project.

7. Training:
A data warehouse project must not overlook data warehouse training
requirements. Training of users and professional development of the project team may
also be required since it is a complex task and the skills of the project team are critical to
the success of the project.

8. Adaptability:
The project should build in adaptability so that changes may be made to the data
warehouse if and when required.

106

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

9. Joint management:
The project must be managed by both IT and business professionals in the
enterprise. To ensure good communications with the stakeholders and that the project
is focused on assisting the enterprise’s business, business professionals must be
involved in the project along with technical professionals.

5.1.6 Data Warehouse Metadata

a.Definition:
Metadata has been defined as “all of the information in the data warehouse
environment that is not the actual data itself”.
It is a structured data which describes the characteristics of the resource. Metadata is
stored in the system itself and can be queried using tools that are available on the
system
We now give several example of metadata that should be familiar to be reader:
1. A library catalogue may be considered metadata. It contains a number of
predefined elements representing specific attributes of a resource, and each
element can have one or more values.
2. The table of contents and the index in a book may be considered metadata for
the book
3. Suppose we say that a data element about a person is 80. Therefore is the
metadata about the data 80.
4. Yet another example of metadata is about the tables and figures in a
document like this book. A table has a name and there are column names of
the table that may be considered metadata.
In the context of a data warehouse, Metadata needs to be much more
comprehensive. It may be classified in to two groups:
1. Back room metadata
2. Front room metadata

107

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Much important information included in the back room metadata. This could
include information on what source systems the ETL process uses and their schemas.
The front room metadata is more descriptive and could include information needed by
the users, for example, user security privileges, various statistics about usage,
information on network security and so on.

Recently, a metadata standard called the common data warehouse metadata


(CWM) has been developed.

5.2 Online Analytical Processing (OLAP)

a.OLAP
It is a primarily a software technology concerned with fast analysis of enterprise
information. OLAP systems are data warehouse front-end software tools to make
aggregate data available efficiently, for advanced analysis, to an enterprise’s managers.
It is essential that an OLAP system provides facilities for a manager to pose and
hoc complex queries to obtain the information that requires.
Another term that is being used increasingly is business intelligence. It is
sometimes used to mean both data warehousing and OLAP. Other times, it has been
defined as a user-centered process of exploring data, data relationships and trends,
thereby helping to improve overall decision making.
OLAP and data warehouse are based on a multidimensional conceptual view of
the enterprise data. For example,

country BSc MBBS BCom ALL

India 10 15 25 50
Australia 5 15 50 70
USA 0 20 15 35
ALL 15 50 90 155

108

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Table: A multidimensional view of data for two dimension


OLAP systems used as a generalization of spreadsheets, spreadsheets are not
really suitable for OLAP in spite of the user-friendly interface that they provide.
Spreadsheets tie data storage too tightly to the presentation.

5.2.1 Introduction

a. OLAP DEFINITION
OLAP is the dynamic enterprise analysis required to create, manipulate, animate
and synthesis information from exegetical, contemplative and formulaic data analysis
models.
Another definition of OLAP is that OLAP is software technology that enables
analysts, managers and executives to gain insight into data through fast, consistent,
interactive access to a wide variety of possible views of information that has been
transformed from raw data to reflect the real dimensionality of the enterprise as
understood by the user.
An even simpler definition is that OLAP is fast analysis of shares
multidimensional information for advanced analysis. This definition (sometimes called
FASMI) implies that most OLAP queries should be answered within seconds.

5.2.2 OLAP Characteristics of OLAP Systems

There are several important characteristics involved they are,


1. Users: OLAP systems are designed for office workers while the OLAP systems
are designed for decision makers. An OLAP system is likely to be accessed only
by a select group of managers and may be used only by dozens of users.
2. Functions: OLTP systems are mission-critical. They support an enterprise’s day-
to-day operations and are mostly performance and availability driven. OLAP
systems are management critical to support an enterprise decision support
function using analytical investigation.

109

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

3. Nature: Although SQL queries return a set of records, OLTP systems are
designed to process one record at a time. OLAP systems are not designed to deal
with individual customer records.
4. Design: OLTP database systems are designed to be application-oriented while
OLAP systems are designed to be subject-oriented.
5. Data: OLTP systems normally deal only with the current status of information.
OLAP systems require historical data over several years since trends are more
important in decision making.
6. Kind of use: OLTP systems are used for read and write operations while OLAP
systems normally do not update the data.

b.Comparison of OLAP and OLTP systems:

OLTP OLAP
Property
Nature of users Operations workers Decision makers
Functions Mission-critical Management-critical
Nature of queries Mostly simple Most complex
Nature of usage Mostly repetitive Mostly ad hoc
Nature of design Application oriented Subject oriented
Nature of users Thousands Dozens
Nature of data Current, detailed. relational Historical, summarized,
multidimensional
Updates All the time Usually not allowed

c.FASMI Characteristics:
The name derived from the first letters of the characteristics, are:

110

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

1. Fast: most OLAP queries should be answered very quickly perhaps within
seconds. One approach is to pre-compute the most commonly queried
aggregates and compute the remaining on-the-fly.
2. Analytic: An OLAP system must provide rich analytic functionality and it is
expected that most OLAP queries can be answered without any programming.
3. Shared: An OLAP system is a shared resource although it is unlikely to be shared
by hundreds of users.
4. Multidimensional: OLAP software is being used; it must provide z
multidimensional conceptual view of the data.
5. Information: The system should be handling a large amount of input data.
Codd’s OLAP characteristics:
1. Multidimensional conceptual view: By requiring a multidimensional view, it is
possible to carry out operation like slice and dice.
2. Accessibility (OLAP as a mediator): The OLAP system should be sitting between
data sources (e.g. a data warehouse) and an OLAP front-end.
3. Batch extraction vs interpretive: An OLAP system should provide
multidimensional data staging plus partial pre calculation of aggregates \in
large multidimensional database.
4. Multi-user support: OLAP software should provide many normal database
operations including retrieval, update, concurrency control, integrity and
security.
5. Storing OLAP results: Read-write OLAP applications should not be implemented
directly on live transaction data if OLTP source systems are supplying
information to the OLAP system directly.
6. Extraction of missing values: The OLAP system should distinguish missing
values from zero values.

111

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

5.2.3 Multidimensional view and data cube

The multidimensional view of data is in some ways a natural view of any


enterprise for managers. The triangle diagram shows that as we go higher in the
triangle hierarchy the managers need for detailed information declines.

 A typical university management hierarchy


 Senior Executives V-C, Deans
 Department & faculty management, heads
 Daily operation registrar, HR, Finance for

For example deals with student data that has only three dimension. The first relation
(student) provides information about students. The second relation (enrolment)
provides information about the student’s degree and the semester of the first
enrollment. The third relation (degree) provides information about university and
department.

Student(student_id, student_name, Country, DOB, Address)


Enrollment(Student_id, Degree_id, SSemester)
Degree(Degree_id, Degree_name, Degree_length,Fee, Department)

Student_id Student_name Country DOB Address


8656789 Sindhu India 1/1/1980 Erode
8700020 Sita Canada 2/2/1981 salem
Table 1: The relation student

112

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Student_id Degree_id SSemester


8900020 1256 2000-01
8700074 3271 2000-01
Table 2: The relation enrollment

Degree_id Degree_name Degree_length Fee Department


1256 BIT 6 18 cs
2345 BSc 6 20 Cs
Table 3: The relation degree

A table that summarizes this type of information may be represented by a two-


dimensional spreadsheet.

country BSc MBBS BCom ALL

India 10 15 25 50
Australia 5 15 50 70
USA 0 20 15 35
ALL 15 50 90 155

A two dimensional table of aggregate for semester 200-01. The table together
now forms a three- dimensional cube. Each of the edges in the cube represented a
dimension. Each dimension has a number of members.
Different types of measures may behave differently in computation. We can also
easily add the number of students in two tables such measures are called additive.
The cube formed by the above tables.
In the three dimensional cube there are some types of questions are possible.
1. Null
2. Degree
3. Semester

113

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

4. Country
5. Degrees, semester
6. Semester, country
7. Degree, country
8. All

For example, all of the aggregation in the list above can be built by queries like:

SELECT Degree_id, count(*) FROM enrollment GROUP BY Degree_id.

Aggregates generated by this query essential are a fact table representation.

5.2.4 Data Cube Implementation

Some possible solutions for data cube implementation


1. Pre-compute and store all: This means that perhaps millions of aggregates will
need to be computed and stored. Although this is the best solution as far as
query response time is concerned, the solution is impractical since resource
required to compute the aggregates and to store them will be prohibitively large
for a large data cube. Indexing large amounts of data is also expensive.
2. Pre-compute (and store) none: This means that the aggregates are computed on-
the-fly using the raw data whenever a query is posed. This approach does not
require additional space for storing the cube but the query response time is likely
to be very poor for large data cubes.
3. Pre-compute and store some: This means that we pre-compute and store the
most frequently queried aggregate and compute others as the need arise. The
more aggregates we are able to pre-compute the better the query performance.
Data cube products use different techniques for pre-computing aggregates and storing
them. They are,

114

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

1. ROLAP (Relational OLAP model)


2. MOLAP (Multidimensional OLAP)
a.ROLAP
 ROLAP uses a relational DBMS to implement an OLAP environment
 It may be considered a bottom-up approach.
 The data therefore is likely to be in a denormalized structure.
 A normalized database avoid redundancy but is usually not appropriate
for high performance
 The advantage of using ROLAP is that it is more easily used with existing
relational DBMS and the data can be stored efficiently using tables since
no zero facts need to be stored.
 The major disadvantage of the ROLAP model is its poor query
performance.
 Proponents of the MOLAP model have called the ROLAP model
SLOWLAP.
b.MOLAP
 MOLAP is based on using a multidimensional DBMS rather than a data
warehouse to store and access data.
 It may be considered as top-down approach to OLAP.
 It does not have a standard approach to storing and maintaining their
data.
 They often use special purpose file system
 The MOLAP implementation is usually exceptionally efficient.
 The disadvantage of MOLAP is that it is likely to be more expensive than
ROLAP.
 MOLAP is easier to use and therefore may be suitable for inexperienced
users.

115

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

Comparison of MOLAP and ROLAP

Property MOLAP ROLAP

Data structure Multidimensional database using Relational tables


sparse arrays
Disk space Separate database for data cube; May not require any space
large for large data cubes. other than that available in
the data warehouse
Retrieval Fast (pre- computed) Slow

Scalability Limited (cubes can be very large) Excellent

Best suited for Inexperienced users, limited set of Experienced users, queries
queries change frequently

DBMS facilities Usually weak Usually very strong

5.2.5 Data Cube Operations OLAP implementation guidelines

A number of operations may be applied to data cubes. The common ones are:
1. Roll-up
2. Drill-down
3. Slice and dice
4. Pivot
1.Roll-up
Roll-up is like zooming out on the data cube. It is required when the user needs
further abstraction or less detail. This operation performs further aggregations on the
data.

116

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

2.Drill-down
Drill-down is like zooming in on the data and is therefore the reverse of roll-up.
It is appropriate operation when the user needs further details or when the user wants
to partition more finely or wants to focus on some particular values of certain
dimension. It adds more details to data.

3.Slice and dice


Slice and dice are operation for browsing the data in the cube. The terms refer to
the ability to look at information from different viewpoints. A slice is a subset of the
cube corresponding to a single value for one or more members of the dimension. The
dice operation is similar to slice but does not involve reducing the number of
dimensions.

4.Pivot or rotate
The pivot operation is used when the user wishes to re-orient the view of the
data cube. It may involve swapping the rows and columns or moving one of the row
dimensions in to the column dimension.

a.Guidelines for OLAP Implementation

There are number of guidelines for successful implementation of OLAP. They


are,
1. Vision: The OLAP team must, in consultation with the users, develop a clear
vision for the OLAP system. This vision including the business objectives should
be clearly defined, understood, and shares by the stakeholders.
2. Senior management support: The OLAP project should be fully supported by
the senior managers. Since a data warehouse may have been developed already,
this should not be difficult.

117

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

3. Selecting an OLAP tool: The OLAP team should familiarize themselves with the
ROLAP and MOLAP tools available in the market. Since tools are quite different,
careful planning may be required in selecting a tool that is appropriate for the
enterprise. In some situations, a combination of ROLAP and MOLAP may be
most effective.
4. Corporate strategy: The OLAP Strategy should fit with the enterprise strategy
and business objectives. A good fit will result in the OLAP tools being used more
widely.
5. Focus on the users: The OLAP project should be focused on the users. Users
should, in consultation with the technical professionals, decide what task will be
done first what done be later.
6. Joint management: The OLAP project must be managed by both the IT and
business professionals, Many other people should be involved in supplying
ideas. An appropriate committee structure may be necessary to channel these
ideas.

7. Review and adapt: Regular reviews of the project may be required to ensure that
the project is meeting the current needs of the enterprise.

118

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

DATA MINING AND WAREHOUSING


Time:3Hrs Max : 75 Marks
SECTION -A
(10X02=20)
Answer ALL questions
1. Define data mining
2. What are the different types of data mining techniques?
3. Define Classification
4. What is decision tree?
5. What is clustering ?
6. What is density based methods ?
7. Define web mining?
8. What are the goals of web search?
9. Define Data warehouse.
10. Define OLAP System.

SECTION -B (05X05=25)
Answer ALL questions, choosing either (a) or (b)
11. a) Write short notes on data mining applications
(or)
b) Explain about APRIORI Algorithm
12. a) Explain the basic concept of classifications
(or)
b Explain decision tree rules
13. a) Explain about Hierarchical methods
(or)
b) Explain about density-based methods

14. a) Briefly explain about Describe graph terminology.

119

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

(or)
b Explain search engine architecture in detail.

15. a) Explain ODS and DW Architectures


(or)
b) Explain OLAP and its characteristics.

SECTION - C (03X10=30)

Answer any THREE questions

16. Discuss about the different types of data mining techniques.


17. Explain decision tree in detail.
18. Explain about various types of cluster analysis methods
19. Explain web content mining in detail
20. Explain Data warehouse and its benefits in detail.

120

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

DATA MINING AND WAREHOUSING


Time:3Hrs Max : 75 Marks

SECTION –A (10X02=20)

Answer ALL questions


1. Define Association rule mining.
2. Mention the advantages of FP-Tree approach.
3. Define Pruning.
4 Define Robustness.
5. Define Classification
6. What is decision tree?
7. Define cookies.
8. What are the goals of web search?
9. What are the different operation of Data cube?
10. Differentiate OLTP and OLAP System

SECTION -B (05X05=25)
Answer ALL questions, choosing either (a) or (b)

11. a) Write short notes on future of data mining?


(or)
b) Write about the performance of various algorithm of association rule mining?
12. a) Explain naïve bayesmethod
(or)
b) Explain the software classification
13. a) Explain about Hierarchical methods
(or)
b) Explain about density-based methods

121

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

14. a) Explain web metrics


(or)
b) Explain page rank algorithm.

15. a) Describe Data warehouse Design.


(or)
b Explain the guidelines of OLAP implementation.

SECTION - C (03X10=30)

Answer any THREE questions

16. Discuss about the different types of data mining techniques


17. Explain the basic concept of classifications.
18. Explain about various types of cluster analysis methods
19. Explain search engine architecture in detail.
20. Describe Data warehouse metadata in detail

122

Department of Computer Science & Applications


Vidhyaa Arts And Science College-Konganapuram

VIDHYAA ARTS AND SCIENCE COLLEGE


DEPARTMENT OF COMPUTER SCIENCE &APPLICATIONS
OBJECTIVE TEST

Subject Code: Paper Title: Data Mining and Warehousing

1. KDD stands for ___________


a) Knowledge discovery of database b) Knowledge device of database
c) Knowledge discovery in device d) none
2. Data mining software called ________STUDIO
a) Knowledge b) Network c) Port d) Protocol
3. The _________algorithm is resource intensive for large sets of transactions
a) Apriori b) clustering c) classification d) none
4. Classification is the separation or ordering of _________ into the classes
a) objects b) classes c) polymorphism d) All of these
5. A _______is a popular classification method that results in a flow-chart like tree
structure
a) decision tree b) binary c) cluster d) none
6. _________is a technique to make an overfitted decision tree
a) Pruning b) cluster c) classification d) All of these
7. Cluster analysis methods are based on measuring similarity between _______
a) objects b) classes c) displaying d) composition
8. _______ is the set of all nodes which are interconnected by hyper text links.
a) WWW b) LAN c)MAN d) none
9. Web content mining deals with _______useful information from the web
a) discover b) plaintext c) secure d) HTTP
10. Web terminology based on the work of _________
a) W3C b) WC3 c) WWW d) W3W

123

Department of Computer Science & Applications

You might also like