Data Mining and Ware Housing
Data Mining and Ware Housing
UNIT – I
1.1 INTRODUCTION
a. Data mining:
Data mining is a collection of techniques for efficient automated discovery of
previously unknown, valid, novel, useful and understandable patterns in large
database. The patterns must be actionable so that they may be used in an enterprise’s
decision making process.
1.1.1 Data mining applications
This can be used in predicting the class to which an object or individual is likely
to belong. This is useful, for example, in predicting whether an individual is likely to
respond to a direct mail solicitation, in identifying a good candidate for a surgical
procedure, or in identifying a good risk for granting a loan or insurance.
One of the most widely used supervised classification technique is the decision
tree. The decision tree technique is widely used because it generates easily
understandable rule for classifying data.
b. Cluster analysis:
Cluster analysis is similar to classification but, in contrast to supervised
classification, cluster analysis is useful when the data are not already known and the
training data is not available.
The aim of cluster analysis is to find groups that are very different from each
other in a collection of data. Cluster analysis breaks up a single collection of perhaps
diverse data into to a number of groups.
One of the most widely used cluster analysis method is called the k-means
algorithm, which requires that the user specifies not only the number of cluster but also
their starting seeds.
c. Web data mining:
Searching the web has become an everyday experience for millions of people
from all over the world. From its beginning in the early 1990s, the web had grown to
more than four billion pages in 2004, and perhaps would grow to more than eight
billion pages by the end of 2006.
d. Search engines:
Search engines, are huge database of web pages as well as software packages for
indexing and retrieving the pages that enable users to find information of interest to
them. Normally the search engine databases of web pages are built and updated
automatically by web crawlers.
g. Healthcare:
Data mining carried out a variety of applications in healthcare. Fro example,
drug testing, data mining may assist in isolating those patients where the drug is
most effective or where the drug is having unintended side effects.
Data mining has been used in determining factors influencing the survival
rate of heart transplant patients when the survival rate data were available for
the significant number of patients over several years.
The aim of the study was to be able to predict the length of hospital stay
for patients suffering from spinal cord injuries. The study required data
validation and data cleaning.
h. Manufacturing:
Data mining tools have been used to identify factors that lead to critical
manufacturing situations in order to warn engineers of impending problems. Data
mining applications have also been reported in power station, petrochemical plants
and other types of manufacturing plants. For example, in a study involving a power
station, the company wanted to reduce its operating costs.
i. Marketing and Electronic Commerce:
One of the most widely discussed applications of data mining is that by
Amazon.com, which uses simple data mining tools to recommend to customers
what other products they might consider buying.
The rules use the customers’ own history of purchases as well as purchases
by similar customers. Strange results can sometimes be found in data mining. In
one data mining study, it was found that customers tended to favour one side of the
store where the specials were put and did not shop in the whole store.
j. Telecommunications:
The telecommunication industry in many countries is in turmoil due to
deregulation. The telecommunications business is changing through consolidation in
the market place and the convergence of new technologies.
o For example, video, data, voice.
o They also have to deal with technologies like voice over IP.
o A widely discussed data mining application in the telecommunications
industry is churn analysis.
The telecommunication company has to deal with a large number of variables
including the cost of local calls, the cost of international calls, the mobile phone plans,
the installation and disconnection rate, customer satisfaction data, or the data about
customer who do not pay their bills or do not pay them on time.
1.1.4 The Future Of Data Mining
Most time spent in data mining is actually spent in data extraction, data cleaning
and data manipulation, it is expected that technologies like data warehousing will grow
in importance. It has been found that as much as 40% of all collected data contains
errors.
To deal with such large error rate, there is likely to be ore emphasis in the future on
building data warehousing data cleaning and extraction. Business users often find the
techniques difficult to understand and integrate into business processes.
The academic community is more interested to developing new techniques that
perform than those that are already known. Data mining technique depend upon a lot
of careful analysis of the business and a good understanding of the technique and
software available.
Many data mining techniques are not based on sound theoretical background.
More theory regarding all data mining techniques and practices is also likely to be the
focus of data mining efforts in the future.
1.1.5 Data Mining Software
There is considerable data mining software available on the market. Most major
computing companies, like IBM, Oracle and Microsoft, are providing data mining
packages. Some data mining software can be expensive while other software is
available free and therefore a user should spend some time selecting an appropriate tool
for the task they faces.
m).Mining Mart:
This package was developed at the University of Dortund in Germany. The
software focuses on data cleaning and provides a graphical tool for data preprocessing.
n).Oracle:
Users of Oracle therefore have access to techniques for association rules,
classification prediction. Oracle data mining is a graphical user interface.
o).Weka 3:
A collection of machine learning algorithm for solving data mining problems. It
is written in Java that runs on almost any platform
p).Software evaluation and selection:
Any factors must be considered in evaluating the suitability of software.
Product and vendor information.
Total cost of the ownership.
Performance.
Functionality and modularity.
Training and support.
Reporting facilities and visualization.
Usability.
1.2.1 Introduction
10
1.2.2 basics
11
12
Let us consider a Naïve Brute fore algorithm to do the task. Consider the
following example we have only four items for sale (Bread, Cheese, Juice and Milk) and
have only four transactions.
We have to find the association rules with a minimum Support of 50% and
minimum Confidence of 75%.
Transaction table
Transaction ID Items
100 Bread, Cheese
200 Bread, Cheese, Juice
300 Bread, Milk
400 Cheese, Juice, Milk,
The basis of our Naïve algorithm is as follows. We can list all the combination of
the items that we have in stock and find which of these combinations are frequent, and
then we can find the association rule that have the confidence from these frequent
combinations.
The four items and all the combinations of these four items and their frequencies
of occurrence in the transaction database are given in the following table.
The list of all item sets and their frequencies
Item sets Frequency
Bread 3
Cheese 3
Juice 2
Milk 2
(Bread, Cheese) 2
(Bread, Juice) 1
(Bread, Milk) 1
(Cheese, Juice) 2
13
(Cheese, Milk) 1
(Juice, Milk) 1
(Bread, Cheese, Juice) 1
(Bread, Cheese, Milk) 0
(Bread, Juice, Milk) 0
(Cheese, Juice, Milk) 1
(Bread, Cheese, Juice, Milk) 0
From the above table we required minimum support of 50%, we find the item
sets that occur in at least two transactions. Such item sets are called frequent.
The lost of frequencies shows that all four items are frequent.
14
The basic algorithm for finding the association rules was first proposed in 1993.
In 1994, an improved algorithm was proposed and called the Apriori algorithm.
This may be considered to consist of two parts. In the first part, those item sets
that exceed the minimum support requirement are found such item set is called frequent
item set, in the second part, the association rules that meet the minimum confidence
requirement are found from the frequent item set.
a.First part Frequent item sets:
This part is divided into two steps (step2 and step3).
i.Step 1:
Scan all transactions and find all frequent items that have support above p%. let
these frequent items be L1.
ii.Step 2:
Build potential sets of K items from Lk‐ 1 by using pairs of item sets in Lk-1 such
that each pair has the first k-2 items in common. Now the k-2 common items and the on
remaining item from each of the two item sets are combined to form a k-item set. K is
the candidate set Ck. this set is called Apriori-gen.
iii.Step 3:
Scan all transactions and find all k-item sets in Ck that are frequent. The frequent
set obtained is Lk.
Terminate when no further frequent item sets are found, otherwise continue with
step 2. The main notation for association rule mining is used in the Apriori algorithm is
15
the following: 1. A k-item set is a set of k items. 2. The set Ck is a set of candidate k-
item sets that are potentially frequent. 3. The set L k is a subset of Ck and is the set of k-
item sets that are frequent. Some of the issues needed for Apriori algorithms are:
a.Computing L1 :
We scan the disk-resident database only one to obtain L 1 . An item vector of
length n with count for each item stored in the main memory may be used. Once the
scan of the database is finished and the count for each item found, the items that meet
the support criterion can be identified and L1 determined.
b.Apriori-gen function:
This is the step 2 of the apriori algorithm. It takes an argument Lk-1 and returns a
set of all candidate k-itemset. In computing C3 from L2,we organize L2 so that the
itemsets sorted in their lexicographic order. Observe that if an item set that if an itemset
in C3 is (a,b,c) then L2 must have items(a,b) and (a,c) since all subsets of C3 must be
frequent.
c.Pruning:
Once a candidate set Ci has been produced, we can prune some of the candidate
item sets by checking that all subsets of every item set in the set are frequent, for
example if we have derived {a,b,c} from {a,b} and {a,c}, then we check that {b,c} is
also in L2. If it is not, (a,b,c) may be removed from C3.
The task of such pruning becomes harder as the number of items in the item set
grows, but the number of large item sets tends to be small.
d.Apriori subset function:
To improve the efficiency of searching, the candidate item sets Ck are stored in a
hash tree. Each leaf node is reached by traversing the tree whose root is at depth 1. Each
internal node of depth d points to all the related nodes at depth d+1 and the branch to
be taken is determined by applying the hash function.
e.Transactions storage:
We assume the data is too large to be stored in the main memory, should it be
stored as a set of transactions, each transaction being a sequence of item numbers.
16
f.Computing L2:
Assuming that C2 is available in the main memory, each candidate pair needs to
be tested to find if the pair is frequent. Given that C2 is likely to be large, this testing
must be done efficiently, In one scan, each transaction can be checked for the candidate
pairs.
g.Second Part – Finding the rules:
To find the association rules from the frequent item sets, we take a large frequent
item set, say p, and find each nonempty subset a, The rule a-> (p-a) is possible if it
satisfies the confidence, Confidence of this rule is given by support(p)/support(a).
Since confidence is given by support(p)/support(a), it is cleared that if for some a,
the rule a->(p-a) does not have the minimum confidence then all rules like b-> (p-
b), where b is a subset of a , will also not have the confidence since support(b) cannot be
smaller than support(a).
Another way to improve rule generation is to consider rules like (p-a)->a. If this
rule has the minimum confidence then all rules (p-b)->b will also have minimum
confidence if b is a subset of ‘a’ since (p-b) has more items than (p-a), given that ‘b’ is
smaller than ‘a’ and so cannot have support higher than that of (p-a).
Once again this can be used in improving the efficiency of rule generation. In
both the improvements noted above, the total number of items, and therefore the
support(p),stays the same.
1.2.5 Improving the efficiency of the apriori algorithm
The Apriori algorithm is resource intensive for large sets of transactions that
have a large set of frequent items. The major reasons for this may be summarized as
follows:
1. The number of candidate itemsets grows quickly and can result in huge
candidate sets.The larger the candidate set, the higher the processing cost for
scanning the transaction database to find the frequent itemsets. The performance
of the Apriori algorithm in the later stages therefore is not so much of a concern.
17
2. The Apriori algorithm requires many scans of the database, If n is the length of
the longest itemset, then (n+1) scans are required.
3. Many trivial rules are derived and it can often be difficult to extract the most
interesting rules from all the rules derived.
4. Some rules can be inexplicable and very fine grained, for example, toothbrush
was the most frequently sold item on Thursday mornings.
5. Redundant rules are generated, For example, if A->B is a rule then any rule AC-
>b is redundant. A number of approaches have been suggested to avoid
generating redundant rules.
6. Apriori assumes sparsity since the number of items in each transaction is small
compared with the total number of items.
A number of techniques for improving the performance of Apriori algorithm have
been suggested, They can be classified in to 4 categories:
Reduce the number of candidate itemsets.
Reduce the number of transactions.
Reduce the number of Comparisons.
Reduce the candidate sets efficiently.
Some algorithms that use one or more of the above approaches are:
Apriori-TID
Direct Hashing and Pruning(DHP)
Dynamic Itemset Counting (DIC)
Frequent Pattern Growth.
1.2.6 Mining frequent patterns without candidate generation (fp-growth):
This algorithm uses an approach that is different from that used by methods
based on the Apriori algorithm. The major difference between frequent pattern-growth
(FP-growth) and the other algorithms is that FP-growth does not generate the
candidates, it only tests.
18
19
Now we remove the items that are not frequent from the transaction and order
the items according to their frequency.
Table: 3 Database For After Removing the Non-Frequent Items And Reordering
Transaction items
ID
100 Bread, Juice, cheese
200 Bread, Juice, cheese
300 Bread, Milk
400 Bread, juice, Milk
500 Juice, cheese, Milk
20
The FP-tree also consists of a header table with an entry for each item-set and a
link to the first item in the tree with the same name. This linking is done to make
traversal of the tree ore efficient. Nodes with the same name in the tree are linked via
the dotted node-links.
STEPS: FP-tree
1. The tree is built by making a root node labeled NULL. A node is made for each
frequent item in the first transaction and the count is set to 1.
2. The first transaction {B, J, and C} is inserted in the empty tree with the root node
labeled NULL. Each of these items is given a frequency count of 1.
3. The second transaction, which is identical to the first, is inserted; it changes the
frequency item to 2.
4. Next {B, M} is inserted. This requires that a node for M be created. The counter
for B goes to 3 and M is set to 1.
5. The next transaction {B, J, and M} results the counter for B and J going up to 4
and 3 respectively and a new node for M with count of 1.
6. The last transaction {J, C and M} results in a brand new branch for the tree which
is shown on the right-hand side in the above figure.
The nodes near the root in the tree are more frequent than those further down
the tree. The height of an FP-tree is always equal to the maximum number of item-
set in a transaction excluding the root node. The FP-tree is compact and often orders
21
of magnitude smaller than the transaction database. Once the FP-tree is constructed
the transaction database is not required.
d.Mining the FP-tree for frequent items:
To find the frequent item-sets we should note that for any frequent item a, all the
frequent item-sets containing a can be obtained by following the a’s node-links
starting fro a’s head in the FP-tree header.
The mining on the FP-tree structure is done using an algorithm called the
frequent pattern growth (FP-growth). This algorithm starts with the least frequent
item, which is the last item in the header table.
By using the above example we find the frequent item-sets. We start with the
item M and find the following patterns:
BM (1)
BJM (1)
JCM (1)
No frequent item-set is discovered fro these since no item-set appear three times.
Next we look at c and find the following:
BJC (2)
JC (1)
22
These two patterns give us a frequent item-set JC (3). Looking at J, the next
frequent item in the table, we obtain:
BJ (3)
J (1)
Again we obtain a frequent item-set BJ (3). There is no need to follow links fro
item B as there are no other frequent item-set.
Advantages of the FP-tree approach:
FP-tree algorithm is that it avoids scanning the database ore than twice to find
the support counts. FP-tree is completely eliminates the costly candidate generation,
which can be expensive in particular for the Apriori algorithm for the candidate set C2.
FP-growth algorithm is better than the Apriori algorithm when the transaction
database is huge and the minimum support count is low.
FP-growth algorithm uses a more efficient structure to mine patterns when the
database grows.
23
24
UNIT – II
2.1 Classification
2.1.1 Introduction
25
that we build from the training data is never 100% accurate and classification based on
the model will always lead to errors in some cases. In spite of such errors, classification
can be useful for prediction and better understanding of the data.
The attributes may be of different types. Attributes whose domain is numerical
are called numerical attributes while attributes whose domain is not numerical are
called categorical attributes. Categorical attributes may be ordered (e.g. a student’s
grade) or may be unordered (e.g. gender). Usually, the dependent attribute is assumed
to be categorical if it is a classification problem and then the value of the attribute is the
class label.
If the problem is not a classification problem, the dependent attribute may be
numerical. Usually such problems are called regression problems. Obviously, the two
problems are closely related and one type of problem may sometimes be converted to
another type, if necessary, by simple transformation of variables from either categorical
to continuous (by converting categories to numerical values which may not be always
possible) or from continuous to categorical (by bracketing numerical values and
assigning categories, e.g. salaries may be assigned high, medium and low categories).
Binning of continuous data into categories is quite simple although the selection of
ranges can have a significant impact on the results.
Classification has many applications, for example prediction of customer
behavior (e.g. predicting direct mail responses or identifying telecom customers that
might switch companies) and identifying fraud. A classification method may use the
sample to derive a set of rules for allocating new applications to either of the two
classes.
Supervised learning requires training data while testing requires additional data
which also has been reclassified. Classifying the test data and comparing the results
with the known results can then determine the accuracy. The number of cases classified
correctly provides us with an estimate of the accuracy of model. Although accuracy is a
very useful metric, it does not provide sufficient information about the utility of model.
26
Our aim is to find highly accurate model that are easy to understand and which
are efficient when dealing with large datasets. There are a number of classification
methods to only discuss the Decision Tree and Naive Bayes techniques.
2.1.2 Decision Tree
Pouch?
No Yes
Feathe Marsupia
l
Yes rs No
Bird
Mamma
l
27
Each node of the tree is a decision while the leaves are the classes. Clearly this is
a very simple example that is unlike problems in real life.
Decision tree is a model that is both predictive. A decision tree is a tree that
displays relationships found in the training data. The tree consists of zero or more
internal nodes and one or more leaf nodes with each internal node being a decision
node having two or more child nodes.
Each node of the tree represents a choice between a number of alternatives (in
“20 questions” the choices are binary) and each leaf node represents a classification or a
decision. The training process that generates the tree is called induction.
The decision tree techniques is popular since the rule generated are easy to
describe and understand, the technique is fast unless the data is very large and there is a
variety of software available.
It should be noted that there needs to be a balance between the number of
training samples and the number of independent attributes. Generally, the number of
training samples requires is likely
to be relatively small if the number of independent attributes is small and the number
of training samples required is likely to be large when the number of attributes is large.
It is not unusual to have hundreds of objects in the training sample although the
examples in these notes must consider only a small training set.
The complexity of a decision tree increases as the number of attributes increases,
although in some situations it has been found that only a small number of attributes can
determine the class to which an object belongs and the rest of the attributes have little or
no impact.
The quality of training data usually plays an important role in determining the
quality of the decision tree. If there are a number of classes, then there should normally
be sufficient training data available that belongs to each of the classes.
It is of course not going to be possible to model the most infrequent situations. If
one tried to do that then we have a condition that is called an “overtrained model”
which produces errors because unusual cases were present in the training data.
28
The decision tree building algorithm continues until all leaf nodes are single
class. Nodes or no more attributes are available for splitting a node that has objects of
more than one class.
When the objects being classified have a large number of attributes and a tree of
maximum possible depth is built, the tree quality may not be high since the tree is built
to deal correctly with the training set.
Some branches of the tree may reflect anomalies due to noise or outlines in the
training samples. Such decision trees are a result of over fitting the training data and
may result in poor accuracy for unseen samples.
According to the Occam’ razor principle (due to the medieval philosopher
William of Occam) it is best to posit that the world is inherently simple and to choose
the simplest model from similar models since the simplest model is more likely to be a
better model.
We can therefore “shave off” nodes and branches of a decision tree, essentially
replacing a whole subtree by a leaf node, if it can be established that the expected error
rate in the subtree is greater than that in the single leaf. This makes the classifier
simpler. A simple model has less chance of introducing inconsistencies, ambiguities
and redundancies.
29
30
Once all the rules have been generated, it may be possible to simplify the rules.
Rules with only one antecedent (e.g if gender = “Male “then class =B) cannot be further
simplified, so we only consider those with two or more antecedents. It may be possible
to eliminate unnecessary rule antecedents that have no effect on the conclusion reached
by the rule. Some rules may be unnecessary and these may be removed. In some cases a
number of rules that lead to the same class may be combined.
The quality of the decision tree and that of the rules depends on the quality of the
training sample. If the training sample is not a good representation of the population,
then one should be careful in reading too much into rules derived.
2.1.5 Naive Bayes Method
The Naïve Bayes method is based on the work of Thomas Bayes (1702-
1761).Bayes was a British minister and his theory was published only after his death. It
is mystery what Bayes wanted to do with such calculation.
Bayes Classification is quite different from the decision tree approach. In
Bayesian classification we have a hypothesis that the given data belongs to a particular
class. We then calculate the probability for the hypothesis is to be true. The approach
requires only one scan of the whole data. Also, if at some stage there are additional
training data then each training example can incrementally increase/decrease the
probability that a hypothesis is correct.
Before we define the Baye’s theorem we will define some notation. The
expression P(A) refers to the probability that event occur, P(A|B) stands for the
probability that event A will happen, given that event B has already happened. In other
words P (A|B) is the conditional probability of A based on the condition that B has
already happened.
31
Baye’s theorem:-
P(A|B)=P(B|A)P(A)/P(B)
P(A|B)=P(A & B)/P(B)
P(B/A)= P(A & B)/P(A)
Dividing the first equation by the second gives us the Baye’s theorem.
Continuing with A and B being courses, we can compute the conditional probability if
we knew what the probability of passing both courses was, that is P(A & B), and what
the probabilities of passing A and B separately were. If an event has already happened
then we divide the joint probability P(A & B) with the probability of what has just
happened and obtain the conditional probability.
Once the probabilities have computed for all the classes, we simply assign X to
the class that has the highest conditional probability.
Let us consider the probabilities P(Ci|X) may be calculated.
P(Ci|X) = [P(X| Ci)P(Ci)]/P(X)
32
Therefore we only need to compute P(X| C i) and P(Ci.) for each class.
Computing P(Ci.) is rather easy since we count the number of instances of each class in
the training data and divide each by the total number of instances. This may not be the
most accurate estimation of P(Ci.) but we have very little information, the training
sample, and we have no other information to obtain a better estimation.
To Compute P(X| Ci) we use a naïve approach by assuming that all
attributes of X are independent which is often not true.
Using the independence of attributes assumption and based on the trained data,
we compute an estimate of the probability of obtaining the data X we have by
estimating the probability of each of the attribute values by counting the frequency of
those values for class Ci. .
We then determine the class allocation of X by computing P (X| Ci) P (Ci) for
each of the class and allocating X to the class with the largest value.
The Bayesian approach is that the probability of the dependent attribute can be
estimated by computing estimates of the probabilities of the independent attributes.
It is possible to use this approach even if values of all the independent attributes
are not known since we can still estimate the probabilities of the attribute values that we
know.
Example 3.3 – Naïve Bayes Method
Owns Credit Risk
Married Gender Employed
Home? Rating class
Yes Yes Male Yes A B
No No Female Yes A A
Yes Yes Female Yes B C
Yes No Male No B B
No Yes Female Yes B C
No No Female Yes B A
No No Male No B B
33
34
Given estimates of the probabilities ,we can compute the posterior probabilities
as
P(X|A) = 2/9
P(X|B) = 0
P(X|C) = 0
35
Bayes theorem assumes that all attributes are independent and that the training
sample is a good sample to estimate probabilities.
2.1.6 Estimating predictive accuracy of classification methods
36
population. A method of estimating the error rate is consider biased if it either tends to
underestimate the error or tends to overestimate it.
The advantage of using this matrix is that it not only tells us how many got
misclassified but also what misclassification occurred.
A confusion matrix for three classes
True Class
Predicated class
1 2 3
1 8 1 1
2 2 9 2
3 0 0 7
Using the above table , we can define the terms “false positive “(FP) and “false
negative”(FN).
False positive cases are those that did not belong to a class but were allocated to
it.
False negative on the other hand are cases that belong to a class but were not allocated
to it.
We now define sensitivity and specificity
Sensitivity = TP / (TP + FN)
Specificity = TN / (TN + FP)
Where TP(total positives) is the total correctly classified objects and TN (total
negatives) is the total number of objects that did get classified to a class they did not
belong to.
Consider class 1 in above table .There are 10 objects that belong to this class and
20 20 do not. Of this 10, only 8 are classified correctly. In total , 24 objects are classified
correctly. Out of the 20 that did not belong to class 1 ,2 objects are classified wrongly to
belong to it. So we have TP= 8, TN=18,FN=2. For class 2 , TP= 9, TN=16,FN=1.FP= 4
and for class 3 TP= 7, TN=20 ,FN=3 and FP=0.
Sensitivity = TP / (TP + FN) = 24/30 = 80%
37
38
39
1.Speed
Speed involves not just the time or computation cost of constructing a model, it
also includes the time required to learn to use the model. Obviously, a user wishes to
minimize both times although it has to be understood that any significant data mining
project will take time to plan and prepare the data.
2.Robustness
Data errors are common, in particular when data is being collected from a
number of sources and errors may remain even after data cleaning. It is therefore
desirable that a method be able to produce good results in spite of some errors and
missing values in datasets.
3.Scalability
Data mining methods were originally designed for small dataset. Many have
been modified to deal with large problems. Given the large datasets are becoming
common, it is desirable that a method continues to work efficiently for large disk-
resident database.
40
4.Interpretability
A data mining professional is to ensure that the results of data mining are
explained to the decision makers. It is therefore desirable that the end-user be able to
understand and gain insight from the results produced by the classification method.
Goodness of the model
For a model to be effective , it needs to fit the problem that is being solved. For
example , in a decision tree classification , it is desirable to find a decision tree of the
“right” size and compactness with accuracy.
2.1.8 Classification Software
Decision tree is one of the major techniques in these packages. Some include the
naïve bays method as well as other classification methods. The list of software given
here is for classification only. It should be noted that different software using the same
technique may not produce the same results since there are subtle differences in the
techniques used.
This is not comprehensive list but it includes some of the most widely used
classification software. A more comprehensive classification software list is available at
kdnuggets site (http://www.kdnuggets.com/software/classification.html).
C4.5, version 8 of the “classic” decision –tree tool, developed by J.R..Quinlan is
available at (http://www.rulequest.com/Personal).
C5.0/ See5 from Rulequest Research are designed to deal with large data sets. It
Constructs classifiers in the form of decision tree or a set if-then-else rules. The
software uses boosting to reduce errors on unseen data. Links to large number of
publication of case studies of its use are available at :
(http://www.rulequest.com/see5-pubs.html).
CART 5.0 and Tree Net from Salford Systems are the well known decision tree
software packages.TreeNet provides boosting. CART is the decision tree
software. The package incorporate facilities for data pre-processing and
predictive modeling including bagging and arcing. (http://www.salford-
systems.com/).
41
DTREG, from a company with the same name, generates classification trees
when the classes are categorical and regression decision trees when the classes
are numerical intervals, and finds the optimal tree size. In both cases , the
attribute values may be discrete or numerical. Software modules treeboost and
decision tree generate an ensemble of decision trees. In treeboost, each tree is
generated based on input form a previous tree while decision tree. Forest
generates an ensemble independently of each other. (http://www.dtreg.com/).
Model builder for decision tree from fair Isaac specializes in credit card and
fraud detection. It offers software for decision trees, including advanced tree-
building software that levaverage data and business expertise to guide the user
in strategy development
(http://www.fairisaac.com/Fairisaac/solution/Product+Index/Model+Builder
/).
42
43
UNIT-III
44
45
their rank. Other types of data are also possible. For example, data may include text
strings or a sequence of Web pages visited.
3.1.3 Computing Distance
46
3.Chebychev distance
This distance metric is based on the maximum attribute difference. It is also
called the L norm of the difference vector.
D(x, y) = Max |xi - yi|
4.Categorical data distance
This distance measure may be used if many attributes have categorical values
with only a small number of values (e.g. binary values). Let N be the total
number of categorical attributes.
D(x, y) =- (number of xi - yi)/N
a. Hierarchical methods
Hierarchical methods obtain a nested partition of the objects resulting in a tree of
clusters.
These methods either start with one cluster and then split into smaller and
smaller clusters (called divisive or top down) or start with each object in an
individual cluster and then try to merge similar clusters into larger and larger
clusters (called agglomerative or bottom up).
47
b. Density-based methods
In this class of methods, typically for each data point in a cluster, at least a
minimum number point must exist within a given radius.
c.Grid-based methods
In this class of methods, the object space rather than the data is divided into a
grid.
Grid partitioning is based on characteristics of the data and such methods can
deal with non-numeric data more easily.Grid-based methods are not affected by
data ordering.
d.Model-based methods
A model is assumed, perhaps based on a probability distribution. Essentially the
algorithm tries to build clusters with a high level of similarity within them and a
low level of similarity between them.
Similarity measurement is based on the mean values and the algorithm tries to
minimize the squared-error function.
A simple taxonomy of cluster analysis methods is presented in Figure 4.1.
48
2.Complete-link
The complete-link algorithm is also called the farthest neighbours algorithm. In
this algorithm, thedistance between two clusters is defined as the maximum of the
pairwise distances (a, x}. Therefore if there are m elements in one cluster and n in the
other, all mn pairwise distances therefore must be computed and the largest chosen.
49
Both single-link and complete-link measures have their difficulties. In the single-
link algorithm, each cluster may have an outlier and the two outliers may be nearby
and so the distance between the two clusters would be computed to be small. Single-
link can form a chain of objects as clusters are combined since there is no constraint on
the distance between objects . On the other hand, the two outliers may be very far away
although the clusters are nearby and the complete-link algorithm will compute the
distance as large.
3.Centroid
The centroid algorithm computes the distance between two clusters as the
distance between the average point of each of the two clusters. Usually the squared
Euclidean distance between the centroids is used.
4.Average-link
The average-link algorithm computes the distance between two clusters as the
average of all pairwise distances between an object from one cluster and another from
the other cluster. Therefore if there are m elements in one cluster and n in the other,
there are mn distances to be computed, added and divided by mn.
This approach also generally works well. It tends to join clusters with small
variances.Figure 4.5 shows two clusters A and B and the average-link distance between
them.
Figure 4.5 The average-link distance between 1wo clusters.
50
51
2. Create a distance matrix by computing distances between all pairs of clusters either
using,for example, the single-link metric or the complete-link metric. Some other
metric may also be used. Sort these distances in ascending order.
3. Find the two clusters that have the smallest distance between them.
4. Remove the pair of objects and merge them.
5. If there is only one cluster left then stop.
6. Compute all distances from the new cluster and update the distance matrix after
the merger and go to Step 3.
ii.Divisive Hierarchical Method
The divisive method is the opposite of the agglomerative method
In this method starts with the whole dataset as one cluster and then proceeds to
recursively
Divide the cluster into two sub-clusters and continues until each cluster has only
one object or some other stopping criterion has been reached. There are two
types of divisive methods:
1. Monothetic: It splits a cluster using only one attribute at a time. An attribute that
has the most variation could be selected.
2. Polythetic: It splits a cluster using all of the attributes together. Two clusters far
apart could be built based on distance between objects.
A typical polythetic divisive method works like the following:
1. Decide on a method of measuring the distance between two object.
2. Create a distance matrix by computing distances between all pairs of objects
within the cluster. Sort these distances in ascending order.
3. Find the two objects that have the largest distance between them. They are the
most dissimilar objects.
4. If the distance between the two objects is smaller than the pre-specified threshold
and there is no other cluster that needs to be divided then stop, otherwise continue.
5. Use the pair of objects as seeds of a K-means method to create two new clusters.
52
6. If there is only one object in each cluster then stop otherwise continue with Step
2.
In the above method, we need to resolve the following two issues:
• Which cluster to split next?
• How to split a cluster?
a.Which cluster to split next?
1. Split the clusters in some sequential order.
2. Split the cluster that has the largest number of objects.
3. Split the cluster that has the largest variation within it.
b.How to split a cluster?
We used a simple approach for splitting a cluster based on distance between the objects
in the cluster. A distance matrix is created and the two most dissimilar objects are
selected as seeds of two new clusters. The K-means method is then used to split the
cluster.
Advantages of the Hierarchical Approach
1. Hierarchical methods are conceptually simpler and can be implemented easily.
2. Hierarchical methods can provide clusters at different levels of granularity
Disadvantages of the Hierarchical Approach
1. The hierarchical methods do not include a mechanism by which objects that have
been incorrectly put in a cluster may be reassigned to another cluster.
2. The time complexity of hierarchical methods can be shown to be 0(n3).
3. The distance matrix requires 0(n2) space and becomes very large for a large
number of objects.
4. Different distance metrics and scaling of data can significantly change the results.
3.1.7 Density-Based Methods
In this method clusters are high density collections of data that are separated by
a large space of low density data (which is assumed to be noise). Each data point in a
cluster, at least a minimum number of points must exist within a given distance.
53
Data that is not within such high-density clusters is regarded as outliers or noise.
Density-based clustering is that the clusters are dense regions of probability density in
the data sets.
DBSCAN (density based spatial clustering of applications with noise) is one
example of a density-based method for clustering.
It requires two input parameters; the size of the neighbourhood (R) and the
minimum points in the neighbourhood (N).Essentially these two parameters determine
the density within the clusters the user is willing to accept since they specify how many
points must be in a region.
The number of points not only determines the density of acceptable clusters but
it also determines which objects will be labelled outliers or noise. Objects are declared
to be outliers if there are few other objects in their neighbourhood. The size parameter R
determines the size of the clusters found. If R is big enough, there would be one big
cluster and no outliers. If R is small, there will be small dense clusters and there might
be many outliers.
a.Concepts of DBSCAN method
1. Neighbourhood: The neighbourhood of an object y is defined as all the objects that
are within the radius R from y.
2. Core object: An object y is called a core object if there are N objects within its
neighbourhood.
3. Proximity: Two objects are defined to be in proximity to each other if they belong
to the same cluster. Object x 1 is in proximity to object x 2 if two conditions are
satisfied:
(a) The objects are close enough to each other, i.e. within a distance of R.
(b) x 2 is a core object as defined above.
4. Connectivity: Two objects x 1 and x 2 are connected if there is a path or chain of
objects x 1 ,x 2, …. x n from x 1 to x n
Basic Algorithm for Density-based Clustering:
1. Select values of R and N.
54
Most clustering methods implicitly assume that all data is accessible in the
mainmemory. Often the size of the database is not considered but a method requiring
multiple scans of data that is disk-resident could be quite inefficient for large problems.
a.K-Means Method for Large Databases
One modification of the K-means method for data that is too large to fit in the main
memory.The method first picks the number of clusters and their seed centroids and
then attempts to classify each object to belong to one of the following three groups:
(a) Those that are certain to belong to a cluster. These objects together are called the
discard set. Some information about these objects is computed and saved. This
includes the number of objects n, a vector sum of all attribute values of the n
objects (a vector S) and a vector sum of squares of all attribute values of the n
objects (a vector Q).
(b) The objects are however sufficiently far away from each cluster's centroid that
they cannot yet be put in the discard set of objects. These objects together are
called the compression set.
(c) The remaining objects that are too difficult to assign to either of the two groups
above.These objects are called the retained set and are stored as individual
objects.
55
56
each cluster in an attempt to assess how far each cluster is from the other. Another
approach is based on computing within cluster variation (I) and between clusters
variation (E). These variations may be computed as follows:
Let the number of clusters be k and let the clusters be Ci i = 1, ..., k. Let the total
number of objects be N and let the number of objects in cluster Ci, be Mi so that
M 1 + M 2 + …… + M k = N
The within-cluster variation between the objects in a cluster Ci is defined as the
average squared distance of each object from the centroid of the cluster. That is, if mi is
the centroid of the cluster Ci, then the mean of the cluster is given by
mi, = ∑{ xj }/Mi
The between cluster distances E may now be computed given the centroids of the
clusters. It is the average sum of squares of pairwise distances between the centroids of
the k clusters. Evaluating the quality of clustering methods or results of a cluster
analysis is a challenging task.
The quality of a method involves a number of criteria:
1. Efficiency of the method
2. Ability of the method to deal with noisy and missing data
3. Ability of the method to deal with large problems
4. Ability of the method to deal with a variety of attribute types and magnitudes
Once several clustering results have been obtained, they can be combined, hopefully
providing better results than those from just one clustering of the given data.
Experiments using these techniques suggest that they can lead to improved quality and
robustness of results.
3.1.10 luster analysis software
57
UNIT-IV
4.1 Web Data Mining
4.1.1 Introduction
Web mining is the application of data mining techniques to find interesting and
potentially useful knowledge from web data. It is normally expected that either the
hyperlink structure of the web or the web log data or both have been used in the mining
process.
Web mining may be divided in to several categories:
1. Web content mining
It deals with discovering useful information or knowledge from web page
contents. It goes well beyond using the keywords in the search engines. It focuses on
the web pages contents rather than the links.
2. Web Structure Mining:
It deals with discovering and modeling the link structures of the web. Work has
been carried out to model the web based on the topology of the hyperlinks.
3. Web Usage Mining:
It deals with understanding the user behavior in interacting with the web or with
a web site. One of the aims is to obtain information that may assist web site
reorganization or assist site adaptation to better suit the user.
4.1.2 Web terminology and characteristics
The web is seen as having a two tier architecture. The first tier is the web server
that serves the information to the client machine and the second tier is the client that
displays that information to the user.
This architecture is supported by three web standards, namely HTML( Hyper
Text Markup Language) for defining the web document content, URLs( Uniform
Resource Locators) for naming and identifying remote information resources in
the global web world, and HTTP (Hyper- Text Transfer Protocol) for managing
the transfer of information from the server to the client.
58
59
60
An investing on the structure of the web was carried out and it was found that the
web is not a ball of highly connected nodes. It displayed structural properties and the web
could be divided into the following five components
1. The Strongly Connected Core (SCC): This part of the web was found to consist
of about 30% of the web, which is still very large given more than 4billion pages
on the web in 2004.
2. The IN Group: This part of the web was found to consist of about 20% of the
web.
Main Property- pages in the group can reach the SCC but cannot be reached from
it.
3. The OUT Group: This part of the web was found to consists of about 20% of the
Web.
Main Property- pages in the group can reached from the SCC but cannot be reach
the SCC.
4. Tendrils: This part of the web was found to consists of about 20% of the web.
Main Property- pages cannot be reached by the SCC and cannot reach the SCC.
It does not imply that these pages have no linkages to pages outside the group
since they could well have linkages from the IN group and to the OUT group.
5. The Disconnected Group: This part of the web was found to be less than 10% of
the web and is essentially disconnected from the rest of the web world.
Ex: personal pages at many sites that link to no other page and have no links to
them.
61
The shallow web is the information on the web that the search engines can access
without accessing the web databases. It has been estimated that the deep web is about
500 times the size of the shallow web. The web pages are very dynamic, changing
almost daily. Many new web pages are added to the web every day.
Perhaps web pages are used for other purposes as well, for example,
communicating information to small number of individuals via the web rather than via
email. In many cases, this makes good sense (wasting disk storage in mailbox is
avoided).
But if this grows, then a very large number of such web pages with a short life
span and low connectivity to other pages are generated each day.
Large numbers of websites disappear every day and create many problems on
the web. Links from even well known sites do not always work. Not all results of a
search engine search are guaranteed to work. To overcome these problems, web pages
are categorized as follows:
1. A web page that is guaranteed not to change over.
2. A Web page that will not delete any content, may add content / links but the
page will not disappear.
3. A web page that may change content / links but the page will not disappear.
4. A web page without any guarantee.
c. Web Metrics
There are a number of properties (other than size and structures of the web) about
the web are useful to measure. Some of the measures are based on distance between the
nodes of the web graph. It is possible to define how well connected a node is by using
the concept of centrality of a node.
Centrality may be out-centrality, which is based on distance measured from the
nodes its out-links while in-centrality is based on distances measured from others nodes
that are connected to the node using its in-links. Based on these metrics, it is possible to
define the concept of compactness that varies from 0 to 1, o for a completely
disconnected web graph and 1 for fully connected web graph.
62
Perhaps the most important measurements about web pages are about a page’s
relevance and its quality. Relevance is usually related to a user query since it concerns
the user findings pages that are relevant ti his query and may be defined in a number
of ways.
A simple approach involves relevance to be measured by the number of query
terms that appear in the page. Another approach is based on in-links from relevant
pages. In this, relevance of a page may be defined as the sum of the number of query
terms in the pages that refer to the page.
Relevance may also use the concept of co-citation. In co-citation, if both pages ‘a’
and ‘b’ point to a page ‘c’ then it is assumed that ‘a’ and ‘b’ have a common interest.
Similarly if ‘a’ points to both ‘b’ and ‘c’, then we assume that ‘b’ and ‘c’ also share a
common interest.
Quality is not determined by page content since there is no automatic means of
evaluating the quality of content. And it is determined by the link structure. Example: If
page ‘a’ points to page ‘b’ then it is assumed that page ‘a’ is endorsing page ‘b’ and we
can have some confidence in the quality of page ‘b’.
4.1.3 Locality and hierarchy in the web
The web shows a strong hierarchical structure. A website of any enterprise has the
homepage as the root of the tree and as we go down the tree we find more detailed
information about the enterprise.
Example: The homepage of a university will provide basic information and then
links, for example, to: Prospective students, staff, research, and information for current
students, information for current staff.
Prospective students node will have number of links: course offered, admission
requirements, scholarship available, semester dates, etc.. Web also has a strong locality
feature to the extent that almost two-thirds of all links are to sites within the enterprise
domain.
63
One-third links to sites outside the enterprise domain have a higher percentage of
broken links. Web sites fetch information from a database to ensure that the information
is accurate and timely.
Web pages can be classified as:
1. Homepage or the head page: These pages represent an entry for the website of
an enterprise(so frequently visited) or a section within the enterprise or an
individual’s web page.
2. Index page: These pages assist the user to navigate through the enterprise
website. A homepage may also act as an index page.
3. Reference page: These pages provide some basic information that is used by a
number of other pages Ex: each page in a website may have link to a page that
provides enterprise’s privacy policy.
4. Content Page: These pages only provide content and have little role in assisting a
user’s navigation. Often they are larger in size, have few out-link and are often
the leaf nodes of a tree.
Web site structure and content are based on careful analysis of what the most
common user questions are. The content should be organized into categories in a way
that, traversing the site simple and logical.
Careful web user data mining can help. A number of simple principles have been
developed to design the structure and content of a web site.
Three basic principles are:
1. Relevant Linkage principle: It is assumed that links from a page to other
relevant resources. Links are assumed to reflect the judgment of the page creator.
By providing link to another page means that, he recommends for the other
relevant page.
2. Topical unity principle: It is assumed that web pages that are co-cited( that is
linked from the same pages) are related.
3. Lexical affinity principle: It is assumed that the text and the links within a page
are relevant to each other.
64
Unfortunately not all pages follow these basic principles, resulting in difficulties for
web users and web mining researchers.
4.1.4 Web Content Mining
Web content mining deals with discovering useful information from the web.
When we use search engines like Google to search contents on the web, search engines
find pages based on the location and frequency of keywords on the page although some
now use the concept of page rank.
If we consider the type of queries that are posed to search engines, we find that if
the query is very specific we face scarcity problem. If the query is for a broad topic, we
face the abundance problem. Because of these problems, keywords search engine is
called suggestive of the document relevance. User is then left with the task of finding
relevant documents.
Brin presents an example of finding content on the web. It shows hoe relevant
information from a wide variety of sources presented in wide variety of formats may be
integrated for the user.
Example involves extracting a relation of books in the form of (author, title) from
the web starting with a small sample list.
Problem is defined as: we wish to build a relation R that has a number of
attributes. The information about tuples of R is found on web pages but is unstructured.
Aim is to extract the information and structure it with low error rate. Algorithm
proposed is called Dual Iterative Pattern Relation Extraction (DIPRE). It works as
follows:
65
Here the pattern is defined as a tuple like (order, URL prefix, Prefix, middle,
suffix). It may then match strings like (author ,middle, title) or (title, middle, author).
In step 2 it is assumed that the occurrence consists of title and author with a
variety of other information. Web pages are usually semi-structured(ex: HTML
documents). Database generated HTML pages are more structured, however web pages
consist of unstructured free text data which makes it more difficult to extract
information from them.
The semi-structured web pages may be represented based on the HTML
structures inside the documents. Some may also use hyperlink structure. A different
approach called database approach involves inferring the structure of a website and
transforming the website content into a database.
Web pages are published without any control on them. Some of the sophisticated
searching tools being developed are based on the work of the artificial intelligence
community. These are called Intelligent Web Agents. Some web agents replace the work
of search engines in finding information. Ex: ShopBot finds information about a product
that the user is searching for.
a. Web Document Clustering
Web document clustering is an approach to find relevant documents on a topic
or about query keywords. Search engines return huge, unmanageable list of documents
,and finding the useful one is often tedious.
User could apply clustering to a set of documents returned by search engines
with the aim of finding meaningful clusters that are easier to interpret. It is not
necessary to insist that a document can only belong to one cluster since in some cases it
is justified to have document belong to two or more clusters.
Web clustering may be based on content alone, may be based on both content &
links or based only on links. One approach that is specifically designed for web
66
document cluster analysis is Suffix Tree Clustering (STC) & it uses a phrase based
clustering approach rather than using a single word frequency.
In STC, the key requirements of a web document clustering algorithm include:
i.Relevance: This is most obvious requirement. We want clusters that are relevant to
user query.
ii.Browsable summaries: The cluster must e easy to understand. User should be
quickly able to browse the description of a cluster.
iii.Snippet tolerance: The clustering method should not require whole documents &
should be able to produce relevant clusters based only on the information that the
search engine returns.
iv.Performance: The clustering method should be able to process the results of search
engine quickly & provide the resulting clusters to the user.
67
base clusters is created based on the documents that are in each cluster. If the base
clusters are similar, they are combined.
b.Finding Similar Web pages
It has been found that almost 30% of all web pages are very similar to other
pages. For example:
1. A local copy may have been made to enable faster access to the material.
2. FAQs are duplicated since such pages may be used frequently.
3. Online documentation of popular software like Unix may be duplicated fro local
use.
4. There are mirror sites that copy highly accessed sites to reduce traffic.
In some cases, documents are not identical because different formatting might be
joined together to build a single document. Copying a single web page is called
replication and copying an entire web site is called mirroring.
Similarity between web pages usually means content-based similarity. It is also
possible to consider link-based and usage-based similarity. Link-based is related to the
concept of co-citation and is used for discovering a core set of web page on a topic.
Usage-based is useful in grouping pages or users in to meaningful groups.
Content-based is based on comparing textual content of the web pages. Non- text
contents are not considered.
We define two concepts:
1. Resemblance: Resemblance of two documents is define to be a number between
0 and 1 with 1 indicating that the two documents are virtually identical and any
value close to 1 indicating that the documents are very similar.
2. Containment: Containment of one document in another is defined as a number
between 0 and 1 indicating that the first document is completely contained in the
second.
Number of approaches is there to assess the similarity of documents. One Brute
force approach is to compare two documents using software like ‘diff’ in Unix OS,
which compares two documents as files. Other string comparison algorithms may be
68
used to find how many characters need to be deleted, changed or added to transform
one document to other, but this is expensive for comparing millions of documents.
69
Example:
We find if the two documents with the following content are similar:
Document1: “the web continues to grow at a fast rate”
Document2: “the web is growing at the fast rate”
First Step: Making a set of shingles from the documents. We obtain the below shingles if
we assume the shingle length to be 3 words.
Shingles in Doc1 Shingles in Doc2
The web continues The web is
Web continues to Web is growing
Continues to grow Is growing at
To grow at Growing at a
Grow at a At a fast
At a fast A fast rate
A fast rate
Comparing two sets of shingles we find that only 2 of them are identical. So the
documents are not very similar. We illustrate the approach using 3 shingles that are the
shortest in the number of letters including spaces.
Shingles in Doc1 Number of Shingles in Doc2 Number of letters
Letters
The web continues 17 The web is 10
Web continues to 16 Web is growing 14
Continues to grow 17 Is growing at 13
To grow at 10 Growing at a 12
Grow at a 9 At a fast 9
At a fast 9 A fast rate 11
A fast rate 11
70
We select 3 shortest shingles for comparison. For first document, these are “to
grow at”, “grow at a”, “at a fast”. For second document, these are “the web is”, “at a
fast”, “a fast rate”. There is only one match out of three shingles.
False negatives would be obtained for documents like “the Australian economy
is growing at a fast rate”. So, small length shingles cause many false positives while
larger shingles result in more false negatives. A better approach involves randomly
selecting shingles of various lengths.
Issues in comparing documents using fingerprinting are:
Should the shingle length be in number of words or characters?
How much storage would be needed to store shingles?
Should upper and lower-case letters be treated differently?
Should punctuation marks be ignored?
Should end of paragraph be ignored?
4.1.5 Web Usage Mining
71
3. Visitor referring website: Web site URL of the site the user came from.
4. Visitor referral website: web site URL of the site where the user went he left the
web site.
5. Entry point: which web site page the user entered from
6. Visitor time and duration: time and day of visit and how long the visitor
browsed the site.
7. Path analysis: List of path of pages that the user took
8. Visitor IP address: this helps in finding which part of the world the user came
from.
9. Browser type
10. Platform
11. Cookies
This simple information can assist an enterprise to achieve the following:
1. Shorten the path to high visit pages
2. Eliminate or combine low visit pages
3. Redesign some pages including homepage to help user navigation
4. Redesign some pages so that the search engines can find them
5. Help evaluate effectiveness of an advertising campaign
Web usage mining may be desirable to collect information on:
i.Path traversed: What paths do the customers traverse? What are the most commonly
traversed paths? These patterns need to be analyzed and acted upon.
ii.Conversion rates: What are the basket-to-buy rates for each product?
Impact of advertising: Which banners are pulling in the most traffic? What is their
conversion rate?
iii.Impact of promotions: Which promotions generate the most sales?
iv.Web site design: Which links do the customers click most frequently? What links do
they buy from most frequently?
v.Customer segmentation: What are the features of customers who stop without
buying? Where do most profitable customers come from?
72
vi.Enterprise search: Which customer use enterprise search? Are they likely to
purchase? What they search?
Web usage mining also deals with catching and proxy servers. Not all page
requests are recorded in the server log because pages are often cached and served from
the cache when the users request them again.
Caching occurs at several levels. Browsers itself has a cache to keep copies of
recently visited pages which may be retained only for a short time. Enterprise maintains
a proxy server to reduce internet traffic and to improve performance.
Proxy server interprets any request for web pages from any user in the enterprise
and server the from the proxy if a copy that has not expired is resident in the proxy.Use
of caches and proxies reduces the number of hits that the web server log records.
One may be interested in the behaviors of users, not just in one session but over
several sessions. Returning visitors may be recognized using cookies(which is issued by
the web server and stored on the client). The cookie is then presented to web server on
return visits.
Log data analysis has been investigated using the following techniques:
1. Using association rules
2. Using cluster analysis
73
Aim of web structure mining is to discover the link structure or the model that is
assumed to underline the web. Model may be based on the topology of
hyperlinks.
This can help in discovering similarity between sites or in discovering authority
site for a particular topic or in discovering overview or survey sites that point to
many authority sites (such sites are called hubs)
The links on web pages provide a useful source of information that may be
bound together in web searches.
Kleinberg has developed a connectivity analysis algorithm called Hyperlink-
Induced Topic Search(HITS) based on the assumption that links represent
human judgement.
Algorithm also assumes that for any topic, there are a set of “authorities” sites
that are relevant on the topic and there are “hub” sites that contain useful links to
relevant sites on the topic.
HITS algorithm is based on the idea that if the creator of page p provides a link
to page q, then p gives some authority on page q.
But not all links give authority, since some may be for navigational purposes,
Exlinks to the home page.
74
2. Iterative step: It finds hubs and authorities using the information collected
during sampling.
Example:
For a query “web mining” the algorithm involves carrying out the sampling step
by querying a search engine and then using the most highly ranked web pages
retrieved for determining the authorities and hubs.
Posing a query to a search engine often results in abundance. In some cases, the
search engine may not retrieve all relevant pages for the query. Query for java may not
retrieve pages for object-oriented programming, some of which may also be relevant.
Step-1: Sampling step
First step involves finding a subset of nodes or a sub graph S, which are relevant
authoritative pages. To obtain such a sub graph, the algorithm starts with a root set of,
say 200 pages selected from the results of searching for the query in a traditional search
engine.
Let the root set be R. Starting from R, we wish to obtain a set S that has the
following properties:
1. S is relatively small
2. S is rich in relevant pages given the query
3. S contains most of the strongest authorities
HITS expand the root set R into a base set S by using the following algorithm:
1. Let S=R
2. For each page in S, do steps 3 to 5
3. Let T be the set of all pages S points to
4. Let F be the set of all pages that points to S
5. Let S=S + T + some or all of F(some if F is large)
6. Delete all links with the same domain name.
7. This S is returned
75
Set S is called base set for the given query. We find the hubs and authorities from
the base set as follows:
One approach is ordering them by the count of their out-degree and the count of
their in-degree.
Before starting hubs and authorities step of algorithm, HITS removes all links
between pages on the same web site(or same domain) in step 6 i.e. links between
pages on same site for navigational purposes, not for conferring authority.
Example:
Assume a query has been specified. First step has been carries out and a set of
pages that contain most of the authorities has been found.
76
Let the set be given by the following 10 pages. Their out-links are listed below:
Page A (out-links to G,H,I,J)
Page B (out-links to A,G,H,I,J)
Page C (out-links to B,G,H,I,J)
Page D (out-links to B,G,H,,J)
Page E (out-links to B,H,I,J)
Page F (out-links to B,G,I,J)
Page G (out-links to H,I,)
Page H (out-links to G,I,J)
Page I (out-links to H)
Page J (out-links to F,G,H,I)
Page A B C D E F G H I J
A 0 0 0 0 0 0 1 1 1 1
B 1 0 0 0 0 0 1 1 1 1
C 0 1 0 0 0 0 1 1 1 1
D 0 1 0 0 0 0 1 1 0 1
E 0 1 0 0 0 0 0 1 1 1
F 0 1 0 0 0 0 1 0 1 1
G 0 0 0 0 0 0 0 1 1 0
H 0 0 0 0 0 0 1 0 1 1
I 0 0 0 0 0 0 0 1 0 0
J 0 0 0 0 0 1 1 1 1 0
Every row in the matrix shows the out-links from the web page that is identified
in the first column. Every column shows the in-links from the web page that is
identified in the first row of the table
77
The algorithm works well for most queries, it does not work well for some
others. There are a number of reason for this:
e.Topic drift:
Certain arrangements of tightly connected documents, perhaps due to mutually
reinforcing relationships between hosts, can dominate HITS computation. These
documents may not be the most relevant to query sometimes.
78
g.Non-relevant documents:
Some queries can return non-relevant documents in the highly ranked queries
and this can then lead to erroneous results from HITs.
h.Efficiency:
The real-time performance of the algorithm is not given the steps that involve
finding sites that are pointed to by pages in the root pages
j.Web Communities:
A web community is generated by a group of individuals that share a common
interest. Ex: religious group, sports. Web communities may be discovered by exploring
the web as a graph and identifying sub-graph that have a strong link structure within
them but weak associations with other parts of the graph. Such subgraphs may be
called web-communities or cyber-communities.
The idea of cyber-communities is to find all web communities on any topic by
processing the whole web graph.
79
80
81
1. Collecting Information:
A search engine would normally collect web pages or information about them by
web crawling or by human submission of web pages.
2. Evaluating and categorizing information:
In some cases, fro example, when web pages are submitted to a directory, it may
be necessary to evaluate and decide whether a submitted page should be selected.
3. Creating a database and creating indexes:
The information collected needs to be stored either in a database or some kind of
file system. Indexes must be created so that the information may be searched efficiently.
4. Computing ranks of the web documents:
A variety of methods are being used to determine the rank of each page retrieved
in response to a user query. The information used may include frequency of keywords,
value of in-links & out-links from the page and frequency of use of the page.
5. Checking queries and executing them:
Queries posed by users need to be checked, fro example, fro spelling errors and
whether words in the query are recognizable. Once checked, a query is executed by
searching the search engine database.
6. Presenting results:
How the search engine presents the results to the user is important. The search
engine must determine what results to present and how to display them.
7. Profiling the users:
To improve search performance, the search engines carry out user profiling the
deals with the way users use search engines.
82
No two search engines are exactly the same in terms of sizing, indexing
techniques, page ranking algorithms or speed of search.
Typical search engine architecture is shown below. It consists of many components
including the following three major components to carry out the functions that were
lists above.
The crawler and the indexer: It collects pages from the web, creates and maintains
the index.
The user interface: It allows users to submit queries and enables result presentation.
The database and the query server: It stores information about the web pages and
processes the query and returns results.
Query Representation
checking
Strategy Indexing
User
Selection
Profiling
Query
execution
Result
Presentatio
n
History
Maintenanc
e 83
The crawler:
The crawler or spider or robot or bot) is an application program that carries out a
task similar to graph traversal. It is given set of starting URLs that it uses to
automatically traverse the web by retrieving a page, initially from the starting set.
Some search engines use a number of distributed crawlers. Since a crawler has a
relatively simple task, it is not CPU-bound. It is band width bound. A web crawler must
take into account the load (bandwidth, storage) on the search engine machines and
sometimes also on the machines being traversed in guiding its traversal.
A web crawler starts with a given set of URLs and fetches those pages. This
continues until no new pages are found. Each page found by the crawler is often not
stored as a separate file otherwise four billion pages would require managing four
billion files. Usually lots of pages are stuffed into one file.
b.The Indexer:
Given the size of the web & the number of documents that current search
engines have in their databases, an indexer is essential to reduce the cost of query
evaluation.
84
As an example for inverted index consider the data structure shown below:
Words Page Id
Data 10,18,26,41
Mining 26,30,31,41
Search 72,74,75,76
Engine 72,74,79,82
Google 72,74,90
This index specifies a nu7mber of keywords and the pages that contain the
keywords. If we were searching for “data mining” we look for pages for each keyword
85
“data” and “mining” and then find the interaction of the two lists and obtain page
numbers 26& 41.
Challenging part is when the two lists are too large. Usual approach is to split the
lists across many machines to find the intersection quickly. When the index is
distributed over many machines, the distribution may be based on a local inverted
index or global inverted index.
Local inverted index results in ach machine searching for candidate documents fro
each query while the global inverted index is o distributed that each machine is
responsible for only some of the index terms.
c.Updating the index
As the crawler updates the search engine database, the inverted index must also
be updated. Depending on how the index is stored, incremental updating may be
relatively easy but at some stage after many incremental updates it may become
necessary to rebuild the whole index, which normally will take substantial resources.
d.User profiling
Currently most search engines provide just one type of interface to the user. They
provide an input box in which the user types in t he keywords & then waits for the
results.
Interfaces other than those currently provided are being investigated. They
include form fill-in or a menu or even a simple natural language interface.
e.Query Server
First of all, a search engine needs to receive the query and check the spelling of
keywords that the user has typed. If the search engine cannot recognize the keywords
in the language or proper nouns it is desirable to suggest alternative spellings to the
user.
86
Once the keywords are found to be acceptable, the query may need to be
transformed. Stemming is then carried out and stop words are removed unless they
form part of a phrase that the user wants to search for.
f.Query composition
Query composition is a major problem. Often a user has to submit a query a
number of times using some what different keywords before more or less the “right”
result is obtained.
A search engine providing query refinement based on user feedback would be
useful. Search engines often cache the results of a query and can then the cached results
if the refined query is a modification of a query that has already been processed.
g.Query Processing
A major search engine will run a collection of machines, several replicating the
index. A query may then be broadcast to all that are not heavily loaded. If the query
consists of a number of keywords, then the indexes will need to be searched a number
of times(intersection of results to be found).
A search engine may put weights on different words in a query. Ex: if query has
words “going” “to” “paris”, then the search engine will put greater weight on “paris”.
87
Although more bandwidth is available & cheap, the delay in fetching pages from
the web continues. Common approach is to use web caches & proxies as intermediates
between the client browser processes and the machines serving the web pages. Web
cache or proxy is an intermediary that acts as server on behalf of some other server by
supplying web resources without contacting server.
Essentially a cache is supposed to catch a user’s request, get the page that the
user has requested if one is not available in the cache & then save a copy of it in the
cache. It is assumed that if a page is saved, there is a strong likelihood that the
same/ither user will request it again.
Web has very popular sites which are requested very frequently. The proxy
satisfies such requests without going to the web server. A proxy cache must have
algorithms to replace pages in the hope that the pages being stored in the cache are
fresh and are likely to be pages that users are likely to request.
88
i.Results Presentation
All search engines results ordered according to relevance. Therefore, before
presentation of results to user, the results must be ranked. Relevance ranking is based
on page popularity in terms of number of back links to the page.
How much of what has been found should be presented to the user. Most search
engines present the retrieved items as a list on the screen with some information for
each item, ex: type of document, its URL & a brief extract from the document.
4.2.3 Ranking of Web pages
89
introduces a damping factor-that the random surfer sometimes does not click any link
& jumps to another page. If the damping factor is d(assumed to be between 0 & 1)
then the probability of the surfer jumping off to other page is assumed to be 1-d.
Higher the value of d, surfer will follow one of the links. Given that the surfer has 1-d
probability of jumping to some random page, every page has a minimum page rank of
1-d.
Algorithm works as follows:
Let A be the page whose Page Rank PR(A) is required. Let C(T1) be the number of
links going out from page T1. Page Rank of A is then given by:
PR(A)=(1-d)+d(PR(T1)/C(T1)+PR(T2)/C(T2)+… Where d-damping
factor.
To compute a page rank of A, we need to know the page rank of each page that
points to it. And the number of out-links from each of those pages.
Example:
Let us consider an example of three pages. We are given the following information:
Damping factor=0.8
Page A has an out-link to B
Page B has an out-link to A & another to C
Page C has an out-link to A
Starting page rank for each page is 1
We may show the links between the pages as below:
A B
90
PR(A)=0.2+0.4PR(B)+0.8PR(C)
PR(B)=0.2+0.8PR(A)
PR(C)=0.2+0.4PR(B)
These are three linear equations in 3 unknowns, we may solve them. We write them
as follow of we replace PR(A) by a & others similarly
a-0.4 b-0.8 c=0.2
b-0.8 a =0.2
c-0.4 b =0.2
Solution of the above equation is: a=PR(A)=1.19; b=PR(B)=1.15; C=PR(C)=0.66
A side effect of page rank therefore can be something like this. Ex: Suppose
amazon.com decides to get into the real estate business. Obviously its real estate web
site will have links from all other amazon.com decides to get into the real estate
business.
Obviously its real estate web site will have links from all other amazon.com sites
which have no relevance whatsoever with real estate. When a user then searches for real
estate businesses whatever their quality will appear further down the list.
Because of the nature of the algorithm, page rank does not deal with new pages
fairly since it makes high page rank pages even more popular by serving them at the
91
top of the results. Thus the rich get richer & the poor get poorer. It takes time for a new
page to become popular, even if the new page is of high quality.
i. Other issues in Page Ranking
Some page rank algorithms consider the following in determining page rank:
Page title including keywords
Page content including links, keywords in link text, spelling errors, length of the
page.
Out-links including relevance of links to the content.
In-links &their importance
In-links text-keywords
In-linking page content
Traffic including the number of unique visitors & the time spent on the page.
Yahoo! Has developed its own page ranking algorithm. Algorithm is called Web
Rank. Rank is calculated by analyzing the web page text, title &description as well as
associated links & other unique document characteristics.
92
UNIT-V
5.1 Data warehousing
5.1.1 Introduction
Data warehousing is a process for assembling and managing data from various
sources for the purpose of gaining a single detailed view of an enterprise. Although
there are several definitions of data warehouse, a widely accepted definition by Inmon
(1992) is an integrated subject-oriented and time-variant repository of information in
support of management's decision making process. The definition is similar to the
definition of an ODS except that an ODS is a current-valued data store while a data
warehouse is a time-variant repository of data.
The primary aims in building a data warehouse are to provide a single version of
the truth about the enterprise information and to provide good performance for ad hoc
management queries required for enterprise analysis to manipulate, animate and
synthesize enterprise information.
93
A useful way of showing the relationship between OLTP systems, a data warehouse
and an ODS is given in Figure , The data warehouse, as noted earlier, is more like an
enterprise's long-term memory. The objectives in building the two systems. ODS and
data warehouse, are somewhat conflicting and therefore the two databases are likely to
have different schemas.
94
95
The architecture of a system that includes an ODS and a data warehouse shown in
Figure is more complex. It involves extracting information from source systems by
using an ETL process and then storing the information in a staging database.
The daily changes also come to the staging area. Another ETL process is used to
transform information from the staging area to populate the ODS. The ODS is then used
for supplying information via another ETL process to the data warehouse which' in turn
feeds a number of data marts that generate the reports required by management.
It should be noted that not all ETL processes in this architecture involve data
cleaning, some may only involve data extraction and transformation to suit the target
systems.
96
A data warehouse, does not store real-time data and does not require real-time
updates while the ODS does. The ODS does not have historical information but the data
warehouse does.
A data warehouse can be large but growing only slowly over time. These
differences between an ODS and a data warehouse are summarized in Table
Table Comparison of the ODS and data warehouse (Based on IBM, 2001)
A data mart may be used as a proof of data warehouse concept. Data marts can
also create difficulties by setting up "silos of information" although one may build
dependent data marts, which are populated from the central data warehouse.
Data marts are often the common approach for building a data warehouse since
the cost curve for data marts tends to be more linear. A centralized data warehouse
project can be very resource intensive and requires significant investment at the
beginning although overall costs over a number of years for a centralized data
warehouse and for decentralized data marts are likely to be similar.
Whatever the architecture, a data warehouse needs to have a data model that can
form the basis for implementing it
97
98
99
100
101
102
Star schemas may be refined into snowflake schemas if we wish to provide support for
dimension hierarchies by allowing the dimension tables to have sub tables to represent
A characteristic of a star schema is that all the dimensions directly link to the fact table. The fact table
may look like Table and the dimension tables may look like Tables
Tabic- An example of the fact table
hierarchies.
103
a.Implementation Steps
These steps are based on the work of Chauduri and Dayal.
1. Requirements analysis and capacity planning:
In other projects, the first step in data warehousing involves defining
enterprise needs, defining architecture, carrying out capacity planning and selecting the
hardware and software tools. This step will involve consulting with senior management
as well as with the various stake holders.
2. Hardware integration:
Once the hardware and software have been selected, they need to be put together
by integrating the servers, the storage devices and the client software tools.
3. Modeling:
It is a major step that involves designing the warehouse scheme and views. This
may involve using a modeling tool if the data warehouse is complex.
4. Physical modeling:
This involves designing the physical data warehouse organization, data
placement, data partitioning, deciding on access methods and indexing.
5. Sources:
The data for the data warehouse is likely to come from a number of data sources.
This step involves identifying and connecting the sources using gateways, ODBC
drivers or other wrappers.
104
6. ETL:
This may involve identifying a suitable ETL tool vendor and purchasing and
implementing the tool. This may include customizing the tool to suit the needs of the
enterprise.
b.Implementation Guidelines
These are general guidelines, not all of them applicable to every data warehouse
project.
1. Build incrementally:
Data warehouses must be built incrementally. An enterprise data warehouse can
then be implemented in an iterative manner allowing all data marts to extract
information from the data warehouse.
2. Need a champion:
A data ware house project must have a champion who is willing to carry out
considerable reach in to expected cost and benefits of the project. This has shown that
having a champion can help adoption and success of data warehousing projects.
105
5. Corporate strategy:
The objectives of the project must be clearly defined before the start of the
project. Given the importance of senior management support for a data warehousing
project, the project’s fit with the corporate strategy is essential.
6. Business plan:
Without such understanding, rumors about expenditure and benefits can become
the only source of information, understanding the project.
7. Training:
A data warehouse project must not overlook data warehouse training
requirements. Training of users and professional development of the project team may
also be required since it is a complex task and the skills of the project team are critical to
the success of the project.
8. Adaptability:
The project should build in adaptability so that changes may be made to the data
warehouse if and when required.
106
9. Joint management:
The project must be managed by both IT and business professionals in the
enterprise. To ensure good communications with the stakeholders and that the project
is focused on assisting the enterprise’s business, business professionals must be
involved in the project along with technical professionals.
a.Definition:
Metadata has been defined as “all of the information in the data warehouse
environment that is not the actual data itself”.
It is a structured data which describes the characteristics of the resource. Metadata is
stored in the system itself and can be queried using tools that are available on the
system
We now give several example of metadata that should be familiar to be reader:
1. A library catalogue may be considered metadata. It contains a number of
predefined elements representing specific attributes of a resource, and each
element can have one or more values.
2. The table of contents and the index in a book may be considered metadata for
the book
3. Suppose we say that a data element about a person is 80. Therefore is the
metadata about the data 80.
4. Yet another example of metadata is about the tables and figures in a
document like this book. A table has a name and there are column names of
the table that may be considered metadata.
In the context of a data warehouse, Metadata needs to be much more
comprehensive. It may be classified in to two groups:
1. Back room metadata
2. Front room metadata
107
Much important information included in the back room metadata. This could
include information on what source systems the ETL process uses and their schemas.
The front room metadata is more descriptive and could include information needed by
the users, for example, user security privileges, various statistics about usage,
information on network security and so on.
a.OLAP
It is a primarily a software technology concerned with fast analysis of enterprise
information. OLAP systems are data warehouse front-end software tools to make
aggregate data available efficiently, for advanced analysis, to an enterprise’s managers.
It is essential that an OLAP system provides facilities for a manager to pose and
hoc complex queries to obtain the information that requires.
Another term that is being used increasingly is business intelligence. It is
sometimes used to mean both data warehousing and OLAP. Other times, it has been
defined as a user-centered process of exploring data, data relationships and trends,
thereby helping to improve overall decision making.
OLAP and data warehouse are based on a multidimensional conceptual view of
the enterprise data. For example,
India 10 15 25 50
Australia 5 15 50 70
USA 0 20 15 35
ALL 15 50 90 155
108
5.2.1 Introduction
a. OLAP DEFINITION
OLAP is the dynamic enterprise analysis required to create, manipulate, animate
and synthesis information from exegetical, contemplative and formulaic data analysis
models.
Another definition of OLAP is that OLAP is software technology that enables
analysts, managers and executives to gain insight into data through fast, consistent,
interactive access to a wide variety of possible views of information that has been
transformed from raw data to reflect the real dimensionality of the enterprise as
understood by the user.
An even simpler definition is that OLAP is fast analysis of shares
multidimensional information for advanced analysis. This definition (sometimes called
FASMI) implies that most OLAP queries should be answered within seconds.
109
3. Nature: Although SQL queries return a set of records, OLTP systems are
designed to process one record at a time. OLAP systems are not designed to deal
with individual customer records.
4. Design: OLTP database systems are designed to be application-oriented while
OLAP systems are designed to be subject-oriented.
5. Data: OLTP systems normally deal only with the current status of information.
OLAP systems require historical data over several years since trends are more
important in decision making.
6. Kind of use: OLTP systems are used for read and write operations while OLAP
systems normally do not update the data.
OLTP OLAP
Property
Nature of users Operations workers Decision makers
Functions Mission-critical Management-critical
Nature of queries Mostly simple Most complex
Nature of usage Mostly repetitive Mostly ad hoc
Nature of design Application oriented Subject oriented
Nature of users Thousands Dozens
Nature of data Current, detailed. relational Historical, summarized,
multidimensional
Updates All the time Usually not allowed
c.FASMI Characteristics:
The name derived from the first letters of the characteristics, are:
110
1. Fast: most OLAP queries should be answered very quickly perhaps within
seconds. One approach is to pre-compute the most commonly queried
aggregates and compute the remaining on-the-fly.
2. Analytic: An OLAP system must provide rich analytic functionality and it is
expected that most OLAP queries can be answered without any programming.
3. Shared: An OLAP system is a shared resource although it is unlikely to be shared
by hundreds of users.
4. Multidimensional: OLAP software is being used; it must provide z
multidimensional conceptual view of the data.
5. Information: The system should be handling a large amount of input data.
Codd’s OLAP characteristics:
1. Multidimensional conceptual view: By requiring a multidimensional view, it is
possible to carry out operation like slice and dice.
2. Accessibility (OLAP as a mediator): The OLAP system should be sitting between
data sources (e.g. a data warehouse) and an OLAP front-end.
3. Batch extraction vs interpretive: An OLAP system should provide
multidimensional data staging plus partial pre calculation of aggregates \in
large multidimensional database.
4. Multi-user support: OLAP software should provide many normal database
operations including retrieval, update, concurrency control, integrity and
security.
5. Storing OLAP results: Read-write OLAP applications should not be implemented
directly on live transaction data if OLTP source systems are supplying
information to the OLAP system directly.
6. Extraction of missing values: The OLAP system should distinguish missing
values from zero values.
111
For example deals with student data that has only three dimension. The first relation
(student) provides information about students. The second relation (enrolment)
provides information about the student’s degree and the semester of the first
enrollment. The third relation (degree) provides information about university and
department.
112
India 10 15 25 50
Australia 5 15 50 70
USA 0 20 15 35
ALL 15 50 90 155
A two dimensional table of aggregate for semester 200-01. The table together
now forms a three- dimensional cube. Each of the edges in the cube represented a
dimension. Each dimension has a number of members.
Different types of measures may behave differently in computation. We can also
easily add the number of students in two tables such measures are called additive.
The cube formed by the above tables.
In the three dimensional cube there are some types of questions are possible.
1. Null
2. Degree
3. Semester
113
4. Country
5. Degrees, semester
6. Semester, country
7. Degree, country
8. All
For example, all of the aggregation in the list above can be built by queries like:
114
115
Best suited for Inexperienced users, limited set of Experienced users, queries
queries change frequently
A number of operations may be applied to data cubes. The common ones are:
1. Roll-up
2. Drill-down
3. Slice and dice
4. Pivot
1.Roll-up
Roll-up is like zooming out on the data cube. It is required when the user needs
further abstraction or less detail. This operation performs further aggregations on the
data.
116
2.Drill-down
Drill-down is like zooming in on the data and is therefore the reverse of roll-up.
It is appropriate operation when the user needs further details or when the user wants
to partition more finely or wants to focus on some particular values of certain
dimension. It adds more details to data.
4.Pivot or rotate
The pivot operation is used when the user wishes to re-orient the view of the
data cube. It may involve swapping the rows and columns or moving one of the row
dimensions in to the column dimension.
117
3. Selecting an OLAP tool: The OLAP team should familiarize themselves with the
ROLAP and MOLAP tools available in the market. Since tools are quite different,
careful planning may be required in selecting a tool that is appropriate for the
enterprise. In some situations, a combination of ROLAP and MOLAP may be
most effective.
4. Corporate strategy: The OLAP Strategy should fit with the enterprise strategy
and business objectives. A good fit will result in the OLAP tools being used more
widely.
5. Focus on the users: The OLAP project should be focused on the users. Users
should, in consultation with the technical professionals, decide what task will be
done first what done be later.
6. Joint management: The OLAP project must be managed by both the IT and
business professionals, Many other people should be involved in supplying
ideas. An appropriate committee structure may be necessary to channel these
ideas.
7. Review and adapt: Regular reviews of the project may be required to ensure that
the project is meeting the current needs of the enterprise.
118
SECTION -B (05X05=25)
Answer ALL questions, choosing either (a) or (b)
11. a) Write short notes on data mining applications
(or)
b) Explain about APRIORI Algorithm
12. a) Explain the basic concept of classifications
(or)
b Explain decision tree rules
13. a) Explain about Hierarchical methods
(or)
b) Explain about density-based methods
119
(or)
b Explain search engine architecture in detail.
SECTION - C (03X10=30)
120
SECTION –A (10X02=20)
SECTION -B (05X05=25)
Answer ALL questions, choosing either (a) or (b)
121
SECTION - C (03X10=30)
122
123