0% found this document useful (0 votes)
12 views

Unit-4 - Data Ware

Uploaded by

arjit19dec
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Unit-4 - Data Ware

Uploaded by

arjit19dec
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Unit-4

1.1 Classification and Prediction:

Classification and prediction are two forms of data analysis that can be used to extract models
describing important data classes or to predict future data trends.
Classification predicts categorical (discrete, unordered) labels, prediction models continuous
valued functions.
For example, we can build a classification model to categorize bank loan applications as either
safe or risky, or a prediction model to predict the expenditures of potential customers on
computer equipment given their income and occupation.
A predictor is constructed that predicts a continuous-valued function, or ordered value, as
opposed to a categorical label.
Regression analysis is a statistical methodology that is most often used for numeric prediction.
Many classification and prediction methods have been proposed by researchers in machine
learning, pattern recognition, and statistics.
Most algorithms are memory resident, typically assuming a small data size. Recent data mining
research has built on such work, developing scalable classification and prediction techniques
capable of handling large disk-resident data.

1.1.1 Issues Regarding Classification and Prediction:


1.Preparing the Data for Classification and Prediction:
The following preprocessing steps may be applied to the data to help improve the accuracy,
efficiency, and scalability of the classification or prediction process.
(i)Data cleaning:
This refers to the preprocessing of data in order to remove or reduce noise (by applying
smoothing techniques) and the treatment of missingvalues (e.g., by replacing a missing value
with the most commonly occurring value for that attribute, or with the most probable value
based on statistics).

Dr. Nikhat Akhtar


Although most classification algorithms have some mechanisms for handling noisy or
missing data, this step can help reduce confusion during learning.
(ii)Relevance analysis:
Many of the attributes in the data may be redundant.
Correlation analysis can be used to identify whether any two given attributes are
statistically related.
For example, a strong correlation between attributes A1 and A2 would suggest that one of
the two could be removed from further analysis.
A database may also contain irrelevant attributes. Attribute subset selection can be used in
these cases to find a reduced set of attributes such that the resulting probability distribution
of the data classes is as close as possible to the original distribution obtained using all
attributes.
Hence, relevance analysis, in the form of correlation analysis and attribute subset selection,
can be used to detect attributes that do not contribute to the classification or prediction task.
Such analysis can help improve classification efficiency and scalability.
(iii)Data Transformation and Reduction
The data may be transformed by normalization, particularly when neural networks or
methods involving distance measurements are used in the learning step.
Normalization involves scaling all values for a given attribute so that they fall within a
small specified range, such as -1 to +1 or 0 to 1.
The data can also be transformed by generalizing it to higher-level concepts. Concept
hierarchies may be used for this purpose. This is particularly useful for continuous
valued attributes.

For example, numeric values for the attribute income can be generalized to discrete
ranges, such as low, medium, and high. Similarly, categorical attributes, like street, can
be generalized to higher-level concepts, like city.
Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization techniques, such
as binning, histogram analysis, and clustering.

Dr. Nikhat Akhtar


1.1.2 Comparing Classification and Prediction Methods:
 Accuracy:
The accuracy of a classifier refers to the ability of a given classifier to correctly predict
the class label of new or previously unseen data (i.e., tuples without class label
information).
The accuracy of a predictor refers to how well a given predictor can guess the value of
the predicted attribute for new or previously unseen data.
 Speed:
This refers to the computational costs involved in generating and using the
given classifier or predictor.
 Robustness:
This is the ability of the classifier or predictor to make correct predictions
given noisy data or data with missing values.
 Scalability:
This refers to the ability to construct the classifier or predictor efficiently
given large amounts of data.
 Interpretability:
This refers to the level of understanding and insight that is provided by the classifier or
predictor.
Interpretability is subjective and therefore more difficult to assess.
1.2 Classification by Decision Tree Induction:
Decision tree induction is the learning of decision trees from class-labeled training tuples.
A decision tree is a flowchart-like tree structure, where
 Each internal node denotes a test on an attribute.
 Each branch represents an outcome of the test.
 Each leaf node holds a class label.
 The topmost node in a tree is the root node.

Dr. Nikhat Akhtar


The construction of decision tree classifiers does not require any domain knowledge or
parameter setting, and therefore I appropriate for exploratory knowledge discovery.
Decision trees can handle high dimensional data.
Their representation of acquired knowledge in tree form is intuitive and generally easy to
assimilate by humans.
The learning and classification steps of decision tree induction are simple and fast.
In general, decision tree classifiers have good accuracy.
Decision tree induction algorithms have been used for classification in many application
areas, such as medicine, manufacturing and production, financial analysis, astronomy, and
molecular biology.

1.2.1 Algorithm For Decision Tree Induction:

Dr. Nikhat Akhtar


The algorithm is called with three parameters:
 Data partition
 Attribute list
 Attribute selection method

The parameter attribute list is a list of attributes describing the tuples.


Attribute selection method specifies a heuristic procedure for selecting the attribute that
―bestǁ discriminates the given tuples according to class.
The tree starts as a single node, N, representing the training tuples in D.

Dr. Nikhat Akhtar


If the tuples in D are all of the same class, then node N becomes a leaf and is labeled with
that class .
All of the terminating conditions are explained at the end of the algorithm.
Otherwise, the algorithm calls Attribute selection method to determine the splitting
criterion.
The splitting criterion tells us which attribute to test at node N by determining the ―bestǁ
way to separate or partition the tuples in D into individual classes.

There are three possible scenarios. Let A be the splitting attribute. A has v distinct values,
{a1, a2, … ,av}, based on the training data.

1 A is discrete-valued:

In this case, the outcomes of the test at node N correspond directly to the known
values of A.
A branch is created for each known value, aj, of A and labeled with that value.
A need not be considered in any future partitioning of the tuples.

2 A is continuous-valued:

In this case, the test at node N has two possible outcomes, corresponding to the conditions
A <=split point and A >split point, respectively
where split point is the split-point returned by Attribute selection method as part of the
splitting criterion.

3 A is discrete-valued and a binary tree must be produced:

The test at node N is of the form―A€SA?ǁ.


SA is the splitting subset for A, returned by Attribute selection method as part of the splitting
criterion. It is a subset of the known values of A.
Dr. Nikhat Akhtar
(a)If A is Discrete valued (b)If A is continuous valued (c) If A is discrete-valued and a binary
tree must be produced:

1.3 Bayesian Classification:

Bayesian classifiers are statistical classifiers.


They can predict class membership probabilities, such as the probability that a given tuple
belongs toa particular class.
Bayesian classification is based on Bayes’ theorem.

1.3.1 Bayes’ Theorem:


Let X be a data tuple. In Bayesian terms, X is considered ―evidence. And it is described by

Dr. Nikhat Akhtar


measurements made on a set of n attributes.

Dr. Nikhat Akhtar


Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
For classification problems, we want to determine P(H|X), the probability that the hypothesis
H holds given the ―evidenceǁ or observed data tuple X.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
Bayes’ theorem is useful in that it provides a way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).

1.3.2 Naïve Bayesian Classification:

The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:

1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, …, An.

2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will predict
that X belongs to the class having the highest posterior probability, conditioned on X.

That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only
if

Thus we maximize P(CijX). The class Ci for which P(CijX) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem

3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci).

Dr. Nikhat Akhtar


4. Given data sets with many attributes, it would be extremely computationally expensive to
compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive assumption
of class conditional independence is made. This presumes that the values of the attributes are
conditionally independent of one another, given the class label of the tuple. Thus,

We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) from the training
tuples. For each attribute, we look at whether the attribute is categorical or continuous-valued.
For instance, to compute P(X|Ci), we consider the following:
 If Akis categorical, then P(xk|Ci) is the number of tuples of class Ciin D having the value
xkfor Ak, divided by |Ci,D| the number of tuples of class Ciin D.
 If Akis continuous-valued, then we need to do a bit more work, but the calculation is
pretty straightforward.
A continuous-valued attribute is typically assumed to have a Gaussian distribution with a
mean µ and standard deviation , defined by

5. In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class Ci.
The classifier predicts that the class label of tuple X is the class Ciif and only if

1.4 A Multilayer Feed-Forward Neural Network:


The backpropagation algorithm performs learning on a multilayer feed-
forward neural network.

It iteratively learns a set of weights for prediction of the class label of tuples.

Dr. Nikhat Akhtar


A multilayer feed-forward neural network consists of an input layer, one or
more hidden layers, and an output layer.

Example:

The inputs to the network correspond to the attributes measured for each training tuple. The
inputs are fed simultaneously into the units making up the input layer. These inputs pass
through the input layer and are then weighted and fed simultaneously to a second layer
known as a hidden layer.
The outputs of the hidden layer units can be input to another hidden layer, and so on. The
number of hidden layers is arbitrary.
The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network’s prediction for given tuples

1.4.1 Classification by Backpropagation:


Backpropagation is a neural network learning algorithm.
A neural network is a set of connected input/output units in which each connection has a
weight associated with it.
During the learning phase, the network learns by adjusting the weights so as to be able to
predict the correct class label of the input tuples.

Dr. Nikhat Akhtar


Neural network learning is also referred to as connectionist learning due to the connections
between units.

Neural networks involve long training times and are therefore more suitable for applications
where this is feasible.
Backpropagation learns by iteratively processing a data set of training tuples, comparing
the network’s prediction for each tuple with the actual known target value.
The target value may be the known class label of the training tuple (for classification
problems) or a continuous value (for prediction).
For each training tuple, the weights are modified so as to minimize the mean squared error
between the network’s prediction and the actual target value. These modifications are made
in the ―backwardsǁ direction, that is, from the output layer, through each hidden layer down
to the first hidden layer hence the name is backpropagation.
Although it is not guaranteed, in general the weights will eventually converge, and the
learning process stops.
Advantages:

It include their high tolerance of noisy data as well as their ability to classify patterns on
which they have not been trained.
They can be used when you may have little knowledge of the relationships between
attributes and classes.
They are well-suited for continuous-valued inputs and outputs, unlike most decision tree
algorithms.
They have been successful on a wide array of real-world data, including handwritten
character recognition, pathology and laboratory medicine, and training a computer to
pronounce English text.
Neural network algorithms are inherently parallel; parallelization techniques can be used
to speed up the computation process.
Process:
Initialize the weights:

The weights in the network are initialized to small random numbers

Dr. Nikhat Akhtar


ranging from-1.0 to 1.0, or -0.5 to 0.5. Each unit has a bias associated with it. The biases are
similarly initialized to small random numbers.

Dr. Nikhat Akhtar


Each training tuple, X, is processed by the following steps.

Propagate the inputs forward:

First, the training tuple is fed to the input layer of the network. The inputs pass through the input
units, unchanged. That is, for an input unitj, its output, Oj, is equal to its input value, Ij. Next, the
net input and output of each unit in the hidden and output layers are computed. The net input to a
unit in the hidden or output layers is computed as a linear combination of its inputs.
Each such unit has a number of inputs to it that are, in fact, the outputs of the units connected to
it in the previous layer. Each connection has a weight. To compute the net input to the unit, each
input connected to the unit is multiplied by its corresponding weight, and this is summed.

Where wi,j is the weight of the connection from unit I in the previous layer to unit j;
Oi is the output of unit I from the previous layer
Ɵj is the bias of the unit & it acts as a threshold in that it serves to vary the activity of the unit.

Each unit in the hidden and output layers takes its net input and then applies an activation
function to it.

Dr. Nikhat Akhtar


Backpropagate the error:

The error is propagated backward by updating the weights and biases to reflect the error of
the network’s prediction. For a unit j in the output layer, the error Err jis computed by

Where Oj is the actual output of unit j, and Tj is the known target value of the given
training tuple.
The error of a hidden layer unit j is

Where wjk is the weight of the connection from unit j to a unit k in the next higher layer, and
Errk is the error of unit k.
Weights are updated by the following equations, where Dwi j is the change in weight wi j:

Biases are updated by the following equations below

Dr. Nikhat Akhtar


Algorithm:

1.5 k-Nearest-Neighbor Classifier:


Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a
given test tuple with training tuples that are similar to it.
The training tuples are described by n attributes. Each tuple represents a point in an n-
dimensional space. In this way, all of the training tuples are stored in an n-dimensional
pattern space. When given an unknown tuple, a k-nearest-neighbor classifier searches the
pattern space for the k training tuples that are closest to the unknown tuple. These k training
Dr. Nikhat Akhtar
tuples are the k nearest neighbors of the unknown tuple.

Dr. Nikhat Akhtar


Closeness is defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between two points or tuples, say, X1 = (x11, x12, … , x1n) and
X2 = (x21, x22, … ,x2n), is

In other words, for each numeric attribute, we take the difference between the corresponding
values of that attribute in tuple X1and in tuple X2, square this difference, and accumulate it.
The square root is taken of the total accumulated distance count.
Min-Max normalization can be used to transform a value v of a numeric attribute A to v0 in
the range [0, 1] by computing

Where minA and maxA are the minimum and maximum values of attribute A

For k-nearest-neighbor classification, the unknown tuple is assigned the most common
class among its k nearest neighbors.
When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to
it in pattern space.
Nearest neighbor classifiers can also be used for prediction, that is, to return a real-valued
prediction for a given unknown tuple.
In this case, the classifier returns the average value of the real-valued labels associated
with the k nearest neighbors of the unknown tuple.

1.6 Other Classification Methods:

1.6.1 Genetic Algorithms:

Genetic algorithms attempt to incorporate ideas of natural evolution. In general, genetic learning
starts as follows.
An initial population is created consisting of randomly generated rules. Each rule can be

Dr. Nikhat Akhtar


represented by a string of bits. As a simple example, suppose that samples in a
giventraining set are described by two Boolean attributes, A1 and A2, and that there are
two classes,C1 andC2.
The rule ―IF A1 ANDNOT A2 THENC2ǁ can be encoded as the bit string ―100,ǁ where
the two leftmost bits represent attributes A1 and A2, respectively, and the rightmost bit
represents the class.
Similarly, the rule ―IF NOT A1 AND NOT A2 THEN C1ǁ can be encoded as ―001.ǁ
If an attribute has k values, where k > 2, then k bits may be used to encode the attribute’s
values.
Classes can be encoded in a similar fashion.
Based on the notion of survival of the fittest, a new population is formed to consist of
the fittest rules in the current population, as well as offspring of these rules.
Typically, the fitness of a rule is assessed by its classification accuracy on a set of training
samples.
Offspring are created by applying genetic operators such as crossover and mutation.
In crossover, substrings from pairs of rules are swapped to form new pairs of rules.
In mutation, randomly selected bits in a rule’s string are inverted.
The process of generating new populations based on prior populations of rules continues
until a population, P, evolves where each rule in P satisfies a pre specified fitness
threshold.
Genetic algorithms are easily parallelizable and have been used for classification as
well as other optimization problems. In data mining, they may be used to evaluate the
fitness of other algorithms.

1.6.2 Fuzzy Set Approaches:


Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership
that a certain value has in a given category. Each category then represents a fuzzy set.
Fuzzy logic systems typically provide graphical tools to assist users in converting attribute
values to fuzzy truth-values.
Fuzzy set theory is also known as possibility theory.

Dr. Nikhat Akhtar


It was proposed by Lotfi Zadeh in1965 as an alternative to traditional two-value logic and
probability theory.
It lets us work at a high level of abstraction and offers a means for dealing with imprecise
measurement of data.
Most important, fuzzy set theory allows us to deal with vague or inexact facts.
Unlike the notion of traditional ―crispǁ sets where an element either belongs to a set S or its
complement, in fuzzy set theory, elements can belong to more than one fuzzy set.
Fuzzy set theory is useful for data mining systems performing rule-based classification.
It provides operations for combining fuzzy measurements.
Several procedures exist for translating the resulting fuzzy output into a defuzzified or crisp
value that is returned by the system.
Fuzzy logic systems have been used in numerous areas for classification, including
market research, finance, health care, and environmental engineering.

Example:

1.7 Regression Analysis:


Regression analysis can be used to model the relationship between one or more independent
or predictor variables and a dependent or response variable which is continuous-valued.
In the context of data mining, the predictor variables are the attributes of interest describing
the tuple (i.e., making up the attribute vector).

Dr. Nikhat Akhtar


In general, the values of the predictor variables are known.

Dr. Nikhat Akhtar


The response variable is what we want to predict.

1.7.1 Linear Regression:


Straight-line regression analysis involves a response variable, y, and a single predictor
variable x.
It is the simplest form of regression, and models y as a linear function of x.
That is,y = b+wx
where the variance of y is assumed to be constant
band w are regression coefficients specifying the Y-intercept and slope of the line.
The regression coefficients, w and b, can also be thought of as weights, so that we can
equivalently write, y = w0+w1x
These coefficients can be solved for by the method of least squares, which estimates the
best-fitting straight line as the one that minimizes the error between the actual data and
the estimate of the line.
Let D be a training set consisting of values of predictor variable, x, for some population and
their associated values for response variable, y. The training set contains |D| data points of
the form(x1, y1), (x2, y2), … , (x|D|, y|D|).
The regression coefficients can be estimated using this method with the following equations:

where x is the mean value of x1, x2, … , x|D|, and y is the mean value of y1, y2,…, y|D|.
The coefficients w0 and w1 often provide good approximations to otherwise complicated
regression equations.
1.7.2 Multiple Linear Regression:
It is an extension of straight-line regression so as to involve more than one predictor
variable.

Dr. Nikhat Akhtar


It allows response variable y to be modeled as a linear function of, say, n predictor
variables or attributes, A1, A2, …, An, describing a tuple, X.
An example of a multiple linear regression model based on two predictor attributes or
variables, A1 and A2, is y = w0+w1x1+w2x2
where x1 and x2 are the values of attributes A1 and A2, respectively, in X.
Multiple regression problems are instead commonly solved with the use of statistical
software packages, such as SAS,SPSS, and S-Plus.

1.7.3 Nonlinear Regression:


It can be modeled by adding polynomial terms to the basic linear model.
By applying transformations to the variables, we can convert the nonlinear model into a linear
one that can then be solved by the method of least squares.
Polynomial Regression is a special case of multiple regression. That is, the addition of high-
order terms like x2, x3, and so on, which are simple functions of the single variable, x, can be
considered equivalent to adding new independent variables.
Transformation of a polynomial regression model to a linear regression model:
Consider a cubic polynomial relationship given by
y = w0+w1x+w2x2+w3x3
To convert this equation to linear form, we define new variables:
x1 = x, x2 = x2 ,x3 = x3
It can then be converted to linear form by applying the above assignments, resulting in
the equation y = w0+w1x+w2x2+w3x3
which is easily solved by the method of least squares using software for regression analysis.

1.8 Classifier Accuracy:


The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier.
In the pattern recognition literature, this is also referred to as the overall recognition rate of
the classifier, that is, it reflects how well the classifier recognizes tuples of the various classes.

Dr. Nikhat Akhtar


The error rate or misclassification rate of a classifier ,M, which is simply 1-Acc(M), where
Acc(M) is the accuracy of M.
The confusion matrix is a useful tool for analyzing how well your classifier can recognize tuples
of different classes.
True positives refer to the positive tuples that were correctly labeled by the classifier.
True negatives are the negative tuples that were correctly labeled by the classifier.
False positives are the negative tuples that were incorrectly labeled.
How well the classifier can recognize, for this sensitivity and specificity measures can be
used.
Accuracy is a function of sensitivity and specificity.

Where t _posis the number of true positives


posis the number of positive tuples
t _negis the number of true negatives
negis the number of negative tuples, f
_posis the number of false positives

Dr. Nikhat Akhtar


 CLUSTERING IN DATA MINING

Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of data


points into clusters so that the objects belong to the same group. The process of grouping a set of
physical or abstract objects into classes of similar objects is called clustering. Clustering helps to
splits data into several subsets. Each of these subsets contains data similar to each other, and these
subsets are called clusters. A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters. Clustering is also called
data segmentation in some applications because clustering partitions large data sets into groups
according to their similarity. Clustering can also be used for outlier detection.

For example, the data from customer base is divided into clusters; we can make an informed
decision about who we think is best suited for this product. Suppose we are a market manager,
and we have a new tempting product to sell. We are sure that the product would bring enormous
profit, as long as it is sold to the right people. So, how can we tell who is best suited for the product
from our company's huge customer base?

Figure 1: Application of Clustering Algorithm

• In machine learning, clustering is an example of unsupervised learning. Unlike


classification, clustering and Unsupervised learning do not rely on predefined classes and
class-labeled training examples. For this reason, clustering is a form of learning by
observation, rather than learning by examples.

Dr. Nikhat Akhtar


A Categorization of Major Clustering Methods

(1) Partitioning methods

It classifies the data into k groups, which together satisfy the following requirements:

each group must contain at least one object, and


each object must belong to exactly one group
• It then uses an iterative relocation technique that attempts to improve the partitioning by
moving objects from one group to another.
• The general criterion of a good partitioning is that objects in the same cluster are “close”
or related to each other, whereas objects of different clusters are “far apart” or very
different.
• Popular heuristic methods, such as
(1) the k-means algorithm, where each cluster is represented by the mean value of
the objects in the cluster, and
(2) the k-medoids algorithm, where each cluster is represented by one of the
objects located near the center of the cluster.

1 (a) Centroid-Based Technique: The k-Means Method


• The k-means algorithm takes the input parameter, k, and partitions a set of n objects into
k clusters so that the resulting intra cluster similarity is high but the inter cluster
similarity is low.
• Cluster similarity is measured in regard to the mean value of the objects in a cluster
• First, it randomly selects k of the objects, each of which initially represents a cluster
mean or center.
For each of the remaining objects, an object is assigned to the cluster to which it is
the most similar, based on the distance between the object and the cluster mean.

Figure 2: Clustering of set of objects based on k- means method

1(b) Representative Object-Based Technique: The k-Medoids Method

Dr. Nikhat Akhtar


• The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data.
• Instead of taking the mean value of the objects in a cluster as a reference point, we can
pick actual objects to represent the clusters, using one representative object per cluster.
• Each remaining object is clustered with the representative object to which it is the most
similar.
• The partitioning method is then performed based on the principle of minimizing the sum
of the dissimilarities between each object and its corresponding reference point .

p is the point in space representing a given object in cluster Cj; and oj is the
representative object of Cj. In general, the algorithm iterates until, eventually, each
representative object is actually the metoid, or most centrally located object, of its
cluster.
• Case 1: p currently belongs to representative object, oj. If oj is replaced by o random as a
representative object and p is closest to one of the other representative objects, oi,
• i not equal j, then p is reassigned to oi.
• Case 2: p currently belongs to representative object, oj. If oj is replaced by o random as a
representative object and p is closest to o random, then p is reassigned to o random.
• Case 3: p currently belongs to representative object, oi, i not equal j. If oj is replaced by
o random as a representative object and p is still closest to oi, then the assignment does
not change.
• Case 4: p currently belongs to representative object, oi, i not equal j. If oj is replaced by o
random as a representative object and p is closest to o random, then p is reassigned to o
random.

Figure 3: Four cases of the function k-medoids clustering

(2) Hierarchical methods

Dr. Nikhat Akhtar


A hierarchical method creates a hierarchical decomposition of the given set of data objects.A
hierarchical clustering method works by grouping data objects into a tree of clusters.

Classfication:

o Agglomerative & Divisive Hierarchical Clustering


o CURE
o Chameleon

There are two approaches to improving the quality of hierarchical clustering:

perform careful analysis of object “linkages” at each hierarchical partitioning, such


as in Chameleon, or
integrate hierarchical agglomeration and other approaches by first using a
hierarchical agglomerative algorithm to group objects into microclusters, and then
performing macroclustering on the microclusters using another clustering method
such as iterative relocation, as in BIRCH

(2a) Agglomerative Hierarchical Clustering

• A hierarchical method can be classified as being either agglomerative, also called the bottom-
up approach, starts with each object forming a separate group. It successively merges the
objects or groups that are close to one another, until all of the groups are merged into one
(the topmost level of the hierarchy), or until a termination conditions are satisfied.

• The divisive approach, also called the top-down approach, starts with all of the objects in the
same cluster.

• In each successive iteration, It subdivides the cluster into smaller and smaller pieces, until
each object forms a cluster on its own or until it satisfies certain termination conditions, each
cluster is within a certain threshold.

Figure 4: Agglomerative and Divisve Hierarchical clustering

Dr. Nikhat Akhtar


EXAMPLE:

• The figure shows AGNES (AGglomerative NESting), an agglomerative hierarchical


clustering method, and DIANA (DIvisive ANAlysis), a divisive hierarchical clustering
method, to a data set of five objects,{a, b, c, d, e }.

• Initially, AGNES places each object into a cluster of its own.

• The clusters are then merged step-by-step according to some criterion.

• For example, clusters C1 and C2 may be merged if an object in C1 and an object in C2


form the minimum Euclidean distance between any two objects from different clusters.

• This is a single-linkage approach in that each cluster is represented by all of the objects in
the cluster, and the similarity between two clusters is measured by the similarity of the
closest pair of data points belonging to different clusters.

• The cluster merging process repeats until all of the objects are eventually merged to form
one cluster.

• In DIANA, all of the objects are used to form one initial cluster.

• The cluster is split according to some principle, such as the maximum Euclidean distance
between the closest neighboring objects in the cluster.

• The cluster splitting process repeats until, eventually, each new cluster contains only a
single object

(2b) CURE

The cure algorithm assumes a Euclidean distance. It allows clusters to assume any
shape.It uses collection of representative points to represent clusters

Figure 5: Clusters of different shapes

Dr. Nikhat Akhtar


For example, the dataset of engineers and humanity people is been shown with their salary
and age..

Figure 6: Dataset representation in terms of salary and age

We formed the two clusters of the dataset of engineers and humanities. The clusters formed are
overlapping with each other which will not give the solution.

Figure 7: Two cluster formation

We tried to create three clusters for segregation by which results can be achieved. But after
cluster formation still one cluster is formed having both the dataset values.

Dr. Nikhat Akhtar


Figure 8: Three cluster formation

Algorithm for cure:

Pass 1 of 2:

Pick a random sample of points that fit in main memory.


Cluster sample points hierarchically to create the initial clusters.
Pick representatives points:
o For each cluster, pick k (eg.,4) representative points, as dispersed as
possible
o Move each representative point a fixed fraction (eg., 20%) toward the
centroid of the cluster

Figure 9: Representative points or remote points in cluster

Dr. Nikhat Akhtar


Figure 10: Remote points moving 20% toward centroid

Pass 2 of 2:

Now, rescan the whole dataset and visit each point p in the data set.
Place it in the “closest cluster”
o Closest: that cluster with the closest (to p)among all the representative points of
all the clusters.

(2C) Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling

• Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to


determine the similarity between pairs of clusters.

• It was derived based on the observed weaknesses of two hierarchical clustering


algorithms: ROCK (ignores cluster nearness) and CURE (ignores cluster
interconnectivity)

How does Chameleon work?

• Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph, where


each vertex of the graph represents a data object, and there exists an edge between two
vertices (objects) if one object is among the k-most-similar objects of the other.

• The edges are weighted to reflect the similarity between objects. Chameleon uses a graph
partitioning algorithm to partition the k-nearest-neighbor graph into a large number of
relatively small subclusters.

Dr. Nikhat Akhtar


• It then uses an agglomerative hierarchical clustering algorithm that repeatedly merges
subclusters based on their similarity.

• To determine the pairs of most similar subclusters, it takes into account both the
interconnectivity as well as the closeness of the clusters

Figure 11: Chameleon – Hierarchical clustering based on k-nearest and dynamic modeling

Figure 12: Overall Framework of Chameleon

Dr. Nikhat Akhtar


(3) Density-Based Methods

To discover clusters with arbitrary shape, density-based clustering methods have been
developed.

These typically regard clusters as dense regions of objects in the data space that are
separated by regions of low density (representing noise).

DBSCAN grows clusters according to a density-based connectivity analysis.

OPTICS extends DBSCAN to produce a cluster ordering obtained from a wide range
of parameter settings

(3a) DBSCAN: A Density-Based Clustering Method Based on Connected Regions


with Sufficiently High Density

• DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density


based clustering algorithm.

• The algorithm grows regions with sufficiently high density into clusters and discovers
clusters of arbitrary shape in spatial databases with noise.

• It defines a cluster as a maximal set of density-connected points.

• The basic ideas of density-based clustering involve a number of new definitions

The neighborhood within a radius ɛ of a given object is called the ɛ-neighborhood of the
object.

If the ɛ-neighborhood of an object contains at least a minimum number, MinPts, of


objects, then the object is called a core object

Dr. Nikhat Akhtar


Figure 13: Density reachability and density connectivity in density-based clustering

Example:

Consider Figure for a given e represented by the radius of the circles, and, say, let

MinPts = 3.

• Of the labeled points , m, p, o, and r are core objects because each is in an ɛ


neighborhood containing at least three points.

• q is directly density-reachable from m. m is directly density-reachable from p and vice


versa.

• q is (indirectly) density-reachable from p because q is directly density-reachable from m


and m is directly density-reachable from p.

• However, p is not density-reachable from q because q is not a core object. Similarly, r


and s are density-reachable from o, and o is density-reachable from r.

• o, r, and s are all density-connected.

How does DBSCAN find clusters

• A density-based cluster is a set of density-connected objects that is maximal with respect


to density-reachability.

Dr. Nikhat Akhtar


• DBSCAN searches for clusters by checking the e-neighborhood of each point in the
database.

• If the e-neighborhood of a point p contains more than MinPts, a new cluster with p as a
core object is created.

• DBSCAN then iteratively collects directly density-reachable objects from these core
objects, which may involve the merge of a few density-reachable clusters.

• The process terminates when no new point can be added to any cluster.

(3b) OPTICS: Ordering Points to Identify the Clustering Structure

• Rather than produce a data set clustering explicitly, OPTICS computes an augmented
cluster ordering for automatic and interactive cluster analysis.

• This ordering represents the density-based clustering structure of the data.

It contains information that is equivalent to density-based clustering obtained from a wide


range of parameter settings

• To construct the different clusterings simultaneously, the objects should be processed in


a specific order.

• This order selects an object that is density-reachable with respect to the lowest e value so
that clusters with higher density (lower ɛ) will be finished first.

Based on this idea, two values need to be stored for each object—core-distance and
reachability-distance:

• The core-distance of an object p is the smallest ɛ ‘ value that makes {p} a core object. If p
is not a core object, the core-distance of p is undefined.

• The reachability-distance of an object q with respect to another object p is the greater


value of the core-distance of p and the Euclidean distance between p and q. If p is not a
core object, the reachability-distance between p and q is undefined.

Dr. Nikhat Akhtar


Figure 13 : OPTICS Terminology

Core-distance and reachability-distance

• Figure illustrates the concepts of core distance and reachability-distance.

• Suppose that ɛ =6 mm and MinPts=5. The core distance of p is the distance, ɛ’, between p
and the fourth closest data object.

• The reachability-distance of q1 with respect to p is the core-distance of p (i.e., ɛ’ =3 mm)


because this is greater than the Euclidean distance from p to q1.

• The reachability distance of q2 with respect to p is the Euclidean distance from p to q2


because this is greater than the core-distance of p.

4. Grid-Based Methods

• The grid-based clustering approach uses a multi resolution grid data structure.

• It quantizes the object space into a finite number of cells that form a grid structure on
which all of the operations for clustering are performed.

• The main advantage of the approach is its fast processing time, which is typically
independent of the number of data objects, yet dependent on only the number of cells in
each dimension in the quantized space.

• Some typical examples of the grid-based approach include STING, which explores
statistical information stored in the grid cells; Wave Cluster, which clusters objects using
a wavelet transform method; and CLIQUE, which represents a grid-and density-based
approach for clustering in high-dimensional data space

Dr. Nikhat Akhtar


4 a. STING: Statistical Information Grid

• STING is a grid-based multi resolution clustering technique in which the spatial area is
divided into rectangular cells.

• There are usually several levels of such rectangular cells corresponding to different levels
of resolution, and these cells form a hierarchical structure:

• each cell at a high level is partitioned to form a number of cells at the next lower level.

• Statistical information regarding the attributes in each grid cell (such as the mean,
maximum, and minimum values) is precomputed and stored

4 b. CLIQUE: A Dimension-Growth Subspace Clustering Method

• CLIQUE (CLustering InQUEst) was the first algorithm proposed for dimension-growth
subspace clustering in high-dimensional space.

• In dimension-growth subspace clustering, the clustering process starts at single-


dimensional subspaces and grows upward to higher-dimensional ones.

• Because CLIQUE partitions each dimension like a grid structure and determines whether
a cell is dense based on the number of points it contains, it can also be viewed as an
integration of density-based and grid-based clustering methods

The ideas of the CLIQUE clustering algorithm are outlined as follows.

• Given a large set of multidimensional data points, the data space is usually not uniformly
occupied by the data points.

• CLIQUE’s clustering identifies the sparse and the “crowded” areas in space (or units),
thereby discovering the overall distribution patterns of the data set.

• A unit is dense if the fraction of total data points contained in it exceeds an input model
parameter

Dr. Nikhat Akhtar


Figure 14: Density and Grid based clustering

Difference between STING and OPTICS


The method of identifying similar groups of data in a data set is called clustering. Entities in each
group are comparatively more similar to entities of that group than those of the other groups.

STING (Statistical Information Grid Clustering Algorithm) and OPTICS (Ordering Point To
Identify Clustering Structure Clustering Algorithm) are clustering algorithms used in Unsupervised
Learning. They are machine learning techniques which are used to club the given input data points
into clusters or groups on the basis of their attributes. STING is grid-based clustering algorithm
while OPTICS is a density-based clustering algorithm. Clustering

S.No. STING OPTICS


STING is abbreviation for Statistical OPTICS is abbreviation for Ordering Point To
1.
Information Grid Identify Clustering Structure
2. It is grib based clustering algorithm It is density based clustering algorithm
It concerns not with data points but with
It searches the data space for areas of varied
3. the value space that surrounds the data
density data points in the data space.
points.
It uses multi-dimensional grid data
It is an extension to Density Based spatial
4. structure that quantizes space into a finite
clustering of applications with noise.
number of cells.
The following are the properties of STING The following are the properties of OPTICS
clustering algorithm: clustering algorithm:

5.  Spatial area is divided into  It is an extension of DBSCAN, which


rectangular cells. takes the responsibility of parameters that
 Several level of cells at different can lead to discovery of unacceptable
levels of resolution. clusters.
Dr. Nikhat Akhtar
S.No. STING OPTICS
 High level cell is partitioned into  Core distance is the smallest point that
several low level cells. make a point core.
 Statistically attributes are stored in  Two important parameters are required
cell for instance Mean, Maximum, for OPTICS: epsilon(“eps) and minimum
Minimum are some of the statistical points(“MinPts).
measures which are used.  The parameter eps defines the radius of
 Statistical information is calculated neighborhood around a point P. The
for each cell and the types of parameter MinPts is the minimum no. of
distribution calculated are normal neighbors within “eps”radius.
and exponential.  Density = No. of points within a specified
radius r(eps)

It has relatively less computational


6. It has relatively more computational complexity.
complexity.

STING Algorithm:

1. Determine a layer, to begin with.


2. For each cell of this layer, we calculate the confidence interval (or estimated range) of
probability that this cell is relevant to the query.
3. From the interval calculate above, we label the cell as relevant or not relevant.
4. If this is the bottom layer, then end the process.
5. We go down the hierarchy structure by one level. Go to Step 2 for those levels that form the
relevant cells of the higher-level layer.

STING Hierarchy Diagram :

Dr. Nikhat Akhtar


OPTICS Algorithm:
Core distance of a point P is the smallest distance such that the neighborhood of P has atleast minPts
points.
Reachability distance of p from q1 is the core distance ( ε’ ).
Reachability distance of p from q2 is the euclidean distance between p and q2.

Dr. Nikhat Akhtar


Conclusion
So now we have learned many things about Data Clustering such as the methods of
Data Clustering in Data mining.

2.1 Association Rule Mining:


Association rule mining is a popular and well researched method for discovering interesting
relations between variables in large databases.
It is intended to identify strong rules discovered in databases using different measures of
interestingness.
Based on the concept of strong rules, RakeshAgrawal et al. introduced association rules.
Problem Definition:
The problem of association rule mining is defined as:

Let be a set of binary attributes called items.

Let be a set of transactions called the database.


Each transaction in has a unique transaction ID and contains a subset of the items in .
A rule is defined as an implication of the form
where and .
The sets of items (for short itemsets) and are called antecedent (left-hand-side or LHS) and
consequent (right-hand-side or RHS) of the rule respectively.
Example:
To illustrate the concepts, we use a small example from the supermarket domain. The set of items

is and a small database containing the items (1 codes


presence and 0 absence of an item in a transaction) is shown in the table.

An example rule for the supermarket could be meaning that if


butter and bread are bought, customers also buy milk.
Example database with 4 items and 5 transactions

Dr. Nikhat Akhtar


Transaction ID milk bread butter beer

1 1 1 0 0

2 0 0 1 0

3 0 0 0 1

4 1 1 1 0

5 0 1 0 0

2.1.1 Important concepts of Association Rule Mining:

The support of an itemset is defined as the proportion of transactions in the


data set which contain the itemset. In the example database, the itemset

has a support of since it occurs in 20% of all


transactions (1 out of 5 transactions).

The confidenceof a rule is defined

For example, the rule has a confidence of

in the database, which means that for 100% of the transactions containing
butter and bread the rule is correct (100% of the times a customer buys butter and bread,
milk is bought as well). Confidence can be interpreted as an estimate of the

probability , the probability of finding the RHS of the rule in transactions under
the condition that these transactions also contain the LHS.

The liftof a rule is defined as

Dr. Nikhat Akhtar


or the ratio of the observed support to that expected if X and Y were independent. The

rule has a lift of .

The conviction of a rule is defined as

The rule has a conviction of ,

and can be interpreted as the ratio of the expected frequency that X occurs without Y (that
is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.

2.2 Market basket analysis:

This process analyzes customer buying habits by finding associations between the different items
that customers place in their shopping baskets. The discovery of such associations can help
retailers develop marketing strategies by gaining insight into which items are frequently purchased
together by customers. For instance, if customers are buying milk, how likely are they to also buy
bread (and what kind of bread) on the same trip to the supermarket. Such information can lead to
increased sales by helping retailers do selective marketing and plan their shelf space.

Dr. Nikhat Akhtar


Example:

If customers who purchase computers also tend to buy antivirus software at the same time, then
placing the hardware display close to the software display may help increase the sales of both
items. In an alternative strategy, placing hardware and software at opposite ends of the store may
entice customers who purchase such items to pick up other items along the way. For instance, after
deciding on an expensive computer, a customer may observe security systems for sale while
heading toward the software display to purchase antivirus software and may decide to purchase a
home security system as well. Market basket analysis can also help retailers plan which items to
put on sale at reduced prices. If customers tend to purchase computers and printers together, then
having a sale on printers may encourage the sale of printers as well as computers.

2.3 Frequent Pattern Mining:


Frequent pattern mining can be classified in various ways, based on the following criteria:

Dr. Nikhat Akhtar


1. Based on the completeness of patterns to be mined:

We can mine the complete set of frequent itemsets, the closed frequent itemsets, and the
maximal frequent itemsets, given a minimum support threshold.
We can also mine constrained frequent itemsets, approximate frequent itemsets,near-
match frequent itemsets, top-k frequent itemsets and so on.

2. Based on the levels of abstraction involved in the rule set:

Some methods for associationrule mining can find rules at differing levels of abstraction.

For example, supposet hat a set of association rules mined includes the following rules
where X is a variable representing a customer:

buys(X, ―computerǁ))=>buys(X, ―HP printerǁ) (1)

buys(X, ―laptop computerǁ)) =>buys(X, ―HP printerǁ) (2)

In rule (1) and (2), the items bought are referenced at different levels of abstraction (e.g.,
―computerǁ is a higher-level abstraction of ―laptop computerǁ).
3. Based on the number of data dimensions involved in the rule:

If the items or attributes in an association rule reference only one dimension, then it is a
single-dimensional association rule.
buys(X, ―computerǁ))=>buys(X, ―antivirus softwareǁ)

If a rule references two or more dimensions, such as the dimensions age, income, and buys,
then it is a multidimensional association rule. The following rule is an example of a
multidimensional rule:
age(X, ―30,31…39ǁ) ^ income(X, ―42K,…48Kǁ))=>buys(X, ―high resolution TVǁ)

Dr. Nikhat Akhtar


4. Based on the types of values handled in the rule:
If a rule involves associations between the presence or absence of items, it is a Boolean
association rule.
If a rule describes associations between quantitative items or attributes, then it is a
quantitative association rule.

5. Based on the kinds of rules to be mined:


Frequent pattern analysis can generate various kinds of rules and other interesting
relationships.
Association rule mining can generate a large number of rules, many of which are
redundant or do not indicatea correlation relationship among itemsets.
The discovered associations can be further analyzed to uncover statistical correlations,
leading to correlation rules.

6. Based on the kinds of patterns to be mined:


Many kinds of frequent patterns can be mined from different kinds of data sets.
Sequential pattern mining searches for frequent subsequences in a sequence data set, where
a sequence records an ordering of events.
For example, with sequential pattern mining, we can study the order in which items are
frequently purchased. For instance, customers may tend to first buy a PC, followed by a
digitalcamera, and then a memory card.
Structured pattern mining searches for frequent substructures in a structured data
set. Single items are the simplest form of structure.
Each element of an itemset may contain a subsequence, a subtree, and so on.
Therefore, structured pattern mining can be considered as the most general form of frequent
pattern mining.

Dr. Nikhat Akhtar


2.4 Efficient Frequent Itemset Mining Methods:
2.4.1 Finding Frequent Itemsets Using Candidate Generation:The
Apriori Algorithm

Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules.
The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The resulting
set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used
to find L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lkrequires one full scan of the database.
A two-step process is followed in Aprioriconsisting of joinand prune action.

Dr. Nikhat Akhtar


Example:

TID List of item IDs


T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3

There are nine transactions in this database, that is, |D| = 9.

Dr. Nikhat Akhtar


Steps:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1- itemsets,
C1. The algorithm simply scans all of the transactions in order to count the number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of frequent
1-itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying
minimum support. In our example, all of the candidates in C1 satisfy minimum support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1 to generate
a candidate set of 2-itemsets, C2.No candidates are removed fromC2 during the prune step
because each subset of the candidates is also frequent.
4. Next, the transactions inDare scanned and the support count of each candidate itemsetInC2 is
accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2-
itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, From the join step, we first getC3 =L2x
L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based on the
Apriori property that all subsets of a frequent itemset must also be frequent, we can determine
that the four latter candidates cannotpossibly be frequent.

7. The transactions in D are scanned in order to determine L3, consisting of those candidate
3-itemsets in C3 having minimum support.
8. The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.

Dr. Nikhat Akhtar


2.4.2 Generating Association Rules from Frequent Itemsets:
Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them.

Dr. Nikhat Akhtar


Example:

2.5 Mining Multilevel Association Rules:

For many applications, it is difficult to find strong associations among data items at low
or primitive levels of abstraction due to the sparsity of data at those levels.
Strong associations discovered at high levels of abstraction may represent commonsense
knowledge.
Therefore, data mining systems should provide capabilities for mining association rules
at multiple levels of abstraction, with sufficient flexibility for easy traversal among
different abstraction spaces.
Dr. Nikhat Akhtar
Association rules generated from mining data at multiple levels of abstraction arecalled
multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.
In general, a top-down strategy is employed, where counts are accumulated for the
calculation of frequent itemsets at each concept level, starting at the concept level 1 and
working downward in the hierarchy toward the more specific concept levels,until no more
frequent itemsets can be found.

A concepthierarchy defines a sequence of mappings froma set of low-level concepts to


higherlevel,more general concepts. Data can be generalized by replacing low-level conceptswithin
the data by their higher-level concepts, or ancestors, from a concept hierarchy.

Dr. Nikhat Akhtar


The concept hierarchy has five levels, respectively referred to as levels 0to 4, starting with level 0
at the root node for all.

Here, Level 1 includes computer, software, printer & camera, and computer accessory.
Level 2 includes laptop computer, desktop computer, office software, antivirus software
Level 3 includes IBM desktop computer, . . . , Microsoft office software, and so on.
Level 4 is the most specific abstraction level of this hierarchy.

Parallel and Distributed Data Mining


1. Introduction
Data mining is a process of nontrivial extraction of implicit, previously unknown, and
potentially useful information (such as knowledge rules, constraints, and regularities) from
data in databases. In fact, the term “knowledge discovery” is more general than the
term “data mining.” Data mining is usually viewed as a step towards the process of
knowledge discovery, although these two terms are considered as synonyms in the
computer literature. The entire life cycle of knowledge discovery includes steps such as data
cleaning, data integration, data selections, data transformation, data mining, pattern
evaluation, and knowledge presentation.

Fig. 1. Life Cycle of knowledge presentation


Data cleaning is to remove noise and inconsistent data. Data integration is to combine data
from multiple data sources, such as a database and data warehouse. Data selection is to
retrieve data relevant to the task. Data transformation is to transform data into appropriate
forms. Data mining is to apply intelligent methods to extract data patterns. Pattern
evaluation is to identify the truly interesting patterns based on some interestingness
measures. Knowledge evaluation is to visualize and present the mined knowledge to
the user. There are many data mining techniques, such as association rule mining,

Dr. Nikhat Akhtar


classification, clustering, sequential pattern mining, etc.
Since this chapter focuses on parallel and distributed data mining, let us turn our
attention to those concepts.

2. Distributed data mining


Data mining algorithms deal predominantly with simple data formats (typically flat
files); there is an increasing amount of focus on mining complex and advanced data
types such as object-oriented, spatial and temporal data. Another aspect of this growth
and evolution of data mining systems is the move from stand-alone systems using
centralized and local computational resources towards supporting increasing levels of
distribution. As data mining technology matures and moves from a theoretical domain to
the practitioner’s arena there is an emerging realization that distribution is very much a
factor that needs to be accounted for.
Databases in today’s information age are inherently distributed. Organizations that
operate in global markets need to perform data mining on distributed data sources
(homogeneous / heterogeneous) and require cohesive and integrated knowledge from
this data. Such organizational environments are characterized by a geographical
separation of users from the data sources. This inherent distribution of data sources and
large volumes of data involved inevitably leads to exorbitant communications costs.
Therefore, it is evident that traditional data mining model involving the co-location of
users, data and computational resources is inadequate when dealing with distributed
environments. The development of data mining along this dimension has lead to the
emergence of distributed data mining. The need to address specific issues associated
with the application of data mining in distributed computing environments is the
primary objective of distributed data mining. Broadly, data mining environments
consist of users, data, hardware and the mining software (this includes both the mining
algorithms and any other associated programs). Distributed data mining addresses the
impact of distribution of users, software and computational resources on the data mining
process. There is general consensus that distributed data mining is the process of mining
data that has been partitioned into one or more physically/geographically distributed
subsets.
The significant factors, which have led to the emergence of distributed data mining
from centralized mining, are as follows:
 The need to mine distributed subsets of data, the integration of which is non- trivial and expensive.
 The performance and scalability bottle necks of data mining.
 Distributed data mining provides a framework for scalability, which allows the splitting up of larger datasets with
high dimensionality into smaller subsets that require computational resources individually.
Distributed Data Mining (DDM) is a branch of the field of data mining that offers a
framework to mine distributed data paying careful attention to the distributed data and
computing resources. In the DDM literature, one of two assumptions is commonly
adopted as to how data is distributed across sites: homogeneously and heterogeneously.
Both viewpoints adopt the conceptual viewpoint that the data tables at each site are
partitions of a single global table. In the homogeneous case, the global table is
horizontally partitioned. The tables at each site are subsets of the global table; they have
exactly the same attributes. In the heterogeneous case the table is vertically partitioned,
each site contains a collection of columns (sites do not have the same attributes).
Dr. Nikhat Akhtar
However, each tuple at each site is assumed to contain a unique identifier to facilitate
matching. It is important to stress that the global table viewpoint is strictly conceptual.
It is not necessarily assumed that such a table was physically realized and partitioned to
form the tables at each site.

3. Parallel and distributed data mining


The enormity and high dimensionality of datasets typically available as input to the
problem of association rule discovery, makes it an ideal problem for solving multiple
processors in parallel. The primary reasons are the memory and CPU speed limitations
faced by single processors. Thus it is critical to design efficient parallel algorithms to do
the task. Another reason for parallel algorithm comes from the fact that many
transaction databases are already available in parallel databases or they are distributed at
multiple sites to begin with. The cost of bringing them all to one site or one computer
for serial discovery of association rules can be prohibitively expensive.
For compute-intensive applications, parallelisation is an obvious means for improving
performance and achieving scalability. A variety of techniques may be used to distribute the
workload involved in data mining over multiple processors. Four major classes of parallel
implementations are distinguished. The classification tree in Figure 1 demonstrates this
distinction. The first distinction made in this tree is between task parallel and data-parallel
approaches.

Fig. 2. Methods of Parallelism


Task-parallel algorithms assign portions of the search space to separate processors. The task
parallel approaches can again be divided into two groups. The first group is based on
a Divide and Conquer strategy that divides the search space and assigns each partition to a
specific processor. The second group is based on a task queue that dynamically assigns
small portions of the search space to a processor whenever it becomes available. A task
parallel implementation of decision tree induction will form tasks associated with
branches of the tree. A Divide and Conquer approach seems a natural reflection of the
recursive nature of decision trees.
However the task of parallel implementation suffers from load balancing problems
caused by uneven distributions of records between branches. The success of a task parallel
implementation of decision trees seems to be highly dependent on the structure of the data
set. The second class of approaches, called data parallel, distributes the data set over the
available processors. Data-parallel approaches come in two flavors. A partitioning based on
records will assign non-overlapping sets of records to each of the processors. Alternatively a
partitioning of attributes will assign sets of attributes to each of the processors. Attribute-
based approaches are based on the observation that many algorithms can be expressed in

Dr. Nikhat Akhtar


terms of primitives that consider every attribute in turn. If attributes are distributed over
multiple processors, these primitives may be executed in parallel. For example, when
constructing decision trees, at each node in the tree, all independent attributes are
considered, in order to determine the best split at that point.
There are two basic parallel approaches that have come to be used in recent times – work partitioning and data partitioning.

Work Partitioning - These methods assign different view computations to different


processors. Consider, for example, the lattice for a four dimensional data cube. If a
name is assigned to the dimensions as “ABCD”, 15 views need to be computed. Given
a parallel computer with p processors, work partitioning schemes partition the set of
views into p groups and assign the computation of the views in each group to a different
processor. The main challenges for these methods are load balancing and scalability.
Data Partitioning - These methods work by partitioning the raw data set into p subsets
and store each subset locally on one processor. All views are computed on every
processor but only with respect to the subset of data available at each processor. A
subsequent merge procedure is required to agglomerate the data across processors. The
advantage of data partitioning methods is that they do not require all processors to have
access to the entire raw data set. Each processor only requires a local copy of a portion
of the raw data which can, e.g., be stored on its local disk. This makes such methods
feasible for shared-nothing parallel machines.

4. Why parallelize data mining?


Data-mining applications fall into two groups based on their intent. In some
applications, the goal is to find explanations for the most variable elements of the data
set that is, to find and explain the outliers. In other applications, the goal is to understand
the variations of the majority of the data set elements, with little interest in the outliers.
Scientific data mining seems to be mostly of the first kind, whereas commercial
applications seem to be of the second kind (“understand the buying habits of most of
our customers”). In applications of the first kind, parallel computing seems to be
essential. In applications of the second kind, the question is still open because it is not
known how effective sampling from a large data set might be at answering broader
questions. Parallel computing thus has considerable potential as a tool for data mining,
but it is not yet completely clear whether it represents the future of data mining.

5. Technologies
 Parallel computing
Single systems with many processors work on same problem.

 Distributed computing
Many systems loosely coupled by a scheduler to work on related problems.

 Grid Computing (Meta Computing)
Many systems tightly coupled by software, perhaps geographically distributed, are
made to work together on single problems or on related problems.

Dr. Nikhat Akhtar


5.1 Properties of algorithms for association discovery
Most algorithms for association discovery follow the same general procedure, based on the
sequential Apriori algorithm. The basic idea is to make multiple passes over the database,
building larger and larger groups of associations on each pass. Thus, the first pass
determines the "items" that occur most frequently in all the transactions in the database;
each subsequent pass builds a list of possible frequent item tuples based on the results of
the previous pass, and then scans the database, discarding those tuples that do not occur
frequently in the database. The intuition is that for any set of items that occurs frequently,
all subsets of that set must also occur frequently.
Notice that, for large association sets, this algorithm and its derivatives must make many
passes over a potentially enormous database. It is also typically implemented using a hash
tree, a complex data structure that exhibits very poor locality (and thus poor cache behavior).
Although there exist workable sequential algorithms for data mining (such as Apriori,
above), there is a desperate need for a parallel solution for most realistic-sized problems.
The most obvious (and most compelling) argument for parallelism revolves around
database size. The databases used for data mining are typically extremely large, often
containing the details of the entire history of a company's standard transactional databases.
As these databases grow past hundreds of gigabytes towards a terabyte or more, it becomes
nearly impossible to process them on a single sequential machine, for both time and space
reasons: no more than a fraction of the database can be kept in main memory at any given
time, and the amount of local disk storage and bandwidth needed to keep the sequential
CPU supplied with data is enormous. Additionally, with an algorithm such as Apriori
that requires many complete passes over the database, the actual running time required
to complete the algorithm becomes excessive.
The basic approach to parallelizing association-discovery data mining is via database
partitioning. Each available node in the networking environment is assigned a subset of
the database records, and computes independently on that subset, usually using a
variation on the sequential Apriori algorithm. All of the parallel data mining algorithms
require some amount of global all-all or all-one communication to coordinate the
independent nodes.

5.2 Problems in developing parallel algorithms for distributed environment


There are several problems in developing parallel algorithms for a distributed environment
with association discovery data mining which is being considered in this research
work. These are:
 Data distribution: One of the benefits of parallel and distributed data mining is that each node can potentially
work with a reduced-size subset of the total database. A parallel algorithm in distributed environment must
effectively distribute data to allow each node to make independent progress with its incomplete view of the entire
database.
 I/O minimization: Even with good data distribution, parallel data mining algorithms must strive to minimize the
amount of I/O they perform to the database.
 Load balancing: To maximize the effect/efficiency of parallelism, each workstation must have approximately
the same amount of work to do. Although a good initial data distribution can help provide load-balancing, with
some algorithms, periodic data redistribution is required to obtain good overall load-balancing.
 Avoiding duplication: Ideally, no workstation should do redundant work (work already performed by another
node).
 Minimizing communication: An ideal parallel data mining algorithm allows all workstations to operate
asynchronously, without having to stall frequently for global barriers or for communication delays.
 Maximizing locality: As in all performance programming, high-performance parallel data mining algorithms
Dr. Nikhat Akhtar
must be designed to reap the full performance potential of
hardware. This involves maximizing locality for good cache behavior, utilizing as
much of the machine's memory bandwidth as possible, etc.
Achieving all of the above goals in one algorithm is nearly impossible, as there are
tradeoffs between several of the above points. Existing algorithms for parallel data
mining attempt to achieve an optimal balance between these factors.

5.3 Algorithms in parallel and distributed data mining


The major algorithms used for parallel and distributed data mining are:
 Count Distribution: this algorithm achieves parallelism by partitioning data. Each of N workstations gets 1/Nth of
the database, and performs an Apriori-like algorithm on the subset. At the end of each iteration however, is a
communication phase, in which the frequency of item occurrence in the various data partitions is exchanged
between all workstations. Thus, this algorithm trades off I/O and duplication for minimal communication and
good load-balance: each workstation must scan its database
partition multiple times (causing a huge I/O load) and maintains a full copy of the
(poor-locality) data structures used (causing duplicated data structure maintenance),
but only requires a small amount of per-iteration communication (an asynchronous
broadcast of frequency counts) and has a good distribution of work.
 Data Distribution: This algorithm is designed to minimize computational redundancy and maximize use of the
memory bandwidth of each workstation. It works by partitioning the current maximal-frequency itemset candidates
(like those generated by Apriori) amongst work stations. Thus, each workstation examines a disjoint set of possibilities;
however, each workstation must scan the entire database to examine its candidates. Thus this algorithm trades off a
huge amount of communication (to fetch the database partitions stored on other workstations) for better use of
machine resources and to avoid duplicated work.
 Candidate Distribution: This algorithm is similar to data distribution in that it partitions the candidates across
workstations, but it attempts to minimize communication by selectively partitioning the database such that each
workstation has locally the data needed to process its candidate set. It does this after a fixed (small) number of passes
of the standard data distribution algorithm. This trades off duplication (the same data may need to be replicated on
more than one node) and poor load-balancing (after redistributing the data, the workload of each workstation may not
be balanced) in order to minimize communication and synchronization. The effects of poor load balancing are
mitigated somewhat, since global barriers at the end of each pass are not required.
 Eclat: This sophisticated algorithm avoids most of the tradeoffs above by using an
initial clustering step to pre-process the data before partitioning it between
workstations. It thus achieves many of the benefits of candidate distribution without
the costs. Little synchronization or communication is needed, since each node can
process its partitioned dataset independently. A transformation of the data during
partitioning allows the use of simple database intersections (rather than hash trees),
maximizing cache locality and memory bandwidth usage. The transformation also
drastically cuts down the I/O bandwidth requirements by only necessitating three
database scans.
6. Applications in parallel and distributed data mining
The technology of parallel and distributed data mining can be applied on different real time
applications. The major applications are
 Credit card fraudulent detection
 Intrusion detection
 Business analysis – prediction etc.
 Financial applications
 Astrological events
Anomaly Detection

Dr. Nikhat Akhtar

You might also like