Unit-4 - Data Ware
Unit-4 - Data Ware
Classification and prediction are two forms of data analysis that can be used to extract models
describing important data classes or to predict future data trends.
Classification predicts categorical (discrete, unordered) labels, prediction models continuous
valued functions.
For example, we can build a classification model to categorize bank loan applications as either
safe or risky, or a prediction model to predict the expenditures of potential customers on
computer equipment given their income and occupation.
A predictor is constructed that predicts a continuous-valued function, or ordered value, as
opposed to a categorical label.
Regression analysis is a statistical methodology that is most often used for numeric prediction.
Many classification and prediction methods have been proposed by researchers in machine
learning, pattern recognition, and statistics.
Most algorithms are memory resident, typically assuming a small data size. Recent data mining
research has built on such work, developing scalable classification and prediction techniques
capable of handling large disk-resident data.
For example, numeric values for the attribute income can be generalized to discrete
ranges, such as low, medium, and high. Similarly, categorical attributes, like street, can
be generalized to higher-level concepts, like city.
Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization techniques, such
as binning, histogram analysis, and clustering.
There are three possible scenarios. Let A be the splitting attribute. A has v distinct values,
{a1, a2, … ,av}, based on the training data.
1 A is discrete-valued:
In this case, the outcomes of the test at node N correspond directly to the known
values of A.
A branch is created for each known value, aj, of A and labeled with that value.
A need not be considered in any future partitioning of the tuples.
2 A is continuous-valued:
In this case, the test at node N has two possible outcomes, corresponding to the conditions
A <=split point and A >split point, respectively
where split point is the split-point returned by Attribute selection method as part of the
splitting criterion.
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, …, An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will predict
that X belongs to the class having the highest posterior probability, conditioned on X.
That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only
if
Thus we maximize P(CijX). The class Ci for which P(CijX) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem
3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci).
We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) from the training
tuples. For each attribute, we look at whether the attribute is categorical or continuous-valued.
For instance, to compute P(X|Ci), we consider the following:
If Akis categorical, then P(xk|Ci) is the number of tuples of class Ciin D having the value
xkfor Ak, divided by |Ci,D| the number of tuples of class Ciin D.
If Akis continuous-valued, then we need to do a bit more work, but the calculation is
pretty straightforward.
A continuous-valued attribute is typically assumed to have a Gaussian distribution with a
mean µ and standard deviation , defined by
5. In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class Ci.
The classifier predicts that the class label of tuple X is the class Ciif and only if
It iteratively learns a set of weights for prediction of the class label of tuples.
Example:
The inputs to the network correspond to the attributes measured for each training tuple. The
inputs are fed simultaneously into the units making up the input layer. These inputs pass
through the input layer and are then weighted and fed simultaneously to a second layer
known as a hidden layer.
The outputs of the hidden layer units can be input to another hidden layer, and so on. The
number of hidden layers is arbitrary.
The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the network’s prediction for given tuples
Neural networks involve long training times and are therefore more suitable for applications
where this is feasible.
Backpropagation learns by iteratively processing a data set of training tuples, comparing
the network’s prediction for each tuple with the actual known target value.
The target value may be the known class label of the training tuple (for classification
problems) or a continuous value (for prediction).
For each training tuple, the weights are modified so as to minimize the mean squared error
between the network’s prediction and the actual target value. These modifications are made
in the ―backwardsǁ direction, that is, from the output layer, through each hidden layer down
to the first hidden layer hence the name is backpropagation.
Although it is not guaranteed, in general the weights will eventually converge, and the
learning process stops.
Advantages:
It include their high tolerance of noisy data as well as their ability to classify patterns on
which they have not been trained.
They can be used when you may have little knowledge of the relationships between
attributes and classes.
They are well-suited for continuous-valued inputs and outputs, unlike most decision tree
algorithms.
They have been successful on a wide array of real-world data, including handwritten
character recognition, pathology and laboratory medicine, and training a computer to
pronounce English text.
Neural network algorithms are inherently parallel; parallelization techniques can be used
to speed up the computation process.
Process:
Initialize the weights:
First, the training tuple is fed to the input layer of the network. The inputs pass through the input
units, unchanged. That is, for an input unitj, its output, Oj, is equal to its input value, Ij. Next, the
net input and output of each unit in the hidden and output layers are computed. The net input to a
unit in the hidden or output layers is computed as a linear combination of its inputs.
Each such unit has a number of inputs to it that are, in fact, the outputs of the units connected to
it in the previous layer. Each connection has a weight. To compute the net input to the unit, each
input connected to the unit is multiplied by its corresponding weight, and this is summed.
Where wi,j is the weight of the connection from unit I in the previous layer to unit j;
Oi is the output of unit I from the previous layer
Ɵj is the bias of the unit & it acts as a threshold in that it serves to vary the activity of the unit.
Each unit in the hidden and output layers takes its net input and then applies an activation
function to it.
The error is propagated backward by updating the weights and biases to reflect the error of
the network’s prediction. For a unit j in the output layer, the error Err jis computed by
Where Oj is the actual output of unit j, and Tj is the known target value of the given
training tuple.
The error of a hidden layer unit j is
Where wjk is the weight of the connection from unit j to a unit k in the next higher layer, and
Errk is the error of unit k.
Weights are updated by the following equations, where Dwi j is the change in weight wi j:
In other words, for each numeric attribute, we take the difference between the corresponding
values of that attribute in tuple X1and in tuple X2, square this difference, and accumulate it.
The square root is taken of the total accumulated distance count.
Min-Max normalization can be used to transform a value v of a numeric attribute A to v0 in
the range [0, 1] by computing
Where minA and maxA are the minimum and maximum values of attribute A
For k-nearest-neighbor classification, the unknown tuple is assigned the most common
class among its k nearest neighbors.
When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to
it in pattern space.
Nearest neighbor classifiers can also be used for prediction, that is, to return a real-valued
prediction for a given unknown tuple.
In this case, the classifier returns the average value of the real-valued labels associated
with the k nearest neighbors of the unknown tuple.
Genetic algorithms attempt to incorporate ideas of natural evolution. In general, genetic learning
starts as follows.
An initial population is created consisting of randomly generated rules. Each rule can be
Example:
where x is the mean value of x1, x2, … , x|D|, and y is the mean value of y1, y2,…, y|D|.
The coefficients w0 and w1 often provide good approximations to otherwise complicated
regression equations.
1.7.2 Multiple Linear Regression:
It is an extension of straight-line regression so as to involve more than one predictor
variable.
For example, the data from customer base is divided into clusters; we can make an informed
decision about who we think is best suited for this product. Suppose we are a market manager,
and we have a new tempting product to sell. We are sure that the product would bring enormous
profit, as long as it is sold to the right people. So, how can we tell who is best suited for the product
from our company's huge customer base?
It classifies the data into k groups, which together satisfy the following requirements:
p is the point in space representing a given object in cluster Cj; and oj is the
representative object of Cj. In general, the algorithm iterates until, eventually, each
representative object is actually the metoid, or most centrally located object, of its
cluster.
• Case 1: p currently belongs to representative object, oj. If oj is replaced by o random as a
representative object and p is closest to one of the other representative objects, oi,
• i not equal j, then p is reassigned to oi.
• Case 2: p currently belongs to representative object, oj. If oj is replaced by o random as a
representative object and p is closest to o random, then p is reassigned to o random.
• Case 3: p currently belongs to representative object, oi, i not equal j. If oj is replaced by
o random as a representative object and p is still closest to oi, then the assignment does
not change.
• Case 4: p currently belongs to representative object, oi, i not equal j. If oj is replaced by o
random as a representative object and p is closest to o random, then p is reassigned to o
random.
Classfication:
• A hierarchical method can be classified as being either agglomerative, also called the bottom-
up approach, starts with each object forming a separate group. It successively merges the
objects or groups that are close to one another, until all of the groups are merged into one
(the topmost level of the hierarchy), or until a termination conditions are satisfied.
• The divisive approach, also called the top-down approach, starts with all of the objects in the
same cluster.
• In each successive iteration, It subdivides the cluster into smaller and smaller pieces, until
each object forms a cluster on its own or until it satisfies certain termination conditions, each
cluster is within a certain threshold.
• This is a single-linkage approach in that each cluster is represented by all of the objects in
the cluster, and the similarity between two clusters is measured by the similarity of the
closest pair of data points belonging to different clusters.
• The cluster merging process repeats until all of the objects are eventually merged to form
one cluster.
• In DIANA, all of the objects are used to form one initial cluster.
• The cluster is split according to some principle, such as the maximum Euclidean distance
between the closest neighboring objects in the cluster.
• The cluster splitting process repeats until, eventually, each new cluster contains only a
single object
(2b) CURE
The cure algorithm assumes a Euclidean distance. It allows clusters to assume any
shape.It uses collection of representative points to represent clusters
We formed the two clusters of the dataset of engineers and humanities. The clusters formed are
overlapping with each other which will not give the solution.
We tried to create three clusters for segregation by which results can be achieved. But after
cluster formation still one cluster is formed having both the dataset values.
Pass 1 of 2:
Pass 2 of 2:
Now, rescan the whole dataset and visit each point p in the data set.
Place it in the “closest cluster”
o Closest: that cluster with the closest (to p)among all the representative points of
all the clusters.
• The edges are weighted to reflect the similarity between objects. Chameleon uses a graph
partitioning algorithm to partition the k-nearest-neighbor graph into a large number of
relatively small subclusters.
• To determine the pairs of most similar subclusters, it takes into account both the
interconnectivity as well as the closeness of the clusters
Figure 11: Chameleon – Hierarchical clustering based on k-nearest and dynamic modeling
To discover clusters with arbitrary shape, density-based clustering methods have been
developed.
These typically regard clusters as dense regions of objects in the data space that are
separated by regions of low density (representing noise).
OPTICS extends DBSCAN to produce a cluster ordering obtained from a wide range
of parameter settings
• The algorithm grows regions with sufficiently high density into clusters and discovers
clusters of arbitrary shape in spatial databases with noise.
The neighborhood within a radius ɛ of a given object is called the ɛ-neighborhood of the
object.
Example:
Consider Figure for a given e represented by the radius of the circles, and, say, let
MinPts = 3.
• If the e-neighborhood of a point p contains more than MinPts, a new cluster with p as a
core object is created.
• DBSCAN then iteratively collects directly density-reachable objects from these core
objects, which may involve the merge of a few density-reachable clusters.
• The process terminates when no new point can be added to any cluster.
• Rather than produce a data set clustering explicitly, OPTICS computes an augmented
cluster ordering for automatic and interactive cluster analysis.
• This order selects an object that is density-reachable with respect to the lowest e value so
that clusters with higher density (lower ɛ) will be finished first.
Based on this idea, two values need to be stored for each object—core-distance and
reachability-distance:
• The core-distance of an object p is the smallest ɛ ‘ value that makes {p} a core object. If p
is not a core object, the core-distance of p is undefined.
• Suppose that ɛ =6 mm and MinPts=5. The core distance of p is the distance, ɛ’, between p
and the fourth closest data object.
4. Grid-Based Methods
• The grid-based clustering approach uses a multi resolution grid data structure.
• It quantizes the object space into a finite number of cells that form a grid structure on
which all of the operations for clustering are performed.
• The main advantage of the approach is its fast processing time, which is typically
independent of the number of data objects, yet dependent on only the number of cells in
each dimension in the quantized space.
• Some typical examples of the grid-based approach include STING, which explores
statistical information stored in the grid cells; Wave Cluster, which clusters objects using
a wavelet transform method; and CLIQUE, which represents a grid-and density-based
approach for clustering in high-dimensional data space
• STING is a grid-based multi resolution clustering technique in which the spatial area is
divided into rectangular cells.
• There are usually several levels of such rectangular cells corresponding to different levels
of resolution, and these cells form a hierarchical structure:
• each cell at a high level is partitioned to form a number of cells at the next lower level.
• Statistical information regarding the attributes in each grid cell (such as the mean,
maximum, and minimum values) is precomputed and stored
• CLIQUE (CLustering InQUEst) was the first algorithm proposed for dimension-growth
subspace clustering in high-dimensional space.
• Because CLIQUE partitions each dimension like a grid structure and determines whether
a cell is dense based on the number of points it contains, it can also be viewed as an
integration of density-based and grid-based clustering methods
• Given a large set of multidimensional data points, the data space is usually not uniformly
occupied by the data points.
• CLIQUE’s clustering identifies the sparse and the “crowded” areas in space (or units),
thereby discovering the overall distribution patterns of the data set.
• A unit is dense if the fraction of total data points contained in it exceeds an input model
parameter
STING (Statistical Information Grid Clustering Algorithm) and OPTICS (Ordering Point To
Identify Clustering Structure Clustering Algorithm) are clustering algorithms used in Unsupervised
Learning. They are machine learning techniques which are used to club the given input data points
into clusters or groups on the basis of their attributes. STING is grid-based clustering algorithm
while OPTICS is a density-based clustering algorithm. Clustering
STING Algorithm:
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0
in the database, which means that for 100% of the transactions containing
butter and bread the rule is correct (100% of the times a customer buys butter and bread,
milk is bought as well). Confidence can be interpreted as an estimate of the
probability , the probability of finding the RHS of the rule in transactions under
the condition that these transactions also contain the LHS.
and can be interpreted as the ratio of the expected frequency that X occurs without Y (that
is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.
This process analyzes customer buying habits by finding associations between the different items
that customers place in their shopping baskets. The discovery of such associations can help
retailers develop marketing strategies by gaining insight into which items are frequently purchased
together by customers. For instance, if customers are buying milk, how likely are they to also buy
bread (and what kind of bread) on the same trip to the supermarket. Such information can lead to
increased sales by helping retailers do selective marketing and plan their shelf space.
If customers who purchase computers also tend to buy antivirus software at the same time, then
placing the hardware display close to the software display may help increase the sales of both
items. In an alternative strategy, placing hardware and software at opposite ends of the store may
entice customers who purchase such items to pick up other items along the way. For instance, after
deciding on an expensive computer, a customer may observe security systems for sale while
heading toward the software display to purchase antivirus software and may decide to purchase a
home security system as well. Market basket analysis can also help retailers plan which items to
put on sale at reduced prices. If customers tend to purchase computers and printers together, then
having a sale on printers may encourage the sale of printers as well as computers.
We can mine the complete set of frequent itemsets, the closed frequent itemsets, and the
maximal frequent itemsets, given a minimum support threshold.
We can also mine constrained frequent itemsets, approximate frequent itemsets,near-
match frequent itemsets, top-k frequent itemsets and so on.
Some methods for associationrule mining can find rules at differing levels of abstraction.
For example, supposet hat a set of association rules mined includes the following rules
where X is a variable representing a customer:
In rule (1) and (2), the items bought are referenced at different levels of abstraction (e.g.,
―computerǁ is a higher-level abstraction of ―laptop computerǁ).
3. Based on the number of data dimensions involved in the rule:
If the items or attributes in an association rule reference only one dimension, then it is a
single-dimensional association rule.
buys(X, ―computerǁ))=>buys(X, ―antivirus softwareǁ)
If a rule references two or more dimensions, such as the dimensions age, income, and buys,
then it is a multidimensional association rule. The following rule is an example of a
multidimensional rule:
age(X, ―30,31…39ǁ) ^ income(X, ―42K,…48Kǁ))=>buys(X, ―high resolution TVǁ)
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules.
The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The resulting
set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used
to find L3, and so on, until no more frequent k-itemsets can be found.
The finding of each Lkrequires one full scan of the database.
A two-step process is followed in Aprioriconsisting of joinand prune action.
7. The transactions in D are scanned in order to determine L3, consisting of those candidate
3-itemsets in C3 having minimum support.
8. The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.
For many applications, it is difficult to find strong associations among data items at low
or primitive levels of abstraction due to the sparsity of data at those levels.
Strong associations discovered at high levels of abstraction may represent commonsense
knowledge.
Therefore, data mining systems should provide capabilities for mining association rules
at multiple levels of abstraction, with sufficient flexibility for easy traversal among
different abstraction spaces.
Dr. Nikhat Akhtar
Association rules generated from mining data at multiple levels of abstraction arecalled
multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.
In general, a top-down strategy is employed, where counts are accumulated for the
calculation of frequent itemsets at each concept level, starting at the concept level 1 and
working downward in the hierarchy toward the more specific concept levels,until no more
frequent itemsets can be found.
Here, Level 1 includes computer, software, printer & camera, and computer accessory.
Level 2 includes laptop computer, desktop computer, office software, antivirus software
Level 3 includes IBM desktop computer, . . . , Microsoft office software, and so on.
Level 4 is the most specific abstraction level of this hierarchy.
5. Technologies
Parallel computing
Single systems with many processors work on same problem.
Distributed computing
Many systems loosely coupled by a scheduler to work on related problems.
Grid Computing (Meta Computing)
Many systems tightly coupled by software, perhaps geographically distributed, are
made to work together on single problems or on related problems.