0% found this document useful (0 votes)
6 views37 pages

full UNIT 4 notes

The document discusses frequent pattern mining, particularly in the context of Market Basket Analysis, which helps retailers identify associations between items purchased together. It explains various algorithms for mining frequent itemsets, including the Apriori algorithm, PCY, Multistage, and Multihash, highlighting their methodologies and efficiencies. Additionally, it emphasizes the importance of these techniques in enhancing marketing strategies and improving sales through better understanding of customer buying behavior.

Uploaded by

vishwas897900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views37 pages

full UNIT 4 notes

The document discusses frequent pattern mining, particularly in the context of Market Basket Analysis, which helps retailers identify associations between items purchased together. It explains various algorithms for mining frequent itemsets, including the Apriori algorithm, PCY, Multistage, and Multihash, highlighting their methodologies and efficiencies. Additionally, it emphasizes the importance of these techniques in enhancing marketing strategies and improving sales through better understanding of customer buying behavior.

Uploaded by

vishwas897900
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

UNIT 4

Imagine that you are a sales manager at AllElectronics, and you are talking to a customer who
recently bought a PC and a digital camera from the store.What should you recommend to her
next? Information about which products are frequently purchased by your customers following
their purchases of a PC and a digital camera in sequence would be very helpful in making your
recommendation. Frequent patterns and association rules
are the knowledge that you want to mine in such a scenario.

Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear
frequently in a data set. For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent itemset. A subsequence, such as buying
first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping
history database, is a (frequent) sequential pattern. A substructure can refer to different
structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with
itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured
pattern. Finding frequent patterns plays an essential role in mining associations, correlations,
and many other interesting relationships among data. Moreover, it helps in data classification,
clustering, and other data mining tasks. Thus, frequent pattern mining has become an important
data mining task and a focused theme in data mining research.

Market Basket Analysis


Market Basket Analysis is one of the key techniques used by large retailers to uncover
associations between items. It works by looking for combinations of items that occur together
frequently in transactions. To put it another way, it allows retailers to identify relationships
between the items that people buy.
Association Rules are widely used to analyze retail basket or transaction data, and are intended to
identify strong rules discovered in transaction data using measures of interestingness, based on
the concept of strong rules.

Market Basket Analysis: A Motivating Example


Frequent itemset mining leads to the discovery of associations and correlations among items in
large transactional or relational data sets.With massive amounts of data continuously being
collected and stored, many industries are becoming interested in mining such patterns from
their databases. The discovery of interesting correlation relationships among huge amounts of
business transaction records can help in many business decision-making processes such as
catalog design, cross-marketing, and customer shopping behavior analysis. A typical example of
frequent itemset mining is market basket analysis. This process analyzes customer buying
habits by finding associations between the different items that customers place in their
“shopping baskets” (Figure 6.1). The discovery of these associations can help retailers develop
marketing strategies by gaining insight into which items are frequently purchased together by
customers. For instance, if customers are buying milk, how likely are they to also buy bread (and
what kind of bread) on the same trip
to the supermarket? This information can lead to increased sales by helping retailers do
selective marketing and plan their shelf space.

Let’s look at an example of how market basket analysis can be useful.

Example 6.1 Market basket analysis. Suppose, as manager of an AllElectronics branch, you
would like to learn more about the buying habits of your customers. Specifically, you wonder,
“Which groups or sets of items are customers likely to purchase on a given trip to the store?” To
answer your question, market basket analysis may be performed on the retail data of customer
transactions at your store. You can then use the results to plan marketing or advertising
strategies, or in the design of a new catalog. For instance, market basket analysis may help you
design different store layouts. In one strategy, items that are frequently purchased together can
be placed in proximity to further encourage the combined sale of such items. If customers who
purchase computers also tend to buy antivirus software at the same time, then placing the
hardware display close to the software display may help increase the sales of both items. In an
alternative strategy, placing hardware and software at opposite ends of the store may entice
customers who purchase such items to pick up other items along the way. For instance, after
deciding on an expensive computer, a customer may observe security systems for sale while
heading toward the software display to purchase antivirus software, and may decide to purchase
a home security system as well. Market basket analysis can also help retailers plan which items
to put on sale at reduced prices. If customers tend to purchase computers and printers together,
then having a sale on printers may encourage the sale of printers as well as computers. If we
think of the universe as the set of items available at the store, then each itemhas a Boolean
variable representing the presence or absence of that item. Each basket can then be represented
by a Boolean vector of values assigned to these variables. The Boolean vectors can be analyzed
for buying patterns that reflect items that are frequently associated or purchased together. These
patterns can be represented in the formof association rules. For example, the information that
customers who purchase computers also tend to buy antivirus software at the same time is
represented in the following association rule:
Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for Rule (6.1) means
that 2% of all the transactions under analysis show that computer and antivirus software are
purchased together. A confidence of 60% means that 60% of the customers who purchased a
computer also bought the software. Typically, associationrules are considered interesting if they
satisfy both a minimum support threshold and a minimum confidence threshold. These
thresholds can be a set by users or domain experts. Additional analysis can be performed to
discover interesting statistical correlations between associated items.

Example 1

Example 2
6.2 Frequent Itemset Mining Methods

6.2.1 Apriori Algorithm: Finding Frequent Itemsets by Confined Candidate Generation


Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
frequent itemsets for Boolean association rules [AS94b]. The name of the algorithm is based on
the fact that the algorithm uses prior knowledge of frequent itemset properties, as we shall see
later. Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore .k C1/-itemsets. First, the set of frequent 1-itemsets is found by scanning the
database to accumulate the count for each item, and collecting those items that satisfy minimum
support. The resulting set is denoted by L1.

Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on,
until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of
the database. To improve the efficiency of the level-wise generation of frequent itemsets, an
important property called the Apriori property is used to reduce the search space.
6.3.1 PCY (PARK,CHEN,YU) algorithm

This algorithm, which we call PCY after its authors, exploits the observation that there may be
much unused space in main memory on the first pass. If there are a million items and gigabytes
of main memory, we do not need more than 10% of the main memory for the two tables
suggested in above Figure.
The PCY Algorithm uses that space for an array of integers that generalizes the idea of a Bloom
filter.
The idea is shown schematically in Fig. 2.Think of this array as a hash table, whose buckets hold
integers rather than sets of keys (as in an ordinary hash table) or bits (as in a Bloom filter).
Pairs of items are hashed to buckets of this hash table. As we examine a basket during the first
pass, we not only add 1 to the count for each item in the basket, but we generate all the pairs,
using a double loop.
We hash each pair, and we add 1 to the bucket into which that pair hashes. Note that the pair
itself doesn’t go into the bucket; the pair only affects the single integer in the bucket.
At the end of the first pass, each bucket has a count, which is the sum of the counts of all the
pairs that hash to that bucket. If the count of a bucket is at least as great as the support threshold
s, it is called a frequent bucket.
We can say nothing about the pairs that hash to a frequent bucket; they could all be frequent
pairs from the information available to us. But if the count of the bucket is less than s (an
infrequent bucket), we know no pair that hashes to this bucket can be frequent, even if the pair
consists of two frequent items.
That fact gives us an advantage on the second pass. We can define the set of candidate pairs C2
to be those pairs {i, j} such that:

1. i and j are frequent items.


2. {i, j} hashes to a frequent bucket.

6.3.2 The Multistage Algorithm

The Multistage Algorithm improves upon PCY by using several successive hash tables to reduce
further the number of candidate pairs. The tradeoff is that Multistage takes more than two
passes to find the frequent pairs. An outline of the Multistage Algorithm is shown in Fig. 6.6. The
first pass of Multistage is the same as the first pass of PCY. After that pass, the frequent buckets
are identified and summarized by a bitmap, again
the same as in PCY. But the second pass of Multistage does not count the candidate pairs. Rather,
it uses the available main memory for another hash table, using another hash function. Since the
bitmap from the first hash table takes up 1/32 of the available main memory, the second hash
table has almost as many buckets as the first.
On the second pass of Multistage, we again go through the file of baskets. There is no need to
count the items again, since we have those counts from the first pass. However, we must retain
the information about which items are frequent, since we need it on both the second and third
passes. During the second pass, we hash certain pairs of items to buckets of the second hash
table. A pair is hashed only if it meets the two criteria for being counted in the second pass of
PCY; that is, we hash {i, j} if and only if i and j are both frequent, and the pair hashed to a
frequent bucket on the first pass. As a result, the sum of the counts in the second hash table
should be significantly less than the sum for the first pass. The result is that, even though the
second hash table has only 31/32 of the number of buckets that the first table has, we expect
there to be many fewer frequent buckets in the second hash table than in the first.

After the second pass, the second hash table is also summarized as a bitmap, and that bitmap is
stored in main memory. The two bitmaps together take up slightly less than 1/16th of the
available main memory, so there is still plenty of space to count the candidate pairs on the third
pass. A pair {i, j} is in C2 if
and only if:

1. i and j are both frequent items.


2. {i, j} hashed to a frequent bucket in the first hash table.
3. {i, j} hashed to a frequent bucket in the second hash table.

The third condition is the distinction between Multistage and PCY. It might be obvious that it is
possible to insert any number of passes between the first and last in the multistage Algorithm.
There is a limiting factor that

6.3.3 The Multihash Algorithm

Sometimes, we can get most of the benefit of the extra passes of the Multistage Algorithm in a
single pass. This variation of PCY is called the Multihash Algorithm. Instead of using two
different hash tables on two successive passes, use two hash functions and two separate hash
tables that share main memory on the first pass, as suggested by Fig. 6.7.

The danger of using two hash tables on one pass is that each hash table has half as many buckets
as the one large hash table of PCY. As long as the average count of a bucket for PCY is much
lower than the support threshold, we can operate two half-sized hash tables and still expect
most of the buckets of both hash tables to be infrequent. Thus, in this situation we might well
choose the multihash approach

For the second pass of Multihash, each hash table is converted to a bitmap, as usual. Note that
the two bitmaps for the two hash functions in Fig. 6.7 occupy exactly as much space as a single
bitmap would for the second pass of the PCY Algorithm. The conditions for a pair {i, j} to be in
C2, and thus to require a count on the second pass, are the same as for the third pass of
Multistage: i and j must both be frequent, and the pair must have hashed to a frequent bucket
according to both hash tables.
Just as Multistage is not limited to two hash tables, we can divide the available main memory
into as many hash tables as we like on the first pass of Multihash. The risk is that should we use
too many hash tables, the average count for a bucket will exceed the support threshold. At that
point, there may be very few infrequent buckets in any of the hash tables. Even though a pair
must hash to a frequent bucket in every hash table to be counted, we may find that the
probability an infrequent pair will be a candidate rises, rather than
falls, if we add another hash table.

6.4 Limited-Pass Algorithms


The algorithms for frequent itemsets discussed so far use one pass for each size of itemset we
investigate. If main memory is too small to hold the data and the space needed to count frequent
itemsets of one size, there does not seem to be any way to avoid k passes to compute the exact
collection of frequent itemsets. However, there are many applications where it is not essential to
discover every
frequent itemset. For instance, if we are looking for items purchased together at a supermarket,
we are not going to run a sale based on every frequent itemset we find, so it is quite sufficient to
find most but not all of the frequent itemsets.
In this section we explore some algorithms that have been proposed to find all or most frequent
itemsets using at most two passes. We begin with the obvious approach of using a sample of the
data rather than the entire dataset. An algorithm called SON uses two passes, gets the exact
answer, and lends itself to implementation by map-reduce or another parallel computing
regime. Finally, Toivonen’s Algorithm uses two passes on average, gets

6.4.1 The Simple, Randomized Algorithm

Instead of using the entire file of baskets, we could pick a random subset of the baskets and
pretend it is the entire dataset. We must adjust the support threshold to reflect the smaller
number of baskets. For instance, if the support threshold for the full dataset is s, and we choose a
sample of 1% of the baskets, then we should examine the sample for itemsets that appear in at
least s/100 of the baskets.
The safest way to pick the sample is to read the entire dataset, and for each basket, select that
basket for the sample with some fixed probability p. Suppose there are m baskets in the entire
file. At the end, we shall have a sample whose size is very close to pm baskets. However, if we
have reason to believe that the baskets appear in random order in the file already, then we do
not even have to read the entire file. We can select the first pm baskets for our sample. Or, if the
file is part of a distributed file system, we can pick some chunks at random to serve as the
sample.

Having selected our sample of the baskets, we use part of main memory to store these baskets.
The balance of the main memory is used to execute one of the algorithms we have discussed,
such as A-Priori, PCY, Multistage, or Multihash. However, the algorithm must run passes over the
main-memory sample for each itemset size, until we find a size with no frequent items. There are
no disk accesses needed to read the sample, since it resides in main memory. As frequent
itemsets of each size are discovered, they can be written out to disk; this operation and the
initial reading of the sample from disk are the only disk I/O’s the algorithm does.

Of course the algorithm will fail if whichever method from Section 6.2 or 6.3 we choose cannot
be run in the amount of main memory left after storing the sample. If we need more main
memory, then an option is to read the sample from disk for each pass. Since the sample is much
smaller than the full dataset, we still avoid most of the disk I/O’s that the algorithms discussed
previously would use.

6.4.3 The Algorithm of Savasere, Omiecinski, and Navathe(SON)

Our next improvement avoids both false negatives and false positives, at the cost of making two
full passés.

The idea is to divide the input file into chunks (which may be “chunks” in the sense of a
distributed file system, or simply a piece of the file). Treat each chunk as a sample, and run the
algorithm of Section 6.4.1 on that chunk. We use ps as the threshold, if each chunk is fraction p of
the whole file, and s is the support threshold. Store on disk all the frequent itemsets found for
each chunk.

Once all the chunks have been processed in that way, take the union of all the itemsets that have
been found frequent for one or more chunks. These are the candidate itemsets. Notice that if an
itemset is not frequent in any chunk, then its support is less than ps in each chunk. Since the
number of chunks is 1/p, we conclude that the total support for that itemset is less than (1/p)ps
= s. Thus, every itemset that is frequent in the whole is frequent in at least one chunk, and we
can be sure that all the truly frequent itemsets are among the candidates; i.e., there are no false
negatives. We have made a total of one pass through the data as we read each chunk and
processed it. In a second pass, we count all the candidate itemsets and
select those that have support at least s as the frequent itemsets.

6.4.5 Toivonen’s Algorithm

This algorithm will give neither false negatives nor positives, but there is a small but finite
probability that it will fail to produce any answer at all. In that case it needs to be repeated until
it gives an answer. However, the average number of passes needed before it produces all and
only the frequent itemsets is a small constant.

Toivonen’s algorithm begins by selecting a small sample of the input dataset, and finding from it
the candidate frequent itemsets. The process is exactly that of Section 6.4.1, except that it is
essential the threshold be set to something less than its proportional value. That is, if the
support threshold for the whole dataset is s, and the sample size is fraction p, then when looking
for frequent itemsets in the sample, use a threshold such as 0.9ps or 0.8ps. The smaller we make
the threshold, the more main memory we need for computing all itemsets that are frequent in
the sample, but the more likely we are to avoid the situation where the algorithm fails to provide
an answer. Having constructed the collection of frequent itemsets for the sample, we next
construct the negative border. This is the collection of itemsets that are not frequent in the
sample, but all of their immediate subsets (subsets constructed by deleting exactly one item) are
frequent in the sample.

To complete Toivonen’s algorithm, we make a pass through the entire data- set, counting all the
itemsets that are frequent in the sample or are in the negative border. There are two possible
outcomes.

1. No member of the negative border is frequent in the whole dataset. In this case, the correct set
of frequent itemsets is exactly those itemsets from the sample that were found to be frequent in
the whole.

2. Some member of the negative border is frequent in the whole. Then we cannot be sure that
there are not some even larger sets, in neither the negative border nor the collection of frequent
itemsets for the sample, that are also frequent in the whole. Thus, we can give no answer at this
time and must repeat the algorithm with a new random sample.
CLUSTERING IN DATA MINING
Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group
of data points into clusters so that the objects belong to the same group. The process of
grouping a set of physical or abstract objects into classes of similar objects is called
clustering. Clustering helps to splits data into several subsets. Each of these subsets
contains data similar to each other, and these subsets are called clusters.A cluster is a
collection of data objects that are similar to one another within the same cluster and are
dissimilar to the objects in other clusters. Clustering is also called data segmentation in
some applications because clustering partitions large data sets into groups according to
their similarity. Clustering can also be used for outlier detection.

For example, the data from customer base is divided into clusters; we can make an informed
decision about who we think is best suited for this product. Suppose we are a market
manager, and we have a new tempting product to sell. We are sure that the product would
bring enormous profit, as long as it is sold to the right people. So, how can we tell who is best
suited for the product from our company's huge customer base?

Fig

Figure 1: Application of Clustering Algorithm

• In machine learning, clustering is an example of unsupervised learning. Unlike


classification, clustering and Unsupervised learning do not rely on predefined classes
and class-labeled training examples. For this reason, clustering is a form of learning
by observation, rather than learning by examples.

A Categorization of Major Clustering Methods

(1) Partitioning methods


It classifies the data into k groups, which together satisfy the following requirements:

 each group must contain at least one object, and


 each object must belong to exactly one group
• It then uses an iterative relocation technique that attempts to improve the partitioning
by moving objects from one group to another.
• The general criterion of a good partitioning is that objects in the same cluster are
“close” or related to each other, whereas objects of different clusters are “far apart” or
very different.
• Popular heuristic methods, such as
(1) the k-means algorithm, where each cluster is represented by the mean value
of the objects in the cluster, and
(2) the k-medoids algorithm, where each cluster is represented by one of the
objects located near the center of the cluster.

1 (a) Centroid-Based Technique: The k-Means Method


• The k-means algorithm takes the input parameter, k, and partitions a set of n objects
into k clusters so that the resulting intra cluster similarity is high but the inter cluster
similarity is low.
• Cluster similarity is measured in regard to the mean value of the objects in a cluster
• First, it randomly selects k of the objects, each of which initially represents a cluster
mean or center.
For each of the remaining objects, an object is assigned to the cluster to which it is
the most similar, based on the distance between the object and the cluster mean.

Figure 2: Clustering of set of objects based on k- means method

1(b) Representative Object-Based Technique: The k-Medoids Method


• The k-means algorithm is sensitive to outliers because an object with an extremely
large value may substantially distort the distribution of data.
• Instead of taking the mean value of the objects in a cluster as a reference point, we
can pick actual objects to represent the clusters, using one representative object per
cluster.
• Each remaining object is clustered with the representative object to which it is the
most similar.
• The partitioning method is then performed based on the principle of minimizing the
sum of the dissimilarities between each object and its corresponding reference point .

 p is the point in space representing a given object in cluster Cj; and oj is the
representative object of Cj. In general, the algorithm iterates until, eventually, each
representative object is actually the metoid, or most centrally located object, of its
cluster.
• Case 1: p currently belongs to representative object, oj. If oj is replaced by o random
as a representative object and p is closest to one of the other representative objects,
oi,
• i not equal j, then p is reassigned to oi.
• Case 2: p currently belongs to representative object, oj. If oj is replaced by o random
as a representative object and p is closest to o random, then p is reassigned to o
random.
• Case 3: p currently belongs to representative object, oi, i not equal j. If oj is replaced
by o random as a representative object and p is still closest to oi, then the assignment
does not change.
• Case 4: p currently belongs to representative object, oi, i not equal j. If oj is replaced
by o random as a representative object and p is closest to o random, then p is
reassigned to o random.

Figure 3: Four cases of the function k-medoids clustering

(2) Hierarchical methods

A hierarchical method creates a hierarchical decomposition of the given set of data


objects.A hierarchical clustering method works by grouping data objects into a tree of
clusters.

Classfication:

o Agglomerative & Divisive Hierarchical Clustering


o CURE
o Chameleon
There are two approaches to improving the quality of hierarchical clustering:

 perform careful analysis of object “linkages” at each hierarchical partitioning, such


as in Chameleon, or
 integrate hierarchical agglomeration and other approaches by first using a
hierarchical agglomerative algorithm to group objects into microclusters, and then
performing macroclustering on the microclusters using another clustering method
such as iterative relocation, as in BIRCH

(2a) Agglomerative Hierarchical Clustering

• A hierarchical method can be classified as being either agglomerative, also called the
bottom-up approach, starts with each object forming a separate group. It successively
merges the objects or groups that are close to one another, until all of the groups are
merged into one (the topmost level of the hierarchy), or until a termination conditions are
satisfied.

• The divisive approach, also called the top-down approach, starts with all of the objects in
the same cluster.

• In each successive iteration, It subdivides the cluster into smaller and smaller pieces, until
each object forms a cluster on its own or until it satisfies certain termination conditions,
each cluster is within a certain threshold.

Figure 4: Agglomerative and Divisve Hierarchical clustering

EXAMPLE:

• The figure shows AGNES (AGglomerative NESting), an agglomerative hierarchical


clustering method, and DIANA (DIvisive ANAlysis), a divisive hierarchical clustering
method, to a data set of five objects,{a, b, c, d, e }.

• Initially, AGNES places each object into a cluster of its own.


• The clusters are then merged step-by-step according to some criterion.

• For example, clusters C1 and C2 may be merged if an object in C1 and an object in


C2 form the minimum Euclidean distance between any two objects from different
clusters.

• This is a single-linkage approach in that each cluster is represented by all of the


objects in the cluster, and the similarity between two clusters is measured by the
similarity of the closest pair of data points belonging to different clusters.

• The cluster merging process repeats until all of the objects are eventually merged to
form one cluster.

• In DIANA, all of the objects are used to form one initial cluster.

• The cluster is split according to some principle, such as the maximum Euclidean
distance between the closest neighboring objects in the cluster.

• The cluster splitting process repeats until, eventually, each new cluster contains only a
single object

(2b) CURE

The cure algorithm assumes a Euclidean distance. It allows clusters to assume any
shape.It uses collection of representative points to represent clusters

Figure 5: Clusters of different shapes

For example, the dataset of engineers and humanity people is been shown with their
salary and age..
Figure 6: Dataset representation in terms of salary and age

We formed the two clusters of the dataset of engineers and humanities. The clusters formed
are overlapping with each other which will not give the solution.

Figure 7: Two cluster formation

We tried to create three clusters for segregation by which results can be achieved. But after
cluster formation still one cluster is formed having both the dataset values.

Figure 8: Three cluster formation

Algorithm for cure:

Pass 1 of 2:

 Pick a random sample of points that fit in main memory.


 Cluster sample points hierarchically to create the initial clusters.
 Pick representatives points:
o For each cluster, pick k (eg.,4) representative points, as dispersed as
possible
o Move each representative point a fixed fraction (eg., 20%) toward the
centroid of the cluster
Figure 9: Representative points or remote points in cluster

Figure 10: Remote points moving 20% toward centroid

Pass 2 of 2:

 Now, rescan the whole dataset and visit each point p in the data set.
 Place it in the “closest cluster”
o Closest: that cluster with the closest (to p)among all the representative points
of all the clusters.

(2C) Chameleon: A Hierarchical Clustering Algorithm Using Dynamic


Modeling

• Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to


determine the similarity between pairs of clusters.
• It was derived based on the observed weaknesses of two hierarchical clustering
algorithms: ROCK (ignores cluster nearness) and CURE (ignores cluster
interconnectivity)

How does Chameleon work?

• Chameleon uses a k-nearest-neighbor graph approach to construct a sparse graph,


where each vertex of the graph represents a data object, and there exists an edge
between two vertices (objects) if one object is among the k-most-similar objects of the
other.

• The edges are weighted to reflect the similarity between objects. Chameleon uses a
graph partitioning algorithm to partition the k-nearest-neighbor graph into a large
number of relatively small subclusters.

• It then uses an agglomerative hierarchical clustering algorithm that repeatedly merges


subclusters based on their similarity.

• To determine the pairs of most similar subclusters, it takes into account both the
interconnectivity as well as the closeness of the clusters

Figure 11: Chameleon – Hierarchical clustering based on k-nearest and dynamic modeling
Figure 12: OVERALL FRAMEWORK OF CHAMELEON

Clustering high-dimensional data


The clustering methods we have studied so far work well when the dimensionality is not
high, that is, having less than 10 attributes. There are, however, important applications of
high dimensionality. “How can we conduct cluster analysis on high-dimensional data”?
For example, All Electronics keeps track of the products purchased by every customer.
As a customer-relationship manager, you want to cluster customers into groups according
to what they purchased from All Electronics. All Electronics carries tens of thousands of
products.

IIt is
easy to see that

dist(Ada,Bob) = dist(Bob,Cathy) = dist(Ada,Cathy) = √ 2.

According to Euclidean distance, the three customers are equivalently similar (or dissimilar)
to each other. However, a close look tells us that Ada should be more similar to Cathy than
to Bob because Ada and Cathy share one common purchased item, P1.
The traditional distance measures can be ineffective on high-dimensional data. Such
distance measures may be dominated by the noise in many dimensions. Therefore, clusters
in the full, high-dimensional space can be unreliable, and finding such clusters may not be
meaningful. Clustering high-dimensional data is the search for clusters and the space in
which they exist.

There are two challenges.

First challenge :

A major issue is how to create appropriate models for clusters in high-dimensional data.
Unlike conventional clusters in low-dimensional spaces, clusters hidden in high-dimensional
data are often significantly smaller. For example, when clustering customer-purchase data,
we would not expect many users to have similar purchase patterns. Searching for such
small but meaningful clusters is like finding needles in a haystack. we often have to consider
various more sophisticated techniques that can model correlations and consistency among
objects in subspaces.

Second Challenge:

There are typically an exponential number of possible subspaces or dimensionality


reduction options, and thus the optimal solutions are often computationally prohibitive.

For example, if the original data space has 1000 dimensions, and we want to find clusters
of dimensionality 10, then there are 2.63×1023 possible subspaces.

Two major kinds of methods:

 Subspace clustering approaches search for clusters existing in subspaces of the


given high-dimensional data space, where a subspace is defined using a subset of
attributes in the full space. It is classified as

1. Subspace search methods

2. Correlation-based clustering methods

3. Biclustering methods

 Dimensionality reduction approaches try to construct a much lower-dimensional


space and search for clusters in such a space. Often, a method may construct new
dimensions by combining some dimensions from the original data.

Subspace Search Methods


A subspace search method searches various subspaces for clusters. Here, a cluster is a
subset of objects that are similar to each other in a subspace. The similarity is often
captured by conventional measures such as distance or density.A major challenge that
subspace search methods face is how to search a series of subspaces effectively and
efficiently.

Generally there are two kinds of strategies:

Bottom-up approaches start from low-dimensional subspaces and search higher


dimensional subspaces only when there may be clusters in those higher-dimensional
subspaces. Various pruning techniques are explored to reduce the number of higher
dimensional subspaces that need to be searched. CLIQUE is an example of a bottom-up
approach.

Top-down approaches start from the full space and search smaller and smaller
subspaces recursively. Top-down approaches are effective only if the locality assumption
holds, which require that the subspace of a cluster can be determined by the local
neighborhood. PROCLUS, is an example of a top-down subspace approach.

CLIQUE: A Dimension –Growth subspace clustering Method


• CLIQUE (CLustering InQUEst) was the first algorithm proposed for dimension-growth
subspace clustering in high-dimensional space.

• In dimension-growth subspace clustering, the clustering process starts at single-


dimensional subspaces and grows upward to higher-dimensional ones.

• Because CLIQUE partitions each dimension like a grid structure and determines
whether a cell is dense based on the number of points it contains, it can also be
viewed as an integration of density-based and grid-based clustering methods

The ideas of the CLIQUE clustering algorithm are outlined as follows.

• Given a large set of multidimensional data points, the data space is usually not
uniformly occupied by the data points.

• CLIQUE’s clustering identifies the sparse and the “crowded” areas in space (or units),
thereby discovering the overall distribution patterns of the data set.

• A unit is dense if the fraction of total data points contained in it exceeds an input
model parameter
Figure : Density and Grid based clustering

HOW DOES clique works

I STEP: CLIQUE partitions the d-dimensional data space into non overlapping
rectangular units, identifying the dense units among these.

II STEP: The subspaces representing these dense units are intersected to form a
candidate search space in which dense units of higher dimensionality may exist.

How effectively clique is?


CLIQUE automatically find subspaces of the highest dimensionality such that high density
clusters exist in those subspace. It is insensitive to order of objects. It scales linearly with
the size of input and has a good scalability as the number of dimensions in the data is
increased. Clustering results are dependent on proper tuning on grid size and the density
threshold.

Graphical definition:
CLIQUE is the group of nodes in graph such that all nodes in a CLIQUE are connected to
each other.‘K’ – No of nodes in a CLIQUE
Community is the group of CLIQUES such that all the CLIQUES must have
‘K-1’ nodes in common.

CLIQUE- Example 1
CLIQUE- Example 2

CLIQUE ( K =3)

a) {1,2,3}

b) {1,2,8}
c) {2,6,5}

d) {2,6,4}

e) {2,5,4}

f) {4,5,6}

Community 1= {a, b}

Community 2 = { c,d,e,f}

PROCLUS (PROjected CLUStering)


 PROCLUS (PROjected CLUStering) is the first, top-down partition based projected
clustering algorithm based on the concepts of k-medoid clustering which was
proposed by Aggarwal (1999). It computes medoids for each cluster iteratively on a
sample of data using a greedy hill climbing technique and then improves the results
iteratively.

 Cluster quality in PROCLUS is a function of average distance between data points


and the nearest medoid. Also, the subspace dimensionality is an input parameter,
which generates clusters of similar sizes. PROCLUS is a typical dimension-
reduction subspace clustering method. That is, instead of starting from single-
dimensional spaces, it starts by finding an initial approximation of the clusters in
high dimensional attribute space.

 Each dimension is then assigned a weight for each cluster and the updated weights
are used in the next iteration to regenerate the clusters. This leads to the
exploration of dense regions in all subspaces of some desired dimensionality. It
avoids the generation of a large number of overlapped clusters in lower
dimensionality.

 PROCLUS finds the best set of medoids by a hill climbing process but generalized
to deal with projected clustering.It adopts a distance measure called Manhattan
segmental distance.The PROCLUS algorithm consists of three phases.:
Initialization, Iteration, Cluster refinement.

 However, PROCLUS is faster than CLIQUE due to the sampling of large datasets,
though the use of small number of representative points can cause PROCLUS to
miss some clusters entirely.Experiments on PROCLUS show that the method is
efficient and scalable at finding high dimensional clusters .PROCLUS finds non
overlaped partitions of points

INPUT AND OUTPUT FOR PROCLUS

Choose a sample set of data point randomly.


Choose a set of data point which is probably the medoids of the cluster

Input:

The set of data points

Number of clusters, denoted by k

Average number of dimensions for each clusters, denoted by L

Output:

The clusters found, and the dimensions respected to such clusters

Three Phase for PROCLUS:

 Initialization Phase

1. Choose a sample set of data point randomly.

2. Choose a set of data point which is probably the medoids of the cluster

 Iterative Phase

1. From the Initialization Phase, we got a set of data points which should contains the
medoids. (Denoted by M). This phase, we will find the best medoids from M.

2. Randomly find the set of points Mcurrent, and replace the “bad” medoids from other
point in M if necessary by which cluster quality is improved. The newly formed
meaningful medoid set is denoted as Mbest.

3. For the medoids, following will be done:

 Find Dimensions related to the medoids

 Assign Data Points to the medoids


 Evaluate the Clusters formed

 Find the bad medoid, and try the result of replacing bad medoid

 The above procedure is repeated until we got a satisfied result

 Refinement Phase

The final step of this algorithm is refinement phase. This phase is included to improve the
quality of the clusters formed. The clusters C1,C2, C3,…,Ck formed during the iterative
phase are the inputs to this phase. The original data set is passed over one or more times
to improve the quality of the clusters. The dimension sets Di found during the iterative
phase are discarded and new dimension sets are computed for each of the cluster set Ci.

Once when the new dimensions are computed for the clusters, then the points are
reassigned to the medoids relative to these new sets of dimensions. Outliers are
determined in the last pass over the data.

Drawback:

 The algorithm requires the average number of dimensions per cluster as parameter in
input.

 The performance of PROCLUS is highly sensitive to the value of its input parameter.

 If the average number of dimensions is erroneously estimated, the performance of


PROCLUS significantly worsens.
FREQUENT PATTERN BASED CLUSTERING
Frequent pattern mining can be applied to clustering resulting in frequent pattern based cluster
analysis. Frequent pattern mining can lead to the discovery of interesting association and
correlation among data object. The Idea behind frequent pattern based cluster analysis is that the
frequent patterns discovered may also indicate clusters. Frequent pattern based cluster
analysis is well suited to high dimension data

Rather than growing the clusters dimension by dimension ,we grow sets of frequent item
sets which eventually lead to cluster descriptions.

Examples of frequent pattern based cluster analysis : Clustering of text documents that
contain thousands of distinct keywords.

Example: Text Clustering

 Text clustering is the application of cluster analysis to text-based documents.

Working:

 Descriptors (sets of words that describe topic matter) are extracted from the
document first.

 Then they are analyzed for the frequency in which they are found in the document
compared to other terms.

 After which, clusters of descriptors can be identified and then auto-tagged.

 From there, the information can be used in any number of ways

 Google’s search engine is probably the best and most widely known example.

 When you search for a term on Google, it pulls up pages that apply to that term.

 How Google can analyze billions of web pages to deliver an accurate and fast result?

 It’s because of text clustering! Google’s algorithm breaks down unstructured data
from web pages and turns it into a matrix model, tagging pages with keywords that
are then used in search results!

There are two forms of frequent pattern based cluster analysis

1. Frequent term based text clustering

2. Clustering by pattern similarity in microarray data analysis.

1. Frequent term based text clustering


In frequent term based text clustering text documents are clustered based on the
frequent terms they contain. Examples include processing word documents, HTML
tags etc. A stemming algorithm is applied to reduce each term to its basic stem in this
way each document can be represented as a set of terms. the dimension space can
be referred to another vector space with each document is represented by a
term vector. A well selected subset of the set of all frequent item sets can be
considered as the clustering. An advantage of frequent term based text clustering is
that it automatically generates a description for the generated clusters in terms of their
frequent term sets.

A stemming algorithm is a process of linguistic normalization, in which the variant


forms of a word are reduced to a common form, for example,

Connection

Connections

Connective --------------------------> connect

Connected

connecting

It is important to appreciate that we use stemming with the intention of improving the
performance of IR systems.

2. Clustering by pattern similarity in microarray data analysis

Pcluster:

Another approach for clustering high dimensional data is based on pattern similarity
among the objects on a subset of dimensions. pCluster method performs clustering by
pattern similarity in microarray data analysis. Example is DNA microarray analysis

DNA microarray analysis: A microarray is a laboratory tool used to detect the expression
of thousands of genes at the same time. DNA microarrays are microscope slides that
are printed with thousands of tiny spots in defined positions, with each spot containing a
known DNA sequence or gene.

Under the pCluster model, two objects are similar if they exhibit a coherent pattern on a
subset of dimensions. All though the magnitude of their expression levels may not be
close, the pattern they exhibit can be very much alike. The pCluster model though
developed in the study of microarray data cluster analysis can be applied to many other
applications that require finding similar or coherent patterns involving a subset of
numerical dimensions in large high dimensional data sets.

Clustering in non-euclidean space


When the space is non-Euclidean, we need to use some distance measure that is
computed from points, such as Jaccard, cosine, or edit distance. That is, we cannot base
distances on “location” of points A problem arises when we need to represent a cluster,
because we cannot replace a collection of points by their centroid.

 Suppose we are using edit distance, and we decide to merge the strings abcd and
aecdb.

 These have edit distance 3 and might well be merged.

 However, there is no string that represents their average, or that could be thought of
as lying naturally between them.

 We could take one of the strings that we might pass through when transforming one
string to the other by single insertions or deletions, such as aebcd, but there are many
such options.

 Moreover, when clusters are formed from more than two strings, the notion of “on the
path between” stops making sense.

 Given that we cannot combine points in a cluster when the space is nonEuclidean, our
only choice is to pick one of the points of the cluster itself to represent the cluster.

 Ideally, this point is close to all the points of the cluster, so it in some sense lies in the
“center.”

 We call the representative point the clustroid.

 We can select the clustroid in various ways, each designed to, in some sense,
minimize the distances between the clustroid and the other points in the cluster.

Common choices include selecting as the clustroid the point that minimizes:

 1. The sum of the distances to the other points in the cluster.

 2. The maximum distance to another point in the cluster.

 3. The sum of the squares of the distances to the other points in the cluster.

EXAMPLE:

 We are using edit distance, and a cluster consists of the four points abcd, aecdb,
abecb, and ecdab. Their distances are found in the above table
 Suppose If we apply the three criteria for being the centroid to each of the four points
of the cluster, we find the clustroid which is shown:

 In general, different criteria could yield different clustroids.

GRGPF Algorithm
 Consider an algorithm that handles non-main-memory data, but does not require a
Euclidean space. The algorithm, which we shall refer to as GRGPF for its authors
(V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, andJ. French), takes ideas from
both hierarchical and point-assignment approaches

 Like CURE, it represents clusters by sample points in main memory. However, it


also tries to organize the clusters hierarchically, in a tree, so a new point can be
assigned to the appropriate cluster by passing it down the tree.Leaves of the tree
hold summaries of some clusters, and interior nodes hold subsets of the
information describing the clusters reachable through that node.

Representing Clusters in the GRGPF Algorithm

 As we assign points to clusters, the clusters can grow large. Most of the points in a
cluster are stored on disk, and are not used in guiding the assignment of points,
although they can be retrieved. If p is any point in a cluster, let ROWSUM(p) be the
sum of the squares of the distances from p to each of the other points in the cluster.

 The following features form the representation of a cluster.


1.N, the number of points in the cluster.

2. The clustroid of the cluster, which is defined specifically to be the point in the cluster
that minimizes the sum of the squares of the distances to the other points.

3. The rowsum of the clustroid of the cluster

4. For some chosen constant k, the k points of the cluster that are closest to the clustroid,
and their rowsums. These points are part of the representation in case the addition of
points to the cluster causes the clustroid to change. The assumption is made that the new
clustroid would be one of these k points near the old clustroid

5. The k points of the cluster that are furthest from the clustroid and their rowsums. These
points are part of the representation so that we can consider whether two clusters are
close enough to merge.The assumption is made that if two clusters are close, then a pair
of points distant from their respective clustroids would be close

Initializing the Cluster Tree:

1. Cluster hierarchically into main memory a sample of the dataset.

We now have a tree T.

2. Select from T some nodes with some desired size n (or close to n).

These become the initial clusters for GRGPF. Place them in leafs of the cluster
representing tree (or CRT).

3. Group clusters with a common ancestor in T into interior nodes of the CRT

Clustering for Streams:

 Use a sliding window and keep N≫ 0 “points” of data.

 For any m ≤ N, do clustering on the m points. The clustering algorithm depends on the
data space.

 A simple way of keeping track of the data is store data in buckets containing 2 k points.
Allow up to 2 buckets of size 2 k, k fixed

The bucket contents are

1. Size of the bucket.

2. Timestamp of the most recent point added to the bucket.

3. Cluster representations with each having

a. Size of cluster.

b. Centroid or clustroid.
c. Any feature needed to merge clusters

When a new point arrives it must go into a bucket. This causes bucket management
issues. Buckets also time out and can be deleted.

You might also like