full UNIT 4 notes
full UNIT 4 notes
Imagine that you are a sales manager at AllElectronics, and you are talking to a customer who
recently bought a PC and a digital camera from the store.What should you recommend to her
next? Information about which products are frequently purchased by your customers following
their purchases of a PC and a digital camera in sequence would be very helpful in making your
recommendation. Frequent patterns and association rules
are the knowledge that you want to mine in such a scenario.
Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear
frequently in a data set. For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent itemset. A subsequence, such as buying
first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping
history database, is a (frequent) sequential pattern. A substructure can refer to different
structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with
itemsets or subsequences. If a substructure occurs frequently, it is called a (frequent) structured
pattern. Finding frequent patterns plays an essential role in mining associations, correlations,
and many other interesting relationships among data. Moreover, it helps in data classification,
clustering, and other data mining tasks. Thus, frequent pattern mining has become an important
data mining task and a focused theme in data mining research.
Example 6.1 Market basket analysis. Suppose, as manager of an AllElectronics branch, you
would like to learn more about the buying habits of your customers. Specifically, you wonder,
“Which groups or sets of items are customers likely to purchase on a given trip to the store?” To
answer your question, market basket analysis may be performed on the retail data of customer
transactions at your store. You can then use the results to plan marketing or advertising
strategies, or in the design of a new catalog. For instance, market basket analysis may help you
design different store layouts. In one strategy, items that are frequently purchased together can
be placed in proximity to further encourage the combined sale of such items. If customers who
purchase computers also tend to buy antivirus software at the same time, then placing the
hardware display close to the software display may help increase the sales of both items. In an
alternative strategy, placing hardware and software at opposite ends of the store may entice
customers who purchase such items to pick up other items along the way. For instance, after
deciding on an expensive computer, a customer may observe security systems for sale while
heading toward the software display to purchase antivirus software, and may decide to purchase
a home security system as well. Market basket analysis can also help retailers plan which items
to put on sale at reduced prices. If customers tend to purchase computers and printers together,
then having a sale on printers may encourage the sale of printers as well as computers. If we
think of the universe as the set of items available at the store, then each itemhas a Boolean
variable representing the presence or absence of that item. Each basket can then be represented
by a Boolean vector of values assigned to these variables. The Boolean vectors can be analyzed
for buying patterns that reflect items that are frequently associated or purchased together. These
patterns can be represented in the formof association rules. For example, the information that
customers who purchase computers also tend to buy antivirus software at the same time is
represented in the following association rule:
Rule support and confidence are two measures of rule interestingness. They respectively
reflect the usefulness and certainty of discovered rules. A support of 2% for Rule (6.1) means
that 2% of all the transactions under analysis show that computer and antivirus software are
purchased together. A confidence of 60% means that 60% of the customers who purchased a
computer also bought the software. Typically, associationrules are considered interesting if they
satisfy both a minimum support threshold and a minimum confidence threshold. These
thresholds can be a set by users or domain experts. Additional analysis can be performed to
discover interesting statistical correlations between associated items.
Example 1
Example 2
6.2 Frequent Itemset Mining Methods
Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on,
until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of
the database. To improve the efficiency of the level-wise generation of frequent itemsets, an
important property called the Apriori property is used to reduce the search space.
6.3.1 PCY (PARK,CHEN,YU) algorithm
This algorithm, which we call PCY after its authors, exploits the observation that there may be
much unused space in main memory on the first pass. If there are a million items and gigabytes
of main memory, we do not need more than 10% of the main memory for the two tables
suggested in above Figure.
The PCY Algorithm uses that space for an array of integers that generalizes the idea of a Bloom
filter.
The idea is shown schematically in Fig. 2.Think of this array as a hash table, whose buckets hold
integers rather than sets of keys (as in an ordinary hash table) or bits (as in a Bloom filter).
Pairs of items are hashed to buckets of this hash table. As we examine a basket during the first
pass, we not only add 1 to the count for each item in the basket, but we generate all the pairs,
using a double loop.
We hash each pair, and we add 1 to the bucket into which that pair hashes. Note that the pair
itself doesn’t go into the bucket; the pair only affects the single integer in the bucket.
At the end of the first pass, each bucket has a count, which is the sum of the counts of all the
pairs that hash to that bucket. If the count of a bucket is at least as great as the support threshold
s, it is called a frequent bucket.
We can say nothing about the pairs that hash to a frequent bucket; they could all be frequent
pairs from the information available to us. But if the count of the bucket is less than s (an
infrequent bucket), we know no pair that hashes to this bucket can be frequent, even if the pair
consists of two frequent items.
That fact gives us an advantage on the second pass. We can define the set of candidate pairs C2
to be those pairs {i, j} such that:
The Multistage Algorithm improves upon PCY by using several successive hash tables to reduce
further the number of candidate pairs. The tradeoff is that Multistage takes more than two
passes to find the frequent pairs. An outline of the Multistage Algorithm is shown in Fig. 6.6. The
first pass of Multistage is the same as the first pass of PCY. After that pass, the frequent buckets
are identified and summarized by a bitmap, again
the same as in PCY. But the second pass of Multistage does not count the candidate pairs. Rather,
it uses the available main memory for another hash table, using another hash function. Since the
bitmap from the first hash table takes up 1/32 of the available main memory, the second hash
table has almost as many buckets as the first.
On the second pass of Multistage, we again go through the file of baskets. There is no need to
count the items again, since we have those counts from the first pass. However, we must retain
the information about which items are frequent, since we need it on both the second and third
passes. During the second pass, we hash certain pairs of items to buckets of the second hash
table. A pair is hashed only if it meets the two criteria for being counted in the second pass of
PCY; that is, we hash {i, j} if and only if i and j are both frequent, and the pair hashed to a
frequent bucket on the first pass. As a result, the sum of the counts in the second hash table
should be significantly less than the sum for the first pass. The result is that, even though the
second hash table has only 31/32 of the number of buckets that the first table has, we expect
there to be many fewer frequent buckets in the second hash table than in the first.
After the second pass, the second hash table is also summarized as a bitmap, and that bitmap is
stored in main memory. The two bitmaps together take up slightly less than 1/16th of the
available main memory, so there is still plenty of space to count the candidate pairs on the third
pass. A pair {i, j} is in C2 if
and only if:
The third condition is the distinction between Multistage and PCY. It might be obvious that it is
possible to insert any number of passes between the first and last in the multistage Algorithm.
There is a limiting factor that
Sometimes, we can get most of the benefit of the extra passes of the Multistage Algorithm in a
single pass. This variation of PCY is called the Multihash Algorithm. Instead of using two
different hash tables on two successive passes, use two hash functions and two separate hash
tables that share main memory on the first pass, as suggested by Fig. 6.7.
The danger of using two hash tables on one pass is that each hash table has half as many buckets
as the one large hash table of PCY. As long as the average count of a bucket for PCY is much
lower than the support threshold, we can operate two half-sized hash tables and still expect
most of the buckets of both hash tables to be infrequent. Thus, in this situation we might well
choose the multihash approach
For the second pass of Multihash, each hash table is converted to a bitmap, as usual. Note that
the two bitmaps for the two hash functions in Fig. 6.7 occupy exactly as much space as a single
bitmap would for the second pass of the PCY Algorithm. The conditions for a pair {i, j} to be in
C2, and thus to require a count on the second pass, are the same as for the third pass of
Multistage: i and j must both be frequent, and the pair must have hashed to a frequent bucket
according to both hash tables.
Just as Multistage is not limited to two hash tables, we can divide the available main memory
into as many hash tables as we like on the first pass of Multihash. The risk is that should we use
too many hash tables, the average count for a bucket will exceed the support threshold. At that
point, there may be very few infrequent buckets in any of the hash tables. Even though a pair
must hash to a frequent bucket in every hash table to be counted, we may find that the
probability an infrequent pair will be a candidate rises, rather than
falls, if we add another hash table.
Instead of using the entire file of baskets, we could pick a random subset of the baskets and
pretend it is the entire dataset. We must adjust the support threshold to reflect the smaller
number of baskets. For instance, if the support threshold for the full dataset is s, and we choose a
sample of 1% of the baskets, then we should examine the sample for itemsets that appear in at
least s/100 of the baskets.
The safest way to pick the sample is to read the entire dataset, and for each basket, select that
basket for the sample with some fixed probability p. Suppose there are m baskets in the entire
file. At the end, we shall have a sample whose size is very close to pm baskets. However, if we
have reason to believe that the baskets appear in random order in the file already, then we do
not even have to read the entire file. We can select the first pm baskets for our sample. Or, if the
file is part of a distributed file system, we can pick some chunks at random to serve as the
sample.
Having selected our sample of the baskets, we use part of main memory to store these baskets.
The balance of the main memory is used to execute one of the algorithms we have discussed,
such as A-Priori, PCY, Multistage, or Multihash. However, the algorithm must run passes over the
main-memory sample for each itemset size, until we find a size with no frequent items. There are
no disk accesses needed to read the sample, since it resides in main memory. As frequent
itemsets of each size are discovered, they can be written out to disk; this operation and the
initial reading of the sample from disk are the only disk I/O’s the algorithm does.
Of course the algorithm will fail if whichever method from Section 6.2 or 6.3 we choose cannot
be run in the amount of main memory left after storing the sample. If we need more main
memory, then an option is to read the sample from disk for each pass. Since the sample is much
smaller than the full dataset, we still avoid most of the disk I/O’s that the algorithms discussed
previously would use.
Our next improvement avoids both false negatives and false positives, at the cost of making two
full passés.
The idea is to divide the input file into chunks (which may be “chunks” in the sense of a
distributed file system, or simply a piece of the file). Treat each chunk as a sample, and run the
algorithm of Section 6.4.1 on that chunk. We use ps as the threshold, if each chunk is fraction p of
the whole file, and s is the support threshold. Store on disk all the frequent itemsets found for
each chunk.
Once all the chunks have been processed in that way, take the union of all the itemsets that have
been found frequent for one or more chunks. These are the candidate itemsets. Notice that if an
itemset is not frequent in any chunk, then its support is less than ps in each chunk. Since the
number of chunks is 1/p, we conclude that the total support for that itemset is less than (1/p)ps
= s. Thus, every itemset that is frequent in the whole is frequent in at least one chunk, and we
can be sure that all the truly frequent itemsets are among the candidates; i.e., there are no false
negatives. We have made a total of one pass through the data as we read each chunk and
processed it. In a second pass, we count all the candidate itemsets and
select those that have support at least s as the frequent itemsets.
This algorithm will give neither false negatives nor positives, but there is a small but finite
probability that it will fail to produce any answer at all. In that case it needs to be repeated until
it gives an answer. However, the average number of passes needed before it produces all and
only the frequent itemsets is a small constant.
Toivonen’s algorithm begins by selecting a small sample of the input dataset, and finding from it
the candidate frequent itemsets. The process is exactly that of Section 6.4.1, except that it is
essential the threshold be set to something less than its proportional value. That is, if the
support threshold for the whole dataset is s, and the sample size is fraction p, then when looking
for frequent itemsets in the sample, use a threshold such as 0.9ps or 0.8ps. The smaller we make
the threshold, the more main memory we need for computing all itemsets that are frequent in
the sample, but the more likely we are to avoid the situation where the algorithm fails to provide
an answer. Having constructed the collection of frequent itemsets for the sample, we next
construct the negative border. This is the collection of itemsets that are not frequent in the
sample, but all of their immediate subsets (subsets constructed by deleting exactly one item) are
frequent in the sample.
To complete Toivonen’s algorithm, we make a pass through the entire data- set, counting all the
itemsets that are frequent in the sample or are in the negative border. There are two possible
outcomes.
1. No member of the negative border is frequent in the whole dataset. In this case, the correct set
of frequent itemsets is exactly those itemsets from the sample that were found to be frequent in
the whole.
2. Some member of the negative border is frequent in the whole. Then we cannot be sure that
there are not some even larger sets, in neither the negative border nor the collection of frequent
itemsets for the sample, that are also frequent in the whole. Thus, we can give no answer at this
time and must repeat the algorithm with a new random sample.
CLUSTERING IN DATA MINING
Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group
of data points into clusters so that the objects belong to the same group. The process of
grouping a set of physical or abstract objects into classes of similar objects is called
clustering. Clustering helps to splits data into several subsets. Each of these subsets
contains data similar to each other, and these subsets are called clusters.A cluster is a
collection of data objects that are similar to one another within the same cluster and are
dissimilar to the objects in other clusters. Clustering is also called data segmentation in
some applications because clustering partitions large data sets into groups according to
their similarity. Clustering can also be used for outlier detection.
For example, the data from customer base is divided into clusters; we can make an informed
decision about who we think is best suited for this product. Suppose we are a market
manager, and we have a new tempting product to sell. We are sure that the product would
bring enormous profit, as long as it is sold to the right people. So, how can we tell who is best
suited for the product from our company's huge customer base?
Fig
p is the point in space representing a given object in cluster Cj; and oj is the
representative object of Cj. In general, the algorithm iterates until, eventually, each
representative object is actually the metoid, or most centrally located object, of its
cluster.
• Case 1: p currently belongs to representative object, oj. If oj is replaced by o random
as a representative object and p is closest to one of the other representative objects,
oi,
• i not equal j, then p is reassigned to oi.
• Case 2: p currently belongs to representative object, oj. If oj is replaced by o random
as a representative object and p is closest to o random, then p is reassigned to o
random.
• Case 3: p currently belongs to representative object, oi, i not equal j. If oj is replaced
by o random as a representative object and p is still closest to oi, then the assignment
does not change.
• Case 4: p currently belongs to representative object, oi, i not equal j. If oj is replaced
by o random as a representative object and p is closest to o random, then p is
reassigned to o random.
Classfication:
• A hierarchical method can be classified as being either agglomerative, also called the
bottom-up approach, starts with each object forming a separate group. It successively
merges the objects or groups that are close to one another, until all of the groups are
merged into one (the topmost level of the hierarchy), or until a termination conditions are
satisfied.
• The divisive approach, also called the top-down approach, starts with all of the objects in
the same cluster.
• In each successive iteration, It subdivides the cluster into smaller and smaller pieces, until
each object forms a cluster on its own or until it satisfies certain termination conditions,
each cluster is within a certain threshold.
EXAMPLE:
• The cluster merging process repeats until all of the objects are eventually merged to
form one cluster.
• In DIANA, all of the objects are used to form one initial cluster.
• The cluster is split according to some principle, such as the maximum Euclidean
distance between the closest neighboring objects in the cluster.
• The cluster splitting process repeats until, eventually, each new cluster contains only a
single object
(2b) CURE
The cure algorithm assumes a Euclidean distance. It allows clusters to assume any
shape.It uses collection of representative points to represent clusters
For example, the dataset of engineers and humanity people is been shown with their
salary and age..
Figure 6: Dataset representation in terms of salary and age
We formed the two clusters of the dataset of engineers and humanities. The clusters formed
are overlapping with each other which will not give the solution.
We tried to create three clusters for segregation by which results can be achieved. But after
cluster formation still one cluster is formed having both the dataset values.
Pass 1 of 2:
Pass 2 of 2:
Now, rescan the whole dataset and visit each point p in the data set.
Place it in the “closest cluster”
o Closest: that cluster with the closest (to p)among all the representative points
of all the clusters.
• The edges are weighted to reflect the similarity between objects. Chameleon uses a
graph partitioning algorithm to partition the k-nearest-neighbor graph into a large
number of relatively small subclusters.
• To determine the pairs of most similar subclusters, it takes into account both the
interconnectivity as well as the closeness of the clusters
Figure 11: Chameleon – Hierarchical clustering based on k-nearest and dynamic modeling
Figure 12: OVERALL FRAMEWORK OF CHAMELEON
IIt is
easy to see that
According to Euclidean distance, the three customers are equivalently similar (or dissimilar)
to each other. However, a close look tells us that Ada should be more similar to Cathy than
to Bob because Ada and Cathy share one common purchased item, P1.
The traditional distance measures can be ineffective on high-dimensional data. Such
distance measures may be dominated by the noise in many dimensions. Therefore, clusters
in the full, high-dimensional space can be unreliable, and finding such clusters may not be
meaningful. Clustering high-dimensional data is the search for clusters and the space in
which they exist.
First challenge :
A major issue is how to create appropriate models for clusters in high-dimensional data.
Unlike conventional clusters in low-dimensional spaces, clusters hidden in high-dimensional
data are often significantly smaller. For example, when clustering customer-purchase data,
we would not expect many users to have similar purchase patterns. Searching for such
small but meaningful clusters is like finding needles in a haystack. we often have to consider
various more sophisticated techniques that can model correlations and consistency among
objects in subspaces.
Second Challenge:
For example, if the original data space has 1000 dimensions, and we want to find clusters
of dimensionality 10, then there are 2.63×1023 possible subspaces.
3. Biclustering methods
Top-down approaches start from the full space and search smaller and smaller
subspaces recursively. Top-down approaches are effective only if the locality assumption
holds, which require that the subspace of a cluster can be determined by the local
neighborhood. PROCLUS, is an example of a top-down subspace approach.
• Because CLIQUE partitions each dimension like a grid structure and determines
whether a cell is dense based on the number of points it contains, it can also be
viewed as an integration of density-based and grid-based clustering methods
• Given a large set of multidimensional data points, the data space is usually not
uniformly occupied by the data points.
• CLIQUE’s clustering identifies the sparse and the “crowded” areas in space (or units),
thereby discovering the overall distribution patterns of the data set.
• A unit is dense if the fraction of total data points contained in it exceeds an input
model parameter
Figure : Density and Grid based clustering
I STEP: CLIQUE partitions the d-dimensional data space into non overlapping
rectangular units, identifying the dense units among these.
II STEP: The subspaces representing these dense units are intersected to form a
candidate search space in which dense units of higher dimensionality may exist.
Graphical definition:
CLIQUE is the group of nodes in graph such that all nodes in a CLIQUE are connected to
each other.‘K’ – No of nodes in a CLIQUE
Community is the group of CLIQUES such that all the CLIQUES must have
‘K-1’ nodes in common.
CLIQUE- Example 1
CLIQUE- Example 2
CLIQUE ( K =3)
a) {1,2,3}
b) {1,2,8}
c) {2,6,5}
d) {2,6,4}
e) {2,5,4}
f) {4,5,6}
Community 1= {a, b}
Community 2 = { c,d,e,f}
Each dimension is then assigned a weight for each cluster and the updated weights
are used in the next iteration to regenerate the clusters. This leads to the
exploration of dense regions in all subspaces of some desired dimensionality. It
avoids the generation of a large number of overlapped clusters in lower
dimensionality.
PROCLUS finds the best set of medoids by a hill climbing process but generalized
to deal with projected clustering.It adopts a distance measure called Manhattan
segmental distance.The PROCLUS algorithm consists of three phases.:
Initialization, Iteration, Cluster refinement.
However, PROCLUS is faster than CLIQUE due to the sampling of large datasets,
though the use of small number of representative points can cause PROCLUS to
miss some clusters entirely.Experiments on PROCLUS show that the method is
efficient and scalable at finding high dimensional clusters .PROCLUS finds non
overlaped partitions of points
Input:
Output:
Initialization Phase
2. Choose a set of data point which is probably the medoids of the cluster
Iterative Phase
1. From the Initialization Phase, we got a set of data points which should contains the
medoids. (Denoted by M). This phase, we will find the best medoids from M.
2. Randomly find the set of points Mcurrent, and replace the “bad” medoids from other
point in M if necessary by which cluster quality is improved. The newly formed
meaningful medoid set is denoted as Mbest.
Find the bad medoid, and try the result of replacing bad medoid
Refinement Phase
The final step of this algorithm is refinement phase. This phase is included to improve the
quality of the clusters formed. The clusters C1,C2, C3,…,Ck formed during the iterative
phase are the inputs to this phase. The original data set is passed over one or more times
to improve the quality of the clusters. The dimension sets Di found during the iterative
phase are discarded and new dimension sets are computed for each of the cluster set Ci.
Once when the new dimensions are computed for the clusters, then the points are
reassigned to the medoids relative to these new sets of dimensions. Outliers are
determined in the last pass over the data.
Drawback:
The algorithm requires the average number of dimensions per cluster as parameter in
input.
The performance of PROCLUS is highly sensitive to the value of its input parameter.
Rather than growing the clusters dimension by dimension ,we grow sets of frequent item
sets which eventually lead to cluster descriptions.
Examples of frequent pattern based cluster analysis : Clustering of text documents that
contain thousands of distinct keywords.
Working:
Descriptors (sets of words that describe topic matter) are extracted from the
document first.
Then they are analyzed for the frequency in which they are found in the document
compared to other terms.
Google’s search engine is probably the best and most widely known example.
When you search for a term on Google, it pulls up pages that apply to that term.
How Google can analyze billions of web pages to deliver an accurate and fast result?
It’s because of text clustering! Google’s algorithm breaks down unstructured data
from web pages and turns it into a matrix model, tagging pages with keywords that
are then used in search results!
Connection
Connections
Connected
connecting
It is important to appreciate that we use stemming with the intention of improving the
performance of IR systems.
Pcluster:
Another approach for clustering high dimensional data is based on pattern similarity
among the objects on a subset of dimensions. pCluster method performs clustering by
pattern similarity in microarray data analysis. Example is DNA microarray analysis
DNA microarray analysis: A microarray is a laboratory tool used to detect the expression
of thousands of genes at the same time. DNA microarrays are microscope slides that
are printed with thousands of tiny spots in defined positions, with each spot containing a
known DNA sequence or gene.
Under the pCluster model, two objects are similar if they exhibit a coherent pattern on a
subset of dimensions. All though the magnitude of their expression levels may not be
close, the pattern they exhibit can be very much alike. The pCluster model though
developed in the study of microarray data cluster analysis can be applied to many other
applications that require finding similar or coherent patterns involving a subset of
numerical dimensions in large high dimensional data sets.
Suppose we are using edit distance, and we decide to merge the strings abcd and
aecdb.
However, there is no string that represents their average, or that could be thought of
as lying naturally between them.
We could take one of the strings that we might pass through when transforming one
string to the other by single insertions or deletions, such as aebcd, but there are many
such options.
Moreover, when clusters are formed from more than two strings, the notion of “on the
path between” stops making sense.
Given that we cannot combine points in a cluster when the space is nonEuclidean, our
only choice is to pick one of the points of the cluster itself to represent the cluster.
Ideally, this point is close to all the points of the cluster, so it in some sense lies in the
“center.”
We can select the clustroid in various ways, each designed to, in some sense,
minimize the distances between the clustroid and the other points in the cluster.
Common choices include selecting as the clustroid the point that minimizes:
3. The sum of the squares of the distances to the other points in the cluster.
EXAMPLE:
We are using edit distance, and a cluster consists of the four points abcd, aecdb,
abecb, and ecdab. Their distances are found in the above table
Suppose If we apply the three criteria for being the centroid to each of the four points
of the cluster, we find the clustroid which is shown:
GRGPF Algorithm
Consider an algorithm that handles non-main-memory data, but does not require a
Euclidean space. The algorithm, which we shall refer to as GRGPF for its authors
(V. Ganti, R. Ramakrishnan, J. Gehrke, A. Powell, andJ. French), takes ideas from
both hierarchical and point-assignment approaches
As we assign points to clusters, the clusters can grow large. Most of the points in a
cluster are stored on disk, and are not used in guiding the assignment of points,
although they can be retrieved. If p is any point in a cluster, let ROWSUM(p) be the
sum of the squares of the distances from p to each of the other points in the cluster.
2. The clustroid of the cluster, which is defined specifically to be the point in the cluster
that minimizes the sum of the squares of the distances to the other points.
4. For some chosen constant k, the k points of the cluster that are closest to the clustroid,
and their rowsums. These points are part of the representation in case the addition of
points to the cluster causes the clustroid to change. The assumption is made that the new
clustroid would be one of these k points near the old clustroid
5. The k points of the cluster that are furthest from the clustroid and their rowsums. These
points are part of the representation so that we can consider whether two clusters are
close enough to merge.The assumption is made that if two clusters are close, then a pair
of points distant from their respective clustroids would be close
2. Select from T some nodes with some desired size n (or close to n).
These become the initial clusters for GRGPF. Place them in leafs of the cluster
representing tree (or CRT).
3. Group clusters with a common ancestor in T into interior nodes of the CRT
For any m ≤ N, do clustering on the m points. The clustering algorithm depends on the
data space.
A simple way of keeping track of the data is store data in buckets containing 2 k points.
Allow up to 2 buckets of size 2 k, k fixed
a. Size of cluster.
b. Centroid or clustroid.
c. Any feature needed to merge clusters
When a new point arrives it must go into a bucket. This causes bucket management
issues. Buckets also time out and can be deleted.