0% found this document useful (0 votes)
4 views

Cluster Analysis

Cluster analysis note

Uploaded by

fazeelanwar.vp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Cluster Analysis

Cluster analysis note

Uploaded by

fazeelanwar.vp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Cluster Analysis

• Clustering: The process of grouping a set of physical or abstract


objects into classes of similar objects
• In other words: It is the task of grouping a set of objects in such
a way that objects in the same group (called a cluster) are more
similar (in some sense) to each other than to those in other
groups (clusters).
• A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the objects
in other clusters.
• In clustering analysis, First partition the set of data into groups
based on data similarity (e.g., using clustering), and then assign
labels to the relatively small number of groups
• Unlike classification, clustering and unsupervised learning do not
rely on predefined classes and class-labeled training examples.
• Clustering is a form of learning by observation, rather than
learning by
examples.
Fig: A sample of 3 clusters in n dimensional space
• Applications: market research, pattern recognition, data
analysis,image processing, machine learning, information
retrieval,bioinformatics, data compression, and computer graphics
• In business:help marketers discover distinct groups in their
customer bases and characterize customer groups based on
purchasing patterns.
• In biology: used to derive plant and animal taxonomies, categorize
genes with similar functionality, and gain insight into structures
inherent in populations
• Helps to classify documents on the Web for information discovery.
• Help in the identication of areas of similar land use in an earth
observation database etc...
• Outlier detection: can detect values that are far away from any
cluster
• Typical requirements of clustering algorithms in data mining:
• Scalability:
Algorithms must be scalable over large database that may contain millions
of
data objects.
• Ability to deal with different types of attributes: Should support not only
the interval-based (numerical) data, but also other types of data, such as
binary, categorical (nominal), and ordinal data, or mixtures of these data
types.
• Discovery of clusters with arbitrary shape:
Algorithms based on distance measures like Euclidean or Manhattan,tend to
find
spherical clusters with similar size and density.
• It is important to develop algorithms that can detect clusters of arbitrary
shape.
• Minimal requirements for domain knowledge to determine input
parameters
Many clustering algorithms require users to input certain parameters in
cluster analysis (such as the number of desired clusters), which are often
diffcult to determine.
The algorithms must have minimal requirements to avoid burden to users
• Ability to deal with noisy data:
Some clustering algorithms are sensitive to outliers or missing,unknown, or
erroneous data and may lead to clusters of poor quality. So the algorithms
should have a mechanism to deal such data to get clusters of better quality
• Incremental clustering and insensitivity to the order of input records:
Algorithms must incorporate newly inserted data (i.e., database updates)
into existing clustering structures, as well as they should be insensitive to the
order of input.
• High dimensionality
A database or a data warehouse can contain several dimensions or
attributes
Algorithms should support data objects in high dimensional space
• Constraint-based clustering:
Real-world applications may need to perform clustering under various
kinds of constraints.
• The algorithms must be capable of satisfying user specfied
constraints.
• Interpretability and usability:
The clustering results to be interpretable, comprehensible, and usable.
Categorization of Clustering Methods

• Partitioning methods:
A partitioning method constructs k partitions from the a given database of n
objects or data tuples, where each partition represents a cluster and k <= n.
• ie; it classiffies the data into k groups which together satisfy the following
requirements:
• (1) each group must contain at least one object
• (2) each object must belong to exactly one group
• Given k, the number of partitions to construct, a partitioning method
creates an initial partitioning
• Then uses an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another.
• General criterion of a good partitioning: objects in the same cluster are
“close" or related to each other, whereas objects of different clusters are
“far apart" or very different
• There are various kinds of other criteria for judging the quality of
partitions.
• There are two types of clustering methods based on partitioning
• 1. k-means algorithm: where each cluster is represented by the mean
value of the objects in the cluster
• 2. k-medoids algorithm: where each cluster is represented by one of the
objects located near the center of the cluster.
• Hierarchical methods:
• A hierarchical method creates a hierarchical decomposition of the given
set of data objects
• A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed.
• Agglomerative Methods:
• also called the bottom-up approach, , starts with each object forming a
separate group
• It successively merges the objects or groups that are close to one another,
until all of the groups are merged into one (the topmost level of the
hierarchy), or until a termination condition holds.
• Divisive Methods:
• also called the top-down approach, starts with all of the objects in the same
cluster.
• In each successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition holds.
• Hierarchical methods suffer from the fact that once a step (merge or split) is done,
it can never be undone.
• Heirachical Clustering Methods: BIRCH,DBSCAN
• Density-based methods:
• Clustering methods have been developed based on the notion of density(number
of objects or data points)
• The general idea is to continue growing the given cluster as long as the density
(number of objects or data points) in the “neighborhood“ exceeds some threshold
• ie; for each data point within a given cluster, the neighborhood of a given radius
has to contain at least a minimum number of points
• Such a method can be used to lter out noise (outliers) and discover
clusters of arbitrary shape.
• Density based clustering methods: DBSCAN and its extension,
OPTICS
• Grid-based methods
• Quantize the object space into a finite number of cells that form a
grid structure.
• All of the clustering operations are performed on the grid structure
(i.e., on the quantized space).
• The main advantage of this approach is its fast processing time,
which is typically independent of the number of data objects and
dependent only on the number of cells in each dimension in the
quantized space.
• Example: STING
• Model-based methods
• Hypothesize a model for each of the clusters and
find the best fit of the data to the given model
• A model-based algorithm may locate clusters by
constructing a density function that reflects the
spatial distribution of the data points
• It also leads to a way of automatically
determining the number of clusters based on
standard statistics
• Takes “noise" or outliers into account and thus
yielding robust clustering methods.
K-Medoid Clustering

• In k-medoid clustering, each cluster is represented by one of the


objects located near the center of the cluster.
• Steps:
1 Choose k, the number of clusters
2 Select at random k objects in D as the initial nearest representatives
or seeds
3 Assign each data point(object) to the closest representative object,
which forms k clusters
4 randomly select a nonrepresentative object, O.;
5 compute the cost change, S, of swapping representative object, Oj
,with O.;
6 if S < 0 then swap Oj , with O. to form the new set of k representative
objects;
7 Repeat steps 3 to 6 until no change; otherwise STOP
• Use k-medoid clustering algorithm to divide
the given data into two clusters and also
compute the representative data points for
the clusters.
(1) We have k = 2, the number of clusters
(2) Initialize random medoids c1 = (3; 4) and c2 = (7; 4)
(3) Calculating distance(cost) so as to associate each data object
to its nearest medoid
• The distance(cost) between ci = (a; b) and Xi = (c; d) is
calculated as
cost =|a − c| + |b − d| which is the Manhattan Distance
• Calculate the total cost(S1) as summation of the cost of
data object
from its medoid in its clusters, which is S1 = 20
• This divides the data into two clusters: Cluster 1:
{X1;X2;X3;X4}
and Cluster 2: {X5;X6;X7;X8;X9;X10}
• (4) Select another random medoid O. = (7; 3). Now the
medoids are
c1 = (3; 4);O. = (7; 3), and calculate the new total cost(S.)
• We have S’ = 22, We get the cost of swapping the
medoid from
old(c2) to new(O’) as:
• S = S’ − S1 = 22 − 20 = 2 > 0
• The positive value indicates a higher cost if we
swap to new medoid.
So moving to O’ would be bad idea. hence the
previous choice was
good and algorithm terminates here.
Assignment
For the given data, partition it into
two clusters using the k-medoid
algorithm.

You might also like