Cluster Analysis
Cluster Analysis
• Partitioning methods:
A partitioning method constructs k partitions from the a given database of n
objects or data tuples, where each partition represents a cluster and k <= n.
• ie; it classiffies the data into k groups which together satisfy the following
requirements:
• (1) each group must contain at least one object
• (2) each object must belong to exactly one group
• Given k, the number of partitions to construct, a partitioning method
creates an initial partitioning
• Then uses an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another.
• General criterion of a good partitioning: objects in the same cluster are
“close" or related to each other, whereas objects of different clusters are
“far apart" or very different
• There are various kinds of other criteria for judging the quality of
partitions.
• There are two types of clustering methods based on partitioning
• 1. k-means algorithm: where each cluster is represented by the mean
value of the objects in the cluster
• 2. k-medoids algorithm: where each cluster is represented by one of the
objects located near the center of the cluster.
• Hierarchical methods:
• A hierarchical method creates a hierarchical decomposition of the given
set of data objects
• A hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is formed.
• Agglomerative Methods:
• also called the bottom-up approach, , starts with each object forming a
separate group
• It successively merges the objects or groups that are close to one another,
until all of the groups are merged into one (the topmost level of the
hierarchy), or until a termination condition holds.
• Divisive Methods:
• also called the top-down approach, starts with all of the objects in the same
cluster.
• In each successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition holds.
• Hierarchical methods suffer from the fact that once a step (merge or split) is done,
it can never be undone.
• Heirachical Clustering Methods: BIRCH,DBSCAN
• Density-based methods:
• Clustering methods have been developed based on the notion of density(number
of objects or data points)
• The general idea is to continue growing the given cluster as long as the density
(number of objects or data points) in the “neighborhood“ exceeds some threshold
• ie; for each data point within a given cluster, the neighborhood of a given radius
has to contain at least a minimum number of points
• Such a method can be used to lter out noise (outliers) and discover
clusters of arbitrary shape.
• Density based clustering methods: DBSCAN and its extension,
OPTICS
• Grid-based methods
• Quantize the object space into a finite number of cells that form a
grid structure.
• All of the clustering operations are performed on the grid structure
(i.e., on the quantized space).
• The main advantage of this approach is its fast processing time,
which is typically independent of the number of data objects and
dependent only on the number of cells in each dimension in the
quantized space.
• Example: STING
• Model-based methods
• Hypothesize a model for each of the clusters and
find the best fit of the data to the given model
• A model-based algorithm may locate clusters by
constructing a density function that reflects the
spatial distribution of the data points
• It also leads to a way of automatically
determining the number of clusters based on
standard statistics
• Takes “noise" or outliers into account and thus
yielding robust clustering methods.
K-Medoid Clustering