Unit 2 - Introduction to Cluster Analysis
Unit 2 - Introduction to Cluster Analysis
Analysis
Unit 2 : Chapter 2
Contents
• Classificalion v/s Clustering
• Clustering
• Types of data in cluster analysis
• Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density Method
Classificalion v/s Clustering
Clustering
• What Is Cluster Analysis?
• Cluster analysis or simply clustering is the process of partitioning a set of data
objects (or observations) into subsets. Each subset is a cluster, such that objects
in a cluster are similar to one another, yet dissimilar to objects in other
clusters.
• The set o1 clusters resulting from a Gluster analysis can be referred to as a
clustering.
.! . i.,
,
Requirements for Cluster Analysis
Scalability:
• Many clustering algorithms work well on small data sets containmg fewer
than several hundred data objects; liOwever, a large database may contain
millions or even billions of objects, particularly in Web search scenarios.
• Clustering on only a sample of a given large data set may lead to biased results.
Therefore, highly scalable clustering algorithms are needed.
_ <› • •• x; y • x;
i p
x;
xiv
f
n1 ' nf
np
o This represents o objects, such as persons, with p
variables (measurements or attributes), such as age,
height, weight, gender, and so on. ax r abhi
• The structure is in the form of a relational table, or my Khoijiueukm
-p
"SiS™i Pi’Of'ssor
matrix (o objects p Vdfiables)
Dissimilarity Matrix
.! . i., ,
,
What Is the Problem of the K-Means Method?
Table 2. Advantages and Limitations of k-means algorithm
Advantages Limitations
Relatively efficient and easy to implement. Sensitive to initialization
Terminates at local optimum. Limiting case of fixed data.
Apply even large data sets Difficult to compare with
different
numbers of clusters
The clusters are non-hierarchical arid they Needs to specify the number of clusters
do not overlap in advance.
With a large number of variables, K-Means Unable to handle noisy data or outliers.
may be computationally faster than
hierarchical clustering
K-Means may produce tighter clusters than Not suitable to discover clusters with
hierarchical clustering, especially if the non-convex shapes
clusters are globular
The K-Medoids Clustering Method: A
Representative Object-Based
Technique
1. Initialize: select k random points out of the n data points as the
medoids.
2. Associate each data point to the closest medoid by using any
common distance metric methods.
3. While the cost decreases: For each medoid m, for each data o point
which is not a medoids:
4. Swap m and o, associate each data point to the closest
medoids, and recompute the cost.
5. If the total cost is more than that in the previous step, undo
the swap.
Algorithmi k-niedoidn PAM, a £-medoids algorithm fòr partitioning¡ based on
medoid or central objects.
assign each remaining object to the duster with the nearest representativa
object;
randomly select a nonrepresentative object, pp;
compute the total cost, S, of swapping representativa object, oj, with o
,p;
(6) if S < 0 tben swap oj with ng p„, to form the new set of £
representativa objecte;
(7) until no change;
Which method is more robust k-means
or medoids?
• The k-medoids method is more robust than k-means in the presence of noise
and outliers because a medoid is less influenced by outliers or other extreme
values than a mean.
• For large values of n and k, such computation becomes very costly, and much
more costly than the k-means method.
.! . i.::
Problems
• Refer class notes
“How can we scale up the k-medoids method?
• To deal with larger data sets, a sampling-based method called CLARA
(Clustering LARge Applications) can be used.
• Instead of taking the whole data set into consideration, CLARA uses a
random sample of the data set. The PAM algorithm is then applied to compute
the best medoids from the sample.
• Ideally, the sample should closely represent the original data set. In many cases, a
large sample works well if it is created so that each object has equal probability of
being selected into the sample.
• The representative objects (medoids) chosen will likely be similar to those that
would have been chosen from the whole data set. CLARA builds
multiple random samples and returns the best clustering as the output.
Hierarchical methods:
• A hierarchical method creates a hierarchical decomposition of the given set of
data objects.
• A hierarchical method can be classified as being either agglomerative or divisive,
based on hOw the hierarchical decomposition is formed.
• The divisive approach, also called the top-down approach, starts with all
the objects in the same cluster. In each successive iteration, a cluster is split
into smaller clusters, until eventually each object is in one cluster, or a
teonination condition holds.
Agglomeroflve Hierarchical Clustering
r Bottom-up strategy
• Each cluster starts with only one object
r Clusters are merged into larger and larger clusters until:
All the objects ore in a single cluster
Certain termination conditions ore satisfied
• Top-clown strategy
• Start with all objeck in one cluster
r Clusters are subclivided into smaller ancJ smaller clusters
until:
Di’ sha Prabhii Khor{.van'
Each object forms a cluster on its own ss.’m.ant Profs •.o•
Certain termination conditions ore satisfied
• Hierarchical clustering methods can be distance-based or density- and cOntinuity
based.
• Hierarchical methods suffer from the factor that once a step (merge or split) is
done, it can never be undone. This rigidity is useful in that it leads to
smaller computation Gosts by not having to worry about a combinatorial
number of different choices.
• Such teclmiques cannot correct erroneous decisions; however, methods for
improving the quality of hierarchical clustering have been proposed.
Example
• Agglomerati ve and divisive algorithms d ata set of five
ona objects {a, b, c, d, e}
a b
cd e
cdc
divisive
(DIANA)
Step 4 Step 3 Step 2 Siep l Step 0
› a g g lomerotiv e
Step
G
tcp
T
tcp lrp Tt p
J
AGNES (AGN E3 )
Clusters C l and C2
may be m e rge d if an object
de
in C l ancl an object in C2 form
divis ive
the minimum Euclidean S +p Step " Step " Strp 1 Step 0
( DI A h!
A)
4
distance between any two
objects from different clusters
• DIANA
.
'
• A tree structure called a dendrogram is commonly used to represent the process
of hierarchical clustering.
• It shows how objects are grouped together (in an aggloinerative method)
or partitioned (in a divisive method)
• A Dendrogram for the five objects presented in Figure where 1 = 0 shows the
five objects as singleton clusters at level 0. At l = 1, objects a and b are grouped
together to foon the first cluster, and they stay together at all subsequent levels.
Le 'cl a b c d r
/= 0 1.0
' Dendrogram representation for hierarchical clustering of data objects (o, h, c, d, e). . .
!
Fast computation and there Hard to define levels for
is no need to prc•define the clusters.Sensitivity to noise
number of clusters(k). and outliers
N
pmblem o when
s the
› No backtracking
! . ’, ; ' ,
1'
Advantages
• Density-based clustering algorithms can effectively handle noise and outliers in
the dataset, making them robust in such scenarios.
• These algorithms can identify clusters o1 arbitrary shapes and sizes instead of
other clustering algorithms that may assume specific forms.
• They don't require prior knowledge of the number of clusters, making them more
flexible and versatile.
• They can efficiently process large datasets and handle high-dimensional data.
.! .
i.:. ,
Disadvantages
• The performance o1 density-based clustering algoritluns is highly
dependent on the choice of parameters, such as e and MinPts, which can be
challenging to tune.
• These algorithms may not be suitable for datasets with low-density regions or
evenly distributed data points.
• They Gan be coinputationally expensive and time-consuming,
especially for large datasets with complex structures.
• Density-based clustering can need help with identifying clusters o1 varying —
— densities or scales.
Summary of methods
.! .
i.::