Notes
Notes
In single linkage clustering, the distance between two clusters is defined as the minimum distance
between any two points from the clusters. It tends to create elongated, "chained" clusters since it
prioritizes proximity of at least one pair of points.
Single linkage is used in ecological studies where organisms need to be grouped based on their
proximity or similarity in a geographical region. For example, when clustering species found in
different locations, single linkage ensures that even if only a few species are found close to each
other, they will be grouped, reflecting actual biological processes like species dispersion or migration.
Why single linkage? Single linkage preserves long, stretched-out clusters, which can
represent continuous habitats or migration corridors.
In complete linkage clustering, the distance between two clusters is defined as the maximum
distance between any two points in the clusters. This method results in compact, spherical clusters.
In text mining or document clustering, you often want clusters of documents where the documents
in the same cluster are very similar to each other, and there is minimal overlap with other clusters.
Complete linkage works well for applications like clustering research papers, customer reviews, or
news articles.
Why complete linkage? It ensures that all documents within a cluster are close together in
terms of content, making it less likely that dissimilar documents are grouped together.
In average linkage clustering, the distance between two clusters is defined as the average of all
pairwise distances between points in the two clusters. It balances between single and complete
linkage methods.
Average linkage is widely used in bioinformatics to cluster gene expression data, where the similarity
or dissimilarity of gene expression patterns needs to be identified. Since genes often exhibit gradual
changes in expression rather than sharp distinctions, average linkage provides a balanced approach
by considering the overall similarity.
Why average linkage? It produces balanced clusters that neither overemphasize close
outliers (as in single linkage) nor demand tight compactness (as in complete linkage), which
is ideal for identifying patterns in biological data where there’s natural variability in gene
expression.
Each of these linkage methods is suited to different types of data and clustering goals, depending on
the structure of the data and the desired cluster properties.