SciPy - Agglomerative Clustering
Last Updated :
17 Sep, 2024
Agglomerative clustering, also known as hierarchical clustering, is one of the most popular clustering techniques in data analysis and machine learning. It builds a hierarchy of clusters through a bottom-up approach, where each data point starts as its own cluster, and pairs of clusters are merged at each iteration based on their similarity until a desired cluster structure is formed.
In this article, We will cover all the relevant theoretical concepts and provide practical examples to ensure a deep understanding of this topic.
What Is Agglomerative Clustering?
Agglomerative clustering is a type of hierarchical clustering method, where the algorithm starts with each data point as its own individual cluster. The clusters are then merged iteratively based on a specific criterion, such as distance or linkage method, until a certain stopping criterion (e.g., number of clusters) is reached.
Key Features of Agglomerative Clustering:
- Hierarchical structure: It generates a hierarchy of clusters, typically visualized using a dendrogram.
- Distance metric: Determines how similar two clusters or data points are.
- Linkage criterion: Determines how the distance between clusters is measured.
Steps Involved in Agglomerative Clustering
The agglomerative clustering process generally follows these steps:
- Initialization: Each data point starts as its own cluster.
- Distance Matrix Calculation: A distance matrix is computed to determine the similarity between data points or clusters.
- Cluster Merging: The closest pair of clusters is merged based on a chosen linkage criterion.
- Update Distance Matrix: After merging, the distance matrix is updated to reflect the new distances between clusters.
- Repeat: The process of merging and updating the distance matrix is repeated until the stopping criterion is met, such as the desired number of clusters.
Implementing Agglomerative Clustering Using SciPy
We will use the scipy.cluster.hierarchy module to implement agglomerative clustering. This module provides various functions for hierarchical clustering and allows for the visualization of the dendrogram, a tree-like diagram representing the merging of clusters.
Step 1: Import Required Libraries
Python
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist
Step 2: Generate Sample Data
Python
# Generate random data points for clustering
np.random.seed(42)
data = np.random.randn(50, 2)
Step 3: Compute the Linkage Matrix
The linkage function is used to compute the hierarchical clustering based on the data. You can specify the linkage method (e.g., 'single', 'complete', 'average', or 'ward').
Python
# Compute the linkage matrix
Z = linkage(data, method='ward')
Step 4: Visualize the Dendrogram
A dendrogram is useful to visualize the hierarchical relationships between clusters. You can use the dendrogram function from SciPy to create the plot.
Python
# Create a dendrogram to visualize the hierarchical clustering
plt.figure(figsize=(10, 6))
dendrogram(Z)
plt.title('Dendrogram for Agglomerative Clustering')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
Output:
SciPy - Agglomerative ClusteringThe dendrogram represents the hierarchical relationships between clusters. Each leaf in the dendrogram corresponds to a single data point, and the merging of clusters is represented by vertical lines. The height of each vertical line represents the distance at which the clusters are merged. Important Concepts in Dendrograms:
- Cutting the Dendrogram: You can cut the dendrogram at a specific distance to form flat clusters.
- Inconsistency: The inconsistency in the dendrogram shows how clusters at different levels of the hierarchy differ from one another. Larger inconsistencies suggest less similarity between clusters.
Step 5: Form Clusters Based on a Distance Threshold
You can cut the dendrogram at a certain distance to form clusters. The fcluster function can be used to achieve this.
Python
from scipy.cluster.hierarchy import fcluster
# Form flat clusters by cutting the dendrogram at a specified distance
max_distance = 1.5
clusters = fcluster(Z, max_distance, criterion='distance')
# Plot the clustered data
plt.scatter(data[:, 0], data[:, 1], c=clusters, cmap='rainbow')
plt.title('Data Points Clustered Using Agglomerative Clustering')
plt.show()
Output:
SciPy - Agglomerative ClusteringConclusion
Agglomerative clustering is a powerful and flexible method for hierarchical clustering that builds a hierarchy of clusters in a bottom-up approach. Using the SciPy library, we can easily implement and visualize this clustering method through the use of functions like linkage, dendrogram, and fcluster. Although the algorithm can be computationally expensive for large datasets, its interpretability and flexibility make it an excellent choice for many real-world applications.
Similar Reads
Implementing Agglomerative Clustering using Sklearn
Agglomerative Clustering is one of the most common hierarchical clustering technique where each data point starts in its own group (cluster) and step by step the closest clusters are joined together until we reach one big cluster. It's a bottom-up approach meaning:Each data point starts in its own c
3 min read
Agglomerative Methods in Machine Learning
Before diving into the Agglomerative algorithms, we must understand the different concepts in clustering techniques. So, first, look at the concept of Clustering in Machine Learning: Clustering is the broad set of techniques for finding subgroups or clusters on the basis of characterization of objec
14 min read
Feature agglomeration in Scikit Learn
Data Science is a wide field with a lot of hurdles that data scientist usually faces to get informative insights out of the data presented to them, one of such hurdles is referred to as 'The Curse of Dimensionality'. As the number of data features increases in the dataset the complexity of modelling
9 min read
SciPy - Cluster
Clustering is nothing but it is the procedure of dividing the datasets into groups consisting of similar data points. In this procedure, the data points in the same group must be identical as possible and should be different from the other groups. Types of SciPy - Cluster: There are two types of Clu
4 min read
Agglomerative clustering with and without structure in Scikit Learn
Agglomerative clustering is a hierarchical clustering algorithm that is used to group similar data points into clusters. It is a bottom-up approach that starts by treating each data point as a single cluster and then merges the closest pair of clusters until all the data points are grouped into a si
10 min read
Difference Between Agglomerative clustering and Divisive clustering
Agglomerative and divisive clustering are two main types of hierarchical clustering methods. Agglomerative clustering is a bottom-up approach where each data point starts as its own cluster and similar ones are merged step by step.Divisive clustering is top-down, starting with all data in one cluste
3 min read
ML | Mini Batch K-means clustering algorithm
Prerequisite: Optimal value of K in K-Means Clustering K-means is one of the most popular clustering algorithms, mainly because of its good time performance. With the increasing size of the datasets being analyzed, the computation time of K-means increases because of its constraint of needing the wh
6 min read
ML | Classification vs Clustering
Prerequisite: Classification and Clustering As you have read the articles about classification and clustering, here is the difference between them. Both Classification and Clustering is used for the categorization of objects into one or more classes based on the features. They appear to be a similar
2 min read
Different Types of Clustering Algorithm
The introduction to clustering is discussed in this article and is advised to be understood first. The clustering Algorithms are of many types. The following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms.
5 min read
K- means clustering with SciPy
Prerequisite: K-means clustering K-means clustering in Python is one of the most widely used unsupervised machine-learning techniques for data segmentation and pattern discovery. This article will explore K-means clustering in Python using the powerful SciPy library. With a step-by-step approach, we
8 min read