0% found this document useful (0 votes)
3 views

Optimisation and Dimension Reduction Tech-unlocked

The document provides an overview of unsupervised learning, focusing on clustering techniques such as K-means and DBSCAN, and their applications in various fields. It outlines the purpose of clustering in identifying patterns within data, differentiating it from supervised learning, and discusses optimization processes in formulating machine learning problems. Additionally, it covers types of clustering methods, including partitional, hierarchical, and spectral clustering, along with their significance in data analysis.

Uploaded by

Sumod Sanker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Optimisation and Dimension Reduction Tech-unlocked

The document provides an overview of unsupervised learning, focusing on clustering techniques such as K-means and DBSCAN, and their applications in various fields. It outlines the purpose of clustering in identifying patterns within data, differentiating it from supervised learning, and discusses optimization processes in formulating machine learning problems. Additionally, it covers types of clustering methods, including partitional, hierarchical, and spectral clustering, along with their significance in data analysis.

Uploaded by

Sumod Sanker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Optimization and Dimension Reduction Techniques 1

Module - I: Introduction to Unsupervised Learning


Notes

Learning Objectives
At the end of this module, you will be able to:
Ɣ Discuss the concept of clustering and its purpose in unsupervised learning
Ɣ Discuss the K-means clustering algorithm and its step-by-step procedure
Ɣ Learn methods to determine the optimal number of clusters in K-means clustering
Ɣ Interpret and analyse the results obtained from K-means clustering
Ɣ Discuss DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
method
Ɣ Infer the concepts of core points, density reachability and noise points in DBSCAN

Introduction
Unsupervised machine learning, also known as unsupervised learning, utilises
machine learning algorithms to analyse and cluster datasets without any labelled
information. These algorithms autonomously identify hidden patterns or data clusters
without human intervention, making them valuable for tasks like exploratory data
analysis, cross-selling strategies, consumer segmentation and image recognition. The
tool excels in identifying similarities and differences within data.
Unsupervised learning is considered one of the most crucial machine learning
frameworks because it allows the data to be “observed” in a systematic, comprehensive,
objective and frequently creative way to uncover the subtleties of the underlying process
that produced the data, the grammar in the data and insights that were not even aware
were present in the data in the first place.
The goal of supervised learning is to discover a mapping from an input to an output
whose proper values are supplied by a supervisor. In unsupervised learning, input data is
the only available information because there is no such supervisor.
Five unsupervised learning paradigms are explained below:
1. Projections: Projections refer to the process of reducing high-dimensional data to
lower-dimensional representations for data visualisation and exploring whether the
data can be represented on lower-dimensional “manifolds” or if it remains inherently
high-dimensional. This analysis will cover different projection algorithms, including:
(a) Principal Components Analysis (PCA), which minimises variance loss.
(b) Self-Organizing Maps, which distribute data on a predefined grid.
(c) Multidimensional Scaling (MDS), which preserves pairwise distances between
data points after projection.
2. Clustering: Clustering is the procedure of categorising entities (customers, movies,
VWDUVJHQHVHTXHQFHV/LQNHG,QSUR¿OHVHWF LQWRJURXSVRUKLHUDUFKLHVEDVHGRQ
their similarities. This process resembles a compression mechanism similar to how
RXU EUDLQ FDWHJRULVHV DQG PDSV REVHUYDWLRQV &OXVWHULQJ SURYHV WR EH EHQH¿FLDO
for organising large datasets into meaningful clusters, facilitating interpretation and
DFWLRQV VXFK DV VHJPHQWEDVHG PDUNHWLQJ ,W DOVR DOORZV XV WR ¿OWHU RXW QRLVH RU
unimportant factors, such as accents in speech recognition. Various types of clustering
methods include:

Amity Directorate of Distance & Online Education


2 Optimization and Dimension Reduction Techniques

(a) Partitional Clustering,


Notes (b) Hierarchical Clustering,
(c) Spectral Clustering.
3. Density Estimation: Density Estimation is a technique used to determine the likelihood
RI D VSHFL¿F REVHUYDWLRQ RFFXUULQJ EDVHG RQ WKH DYDLODEOH GDWD 7KLV WHFKQLTXH LV
commonly employed in fraud detection scenarios to identify normal patterns in the
data that have a high probability, as well as to identify abnormal or outlier patterns that
have a low probability. There are two distinct methodologies for acquiring the ability to
calculate the probability density of a data record:
(a) parametric and
(b) nonparametric
4. Pattern Recognition: Pattern Recognition is a process that entails identifying the most
SUHYDOHQWRUVLJQL¿FDQWUHFXUULQJSDWWHUQVZLWKLQWKHGDWDVHW2QHLQVWDQFHLQYROYHVWKH
LGHQWL¿FDWLRQRISDWWHUQVVXFKDVWKHFRUUHODWLRQEHWZHHQLQGLYLGXDOVSXUFKDVLQJPLON
and their tendency to also purchase bread. Another scenario entails the determination
RIFRPPRQO\RFFXUULQJZRUGVWKDWIROORZDVSHFL¿FVHTXHQFHRIZRUGV7KHJUDPPDU
of the data can be discerned by analysing the relative frequency of these patterns.
+LJKIUHTXHQF\SDWWHUQVDUHUHJDUGHGDVVLJQL¿FDQWVLJQDOVZKHUHDVORZIUHTXHQF\
patterns are regarded as noise. Market-basket analysis is a technique used to discover
patterns from sets, while n-grams refer to instances of patterns from sequences that
can be studied.
5. Network Analysis: Network Analysis focuses on identifying structures in network or
graph data, such as detecting communities in social networks like terrorist cells or
fraud syndicates, determining node importance using link structures (e.g., PageRank)
and uncovering relevant structures like gene pathways, money laundering schemes,
or bridge structures. By applying graph theory and network analysis algorithms to real-
world networks, valuable insights can be gained that may be challenging to perceive
otherwise.
Another horizontal observation will be made to help becoming better “formulators” of
business problems into machine learning problems - a key quality of a data scientist.
Most machine learning algorithms, whether supervised or unsupervised, are boiled
down to some form of an optimization problem. Here, the way of thinking that machine
learning is essentially a four-stage optimization process will be developed.

Optimization Process:
1. Problem Approach: Gain a conceptual understanding of the problem by framing it as
an optimisation problem.
2. Formulation: Express the mathematical objective function precisely by utilising data-
driven intuition.
 2EMHFWLYH)XQFWLRQ0RGL¿FDWLRQ7KHREMHFWLYHIXQFWLRQLVWREHPRGL¿HGLQRUGHUWR
simplify it or make it more solvable.
4. Optimisation: Employ conventional optimisation methods to solve the altered objective
function.

1.1 Introduction to Clustering


Clustering is an unsupervised learning technique, as stated earlier. Unsupervised
learning involves extracting patterns from input data without labelled responses. This

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 3

method is commonly used to identify groups, generative qualities and significant


structures within a set of instances. Notes
The primary goal of clustering is to partition a given population or dataset into
multiple groups, such that the data points within each group exhibit higher similarity to
one another and greater dissimilarity to the data points in other groups. The concept
involves the classification of objects according to their degree of similarity and
dissimilarity.
The clustered data points in the image below can be categorised into a single group.
The clusters can be distinguished and counted into three clusters.

Clusters need not be spherical, as shown in the illustration below:

(Image Source: https://www.geeksforgeeks.org/clustering-in-machine-learning/)

1.1.1 Purpose and Applications of Clustering


Purpose
The core assumption in the pursuit of identifying patterns in data is rooted in the
concept that although datasets can be vast, the underlying mechanisms responsible for
generating the data possess limited variability. There are a finite number of latent sources
of variations that give rise to the observed data.
1. Quantization of Customer Behaviours:
In retail point-of-sale data, despite observing considerable variation from customer to
FXVWRPHUWKHUHDUHRQO\¿QLWHW\SHVRIFXVWRPHUEHKDYLRXUVEDVHGRQYDULRXVIDFWRUV
such as lifestyle, life-stage, purchase behaviour and purchase intents. Discovering
such quantization allows us to identify distinct customer segments and their behaviours.

Amity Directorate of Distance & Online Education


4 Optimization and Dimension Reduction Techniques

2. Categorization of Videos, Web Pages and People:


Notes  6LPLODUO\LQYDVWFROOHFWLRQVOLNH<RX7XEHYLGHRVZHESDJHVDQGVRFLDOPHGLDSUR¿OHV
WKHGLIIHUHQWW\SHVRIFRQWHQWRULQGLYLGXDOVFDQEHJURXSHGLQWRDUHDVRQDEO\¿QLWHVHW
'HVSLWHWKHVHHPLQJO\HQGOHVVYDULHW\WKHUHDUHVSHFL¿FFDWHJRULHVIRUYLGHRVZHE
pages and people, allowing for meaningful categorization.
3. Grouping of Words in a Language:
In languages, the vocabulary may consist of numerous words, but they can be grouped
based on characteristics such as parts of speech, roots, tenses and meanings,
resulting in a relatively small number of word types.
4. Telematics Data and Driving Behaviour:
Even in telematics data related to driving behaviour, where the data variation might be
extensive across all cars, the number of distinct driving actions combined with various
GULYLQJ VFHQDULRV UHPDLQV ¿QLWH 7KLV LQVLJKW DOORZV XV WR DQDO\VH DQG XQGHUVWDQG
driving patterns more effectively.
5. The Key: Letting the Data Speak for Itself
The key to discovering structure in data lies in letting the data speak for itself. By
applying appropriate techniques and algorithms, the underlying patterns and variations
WKDWH[LVWZLWKLQWKHGDWDVHWFDQEHLGHQWL¿HGHYHQZKHQGHDOLQJZLWKYDVWDPRXQWVRI
information.

Classification vs. Clustering:


Classification:
Ɣ Known Variations: Dealing with data variations already known in advance.
Ɣ Mapping to Known Types: Mapping data variations into a specific set of known
types.
Ɣ Supervised Learning: Covered in detail in the supervised learning topic.

Clustering:
Ɣ Discovering Variations: Identifying unknown variations in the data.
Ɣ Grouping Similar Data: Grouping similar data points based on defined criteria.
Ɣ Unsupervised Learning: Involves exploring and discovering patterns in unlabelled
data.

Applications of Clustering
Applications of Clustering in Various Fields:
Ɣ Marketing:Characterising and discovering customer segments for targeted
marketing strategies.
Ɣ Biology:Classifying different species of plants and animals based on genetic
patterns.
Ɣ Libraries:Clustering books based on topics and information for better organisation.
Ɣ Insurance:Identifying customer policies and detecting fraudulent activities.
Ɣ City Planning:Grouping houses based on geographical locations and studying
property values.
Ɣ Earthquake Studies:Identifying earthquake-affected areas to determine high-risk
zones.
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 5

Ɣ Image Processing:Grouping similar images, classifying images by content and


identifying patterns in image data. Notes
Ɣ Genetics:Grouping genes with similar expression patterns and identifying gene
networks.
Ɣ Finance:Identifying market segments, analysing stock market data and assessing
risk in investment portfolios.
Ɣ Customer Service:Grouping customer inquiries and complaints, identifying common
issues and developing targeted solutions.
Ɣ Manufacturing:Grouping similar products, optimising production processes and
identifying defects.
Ɣ Medical Diagnosis:Grouping patients with similar symptoms or diseases for accurate
diagnoses and treatments.
Ɣ Fraud Detection:Identifying suspicious patterns in financial transactions to detect
fraud.
Ɣ Traffic Analysis:Grouping patterns of traffic data for transportation planning and
infrastructure improvements.
Clustering techniques find diverse applications across industries, helping
professionals uncover valuable insights from large volumes of data.

1.1.2 Types of Clustering Technique


Three broad clustering approaches are :
1. Partitional
2. Hierarchical/ agglomerative
3. Spectral
1. Partitional Clustering: The clustering process categorises the information into different
groups based on their shared qualities and similarities. The determination of the
number of clusters required for clustering algorithms is a crucial task for data analysts.
In the context of a database (D) with multiple items (N), the partitioning method is
HPSOR\HGWRJHQHUDWHXVHUVSHFL¿HGSDUWLWLRQVRIWKHGDWD(DFKSDUWLWLRQFRUUHVSRQGV
WRDGLVWLQFWFOXVWHUDQGUHSUHVHQWVDVSHFL¿FUHJLRQ7KHUHH[LVWPXOWLSOHDOJRULWKPV
that fall within the partitioning approach, with a selection of notable ones being :
™ K-Mean,
™ PAM(K-Medoids),
™ CLARA algorithm (Clustering Large Applications), etc.

(ImageSource:https://www.google.com/url?sa=i&url=https%3A%2F%2Fptop.only.wip.la%3A443%2Fhttp%2Fwww.sthda.
com%2Fenglish%2Fwiki%2Fthe-ultimate-guide-to-partitioning-clustering&psig=AOvVaw3feeaF0ueAo9
Zu_oTdIbgH&ust=1692940199507000&source=images&cd=vfe&opi=89978449&ved=0CBAQjRxqFwoTC
Mi3k-zD9IADFQAAAAAdAAAAABAE)

Amity Directorate of Distance & Online Education


6 Optimization and Dimension Reduction Techniques

2. Hierarchical Clustering: Hierarchical clustering can be used as an alternative to


Notes partitioned clustering since it does not require pre-specifying the number of clusters
to be produced. The dataset is partitioned into clusters using a method that generates
a tree-like structure, commonly referred to as a dendrogram. The selection of
REVHUYDWLRQVRUFOXVWHUVFDQEHDFKLHYHGE\DSSURSULDWHO\UHPRYLQJDVSHFL¿FSRUWLRQ
of the tree. The Agglomerative Hierarchical algorithm is widely recognised as a
prominent example of this technique.

Fig: Showing Hierarchical Clustering


https://www.google.com/url?sa=i&url=https%3A%2F%2Fptop.only.wip.la%3A443%2Fhttps%2Fwww.kdnuggets.
com%2F2019%2F09%2Fhierarchical-clustering.html&psig=AOvVaw0W5N4LtMxzJl2ETGAMP1DL&ust=1
692940371010000&source =images&cd=vfe&opi=89978449&ved=0CBAQjRxqFwoTCICD2r7E9IADFQAA
AAAdAAAAABAK)

Types of Hierarchical Clustering:


Hierarchical clustering can be done in two ways: top-down and bottom-up.
1. Bottom-up hierarchical clustering(Agglomerative Clustering) :Agglomerative
clustering, also known as hierarchical clustering, constructs the hierarchy of clusters
by starting with the raw data points at the base of the hierarchy. It computes the
distances between all pairs of points and then joins the two points that are closest
to each other, effectively “merging” them. The merged point replaces the two original
points, reducing the total number of data points from N to N-1. This process is repeated
iteratively until all the data has been merged into a single root node, combining two
data points or clusters at each step.
Agglomerative Hierarchical Clustering uses the following algorithm:
Let’s say there are six data points A, B, C, D, E and F.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 7

Notes

Figure – Agglomerative Hierarchical clustering


(Imagesource:https://www.google.com/url?sa=i&url=https%3A%2F%2Fptop.only.wip.la%3A443%2Fhttps%2Fwww.geeksforgeeks.
org%2Fhierarchical-clustering-in-data-mining%2F&psig=AOvVaw0W5N4LtMxzJl2ETG
AMP1DL&ust=1692940371010000&source=images&cd=vfe&opi=89978449&ved=0CBA-
QjRxqFwoTCICD2r7E9IADFQAAAAAdAAAAABAE)

In Step 1, each alphabet is treated as a single cluster and the distance of one cluster
from all other clusters is calculated.
In Step 2, comparable clusters (B) and (C) are merged together to form a single
cluster, as well as clusters (D) and (E). After this step, the clusters are [(A), (BC), (DE),
(F)].
In Step 3, the proximity is recalculated and clusters (DE) and (F) are merged
together to form new clusters, resulting in [(A), (BC), (DEF)].
In Step 4, the process is repeated and clusters (DEF) and (BC) are found to be
comparable and are merged together to form a new cluster, leading to [(A), (BCDEF)].
In Step 5, the two remaining clusters, (A) and (BCDEF), are merged together to form
a single cluster, resulting in [(ABCDEF)].

Benefits and Drawbacks of Hierarchical Agglomerative Clustering:


Benefits
Ɣ Deterministic Clusters: Hierarchical agglomerative clustering consistently produces
the same clustering for the same dataset and definition of distance, providing
reliable and repeatable results.
Ɣ Feature Representation vs. Distance Function: Agglomerative clustering can
work with datasets where only pairwise distances are given, making it suitable for
scenarios without a predefined feature representation.

Drawbacks
Ɣ Scale: Agglomerative clustering has a quadratic time complexity in the number of
data points, making it computationally expensive for large datasets.
Ɣ Number of Clusters: While hierarchical agglomerative clustering allows flexibility in
obtaining the desired number of clusters, determining the optimal number of clusters
can still be challenging and subjective.

Amity Directorate of Distance & Online Education


8 Optimization and Dimension Reduction Techniques

2. Top-down hierarchical clustering(Divisive Clustering), The process commonly known


Notes as divisive clustering is a technique that applies partitional clustering iteratively. It
VWDUWVE\LGHQWLI\LQJ.FOXVWHUVDWWKHEDVHRIWKHKLHUDUFK\DQGWKHQSURFHHGVWR¿QG
K2 clusters within each of those, continuing this process for subsequent levels. Once
the appropriate number of levels and clusters at each level has been established, a
top-down methodology can be employed to ascertain the comprehensive organisation
of the data. Nonetheless, there are still concerns regarding initialization, the number of
clusters at each level and other issues typically associated with a partitional clustering
algorithm. The issue of non-determinism persists, resulting in the potential for a variety
of different answers to be generated.

Figure – Divisive Hierarchical clustering


(Imagesource:https://www.google.com/url?sa=i&url=https%3A%2F%2Fptop.only.wip.la%3A443%2Fhttps%2Fwww.geeksforgeeks.
org%2Fhierarchical-clustering-in-data-mining%2F&psig=AOvVaw0W5N4LtMxzJl2ETG
AMP1DL&ust=1692940371010000&source=images&cd=vfe&opi=89978449&ved=0CBA-
QjRxqFwoTCICD2r7E9IADFQAAAAAdAAAAABAE)

As the name implies, Divisive Hierarchical Clustering is the complete opposite of


Agglomerative Hierarchical Clustering. Divisive Hierarchical clustering entails treating all
of the data points as a single cluster, then at the conclusion of each cycle, removing the
data points from the clusters that are incomparable. The final result is N clusters.

1.2 K-means Clustering


K-Means Clustering, an Unsupervised Machine Learning algorithm, clusters
unlabelled datasets into distinct groups.
Unsupervised Machine Learning involves training a computer to handle unclassified
data without any prior guidance, organising unsorted data based on similarities, patterns
and variations.
The primary objective of clustering is to partition a population or a set of data points
into multiple groups, ensuring that the data points within each group are more alike to
one another and distinguishable from those in other groups, grouping items based on
their levels of similarity and dissimilarity.
The k-means clustering algorithm serves two primary functions:

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 9

™ The iterative technique is employed to select the optimal value for K centre
points or centroids. Notes
™ Each individual data point is paired with the closest k-centroid. A cluster is
created when data points are in close proximity to a designated k-centre.
As a result, each cluster is distinct from the others and contains data points with
some commonality.
Fig: The below diagram explains the working of the K-means Clustering:

(Image source : https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning)

The K-means clustering algorithm is a widely used unsupervised machine learning


technique that aims to divide a given dataset into separate clusters or groups. This
division is based on the similarity observed among the data points. The objective is to
reduce the variability within each cluster and increase the variability between clusters.
K-means clustering is a data analysis technique that can be comprehended by following
a systematic procedure, involving mathematical iterations, the formation of clusters, and
the evaluation of the results.

Step 1: Initialisation process


Specify the desired number of clusters, denoted as ‘k’, that you wish to create. This
step is of utmost importance and can be determined by leveraging domain expertise or
employing methodologies such as the Elbow Method.
Randomly assign initial values to the ‘k’ cluster centroids. The centroids serve as the
initial estimation for the cluster centres.

Step 2: Allocation
For every individual data point within your dataset, compute its distance, usually
using the Euclidean distance formula, to each of the ‘k’ centroids.
Allocate the data point to the cluster that has the closest centroid, determined by the
minimum distance.

Step 3: Perform centroid updates.


™ Compute the arithmetic mean (average) of all data points within each cluster.
™ Assign the calculated mean as the entroid for each cluster.

Amity Directorate of Distance & Online Education


10 Optimization and Dimension Reduction Techniques

Step 4: Achieving Convergence Perform a verification or examination.


Notes Perform a comparison to determine if there has been a significant change in the
centroids since the previous iteration. The algorithm terminates if there has been minimal
change or if the maximum number of iterations has been reached.

Step 5: Iterate
If there is a significant change in the centroids, iterate through steps 2 and 3 until
convergence is achieved.

Example:
Here is a concise illustration of K-means clustering:
Consider a given dataset consisting of 8 data points in a two-dimensional space: [(2,
3), (3, 3), (6, 5), (8, 8), (9, 7), (10, 8), (12, 9), (14, 12)]. Our objective is to generate two
distinct clusters by employing the k-means clustering algorithm with k=2.

Step 1: Initialisation process


Randomly assign initial values to two centroids. Assuming we initialise centroids
C1(3, 3) and C2(8, 8).

Step 2: Task Allocation


Perform a distance calculation between points and centroids, and subsequently
assign each point to its closest centroid.
Cluster 1: [(2, 3), (3, 3)]
Cluster 2: [(6, 5), (8, 8), (9, 7), (10, 8), (12, 9), (14, 12)]

Step 3: Perform centroid updates.


Recalculate the centroids for both clusters:
New C1(2.5, 3)
New C2(9.83, 8)

Step 4: Verification of Convergence


Perform a comparative analysis to determine if there have been significant changes
in the centroids. If the answer is affirmative, proceed to iterate steps 2 and 3.

Step 5: Iterate
Iterate through steps 2 and 3 until reaching convergence.
Clustering Formation refers to the process of grouping or organising data points into
distinct clusters based on their similarities or proximity to each other.
After multiple iterations, the centroids will converge to stable positions.
Cluster 1: [(2, 3), (3, 3)], Centroid: (2.5, 3)
Cluster 2: [(6, 5), (8, 8), (9, 7), (10, 8), (12, 9), (14, 12)], Centroid: (9.83, 8)

Evaluation:
™ The commonly used evaluation metrics for K-means clustering are the Within-
Cluster Sum of Squares (WCSS) and the Silhouette Score. These metrics

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 11

evaluate the effectiveness of the clustering algorithm in determining the quality


of the clusters. Notes
™ The Within-Cluster Sum of Squares (WCSS) is a metric that quantifies the sum
of squared distances between data points and their respective cluster centroids.
Lower within-cluster sum of squares (WCSS) values are indicative of more
compact and closely grouped clusters.
™ The Silhouette Score is a metric used to quantify the similarity of an object to
its assigned cluster in relation to other clusters. A higher Silhouette Score is
indicative of improved cluster separation.
The K-means clustering algorithm is an iterative procedure that relies on the
selection of initial centroids and the determination of the number of clusters, denoted
as ‘k’, to achieve optimal performance. Running the algorithm iteratively with various
initializations is crucial in order to mitigate the risk of converging to local minima and to
select the optimal clustering solution.

1.2.1 Algorithm and Procedure


The K-means clustering algorithm is widely used to partition a given dataset into k
distinct groups or clusters, where k represents the desired number of groups. The main
objective of this algorithm is to categorise objects into multiple clusters, ensuring that
objects within the same cluster have high similarity (high intra-class similarity), while
objects belonging to different clusters exhibit significant dissimilarity (low inter-class
similarity).
In k-means clustering, each cluster is represented by its centroid, which is
calculated as the mean of all the points assigned to that cluster. The core idea behind
k-means clustering is to form clusters in a way that minimises the total variation within
each cluster, known as the total intra-cluster variation or total within-cluster variation.
This implies that objects within a cluster should be as close to the centroid as possible,
resulting in tightly packed and well-separated clusters.

Fig:Process for K-means Algorithm


(Image source : https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning)

Amity Directorate of Distance & Online Education


12 Optimization and Dimension Reduction Techniques

The process of K-means clustering involves the following steps:


Notes ™ Specify the number of clusters (k) that you want to create in the final solution.
™ Randomly select k objects from the dataset to serve as the initial cluster centres
or centroids.
™ Assign each remaining object in the dataset to its nearest centroid based on
Euclidean distance. This step is called the “cluster assignment step.”
™ Calculate the new mean value of each cluster based on the objects assigned to
it. This step is known as a “centroid update.”
™ Reassign all the objects in the dataset based on the updated cluster means.
™ The cluster assignment and centroid update steps are repeated iteratively until
convergence is achieved. Convergence occurs when the cluster assignments in
the current iteration remain the same as those from the previous iteration.
The pseudocode of above algorithm is as follows:
Initialize k means with random values
--> For a given number of iterations:
--> Iterate through items:
--> Find the mean closest to the item by calculating
the euclidean distance of the item with each of the means
--> Assign item to mean
--> Update mean by shifting it to the average of the items in that cluster
Let’s analyse the visual plots in order to comprehend the aforementioned steps:

(Image source : https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning)

™ Initial centroids randomly chosen to form the clusters, even if not part of the
dataset.
™ Distance computed between data points and centroids for assignment to the
nearest cluster.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 13

™ Median line drawn between centroids to separate the clusters.


™ New centroids calculated as the centre of gravity of the current clusters. Notes
™ Iterative process of reassigning data points to new centroids based on proximity
and updating centroids.
™ Process continues until data points no longer change assignments significantly.
™ Stable clusters obtained and assumed centroids removed.
™ Final result: scatter plot with distinct clusters of data points.

1.2.2 Determining Optimal Number of Clusters


The K-means clustering algorithm’s effectiveness rests on the extremely effective
clusters that it creates. However, determining the ideal number of clusters is a difficult
process. There are several approaches to determining the optimal number of clusters,
the techniques are described below. There are three process which are described below:
1. Elbow method
2. Silhouette method
3. Gap statistic

1. Elbow Method
One of the most widely used techniques for determining the ideal number of clusters
is the Elbow method. The WCSS value idea is used in this technique. The term “total
variations within a cluster” is abbreviated as “WCSS,” which stands for inside Cluster
Sum of Squares. The following formula can be used to determine the value of WCSS (for
3 clusters):
:&66 ™Pi in Cluster1 distance(Pi C1)2™Pi in Cluster2distance(Pi C2)2™Pi in CLuster3 distance(Pi
C3)2
™Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each
data point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, any method such as
Euclidean distance or Manhattan distance can be used.
The elbow method is a technique used to determine the optimal number of clusters
in K-means clustering. The steps involved are as follows:
Ɣ Perform K-means clustering on the dataset for different K values (typically ranging
from 1 to 10).
Ɣ Calculate the Within-Cluster Sum of Squares (WCSS) value for each K value.
Ɣ Plot a curve showing the relationship between the calculated WCSS values and the
number of clusters K.
Ɣ Identify the point on the plot where the curve forms a sharp bend, resembling an
“elbow.”
Ɣ The K value corresponding to the “elbow point” is considered the best or optimal
number of clusters for the dataset.
The graph obtained using the elbow method shows a steep bend that resembles an
elbow, helping to identify the optimal number of clusters.

Amity Directorate of Distance & Online Education


14 Optimization and Dimension Reduction Techniques

Notes

(Image source : https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning)

2. Silhouette method
The quality of a clustering algorithm can be evaluated using the average silhouette
method. It assesses how well each data point fits within its assigned cluster and a higher
average silhouette width indicates better clustering. The average silhouette of data
points is calculated for various numbers of clusters (k values) and the optimal number of
clusters is determined by selecting the k value that maximises the average silhouette.
The average silhouette width can be computed using the silhouette function from the
cluster package. By applying this method for k values ranging from 1 to 15 clusters, the
optimal number of clusters can be identified. The results show that 2 clusters achieve
the highest average silhouette values, making it the most optimal number of clusters.
Additionally, 4 clusters are considered the second-best choice based on the average
silhouette method.

3. Gap Statistic Method


The gap statistic can be utilised with a variety of clustering techniques, including
K-means clustering and hierarchical clustering, according to research from Stanford
University (R. Tibshirani, G. Walther and T. Hastie, 2001). It contrasts the total intracluster
variation for various values of k (number of clusters) with what would be predicted under
a scenario in which there is no discernible clustering. Monte Carlo simulations of the
sampling procedure are used to build the reference dataset.

1.2.3 Interpreting K-means Clustering Results


Upon conducting K-means clustering on the dataset, it is imperative to analyse
and interpret the outcomes in order to extract valuable insights and comprehend the
underlying patterns present in the data. Below are several essential steps for interpreting
the outcomes of K-means clustering:
Ɣ Cluster centres are the resultant values obtained from the K-means algorithm.
These centres represent the average values of the features within each cluster. The
cluster centres can be analysed to gain insights into the distinctive characteristics
exhibited by each cluster.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 15

Ɣ Cluster Sizes: Analyse the cardinality of each cluster, which refers to the count of
data points assigned to each cluster. Gaining insights into the distribution of data Notes
points within clusters can provide valuable information about the equilibrium and
dispersion of your dataset.
Ɣ Data Point Assignments: Evaluate the cluster assignment for each individual data
point. One can perform data clustering analysis to examine the grouping of data
points and determine the similarity or dissimilarity of data points within specific
clusters.
Ɣ Data Visualisation: Generate visual representations, such as scatter plots or
heatmaps, to visually depict the clusters within the data space. This tool facilitates
the visual evaluation of the degree of separation and cohesion among the clusters.
Ɣ Cluster interpretation involves assigning labels or descriptions to clusters based
on the distinctive characteristics exhibited by the data points within each cluster.
This analysis will facilitate comprehension of the unique patterns and behaviours
exhibited by each cluster.
Ɣ Cluster validation involves assessing the accuracy and reliability of clustering
outcomes through the utilisation of validation metrics such as the silhouette score
or Davies-Bouldin index. These metrics can offer a numerical assessment of the
degree to which the data points are grouped together.
Ɣ Business Insights: Establish a correlation between the clusters and the specific
problem or context within your business domain. Gaining a comprehensive
understanding of the business implications associated with each cluster can result
in the identification of actionable insights and facilitate informed decision-making
processes.

Example:
An instance will be examined wherein a dataset comprising customer information
sourced from an electronic commerce website is available. The dataset comprises
variables such as “Age,” “Total Spending,” and “Frequency of Purchases.” The aim is to
employ customer segmentation techniques to categorise the customer base into distinct
groups according to their spending patterns.

Step 1: Execute the K-means Clustering algorithm.


The K-means clustering algorithm is used to partition the dataset into a
predetermined number of clusters, in this case, K=3. Upon executing the algorithm,
successfully generated three distinct clusters.

Step 2: Calculation of Cluster Centres


The K-means algorithm yields the centroid points, which correspond to the mean
values of each feature within each cluster. As an illustration:
Cluster 1 refers to a group or collection of data points that are similar or related to
each other based on certain criteria or characteristics
Age: 35 years old
Total spending: $500.
Purchase Frequency: 10 occurrences
Cluster 2 refers to a specific group or category within a larger set of data or objects
that have been organised based on their similarities
Amity Directorate of Distance & Online Education
16 Optimization and Dimension Reduction Techniques

Age: 28 years old


Notes Total spending: $300.
Purchase Frequency: 5 occurrences

Cluster 3 refers to a specific group or category within a larger dataset or


system.
Age: 45 years old
Total spending: $800.
Purchase Frequency: 15 occurrences

Step 3: Determining the sizes of clusters.


Lets perform an evaluation of the dimensions of every cluster.
Cluster 1 consists of a total of 200 customers.
Cluster 2 consists of a total of 150 customers.
Cluster 3 consists of a total of 180 customers.

Step 4: Allocation of Data Points


Analyse the allocation of individual customers to clusters. For example, Customer
A is assigned to Cluster 1, Customer B is assigned to Cluster 2 and Customer C is
assigned to Cluster 3.

Step 5: Data Visualisation


A scatter plot is generated to visually represent the clusters within the feature space.
Each data point corresponds to an individual customer and their position on the plot is
determined by their respective values for “Total Spending” and “Frequency of Purchases.”
This visualisation facilitates the assessment of the degree of separation between the
clusters.

Step 6: Analysis and interpretation of cluster patterns


Based on the centroid values and the assignment of data points to each cluster, the
interpretation of the clusters are derived.
Cluster 1 consists of customers who are in the middle-aged demographic, exhibit
moderate levels of spending and make frequent purchases.
Cluster 2 consists of customers who are relatively young, exhibit lower levels of
spending and make purchases less frequently.
Cluster 3 consists of a demographic segment consisting of mature customers who
exhibit a propensity for substantial expenditure and engage in frequent purchasing
activities.

Step 7: Evaluation of Cluster Validity


Validation metrics, such as the silhouette score, are employed to assess the
effectiveness of cluster grouping by evaluating the proximity of data points within each
cluster. A higher silhouette score is indicative of clusters that are more well-defined.

Phase 8: Data-driven Analysis and Decision-making


Ultimately, a correlation is established between the clusters and our specific
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 17

business issue. For example, marketing strategies can be customised based on distinct
customer segments, loyalty programs can be implemented to incentivize the high- Notes
spending cluster and promotional campaigns can be executed to encourage more
frequent purchases from the lower-spending cluster.

1.3 DBSCAN (Density-Based Spatial Clustering of Applications


with Noise)
Clusters can be defined as regions in the data space that exhibit high point densities,
which are separated by regions of relatively lower point densities. The fundamental basis
of the DBSCAN algorithm lies in the widely accepted definitions of “clusters” and “noise.”
The fundamental principle states that a minimum number of points must exist in close
proximity to each point within a cluster, within a specified radius.

DBSCAN: Why?
Convex or spherical groups can be identified through the utilisation of hierarchical
clustering or partitioning techniques, such as K-means or PAM clustering. In essence,
these solutions are suitable solely for clusters that possess a small size and exhibit a
well-distributed nature. Moreover, the data is greatly influenced by the presence of noise
and outliers.
K-Means clustering may group loosely related observations together, as every data
point becomes part of some cluster, even if they are scattered far apart in the vector space.
This sensitivity to individual data points can lead to slight changes significantly affecting the
clustering outcome. However, this issue is largely reduced in DBSCAN due to its cluster
formation approach, making it more robust to outliers and irregularly shaped data.
The DBSCAN algorithm requires two parameters:
 HSV,WGH¿QHVWKHQHLJKERXUKRRGDURXQGDGDWDSRLQW,IWKHGLVWDQFHEHWZHHQWZR
points is lower or equal to ‘eps,’ they are considered neighbours. Selecting a small eps
value may classify a large portion of the data as outliers, while choosing a very large
value may merge clusters, resulting in most data points being in the same cluster. The
optimal eps value can be determined based on the k-distance graph.
2. MinPts: It represents the minimum number of neighbours (data points) within the eps
radius for a point to be considered a core point. The appropriate value of MinPts
depends on the dataset’s size. A larger dataset requires a larger MinPts value. As a
rule of thumb, MinPts should be at least D+1, where D is the number of dimensions in
the dataset. The minimum value of MinPts should be set to at least 3.

ImageSource:https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/

Amity Directorate of Distance & Online Education


18 Optimization and Dimension Reduction Techniques

The DBSCAN algorithm follows the steps outlined below:


Notes Ɣ The task at hand involves identifying and locating all neighbouring points that
fall within a specified distance, referred to as “eps.” Additionally, the objective
is to determine whether these neighbouring points qualify as “core points,” which
are defined as points that have a number of neighbours greater than a specified
threshold called “MinPts.” If a point meets the criteria of being a core point, it should
be marked as visited.
Ɣ To ensure that each core point is properly assigned to a cluster, a new cluster will be
created if the core point is not already assigned to one.
Ɣ The process involves recursively identifying and assigning all density-connected
points to the core point within the same cluster. In the context of density-based
clustering algorithms, the relationship between two points, denoted as a and b, is
defined as density-connected if certain conditions are met. Specifically, there must
exist another point, denoted as c, which has a significant number of neighbouring
points and both a and b are located within a distance of eps from c. The described
procedure results in the formation of a sequence of points that are connected based
on their density.
Ɣ Perform an iteration process to traverse the unvisited points that are still present
in the dataset. Data points that do not fall within any specific cluster are commonly
referred to as noise or outliers.

Pseudocode For DBSCAN Clustering Algorithm


DBSCAN(dataset, eps, MinPts){
# cluster index
C=1
for each unvisited point p in dataset {
mark p as visited
# find neighbours
Neighbors N = find the neighbouring points of p
if |N|>=MinPts:
N = N U N’
if p’ is not a member of any cluster:
add p’ to cluster C
}
The complexity of DBSCAN can vary based on the data and the implementation of
the algorithm.
The runtime complexity of a neighbourhood query algorithm depends on the indexing
system used to store the dataset.
Ɣ Best Case: When an efficient indexing system is utilised, such as one that enables
logarithmic time neighbourhood queries, the average runtime complexity is
O(nlogn).
Ɣ Worst Case: However, in scenarios where there is no index structure or when
GHDOLQJ ZLWK GHJHQHUDWHG GDWD HJ DOO SRLQWV ZLWKLQ D GLVWDQFH OHVV WKDQ İ  WKH
worst-case runtime complexity remains O(n²).

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 19

Ɣ Average Case: The average runtime complexity can be the same as the best
or worst case, depending on the characteristics of the data and the specific Notes
implementation of the algorithm.

Let’s explain DBSCAN using an example:


Example:
Imagine you have a dataset of 12 data points in 2D space:
Dataset: [(1, 2), (2, 2), (2, 3), (3, 3), (8, 7), (9, 7), (9, 8), (10, 8), (11, 8), (8, 9), (9, 9),
(10, 9)]

Step 1: Choosing Parameters


&KRRVH WZR SDUDPHWHUV HSVLORQ İ  DQG PLQBVDPSOHV (SVLORQ GHILQHV WKH UDGLXV
ZLWKLQ ZKLFK ZH ORRN IRU QHLJKERXULQJ SRLQWV DQG PLQBVDPSOHV VHWV WKH PLQLPXP
number of data points required to form a dense region (cluster).
)RUWKLVH[DPSOHOHW¶VVHWİ DQGPLQBVDPSOHV 

Step 2: Identifying Core Points


Start with a random data point, e.g., (1, 2).
Calculate the Euclidean distance between this point and all other points.
,IWKHGLVWDQFHLVOHVVWKDQRUHTXDOWRİFRXQWLWDVDQHLJKERXU
,IWKHQXPEHURIQHLJKERXUVLVJUHDWHUWKDQRUHTXDOWRPLQBVDPSOHVPDUNWKLVSRLQW
as a core point.
)RU    LWV QHLJKERXUV ZLWKLQ İ  DUH       DQG    6LQFH WKHUH DUH
WKUHHQHLJKERXUV •PLQBVDPSOHV   LVDFRUHSRLQW
Repeat this process for all data points. Mark other core points.

Step 3: Forming Clusters


Start with any core point, let’s say (1, 2). Create a new cluster and add (1, 2) to it.
For each neighbour of (1, 2), if it’s also a core point, add it to the cluster and
recursively check its neighbours.
In our example, add (2, 2), (2, 3), and (3, 3) to the cluster.
The cluster formed so far: Cluster 1: [(1, 2), (2, 2), (2, 3), (3, 3)]
Continue this process for other core points, creating new clusters if necessary.
Any data point that is not a core point or a neighbour of a core point becomes an
outlier.

Step 4: Identifying Border Points


For each data point that is not a core point, check if it’s a neighbour of any core
point.
If yes, classify it as a border point.
Border points may belong to a cluster but are not counted as core points.
In our example, the remaining data points are (8, 7), (9, 7), (9, 8), (10, 8), (11, 8), (8,
9), (9, 9), and (10, 9).

Amity Directorate of Distance & Online Education


20 Optimization and Dimension Reduction Techniques

(8, 7) is a neighbour of (9, 7), so it becomes a border point.


Notes
Step 5: Result
You have clusters formed by core points and their neighbours, as well as border
points.
Outliers are the data points that are neither core points nor border points.

Resulting Clusters:
Cluster 1: [(1, 2), (2, 2), (2, 3), (3, 3)]
Cluster 2: [(8, 7), (9, 7), (9, 8), (10, 8), (11, 8), (8, 9), (9, 9), (10, 9)]
Outliers: None in this example.
DBSCAN is effective in identifying clusters of varying shapes and handling noise/
outliers well. The algorithm’s performance depends on the choice of parameters
İ DQG PLQBVDPSOHV  ,W FDQ GLVFRYHU GHQVH UHJLRQV DQG LV OHVV VHQVLWLYH WR WKH LQLWLDO
configuration than some other clustering techniques.

1.3.1 Core Points, Density Reachability and Noise Points


In the DBSCAN algorithm, the data points can be categorised into three types:
1. Core Point: A point is considered a core point if it has more than MinPts points within
the eps distance.
 %RUGHU3RLQW$SRLQWLVFODVVL¿HGDVDERUGHUSRLQWLILWKDVIHZHUWKDQ0LQ3WVSRLQWV
within the eps distance but is in the neighbourhood of a core point.
3. Noise or Outlier: A point is labelled as noise or outlier if it is neither a core point nor a
border point.

Source:https://www.kdnuggets.com/2020/04/dbscan-clustering-algorithm-machine-learning.html

Density Reachability
When applying DBSCAN clustering to a set of points in a space, we define the
following terms:

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 21

İ ,W UHSUHVHQWV WKH UDGLXV RI D QHLJKERXUKRRG DURXQG D SRLQW ,Q RWKHU ZRUGV DOO
points within this radius are considered neighbours of that point. Notes
Core Objects: These are the points that have at least MinPts number of objects
ZLWKLQ WKHLU İQHLJKERXUKRRG ,Q RWKHU ZRUGV FRUH REMHFWV KDYH D VXIILFLHQW QXPEHU RI
neighbours to form a dense region.

Directly density reachable:


If object q is in object p’s -Neighbourhood and object p is a core object, object q is
directly density accessible from object p.

ImageSource: https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/

Here Direct density reachability is not symmetric. Object p is not considered directly
density-reachable from object q because q does not meet the criteria of being a core
object.

Density reachable:
$Q REMHFW T LV GHQVLW\UHDFKDEOH IURP S ZUW İ DQG 0LQ3WV LI WKHUH LV D FKDLQ RI
objects q1, q2…, qn, with q1=p, qn=q such that qi+1 is directly density-reachable from qi
ZUWİDQG0LQ3WVIRUDOO
1 <= i <= n

ImageSource: https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/

Here Density reachability is not symmetric. As q is not a core point, qn-1 is not
directly density-reachable from q and as a result, object p is not density-reachable from
object q.

Density Connectivity:
2EMHFW T LV FRQVLGHUHG GHQVLW\FRQQHFWHG WR REMHFW S ZLWK UHVSHFW WR İ DQG 0LQ3WV

Amity Directorate of Distance & Online Education


22 Optimization and Dimension Reduction Techniques

if there exists another object o such that both p and q are density-reachable from o with
Notes UHVSHFWWRİDQG0LQ3WV
It is important to note that density connectivity is symmetric. If object q is density-
connected to object p, then object p is also density-connected to object q.
Based on the concepts of reachability and connectivity, clusters and noise points can
be defined as follows :
&OXVWHU$FOXVWHU&ZLWKUHVSHFWWRİDQG0LQ3WVLVDQRQHPSW\VXEVHWRIWKHHQWLUH
set of objects or instances (D) that satisfies the following conditions:
Maximality: For all objects p and q, if p belongs to C and q is density-reachable from
SZLWKUHVSHFWWRİDQG0LQ3WVWKHQTDOVREHORQJVWR&
Connectivity: For all objects p and q that belong to C, p is density-connected to q and
YLFHYHUVDZLWKUHVSHFWWRİDQG0LQ3WV
Noise:Noise points are objects that are not directly density-reachable from at least
one core object.
Hence, density connectivity and reachability helps defining clusters and noise points
in density-based clustering. Clusters are subsets of objects that are maximally connected
DQG GHQVLW\UHDFKDEOH ZLWKLQ D VSHFLILHG GLVWDQFH WKUHVKROG İ DQG PLQLPXP SRLQWV
MinPts. Noise points, on the other hand, are objects that do not meet these criteria and
are not part of any cluster.

1.3.2 Strengths and Limitations


Strength:
Ɣ Robust to outliers, making it effective in handling noisy data.
Ɣ No requirement for specifying the number of classes (numClasses) in advance,
providing flexibility in clustering.
Ɣ Capability to identify clusters with uneven shapes, accommodating complex data
distributions.
Ɣ Ease of parameter tuning for those familiar with the dataset, allowing for customised
and optimal clustering results.

Limitations:
Ɣ Confusion when dealing with border points that may belong to multiple clusters,
leading to ambiguous assignments.
Ɣ Limited ability to handle clusters with significant differences in densities; variable
density clusters pose challenges for the algorithm.
Ɣ High dependency on the distance metric used, impacting the quality and accuracy of
clustering results.
Ɣ Difficulty in guessing the correct parameters, such as epsilon (eps) and MinPts, for
an unknown dataset, making parameter selection a challenging task.

1.4 Hierarchical Clustering


Hierarchical clustering is a method used to group data into a tree-like structure of
clusters. It begins by considering each data point as an individual cluster. The process
iteratively performs the following steps:

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 23

Ɣ Identify the two clusters that are closest to each other.


Ɣ Merge the two most similar clusters. Notes
Ɣ Continue these steps until all the clusters are merged together.
The goal of hierarchical clustering is to produce a dendrogram, which is a tree-like
diagram that visually represents the hierarchical relationships between the clusters. The
dendrogram shows the order of merges (bottom-up view) or splits (top-down view) of the
clusters.
This method is widely used in data mining and provides a hierarchical representation
of the clusters in a dataset. It starts with individual data points as clusters and gradually
combines the closest clusters until a stopping criterion is met. The resulting dendrogram
visually illustrates the hierarchical relationships among the clusters.
Hierarchical clustering offers several advantages over other clustering methods:
™ It can handle non-convex clusters, clusters of varying sizes and densities,
making it suitable for complex data patterns.
™ It is robust against missing data and noisy data, allowing for more flexible data
processing.
™ It reveals the hierarchical structure of the data, providing insights into the
relationships among the clusters.
However, there are also some drawbacks to hierarchical clustering:
™ Determining the optimal number of clusters requires a stopping criterion, which
can be challenging to determine in some cases.
™ The method can be computationally expensive and memory-intensive,
especially for large datasets.
™ Results can be sensitive to the initial conditions, linkage criterion and distance
metric used, which may impact the clustering outcome.
™ In summary, hierarchical clustering is a versatile data mining method that
can effectively group similar data points into clusters while uncovering the
hierarchical relationships among them. Nevertheless, careful consideration
should be given to the stopping criterion and other parameters to achieve
meaningful results efficiently.

Hierarchical Clustering: Dendrogram


A structure that resembles a tree called a dendrogram is primarily used by the HC
algorithm to store each step it takes as memory. In the dendrogram plot, the Y-axis
displays the Euclidean distances between the data points and the x-axis displays every
data point in the given dataset.
Dendrogram working can be explained using the below diagram:

Amity Directorate of Distance & Online Education


24 Optimization and Dimension Reduction Techniques

Notes

(Image Source: https://www.javatpoint.com/hierarchical-clustering-in-machine-learning)

In the above diagram, the process of agglomerative clustering on the left side, while
the corresponding dendrogram is displayed on the right side can be observed.
Ɣ Initially, data points P2 and P3 are combined to form a cluster and a dendrogram
is constructed, connecting P2 and P3 with a rectangular shape. The height of this
dendrogram is determined by the Euclidean distance between P2 and P3.
Ɣ In the subsequent step, P5 and P6 form a cluster, resulting in another dendrogram.
This new dendrogram is higher than the previous one, as the Euclidean distance
between P5 and P6 is slightly greater than that between P2 and P3.
Ɣ Further iterations create two new dendrograms. One combines P1, P2 and P3, while
the other combines P4, P5 and P6.
Ɣ Finally, all data points are merged into a single dendrogram, representing the entire
dataset.
The dendrogram tree structure can be cut at any level, depending on our specific
clustering requirements.

Example:
Suppose we have a dataset of cities with their respective distances in kilometres:
Cities: A, B, C, D, E, F, G
Distances (in km):
A-B: 10
A-C: 15
A-D: 25
B-E: 35
B-F: 45
C-G: 55

Step 1: Initial Clusters


Start by treating each city as a single-point cluster. So, we have seven initial clusters:
A, B, C, D, E, F, and G.
Initial Clusters: A, B, C, D, E, F, G

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 25

Step 2: Distance Calculation


Calculate the pairwise distance between clusters. In this example, we’ll use the
Notes
single-linkage method, where the distance between two clusters is defined as the
minimum distance between any two points in the clusters.

Distance Matrix:
A B C D E F G
A - 10 15 25 - - -
B 10 - - - 35 45 -
C 15 - - - - - 55
D 25 - - - - - -
E - 35 - - - - -
F - 45 - - - - -
G - - 55 - - - -

Step 3: Merge Closest Clusters


Find the two clusters that are closest to each other based on the distance matrix. In
this case, clusters A and B are the closest.

Step 4: Update Distance Matrix


Update the distance matrix to include the newly formed cluster (A, B). Calculate the
distances from the new cluster to all other clusters.

Updated Distance Matrix:


AB C D E F G
AB - - - 35 45 -
C 15 - - - - 55
D 25 - - - - -
E 35 - - - - -
F 45 - - - - -
G - 55 - - - -

Step 5: Repeat
Repeat steps 3 and 4 until all data points are in a single cluster or the desired
number of clusters is reached.

Step 6: Dendrogram Formation


Visualise the hierarchical structure using a dendrogram. A dendrogram is a tree-like
diagram that shows the sequence of cluster merges.

Step 7: Optimal Clusters


To determine the optimal number of clusters, we inspect the dendrogram.
We look for the vertical lines where the clusters merge. The height at which clusters
merge corresponds to the dissimilarity between clusters.
Amity Directorate of Distance & Online Education
26 Optimization and Dimension Reduction Techniques

The idea is to identify a height where clusters merge but don’t have a very large
Notes jump in dissimilarity. This jump is known as an “elbow point.”
In our dendrogram, the “elbow point” is around height 35. At this height, clusters A,
B, and C merge, creating a sensible cluster. Thus, we choose to have three clusters: {A,
B, C, D}, {E, F}, and {G}.
Hierarchical clustering offers advantages like visual interpretability through
dendrograms and flexibility in choosing the number of clusters. However, it can be
computationally expensive for large datasets. The choice of distance metric and linkage
criteria (how distances between clusters are computed) can also affect the results.

1.4.1 Agglomerative vs. Divisive Methods


Difference between agglomerative clustering and Divisive clustering :

S.No. Parameters Agglomerative Clustering Divisive Clustering


1. Category Bottom-up approach Top-down approach
2. Approach Each data point is initially assigned The initial state of the data points
to its own cluster and the algorithm is a single cluster and the algorithm
iteratively combines the closest pairs proceeds to iteratively partition the
RIFOXVWHUVXQWLOD¿QDOFOXVWHULVIRUPHG cluster into smaller sub-clusters until
that includes all the data points. each data point is assigned to its
own individual cluster.
3. Complexity level Agglomerative clustering exhibits Divisive clustering is a cost-effective
higher computational complexity, approach as it only involves
SDUWLFXODUO\ IRU GDWDVHWV RI VLJQL¿FDQW computing distances between sub-
size, due to the necessity of computing clusters, thereby minimising the
pairwise distances between all data computational load.
points. This process incurs substantial
computational costs.
4. Outliers Agglomerative clustering exhibits Divisive clustering algorithms have
superior outlier handling capabilities the potential to generate sub-
compared to divisive clustering due clusters that revolve around outliers,
to its ability to assimilate outliers into which can result in clustering
larger clusters. outcomes that are less than optimal.
5. Interpretability Agglomerative clustering is known Divisive clustering presents
for generating results that are challenges in interpretation due
easier to interpret due to the visual to the visual representation of the
representation of the merging process dendrogram, which illustrates the
in the dendrogram. This allows the user cluster splitting process. The user
to determine the number of clusters is required to establish a stopping
based on the desired level of granularity. criterion to determine the optimal
number of clusters.
6. Implementation Scikit-learn offers various linkage The Scikit-learn library does not
methods for agglomerative clustering, currently have an implementation
including “ward,” “complete,” “average,” for divisive clustering.
and “single.”
7. Example Agglomerative Clustering is utilised Some of the uses for agglomerative
in various applications, including: clustering are as follows: Biological
Some examples of technical areas categorization, anomaly detection,
include image segmentation, customer market segmentation, natural
segmentation, social network analysis, language processing, etc.
document clustering, genetics,
genomics and others.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 27

1.4.2 Linkage Methods and Dendrogram Visualisation


As part of the hierarchical clustering process, smaller sub-clusters of data points
Notes
are grouped into bigger clusters in a bottom-up fashion or, conversely, a larger cluster is
divided into smaller sub-clusters in a top-down method. The distance between two sub-
clusters must be calculated for both hierarchical clustering types. Different kinds of links
can be used to explain the various methods for calculating the separation between two
sub-clusters of data points.
The idea of linkage arises when there are multiple points in a cluster and it is
necessary to calculate the distance between this cluster and the other points/clusters
to determine their proper location. Linkage is a gauge of how differently multiple
observational clusters differ from one another.
The types of linkages that are typically used are
™ Single Linkage
™ Complete Linkage
™ Average Linkage
The kind of linkage employed determines the kind of clusters created as well as the
dendrogram’s shape.

1. Single Linkage
Single Linkage is a method for determining the distance between two clusters by
considering the minimum distance between any pair of data points from the two clusters.
In other words, it calculates the pairwise distance between all points in cluster one and all
points in cluster two and then selects the smallest distance as the distance between the
two clusters.
However, this approach tends to produce loose and widely spread clusters, resulting
in high intra-cluster variance. Despite this drawback, Single Linkage is still commonly
used in certain applications.
The single linkage yields the shortest distance between two points i and j such that i
belongs to R and j belongs to S for two clusters R and S.

(ImageSource:https://aitskadapa.ac.in/e-books/AI&ML/MACHINE%20LEARNING/Machine%20
Learning%20(%20etc.)%20(z-lib.org).pdf)

For Instance:
Most of the time, the dendrogram does not give a good image of the clusters if you
take an example data set and plot the single linkage.

Amity Directorate of Distance & Online Education


28 Optimization and Dimension Reduction Techniques

Notes

Graph:Dendrogram with Single linkage

From the plot, it can be observed that the clusters are not well-defined, though
some clusters can still be formed. The orange cluster is noticeably distant from the green
cluster, as indicated by the length of the blue line between them. However, within the
green cluster itself, it is challenging to identify distinct subclusters with significant distance
between them. This will be further examined when using other linkages as well.
As a reminder, the greater the height (on the y-axis), the greater the distance
between clusters. The heights between the points in the green cluster are either very high
or very low, suggesting that they are loosely grouped together. There is a possibility that
they do not belong together at all.

2. Complete Linkage
Complete Linkage is a clustering method where the distance between two clusters
is defined by the maximum distance between any pair of members belonging to the two
clusters. This approach leads to the formation of stable and tightly-knit clusters.
In other words, in Complete Linkage, we calculate the greatest distance between two
points, i and j, such that i belongs to cluster R and j belongs to cluster S, for any two
clusters R and S.

(ImageSource:https://aitskadapa.ac.in/e-books/AI&ML/MACHINE%20LEARNING/Machine%20
Learning%20(%20etc.)%20(z-lib.org).pdf)
Fig: Complete Linkage

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 29

With the same data set as above, the dendrogram obtained would be like this:
Notes

Graph : Dendrogram with Complete Linkage

In this scenario, the clusters appear to be more coherent and distinct. The orange
and green clusters are well separated and there is the possibility of creating further
clusters within the green cluster if desired. For instance, by cutting the dendrogram
at a height of 5, two clusters within the green cluster can be formed. Additionally, it is
worth noting that the height between points within a cluster is low, indicating low intra-
cluster variance, while the height between two clusters is high, implying high inter-cluster
variance. This is a desirable outcome for effective clustering.

3. Average linkage
The Average linkage method computes the distance between two clusters by
taking the average of all the distances between the individual members of the clusters.
The process entails the computation of the distance between each point and every
other point in the alternate cluster, followed by the calculation of the average distance
across all these pairs. To calculate the distance between any data point i in cluster R and
any data point j in cluster S, we first determine the distance between each pair of data
points. Then, we calculate the arithmetic mean of these distances. The Average Linkage
algorithm computes the arithmetic mean and returns this value.

(ImageSource:https://aitskadapa.ac.in/e-books/AI&ML/MACHINE%20LEARNING/Machine%20
Learning%20(%20etc.)%20(z-lib.org).pdf)

Amity Directorate of Distance & Online Education

You might also like