Optimisation and Dimension Reduction Tech-unlocked
Optimisation and Dimension Reduction Tech-unlocked
Learning Objectives
At the end of this module, you will be able to:
Ɣ Discuss the concept of clustering and its purpose in unsupervised learning
Ɣ Discuss the K-means clustering algorithm and its step-by-step procedure
Ɣ Learn methods to determine the optimal number of clusters in K-means clustering
Ɣ Interpret and analyse the results obtained from K-means clustering
Ɣ Discuss DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
method
Ɣ Infer the concepts of core points, density reachability and noise points in DBSCAN
Introduction
Unsupervised machine learning, also known as unsupervised learning, utilises
machine learning algorithms to analyse and cluster datasets without any labelled
information. These algorithms autonomously identify hidden patterns or data clusters
without human intervention, making them valuable for tasks like exploratory data
analysis, cross-selling strategies, consumer segmentation and image recognition. The
tool excels in identifying similarities and differences within data.
Unsupervised learning is considered one of the most crucial machine learning
frameworks because it allows the data to be “observed” in a systematic, comprehensive,
objective and frequently creative way to uncover the subtleties of the underlying process
that produced the data, the grammar in the data and insights that were not even aware
were present in the data in the first place.
The goal of supervised learning is to discover a mapping from an input to an output
whose proper values are supplied by a supervisor. In unsupervised learning, input data is
the only available information because there is no such supervisor.
Five unsupervised learning paradigms are explained below:
1. Projections: Projections refer to the process of reducing high-dimensional data to
lower-dimensional representations for data visualisation and exploring whether the
data can be represented on lower-dimensional “manifolds” or if it remains inherently
high-dimensional. This analysis will cover different projection algorithms, including:
(a) Principal Components Analysis (PCA), which minimises variance loss.
(b) Self-Organizing Maps, which distribute data on a predefined grid.
(c) Multidimensional Scaling (MDS), which preserves pairwise distances between
data points after projection.
2. Clustering: Clustering is the procedure of categorising entities (customers, movies,
VWDUVJHQHVHTXHQFHV/LQNHG,QSUR¿OHVHWF LQWRJURXSVRUKLHUDUFKLHVEDVHGRQ
their similarities. This process resembles a compression mechanism similar to how
RXU EUDLQ FDWHJRULVHV DQG PDSV REVHUYDWLRQV &OXVWHULQJ SURYHV WR EH EHQH¿FLDO
for organising large datasets into meaningful clusters, facilitating interpretation and
DFWLRQV VXFK DV VHJPHQWEDVHG PDUNHWLQJ ,W DOVR DOORZV XV WR ¿OWHU RXW QRLVH RU
unimportant factors, such as accents in speech recognition. Various types of clustering
methods include:
Optimization Process:
1. Problem Approach: Gain a conceptual understanding of the problem by framing it as
an optimisation problem.
2. Formulation: Express the mathematical objective function precisely by utilising data-
driven intuition.
2EMHFWLYH)XQFWLRQ0RGL¿FDWLRQ7KHREMHFWLYHIXQFWLRQLVWREHPRGL¿HGLQRUGHUWR
simplify it or make it more solvable.
4. Optimisation: Employ conventional optimisation methods to solve the altered objective
function.
Clustering:
Ɣ Discovering Variations: Identifying unknown variations in the data.
Ɣ Grouping Similar Data: Grouping similar data points based on defined criteria.
Ɣ Unsupervised Learning: Involves exploring and discovering patterns in unlabelled
data.
Applications of Clustering
Applications of Clustering in Various Fields:
Ɣ Marketing:Characterising and discovering customer segments for targeted
marketing strategies.
Ɣ Biology:Classifying different species of plants and animals based on genetic
patterns.
Ɣ Libraries:Clustering books based on topics and information for better organisation.
Ɣ Insurance:Identifying customer policies and detecting fraudulent activities.
Ɣ City Planning:Grouping houses based on geographical locations and studying
property values.
Ɣ Earthquake Studies:Identifying earthquake-affected areas to determine high-risk
zones.
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 5
(ImageSource:https://www.google.com/url?sa=i&url=https%3A%2F%2Fptop.only.wip.la%3A443%2Fhttp%2Fwww.sthda.
com%2Fenglish%2Fwiki%2Fthe-ultimate-guide-to-partitioning-clustering&psig=AOvVaw3feeaF0ueAo9
Zu_oTdIbgH&ust=1692940199507000&source=images&cd=vfe&opi=89978449&ved=0CBAQjRxqFwoTC
Mi3k-zD9IADFQAAAAAdAAAAABAE)
Notes
In Step 1, each alphabet is treated as a single cluster and the distance of one cluster
from all other clusters is calculated.
In Step 2, comparable clusters (B) and (C) are merged together to form a single
cluster, as well as clusters (D) and (E). After this step, the clusters are [(A), (BC), (DE),
(F)].
In Step 3, the proximity is recalculated and clusters (DE) and (F) are merged
together to form new clusters, resulting in [(A), (BC), (DEF)].
In Step 4, the process is repeated and clusters (DEF) and (BC) are found to be
comparable and are merged together to form a new cluster, leading to [(A), (BCDEF)].
In Step 5, the two remaining clusters, (A) and (BCDEF), are merged together to form
a single cluster, resulting in [(ABCDEF)].
Drawbacks
Ɣ Scale: Agglomerative clustering has a quadratic time complexity in the number of
data points, making it computationally expensive for large datasets.
Ɣ Number of Clusters: While hierarchical agglomerative clustering allows flexibility in
obtaining the desired number of clusters, determining the optimal number of clusters
can still be challenging and subjective.
The iterative technique is employed to select the optimal value for K centre
points or centroids. Notes
Each individual data point is paired with the closest k-centroid. A cluster is
created when data points are in close proximity to a designated k-centre.
As a result, each cluster is distinct from the others and contains data points with
some commonality.
Fig: The below diagram explains the working of the K-means Clustering:
Step 2: Allocation
For every individual data point within your dataset, compute its distance, usually
using the Euclidean distance formula, to each of the ‘k’ centroids.
Allocate the data point to the cluster that has the closest centroid, determined by the
minimum distance.
Step 5: Iterate
If there is a significant change in the centroids, iterate through steps 2 and 3 until
convergence is achieved.
Example:
Here is a concise illustration of K-means clustering:
Consider a given dataset consisting of 8 data points in a two-dimensional space: [(2,
3), (3, 3), (6, 5), (8, 8), (9, 7), (10, 8), (12, 9), (14, 12)]. Our objective is to generate two
distinct clusters by employing the k-means clustering algorithm with k=2.
Step 5: Iterate
Iterate through steps 2 and 3 until reaching convergence.
Clustering Formation refers to the process of grouping or organising data points into
distinct clusters based on their similarities or proximity to each other.
After multiple iterations, the centroids will converge to stable positions.
Cluster 1: [(2, 3), (3, 3)], Centroid: (2.5, 3)
Cluster 2: [(6, 5), (8, 8), (9, 7), (10, 8), (12, 9), (14, 12)], Centroid: (9.83, 8)
Evaluation:
The commonly used evaluation metrics for K-means clustering are the Within-
Cluster Sum of Squares (WCSS) and the Silhouette Score. These metrics
Initial centroids randomly chosen to form the clusters, even if not part of the
dataset.
Distance computed between data points and centroids for assignment to the
nearest cluster.
1. Elbow Method
One of the most widely used techniques for determining the ideal number of clusters
is the Elbow method. The WCSS value idea is used in this technique. The term “total
variations within a cluster” is abbreviated as “WCSS,” which stands for inside Cluster
Sum of Squares. The following formula can be used to determine the value of WCSS (for
3 clusters):
:&66 Pi in Cluster1 distance(Pi C1)2Pi in Cluster2distance(Pi C2)2Pi in CLuster3 distance(Pi
C3)2
Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each
data point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, any method such as
Euclidean distance or Manhattan distance can be used.
The elbow method is a technique used to determine the optimal number of clusters
in K-means clustering. The steps involved are as follows:
Ɣ Perform K-means clustering on the dataset for different K values (typically ranging
from 1 to 10).
Ɣ Calculate the Within-Cluster Sum of Squares (WCSS) value for each K value.
Ɣ Plot a curve showing the relationship between the calculated WCSS values and the
number of clusters K.
Ɣ Identify the point on the plot where the curve forms a sharp bend, resembling an
“elbow.”
Ɣ The K value corresponding to the “elbow point” is considered the best or optimal
number of clusters for the dataset.
The graph obtained using the elbow method shows a steep bend that resembles an
elbow, helping to identify the optimal number of clusters.
Notes
2. Silhouette method
The quality of a clustering algorithm can be evaluated using the average silhouette
method. It assesses how well each data point fits within its assigned cluster and a higher
average silhouette width indicates better clustering. The average silhouette of data
points is calculated for various numbers of clusters (k values) and the optimal number of
clusters is determined by selecting the k value that maximises the average silhouette.
The average silhouette width can be computed using the silhouette function from the
cluster package. By applying this method for k values ranging from 1 to 15 clusters, the
optimal number of clusters can be identified. The results show that 2 clusters achieve
the highest average silhouette values, making it the most optimal number of clusters.
Additionally, 4 clusters are considered the second-best choice based on the average
silhouette method.
Ɣ Cluster Sizes: Analyse the cardinality of each cluster, which refers to the count of
data points assigned to each cluster. Gaining insights into the distribution of data Notes
points within clusters can provide valuable information about the equilibrium and
dispersion of your dataset.
Ɣ Data Point Assignments: Evaluate the cluster assignment for each individual data
point. One can perform data clustering analysis to examine the grouping of data
points and determine the similarity or dissimilarity of data points within specific
clusters.
Ɣ Data Visualisation: Generate visual representations, such as scatter plots or
heatmaps, to visually depict the clusters within the data space. This tool facilitates
the visual evaluation of the degree of separation and cohesion among the clusters.
Ɣ Cluster interpretation involves assigning labels or descriptions to clusters based
on the distinctive characteristics exhibited by the data points within each cluster.
This analysis will facilitate comprehension of the unique patterns and behaviours
exhibited by each cluster.
Ɣ Cluster validation involves assessing the accuracy and reliability of clustering
outcomes through the utilisation of validation metrics such as the silhouette score
or Davies-Bouldin index. These metrics can offer a numerical assessment of the
degree to which the data points are grouped together.
Ɣ Business Insights: Establish a correlation between the clusters and the specific
problem or context within your business domain. Gaining a comprehensive
understanding of the business implications associated with each cluster can result
in the identification of actionable insights and facilitate informed decision-making
processes.
Example:
An instance will be examined wherein a dataset comprising customer information
sourced from an electronic commerce website is available. The dataset comprises
variables such as “Age,” “Total Spending,” and “Frequency of Purchases.” The aim is to
employ customer segmentation techniques to categorise the customer base into distinct
groups according to their spending patterns.
business issue. For example, marketing strategies can be customised based on distinct
customer segments, loyalty programs can be implemented to incentivize the high- Notes
spending cluster and promotional campaigns can be executed to encourage more
frequent purchases from the lower-spending cluster.
DBSCAN: Why?
Convex or spherical groups can be identified through the utilisation of hierarchical
clustering or partitioning techniques, such as K-means or PAM clustering. In essence,
these solutions are suitable solely for clusters that possess a small size and exhibit a
well-distributed nature. Moreover, the data is greatly influenced by the presence of noise
and outliers.
K-Means clustering may group loosely related observations together, as every data
point becomes part of some cluster, even if they are scattered far apart in the vector space.
This sensitivity to individual data points can lead to slight changes significantly affecting the
clustering outcome. However, this issue is largely reduced in DBSCAN due to its cluster
formation approach, making it more robust to outliers and irregularly shaped data.
The DBSCAN algorithm requires two parameters:
HSV,WGH¿QHVWKHQHLJKERXUKRRGDURXQGDGDWDSRLQW,IWKHGLVWDQFHEHWZHHQWZR
points is lower or equal to ‘eps,’ they are considered neighbours. Selecting a small eps
value may classify a large portion of the data as outliers, while choosing a very large
value may merge clusters, resulting in most data points being in the same cluster. The
optimal eps value can be determined based on the k-distance graph.
2. MinPts: It represents the minimum number of neighbours (data points) within the eps
radius for a point to be considered a core point. The appropriate value of MinPts
depends on the dataset’s size. A larger dataset requires a larger MinPts value. As a
rule of thumb, MinPts should be at least D+1, where D is the number of dimensions in
the dataset. The minimum value of MinPts should be set to at least 3.
ImageSource:https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/
Ɣ Average Case: The average runtime complexity can be the same as the best
or worst case, depending on the characteristics of the data and the specific Notes
implementation of the algorithm.
Resulting Clusters:
Cluster 1: [(1, 2), (2, 2), (2, 3), (3, 3)]
Cluster 2: [(8, 7), (9, 7), (9, 8), (10, 8), (11, 8), (8, 9), (9, 9), (10, 9)]
Outliers: None in this example.
DBSCAN is effective in identifying clusters of varying shapes and handling noise/
outliers well. The algorithm’s performance depends on the choice of parameters
İ DQG PLQBVDPSOHV ,W FDQ GLVFRYHU GHQVH UHJLRQV DQG LV OHVV VHQVLWLYH WR WKH LQLWLDO
configuration than some other clustering techniques.
Source:https://www.kdnuggets.com/2020/04/dbscan-clustering-algorithm-machine-learning.html
Density Reachability
When applying DBSCAN clustering to a set of points in a space, we define the
following terms:
İ ,W UHSUHVHQWV WKH UDGLXV RI D QHLJKERXUKRRG DURXQG D SRLQW ,Q RWKHU ZRUGV DOO
points within this radius are considered neighbours of that point. Notes
Core Objects: These are the points that have at least MinPts number of objects
ZLWKLQ WKHLU İQHLJKERXUKRRG ,Q RWKHU ZRUGV FRUH REMHFWV KDYH D VXIILFLHQW QXPEHU RI
neighbours to form a dense region.
ImageSource: https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/
Here Direct density reachability is not symmetric. Object p is not considered directly
density-reachable from object q because q does not meet the criteria of being a core
object.
Density reachable:
$Q REMHFW T LV GHQVLW\UHDFKDEOH IURP S ZUW İ DQG 0LQ3WV LI WKHUH LV D FKDLQ RI
objects q1, q2…, qn, with q1=p, qn=q such that qi+1 is directly density-reachable from qi
ZUWİDQG0LQ3WVIRUDOO
1 <= i <= n
ImageSource: https://www.geeksforgeeks.org/dbscan-clustering-in-ml-density-based-clustering/
Here Density reachability is not symmetric. As q is not a core point, qn-1 is not
directly density-reachable from q and as a result, object p is not density-reachable from
object q.
Density Connectivity:
2EMHFW T LV FRQVLGHUHG GHQVLW\FRQQHFWHG WR REMHFW S ZLWK UHVSHFW WR İ DQG 0LQ3WV
if there exists another object o such that both p and q are density-reachable from o with
Notes UHVSHFWWRİDQG0LQ3WV
It is important to note that density connectivity is symmetric. If object q is density-
connected to object p, then object p is also density-connected to object q.
Based on the concepts of reachability and connectivity, clusters and noise points can
be defined as follows :
&OXVWHU$FOXVWHU&ZLWKUHVSHFWWRİDQG0LQ3WVLVDQRQHPSW\VXEVHWRIWKHHQWLUH
set of objects or instances (D) that satisfies the following conditions:
Maximality: For all objects p and q, if p belongs to C and q is density-reachable from
SZLWKUHVSHFWWRİDQG0LQ3WVWKHQTDOVREHORQJVWR&
Connectivity: For all objects p and q that belong to C, p is density-connected to q and
YLFHYHUVDZLWKUHVSHFWWRİDQG0LQ3WV
Noise:Noise points are objects that are not directly density-reachable from at least
one core object.
Hence, density connectivity and reachability helps defining clusters and noise points
in density-based clustering. Clusters are subsets of objects that are maximally connected
DQG GHQVLW\UHDFKDEOH ZLWKLQ D VSHFLILHG GLVWDQFH WKUHVKROG İ DQG PLQLPXP SRLQWV
MinPts. Noise points, on the other hand, are objects that do not meet these criteria and
are not part of any cluster.
Limitations:
Ɣ Confusion when dealing with border points that may belong to multiple clusters,
leading to ambiguous assignments.
Ɣ Limited ability to handle clusters with significant differences in densities; variable
density clusters pose challenges for the algorithm.
Ɣ High dependency on the distance metric used, impacting the quality and accuracy of
clustering results.
Ɣ Difficulty in guessing the correct parameters, such as epsilon (eps) and MinPts, for
an unknown dataset, making parameter selection a challenging task.
Notes
In the above diagram, the process of agglomerative clustering on the left side, while
the corresponding dendrogram is displayed on the right side can be observed.
Ɣ Initially, data points P2 and P3 are combined to form a cluster and a dendrogram
is constructed, connecting P2 and P3 with a rectangular shape. The height of this
dendrogram is determined by the Euclidean distance between P2 and P3.
Ɣ In the subsequent step, P5 and P6 form a cluster, resulting in another dendrogram.
This new dendrogram is higher than the previous one, as the Euclidean distance
between P5 and P6 is slightly greater than that between P2 and P3.
Ɣ Further iterations create two new dendrograms. One combines P1, P2 and P3, while
the other combines P4, P5 and P6.
Ɣ Finally, all data points are merged into a single dendrogram, representing the entire
dataset.
The dendrogram tree structure can be cut at any level, depending on our specific
clustering requirements.
Example:
Suppose we have a dataset of cities with their respective distances in kilometres:
Cities: A, B, C, D, E, F, G
Distances (in km):
A-B: 10
A-C: 15
A-D: 25
B-E: 35
B-F: 45
C-G: 55
Distance Matrix:
A B C D E F G
A - 10 15 25 - - -
B 10 - - - 35 45 -
C 15 - - - - - 55
D 25 - - - - - -
E - 35 - - - - -
F - 45 - - - - -
G - - 55 - - - -
Step 5: Repeat
Repeat steps 3 and 4 until all data points are in a single cluster or the desired
number of clusters is reached.
The idea is to identify a height where clusters merge but don’t have a very large
Notes jump in dissimilarity. This jump is known as an “elbow point.”
In our dendrogram, the “elbow point” is around height 35. At this height, clusters A,
B, and C merge, creating a sensible cluster. Thus, we choose to have three clusters: {A,
B, C, D}, {E, F}, and {G}.
Hierarchical clustering offers advantages like visual interpretability through
dendrograms and flexibility in choosing the number of clusters. However, it can be
computationally expensive for large datasets. The choice of distance metric and linkage
criteria (how distances between clusters are computed) can also affect the results.
1. Single Linkage
Single Linkage is a method for determining the distance between two clusters by
considering the minimum distance between any pair of data points from the two clusters.
In other words, it calculates the pairwise distance between all points in cluster one and all
points in cluster two and then selects the smallest distance as the distance between the
two clusters.
However, this approach tends to produce loose and widely spread clusters, resulting
in high intra-cluster variance. Despite this drawback, Single Linkage is still commonly
used in certain applications.
The single linkage yields the shortest distance between two points i and j such that i
belongs to R and j belongs to S for two clusters R and S.
(ImageSource:https://aitskadapa.ac.in/e-books/AI&ML/MACHINE%20LEARNING/Machine%20
Learning%20(%20etc.)%20(z-lib.org).pdf)
For Instance:
Most of the time, the dendrogram does not give a good image of the clusters if you
take an example data set and plot the single linkage.
Notes
From the plot, it can be observed that the clusters are not well-defined, though
some clusters can still be formed. The orange cluster is noticeably distant from the green
cluster, as indicated by the length of the blue line between them. However, within the
green cluster itself, it is challenging to identify distinct subclusters with significant distance
between them. This will be further examined when using other linkages as well.
As a reminder, the greater the height (on the y-axis), the greater the distance
between clusters. The heights between the points in the green cluster are either very high
or very low, suggesting that they are loosely grouped together. There is a possibility that
they do not belong together at all.
2. Complete Linkage
Complete Linkage is a clustering method where the distance between two clusters
is defined by the maximum distance between any pair of members belonging to the two
clusters. This approach leads to the formation of stable and tightly-knit clusters.
In other words, in Complete Linkage, we calculate the greatest distance between two
points, i and j, such that i belongs to cluster R and j belongs to cluster S, for any two
clusters R and S.
(ImageSource:https://aitskadapa.ac.in/e-books/AI&ML/MACHINE%20LEARNING/Machine%20
Learning%20(%20etc.)%20(z-lib.org).pdf)
Fig: Complete Linkage
With the same data set as above, the dendrogram obtained would be like this:
Notes
In this scenario, the clusters appear to be more coherent and distinct. The orange
and green clusters are well separated and there is the possibility of creating further
clusters within the green cluster if desired. For instance, by cutting the dendrogram
at a height of 5, two clusters within the green cluster can be formed. Additionally, it is
worth noting that the height between points within a cluster is low, indicating low intra-
cluster variance, while the height between two clusters is high, implying high inter-cluster
variance. This is a desirable outcome for effective clustering.
3. Average linkage
The Average linkage method computes the distance between two clusters by
taking the average of all the distances between the individual members of the clusters.
The process entails the computation of the distance between each point and every
other point in the alternate cluster, followed by the calculation of the average distance
across all these pairs. To calculate the distance between any data point i in cluster R and
any data point j in cluster S, we first determine the distance between each pair of data
points. Then, we calculate the arithmetic mean of these distances. The Average Linkage
algorithm computes the arithmetic mean and returns this value.
(ImageSource:https://aitskadapa.ac.in/e-books/AI&ML/MACHINE%20LEARNING/Machine%20
Learning%20(%20etc.)%20(z-lib.org).pdf)