CE345 - Lecture #9 - Clustering
CE345 - Lecture #9 - Clustering
Machine Learning
Lecture #9
The Unsupervised Learning - Clustering
2
Unsupervised Learning
● Clustering
Further readings:
○ K-Means Method
○ DBSCAN ● H. U. Dike, Y. Zhou, K. K. Deveerasetty and Q.
○ Mixture of Gaussians Wu, "Unsupervised Learning Based On Artificial
Neural Network: A Review," 2018 IEEE
○ Agglomerative Clustering International Conference on Cyborg and Bionic
○ Affinity Propagation… Systems (CBS), Shenzhen, China, 2018, pp.
322-327, doi: 10.1109/CBS.2018.8612259.
● And many more methods.
● Naeem, S., Ali, A., Anam, S., & Ahmed, M. M.
(2023). An unsupervised machine learning
algorithms: Comprehensive review. International
Journal of Computing and Digital Systems.
3
Unsupervised Learning
● It is a type of machine learning where the algorithm is given data without
labeled outcomes or explicit instructions on what to learn.
● It autonomously identifies patterns, structures, and relationships within the
data.
● The primary objective of unsupervised learning is to discover hidden
structures and groupings in the data, allowing us to better understand and
analyze it.
4
Unsupervised Learning
● Unsupervised learning is especially valuable in scenarios where labeled data
is scarce or difficult to obtain.
● It is mostly used for:
○ Exploratory Data Analysis: To reveal unknown patterns, relationships, or clusters within data
that can guide further analysis or model development.
○ Data Dimensionality Reduction: To reduce the number of features while preserving
meaningful relationships, improving model efficiency and interpretability.
○ Feature Engineering: To generate new, meaningful features based on patterns found in the
data.
○ Anomaly Detection: To detect outliers or unusual data points, which is useful in fields like
fraud detection and quality control.
○ Customer Segmentation: To group customers or users into segments based on behavior or
characteristics, often used in marketing and recommendation systems.
5
Unsupervised Learning
The main paradigms of unsupervised learning include:
6
Clustering
7
Clustering
● It is an unsupervised learning technique that groups a dataset into distinct
clusters, where each cluster contains data points that are similar to each other
according to a specific similarity measure.
● The goal of clustering is to discover hidden patterns, groupings, or structures
within unlabeled data, making it useful for exploratory data analysis,
segmentation, and summarizing large datasets.
8
Clustering
● Partition-Based (Centroid-Based) Clustering:
○ Example: k-Means, k-Medoids
○ Best for: Spherical clusters and fast segmentation.
● Hierarchical (Connectivity-Based) Clustering:
○ Example: Agglomerative, Divisive, BIRCH
○ Best for: Small to medium datasets, interpretable hierarchy.
● Fuzzy Clustering
○ Example: Fuzzy k-means, c-Means
○ Best for: Overlapping clusters where points can belong to multiple clusters.
● Search-Based (Heuristic-Based) Clustering
○ Example: J-Means, Global k-Means
○ Best for: Complex objective functions and cases where global optimization is needed.
● Graph-Based Clustering
○ Example: Chameleon, CACTUS
○ Best for: Non-linearly separable clusters, network and social data, and data with connectivity-based structures.
● Grid-Based Clustering:
○ Example: STING, CLIQUE
○ Best for: Spatial or high-dimensional data.
● Density-Based Clustering:
○ Example: DBSCAN, OPTICS
○ Best for: Non-spherical clusters and data with noise or outliers.
● Model-Based Clustering:
○ Example: Gaussian Mixture Models
○ Best for: Elliptical clusters and soft clustering, overlapping clusters.
● Affinity Propagation
○ Best for: Clustering without predefined k and data with complex structures based on similarity.
9
Clustering
10
Clustering
11
K-Means Method
12
K-Means Method
13
K-Means Method
● K-means is an iterative clustering algorithm.
● It partitions data into k clusters by minimizing the intra-cluster variance.
● It begins by initializing k centroids and assigning each data point to the
nearest centroid. The centroids are iteratively updated until convergence.
● It has two steps in each iteration:
○ Cluster assignment step: Assign each sample to the closest cluster centroid.
○ Move centroids step: Recompute cluster centroids using assigned samples.
14
K-Means Method
15
K-Means Method
16
K-Means Method
17
K-Means Method
18
K-Means Method
19
K-Means Method
20
K-Means Method
21
K-Means Method
22
K-Means Method
The necessary input to be given to K-means method are:
23
K-Means Method
24
K-Means Method
● If an iteration of the algorithm results in the situation of ‘no sample is assigned
to one of the clusters’, i.e. ‘empty cluster’, then you can eliminate that cluster
and continue with K-1 clusters.
● If you are sure that there are K clusters, then you need to randomly initialize
centroids and run K-means again.
25
K-Means Method: Minimization of the Cost Function
26
K-Means Method: Minimization of the Cost Function
One can see that the cost is minimized in the
27
K-Means Method: Random Centroid Initialization
28
K-Means Method: Random Centroid Initialization
29
K-Means Method: Random Centroid Initialization
30
K-Means Method: The Choice of k value
● For non-well-separated clusters, what is the right value of K?
31
K-Means Method: The Choice of k value
● One solution can be using the Elbow method:
32
K-Means Method: The Choice of k value
● The Elbow Method evaluates the within-cluster sum of squares (WCSS) for
different K values. WCSS represents the total variance within each cluster. As
K increases, WCSS decreases, since more clusters lead to a better fit.
● The procedure for it is as follows:
1. Plot WCSS against different K values.
2. Look for an "elbow" point where the decrease in WCSS begins to slow
down.
3. This point suggests an appropriate K, as adding more clusters beyond
this value yields diminishing returns in terms of variance reduction.
33
K-Means Method: The Choice of k value
Other methods for selecting a good K value include:
● Silhouette Score: It measures how similar a data point is to its own cluster
(cohesion) compared to other clusters (separation). The score ranges from -1
to 1, where a higher value indicates better clustering.
● Gap Statistic: It compares the WCSS of the actual data clustering to the
expected WCSS if the data were uniformly distributed.
● Cross-Validation for Stability: It can be used to assess the stability of
clusters over multiple runs with different initializations.
34
K-Means Method: The Choice of k value
● Beyond these K-selection methods, usually, the K value is selected manually
considering the purpose of clustering.
● If you can find a metric to measure how well it performs (production cost,
customer satisfaction, etc.), then you can use it to choose a better K value.
35
K-Means Method: Example #1
● Let’s say we have a small dataset of 5 students, each represented by two
features, Study Hours (x-axis) and Exam Grade (y-axis).
● Your goal is to use k-Means clustering to group these students based on their
study habits and grades, choosing k=2 clusters to identify two groups (e.g.,
high achievers and low achievers).
● The dataset is given as:
36
K-Means Method: Example #1
● Identify high achievers & low achievers by choosing k=2 clusters in K-Means.
● Let’s choose Student A (2, 81) and Student D (8, 95) as the initial centroids for
simplicity.
● Calculate the Euclidean distance from each point to each centroid. Assign
each point to the cluster with the nearest centroid.
37
K-Means Method: Example #1
● Identify high achievers & low achievers by choosing k=2 clusters in K-Means.
● Now, calculate the new centroids by averaging the coordinates of the points in
each cluster.
● Since Cluster 1 only has Student A, the centroid remains same as (2, 81).
● Cluster 2 will be updated to (7, 94.25), with the calculations:
● After this step, repeat the assignment step with the updated centroids.
38
K-Means Method: Example #1
● Identify high achievers & low achievers by choosing k=2 clusters in K-Means.
● Since the clusters have not changed, the algorithm converges.
39
K-Medoids
40
K-Medoids
● It is a clustering method similar to k-Means, but it is more robust to outliers.
● In k-Medoids, each cluster is represented by an actual data point (called a
medoid) rather than the mean of the points in the cluster (as in k-Means).
● The medoid is the point in the cluster that minimizes the total distance to all
other points in that cluster, making it less sensitive to extreme values or
outliers.
41
K-Medoids
k-Medoids attempts to minimize the sum of dissimilarities between points and their assigned
cluster medoid. A step-by-step outline of the k-Medoids algorithm:
1. Choose the Number of Clusters (k): Decide on the number of clusters, k, just like in k-Means.
2. Initialize Medoids: Select k data points randomly from the dataset to serve as the initial medoids.
3. Assign Points to the Nearest Medoid:
○ For each data point, calculate its distance to each medoid.
○ Assign each data point to the nearest medoid to form k initial clusters.
4. Update Medoids:
○ For each cluster, calculate the total distance between each data point in the cluster and every other point in
that cluster.
○ Select the point that minimizes the total distance as the new medoid for the cluster.
5. Reassign Points:
○ Reassign all data points to their nearest medoid, updating the clusters as necessary.
○ Repeat the process of updating medoids and re-assigning points until the medoids no longer change or until a
maximum number of iterations is reached.
6. Convergence: The algorithm converges when the medoids stabilize, meaning they no longer change with further
iterations, or when a predefined number of iterations is reached.
42
K-Medoids
Advantages:
● Robust to Outliers: Because k-Medoids uses actual data points as cluster centers, it is less
sensitive to outliers than k-Means.
● Flexibility in Distance Metrics: Unlike k-Means, which relies mostly on Euclidean distance,
k-Medoids can use any distance metric (e.g., Manhattan distance), making it adaptable to different
data types.
Limitations:
● Computationally Intensive: k-Medoids is generally slower than k-Means, especially for large
datasets, because it requires computing the pairwise distances between all points in each cluster to
find the medoid.
● Sensitive to Initial Medoid Selection: Like k-Means, the initial choice of medoids can affect the
final clustering outcome, though it’s more stable than k-Means due to the use of medoids.
43
K-Medoids
As a quick comparison with k-Means:
By minimizing the total distance between points and their assigned medoid,
k-Medoids achieves robust clustering, making it suitable for datasets with outliers
or when the mean is not a good representation of a cluster.
44
Density-Based Spatial Clustering of
Applications with Noise (DBSCAN) Method
45
DBSCAN Method
46
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) Method
47
DBSCAN Method
DBSCAN relies on two key parameters:
1. Epsilon (ε): The maximum distance between two points for one to be
considered in the neighborhood of the other.
2. Minimum Points (Nmin): The minimum number of points required to form a
dense region (i.e., a cluster).
DBSCAN classifies each data point as one of three types:
● Core Point: A point that has at least Nmin neighbors within a distance of ε.
● Border Point: A point that has fewer than Nmin neighbors within ε, but is within
the ε distance of a core point.
● Noise (Outlier): A point that is neither a core point nor a border point.
48
DBSCAN Method
1. Choose an Unvisited Point: Pick a random point in the dataset that has not
yet been visited.
2. Identify Core Points and Form Clusters:
a. If the point has at least Nmin neighbors within a distance of ε, it is marked as a core point and
forms the initial point of a cluster.
b. All points within the ε neighborhood of this core point are added to the cluster. If any of these
points are also core points, their neighbors are also added to the cluster (expanding the
cluster iteratively).
c. This process continues until no more points can be added to the cluster.
3. Mark Border Points and Noise:
a. Any points that are within ε of a core point but do not themselves have Nmin neighbors are
marked as border points.
b. Points that are not within ε of any core point and do not meet the Nmin requirement are labeled
as noise (outliers).
4. Repeat: Select another unvisited point and repeat the process until all points
have been visited.
49
DBSCAN Method: Choosing Parameters
● Selecting ε:
○ The k-Distance Graph method is commonly used to select ε.
○ For each point, calculate the distance to its k-th nearest neighbor (often k = Nmin),
and plot these distances in ascending order.
○ The "elbow" point in this plot indicates a suitable value for ε, as distances typically
increase sharply beyond this value.
● Selecting Nmin:
○ A general rule of thumb is to set minPts to at least the dimensionality of the data
plus one (e.g., Nmin = 3 for 2D data).
○ Higher values of Nmin result in stricter density requirements, which can reduce the
number of noise points and create tighter clusters.
50
DBSCAN Method
Advantages:
Limitations:
● Parameter Sensitivity: DBSCAN’s results depend heavily on ε and Nmin. Finding optimal values for
these parameters can be challenging, especially for datasets with varying densities.
● Difficulty with Varying Densities: DBSCAN may struggle when clusters have widely varying
densities, as the fixed ε radius may not capture all cluster structures accurately.
● Scalability: DBSCAN can be computationally intensive for very large datasets, as it requires
pairwise distance calculations for each point.
51
DBSCAN Method: Demo
53
Summary
● Clustering Methods
● K-means algorithm
● K-medoids
● DBSCAN
54
End of Lecture #9
55
References
● Lecture Slides by Ethem Alpaydın, Introduction to Machine Learning, 3rd Edt.
(MIT Press, 2014).
● Lecture Slides by Yalın Baştanlar, Introduction to Machine Learning course,
IZTECH CS Department, 2012.
● Machine Learning Flashcards, Chris Albon,
https://machinelearningflashcards.com/
● Gan, G., Ma, C., & Wu, J. (2007). Data clustering: theory, algorithms, and
applications. Society for Industrial and Applied Mathematics.
● Giordani, P., Ferraro, M. B., Martella, F., Giordani, P., Ferraro, M. B., &
Martella, F. (2020). Introduction to clustering with R. Springer Singapore.
● Clustering Like a Pro: A Beginner’s Guide to DBSCAN, Medium,
https://medium.com/@sachinsoni600517/clustering-like-a-pro-a-beginners-guide-to-dbscan-6c8274c362c4
56