0% found this document useful (0 votes)
2 views

CE345 - Lecture #9 - Clustering

The document discusses unsupervised learning in machine learning, focusing on clustering as a key technique for grouping similar data points without labeled outcomes. It outlines various clustering methods, including K-Means, DBSCAN, and K-Medoids, detailing their processes, advantages, and limitations. The primary goal of unsupervised learning is to uncover hidden structures in data, making it valuable for exploratory analysis, anomaly detection, and customer segmentation.

Uploaded by

oliviadunham6890
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CE345 - Lecture #9 - Clustering

The document discusses unsupervised learning in machine learning, focusing on clustering as a key technique for grouping similar data points without labeled outcomes. It outlines various clustering methods, including K-Means, DBSCAN, and K-Medoids, detailing their processes, advantages, and limitations. The primary goal of unsupervised learning is to uncover hidden structures in data, making it valuable for exploratory analysis, anomaly detection, and customer segmentation.

Uploaded by

oliviadunham6890
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

CE 345 - Introduction to

Machine Learning
Lecture #9
The Unsupervised Learning - Clustering

Dr. M. Çağkan Uludağlı


Unsupervised Learning

2
Unsupervised Learning
● Clustering
Further readings:
○ K-Means Method
○ DBSCAN ● H. U. Dike, Y. Zhou, K. K. Deveerasetty and Q.
○ Mixture of Gaussians Wu, "Unsupervised Learning Based On Artificial
Neural Network: A Review," 2018 IEEE
○ Agglomerative Clustering International Conference on Cyborg and Bionic
○ Affinity Propagation… Systems (CBS), Shenzhen, China, 2018, pp.
322-327, doi: 10.1109/CBS.2018.8612259.
● And many more methods.
● Naeem, S., Ali, A., Anam, S., & Ahmed, M. M.
(2023). An unsupervised machine learning
algorithms: Comprehensive review. International
Journal of Computing and Digital Systems.

● Celebi, M. E., & Aydin, K. (Eds.). (2016).


Unsupervised learning algorithms (Vol. 9, p. 103).
Cham: Springer.

3
Unsupervised Learning
● It is a type of machine learning where the algorithm is given data without
labeled outcomes or explicit instructions on what to learn.
● It autonomously identifies patterns, structures, and relationships within the
data.
● The primary objective of unsupervised learning is to discover hidden
structures and groupings in the data, allowing us to better understand and
analyze it.

4
Unsupervised Learning
● Unsupervised learning is especially valuable in scenarios where labeled data
is scarce or difficult to obtain.
● It is mostly used for:
○ Exploratory Data Analysis: To reveal unknown patterns, relationships, or clusters within data
that can guide further analysis or model development.
○ Data Dimensionality Reduction: To reduce the number of features while preserving
meaningful relationships, improving model efficiency and interpretability.
○ Feature Engineering: To generate new, meaningful features based on patterns found in the
data.
○ Anomaly Detection: To detect outliers or unusual data points, which is useful in fields like
fraud detection and quality control.
○ Customer Segmentation: To group customers or users into segments based on behavior or
characteristics, often used in marketing and recommendation systems.

5
Unsupervised Learning
The main paradigms of unsupervised learning include:

● Clustering – Grouping similar data points.


● Dimensionality Reduction – Reducing feature space complexity.
● Anomaly Detection – Identifying outliers or abnormal data.
● Association Rule Learning – Discovering relationships among items.
● Density Estimation – Estimating the probability distribution of data.
● Representation Learning – Learning meaningful data representations.
● Generative Modeling – Generating new data samples.

6
Clustering

7
Clustering
● It is an unsupervised learning technique that groups a dataset into distinct
clusters, where each cluster contains data points that are similar to each other
according to a specific similarity measure.
● The goal of clustering is to discover hidden patterns, groupings, or structures
within unlabeled data, making it useful for exploratory data analysis,
segmentation, and summarizing large datasets.

8
Clustering
● Partition-Based (Centroid-Based) Clustering:
○ Example: k-Means, k-Medoids
○ Best for: Spherical clusters and fast segmentation.
● Hierarchical (Connectivity-Based) Clustering:
○ Example: Agglomerative, Divisive, BIRCH
○ Best for: Small to medium datasets, interpretable hierarchy.
● Fuzzy Clustering
○ Example: Fuzzy k-means, c-Means
○ Best for: Overlapping clusters where points can belong to multiple clusters.
● Search-Based (Heuristic-Based) Clustering
○ Example: J-Means, Global k-Means
○ Best for: Complex objective functions and cases where global optimization is needed.
● Graph-Based Clustering
○ Example: Chameleon, CACTUS
○ Best for: Non-linearly separable clusters, network and social data, and data with connectivity-based structures.
● Grid-Based Clustering:
○ Example: STING, CLIQUE
○ Best for: Spatial or high-dimensional data.
● Density-Based Clustering:
○ Example: DBSCAN, OPTICS
○ Best for: Non-spherical clusters and data with noise or outliers.
● Model-Based Clustering:
○ Example: Gaussian Mixture Models
○ Best for: Elliptical clusters and soft clustering, overlapping clusters.
● Affinity Propagation
○ Best for: Clustering without predefined k and data with complex structures based on similarity.
9
Clustering

10
Clustering

11
K-Means Method

12
K-Means Method

13
K-Means Method
● K-means is an iterative clustering algorithm.
● It partitions data into k clusters by minimizing the intra-cluster variance.
● It begins by initializing k centroids and assigning each data point to the
nearest centroid. The centroids are iteratively updated until convergence.
● It has two steps in each iteration:
○ Cluster assignment step: Assign each sample to the closest cluster centroid.
○ Move centroids step: Recompute cluster centroids using assigned samples.

14
K-Means Method

15
K-Means Method

16
K-Means Method

17
K-Means Method

18
K-Means Method

19
K-Means Method

20
K-Means Method

21
K-Means Method

22
K-Means Method
The necessary input to be given to K-means method are:

● k: The number of clusters.


● The training set: {x(1), x(2), …, x(m)} where m is the number of input data.
● n: The number of features.

For example: x4(2): 4th feature of 2nd sample.

Note that, we do not use, k = 1, for K-means method.

23
K-Means Method

24
K-Means Method
● If an iteration of the algorithm results in the situation of ‘no sample is assigned
to one of the clusters’, i.e. ‘empty cluster’, then you can eliminate that cluster
and continue with K-1 clusters.
● If you are sure that there are K clusters, then you need to randomly initialize
centroids and run K-means again.

25
K-Means Method: Minimization of the Cost Function

26
K-Means Method: Minimization of the Cost Function
One can see that the cost is minimized in the

● Cluster assignment step by changing c(i).


● Move centroids step by changing μk.

27
K-Means Method: Random Centroid Initialization

28
K-Means Method: Random Centroid Initialization

29
K-Means Method: Random Centroid Initialization

30
K-Means Method: The Choice of k value
● For non-well-separated clusters, what is the right value of K?

31
K-Means Method: The Choice of k value
● One solution can be using the Elbow method:

32
K-Means Method: The Choice of k value
● The Elbow Method evaluates the within-cluster sum of squares (WCSS) for
different K values. WCSS represents the total variance within each cluster. As
K increases, WCSS decreases, since more clusters lead to a better fit.
● The procedure for it is as follows:
1. Plot WCSS against different K values.
2. Look for an "elbow" point where the decrease in WCSS begins to slow
down.
3. This point suggests an appropriate K, as adding more clusters beyond
this value yields diminishing returns in terms of variance reduction.

33
K-Means Method: The Choice of k value
Other methods for selecting a good K value include:

● Silhouette Score: It measures how similar a data point is to its own cluster
(cohesion) compared to other clusters (separation). The score ranges from -1
to 1, where a higher value indicates better clustering.
● Gap Statistic: It compares the WCSS of the actual data clustering to the
expected WCSS if the data were uniformly distributed.
● Cross-Validation for Stability: It can be used to assess the stability of
clusters over multiple runs with different initializations.

34
K-Means Method: The Choice of k value
● Beyond these K-selection methods, usually, the K value is selected manually
considering the purpose of clustering.
● If you can find a metric to measure how well it performs (production cost,
customer satisfaction, etc.), then you can use it to choose a better K value.

35
K-Means Method: Example #1
● Let’s say we have a small dataset of 5 students, each represented by two
features, Study Hours (x-axis) and Exam Grade (y-axis).
● Your goal is to use k-Means clustering to group these students based on their
study habits and grades, choosing k=2 clusters to identify two groups (e.g.,
high achievers and low achievers).
● The dataset is given as:

36
K-Means Method: Example #1
● Identify high achievers & low achievers by choosing k=2 clusters in K-Means.
● Let’s choose Student A (2, 81) and Student D (8, 95) as the initial centroids for
simplicity.
● Calculate the Euclidean distance from each point to each centroid. Assign
each point to the cluster with the nearest centroid.

37
K-Means Method: Example #1
● Identify high achievers & low achievers by choosing k=2 clusters in K-Means.
● Now, calculate the new centroids by averaging the coordinates of the points in
each cluster.
● Since Cluster 1 only has Student A, the centroid remains same as (2, 81).
● Cluster 2 will be updated to (7, 94.25), with the calculations:

● After this step, repeat the assignment step with the updated centroids.

38
K-Means Method: Example #1
● Identify high achievers & low achievers by choosing k=2 clusters in K-Means.
● Since the clusters have not changed, the algorithm converges.

39
K-Medoids

40
K-Medoids
● It is a clustering method similar to k-Means, but it is more robust to outliers.
● In k-Medoids, each cluster is represented by an actual data point (called a
medoid) rather than the mean of the points in the cluster (as in k-Means).
● The medoid is the point in the cluster that minimizes the total distance to all
other points in that cluster, making it less sensitive to extreme values or
outliers.

41
K-Medoids
k-Medoids attempts to minimize the sum of dissimilarities between points and their assigned
cluster medoid. A step-by-step outline of the k-Medoids algorithm:
1. Choose the Number of Clusters (k): Decide on the number of clusters, k, just like in k-Means.
2. Initialize Medoids: Select k data points randomly from the dataset to serve as the initial medoids.
3. Assign Points to the Nearest Medoid:
○ For each data point, calculate its distance to each medoid.
○ Assign each data point to the nearest medoid to form k initial clusters.
4. Update Medoids:
○ For each cluster, calculate the total distance between each data point in the cluster and every other point in
that cluster.
○ Select the point that minimizes the total distance as the new medoid for the cluster.
5. Reassign Points:
○ Reassign all data points to their nearest medoid, updating the clusters as necessary.
○ Repeat the process of updating medoids and re-assigning points until the medoids no longer change or until a
maximum number of iterations is reached.
6. Convergence: The algorithm converges when the medoids stabilize, meaning they no longer change with further
iterations, or when a predefined number of iterations is reached.
42
K-Medoids
Advantages:

● Robust to Outliers: Because k-Medoids uses actual data points as cluster centers, it is less
sensitive to outliers than k-Means.
● Flexibility in Distance Metrics: Unlike k-Means, which relies mostly on Euclidean distance,
k-Medoids can use any distance metric (e.g., Manhattan distance), making it adaptable to different
data types.

Limitations:

● Computationally Intensive: k-Medoids is generally slower than k-Means, especially for large
datasets, because it requires computing the pairwise distances between all points in each cluster to
find the medoid.
● Sensitive to Initial Medoid Selection: Like k-Means, the initial choice of medoids can affect the
final clustering outcome, though it’s more stable than k-Means due to the use of medoids.

43
K-Medoids
As a quick comparison with k-Means:

● k-Means: Uses centroids (average of points) as cluster centers, sensitive to


outliers, faster.
● k-Medoids: Uses medoids (actual points) as cluster centers, more robust to
outliers, slower due to additional computations.

By minimizing the total distance between points and their assigned medoid,
k-Medoids achieves robust clustering, making it suitable for datasets with outliers
or when the mean is not a good representation of a cluster.

44
Density-Based Spatial Clustering of
Applications with Noise (DBSCAN) Method

45
DBSCAN Method

46
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) Method

● It is a density-based clustering algorithm that forms clusters based on the


density of data points in a region.
● Unlike K-Means, which assumes clusters are spherical and requires the
number of clusters (K) to be specified in advance, DBSCAN automatically
determines the number of clusters and can find clusters of arbitrary shapes.
● It is also effective at identifying outliers as points that do not belong to any
cluster.

47
DBSCAN Method
DBSCAN relies on two key parameters:
1. Epsilon (ε): The maximum distance between two points for one to be
considered in the neighborhood of the other.
2. Minimum Points (Nmin): The minimum number of points required to form a
dense region (i.e., a cluster).
DBSCAN classifies each data point as one of three types:
● Core Point: A point that has at least Nmin neighbors within a distance of ε.
● Border Point: A point that has fewer than Nmin neighbors within ε, but is within
the ε distance of a core point.
● Noise (Outlier): A point that is neither a core point nor a border point.
48
DBSCAN Method
1. Choose an Unvisited Point: Pick a random point in the dataset that has not
yet been visited.
2. Identify Core Points and Form Clusters:
a. If the point has at least Nmin neighbors within a distance of ε, it is marked as a core point and
forms the initial point of a cluster.
b. All points within the ε neighborhood of this core point are added to the cluster. If any of these
points are also core points, their neighbors are also added to the cluster (expanding the
cluster iteratively).
c. This process continues until no more points can be added to the cluster.
3. Mark Border Points and Noise:
a. Any points that are within ε of a core point but do not themselves have Nmin neighbors are
marked as border points.
b. Points that are not within ε of any core point and do not meet the Nmin requirement are labeled
as noise (outliers).
4. Repeat: Select another unvisited point and repeat the process until all points
have been visited.
49
DBSCAN Method: Choosing Parameters

● Selecting ε:
○ The k-Distance Graph method is commonly used to select ε.
○ For each point, calculate the distance to its k-th nearest neighbor (often k = Nmin),
and plot these distances in ascending order.
○ The "elbow" point in this plot indicates a suitable value for ε, as distances typically
increase sharply beyond this value.
● Selecting Nmin:
○ A general rule of thumb is to set minPts to at least the dimensionality of the data
plus one (e.g., Nmin = 3 for 2D data).
○ Higher values of Nmin result in stricter density requirements, which can reduce the
number of noise points and create tighter clusters.

50
DBSCAN Method
Advantages:

● No Need to Predefine the Number of Clusters: Unlike k-Means, DBSCAN automatically


determines the number of clusters based on density.
● Detects Arbitrarily Shaped Clusters: DBSCAN is not restricted to spherical clusters and can detect
clusters of varying shapes and sizes.
● Identifies Outliers: Points that do not fit within any dense region are labeled as noise, making
DBSCAN effective at outlier detection.

Limitations:

● Parameter Sensitivity: DBSCAN’s results depend heavily on ε and Nmin. Finding optimal values for
these parameters can be challenging, especially for datasets with varying densities.
● Difficulty with Varying Densities: DBSCAN may struggle when clusters have widely varying
densities, as the fixed ε radius may not capture all cluster structures accurately.
● Scalability: DBSCAN can be computationally intensive for very large datasets, as it requires
pairwise distance calculations for each point.

51
DBSCAN Method: Demo

The demo application with Python can be found in here: https://scikit-learn.org/1.5/auto_examples/cluster/plot_dbscan.html 52


K-Means vs DBSCAN

53
Summary
● Clustering Methods
● K-means algorithm
● K-medoids
● DBSCAN

54
End of Lecture #9

55
References
● Lecture Slides by Ethem Alpaydın, Introduction to Machine Learning, 3rd Edt.
(MIT Press, 2014).
● Lecture Slides by Yalın Baştanlar, Introduction to Machine Learning course,
IZTECH CS Department, 2012.
● Machine Learning Flashcards, Chris Albon,
https://machinelearningflashcards.com/
● Gan, G., Ma, C., & Wu, J. (2007). Data clustering: theory, algorithms, and
applications. Society for Industrial and Applied Mathematics.
● Giordani, P., Ferraro, M. B., Martella, F., Giordani, P., Ferraro, M. B., &
Martella, F. (2020). Introduction to clustering with R. Springer Singapore.
● Clustering Like a Pro: A Beginner’s Guide to DBSCAN, Medium,
https://medium.com/@sachinsoni600517/clustering-like-a-pro-a-beginners-guide-to-dbscan-6c8274c362c4

56

You might also like