0% found this document useful (0 votes)
1 views

Unsupervised Learning

This document explores unsupervised learning techniques, specifically clustering methods like Self-Organizing Maps (SOMs), K-Means, and DBSCAN, using an autoencoder-driven neural network. It demonstrates how these techniques can autonomously classify unlabelled data and highlights their unique advantages and limitations. The study also emphasizes the role of autoencoders in enhancing clustering efficiency and provides insights into real-time clustering visualization and user interactivity.

Uploaded by

Ninaad Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Unsupervised Learning

This document explores unsupervised learning techniques, specifically clustering methods like Self-Organizing Maps (SOMs), K-Means, and DBSCAN, using an autoencoder-driven neural network. It demonstrates how these techniques can autonomously classify unlabelled data and highlights their unique advantages and limitations. The study also emphasizes the role of autoencoders in enhancing clustering efficiency and provides insights into real-time clustering visualization and user interactivity.

Uploaded by

Ninaad Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Unsupervised Learning:

Exploration of well-known
Clustering Techniques in Python
ChatGPT-4¹, Ninaad Das²

¹LLM AI Research & Computational Learning


²BSc Filmmaking, Direction

Abstract without predefined labels. In contrast to


supervised learning, where models learn from
This addendum continues to build upon the input-output pairs, unsupervised learning
previously documented Neural Learning relies on the intrinsic structure and
Engine (NLE) experiment by incorporating relationships within the dataset. This
unsupervised learning methodologies. While experiment extends prior research on the
supervised learning relies on labelled Neural Learning Engine (NLE) by investigating
datasets, unsupervised learning uncovers how different clustering techniques
patterns and structures within unstructured, autonomously classify data using an
unlabelled data. By employing autoencoders autoencoder-driven neural network.
along with various clustering techniques—
including Self-Organizing Maps (SOMs), K- This study introduces a user-interactive
Means, and DBSCAN—a comparative analysis framework that allows dynamic switching
is presented to examine how different between:
methods organize and interpret data. The • Self-Organizing Maps (SOMs): A
experiment demonstrates real-time adaptive biologically inspired method where
clustering, illustrating the effectiveness of neurons self-adjust to input patterns.
neural networks in discovering hidden
patterns. The findings highlight the • K-Means Clustering: A classical
advantages and trade-offs of different approach based on distance
clustering techniques in an unsupervised measurement and centroid updates.
learning context. • DBSCAN Clustering: A density-based
Additionally, we introduce real-time cluster
method capable of identifying outliers
evolution visualization and apply the and clusters of arbitrary shapes.
methods to real-world datasets such as
customer segmentation and image clustering. These methods provide different perspectives
on how unsupervised learning enables data
1. Introduction classification without human supervision,
establishing the fundamental principles that
Unsupervised learning is a field that focuses
distinguish it from supervised learning.
on extracting meaningful patterns from data

2. Objective of the Experiment clustering techniques when applied to


unlabelled data within an autoencoder-driven
The objective of this experiment is to observe neural learning engine. Specifically, this study
and compare the performance of various aims to:
• Demonstrate real-time clustering of 3. Difference Between Supervised and
unlabelled data. Unsupervised Learning

• Compare the classification patterns Supervised Learning (Previous Experiment)


produced by SOMs, K-Means, and
The previous experiment utilized labelled data
DBSCAN.
where each input had a corresponding output.
• Analyse how autoencoders contribute A neural network was trained to classify
to efficient feature extraction for predefined categories by minimizing
clustering. classification errors. The decision boundary
evolved iteratively based on labelled
• Showcase how neural networks learn
feedback.
underlying structures without explicit
guidance. Unsupervised Learning (This Experiment)

By achieving these goals, this research No predefined labels are provided as the
confirms that unsupervised learning can be network autonomously identifies patterns and
effectively used to categorize data, revealing structures. Clustering algorithms group data
hidden relationships and structures without points based on inherent similarities without
predefined labels. prior knowledge.

4. Experimental Implementation • The encoder extracts key features,


while the decoder reconstructs the
The experimental framework is built upon an data, minimizing information loss.
autoencoder neural network that compresses
input data into a latent representation, which 4.3 Clustering Techniques for Unsupervised
is then analysed using different clustering Learning
techniques. The primary steps in the setup • Self-Organizing Maps (SOMs): Groups
include: similar data points based on neuron
4.1 Data Generation activation.

• The dataset consists of 500 randomly • K-Means Clustering: Iteratively


distributed data points generated updates centroids to form clusters.
using the make blobs() function. • DBSCAN: Identifies high-density
• No predefined labels are assigned, clusters while isolating outliers.
allowing the clustering techniques to 4.4 Interactive User Interface for Clustering
infer structure naturally. Comparisons
4.2 Autoencoder-Based Feature Extraction • A radio button interface enables users
• An autoencoder compresses the 2D to switch between different clustering
dataset into a latent space. techniques in real time.

• The model dynamically updates the


visualization, demonstrating how each
method classifies the dataset.
5. Observations and Results
The experiment successfully verifies the characteristics of unsupervised learning by demonstrating
how clustering algorithms autonomously group data. The findings are summarized as follows:

5.1 Different Clustering Methods Yield Unique Classifications


• Each clustering algorithm produces a distinct classification pattern.

Autoencoders Improve Cluster Separation

• The encoded feature space enhances clustering techniques, allowing them to identify more
distinct patterns compared to raw data.

Loss Reduction Confirms Model Improvement

• A decrease in reconstruction loss over time indicates optimization of the model and
improved feature extraction.

5.2 Cluster Evolution Visualization


The cluster formation is animated for each algorithm so that:

SOM gradually adapts over time.

K-Means updates centroids at each step.

DBSCAN expands clusters dynamically and detects outliers.

5.3 Observations from a 20-Second Run K-Means immediately assigns data points to the
nearest centroid.
(1) Self-Organizing Map (SOM)
Initially, clusters jump around as centroids adjust.
SOM gradually adjusts its neurons toward data
clusters. By iteration ~10, centroids stabilize, and cluster
assignment stops changing.
The decision boundaries smoothly evolve over
time, not forming clusters immediately. Performs well for evenly distributed circular
clusters.
Initial clustering is random, but with more
iterations, groupings become clearer. ⚠ Limitations:

Handles non-circular and non-uniform density Fails with non-uniform clusters (e.g., elongated or
clusters effectively. varied densities).

⚠ Limitations: Struggles with overlapping clusters, forcing a


strict K-partitioning.
Requires more training time to stabilize.
Cannot detect outliers, every point is forced into
Can sometimes merge small clusters if neurons a cluster.
are not finely tuned.

(2) K-Means Clustering


(3) DBSCAN Clustering ⚠ Limitations:
Unlike K-Means, DBSCAN doesn’t require Highly sensitive to hyperparameters (eps &
specifying the number of clusters. min_samples).
It slowly expands core clusters while detecting Fails if clusters overlap too much.
noise points.
Slow for high-dimensional data (though fine for
Handles irregularly shaped clusters better than K- 2D).
Means.

Can leave some points unclassified (outliers),


marking them as noise.

Fig: The three-clustering algorithms in discussion demonstrating their working on a minimum 20


second period.
Limitations of the experimental setup:
• A ‘start’ button to be implemented in the UI for more controlled reading, as the initial few
milliseconds are lost while switching to the other algorithms in program, as the default
algorithm is SOM.
• For a simple demonstration the time (and thus, the sample space) has been kept to a short
duration of 20 seconds. For academic and industrial purposes, the sample space must be
higher by a huge margin to reduce any observational bias because of bad data and errors.
• Academic standard of accuracy can be obtained if the program is refactored to introduce a
singular starting configuration for the points distribution, for an even better and unbiased
observation. It has been avoided in this particular demonstration to keep it simple for
teaching or explanation purposes.

Why Do These Algorithms Behave Differently?


Each clustering technique uses different principles to organize data. Let's break them down in detail:

Self-Organizing Maps (SOM)

How It Works:

• SOM is a type of artificial neural network that uses competitive learning.

• Each neuron (node) in the map competes to become the closest to a given data point.

• Over time, the map self-adjusts so that similar data points activate the same neuron or
nearby neurons.

Why It Behaves Differently:

• Unlike K-Means and DBSCAN, SOM learns progressively rather than immediately assigning
labels.

• The network smoothly adapts to data instead of making hard assignments.

• Because it is topology-preserving, it maintains relationships between clusters.

Best Use Cases:

• Unstructured data, such as high-dimensional embeddings.

• Time-series clustering (where relationships between clusters matter).

• Dimensionality reduction for visualization.


K-Means Clustering

How It Works:

1. Randomly initialize K centroids.

2. Assign each data point to the nearest centroid.

3. Compute the new centroid locations (average of assigned points).

4. Repeat until centroids stop moving.

Why It Behaves Differently:

• Hard Assignments: Every point belongs to exactly one cluster.

• Fast Convergence: K-Means typically converges faster than SOM & DBSCAN.

• Assumes Circular Clusters: Struggles with non-circular or unevenly spaced clusters.

Best Use Cases:

• Market segmentation (grouping customers by similar behaviours).

• Image compression (reducing pixel colour space).

• Simple datasets with well-separated clusters.

DBSCAN (Density-Based Spatial Clustering)

How It Works:

1. Pick a random unvisited point.

2. If at least min_samples points exist within radius eps, it’s a core point → forms a cluster.

3. Expand cluster by absorbing all reachable core points.

4. Mark outliers that don’t belong to any cluster.

Why It Behaves Differently:

• Unlike K-Means, DBSCAN does not require predefining K clusters.

• Can ignore outliers and focus only on dense regions.

• Adapts well to irregular clusters but fails if clusters are too close together.

Best Use Cases:

• Anomaly detection (e.g., fraud detection in banking).

• Geospatial clustering (grouping locations in maps).

• Text clustering (e.g., topic modelling in NLP).


Summary of Differences

Works on Non-Circular
Algorithm Learns Over Time? Handles Outliers? Speed
Data?

SOM Yes (gradual) No (all points assigned) Yes (adapts to structure) Slow

No (forces points into No (fails for elongated


K-Means No (immediate) Fast
clusters) shapes)

Yes (detects arbitrary


DBSCAN Yes (expands clusters) Yes (identifies noise) Medium
clusters)

Self-Organizing Maps (SOM)

SOM is a type of neural network that uses competitive learning instead of traditional supervised
learning. It maps high-dimensional data to a lower-dimensional space (usually 2D) while preserving
the topological structure.

Algorithm & Equations

1. Initialization:
A grid of neurons (nodes) is randomly initialized in the input space. Each neuron j has a weight
vector 𝝎j of the same dimension as the input data 𝒳:

𝝎j = (𝝎j1 + 𝝎j2, ….., 𝝎jn)


2. Find the Best Matching Unit (BMU):
For each input 𝒳, find the neuron whose weight vector is closest to 𝒳. The BMU is
determined by minimizing the Euclidean distance:

𝑩𝑴𝑼 = arg min || 𝓧 - 𝝎j ||

3. Update Weights (Learning Step):


The BMU and its neighbouring neurons are updated using the following equation:

𝝎j (t +1) = 𝝎j (t) + 𝜂(t)hj,BMU(t)(x-wj (t))


where:

o η(t) is the learning rate (decreases over time).

o hj,BMU, (t)h_{j,BMU}(t)hj,BMU(t) is the neighbourhood function, usually Gaussian:

▪ rj is the position of neuron j in the SOM grid.


▪ σ(t) is the neighbourhood radius, which shrinks over time.

4. Repeat Until Convergence:


The steps are repeated for multiple iterations until the weights stabilize.

Comprehensive Explanation

• The BMU search ensures that the closest neuron is chosen for adaptation.

• The neighbourhood function ensures that nearby neurons are updated together, preserving
topological relationships.

• The learning rate and neighbourhood size decay over time to fine-tune adjustments.

K-Means Clustering

K-Means is a centroid-based clustering algorithm that partitions data into K clusters.

Algorithm & Equations

1. Initialize K Centroids and assign Data Points:


Choose K initial cluster centers μk, either randomly or using specific heuristics like K-Means++.

Each point Xi is assigned to the cluster whose centroid is closest:

where Ci is the cluster assignment for Xi.

2. Update Centroids:
The new centroid for each cluster is computed as the mean of all points assigned to it:

where ∣Ck∣ is the number of points in cluster k.

3. Repeat Until Convergence:


Steps 2 and 3 are repeated until centroids stop changing significantly.

Comprehensive Explanation

• The centroid update step ensures that each cluster's center represents the average of its
points.

• The assignment step forces each data point into exactly one cluster. The algorithm minimizes
the Within-Cluster Sum of Squares (WCSS):
where J is the cost function that measures the compactness of clusters.

• The downside is that K-Means struggles with non-circular clusters and is sensitive to outliers.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN groups data based on density rather than pre-defined centroids.

Algorithm & Equations

1. Define Parameters:

o ε (epsilon): The maximum neighbourhood radius.

o min_samples: Minimum number of points required to form a dense region.

2. Classify Points:

o Core Points: Have at least min_samples neighbours within ε.

o Border Points: Have fewer than min_samples neighbours but are close to a core point.

o Noise Points: Do not belong to any cluster.

3. Expand Clusters:

o Select a random unvisited core point and create a new cluster.

o Recursively expand the cluster by adding density-reachable points:

o The cluster grows until no more density-reachable points remain.

Comprehensive Explanation

• DBSCAN is different from K-Means because it does not require the number of clusters as
input.

• It can find arbitrarily shaped clusters, unlike K-Means which assumes circular clusters.

• The density-reachability condition ensures that only high-density areas form clusters.

• The algorithm’s time complexity is O(n log n) with efficient indexing (e.g., KD-trees), but it can
degrade to O(n2) in the worst case.
When to Use Each Clustering Algorithm

SOM to be used when:

• A neural-network-based approach that preserves data relationships is desired.

• Complex structures are present or dimensionality reduction is needed.

K-Means to be when:

• Well-separated, circular clusters are present.

• Fast clustering for large datasets is required.

DBSCAN to be used when:

• Automatic cluster count detection is needed.

• Filtering out noise and outliers is necessary.

6. Conclusion and Future Work Future Enhancements:

• Implementation of hierarchical
This experiment successfully demonstrates
clustering for multi-level classification.
the core principles of unsupervised learning
by applying autoencoders and multiple • Introducing real-time centroid
clustering techniques to unlabelled data. The tracking for K-Means.
findings reinforce the ability of neural
• Developing an interactive tool that
networks to autonomously extract hidden
structures, providing valuable insights into allows users to draw custom data
different clustering approaches. points and observe clustering
responses.
Key Takeaways:

• Unsupervised learning does not


require predefined labels. References

• Autoencoders enhance clustering Goodfellow, I., Bengio, Y., & Courville, A.


efficiency by extracting essential (2016). Deep Learning. MIT Press.
features.
Kohonen, T. (1990). The Self-Organizing Map.
• Different clustering techniques Proceedings of the IEEE, 78(9), 1464-1480.
provide unique perspectives on data
organization. MacQueen, J. (1967). Some methods for
classification and analysis of multivariate
• Interactive visualization improves
observations. Proceedings of the Fifth Berkeley
understanding of AI-driven pattern
Symposium on Mathematical Statistics and
recognition.
Probability.
7. Apendnix: Code Excerpt:

You might also like