0% found this document useful (0 votes)
26 views

Data Clustering in K-Means Hierarchical Clustering DBSCAN Clustering

Data Clustering in K-means Hierarchical Clustering DBSCAN Clustering
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Data Clustering in K-Means Hierarchical Clustering DBSCAN Clustering

Data Clustering in K-means Hierarchical Clustering DBSCAN Clustering
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Faculty of Computer Engineering

Seminar for Master degree in


The Major of Artificial Intelligence And Robotic
Title
DATA CLUSTERING
‫خوشه بندی داده ها‬
Supervisor
Associate professor. Askar Poer
Advisor
Prof. …………
Researcher
Mohammed Ayoub Mamaseeni
Outline
 Introduction
 What is Data Clustering?
 Types of Clustering Algorithms
 K-Means Clustering
 Hierarchical Clustering
 DBSCAN Clustering
 Choosing the Right Clustering Algorithm
 Evaluating Clustering Performance
 Applications of Data Clustering
 Conclusion and Key Take aways

2
Introduction to
Data Clustering
Data clustering is a powerful technique in machine learning and data analysis
that groups similar data points together, revealing underlying patterns and
structures within complex datasets. This provides valuable insights for a wide
range of applications, from customer segmentation to image recognition.

3
What is Data Clustering?
Data clustering is the process of grouping similar data points together into distinct clusters or groups. The goal is
to identify natural patterns and structures within complex datasets, enabling deeper insights and better
decision-making. By organizing data into meaningful clusters, analysts can uncover hidden relationships and
trends that may not be immediately apparent.

4
Types of Clustering
Algorithms
1. Partitioning Algorithms: These divide data into k distinct clusters,
such as K-Means, which assigns each data point to the nearest cluster
center.
2. Hierarchical Algorithms: These build a hierarchy of clusters, allowing
analysis at different levels of granularity, like Agglomerative and
Divisive clustering.
3. Density-Based Algorithms: These identify clusters based on the
density of data points, like DBSCAN, which finds high-density regions
separated by low-density areas.

5
K-Means Clustering
K-Means is a popular partitioning clustering algorithm that groups data points
into k distinct clusters based on their similarity. It works by iteratively assigning
each data point to the nearest cluster centroid and then recalculating the
centroids until convergence.

The key advantages of K-Means are its simplicity, scalability, and the ability to
handle large datasets effectively. It is widely used in customer segmentation,
image segmentation, and anomaly detection applications.

6
Hierarchical Clustering
Hierarchical clustering is a powerful technique that builds a hierarchy of
clusters, allowing analysis at different levels of granularity. It can identify
complex, nested structures within data by iteratively merging or splitting
clusters based on their proximity.

This approach is particularly useful when the number of clusters is unknown or


the data exhibits a clear hierarchical relationship. Hierarchical methods
include Agglomerative and Divisive clustering, each with its own strengths and
applications.

7
DBSCAN Clustering

Density-Based Handling Outliers Parameters and


Clustering Considerations
One of the key advantages of
DBSCAN is a density-based DBSCAN is its ability to identify The performance of DBSCAN
clustering algorithm that groups and handle outliers, which are depends on the selection of its
together data points that are data points that do not belong to two key parameters, epsilon (eps)
close to each other based on any well-defined cluster. and the minimum number of
density, identifying clusters of points (minPoints), which
arbitrary shape and size. determine the density threshold
for cluster formation.

8
Choosing the Right Clustering
Algorithm
Data Cluster Shapes Noise Handling Computational
Characteristic Efficiency
s K-Means works best DBSCAN can identify
Consider the size, for spherical clusters, and isolate outliers, K-Means is highly
dimensionality, and while DBSCAN can while K-Means is scalable, while
structure of your handle arbitrary more sensitive to DBSCAN and
dataset. Different shapes. Hierarchical noise. Hierarchical hierarchical methods
algorithms excel with methods suit nested methods have varied can be more
specific data types structures. noise tolerance. computationally
and properties. intensive for large
datasets.

9
Evaluating Clustering Performance
Assessing the quality and effectiveness of clustering models is crucial to ensure they deliver meaningful
insights. Several evaluation metrics can be used to measure clustering performance, such as intra-cluster
distance, inter-cluster distance, and silhouette score.

The chart presents the performance of a clustering model based on three key evaluation metrics. The low intra-
cluster distance and high inter-cluster distance indicate that the clusters are well-separated and compact. The
silhouette score, which measures how well each data point fits its assigned cluster, further validates the
clustering quality.
1
Applications of Data Clustering

Customer Biomedical Image Network


Segmentation Research Segmentation Analysis
Cluster customers Identify subgroups of Partition images into Cluster nodes in a
based on their patients with similar meaningful regions or network to uncover
behaviors, preferences, genetic profiles or objects, enabling communities, detect
and demographics to disease characteristics applications like object anomalies, and
personalize marketing to enable precision detection and understand complex
and improve user medicine. recognition. relationships.
experiences.

1
Related Studies
1- Two-pronged feature reduction in spectral clustering with optimized
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=qNQSCOoAAAAJ&pagesize=80&citft=3&email_for_op=mahamad97ayo
ub%40gmail.com&authuser=1&citation_for_view=qNQSCOoAAAAJ:EUQCXRtRnyEC
The paper discusses a novel spectral clustering algorithm called BVA_LSC (Barnes-Hut t-SNE Variational Autoencoder Landmark-based Spectral
Clustering), which aims to improve the performance and efficiency of spectral clustering on high-dimensional datasets. The key contributions and
methods presented in the paper are as follows:
 Two-Pronged Feature Reduction:
- Barnes-Hut t-SNE: This method is used for dimensionality reduction, which optimizes the computational cost by reducing the size of the
similarity matrix used in spectral clustering. Barnes-Hut t-SNE is particularly effective for high-dimensional data.
- Variational Autoencoder (VAE): A deep learning technique used alongside Barnes-Hut t-SNE to capture non-linear relationships in data and
further reduce dimensionality.
 Adaptive Landmark Selection:
- K-harmonic means clustering: This algorithm is used initially to group data points and narrow down potential landmarks (a subset of
representative data points).
- Grey Wolf Optimization (GWO): An optimization algorithm inspired by the social hierarchy of grey wolves, which is used to select the most
effective landmarks based on a novel objective function. This selection process ensures that the landmarks are evenly distributed across the
dataset and represent the data well.

1
Related Studies
 Optimized Similarity Matrix:
- By reducing the number of features and carefully selecting landmarks, the algorithm decreases the size of the similarity matrix, which reduces
the computational burden during eigen decomposition—a critical step in spectral clustering.
 Dynamic Landmark Count Determination:
- The paper introduces a new equation to dynamically determine the optimal number of landmarks based on the dataset’s features. This allows
the algorithm to adapt to different datasets without requiring manual tuning.
 Experimental Validation:
- The algorithm was tested on several real-world datasets (e.g., MNIST, USPS, Fashion-MNIST) and compared against various state-of-the-art
spectral clustering methods. The results showed that BVA_LSC generally outperforms other methods in terms of clustering accuracy (ACC) and
normalized mutual information (NMI), particularly for complex and high-dimensional datasets.
 Computational Efficiency:
- While BVA_LSC demonstrates superior clustering performance, it does so at the cost of slightly higher computational time compared to some
of the other methods, especially as the number of landmarks increases.
 Overall, the paper introduces a robust and efficient spectral clustering method that leverages advanced feature reduction and optimized
landmark selection to tackle the challenges of high-dimensional data clustering. The approach balances accuracy with computational
efficiency, making it suitable for large-scale data analysis tasks.

1
Conclusion and Key Takeaways
Powerful Insights from Data Adaptable to Various Domains
Clustering algorithms unlock hidden patterns From customer segmentation to image
and structures in complex data, enabling analysis, clustering techniques can be applied
organizations to uncover valuable business across a wide range of industries and use
insights. cases.

Importance of Algorithm Continuous Improvement


Selection
Evaluating clustering performance and
Carefully choosing the right clustering iterating on models can lead to ongoing
algorithm based on data characteristics and refinements and better decision-making
business objectives is crucial for successful support.
deployment.

You might also like