SlideShare a Scribd company logo
Department of Electrical Engineering
University of Arkansas
Density-Based Spatial Clustering
Md Abul Hayat
mahayat@uark.edu
OUTLINE
• Introduction
• k-means clustering
• DBSCAN
– Definitions
– Advantages
– Limitations
• OPTICS
– Definitions
– Advantages
– Limitations
K-means Clustering
– An Unsupervised approach for partitioning a data set into K distinct, non-
overlapping clusters. [Lloyd, 1982]
– We must first specify the desired number of clusters ‘K’.
– Then the K-means algorithm will assign each observation to exactly one of the
K clusters.
– The optimization problem that defines K-means clustering,
– The problem is computationally NP –hard.
K-means : Algorithm
• Lloyd’s Algorithm
– Mathematically, this is partitioning the observations according to
the Voronoi diagram generated by the means.
How Lloyd’s Algorithm Work
Problems with K-means
– K-means partition the space in
Voronoi cells and they are convex
in nature.
– Thus k-means does not perform
good when we have non-convex
clusters
– We have to provide the number of
clusters beforehand.
– Sometimes, we want to find out
the intrinsic number of clusters
within the dataset.
– No way of handling noise
separately.
Problems with K-means
• Non-convex Clusters
• When we do not know the number of clusters.
• To solve these issues, density based clustering was introduced.
DBSCAN
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
• Inventors:
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu.
• Paper : “A Density-Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise”
• Presented at the International Conference of Knowledge Discovery and Data
Mining (KDD) in 1996. KDD is a SIG of ACM.
• Citations: 13,293 (till 11/04/2018)
• The ‘2014 Test of Time’ award recognized DBSCAN as an influential
contributions to SIGKDD that have withstood the test of time.
• This is an unsupervised algorithm.
Definitions
– The shape of a neighborhood is determined by the choice of a distance
measure between two points p and q, denoted by d(p,q).
– For instance, when using the Manhattan distance in 2D space, the shape of
the neighborhood is rectangular.
– For the purpose of proper visualization, all examples will be in 2D space
using the Euclidean distance.
Distance Measures
– If we have two points
– Minkowski Distance:
Definitions
•
Definitions
• Introduction
Definitions
Definitions
• Introduction
• [ Link: Funny Visualization ]
DBSCAN: Algorithm
• Introduction
DCSCAN : Examples
• Resistant to Noise (unlike k-means)
• Can handle clusters of different shapes and sizes
Original Data After DBSCAN
DBSCAN Limitations
• Introduction
(MinPts=4, Eps=9.92).
(MinPts=4, Eps=9.75)
Original Data
• Cannot handle varying densities.
• Sensitive to parameter selection.
Heuristics for Choosing DBSCAN Parameters
– Let d be the distance of a point p to its k-th nearest neighbor, then the d-
neighborhood of p contains exactly k+1 points for almost all points p.
– For a given k we define a function k-dist (= d) from the database D to the
real numbers, mapping each point to the distance from its k-th nearest
neighbor.
– When sorting the points of the database in descending order of their k-dist
values, the graph of this function gives some hints concerning the density
distribution in the database.
– If we choose an arbitrary point p, set the parameter Eps to k-dist(p) and set
the parameter MinPts to k, all points with an equal or smaller k-dist value
will be core points.
– All points with a higher k-dist value ( left of the threshold) are
considered to be noise, all other points (right of the threshold) are assigned
to some cluster
DBSCAN : Parameter Selection
– The easier-to-set parameter of DBSCAN is the minPts parameter.
– Sander et al. suggest setting it to twice the dataset dimensionality, i.e.,
minPts = 2 · dim.
– Ester et al. provide a heuristic for choosing the ε parameter based on the
distance to the fourth nearest neighbor (for two/dimensional data).
– In Generalized DBSCAN, Sander et al. suggested using the (2 · dim - 1)
nearest neighbors and minPts = 2 · dim
OPTICS
• Ordering Points To Identify the Clustering Structure (OPTICS)
– Inventors: (1999)
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander
– Paper: “OPTICS: Ordering Points To Identify the Clustering Structure”
– OPTICS requires the same ε and minPts parameters as DBSCAN,
however, the ε parameter is theoretically unnecessary and is only used for
the practical purpose of reducing the runtime complexity of the algorithm.
– While DBSCAN may be thought of as a clustering algorithm, searching
for natural groups in data, OPTICS is an augmented ordering algorithm.
– In OPTICS, we have to introduce two more definitions.
– Here, we just fix the minPts parameter and we can get the insight of the
underlying clusters using a plot called ‘Reachability Plot’.
OPTICS : Definitions
• Introduction
OPTICS : Definitions
• Introduction
ε = Generating Distance
ε’ = Core Distance
Reachability Graph
Reachability Graph : Toy Example
Reachability Graph : Toy Example
• Introduction
Reachability Graph : Toy Example
• Introduction
Reachability Graph : Toy Example
• Introduction
Reachability Graph : Toy Example
• Introduction
Reachability Graph : Toy Example
• Introduction
Reachability Graph : Toy Example
• Introduction
R Package & Examples
• dbscan: Density Based Clustering of Applications with Noise
(DBSCAN) and Related Algorithms
– Published: May 19, 2018
– From the order discovered by OPTICS, two ways to group points into
clusters was discussed
ξ-Cluster
ξ-Cluster
• Introduction
ξ-Cluster
ξ-Cluster
• Introduction
Conclusion
• Reachability plots are helpful to determine the number of clusters.
• Can be applied to find clusters in high dim-data (like image).
• DBSCAN and OPTICS, both are unsupervised techniques.
Questions?

More Related Content

PPTX
DBSCAN : A Clustering Algorithm
PPTX
DBSCAN (1) (4).pptx
PPTX
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
PDF
50120140501016
PDF
DBSCAN
PPTX
density based method and expectation maximization
PPTX
Density based clustering
DBSCAN : A Clustering Algorithm
DBSCAN (1) (4).pptx
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
50120140501016
DBSCAN
density based method and expectation maximization
Density based clustering

Similar to Fa18_P2.pptx (20)

PPTX
Density Based Clustering harsh for college
PPTX
DBSCAN (2014_11_25 06_21_12 UTC)
PPTX
Density based methods
PDF
7. 10083 12464-1-pb
PPTX
Dbscan
PDF
Clustering Algorithm by Vishal.pdf
PPTX
Dbscan algorithom
PDF
Analysis of mass based and density based clustering techniques on numerical d...
PPTX
Presentation Data Mining Mini Project.pptx
PDF
clustering density technidques in machine learning
PDF
DMTM Lecture 14 Density based clustering
PDF
Optics ordering points to identify the clustering structure
PDF
Clustering Algorithms for Data Stream
PDF
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
PPTX
Unsupervised learning (clustering)
PDF
Unsupervised learning: Clustering
PPTX
Optics
PPTX
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
PDF
A0360109
PDF
CPSC 340: Machine Learning and Data Mining More Clustering Andreas Lehrmann a...
Density Based Clustering harsh for college
DBSCAN (2014_11_25 06_21_12 UTC)
Density based methods
7. 10083 12464-1-pb
Dbscan
Clustering Algorithm by Vishal.pdf
Dbscan algorithom
Analysis of mass based and density based clustering techniques on numerical d...
Presentation Data Mining Mini Project.pptx
clustering density technidques in machine learning
DMTM Lecture 14 Density based clustering
Optics ordering points to identify the clustering structure
Clustering Algorithms for Data Stream
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Unsupervised learning (clustering)
Unsupervised learning: Clustering
Optics
3b318431-df9f-4a2c-9909-61ecb6af8444.pptx
A0360109
CPSC 340: Machine Learning and Data Mining More Clustering Andreas Lehrmann a...
Ad

More from Md Abul Hayat (13)

PDF
Self-supervised Learning for Astronomical Images
PDF
dissertation_proposal_presentation.pdf
PDF
Review_Sp23.pdf
PPTX
Fa18_P1.pptx
PPTX
Sp18_P2.pptx
PPTX
Sp18_P1.pptx
PPTX
Sp20_P1.pptx
PPTX
Sp19_P2.pptx
PPTX
Sp19_P1.pptx
PPTX
Fa19_P2.pptx
PPTX
Fa19_P1.pptx
PPTX
Fa18_P1.pptx
PPTX
STAN_MS_PPT.pptx
Self-supervised Learning for Astronomical Images
dissertation_proposal_presentation.pdf
Review_Sp23.pdf
Fa18_P1.pptx
Sp18_P2.pptx
Sp18_P1.pptx
Sp20_P1.pptx
Sp19_P2.pptx
Sp19_P1.pptx
Fa19_P2.pptx
Fa19_P1.pptx
Fa18_P1.pptx
STAN_MS_PPT.pptx
Ad

Recently uploaded (20)

PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Practice Questions on recent development part 1.pptx
PDF
Queuing formulas to evaluate throughputs and servers
PDF
BRKDCN-2613.pdf Cisco AI DC NVIDIA presentation
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
composite construction of structures.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
“Next-Gen AI: Trends Reshaping Our World”
PPT
Drone Technology Electronics components_1
PDF
algorithms-16-00088-v2hghjjnjnhhhnnjhj.pdf
PPTX
Geodesy 1.pptx...............................................
PPTX
The-Looming-Shadow-How-AI-Poses-Dangers-to-Humanity.pptx
PPTX
AgentX UiPath Community Webinar series - Delhi
PPTX
ANIMAL INTERVENTION WARNING SYSTEM (4).pptx
PDF
오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Practice Questions on recent development part 1.pptx
Queuing formulas to evaluate throughputs and servers
BRKDCN-2613.pdf Cisco AI DC NVIDIA presentation
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
composite construction of structures.pdf
OOP with Java - Java Introduction (Basics)
Fluid Mechanics, Module 3: Basics of Fluid Mechanics
Embodied AI: Ushering in the Next Era of Intelligent Systems
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
“Next-Gen AI: Trends Reshaping Our World”
Drone Technology Electronics components_1
algorithms-16-00088-v2hghjjnjnhhhnnjhj.pdf
Geodesy 1.pptx...............................................
The-Looming-Shadow-How-AI-Poses-Dangers-to-Humanity.pptx
AgentX UiPath Community Webinar series - Delhi
ANIMAL INTERVENTION WARNING SYSTEM (4).pptx
오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)

Fa18_P2.pptx

  • 1. Department of Electrical Engineering University of Arkansas Density-Based Spatial Clustering Md Abul Hayat [email protected]
  • 2. OUTLINE • Introduction • k-means clustering • DBSCAN – Definitions – Advantages – Limitations • OPTICS – Definitions – Advantages – Limitations
  • 3. K-means Clustering – An Unsupervised approach for partitioning a data set into K distinct, non- overlapping clusters. [Lloyd, 1982] – We must first specify the desired number of clusters ‘K’. – Then the K-means algorithm will assign each observation to exactly one of the K clusters. – The optimization problem that defines K-means clustering, – The problem is computationally NP –hard.
  • 4. K-means : Algorithm • Lloyd’s Algorithm – Mathematically, this is partitioning the observations according to the Voronoi diagram generated by the means.
  • 6. Problems with K-means – K-means partition the space in Voronoi cells and they are convex in nature. – Thus k-means does not perform good when we have non-convex clusters – We have to provide the number of clusters beforehand. – Sometimes, we want to find out the intrinsic number of clusters within the dataset. – No way of handling noise separately.
  • 7. Problems with K-means • Non-convex Clusters • When we do not know the number of clusters. • To solve these issues, density based clustering was introduced.
  • 8. DBSCAN • Density-Based Spatial Clustering of Applications with Noise (DBSCAN) • Inventors: Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. • Paper : “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise” • Presented at the International Conference of Knowledge Discovery and Data Mining (KDD) in 1996. KDD is a SIG of ACM. • Citations: 13,293 (till 11/04/2018) • The ‘2014 Test of Time’ award recognized DBSCAN as an influential contributions to SIGKDD that have withstood the test of time. • This is an unsupervised algorithm.
  • 9. Definitions – The shape of a neighborhood is determined by the choice of a distance measure between two points p and q, denoted by d(p,q). – For instance, when using the Manhattan distance in 2D space, the shape of the neighborhood is rectangular. – For the purpose of proper visualization, all examples will be in 2D space using the Euclidean distance.
  • 10. Distance Measures – If we have two points – Minkowski Distance:
  • 14. Definitions • Introduction • [ Link: Funny Visualization ]
  • 16. DCSCAN : Examples • Resistant to Noise (unlike k-means) • Can handle clusters of different shapes and sizes Original Data After DBSCAN
  • 17. DBSCAN Limitations • Introduction (MinPts=4, Eps=9.92). (MinPts=4, Eps=9.75) Original Data • Cannot handle varying densities. • Sensitive to parameter selection.
  • 18. Heuristics for Choosing DBSCAN Parameters – Let d be the distance of a point p to its k-th nearest neighbor, then the d- neighborhood of p contains exactly k+1 points for almost all points p. – For a given k we define a function k-dist (= d) from the database D to the real numbers, mapping each point to the distance from its k-th nearest neighbor. – When sorting the points of the database in descending order of their k-dist values, the graph of this function gives some hints concerning the density distribution in the database. – If we choose an arbitrary point p, set the parameter Eps to k-dist(p) and set the parameter MinPts to k, all points with an equal or smaller k-dist value will be core points. – All points with a higher k-dist value ( left of the threshold) are considered to be noise, all other points (right of the threshold) are assigned to some cluster
  • 19. DBSCAN : Parameter Selection – The easier-to-set parameter of DBSCAN is the minPts parameter. – Sander et al. suggest setting it to twice the dataset dimensionality, i.e., minPts = 2 · dim. – Ester et al. provide a heuristic for choosing the ε parameter based on the distance to the fourth nearest neighbor (for two/dimensional data). – In Generalized DBSCAN, Sander et al. suggested using the (2 · dim - 1) nearest neighbors and minPts = 2 · dim
  • 20. OPTICS • Ordering Points To Identify the Clustering Structure (OPTICS) – Inventors: (1999) Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander – Paper: “OPTICS: Ordering Points To Identify the Clustering Structure” – OPTICS requires the same ε and minPts parameters as DBSCAN, however, the ε parameter is theoretically unnecessary and is only used for the practical purpose of reducing the runtime complexity of the algorithm. – While DBSCAN may be thought of as a clustering algorithm, searching for natural groups in data, OPTICS is an augmented ordering algorithm. – In OPTICS, we have to introduce two more definitions. – Here, we just fix the minPts parameter and we can get the insight of the underlying clusters using a plot called ‘Reachability Plot’.
  • 21. OPTICS : Definitions • Introduction
  • 22. OPTICS : Definitions • Introduction ε = Generating Distance ε’ = Core Distance
  • 24. Reachability Graph : Toy Example
  • 25. Reachability Graph : Toy Example • Introduction
  • 26. Reachability Graph : Toy Example • Introduction
  • 27. Reachability Graph : Toy Example • Introduction
  • 28. Reachability Graph : Toy Example • Introduction
  • 29. Reachability Graph : Toy Example • Introduction
  • 30. Reachability Graph : Toy Example • Introduction
  • 31. R Package & Examples • dbscan: Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms – Published: May 19, 2018 – From the order discovered by OPTICS, two ways to group points into clusters was discussed
  • 36. Conclusion • Reachability plots are helpful to determine the number of clusters. • Can be applied to find clusters in high dim-data (like image). • DBSCAN and OPTICS, both are unsupervised techniques.