0% found this document useful (0 votes)
31 views

DBSCAN

The document discusses the DBSCAN clustering algorithm, which finds clusters of arbitrary shape and handles noise. It works by connecting points that are within a distance ε of each other and have at least MinPts neighbors, and considers points not meeting these criteria to be noise. The key steps are finding core points with many neighbors, growing clusters from them, and labeling remaining points as noise.

Uploaded by

yevedi5237
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

DBSCAN

The document discusses the DBSCAN clustering algorithm, which finds clusters of arbitrary shape and handles noise. It works by connecting points that are within a distance ε of each other and have at least MinPts neighbors, and considers points not meeting these criteria to be noise. The key steps are finding core points with many neighbors, growing clusters from them, and labeling remaining points as noise.

Uploaded by

yevedi5237
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Fundamentally, all clustering methods use the same approach i.e.

first we calculate
similarities and then we use it to cluster the data points into groups or batches. Here we
will focus on Density-based spatial clustering of applications with noise (DBSCAN)
clustering methods.

Clusters are dense regions in the data space, separated by regions of the lower density
of points. The DBSCAN algorithm is based on this intuitive notion of “clusters” and
“noise”. The key idea is that for each point of a cluster, the neighborhood of a given
radius has to contain at least a minimum number of points.

Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for
finding spherical-shaped clusters or convex clusters. In other words, they are suitable
only for compact and well-separated clusters. Moreover, they are also severely affected
by the presence of noise and outliers in the data.

Real life data may contain irregularities, like:


1. Clusters can be of arbitrary shape such as those shown in the figure below.
2. Data may contain noise.
The figure below shows a data set containing non convex clusters and outliers/noises.
Given such data, the k-means algorithm has difficulties in identifying these clusters with
arbitrary shapes.

DBSCAN algorithm requires two parameters:


1. eps : It defines the neighborhood around a data point i.e. if the distance between
two points is lower or equal to ‘eps’ then they are considered neighbors.
- If the eps value is chosen too small then large parts of the data will be
considered as outliers.
- If it is chosen very large then the clusters will merge and the majority of
the data points will be in the same clusters.
- One way to find the eps value is based on the k-distance graph.

2. MinPts: Minimum number of neighbors (data points) within eps radius. Larger
the dataset, the larger value of MinPts must be chosen. As a general rule, the
minimum MinPts can be derived from the number of dimensions D in the dataset
as, MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.

In this algorithm, we have 3 types of data points.


1. Core point: A core point is a point that has at least MinPts number of points
within its ε-neighborhood. It is considered to be the most important type of point
in DBSCAN since it can form the core of a cluster. Core points can also be part of
multiple clusters.
2. Boundary point: A boundary point is a point that has fewer than MinPts points
within its ε-neighborhood, but it is reachable from a core point. Boundary points
are part of a cluster but they do not contribute to the core of the cluster.
3. Noise: Noise points are the points that do not belong to any cluster. They are the
points that have fewer than MinPts points within their ε-neighborhood and are not
reachable from any core point. These points are considered to be outliers in the
data set.
DBSCAN algorithm can be abstracted in the following steps:
1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same
cluster as the core point.
A point a and b are said to be density connected if there exists a point c which
has a sufficient number of points in its neighbors and both the points a and b are
within the eps distance. This is a chaining process. So, if b is neighbor of c, c is
neighbor of d, d is neighbor of e, which in turn is neighbor of a implies that b is
neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do
not belong to any cluster are noise.

CHAT GPT
The neighborhood is defined by two parameters: epsilon (ε) and minimum points
(MinPts). Epsilon is the maximum distance between two points for them to be
considered as part of the same neighborhood, while MinPts is the minimum number of
points required for a group of points to be considered a cluster.

You might also like