Clustering[1]
Clustering[1]
1
What is Cluster Analysis?
◼ Cluster: A collection of data objects
◼ Similar to one another within the same group
◼ Partitioning criteria
◼ Single level vs. hierarchical partitioning (often,
multi-level hierarchical partitioning is desirable)
◼ Separation of clusters
◼ Exclusive (e.g., one customer belongs to only
one region) vs. non-exclusive (e.g., one
document may belong to more than one class)
◼ Clustering space
◼ Full space (often when low dimensional) vs.
subspaces (often in high-dimensional
clustering)
◼ Scalability
◼ Clustering all the data instead of only on
samples
◼ Ability to deal with different types of attributes
◼ Numerical, binary, categorical, ordinal, linked,
parameters
5/11/2025 Dept of CS&SE, AU College of Engg. 6
◼ Interpretability and usability
◼ Others
◼ Discovery of clusters with arbitrary shape
order
◼ High dimensionality
◼ Partitioning approach:
◼ Construct various partitions and then evaluate them
◼ Hierarchical approach:
◼ Create a hierarchical decomposition of the set of
CAMELEON
5/11/2025 Dept of CS&SE, AU College of Engg. 8
◼ Density-based approach:
◼ Based on connectivity and density functions
◼ Grid-based approach:
◼ based on a multiple-level granularity structure
◼ Model-based:
◼ A model is hypothesized for each of the clusters and
E = ik=1 pCi ( p − ci ) 2
◼ Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
◼ Global optimal: exhaustively enumerate all partitions
◼ Heuristic methods: k-means and k-medoids algorithms
◼ k-means : Each cluster is represented by the center of the cluster
◼ k-medoids or PAM : Each cluster is represented by one of the
objects in the cluster
K=2
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
◼ PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
➔Sampling-based method
CLARA(Clustering LARge Applications)
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
19
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
N N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)
CF = (5, (16,30),(54,190))
SS: square sum of N points
N 2 10
(3,4)
Xi
9
i =1
7
6
(2,6)
5
4 (4,5)
3
1
(4,7)
0
0 1 2 3 4 5 6 7 8 9 10
(3,8)
24
CF-Tree in BIRCH
◼ Clustering feature:
◼ Summary of the statistics for a given subcluster: the 0-th, 1st,
5/11/2025
nodes Dept of CS&SE, AU College of Engg. 25
The CF Tree Structure
Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
n( n − 1)
parents
◼ Algorithm is O(n)
◼ Concerns
◼ Sensitive to insertion order of data points
◼ Since we fix the size of leaf nodes, so clusters may not be so natural
measures
5/11/2025 Dept of CS&SE, AU College of Engg. 27
Density-Based Clustering Methods
◼ Handle noise
◼ One scan
◼ Density-reachable:
◼ A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
◼ Density-connected
◼ A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
5/11/2025 Dept of CS&SE, AU College of Engg. 30
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
◼ Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
◼ Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
techniques
5/11/2025 Dept of CS&SE, AU College of Engg. 33
Grid-Based Clustering Method