08-Data_Mining_Clustering
08-Data_Mining_Clustering
Clustering
EL Moukhtar ZEMMOURI
ENSAM-Meknès
2023-2024
• Typical applications:
• Customer segmentation
• Data summarization
• Social network analysis
E. Zemmouri
2
Classification vs Clustering
• Classification : Supervised learning • Clustering : Unsupervised learning
• Learn a method for predicting the • Find natural grouping of instances given
instance class from pre-labeled instances un-labeled data
a2 a2
E. Zemmouri
a1 a1
3
Data Clustering
• Let ! = #! = #!" , #!# , … , #!$ / ' = 1. . * be a multidimensional dataset of
* examples, characterized by + attributes ," , ,# , … , ,$
• è ! is an "×$ data matrix
• The clustering problem consists of partitioning the rows (examples) of X into sets
(clusters) ." … .% such that the examples in each cluster are similar to one
another.
• An important part of the clustering process is the design of an appropriate similarity function.
4
Clustering Methods
• Many different methods and algorithms:
• For numeric and/or symbolic data
• Deterministic vs. probabilistic
• Exclusive vs. overlapping
• Hierarchical vs. flat
• Top-down vs. bottom-up
E. Zemmouri
5
Clustering Evaluation
• Manual inspection
• Using existing labeled examples
• Cluster quality measures
• Distance measures
• High similarity within a cluster, low across clusters
E. Zemmouri
6
Representative-Based Clustering
Representative-Based Clustering
8
Representative-Based Clustering
• Formally :
• Given :
• a dataset ! = &! / ( = 1. . " containing n data points
• the number of desired clusters +
• a distance function ,(-. . , .
• Determine / representatives 0" , … , 0% that minimize the following
objective function :
'
E. Zemmouri
!&"
• è The sum of the distances of the different data points to their closest
representatives needs to be minimized
Representative-Based Clustering
• Note that :
• The representatives 0" , … , 0# and
• the optimal assignment of data points to representatives are unknown a priori,
• but they depend on each other in a circular way.
10
Representative-Based Clustering
• è generic k-representatives approach
• Start with ! initial representatives
• Then refine representatives and assignments iteratively :
• Assign each data point to its closest representative using distance function !"#$
• Determine the optimal representative %! for each cluster &! that minimizes its local objective function
'()*'+,- = / !"#$ 0% , %!
"!∈$"
• Typically, the improvement is significant in early iterations, but it slows down in later iterations.
• è Stop when the improvement in the objective function in an iteration is less than a threshold.
• This k-representative approach defines a family of algorithms:
E. Zemmouri
• K-means
• K-medians
• …
11
end
13
k-Means
• A simple and efficient clustering algorithm
• Works only with numeric data
• The objective function to minimize is the sum of the squares of the Euclidean
distances of data points to their closest representatives (centroids).
%
,(-. &! , 0$ = &! − 0$
%
• In the case of Euclidean distance function, it can be shown that the optimal
centralized representative of each cluster is its mean ;.
E. Zemmouri
14
k-Means
• K-means Algorithm
1. Randomly choose + initial centers (at iteration . = 0)
#! = #"! , … , ##!
2. Assign each data point &! to its nearest cluster center 4$&
%
5 = arg min &! − 4$&
$ %
%
4$&'" = arg min < &! − 4$&
%
E. Zemmouri
(
)$ ∈+%
1. Repeat steps 2,3 until convergence (Change in cluster assignments less than a threshold)
15
k-Means
• Note that :
• 8'9: #! , ;) can be viewed as the squared error of approximating a data
point #! with the cluster center ;)
• è The overall objective minimizes the sum of square errors over different
data points.
E. Zemmouri
16
k-Means : step 1
Y
ü Example :
ü Two attributes X and Y 61
ü k=3
ü Pick 3 initial cluster centers
(randomly) 62
E. Zemmouri
63
X
17
k-Means : step 2
Y
62
E. Zemmouri
63
X
18
k-Means : step 3
Y
62
63
62
E. Zemmouri
63
X
19
k-Means : step 4
Y
ü…
63
62
E. Zemmouri
X
20
k-Means Advantages
E. Zemmouri
21
k-Means Issues
22
6.3. REPRESENTATIVE-BASED ALGORITHMS
k-Means Issues
E. Zemmouri
• Problem :
as follows:
• Complexity Dist(Xi , Yj ) = (Xi − Yj )Σ−
j
23
The use of the Mahalanobis distance is generally hel
elongated along certain directions, as in the case of F
local density normalization, which is helpful in data
resulting algorithm is referred to as the Mahalanobis
The k-means algorithm does not work well when t
example is illustrated in Fig. 6.4a, in which cluster A
algorithm breaks it up into two parts, and also merges
situations are common in k-means, because it is bia
Even the Mahalanobis k-means algorithm does not
its ability to adjust for the elongation of clusters. On
means algorithm can adjust well to varying cluster de
k-Means Issues is because the Mahalanobis method normalizes loca
specific covariance matrix. The data set of Fig. 6.4
many density-based algorithms, which are designed t
• How to deal with outliers ? (cf. Sect. 6.6). Therefore, different algorithms are sui
25
Hierarchical Clustering
• Hierarchical algorithms typically cluster the data with distances
• Why are hierarchical clustering methods useful ?
• Generate different levels of clustering
26
E. Zemmouri
• Both methods produce a dendrogram
27
Bottom-Up Algorithms
• Generic agglomerative merging algorithm
Algorithm BottomUpClustering ( Dataset : X )
begin
Initialize an 7×7 distance matrix 9 using -;
repeat
Pick closest pair of clusters / and : using 9(the least distance is selected);
Merge clusters / and :;
Delete rows/columns / and : from 9 and create a new row and column for
newly merged cluster;
Update the entries of new row and column of 9;
E. Zemmouri
• A maximum threshold can be used on the distances between two merged clusters
The generic agglomerative procedure with an unspecified merging criterion is illustrated
E. Zemmouri
• or a minimum threshold
in Fig.can
6.7.be used
The on thearenumber
distances encodedof
in clusters.
the nt × nt distance matrix M . This matrix
provides the pairwise cluster distances computed with the use of the merging criterion. The
different choices for the merging criteria will be described later. The merging of two clusters
corresponding to rows (columns) i and j in the matrix M requires the computation of some
measure of distances between their constituent objects. For two clusters containing mi and
mj objects, respectively, there are mi · mj pairs of distances between constituent objects. 29
For example, in Fig. 6.8b, there are 2 × 4 = 8 pairs of distances between the constituent
objects, which are illustrated by the corresponding edges. The overall distance between the
two clusters needs to be computed as a function of these mi · mj pairs. In the following,
different ways of computing the distances will be discussed.
• …
30
Bottom-Up Algorithms
• Distance between clusters
• Best (single) linkage
• The distance is equal to the minimum distance between all pairs. A
• è The closest pair of examples between the two groups. B
E
• Worst (complete) linkage
C F
• The distance is equal to the maximum distance between all pairs.
• è The farthest pair of examples between the two groups. D
Cluster (
• Group-average linkage
Cluster '
• The distance is equal to the average distance between all pairs.
• Closest centroid
E. Zemmouri
• The closest centroids are merged in each iteration. Question : how to
update the matrix
• Not desirable, because the centroids lose information about the relative
spreads of the different clusters.
M in each case ?
31
Top-Down Algorithms
• Generic top-down algorithm
• Is based on a flat-clustering algorithm like k-Means
Algorithm TopDownClustering ( Dataset : X , flat-clustering algorithm ;)
begin
Initialize tree T to root containing all examples of -;
repeat
Select a leaf node < in = based on a pre-defined criterion;
Use algorithm > to split L into L! , L( , … , L) ;
Add L! , L( , … , L) as children of < in =;
until termination criterion;
E. Zemmouri
return =
end
The bisecting k-means algorithm : each node is split into exactly two
32
children with a 2-means algorithm.
Cluster Validation
33
Cluster Validation
• How to evaluate the quality of a clustering ?
• Difficult in real datasets, unsupervised learning problem.
• Internal Validation Criteria
• Sum of square distances to centroids
• Intracluster to intercluster distance ratio
• Silhouette coefficient
• Internal measures are generally used for parameter tuning
• The number of cluster k
• Elbow method
E. Zemmouri
34
Cluster Validation
• Sum of square distances to centroids
• Centroids of clusters are determined
• Then the sum of squared distances (SSQ) are reported
• Smaller values indicate better cluster quality
E. Zemmouri
35
Cluster Validation
• Intracluster to intercluster distance ratio
• More detailed than the SSQ measure
• Sample > pairs of data points from the underlying data
• Let ? be the set of pairs that belong to the same cluster found by the algorithm.
• Let @ be the remaining pairs
∑ )$ ,)% ∈-
,(-. &! , &$
A".>B =
?
∑ )$ ,)% ∈.
,(-. &! , &$
A".D> =
@
E. Zemmouri
/0&12
• is the ratio of the average intracluster distance to the intercluster distance.
/0&31
• ,G("!45& the minimum of average distances of &! to the data points in each cluster =6 ≠ =/
1
)718&,-. = min 0 )123(5& , 5# )
/0( ./ )# ∈+%
E. Zemmouri
)718&,-. − )*+,&&'
<& =
max{)*+,&&' , )718&,-. }
• Note that : −1 ≤ <& ≤ 1 and <& = 0 1K .( = 1 (arbitrary choice)
37
Cluster Validation
• Silhouette coefficient
• The overall silhouette coefficient I is the average of the data points specific coefficients.
• −1 ≤ I ≤ 1
• Large negative values of I indicate some “mixing” of data points from different clusters
E. Zemmouri
38
Exercises
39
Exercises
• Exercise 1
1. Consider the 1-dimensional data set with 10 data points {1,2,3,...10}.
Show three iterations of the k-means algorithms when k = 2, and the random
centers are initialized to {1, 2}.
2. Repeat question 1 with initial centers {2,9}.
How did the different choice of the initial centers affect the quality of the
results?
E. Zemmouri
40
Exercises
• Exercise 2
1. Consider a 1-dimensional data set with three natural clusters.
The first cluster contains the consecutive integers {1 . . . 5}.
The second cluster contains the consecutive integers {8...12}.
The third cluster contains the data points {24,28,32,36,40}.
Apply a k- means algorithm with initial centers of 1, 11, and 28.
Does the algorithm determine the correct clusters?
2. If the initial centers are changed to 1, 2, and 3, does the algorithm discover
the correct clusters?
What does this tell you?
E. Zemmouri
41
Exercises
• Exercise 3
1. Write a (Python) program to implement the k-representative algorithm.
Use a modular program structure, in which the distance function and
centroid determination are separate functions.
Instantiate these functions to the cases of the k-means algorithm.
2. Implement k-means with visualization in each iteration
E. Zemmouri
42
Exercises
• Exercise 4
1. Consider the 1-dimensional data set {1...10}.
Apply a hierarchical agglomerative approach, with the use of minimum,
maximum, and group average criteria for merging.
Show the first six merges.
E. Zemmouri
43