0% found this document useful (0 votes)
12 views

08-Data_Mining_Clustering

The document discusses clustering as an unsupervised learning problem that involves partitioning data points into groups based on similarity. It outlines various clustering methods, including representative-based algorithms like k-means, and hierarchical clustering techniques. The document also highlights the advantages and challenges of these methods, such as sensitivity to outliers and the impact of initial conditions on results.

Uploaded by

Ahmed Ajebli
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

08-Data_Mining_Clustering

The document discusses clustering as an unsupervised learning problem that involves partitioning data points into groups based on similarity. It outlines various clustering methods, including representative-based algorithms like k-means, and hierarchical clustering techniques. The document also highlights the advantages and challenges of these methods, such as sensitivity to outliers and the impact of initial conditions on results.

Uploaded by

Ahmed Ajebli
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Knowledge Discovery and Data Mining

Clustering

EL Moukhtar ZEMMOURI
ENSAM-Meknès
2023-2024

The Clustering Problem


• Task:
• Given a set of data points, partition them into groups containing very similar data points.

• Data points (instances / examples) are not labeled.


• è Unsupervised learning problem

• Typical applications:
• Customer segmentation
• Data summarization
• Social network analysis
E. Zemmouri

• Application to other data mining problems


• As a preprocessing step of classification, Outlier analysis, …

2
Classification vs Clustering
• Classification : Supervised learning • Clustering : Unsupervised learning
• Learn a method for predicting the • Find natural grouping of instances given
instance class from pre-labeled instances un-labeled data

a2 a2

E. Zemmouri
a1 a1
3

Data Clustering
• Let ! = #! = #!" , #!# , … , #!$ / ' = 1. . * be a multidimensional dataset of
* examples, characterized by + attributes ," , ,# , … , ,$
• è ! is an "×$ data matrix

• The clustering problem consists of partitioning the rows (examples) of X into sets
(clusters) ." … .% such that the examples in each cluster are similar to one
another.
• An important part of the clustering process is the design of an appropriate similarity function.

• Which depend on the underlying data types.


E. Zemmouri

4
Clustering Methods
• Many different methods and algorithms:
• For numeric and/or symbolic data
• Deterministic vs. probabilistic
• Exclusive vs. overlapping
• Hierarchical vs. flat
• Top-down vs. bottom-up

E. Zemmouri
5

Clustering Evaluation
• Manual inspection
• Using existing labeled examples
• Cluster quality measures
• Distance measures
• High similarity within a cluster, low across clusters
E. Zemmouri

6
Representative-Based Clustering

Representative-Based Clustering

• The simplest of all clustering algorithms


• Rely directly on intuitive notions of distance (or similarity).

• Clusters ." … .% are created using a set of / representatives


• Question : How to determine the / representatives ?
• Discovering a high-quality set of representatives
• è discovering high-quality clusters.
E. Zemmouri

8
Representative-Based Clustering

• Formally :
• Given :
• a dataset ! = &! / ( = 1. . " containing n data points
• the number of desired clusters +
• a distance function ,(-. . , .
• Determine / representatives 0" , … , 0% that minimize the following
objective function :
'

123 = 4 min 8'9: #! , 0)


"()(%

E. Zemmouri
!&"
• è The sum of the distances of the different data points to their closest
representatives needs to be minimized

Representative-Based Clustering

• Note that :
• The representatives 0" , … , 0# and
• the optimal assignment of data points to representatives are unknown a priori,
• but they depend on each other in a circular way.

• The problem can be solved using an iterative approach :


• candidate representatives and candidate assignments are used to improve each other.
E. Zemmouri

10
Representative-Based Clustering
• è generic k-representatives approach
• Start with ! initial representatives
• Then refine representatives and assignments iteratively :
• Assign each data point to its closest representative using distance function !"#$
• Determine the optimal representative %! for each cluster &! that minimizes its local objective function

'()*'+,- = / !"#$ 0% , %!
"!∈$"

• Typically, the improvement is significant in early iterations, but it slows down in later iterations.
• è Stop when the improvement in the objective function in an iteration is less than a threshold.
• This k-representative approach defines a family of algorithms:

E. Zemmouri
• K-means
• K-medians
• …
11

Generic Representative Algorithm

Algorithm GenericRepresentative (Dataset : ! , Number of representatives : ")


begin
Initialize representative set *
repeat
Create clusters +! … +" by assigning each point in - to closest
representative in * using the distance function ./01
Recreate the set * by determining one representative 2# for each cluster
+# that minimizes ∑$& ∈&' ./01 4' , 2#
until convergence
return +! … +"
E. Zemmouri

end

Different distance functions lead to different variations of this broader algorithm


12
K-Means

13

k-Means
• A simple and efficient clustering algorithm
• Works only with numeric data
• The objective function to minimize is the sum of the squares of the Euclidean
distances of data points to their closest representatives (centroids).
%
,(-. &! , 0$ = &! − 0$
%

• In the case of Euclidean distance function, it can be shown that the optimal
centralized representative of each cluster is its mean ;.
E. Zemmouri

14
k-Means
• K-means Algorithm
1. Randomly choose + initial centers (at iteration . = 0)
#! = #"! , … , ##!

2. Assign each data point &! to its nearest cluster center 4$&
%
5 = arg min &! − 4$&
$ %

1. Move each cluster center to the mean of its assigned examples

%
4$&'" = arg min < &! − 4$&
%

E. Zemmouri
(
)$ ∈+%

1. Repeat steps 2,3 until convergence (Change in cluster assignments less than a threshold)

15

k-Means
• Note that :
• 8'9: #! , ;) can be viewed as the squared error of approximating a data
point #! with the cluster center ;)
• è The overall objective minimizes the sum of square errors over different
data points.
E. Zemmouri

16
k-Means : step 1
Y

ü Example :
ü Two attributes X and Y 61
ü k=3
ü Pick 3 initial cluster centers
(randomly) 62

E. Zemmouri
63
X
17

k-Means : step 2
Y

ü Assign each data point to its


nearest cluster center 61

62
E. Zemmouri

63
X
18
k-Means : step 3
Y

ü Move each cluster center to


the mean of its assigned items 61
61

62

63
62

E. Zemmouri
63
X
19

k-Means : step 4
Y

ü Reassign points closest to a


different new cluster center 61

ü…

63
62
E. Zemmouri

X
20
k-Means Advantages

• Simple and understandable


• Simple to implement

• Scales to large datasets


• Examples are automatically assigned to clusters
• è Easily adapts to new examples.

E. Zemmouri
21

k-Means Issues

• Result can vary significantly depending on initial choice of k


and centroids (seeds)
• Can get trapped in local minimum initial
cluster
centers
• Remedy :
• To increase chance of finding global optimum: restart with instances
different random seeds.
E. Zemmouri

22
6.3. REPRESENTATIVE-BASED ALGORITHMS

k-Means Issues

• The k-means algorithm does not work well when the


clusters are of arbitrary shape
• è non-linearly separable data
• Remedy :
• Data transformation
• Use a Kernel to transform the data so that arbitrarily shaped
clusters map to Euclidean clusters (linearly separable) in the
new space.
• è Kernel k-means as an extension of k-means Figure 6.4: Strengths and weakn

E. Zemmouri
• Problem :
as follows:
• Complexity Dist(Xi , Yj ) = (Xi − Yj )Σ−
j
23
The use of the Mahalanobis distance is generally hel
elongated along certain directions, as in the case of F
local density normalization, which is helpful in data
resulting algorithm is referred to as the Mahalanobis
The k-means algorithm does not work well when t
example is illustrated in Fig. 6.4a, in which cluster A
algorithm breaks it up into two parts, and also merges
situations are common in k-means, because it is bia
Even the Mahalanobis k-means algorithm does not
its ability to adjust for the elongation of clusters. On
means algorithm can adjust well to varying cluster de
k-Means Issues is because the Mahalanobis method normalizes loca
specific covariance matrix. The data set of Fig. 6.4
many density-based algorithms, which are designed t
• How to deal with outliers ? (cf. Sect. 6.6). Therefore, different algorithms are sui

• K-means is too sensitive to outliers ! 6.3.2 The Kernel k-Means Algorithm


• K-medoids The k-means algorithm can be extended to discoverin
use of a method known as the kernel trick. The bas
• Instead of mean, use medians of each cluster data so that arbitrarily shaped clusters map to Eucli
to Sect. 10.6.4.1 of Chap. 10 for a brief description
• Example : main problem with the kernel k-means algorithm is
• Mean of 1, 3, 5, 7, 9 is 5 kernel matrix alone is quadratically related to the num
can effectively discover the arbitrarily shaped cluster
• Mean of 1, 3, 5, 7, 1009 is 205
• Median of 1, 3, 5, 7, 1009 is 5
E. Zemmouri

• è Median advantage: not affected by extreme values


24
Hierarchical Clustering

25

Hierarchical Clustering
• Hierarchical algorithms typically cluster the data with distances
• Why are hierarchical clustering methods useful ?
• Generate different levels of clustering

• è provide a taxonomy of clusters which may be browsed for different


6.4. HIERARCHICAL CLUSTERING ALGORITHMS 167
application-specific insights.
E. Zemmouri

26

Figure 6.6: Multigranularity insights from hierarchical clustering

created by a manual volunteer effort, but it nevertheless provides a good understanding of


the multigranularity insights that may be obtained with such an approach. A small portion
of the hierarchical organization is illustrated in Fig. 6.6. At the highest level, the Web pages
are organized into topics such as arts, science, health, and so on. At the next level, the topic
Hierarchical Clustering
• Two types of hierarchical algorithms
• Bottom-up (agglomerative) methods
• Start with single-instance clusters
• At each step, merge the two closest clusters
• Design decision: the choice of objective function used to
decide the merging of the clusters.
• Top-down (divisive) methods
• Start with one universal cluster
• Find two clusters
• Proceed recursively on each subset

E. Zemmouri
• Both methods produce a dendrogram

27

Bottom-Up Algorithms
• Generic agglomerative merging algorithm
Algorithm BottomUpClustering ( Dataset : X )
begin
Initialize an 7×7 distance matrix 9 using -;
repeat
Pick closest pair of clusters / and : using 9(the least distance is selected);
Merge clusters / and :;
Delete rows/columns / and : from 9 and create a new row and column for
newly merged cluster;
Update the entries of new row and column of 9;
E. Zemmouri

until termination criterion;


return current merged cluster set
end

How to compute the distances between clusters ? 28


Bottom-Up Algorithms
6.4. HIERARCHICAL CLUSTERING ALGORITHMS 169
• Example :

• For termination Figure 6.8: Illustration of hierarchical clustering steps

• A maximum threshold can be used on the distances between two merged clusters
The generic agglomerative procedure with an unspecified merging criterion is illustrated

E. Zemmouri
• or a minimum threshold
in Fig.can
6.7.be used
The on thearenumber
distances encodedof
in clusters.
the nt × nt distance matrix M . This matrix
provides the pairwise cluster distances computed with the use of the merging criterion. The
different choices for the merging criteria will be described later. The merging of two clusters
corresponding to rows (columns) i and j in the matrix M requires the computation of some
measure of distances between their constituent objects. For two clusters containing mi and
mj objects, respectively, there are mi · mj pairs of distances between constituent objects. 29
For example, in Fig. 6.8b, there are 2 × 4 = 8 pairs of distances between the constituent
objects, which are illustrated by the corresponding edges. The overall distance between the
two clusters needs to be computed as a function of these mi · mj pairs. In the following,
different ways of computing the distances will be discussed.

6.4.1.1 Group-Based Statistics


The following discussion assumes that the indices of the two clusters to be merged are
denoted by i and j, respectively. In group-based criteria, the distance between two groups
of objects is computed as a function of the mi · mj pairs of distances among the constituent
objects. The different ways of computing distances between two groups of objects are as
follows:
1. Best (single) linkage: In this case, the distance is equal to the minimum distance
Bottom-Up Algorithms
between all m · m pairs of objects. This corresponds to the closest pair of objects
i j
between the two groups. After performing the merge, the matrix M of pairwise dis-
tances needs to be updated. The ith and jth rows and columns are deleted and replaced
with a single row and column representing the merged cluster. The new row (column)
• Distance between clusters can :be computed using the minimum of the values in the previously deleted pair of
rows (columns) in M . This is because the distance of the other clusters to the merged
• For two clusters =! and =cluster
$ containing n! and of"$their
is the minimum
examples
distances to the individual clusters in the best-linkage
scenario. For any other cluster k "= i, j, this is equal to min{Mik ,AMjk } (for rows) and
• There are "! ×"$ pairs of distances between constituent examples.
min{Mki , Mkj } (for columns). The indices of the rows and columns are then updated
to account =forand
the deletion of the two clusters B
• è The overall distance between ! =$ can be computed asand
a their replacement with a new one.
E
The best linkage approach is one of the instantiations of agglomerative methods that
function of all these pairs.
is very good at discovering clusters of arbitrary shape. This is because C the data points F
in clusters of arbitrary shape can be successively merged with chains of data point
• Different ways of aggregating distances are used
pairs at small pairwise distances to each other. On the other hand, D such chaining may
• Best (single) linkage also inappropriately merge distinct clusters when it results from noisy points.
Cluster (
• Worst (complete) linkage
Cluster '
• Group-average linkage
• Closest centroid
E. Zemmouri

• …

30
Bottom-Up Algorithms
• Distance between clusters
• Best (single) linkage
• The distance is equal to the minimum distance between all pairs. A
• è The closest pair of examples between the two groups. B
E
• Worst (complete) linkage
C F
• The distance is equal to the maximum distance between all pairs.
• è The farthest pair of examples between the two groups. D
Cluster (
• Group-average linkage
Cluster '
• The distance is equal to the average distance between all pairs.

• Closest centroid

E. Zemmouri
• The closest centroids are merged in each iteration. Question : how to
update the matrix
• Not desirable, because the centroids lose information about the relative
spreads of the different clusters.
M in each case ?
31

Top-Down Algorithms
• Generic top-down algorithm
• Is based on a flat-clustering algorithm like k-Means
Algorithm TopDownClustering ( Dataset : X , flat-clustering algorithm ;)
begin
Initialize tree T to root containing all examples of -;
repeat
Select a leaf node < in = based on a pre-defined criterion;
Use algorithm > to split L into L! , L( , … , L) ;
Add L! , L( , … , L) as children of < in =;
until termination criterion;
E. Zemmouri

return =
end

The bisecting k-means algorithm : each node is split into exactly two
32
children with a 2-means algorithm.
Cluster Validation

33

Cluster Validation
• How to evaluate the quality of a clustering ?
• Difficult in real datasets, unsupervised learning problem.
• Internal Validation Criteria
• Sum of square distances to centroids
• Intracluster to intercluster distance ratio
• Silhouette coefficient
• Internal measures are generally used for parameter tuning
• The number of cluster k
• Elbow method
E. Zemmouri

34
Cluster Validation
• Sum of square distances to centroids
• Centroids of clusters are determined
• Then the sum of squared distances (SSQ) are reported
• Smaller values indicate better cluster quality

E. Zemmouri
35

Cluster Validation
• Intracluster to intercluster distance ratio
• More detailed than the SSQ measure
• Sample > pairs of data points from the underlying data
• Let ? be the set of pairs that belong to the same cluster found by the algorithm.
• Let @ be the remaining pairs
∑ )$ ,)% ∈-
,(-. &! , &$
A".>B =
?
∑ )$ ,)% ∈.
,(-. &! , &$
A".D> =
@
E. Zemmouri

/0&12
• is the ratio of the average intracluster distance to the intercluster distance.
/0&31

• Small values of this measure indicate better clustering behavior.


36
Cluster Validation
• Silhouette coefficient
• &! is a data point within a cluster =/
• ,BEF!!0 the average distance of &! to all the data points in the same cluster =/
1
)*+,&&' = 0 )123(5& , 5# )
.( − 1
)# ∈+$

• ,G("!45& the minimum of average distances of &! to the data points in each cluster =6 ≠ =/
1
)718&,-. = min 0 )123(5& , 5# )
/0( ./ )# ∈+%

• The silhouette coefficient I! specific to &! is :

E. Zemmouri
)718&,-. − )*+,&&'
<& =
max{)*+,&&' , )718&,-. }
• Note that : −1 ≤ <& ≤ 1 and <& = 0 1K .( = 1 (arbitrary choice)

37

Cluster Validation

• Silhouette coefficient

• The overall silhouette coefficient I is the average of the data points specific coefficients.

• −1 ≤ I ≤ 1

• Large positive values of I indicate highly separated clustering

• Large negative values of I indicate some “mixing” of data points from different clusters
E. Zemmouri

38
Exercises

39

Exercises
• Exercise 1
1. Consider the 1-dimensional data set with 10 data points {1,2,3,...10}.
Show three iterations of the k-means algorithms when k = 2, and the random
centers are initialized to {1, 2}.
2. Repeat question 1 with initial centers {2,9}.
How did the different choice of the initial centers affect the quality of the
results?
E. Zemmouri

40
Exercises
• Exercise 2
1. Consider a 1-dimensional data set with three natural clusters.
The first cluster contains the consecutive integers {1 . . . 5}.
The second cluster contains the consecutive integers {8...12}.
The third cluster contains the data points {24,28,32,36,40}.
Apply a k- means algorithm with initial centers of 1, 11, and 28.
Does the algorithm determine the correct clusters?
2. If the initial centers are changed to 1, 2, and 3, does the algorithm discover
the correct clusters?
What does this tell you?

E. Zemmouri
41

Exercises
• Exercise 3
1. Write a (Python) program to implement the k-representative algorithm.
Use a modular program structure, in which the distance function and
centroid determination are separate functions.
Instantiate these functions to the cases of the k-means algorithm.
2. Implement k-means with visualization in each iteration
E. Zemmouri

42
Exercises
• Exercise 4
1. Consider the 1-dimensional data set {1...10}.
Apply a hierarchical agglomerative approach, with the use of minimum,
maximum, and group average criteria for merging.
Show the first six merges.

E. Zemmouri
43

You might also like