Lec09 Clustering
Lec09 Clustering
Machine Learning 1
What is Clustering
Cluster: a collection of data objects
Similar to one another within the same cluster
Cluster analysis
Grouping a set of data objects into clusters
Machine Learning 2
Examples of Clustering Applications
• Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
• Land use: Identification of areas of similar land use in an earth
observation database
• Insurance: Identifying groups of motor insurance policy holders with a
high average claim cost
• Urban planning: Identifying groups of houses according to their house
type, value, and geographical location
• Seismology: Observed earth quake epicenters should be clustered along
continent faults
Machine Learning 3
What Is a Good Clustering?
Machine Learning 4
Similarity and Dissimilarity Between Objects
• Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Machine Learning 5
Major Clustering Approaches
Machine Learning 6
Partitioning Algorithms
• Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters
• Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
– Global optimal: exhaustively enumerate all partitions
– Heuristic methods: k-means and k-medoids algorithms
– k-means (MacQueen, 1967): Each cluster is represented by the
center of the cluster
– k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw, 1987): Each cluster is represented by one of the objects
in the cluster
Machine Learning 7
K-Means Clustering
• Given k, the k-means algorithm consists of four steps:
– Select initial centroids at random.
– Assign each object to the cluster with the nearest centroid.
– Compute each centroid as the mean of the objects assigned
to it.
– Repeat previous 2 steps until no change.
Machine Learning 8
Algorithm Definition
• The K-Means algorithm is an method to cluster objects based on their
attributes into k partitions.
• It assumes that the k clusters exhibit Gaussian distributions.
• It assumes that the object attributes form a vector space.
• The objective it tries to achieve is to minimize total intra-cluster
variance.
Machine Learning 9
Algorithm Fitness Function
• The K-Means algorithm attempts to minimize the squared error for all
elements in all clusters.
k 2
E p mi
i 1 pCi
• Where E is the sum of the square error for all elements in the data set;
p is a given element; and mi is the mean of cluster Ci
Machine Learning 10
K-Means Algorithm
• Input
– k: the number of clusters
– D: a dataset containing n elements
Machine Learning 11
K-Means Clustering (Example)
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Machine Learning 12
Comments on the K-Means Method
• Strengths
– Relatively efficient: O(tkn), where n is # objects, k is # clusters,
and t is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may be
found using techniques such as simulated annealing and genetic
algorithms
• Weaknesses
– Applicable only when mean is defined (what about categorical data?)
– Need to specify k, the number of clusters, in advance
– Trouble with noisy data and outliers
– Not suitable to discover clusters with non-convex shapes
Machine Learning 13
K-medoids Clustering
• K-means is appropriate when we can work with Euclidean distances
• Thus, K-means can work only with numerical, quantitative variable
types
• Euclidean distances do not work well in at least two situations
– Some variables are categorical
– Outliers can be potential threats
• A general version of K-means algorithm called K-medoids can work
with any distance measure
• K-medoids clustering is computationally more intensive
Machine Learning 14
K-medoids Algorithm
• Step 1: For a given cluster assignment C, find the observation in the
cluster minimizing the total distance to other points in that cluster:
i arg min
k d ( x , x ).
{i:C ( i ) k } C ( j ) k
i j
• Step 3: Given a set of cluster centers {m1, …, mK}, minimize the total
error by assigning each observation to the closest (current) cluster
center:
C (i) arg min d ( xi , mk ), i 1,, N
1 k K
• Iterate steps 1 to 3
Machine Learning 15
K-medoids Summary
• Generalized K-means
• Computationally much costlier that K-means
• Apply when dealing with categorical data
• Apply when data points are not available, but only pair-wise distances
are available
• Converges to local minimum
Machine Learning 16
Hierarchical Clustering
• Two types: (1) agglomerative (bottom up), (2) divisive (top down)
• Agglomerative: two groups are merged if distance between them is less
than a threshold
• Divisive: one group is split into two if intergroup distance more than a
threshold
• Can be expressed by an excellent graphical representation called
“dendogram”, when the process is monotonic: dissimilarity between
merged clusters is increasing. Agglomerative clustering possesses this
property. Not all divisive methods possess this monotonicity.
• Heights of nodes in a dendogram are proportional to the threshold
value that produced them.
Machine Learning 17
An Example Hierarchical Clustering
Machine Learning 18
Hierarchical Clustering
• Use distance matrix as clustering criteria.
• This method does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(bottom up)
a
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (top down)
Machine Learning 19
Agglomerative Nesting (Bottom Up)
• Produces tree of clusters (nodes)
• Initially: each object is a cluster (leaf)
• Recursively merges nodes that have the least dissimilarity
• Criteria: min distance, max distance, avg distance, center distance
• Eventually all nodes belong to the same cluster (root)
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Machine Learning 20
A Dendrogram Shows
How the Clusters are Merged Hierarchically
• Decompose data objects into several levels of nested partitioning (tree
of clusters), called a dendrogram.
• A clustering of the data objects is obtained by cutting the dendrogram
at the desired level. Then each connected component forms a cluster.
Machine Learning 21
Divisive Analysis (Top Down)
• Inverse order of Agglomerative
• Start with root cluster containing all objects
• Recursively divide into subclusters
• Eventually each cluster contains a single object
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Machine Learning 22
Linkage Functions
• We know how to measure the distance between two objects, but defining the distance
between an object and a cluster, or defining the distance between two clusters is non
obvious.
– Single linkage (nearest neighbor): In this method the distance between two
clusters is determined by the distance of the two closest objects (nearest neighbors)
in the different clusters. d (G, H ) min d
SL ij
iG
jH
– Group average linkage: In this method, the distance between two clusters is
calculated as the average distance between all pairs of objects in the two different
clusters. 1
d GA (G, H )
NG N H
d
iG jH
ij
Machine Learning 23
Linkage Functions
• SL considers only a single pair of data points; if this pair is close
enough then action is taken. So, SL can form a “chain” by combining
relatively far apart data points.
• SL often violates the compactness property of a cluster. SL can produce
clusters with large diameters (DG).
DG max dij
iG , jG
Machine Learning 25
Density-Based Clustering Methods
• Clustering based on density (local cluster criterion), such as density-
connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
Machine Learning 26
Model based clustering
• Assume data generated from K probability distributions
• Typically Gaussian distribution Soft or probabilistic version of K-means
clustering
• Need to find distribution parameters.
• EM Algorithm
Machine Learning 27