Week 07 Lecture Material
Week 07 Lecture Material
Week 7: Clustering
Pabitra Mitra
Computer Science and Engineering, IIT Kharagpur
1
Clustering
• Unsupervised method
• Exploratory Data Analysis
2
What is clustering?
• Organizing data into classes such that there is
• high intra-class similarity
• low inter-class similarity
• Finding the class labels and the number of classes directly
from the data (in contrast to classification).
• More informally, finding natural groupings among objects.
3
What is a natural grouping among these objects?
4
What is a natural grouping among these objects?
Clustering is subjective
5
Simpson's Family School Employees Females Males
What is similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity of features.
Similarity is hard
to define.
Defining distance measures
Definition: Let O1 and O2 be two objects from the universe of
possible objects. The distance (dissimilarity) between O1 and O2 is a
real number denoted by D(O1,O2)
Peter Piotr
7
0.23 3 342.7
Peter Piotr What properties should a distance measure
have?
d('', '') = 0 d(s, '') = d('',
s) = |s| -- i.e. length of
s d(s1+ch1, s2+ch2) =
3
min( d(s1, s2) + if
ch1=ch2 then 0 else 1
• D(A,B) = D(B,A) Symmetry
•D(A,B) = 0 iff A= B Reflexive
fi, d(s1+ch1, s2) + 1,
d(s1, s2+ch2) + 1 )
8
Two types of clustering
• Partitional algorithms: Construct various partitions and then evaluate them by some
criterion
• Hierarchical algorithms: Create a hierarchical decomposition of the set of objects using
some criterion
Hierarchical Partitional
9
Desirable Properties of clustering algorithm
• Scalability (in terms of both time and space)
• Ability to deal with different data types
• Minimal requirements for domain knowledge to determine input parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• Incorporation of user-specified constraints
10
Summarizing similarity measurements
Dendrogram.
Terminal Branch Root
The similarity between two objects in a
Internal Branch dendrogram is represented as the height
of the lowest internal node they share.
Internal Node
Leaf
11
Hierarchical clustering using string edit distance
Pedro (Portuguese)
Petros (Greek), Peter (English), Piotr (Polish), Peadar
(Irish), Pierre (French), Peder (Danish), Peka
(Hawaiian), Pietro (Italian), Piero (Italian Alternative),
Petr (Czech), Pyotr (Russian)
Cristovao (Portuguese)
Christoph (German), Christophe (French), Cristobal
(Spanish), Cristoforo (Italian), Kristoffer (Scandinavian),
Krystof (Czech), Christopher (English)
Miguel (Portuguese)
Michalis (Greek), Michael (English), Mick (Irish!)
12
Hierarchical clustering
Bottom-Up (agglomerative): Starting
The number of dendrograms with n
leafs = (2n -3)!/[(2(n -2)) (n -2)!] with each item in its own cluster, find
the best pair to merge into a new
cluster. Repeat until all clusters are
fused together.
13
Distance matrix
We begin with a distance matrix which
contains the distances between every
pair of objects in our database.
D( , ) = 8 0 8 8 7 7
0 2 4 4
D( , ) = 1 0 3 3
0 1
0
14
Bottom-Up (agglomerative)
Starting with each item in its own cluster, find the best pair to merge into a new
cluster. Repeat until all clusters are fused together.
16
Bottom-Up (agglomerative)
Starting with each item in its own cluster, find the best pair to merge into a new
cluster. Repeat until all clusters are fused together.
19
Minimal Spanning Tree – Single Linkage
• Build MST (Minimum Spanning Tree)
– Start with a tree that consists of any point
– In successive steps, look for the closest pair of points (p, q) such that
one point (p) is in the current tree but the other (q) is not
– Add q to the tree and put an edge between p and q
MST: Divisive Hierarchical Clustering
• Use MST for constructing hierarchy of clusters
Summary of hierarchal clustering
• No need to specify the number of clusters in advance.
• Hierarchal nature maps nicely onto human intuition
for some domains
• They do not scale well: time complexity of at least
O(n2), where n is the number of total objects.
22
Partitional clustering
• Nonhierarchical, each instance is placed in
exactly one of K nonoverlapping clusters.
• Since only one set of clusters is output, the user
normally has to input the desired number of
clusters K.
23
k-means
1. Decide on a value for k.
2. Initialize the k cluster centers (randomly, if necessary).
3. Decide the class memberships of the N objects by assigning them to the
nearest cluster center.
4. Re-estimate the k cluster centers, by assuming the memberships found
above are correct.
5. If none of the N objects changed membership in the last iteration, exit.
Otherwise goto 3.
24
K-means clustering: step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
k2
2
k3
0
0 1 2 3 4 5
K-means clustering: step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4
k1
3
k2
2
k3
0
0 1 2 3 4 5 26
K-means clustering: step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4 k1
2
k3
1 k2
0
0 1 2 3 4 5
27
K-means clustering: step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5
4 k1
2
k3
1 k2
0
0 1 2 3 4 5
28
K-means clustering: step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
29
Evaluation of K-means
• Strength
– Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t
is # iterations. Normally, k, t << n.
– Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic
algorithms
• Weakness
– Applicable only when mean is defined, then what about categorical
data?
– Need to specify k, the number of clusters, in advance
– Unable to handle noisy data and outliers
– Not suitable for clusters with non-convex shapes
30
DBSCAN
• DBSCAN is a density-based algorithm.
– Density = number of points within a specified radius (Eps)
– A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
– A noise point is any point that is not a core point or a border point.
DBSCAN: Core, Border, and Noise Points
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points
DBSCAN: Core, Border and Noise Points
(MinPts=4, Eps=9.75).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
DBSCAN: Determining EPS and MinPts
• Idea is that for points in a cluster, their kth nearest neighbors are at
roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest neighbor
Summary of Clustering Algorithms
• K-Means – fast, works only for data where mean can be defined,
generates spherical clusters, robust to noise
• Single linkage – produces non-convex clusters, slow for large data sets,
sensitive to noise
• Complete linkage – produces non-convex clusters, very sensitive to
noise, very slow for large data sets
• DBSCAN – produces arbitrary shaped clusters – works only for low
dimensional data
38
Cluster Validity
• For supervised classification we have a variety of measures to evaluate
how good our model is
– Accuracy, precision, recall
42
Internal Measures: Cohesion and Separation
• Cluster Cohesion: Measures how closely related are
objects in a cluster
– Example: SSE
• Cluster Separation: Measure how distinct or well-
separated a cluster is from other clusters
• Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
WSS = ∑ ∑ ( x − mi ) 2
i x∈C i
– Separation is measured by the between cluster sum of squares
BSS = ∑ i
C
i
( m − mi ) 2
• Example: SSE
– BSS + WSS = constant
m
1
×
m 2
×3 4
×
m 5
1 2
cohesion separation
Internal Measures: Silhouette Coefficient
• Silhouette Coefficient combine ideas of both cohesion and separation, but for
individual points, as well as clusters and clusterings
• For an individual point, i
– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by
s = 1 – a/b if a < b, (or s = b/a - 1 if a ≥ b, not the usual case)
b
48
End of Clustering
49