DM&BAFall2204 2
DM&BAFall2204 2
label
label1
model/
label3 predictor
label4
label5
image compression
…
Unsupervised learning: clustering
Applications?
Gene expression data
Data from Garber et al.
PNAS (98), 2001.
Face Clustering
Face clustering
Search result clustering
Google News
Clustering in search advertising
Advertiser Bidded
Keyword
~10M nodes
Clustering applications
~100M nodes
Data visualization
ThemeScapes, Cartia
• [Mountain height = cluster size]
A data set with clear cluster structure
What clustering
algorithms have you
seen/used?
Issues for clustering
Representation for clustering
• How do we represent an example
• features, etc.
• Similarity/distance between examples
Number of clusters
• Fixed a priori
• Data driven?
Clustering Algorithms
Flat algorithms
• Usually start with a random (partial) partitioning
• Refine it iteratively
• K means clustering
• Model based clustering
• Spectral clustering
Hierarchical algorithms
• Bottom-up, agglomerative
• Top-down, divisive
Hard vs. soft clustering
Hard clustering: Each example belongs to exactly one cluster
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
K-means: an example
K-means: Initialize centers randomly
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
K-means: readjust centers
K-means: assign points to nearest center
No changes: Done
K-means
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
How do we do this?
K-means
Iterate:
n Assign/cluster each example to closest center
iterate over each point:
- get distance to each cluster center
- assign to closest center (hard cluster)
n Recalculate centers as the mean of the points in a cluster
K-means
Iterate:
n Assign/cluster each example to closest center
iterate over each point:
- get distance to each cluster center
- assign to closest center (hard cluster)
n Recalculate centers as the mean of the points in a cluster
Euclidean:
n
d(x, y) = ∑ (xi − yi )2
i=1
One feature for each word. The value is the number of times that
word occurs.
sim(x, y) =
x•y x y
= • =
∑ i=1
xi yi
x y x y n n
∑ 2
x
i=1 i
∑ i=1
yi2
d(x, y) = 1− sim(x, y)
- good for text data and many other “real world” data sets
- is computationally friendly since we only need to consider
features that have non-zero values both examples
K-means
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
Iterate:
• Assign/cluster each example to closest center
• Recalculate centers as the mean of the points in a cluster
1
µ (C) = ∑
| C | x∈C
x
where:
n x n x
x + y = ∑ xi + yi =∑ i
i=1 C i=1 C
K-means loss function
n
loss = ∑ d(xi , µ k )2 where µ k is cluster center for xi
i=1
Iterate:
1. Assign/cluster each example to closest center
2. Recalculate centers as the mean of the points in a cluster
n
loss = ∑ d(xi , µ k )2 where µ k is cluster center for xi
i=1
Iterate:
1. Assign/cluster each example to closest center
2. Recalculate centers as the mean of the points in a cluster
n
loss = ∑ d(xi , µ k )2 where µ k is cluster center for xi
i=1
Iterate:
1. Assign/cluster each example to closest center
2. Recalculate centers as the mean of the points in a cluster
n
loss = ∑ d(xi , µ k )2 where µ k is cluster center for xi
i=1
Does this mean that k-means will always find the minimum
loss/clustering?
Minimizing k-means loss
Iterate:
1. Assign/cluster each example to closest center
2. Recalculate centers as the mean of the points in a cluster
n
loss = ∑ d(xi , µ k )2 where µ k is cluster center for xi
i=1
Iterate:
n Assign/cluster each example to closest center
n Recalculate centers as the mean of the points in a cluster
Convergence
• A fixed number of iterations
• partitions unchanged
• Cluster centers don’t change
K!
K-means: Initialize centers randomly
Common heuristics
• Random centers in the space
• Randomly pick examples
• Points least similar to any existing center (furthest centers
heuristic)
• Try out multiple starting points
• Initialize with the results of another clustering method
Furthest centers heuristic
for i = 2 to K:
μi = point that is furthest from any previous centers
for k = 2 to K:
for i = 1 to N:
si = min d(xi, μ1…k-1) // smallest distance to any center
for k = 2 to K:
for i = 1 to N:
si = min d(xi, μ1…k-1) // smallest distance to any center