Spec Clus Mod
Spec Clus Mod
Clustering
Aarti Singh
1
Data
Clustering
Graph
Clustering
Goal: Given data points X1, …, Xn and similarities W(Xi,Xj), partition the data into
groups so that points in a group are similar and points in different groups are
dissimilar.
Similarity graph
Partition the graph so that edges within a group have large weights and
edges across groups have small weights.
Similarity
graph
construc6on
Similarity Graphs: Model local neighborhood relations between data points
E.g. epsilon-NN
⇢
1 kxi xj k ✏ Controls size of neighborhood
Wij =
0 otherwise
Wij
Data clustering
Similarity
graph
construc6on
Similarity Graphs: Model local neighborhood relations between data points
Wij
Data clustering
Par66oning
a
graph
into
two
clusters
Min-cut: Partition graph into two sets A and B such that weight of edges
connecting vertices in A to vertices in B is minimum.
Normalized cut:
1 T
=
f (D-‐W)
f
2
= 2LHS
Graph
cut
and
Graph
Laplacian
1 1
=
fT(D-‐W)
f
=
fTL
f
2 2
Spectral properties of L:
Balanced
min-‐cut
Above formula5on is s5ll NP-‐Hard, so we relax f not to be binary:
=
λmin(L)
-‐
smallest
eigenvalue
of
L
(Rayleigh-Ritz theorem)
fTL
f
fTλf
=
T
=
λ
f f
T f f
Recall
that
smallest
eigenvalue
of
L
is
0
with
corresponding
eigenvector
1
But
f
can’t
be
1
according
to
constraint
fT1
=
0
Therefore,
solu6on
f
is
the
eigenvector
of
L
corresponding
to
second
smallest
eigenvalue,
aka
second
eigenvector.
Approxima6on
of
Balanced
min-‐cut
Let
f
be
the
second
eigenvector
of
the
unnormalized
graph
Laplacian
L.
L’
Dimensionality Reduction
nxn →nxk
Spectral
Clustering
-‐
Intui6on
Eigenvectors of the Laplacian matrix provide an embedding of the data
based on similarity.
Disconnected subgraphs
Points are easy to
cluster in
embedded space
e.g. using k-means
00 Embedding of point i
L=
00
Understanding
Spectral
Clustering
• If
graph
is
connected,
first
Laplacian
evec
is
constant
(all
1s)
• If
graph
is
disconnected
(k
connected
components),
Laplacian
is
block
diagonal
and
first
k
Laplacian
evecs
are:
0 0
L1 1
…
0 0
…
L= L2 1 0
OR
0
0
…
0 L3 1
…
0 0
First three eigenvectors
Understanding
Spectral
Clustering
• Is
all
hope
lost
if
clusters
don’t
correspond
to
connected
components
of
graph?
No!
• If
clusters
are
connected
loosely
(small
off-‐block
diagonal
enteries),
then
1st
Laplacian
eigenvector
is
all
1s,
but
for
two
clusters,
second
eigenvector
finds
balanced
cut
for
k
clusters,
the
eigenvectors
are
slightly
perturbed
(and
possibly
rotated)
Davis-‐Kahan
Theorem
Spectral
Clustering
-‐
Intui6on
Eigenvectors of the Laplacian matrix provide an embedding of the data
based on similarity.
Disconnected subgraphs
Points are easy to
cluster in
embedded space
e.g. using k-means
Embedding of point i
L= ε
0
0ε
k-‐means
vs
Spectral
clustering
Applying k-means to laplacian eigenvectors allows us to find cluster with
non-convex boundaries.
Similarity matrix
Δk = λk − λk −1
Some
Issues
Ø Choice of number of clusters k