Clustering. Computational Journalism Week 2
Clustering. Computational Journalism Week 2
Computational Journalism
Columbia Journalism School
Week 2: Clustering
September 18, 2015
x1 $
&
x2 &
&
x3 &
&
&
xN &
%
Distance metric
Intuitively: how (dis)similar are two items?
Formally:
d(x, y) 0
d(x, x) = 0
d(x, y) = d(y, x)
d(x, z) d(x, y) + d(y, z)
Distance metric
d(x, y) 0
-
d(x, x) = 0
-
d(x, y) = d(y, x)
-
symmetry: x to y same as y to x
Distance matrix
Data matrix for M objects of N dimensions
! x1 $ ! x1,1
# & #
x2 & # x2,1
#
X=
=
# & #
# & #
" xM % #" x1,M
Distance matrix
x1,N $
&
&
&
&
xM ,N &%
x1,2
x2,2
! d
# 1,1
# d
Dij = D ji = d(xi , x j ) = # 2,1
#
# d1,M
"
d1,2 dM ,M $
&
&
d2,2
&
&
dM ,M &%
Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches
Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
K-means demo
http://www.paused21.net/off/kmeans/bin/
Agglomerative
combining clusters
put
each
item
into
a
leaf
node
while
num
clusters
>
1
find
two
closest
clusters
merge
them
average
2
1
2
3
5
4
1
3
1
1
2
!
!
2
1
1
2
2
1
1
1
2
2
2
!
!
1
1
2
1
1
1
2
2
1
1
1
!
!
3
1
3
3
4
4
3
1
1
2
2
2
1
2
1
2
1
3
4
2
2
3
!
!
2
5
2
1
2
2
2
1
2
1
4
!
!
2
2
4
2
1
2
2
1
2
3
2
!
!
!
!
1
1
2
1
2
1
2
4
2
2
2
.!
.!
4 !
1 !
1 !
2 !
1 !
5 !
5 !
4 !
1 !
1 !
2!
XB
2
Con
1
Lab
2
XB
3
Con
5
XB
4
Con
1
XB
3
Con
1
LDem
1
Lab
2
Lab
2
LDem
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
Lab
2
Lab
2
XB
2
LDem
1
Con
1
Lab
2
Con
1
Con
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
XB
3
Con
1
XB
3
XB
3
XB
4
XB
4
Bp
3
Con
1
Con
1
Lab
2
XB
2
Lab
2
Con
1
XB
2
XB
1
Lab
2
Con
1
XB
3
XB
4
XB
2
Lab
2
XB
3
XB
2
LDem
5
Lab
2
LDem
1
Lab
2
Lab
2
Lab
2
Con
1
Lab
2
Con
1
XB
4
Lab
2
Lab
2
XB
4
Lab
2
Con
1
XB
2
Lab
2
Con
1
Lab
2
XB
3
Lab
2
Con
1
Con
1
Lab
2
XB
1
XB
2
LDem
1
Lab
2
XB
4
Lab
2
Lab
2
Lab
2
XB !
4 !
LDem !
1 !
Con !
1 !
Lab !
2 !
XB !
1 !
Con !
5 !
LDem !
5 !
XB !
4 !
Con !
1 !
Con !
1 !
Lab !
2!
Clustering Algorithm
Dimensionality reduction
Problem: vector space is high-dimensional. Up to
thousands of dimensions. The screen is twodimensional.
We have to go from
x RN
to much lower dimensional points
y RK<<N
Probably K=2 or K=3.
Linear projections
Projects in a straight line
to closest point on
"screen." Mathematically,
y = Px
where P is a K by N matrix.
Nonlinear projections
Still going from highdimensional x to lowdimensional y, but now
y = f(x)
for some function f(), not
linear. So, may not
preserve relative
distances, angles, etc.
Multidimensional scaling
Idea: try to preserve distances between points "as much as
possible."
If we have the distances between all points in a distance matrix,
D = |xi xj| for all i,j
We can recover the original {xi} coordinates exactly (up to rigid
transformations.) Like working out a country map if you know
how far away each city is from every other.
Multidimensional scaling
stress(x) = xi x j dij
i, j
Multi-dimensional Scaling
Like "flattening" a
stretchy structure into
2D, so that distances
between points are
preserved (as much as
possible")
Robustness of results
Regarding these analyses of congressional voting, we
could still ask:
Are we modeling the right thing? (What about other
legislative work, e.g. in committee?)
Are our underlying assumptions correct? (do
representatives really have ideal points in a
preference space?)
What are we trying to argue? What will be the effect of
pointing out this result?
Dierent libraries,
dierent categories