0% found this document useful (0 votes)

181 views

Clustering. Computational Journalism Week 2

This document provides an overview of clustering and dimensionality reduction techniques in computational journalism. It discusses how clustering algorithms like k-means grouping data points into clusters based on distance metrics. Dimensionality reduction methods like principal component analysis and multidimensional scaling project high-dimensional data into a lower dimensional space to enable visualization. The document cautions that clustering is context-specific and there is no single right way to group complex real-world data.

Uploaded by

Jonathan Stray

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

181 views

Clustering. Computational Journalism Week 2

Uploaded by

Jonathan Stray

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Frontiers of

Computational Journalism
Columbia Journalism School
Week 2: Clustering
September 18, 2015

Classication and Clustering

Classification is arguably one of the most central and
generic of all our conceptual exercises. It is the
foundation not only for conceptualization, language,
and speech, but also for mathematics, statistics, and
data analysis in general.
- Kenneth D. Bailey, Typologies and Taxonomies: An

Introduction to Classification Techniques

Vector representation of objects

!
#
#
#
#
#
#
#
"

x1 $
&
x2 &
&
x3 &
&
&
xN &
%

Each xi is a numerical or categorical feature

N = number of features or dimension

Distance metric
Intuitively: how (dis)similar are two items?
Formally:
d(x, y) 0
d(x, x) = 0
d(x, y) = d(y, x)
d(x, z) d(x, y) + d(y, z)

Distance metric
d(x, y) 0
-

distance is never negative

d(x, x) = 0
-

reflexivity: zero distance to self

d(x, y) = d(y, x)
-

symmetry: x to y same as y to x

d(x, z) d(x, y) + d(y, z)

- triangle inequality: going direct is shorter

Distance matrix
Data matrix for M objects of N dimensions
! x1 $ ! x1,1
# & #
x2 & # x2,1
#
X=
=
# & #
# & #
" xM % #" x1,M

Distance matrix

x1,N $
&
&
&

&
xM ,N &%

x1,2
x2,2

! d
# 1,1
# d
Dij = D ji = d(xi , x j ) = # 2,1
#
# d1,M
"

d1,2 dM ,M $
&
&
d2,2
&

&
dM ,M &%

We think of a cluster like this

Real data isnt so simple

Many denitions of a cluster

every point inside is closer to the center of this
cluster than the center of any other
no point outside this cluster is closer than to any
point inside
every point in this cluster is closer to all points inside
than any point outside

Dierent clustering algorithms

Partitioning
o keep adjusting clusters until convergence
o e.g. K-means

Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches

Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split

K-means demo

http://www.paused21.net/off/kmeans/bin/

Agglomerative
combining clusters
put each item into a leaf node
while num clusters > 1

find two closest clusters

merge them

Divisive spliJing clusters

put all items into one cluster
while num clusters < num items

find largest cluster

split so pieces as far as
possible

complete link or max

single link or "min"

average

Trees and Dendrograms

UK House of Lords voting clusters

Algorithm instructed to separate MPs into five clusters. Output:
!
!
1
1
2
2
1
1
1
2
1
2
1

2
1
2
3
5
4
1
3
1
1
2
!
!

2
1
1
2
2
1
1
1
2
2
2
!
!

1
1
2
1
1
1
2
2
1
1
1
!
!

3
1
3
3
4
4
3
1
1
2
2

2
1
2
1
2
1
3
4
2
2
3
!
!

2
5
2
1
2
2
2
1
2
1
4
!
!

2
2
4
2
1
2
2
1
2
3
2
!
!

!
!

1
1
2
1
2
1
2
4
2
2
2
.!
.!

4 !
1 !
1 !
2 !
1 !
5 !
5 !
4 !
1 !
1 !
2!

Voting clusters with parties

LDem
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
Lab
2
Con
1
Lab
2
Con
1

XB
2
Con
1
Lab
2
XB
3
Con
5
XB
4
Con
1
XB
3
Con
1
LDem
1
Lab
2

Lab
2
LDem
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1
Lab
2
Lab
2
XB
2

LDem
1
Con
1
Lab
2
Con
1
Con
1
Con
1
Lab
2
Lab
2
Con
1
Con
1
Con
1

XB
3
Con
1
XB
3
XB
3
XB
4
XB
4
Bp
3
Con
1
Con
1
Lab
2
XB
2

Lab
2
Con
1
XB
2
XB
1
Lab
2
Con
1
XB
3
XB
4
XB
2
Lab
2
XB
3

XB
2
LDem
5
Lab
2
LDem
1
Lab
2
Lab
2
Lab
2
Con
1
Lab
2
Con
1
XB
4

Lab
2
Lab
2
XB
4
Lab
2
Con
1
XB
2
Lab
2
Con
1
Lab
2
XB
3
Lab
2

Con
1
Con
1
Lab
2
XB
1
XB
2
LDem
1
Lab
2
XB
4
Lab
2
Lab
2
Lab
2

XB !
4 !
LDem !
1 !
Con !
1 !
Lab !
2 !
XB !
1 !
Con !
5 !
LDem !
5 !
XB !
4 !
Con !
1 !
Con !
1 !
Lab !
2!

Clustering Algorithm

Input: data points (feature vectors).

Output: a set of clusters, each of which is a set
of points.
Visualization

Input: data points (feature vectors).

Output: a picture of the points.

Dimensionality reduction
Problem: vector space is high-dimensional. Up to
thousands of dimensions. The screen is twodimensional.
We have to go from
x RN
to much lower dimensional points
y RK<<N
Probably K=2 or K=3.

This is called "projection"

Projection from 3 to 2 dimensions

Linear projections
Projects in a straight line
to closest point on
"screen." Mathematically,
y = Px
where P is a K by N matrix.

Projection from 2 to 1 dimensions

Think of this as rotating to align the "screen" with

coordinate axes, then simply throwing out values of
higher dimensions.

Projection from 3 to 2 dimensions

Which direction should we look from?

Principal components analysis: find a linear projection
that preserves greatest variance

Take rst K eigenvectors of covariance matrix

corresponding to largest eigenvalues. This gives a K-
dimensional sub-space for projection.

Sometimes overlap is unavoidable

Real data isnt so simple

Nonlinear projections
Still going from highdimensional x to lowdimensional y, but now
y = f(x)
for some function f(), not
linear. So, may not
preserve relative
distances, angles, etc.

Fish-eye projection from 3 to 2 dimensions

Multidimensional scaling
Idea: try to preserve distances between points "as much as
possible."
If we have the distances between all points in a distance matrix,
D = |xi xj| for all i,j
We can recover the original {xi} coordinates exactly (up to rigid
transformations.) Like working out a country map if you know
how far away each city is from every other.

Multidimensional scaling

Torgerson's "classical MDS" algorithm (1952)

Reducing dimension with MDS

Notice: dimension N is not encoded in the distance
matrix D (its M by M where M is number of points)
MDS formula (theoretically) allows us to recover point
coordinates {x} in any number of dimensions k.

MDS Stress minimization

The formula actually minimizes stress

stress(x) = xi x j dij
i, j

Think of springs between every pair of points. Spring between xi,

xj has rest length dij

Stress is zero if all high-dimensional distances matched exactly in

low dimension.

Multi-dimensional Scaling
Like "flattening" a
stretchy structure into
2D, so that distances
between points are
preserved (as much as
possible")

House of Lords MDS plot

Robustness of results
Regarding these analyses of congressional voting, we
could still ask:
Are we modeling the right thing? (What about other
legislative work, e.g. in committee?)
Are our underlying assumptions correct? (do
representatives really have ideal points in a
preference space?)
What are we trying to argue? What will be the effect of
pointing out this result?

Why do clusters have meaning?

What is the connection between
mathematical and semantic properties?

No unique right clustering

Different distance metrics and clustering algorithms
give different results.
Should we sort incident reports by location, time,
actor, event type, author, cost, casualties?
There is only context-specific categorization.
And the computer doesnt understand your context.

Dierent libraries,
dierent categories

HTML5 Elance Test Top 20
No ratings yet
HTML5 Elance Test Top 20
15 pages
Relations and Functions Notes and Worksheets
33% (3)
Relations and Functions Notes and Worksheets
27 pages
R Programming
100% (8)
R Programming
60 pages
V7.0.0a ReleaseNotes v6.0 PDF
No ratings yet
V7.0.0a ReleaseNotes v6.0 PDF
43 pages
Billing System (Synopsis)
75% (4)
Billing System (Synopsis)
19 pages
FYP Presentation
No ratings yet
FYP Presentation
14 pages
Computational Journalism 2017 Week 1: Introduction
No ratings yet
Computational Journalism 2017 Week 1: Introduction
90 pages
R Programming
No ratings yet
R Programming
60 pages
MATLAB and The File System
No ratings yet
MATLAB and The File System
13 pages
Sheet4 Sol
No ratings yet
Sheet4 Sol
26 pages
Course Name Simulation and Computational Laboratory: by Belaynesh Belachew (Lec.)
No ratings yet
Course Name Simulation and Computational Laboratory: by Belaynesh Belachew (Lec.)
120 pages
DIP Lab Intro
No ratings yet
DIP Lab Intro
65 pages
Anderson_M2B_Lesson_3
No ratings yet
Anderson_M2B_Lesson_3
18 pages
R Programming
No ratings yet
R Programming
60 pages
MATLAB_1
No ratings yet
MATLAB_1
31 pages
Biostat S1 Handout
No ratings yet
Biostat S1 Handout
7 pages
COMPSTAT 2004 - Proceedings in Computational Statistics - Jaromir Antoch (2004) PDF
No ratings yet
COMPSTAT 2004 - Proceedings in Computational Statistics - Jaromir Antoch (2004) PDF
578 pages
Signals and Systems - Prof - Ishit Shah
No ratings yet
Signals and Systems - Prof - Ishit Shah
35 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Introduction To MATLAB: Prepared By: Mahendra Shukla
No ratings yet
Introduction To MATLAB: Prepared By: Mahendra Shukla
27 pages
Signals and Systems - Prof - Ishit Shah
No ratings yet
Signals and Systems - Prof - Ishit Shah
35 pages
Taller #4
No ratings yet
Taller #4
7 pages
Signals and Systems - Prof - Ishit Shah PDF
No ratings yet
Signals and Systems - Prof - Ishit Shah PDF
35 pages
bbd3factor
No ratings yet
bbd3factor
2 pages
Program 1 Aim
No ratings yet
Program 1 Aim
23 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
L3 Adversarial Search
No ratings yet
L3 Adversarial Search
63 pages
Cpts 440 / 540 Artificial Intelligence: Adversarial Search
No ratings yet
Cpts 440 / 540 Artificial Intelligence: Adversarial Search
63 pages
Matlab Programs For Computational Geomatery Q1) Program To Generate A Matrix
No ratings yet
Matlab Programs For Computational Geomatery Q1) Program To Generate A Matrix
8 pages
R Programming
No ratings yet
R Programming
37 pages
CSE512 DataAndImageModels
No ratings yet
CSE512 DataAndImageModels
82 pages
Untitled Document
No ratings yet
Untitled Document
27 pages
1st - Yr - Lecture 08 - NEW
No ratings yet
1st - Yr - Lecture 08 - NEW
40 pages
11-2D Animation and Multiple Plots
No ratings yet
11-2D Animation and Multiple Plots
7 pages
Matlab-Signals and Systems
No ratings yet
Matlab-Signals and Systems
26 pages
Introductory Statistics With R
No ratings yet
Introductory Statistics With R
84 pages
Atlab: Programming
No ratings yet
Atlab: Programming
22 pages
Introduction To MATLAB: Research Computing Group UNC-Chapel Hill
No ratings yet
Introduction To MATLAB: Research Computing Group UNC-Chapel Hill
72 pages
Outline of Next 2 Lectures: Matrix Computations: Direct Methods I
No ratings yet
Outline of Next 2 Lectures: Matrix Computations: Direct Methods I
16 pages
Lecture 3 Introduction to Linear Algebra (Part 2)
No ratings yet
Lecture 3 Introduction to Linear Algebra (Part 2)
57 pages
2035 CH2 Notes
No ratings yet
2035 CH2 Notes
42 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Matrix: Remark: The Created Matrices Are Square Matrices. (Indicated by The Number: 2, 3, 4)
No ratings yet
Matrix: Remark: The Created Matrices Are Square Matrices. (Indicated by The Number: 2, 3, 4)
10 pages
DSP Lab 1 Fall 20.PDF NEW
No ratings yet
DSP Lab 1 Fall 20.PDF NEW
12 pages
Lecture - 3 AI
No ratings yet
Lecture - 3 AI
63 pages
Basic Simulation Lab
No ratings yet
Basic Simulation Lab
24 pages
Final 2.34 PDF
No ratings yet
Final 2.34 PDF
15 pages
BME1901 Week 2
No ratings yet
BME1901 Week 2
36 pages
Ann lab
No ratings yet
Ann lab
23 pages
CONV Convolution and Polynomial Multiplication
No ratings yet
CONV Convolution and Polynomial Multiplication
12 pages
Matlab Student
100% (2)
Matlab Student
19 pages
Arrays, Matrices and Array Operations
No ratings yet
Arrays, Matrices and Array Operations
54 pages
Mfiles41100 Matlab Intro
No ratings yet
Mfiles41100 Matlab Intro
32 pages
Workshop On Scilab
No ratings yet
Workshop On Scilab
32 pages
Mae10 HW
No ratings yet
Mae10 HW
12 pages
BME1901 - Introductory Computer Sciences Laboratory Handout - 1 Objectives
No ratings yet
BME1901 - Introductory Computer Sciences Laboratory Handout - 1 Objectives
6 pages
Image Processing Using Matlab
No ratings yet
Image Processing Using Matlab
19 pages
Matlab®: Academic Resource Center
No ratings yet
Matlab®: Academic Resource Center
40 pages
10-Special Plots
No ratings yet
10-Special Plots
22 pages
Armadillo Documentation
No ratings yet
Armadillo Documentation
223 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Origami Dots: Folding paper to explore geometry
From Everand
Origami Dots: Folding paper to explore geometry
Andy Parkinson
5/5 (1)
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Algebra & Trigonometry II Essentials
From Everand
Algebra & Trigonometry II Essentials
Editors of REA
4/5 (4)
Computational Journalism 2017 Week 6: Drawing Conclusions From Data
No ratings yet
Computational Journalism 2017 Week 6: Drawing Conclusions From Data
82 pages
Computational Journalism Week 11: Privacy and Security
No ratings yet
Computational Journalism Week 11: Privacy and Security
89 pages
Computational Journalism 2017 Week 5: Quantification and Statistics
No ratings yet
Computational Journalism 2017 Week 5: Quantification and Statistics
71 pages
Computational Journalism 2017 Week 4: Computational Journalism Platforms
No ratings yet
Computational Journalism 2017 Week 4: Computational Journalism Platforms
49 pages
Computational Journalism 2016 Week 11: Privacy and Security
No ratings yet
Computational Journalism 2016 Week 11: Privacy and Security
98 pages
Algorithmic Accountability. Computational Journalism Week 9
No ratings yet
Algorithmic Accountability. Computational Journalism Week 9
16 pages
Practical Digital Security For Journalists
No ratings yet
Practical Digital Security For Journalists
98 pages
Computational Journalism 2016 Week 4: Filters As Editors
No ratings yet
Computational Journalism 2016 Week 4: Filters As Editors
57 pages
Computational Journalism 2016 Week 6: Drawing Conclusions From Data
No ratings yet
Computational Journalism 2016 Week 6: Drawing Conclusions From Data
75 pages
What Do Journalists Do With Documents? Field Notes For NLP Researchers
No ratings yet
What Do Journalists Do With Documents? Field Notes For NLP Researchers
33 pages
Computational Journalism 2016 Week 3: Algorithmic Filtering
No ratings yet
Computational Journalism 2016 Week 3: Algorithmic Filtering
61 pages
Computational Journalism 2016 Week 2: Text Analysis
No ratings yet
Computational Journalism 2016 Week 2: Text Analysis
68 pages
Introduction. Computational Journalism Week 1
100% (1)
Introduction. Computational Journalism Week 1
58 pages
From Algorithms To Stories.
No ratings yet
From Algorithms To Stories.
49 pages
Seeing Media Polarization Through Data
No ratings yet
Seeing Media Polarization Through Data
34 pages
Arole, Devendra - Resume
No ratings yet
Arole, Devendra - Resume
3 pages
Generating Art Using Robotics
No ratings yet
Generating Art Using Robotics
23 pages
Vijay Agarwal Mob-7387386635 - Objective
0% (1)
Vijay Agarwal Mob-7387386635 - Objective
4 pages
Mb0047 - Management Information System
No ratings yet
Mb0047 - Management Information System
11 pages
9) Front End Processor PDF
No ratings yet
9) Front End Processor PDF
23 pages
1 2 3 4 5 CFC For S7 Continuous Function Chart Simatic: Appendices Manual
No ratings yet
1 2 3 4 5 CFC For S7 Continuous Function Chart Simatic: Appendices Manual
126 pages
Alfresco Web Scripts: Scriptable MVC For REST, AJAX, Widgets and Portlets
No ratings yet
Alfresco Web Scripts: Scriptable MVC For REST, AJAX, Widgets and Portlets
17 pages
4th Semester
100% (1)
4th Semester
4 pages
EIGRP - Notes
No ratings yet
EIGRP - Notes
2 pages
Quickscan QM
No ratings yet
Quickscan QM
176 pages
Advanced Administration Guide: General Parallel File System
No ratings yet
Advanced Administration Guide: General Parallel File System
164 pages
How To Program With Java Ebook
100% (1)
How To Program With Java Ebook
310 pages
Manually Process An EWA Report
No ratings yet
Manually Process An EWA Report
4 pages
Defining Custom Rules For Use in SAP Workflow: Creating A Rule Using The PFAC Transaction
No ratings yet
Defining Custom Rules For Use in SAP Workflow: Creating A Rule Using The PFAC Transaction
3 pages
ServiceManual For Data Communication EU PDF
No ratings yet
ServiceManual For Data Communication EU PDF
3 pages
Developing Ict Content For Specific Purposes
67% (3)
Developing Ict Content For Specific Purposes
23 pages
Schneider, Eschman & Zuccolotto, (2002) - E-Prime User's Guide. RAVEN Progressive Matrices.
No ratings yet
Schneider, Eschman & Zuccolotto, (2002) - E-Prime User's Guide. RAVEN Progressive Matrices.
284 pages
Software: Computer Software or Just Software, Is A Collection of
No ratings yet
Software: Computer Software or Just Software, Is A Collection of
8 pages
Advproxy - Advanced Web Proxy For SmoothWall Express 2.0 (Marco Sondermann, p76) PDF
100% (1)
Advproxy - Advanced Web Proxy For SmoothWall Express 2.0 (Marco Sondermann, p76) PDF
76 pages
Messages in SDL
No ratings yet
Messages in SDL
2 pages
Abi386 4
No ratings yet
Abi386 4
377 pages
Gajab Exams Sanjal Gajab Exams Sanjal: Second Terminal Examination Second Terminal Examination
No ratings yet
Gajab Exams Sanjal Gajab Exams Sanjal: Second Terminal Examination Second Terminal Examination
1 page
Renegade Cheat Code
No ratings yet
Renegade Cheat Code
3 pages
Things To Know About HFM 11.1.2.4
100% (1)
Things To Know About HFM 11.1.2.4
58 pages
Oracle DBA
100% (3)
Oracle DBA
173 pages
C Language by Ramesh Sir
50% (2)
C Language by Ramesh Sir
177 pages