Lecture 17 Clustering
Lecture 17 Clustering
Clustering: Definition
(Document) clustering is the process of grouping a set of
documents into clusters of similar documents.
Documents within a cluster should be similar.
Documents from different clusters should be dissimilar.
Clustering is the most common form of unsupervised
learning.
Unsupervised = there are no labeled or annotated data.
biology
physics
...
CS
...
space
...
...
dairy
botany
cell
AI
courses
crops
magnetism
HCI
agronomy
evolution
forestry
relativity
craft
missions
Sec. 16.1
Sec. 16.1
Hierarchical algorithms
Create a hierarchy
Bottom-up, agglomerative
Top-down, divisive
11
12
Flat algorithms
Flat algorithms compute a partition of N documents into a
set of K clusters.
Given: a set of documents and the number K
Find: a partition into K clusters that optimizes the chosen
partitioning criterion
Global optimization: exhaustively enumerate partitions,
pick optimal one
Not tractable
13
K-means
Perhaps the best known clustering algorithm
Simple, works well in many cases
Use as default / baseline for clustering documents
14
15
K-means
Each cluster in K-means is defined by a centroid.
Objective/partitioning criterion: minimize the average
squared difference from the centroid
definition of centroid:
16
K-means algorithm
17
18
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
Optimality of K-means
Convergence does not mean that we converge to the optimal
clustering!
This is the great weakness of K-means.
If we start with a bad set of seeds, the resulting clustering
can be horrible.
43
Initialization of K-means
Random seed selection is just one of many ways K-means
can be initialized.
Random seed selection is not very robust: Its easy to get a
suboptimal clustering.
Better ways of computing initial centroids:
Select seeds not randomly, but using some heuristic (e.g., filter
out outliers or find a set of seeds that has good coverage of
the document space)
Use hierarchical clustering to find good seeds
Select i (e.g., i = 10) different random sets of seeds, do a Kmeans clustering for each, select the clustering with lowest RSS
45
46
47
48
49
Sec. 16.3
Purity example
Cluster I
Cluster II
Cluster III
Purity is (1/17)
(5 + 4 + 3) 0.71.
51
Rand index
Definition:
Based on 2x2 contingency table of all pairs of documents:
52
54
F measure
Like Rand, but precision and recall can be weighted
55
56
57
Exercise
59
60
62
Resources
Chapter 16 of IIR
Resources at http://ifnlp.org/ir
K-means example
Keith van Rijsbergen on the cluster hypothesis (he was one of
the originators)
Bing/Carrot2/Clusty: search result clustering
63