Clustering Lecture
Clustering Lecture
Cluster Analysis
Outline
● Background
○ Intro
○ Workflow
○ Similarity metrics
● Clustering algorithms
○ Hierarchical
○ K-means
○ Density-based
● Cluster evaluation
○ External
○ Internal
Cluster Analysis
∙ Data mining tool(s) for dividing a multivariate dataset
into (meaningful, useful) groups
∙ Good clustering:
∙ Data points in one cluster are highly similar
● Gain understanding
− Groups of genes/proteins with
similar function (from
nucleotide or amino acid
sequence data)
− Groups of cells with similar
expression patterns (from
RNAseq data)
● Summarize
− Reduce the size of a large
dataset
Clustering precipitation
in Australia
Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.
Eisen, Brown, Botstein (1998) PNAS.
Cluster analysis is not...
Simple segmentation
i.e., Dividing students into different registration groups alphabetically, by last
name
Although, some work in graph partitioning and more complex segmentation is
related to clustering
Supervised classification
Supervised classification has class label information
Clustering can be called unsupervised classification: labels derived from data
Association Analysis
Finding connections between items in datasets
Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd Edition.
Cluster evaluation has an element of
subjectivity
Hierarchical Partition
5
2 1
1 2 3 6
4
5 4
2
3 6
3 6
3 6
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Average 3 6
3
4 1 1
4 4
3
Choose K where
SSE drops
abruptly
Clusters obtained with K=10, some “real” clusters without initial centroids:
n=7
r
r r
Limitations:
● Struggles to identify clusters with varying densities – clustering is often
For 2 and 3, we can further distinguish whether we want to evaluate the entire
clustering or just individual clusters.