What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
The quality of a clustering result depends on both the similarity measure used by the method and its implementation The quality of a clustering method is also measured by its ability to discover some or all of
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue
Model-based:
A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns Typical methods: pCluster
User-guided or constraint-based:
Clustering by considering user-specified or application-specific constraints Typical methods: COD (obstacles), constrained clustering
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5
9 8 7 6 5 4
3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
reassign
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5
reassign
4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness
Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes
K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
CLARA (Kaufmann & Rousseeuw, 1990) CLARANS (Ng & Han, 1994): Randomized sampling Focusing + spatial data structure (Ester et al., 1995)
10 9 8
10 9 8
K=2
Do loop Until no change
7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
10 9
7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10
8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
PAM Clustering:
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9
t j i h
8 7 6 5 4 3 2 1 0 0 1 2
h i
3 4 5 6 7 8 9 10
Cjih = 0
10 9
h i
j t
7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
h t
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
a b c d e
Step 4
agglomerative (AGNES)
ab abcde cde de
Step 3 Step 2 Step 1 Step 0
divisive (DIANA)
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as Step an 0input, termination condition Step 1 but Step needs 2 Step 3 a Step 4 agglomerative
a b c d e
Step 4 Step 3
divisive (DIANA)
Agglomerative, Bottom-up approach Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Methods
Feature transformation: only effective if most dimensions are relevant PCA & SVD useful only when features are highly correlated/redundant Feature selection: wrapper or filter approaches useful to find a subspace where the data have nice clusters Subspace-clustering: find clusters in all the possible subspaces CLIQUE, ProClus, and frequent pattern-based clustering
Adding a dimension stretch the points across that dimension, making them further apart Adding more dimensions will make the points further apart high dimensional data is extremely sparse Distance measure becomes
Drawbacks
most tests are for single attribute In many cases, data distribution may not be known
Distance-based outlier: A DB(p, D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lies at a distance greater than D from O Algorithms for mining distance-based outliers
Index-based algorithm Nested-loop algorithm
Summary
Cluster analysis groups objects based on their similarity and has wide applications Measure of similarity can be computed for various types of data Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by