Cluster Analysis
Cluster Analysis
Learning
By Robert – just started learning!
Notes adopted from Jiawei Han:- concepts and techniques of data
mining
4. Summary
5. References
• Pattern Recognition
• Spatial Data Analysis
• Create thematic maps in GIS by clustering feature spaces
• Detect spatial clusters or for other spatial mining tasks
• Image Processing
• Economic Science (especially market research)
• WWW
• Document classification
• Cluster Weblog data to discover groups of similar access
patterns
January 28, 2024 Data Mining: Concepts and Techniques 4
Examples of Clustering Applications
• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
January 28, 2024 Data Mining: Concepts and Techniques 8
. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
6. Density-Based Methods
7. Grid-Based Methods
8. Model-Based Methods
9. Clustering High-Dimensional Data
10. Constraint-Based Clustering
January 28, 2024 12. Summary Data Mining: Concepts and Techniques 9
Data Structures
• Dissimilarity matrix 0
d(2,1) 0
• (one mode)
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
• Interval-scaled variables
• Binary variables
• Standardize data
• Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
January 28, 2024 Data Mining: Concepts and Techniques 13
Similarity and Dissimilarity Between
Objects (Cont.)
• If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
• Properties
• d(i,j) 0
• d(i,j) = 0
• d(i,j) = d(j,i)
• d(i,j) d(i,k) + d(k,j)
1 0 sum
• A contingency table for binary
1 a b a b
data Object i
0 c d cd
sum a c b d p
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
• gender is a symmetric attribute
• the remaining attributes are asymmetric binary
• let the values Y and P be set to 1, and the value N be set to 0
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
January 28, 2024 Data Mining: Concepts and Techniques 16
Nominal Variables
M f 1
January 28, 2024 Data Mining: Concepts and Techniques 20
Vector Objects
January 28, 2024 12. Summary Data Mining: Concepts and Techniques 22
Major Clustering Approaches (I)
• Partitioning approach:
• Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
• Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
• Create a hierarchical decomposition of the set of data (or objects) using some
criterion
• Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
• Density-based approach:
• Based on connectivity and density functions
• Typical methods: DBSACN, OPTICS, DenClue
• Model-based:
• A model is hypothesized for each of the clusters and tries to find the best fit of that
model to each other
• Typical methods: EM, SOM, COBWEB
• Frequent pattern-based:
• Based on the analysis of frequent patterns
• Typical methods: pCluster
• User-guided or constraint-based:
• Clustering by considering user-specified or application-specific constraints
January 28, 2024 • Typical methods: COD (obstacles),
Data Mining:constrained clustering
Concepts and Techniques 24
Typical Alternatives to Calculate the Distance between
Clusters
• Single link: smallest distance between an element in one cluster and an element
in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
• Average: avg distance between an element in one cluster and an element in the
other, i.e., dis(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dis(K i, Kj) = dis(Ci,
Cj)
• Medoid: distance between the medoids of two clusters, i.e., dis(K i, Kj) = dis(Mi,
Mj)
• Radius: square root of average distance from any point of the cluster to its
centroid
N (t cm ) 2
Rm i 1 ip
N
• Diameter: square root of average mean squared distance between all pairs
of points in the cluster
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
January 28, 2024 12. Summary Data Mining: Concepts and Techniques 27
Partitioning Algorithms: Basic Concept
• Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
5 5
center 4 Update 4
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
• Dissimilarity calculations
• M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering
structure, SIGMOD’99.
• P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996
• Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
• M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000.
• M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial
databases. KDD'96.
• M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for
efficient class identification. SSD'95.
• D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987.
• D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems.
January 28, 2024 Data Mining: Concepts and Techniques 37
VLDB’98.
References (2)
• V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99.
• D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In
Proc. VLDB’98.
• S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98.
• S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512-
521, Sydney, Australia, March 1999.
• A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD’98.
• A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
• G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling.
COMPUTER, 32(8): 68-75, 1999.
• L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons,
1990.
• E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.
• G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons,
1988.
• P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
• L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review , SIGKDD
Explorations, 6(1), June 2004
• E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996
Int. Conf. on Pattern Recognition,.
• A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles , ICDE'01
• H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, SIGMOD’ 02.
• W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97.
• T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large
databases. SIGMOD'96.