Unit- 4 DMA
Unit- 4 DMA
Syllabus
• Cluster Analysis: Introduction
• Requirements and overview of different categories
• Partitioning method: Introduction
• k-means
• k-medoids
• Hierarchical method: Introduction
• Agglomerative vs. Divisive method
• Distance measures in algorithmic methods
• BIRCH technique
• DBSCAN technique
• STING technique
• CLIQUE technique
• Evaluation of clustering techniques
Session 1
Cluster Analysis: Introduction
Requirements and overview of different categories
• Clustering is the process of grouping a set of data objects intomultiple
groups or clusters
• so that objects within a cluster have high similarity, but are very
dissimilar to objects in other clusters.
• Dissimilarities and similarities are assessed based on the attribute
values describing the objects and often involve distance measures.
• Clustering as a data mining tool has its roots in many application
areas such as biology, security, business intelligence, and Web search.
Cluster Analysis
• Cluster: A collection of data objects
• similar (or related) to one another within the same group
• dissimilar (or unrelated) to the objects in other groups
• Cluster analysis (or clustering, data segmentation, …)
• Finding similarities between data according to the characteristics found in the
data and grouping similar data objects into clusters
• Unsupervised learning: no predefined classes (i.e., learning by observations vs.
learning by examples: supervised)
• Typical applications
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms
Applications of Cluster Analysis
• Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an earth observation database
• Marketing: Help marketers discover distinct groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
• City-planning: Identifying groups of houses according to their house type, value, and
geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
• Climate: understanding earth climate, find patterns of atmospheric and ocean
• Economic Science: market research
• Owing to the huge amounts of data collected in databases, cluster analysis has recently become
a highly active topic in data mining research.
Quality: What Is Good Clustering?
• d(i, j) - nonnegative number that is close to 0 when objects i and j are highly similar or “near” each
other, and becomes larger the more they differ.
• For example, changing measurement units from meters to inches for height, or from kilograms
to pounds for weight, may lead to a very different clustering structure.
• To avoid dependence on the choice of measurement units, the data should be standardized.
• Standardizing measurements attempts to give all variables an equal weight.
• Data Transformation by Normalization
• The measurement unit used can affect the data analysis.
To help avoid dependence on the choice of measurement units, the
data should be normalized or standardized. This involves transforming the data to fall
within a smaller or common range
How can the data for a variable be
standardized?
• To standardize measurements, one choice is to convert the original
measurements to unit-less variables.
Distance measures
Distance Measure
• After standardization, or without standardization in certain
applications, the dissimilarity (or similarity) between the objects
described by interval-scaled variables is typically computed based on
the distance between each pair of objects. The most popular distance
measure is Euclidean distance
• A binary variable is symmetric if both of its states are equally valuable and carry the same weight.
• Dissimilarity that is based on symmetric binary variables is called symmetric binary dissimilarity.
• A binary variable is asymmetric if the outcomes of the states are not equally important, such as
the positive and negative outcomes of a disease test.
• A binary variable contains two possible outcomes: 1 (positive/present) or 0
(negative/absent). If there is no preference for which outcome should be coded as 0
and which as 1, the binary variable is called symmetric.
• For example, the binary variable "is evergreen?" for a plant has the possible states
"loses leaves in winter" and "does not lose leaves in winter." Both are equally
valuable and carry the same weight when a proximity measure is computed.
• If the outcomes of a binary variable are not equally important, the binary variable is
called asymmetric.
• An example of such a variable is the presence or absence of a relatively rare
attribute, such as "is color-blind" for a human being.
• While you say that two people who are color-blind have something in common, you
cannot say that people who are not color-blind have something in common.
Jaccard Coefficient
• The number of negative matches, t, is considered unimportant and thus is
ignored in the computation, as
• we can measure the distance between two binary variables based on the
notion of similarity instead of dissimilarity.
0 1
d ( jack , mary ) 0.33
2 0 1
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
How can we compute the dissimilarity between
objects described by categorical, ordinal, and
ratio-scaled variables?"
Categorical, Ordinal, and Ratio-Scaled
Variables
• A categorical variable is a generalization of the binary variable in that it can take on more than
two states.
• For example, map color is a categorical variable that may have, say, five states: red, yellow,
green, pink, and blue.
• Let the number of states of a categorical variable be M. The states can be denoted by letters,
symbols, or a set of integers, such as 1, 2, : : : , M.
• The dissimilarity between two objects i and j can be computed based on the ratio of
mismatches (Eqn 7.3)
• m - where m is the number of matches (i.e., the number of variables for which i and j are in
the same state), and p is the total number of variables.
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Suppose that we have the sample data of Table 7.3, except that only the object-identifier and
the variable (or attribute) test-1 are available, where test-1 is categorical. Let's compute the
dissimilarity matrix
3. Ordinal Variables
• An ordinal variable can be discrete or continuous. (we need to convert
ordinal into ratio scale)
• Order is important, e.g. rank (junior, senior)
• Can be treated like interval-scaled
• Replace an ordinal variable value by its rank:
• The distance can be calculated by treating ordinal as quantitative
• Map the range of each variable onto [0,1] by replacing i-th object in f-th
• Normalized Rank
variable by
• There are three states for test-2, namely fair, good, and excellent, that
is Mf =3.
• step 1, if we replace each value for test-2 by its rank, the four objects
are assigned the ranks 3, 1, 2, and 3, respectively.
• Step 2 normalizes the ranking by mapping rank 1 to 0.0, rank 2 to 0.5,
and rank 3 to 1.0.
• For step 3, we can use, say, the Euclidean distance, which results in
the following dissimilarity matrix:
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Ratio-Scaled Variables
• The latter two methods are the most effective, although the choice of
method used may depend on the given application.
Dissimilarity between ratio-scaled variables
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
4. Variables of Mixed Types
• how can we compute the dissimilarity between objects of mixed
variable types?”
• One approach is to group each kind of variable together, performing a
separate cluster analysis for each variable type.
• A more preferable approach is to process all variable types together,
performing a single cluster analysis.
• Suppose that the data set contains p variables of mixed type. The
dissimilarity d(i, j) between objects i and j is defined as
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Variables of Mixed Types
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Variables of Mixed Types
• Apply logarithmic transformation to its values. Based on the
transformed values of 2.65, 1.34, 2.21, and 3.08 obtained for the
objects 1 to 4.
• =3.08 and =1.34.
• Then normalize the values in the dissimilarity matrix obtained in
Example 7.5 by dividing each one by (3.08 – 1.34) = 1.74.
• We can now use the dissimilarity matrices for the three variables in
our computation.
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Vector Objects
• There are several ways to define such a similarity function, s(x, y), to
compare two vectors x and y.
• One popular way is to define the similarity function as a cosine measure
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Session 2
Partitioning Method: Introduction
K-Means Algorithm
Partitioning Algorithms: Basic
Concept
• Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the
sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)
E ik1 pCi ( p ci ) 2
• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the
cluster
• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is
represented by one of the objects in the cluster
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four steps:
• Partition objects into k nonempty subsets
• Compute seed points as the centroids of the clusters of the current
partitioning (the centroid is the center, i.e., mean point, of the cluster)
• Assign each object to the cluster with the nearest seed point
• Go back to Step 2, stop when the assignment does not change
K-Means Clustering-
Step-03:
Calculate the distance between each data point and each cluster center.
•The distance may be calculated either by using given distance function or by using euclidean distance formula.
Step-04:
Assign each data point to some cluster.
•A data point is assigned to that cluster whose center is nearest to that data point.
Step-05:
Re-compute the center of newly formed clusters.
•The center of a cluster is computed by taking mean of all the data points contained in that cluster.
Step-06:
Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping criteria is met-
•Center of newly formed clusters do not change
•Data points remain present in the same cluster
•Maximum number of iterations are reached
• K-Means Clustering Algorithm offers the following advantages-
• Point-01:
• It is relatively efficient with time complexity O(nkt) where-
• n = number of instances
• k = number of clusters
• t = number of iterations
• Point-02:
• Disadvantages-
K=2
Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data Loop if
set Reassign objects
needed
Partition objects into k nonempty
subsets
Repeat
Compute centroid (i.e., mean Update
the
point) for each partition cluster
Assign each object to the centroids
cluster of its nearest centroid
Until no change
Comments on the K-Means Method
• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) PAM// Partition around Mediods
• Clustering in LARge Applications.
• Comment: Often terminates at a local optimal.
• Weakness
• Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
• Need to specify k, the number of clusters, in advance (there are ways to automatically determine the best k
(see Hastie et al., 2009)
• Sensitive to noisy data and outliers
• Not suitable to discover clusters with non-convex shapes
What Is the Problem of the K-Means
Method?
• The k-means algorithm is sensitive to outliers !
• Since an object with an extremely large value may substantially distort the distribution of the
data
• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point,
medoids can be used, which is the most centrally located object in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
• Another variant to k-means is the k-modes method, which extends the k-means
paradigm to cluster categorical data by replacing the means of clusters with
modes, using new dissimilarity measures to deal with categorical objects and a
frequency-based method to update modes of clusters. The k-means and the k-
modes methods can be integrated to cluster data with mixed numeric and
categorical values.
• The EM (Expectation-Maximization) algorithm extends the k-means paradigm in
a different way. Whereas the k-means algorithm assigns each object to a cluster,
• In EM,each object is assigned to each cluster according to a weight representing
its probability of membership.
• In other words, there are no strict boundaries between clusters. Therefore, new
means are computed based on weighted measures.
How can we make the k-means algorithm more scalable?"
A recent approach to scaling the k-means algorithm is based on the idea of identifying three
kinds of regions in data:
1. regions that are compressible,
2. regions that must be maintained in main memory,
3. and regions that are discardable.
An object is discardable if its membership in a cluster is ascertained.
An object is compressible if it is not discardable but belongs to a tight subcluster.
A data structure known as a clustering feature is used to summarize objects that have
been discarded or compressed.
If an objectis neither discardable nor compressible, then it should be retained in main
memory.
To achieve scalability,
The iterative clustering algorithm only includes the clustering features of the compressible objects and the objects that
must be retained in main memory,
• thereby turning a secondary-memory-based algorithm into a main-memory- based algorithm.
• An alternative approach to scaling the k-means algorithm explores the microclustering idea,
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Calculating Distance Between A1(2, 10) and C2(5, 8)-
We calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
• Calculate new cluster centers
• After second iteration, the center of the three clusters are-
• C1(3, 9.5)
• C2(6.5, 5.25)
• C3(1.5, 3.5)
Practice Problem
Session 3
K-Medoids
Hierarchical Method: Introduction
Quality of clustering,
Variation within clustering
Error E=
“How can we modify the k-means algorithm to diminish such sensitivity to
outliers?”
• Instead of taking the mean value of the objects in a cluster as a reference point,
• pick actual objects to represent the clusters, using one representative object
per cluster.
• Each remaining object is assigned to the cluster of which the representative
object is the most similar.
• The partitioning method is then performed based on the principle of
minimizing the sum of the dissimilarities between each object p and its
correspondingrepresentative object.
• That is, an absolute-error criterion is used, defined as
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters
• Starts from an initial set of medoids and iteratively replaces one of the medoids by one of
the non-medoids if it improves the total distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale well for large data sets (due to
the computational complexity)
9 9 9
8 8 8
7 7 7
6
Arbitrar 6
Assign 6
5
y 5
each 5
4 choose 4 remaini 4
3
k object 3
ng 3
2
as 2
object 2
initial to
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10
s medoid
K=2 Total Cost = 26
s Randomly select a
nonmedoid
object,Oramdom
10 10
Do loop 9
8
Compute
9
8
Swapping total cost
Until no
7 7
O and 6
of 6
change Oramdom 5
swapping
5
4 4
If quality is 3 3
2 2
improved. 1 1
0 0
0Source
31 4 :5
2 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
• More robust
• Time complexity
• “How can we scale up the k-medoids method?”
• CLARA( Random Samples)
• In some situations we may want to partition our data into groups at
different levels such as in a hierarchy.
• A hierarchical clustering method works by grouping data objects into
a hierarchy or “tree” of clusters.
• Representing data objects in the form of a hierarchy is useful for data
summarization and visualization.
• Handwriting recognition, hierarchy of species(animals,birds
etc),employee,gaming (chess).
• Agglomerative versus divisive hierarchical clustering,
• which organize objects into a hierarchy using a bottom-up or top-
down strategy, respectively.
• Agglomerative methods start with individual objects as clusters,
which are iteratively merged to form larger clusters.
• Conversely, divisive methods initially let all the given objectsform one
cluster, which they iteratively split into smaller clusters.
• Hierarchical clustering methods can encounter difficulties regarding
the selection of merge or split points. Such a decision is critical,
• merge or split decisions, if not well chosen, may lead to low-quality
clusters.
Moreover, the methods do not scale well because each decision of
merge or split needs to examine and evaluate many objects or clusters.
Solution: can be combined with Multiphase clustering
Hierarchical Clustering
DBSCAN technique
STING technique
CLIQUE technique
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
• Decompose data objects into a several levels of nested partitioning (tree of
clusters), called a dendrogram.
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Distance Between Clusters
• Single Link: smallest distance between points
• Complete Link: largest distance between points
• Average Link: average distance between points
• Centroid: distance between centroids
Distance between Clusters
• Single link: smallest distance between an element in one cluster and an element
X
in the other, i.e., dist(Ki, Kj) = min(tip, tjq)// updating distance matrix.
• Average: avg distance between an element in one cluster and an element in the
other, i.e., dist(Ki, Kj) = avg(tip, tjq)
X
• Centroid: distance between the centroids of two clusters, i.e., dist(K i, Kj) = dist(Ci,
Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(K i, Kj) = dist(Mi,
Mj)
• Medoid: a chosen, centrally located object
Source : in the cluster
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Centroid, Radius and Diameter of a Cluster (for
numerical data sets)
• Centroid: the “middle” of a cluster iN1(t )
Cm N ip
Threshold of
12 34 5
A B C D E
Problem: For the one dimensional data set {7,10,20,28,35}, perform
hierarchical clustering and plot the dendogram to visualize it.
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Extensions to Hierarchical Clustering
N 2 10
(3,4)
Xi 9
8
(2,6)
i 1 7
5
(4,5)
4
3
(4,7)
2
1
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
CF-Tree in BIRCH
• Clustering feature:
• Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd moments of the
subcluster from the statistical point of view
• Registers crucial measurements for computing cluster and utilizes storage efficiently
• A CF tree is a height-balanced tree that stores the clustering features for a hierarchical
clustering
• A nonleaf node in a tree has descendants or “children”
• The nonleaf nodes store sums of the CFs of their children
• A CF tree has two parameters
• Branching factor: max # of children
• Threshold: max diameter of sub-clusters stored at the leaf nodes
• The branching factor specifies
• the maximum number of children per nonleaf node.
• The threshold parameter specifies
the maximum diameter of subclusters stored at the leaf nodes of the tree.
These two parameters implicitly control the resulting tree’s size.
The CF Tree Structure
Root
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
The Birch Algorithm
• Cluster Diameter 1 2
(x x )
n( n 1) i j
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Partitioning and hierarchical
methods are designed to find
spherical-shaped clusters.
Session 6
DBSCAN
Density-based clusters are dense areas in the data
space separated from each other by sparser areas.
Given such data, portioning and hierarchical would
likely inaccurately identify convex regions, where
noise or outliers are included in the clusters.
• Density-reachable:
• A point p is density-reachable from a
p
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p p1
q
such that pi+1 is directly density-
reachable from pi
• Density-connected
• A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is a p q
point o such that both, p and q are
density-reachable from o w.r.t. Eps o
and MinPts
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
• Relies on a density-based notion of cluster: A cluster is defined as
a maximal set of density-connected points
• Discovers clusters of arbitrary shape in spatial databases with
noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
DBSCAN: The Algorithm
• Arbitrary select a point p
• Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
• If p is a core point, a cluster is formed
• If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
• Continue the process until all of the points have been
processed
DBSCAN: Sensitive to Parameters
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Session 7
STING
Grid-Based Clustering Method
• Using multi-resolution grid data structure
• Several interesting methods
• STING (a STatistical INformation Grid approach) by Wang,
Yang and Muntz (1997)
• WaveCluster by Sheikholeslami, Chatterjee, and Zhang
(VLDB’98)
• A multi-resolution clustering approach using wavelet method
• CLIQUE: Agrawal, et al. (SIGMOD’98)
• Both grid-based and subspace clustering
STING: A Statistical Information Grid
Approach
• Wang, Yang and Muntz (VLDB’97)
• The spatial area is divided into rectangular cells
• There are several levels of cells corresponding to different levels of
resolution
1st layer
(i-1) st layer
i-th layer
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
The STING Clustering Method
• Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
• Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
• Parameters of higher level cells can be easily calculated from
parameters of lower level cell
• count, mean, s, min, max
• type of distribution—normal, uniform, etc.
• Use a top-down approach to answer spatial data queries
• Start from a pre-selected layer—typically with a small number of
cells
• For each cell in the current level compute the confidence
interval
STING Algorithm and Its Analysis
• Partition the data space and find the number of points that lie
inside each cell of the partition.
• Identify the subspaces that contain clusters using the Apriori
principle
• Identify clusters
• Determine dense units in all subspaces of interests
• Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
• Determine maximal regions that cover a cluster of connected
dense units for each cluster
• Determination of minimal cover for each cluster
Vacation
(10,000)
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3
Vacation
y
l ar 30 50
Sa age
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Strength and Weakness of
CLIQUE
• Strength
• automatically finds subspaces of the highest dimensionality
such that high density clusters exist in those subspaces
• insensitive to the order of records in input and does not
presume some canonical data distribution
• scales linearly with the size of input and has good scalability
as the number of dimensions in the data increases
• Weakness
• The accuracy of the clustering result may be degraded at the
expense of simplicity of the method
Session 9
Evaluation of Clustering Techniques
Assessing Clustering Tendency
• Assess if non-random structure exists in the data by measuring the probability that the data is
generated by a uniform data distribution
• Test spatial randomness by statistic test: Hopkins Static
• Given a dataset D regarded as a sample of a random variable o, determine how far away o is
from being uniformly distributed in the data space
• Sample n points, p1, …, pn, uniformly from D. For each pi, find its nearest neighbor in D: xi =
min{dist (pi, v)} where v in D
• Sample n points, q1, …, qn, uniformly from D. For each qi, find its nearest neighbor in D – {qi}:
yi = min{dist (qi, v)} where v in D and v ≠ qi
• Calculate the Hopkins Statistic:
• If D is uniformly distributed, ∑ xi and ∑ yi will be close to each other and H is close to 0.5. If D
is highly skewed, H is close to 0
Source :
http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.p
Determine the Number of Clusters
• Empirical method
• # of clusters ≈√n/2 for a dataset of n points
• Elbow method
• Use the turning point in the curve of sum of within cluster variance w.r.t
the # of clusters
• Cross validation method
• Divide a given data set into m parts
• Use m – 1 parts to obtain a clustering model
• Use the remaining part to test the quality of the clustering
• E.g., For each point in the test set, find the closest centroid, and use
the sum of squared distance between all points in the test set and the
closest centroids to measure how well the model fits the test set
• For any k > 0, repeat it m times, compare the overall quality measure w.r.t.
different k’s, and find # of clusters that fits the data the best
Measuring Clustering Quality
• Two methods: extrinsic vs. intrinsic
• Extrinsic: supervised, i.e., the ground truth is available
• Compare a clustering against the ground truth using certain
clustering quality measure
• Ex. BCubed precision and recall metrics
• Intrinsic: unsupervised, i.e., the ground truth is unavailable
• Evaluate the goodness of a clustering by considering how well
the clusters are separated, and how compact the clusters are
• Ex. Silhouette coefficient
Measuring Clustering Quality: Extrinsic
Methods
• Clustering quality measure: Q(C, Cg), for a clustering C given the ground truth Cg.
• Q is good if it satisfies the following 4 essential criteria
• Cluster homogeneity: the purer, the better
• Cluster completeness: should assign objects belong to the same category in
the ground truth to the same cluster
• Rag bag: putting a heterogeneous object into a pure cluster should be
penalized more than putting it into a rag bag (i.e., “miscellaneous” or “other”
category)
• Small cluster preservation: splitting a small category into pieces is more
harmful than splitting a large category into pieces
References
• Jiawei Han and Micheline Kamber, “ Data Mining: Concepts and
Techniques”, 3rd Edition, Morgan Kauffman Publishers, 2011.
• http://ccs1.hnue.edu.vn/hungtd/DM2012/DataMining_BOOK.pdf