0% found this document useful (0 votes)
434 views49 pages

Advanced Cluster Analysis: Clustering High-Dimensional Data

This document discusses clustering techniques for high-dimensional data. It begins by listing various clustering methods including hierarchical clustering, K-means clustering, and subspace clustering approaches. It then focuses on the challenges of clustering high-dimensional data, noting that traditional distance measures may be ineffective in high dimensions due to noise. The document outlines two major approaches for clustering high-dimensional data: subspace clustering, which searches for clusters within subspaces of attributes, and dimensionality reduction, which constructs a lower-dimensional space to search for clusters. Specific subspace clustering methods like CLIQUE and PROCLUS are then described in more detail.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
434 views49 pages

Advanced Cluster Analysis: Clustering High-Dimensional Data

This document discusses clustering techniques for high-dimensional data. It begins by listing various clustering methods including hierarchical clustering, K-means clustering, and subspace clustering approaches. It then focuses on the challenges of clustering high-dimensional data, noting that traditional distance measures may be ineffective in high dimensions due to noise. The document outlines two major approaches for clustering high-dimensional data: subspace clustering, which searches for clusters within subspaces of attributes, and dimensionality reduction, which constructs a lower-dimensional space to search for clusters. Specific subspace clustering methods like CLIQUE and PROCLUS are then described in more detail.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 49

ADVANCED CLUSTER

ANALYSIS
Clustering high-dimensional data
SYLLABUS
Clustering techniques:
 hierarchical,

 K-means,

 clustering high dimensional data,

 CLIQUE and ProCLUS,

 frequent pattern based clustering methods,

 clustering in non-euclidean space,

 clustering for streams and parallelism


 Probabilistic model-based clustering
 Clustering high-dimensional data

 Clustering graph and network data

 Clustering with constraints


CLUSTERING HIGH-DIMENSIONAL DATA

 The clustering methods we have studied so far work well


when the dimensionality is not high, that is, having less
than 10 attributes.
 There are, however, important applications of high
dimensionality.
 “How can we conduct cluster analysis on high-
dimensional data”?
EXAMPLE
 All Electronics keeps track of the products purchased by
every customer.
 As a customer-relationship manager, you want to cluster
customers into groups according to what they purchased
from All Electronics.
 All Electronics carries tens of thousands of products
 It is easy to see that

dist(Ada,Bob) = dist(Bob,Cathy) = dist(Ada,Cathy) = √ 2.


 According to Euclidean distance, the three customers are
equivalently similar (or dissimilar) to each other.
 However, a close look tells us that Ada should be more
similar to Cathy than to Bob because Ada and Cathy
share one common purchased item, P1.
 The traditional distance measures can be ineffective on
high-dimensional data.
 Such distance measures may be dominated by the noise
in many dimensions.
 Therefore, clusters in the full, high-dimensional space
can be unreliable, and finding such clusters may not be
meaningful.

 Clustering high-dimensional data is the search for


clusters and the space in which they exist.
FIRST CHALLENGE
 A major issue is how to create appropriate models for clusters
in high-dimensional data.
 Unlike conventional clusters in low-dimensional spaces,
clusters hidden in high-dimensional data are often
significantly smaller.
 For example, when clustering customer-purchase data, we
would not expect many users to have similar purchase
patterns.
 Searching for such small but meaningful clusters is like
finding needles in a haystack.
 we often have to consider various more sophisticated
techniques that can model correlations and consistency among
objects in subspaces.
SECOND CHALLENGE

 There are typically an exponential number of possible


subspaces or dimensionality reduction options, and thus
the optimal solutions are often computationally
prohibitive.
 For example, if the original data space has 1000
dimensions, and we want to find clusters of
dimensionality 10, then there are 2.63×1023 possible
subspaces.
TWO MAJOR KINDS OF METHODS

 Subspace clustering approaches search for clusters


existing in subspaces of the given high-dimensional data
space, where a subspace is defined using a subset of
attributes in the full space.

 Dimensionality reduction approaches try to construct a


much lower-dimensional space and search for clusters in
such a space. Often, a method may construct new
dimensions by combining some dimensions from the
original data.
SUBSPACE CLUSTERING METHODS

 Subspace search methods


 Correlation-based clustering methods

 Biclustering methods
SUBSPACE SEARCH METHODS
 A subspace search method searches various subspaces
for clusters.
 Here, a cluster is a subset of objects that are similar to
each other in a subspace.
 The similarity is often captured by conventional
measures such as distance or density.

 A major challenge that subspace search methods face is


how to search a series of subspaces effectively and
efficiently
GENERALLY THERE ARE TWO KINDS OF
STRATEGIES:
 Bottom-up approaches start from low-dimensional
subspaces and search higher dimensional subspaces only
when there may be clusters in those higher-dimensional
subspaces.

 Various pruning techniques are explored to reduce the


number of higher dimensional subspaces that need to be
searched.
 CLIQUE is an example of a bottom-up approach.
TOP-DOWN APPROACHES
 Top-down approaches start from the full space and
search smaller and smaller subspaces recursively.
 Top-down approaches are effective only if the locality
assumption holds, which require that the subspace of a
cluster can be determined by the local neighborhood.

 PROCLUS, is an example of a top-down subspace


approach.
TOP-DOWN APPROACHES
 Top-down approaches start from the full space and
search smaller and smaller subspaces recursively.
 Top-down approaches are effective only if the locality
assumption holds, which require that the subspace of a
cluster can be determined by the local neighborhood.

 PROCLUS, is an example of a top-down subspace


approach.
CLIQUE: A DIMENSION –GROWTH
SUBSPACE CLUSTERING METHOD
 CLIQUE (Clustering in QUEst) was the first algorithm
proposed for dimension-growth subspace clustering in
high dimensional space.
 In dimension-growth subspace clustering, the clustering
process starts a single dimensional subspace and grows
upward to high dimensional ones (grid structure)
 It can also be viewed as an integration of density-based
and grid based clustering methods.
 Its overall approach is typical of subspace clustering for
high-dimension space.
EXAMPLE
 The idea of the CLIQUE clustering algorithm are
outlined as follows:
 Given a large set of multidimensional data points, the
data space is usually not uniformly occupied by the data
points.
 CLIQUE’s clustering identifies the sparse and crowded
areas in space, thereby discovering overall distribution
patterns of the data set.
 A unit is dense if the fraction of total points contained in
it exceeds an input model parameter.
 In CLIQUE, a cluster is defined as a maximal set of
connected dense units.
HOW DOES CLIQUE WORKS
 I STEP: CLIQUE partitions the d-dimensional data
space into non overlapping rectangular units, identifying
the dense units among these.

 II STEP: The subspaces representing these dense units


are intersected to form a candidate search space in which
dense units of higher dimensionality may exist.
HOW EFFECTIVELY CLIQUE IS?
 CLIQUE automatically find subspaces of the highest
dimensionality such that high density clusters exist in
those subspace.
 It is insensitive to order of objects.

 It scales linearly with the size of input and has a good


scalability as the number of dimensions in the data is
increased.
 Clustering results are dependent on proper tuning on grid
size and the density threshold.
GRAPHICAL DEFINATION
CLIQUE is the group of nodes in
graph such that all nodes in a
CLIQUE are connected to each other.
 ‘K’ – No of nodes in a CLIQUE
The clique percolation method is as follows:
1) All K cliques present in graph G are extracted.
2) A new clique graph GC is created -
a) Here each extracted K - CLIQUE is compressed as one
vertex.
b) The two vertices are connected by an edge in GC if they
have k - 1 common vertices.
3) connected components in GC are identified.
4) Each connected component in GC represents a
community
5) Set C will be set of communities formed for G.
 K=2 K=3

N1
N2
N2
N1

N3

N1 N2

K=4

N3 N4
COMMUNITY
 Community is the group of CLIQUES such that all the
CLIQUES must have ‘K-1’ nodes in common.
CLIQUE PERCOLATION METHOD (CPM)
CLIQUE
COMMUNITY
CLIQUE- EXAMPLE
EXAMPLE CONTINUE
EXAMPLE CONTINUE
EXAMPLE CONTINUE
CLIQUE & COMMUNITY

Here for K=3


CLIQUE 1 = {N1,N2,N3}
CLIQUE 2 = {N1,N2,N4}

COMMUNITY =
{CLIQUE 1, CLIQUE 2 }
EXAMPLE
CLIQUE ( K =3)
a) {1,2,3}

b) {1,2,8}

c) {2,6,5}

d) {2,6,4}

e) {2,5,4}

f) {4,5,6}

Community 1= {a, b}
Community 2 = { c,d,e,f}
CLIQUE ( K =3)
a) {1,2,3}

b) {1,2,8}
d
c) {2,6,5} c

d) {2,6,4}

e) {2,5,4}

f) {4,5,6} e f
Community 1= {a, b}
Community 2 = { c,d,e,f}
EXAMPLE
IDENTIFY – CLIQUE(K= 5 AND K = 4 )

3
10
2 7

1 9

5 6
PROCLUS
 Choose a sample set of data point randomly.
 Choose a set of data point which is probably the
medoids of the cluster
INPUT AND OUTPUT FOR PROCLUS
 Input:
 The set of data points

 Number of clusters, denoted by k

 Average number of dimensions for each clusters,


denoted by L
 Output:

 The clusters found, and the dimensions respected to such


clusters
Three Phase for PROCLUS:
 Initialization Phase

 Iterative Phase

 Refinement Phase
INITIALIZATION PHASE
 Choose a sample set of data point randomly.
 Choose a set of data point which is probably the medoids
of the cluster
ITERATIVE PHASE
 From the Initialization Phase, we got a set of data points which
should contains the medoids. (Denoted by M)
 This phase, we will find the best medoidsfrom M.

 Randomly find the set of points Mcurrent, and replace the “bad”
medoidsfrom other point in M if necessary.
For the medoids, following will be done:
 Find Dimensions related to the medoids

 Assign Data Points to the medoids

 Evaluate the Clusters formed

 Find the bad medoid, and try the result of replacing bad medoid

 The above procedure is repeated until we got a satisfied result


REFINEMENT PHASE-HANDLE
OUTLIERS
 For each medoid mi with the dimension Di, find the
smallest Manhattan segmental distance
 ito any of the other medoids with respect to the set of
dimensions
 the sphere of influence of the medoidmiA data point is
an outlier if it is not under any spheres of influence.

You might also like