0% found this document useful (1 vote)
2K views

CLIQUE and PROCLUS

Clustering high-dimensional data presents major challenges as many dimensions may be irrelevant and distance measures become less meaningful. Methods to address this include feature transformation, feature selection, and subspace clustering. CLIQUE is a subspace clustering algorithm that identifies dense clusters in subspaces of the data. It partitions each dimension and finds dense units, then identifies connected dense units to form clusters. Frequent pattern-based clustering is also proposed to use inherent frequent patterns as features to cluster high-dimensional data like text or microarray data.

Uploaded by

Tanya Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
2K views

CLIQUE and PROCLUS

Clustering high-dimensional data presents major challenges as many dimensions may be irrelevant and distance measures become less meaningful. Methods to address this include feature transformation, feature selection, and subspace clustering. CLIQUE is a subspace clustering algorithm that identifies dense clusters in subspaces of the data. It partitions each dimension and finds dense units, then identifies connected dense units to form clusters. Frequent pattern-based clustering is also proposed to use inherent frequent patterns as features to cluster high-dimensional data like text or microarray data.

Uploaded by

Tanya Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Clustering high Dimensional

Data CLIQUE and PROCLUS

1
Clustering High-Dimensional Data
• Clustering high-dimensional data
– Many applications: text documents, DNA micro-array data
– Major challenges:
• Many irrelevant dimensions may mask clusters
• Distance measure becomes meaningless—due to equi-distance
• Clusters may exist only in some subspaces
• Methods
– Feature transformation: only effective if most dimensions are relevant
• PCA & SVD useful only when features are highly correlated/redundant
– Feature selection: wrapper or filter approaches
• useful to find a subspace where the data have nice clusters
– Subspace-clustering: find clusters in all the possible subspaces
• CLIQUE, ProClus, and frequent pattern-based clustering

2
The Curse of Dimensionality
(graphs adapted from Parsons et al. KDD Explorat ions
2004)
• Data in only one dimension is relatively packed
• Adding a dimension “stretch” the points across
that dimension, making them further apart
• Adding more dimensions will make the points
further apart—high dimensional data is extremely
sparse
• Distance measure becomes meaningless—due to
equi-distance

3
CLIQUE (Clustering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
• Automatically identifying subspaces of a high dimensional data space that
allow better clustering than original space
• CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal length interval
– It partitions an m-dimensional data space into non-overlapping rectangular units
– A unit is dense if the fraction of total data points contained in the unit exceeds the
input model parameter
– A cluster is a maximal set of connected dense units within a subspace

4
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie
inside each cell of the partition.
• Identify the subspaces that contain clusters using the Apriori
principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected dense units
for each cluster
– Determination of minimal cover for each cluster

5
Strength and Weakness of CLIQUE
• Strength
– automatically finds subspaces of the highest dimensionality such that
high density clusters exist in those subspaces
– insensitive to the order of records in input and does not presume some
canonical data distribution
– scales linearly with the size of input and has good scalability as the
number of dimensions in the data increases
• Weakness
– The accuracy of the clustering result may be degraded at the expense of
simplicity of the method

6
Frequent Pattern-Based Approach

• Clustering high-dimensional space (e.g., clustering text documents,


microarray data)
– Projected subspace-clustering: which dimensions to be projected on?
• CLIQUE, ProClus
– Feature extraction: costly and may not be effective?
– Using frequent patterns as “features”
• “Frequent” are inherent features
• Mining freq. patterns may not be so expensive
• Typical methods
– Frequent-term-based document clustering
– Clustering by pattern similarity in micro-array data (pClustering)

7
Clustering by Pattern Similarity (p-Clustering)
• Right: The micro-array “raw” data shows
3 genes and their values in a multi-
dimensional space
– Difficult to find their patterns
• Bottom: Some subsets of dimensions
form nice shift and scaling patterns

8
Why p-Clustering?
• Microarray data analysis may need to
– Clustering on thousands of dimensions (attributes)
– Discovery of both shift and scaling patterns
• Clustering with Euclidean distance measure? — cannot find shift patterns
• Clustering on derived attribute Aij = ai – aj? — introduces N(N-1) dimensions
• Bi-cluster using transformed mean-squared residue score matrix (I, J)

– Where
1
– A submatrix isda δ-=cluster if H(I
di,j J) ≤ δ for som1e δ > 0
ij | J | d =  d d =
1
 d
j J Ij | I | ij IJ | I || J | i  I , j  J ij
• Problems with bi-cluster i I
– No downward closure property,
– Due to averaging, it may contain outliers but still within δ-threshold

9
p-Clustering
• Given objects x, y in O and features a, b in T, pCluster is a 2 by 2 matrix
• A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T),
pScore(X) ≤ δ for pScore
some(  d d  ) =| (d − d ) − (d − d ) |
xa xb
d d  xa xb ya yb

• δ>0   ya yb

• Properties of δ-pCluster
– Downward closure
– Clusters are more homogeneous than bi-cluster (thus the name: pair-wise Cluster)

• Pattern-growth algorithm has been developed for efficient mining

• For scaling patterns, one can observe, taking logarithmic on will lead to the
pScore form
d /d
 δ
xa ya
9/28/2019 IT6006-DATA ANdALYTICS/ d 10
xb yb

10
Cluster Analysis

• What is Cluster Analysis?


• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods

11

11
Model based clustering

• Assume data generated from K probability


distributions
• Typically Gaussian distribution Soft or
probabilistic version of K-means clustering
• Need to find distribution parameters.
• EM Algorithm

12

12
EM Algorithm

• Initialize K cluster centers


• Iterate between two steps
– Expectation step: assign points to clusters
P(di ck ) = wk Pr(di | ck )  w Pr(d
j i | cj )
j
 Pr( d c )
i k
wk = i
N

– Maximation step: estim


1 m ate
d model
P (d parameters
c )
μk =
m
 
i=1
i i
P (d i  c j)
k

13

You might also like