0% found this document useful (1 vote)

2K views

CLIQUE and PROCLUS

Clustering high-dimensional data presents major challenges as many dimensions may be irrelevant and distance measures become less meaningful. Methods to address this include feature transformation, feature selection, and subspace clustering. CLIQUE is a subspace clustering algorithm that identifies dense clusters in subspaces of the data. It partitions each dimension and finds dense units, then identifies connected dense units to form clusters. Frequent pattern-based clustering is also proposed to use inherent frequent patterns as features to cluster high-dimensional data like text or microarray data.

Uploaded by

Tanya Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

2K views

CLIQUE and PROCLUS

Uploaded by

Tanya Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Clustering high Dimensional

Data CLIQUE and PROCLUS

1
Clustering High-Dimensional Data
• Clustering high-dimensional data
– Many applications: text documents, DNA micro-array data
– Major challenges:
• Many irrelevant dimensions may mask clusters
• Distance measure becomes meaningless—due to equi-distance
• Clusters may exist only in some subspaces
• Methods
– Feature transformation: only effective if most dimensions are relevant
• PCA & SVD useful only when features are highly correlated/redundant
– Feature selection: wrapper or filter approaches
• useful to find a subspace where the data have nice clusters
– Subspace-clustering: find clusters in all the possible subspaces
• CLIQUE, ProClus, and frequent pattern-based clustering

2
The Curse of Dimensionality
(graphs adapted from Parsons et al. KDD Explorat ions
2004)
• Data in only one dimension is relatively packed
• Adding a dimension “stretch” the points across
that dimension, making them further apart
• Adding more dimensions will make the points
further apart—high dimensional data is extremely
sparse
• Distance measure becomes meaningless—due to
equi-distance

3
CLIQUE (Clustering In QUEst)
• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
• Automatically identifying subspaces of a high dimensional data space that
allow better clustering than original space
• CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal length interval
– It partitions an m-dimensional data space into non-overlapping rectangular units
– A unit is dense if the fraction of total data points contained in the unit exceeds the
input model parameter
– A cluster is a maximal set of connected dense units within a subspace

4
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie
inside each cell of the partition.
• Identify the subspaces that contain clusters using the Apriori
principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected dense units
for each cluster
– Determination of minimal cover for each cluster

5
Strength and Weakness of CLIQUE
• Strength
– automatically finds subspaces of the highest dimensionality such that
high density clusters exist in those subspaces
– insensitive to the order of records in input and does not presume some
canonical data distribution
– scales linearly with the size of input and has good scalability as the
number of dimensions in the data increases
• Weakness
– The accuracy of the clustering result may be degraded at the expense of
simplicity of the method

6
Frequent Pattern-Based Approach

• Clustering high-dimensional space (e.g., clustering text documents,

microarray data)
– Projected subspace-clustering: which dimensions to be projected on?
• CLIQUE, ProClus
– Feature extraction: costly and may not be effective?
– Using frequent patterns as “features”
• “Frequent” are inherent features
• Mining freq. patterns may not be so expensive
• Typical methods
– Frequent-term-based document clustering
– Clustering by pattern similarity in micro-array data (pClustering)

7
Clustering by Pattern Similarity (p-Clustering)
• Right: The micro-array “raw” data shows
3 genes and their values in a multi-
dimensional space
– Difficult to find their patterns
• Bottom: Some subsets of dimensions
form nice shift and scaling patterns

8
Why p-Clustering?
• Microarray data analysis may need to
– Clustering on thousands of dimensions (attributes)
– Discovery of both shift and scaling patterns
• Clustering with Euclidean distance measure? — cannot find shift patterns
• Clustering on derived attribute Aij = ai – aj? — introduces N(N-1) dimensions
• Bi-cluster using transformed mean-squared residue score matrix (I, J)

– Where
1
– A submatrix isda δ-=cluster if H(I
di,j J) ≤ δ for som1e δ > 0
ij | J | d =  d d =
1
 d
j J Ij | I | ij IJ | I || J | i  I , j  J ij
• Problems with bi-cluster i I
– No downward closure property,
– Due to averaging, it may contain outliers but still within δ-threshold

9
p-Clustering
• Given objects x, y in O and features a, b in T, pCluster is a 2 by 2 matrix
• A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T),
pScore(X) ≤ δ for pScore
some(  d d  ) =| (d − d ) − (d − d ) |
xa xb
d d  xa xb ya yb

• δ>0   ya yb

• Properties of δ-pCluster
– Downward closure
– Clusters are more homogeneous than bi-cluster (thus the name: pair-wise Cluster)

• Pattern-growth algorithm has been developed for efficient mining

• For scaling patterns, one can observe, taking logarithmic on will lead to the
pScore form
d /d
 δ
xa ya
9/28/2019 IT6006-DATA ANdALYTICS/ d 10
xb yb

10
Cluster Analysis

• What is Cluster Analysis?

• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods

11
Model based clustering

• Assume data generated from K probability

distributions
• Typically Gaussian distribution Soft or
probabilistic version of K-means clustering
• Need to find distribution parameters.
• EM Algorithm

12
EM Algorithm

• Initialize K cluster centers

• Iterate between two steps
– Expectation step: assign points to clusters
P(di ck ) = wk Pr(di | ck )  w Pr(d
j i | cj )
j
 Pr( d c )
i k
wk = i
N

– Maximation step: estim

1 m ate
d model
P (d parameters
c )
μk =
m
 
i=1
i i
P (d i  c j)
k

EM-C83 User Manual 3.0.0940.4
No ratings yet
EM-C83 User Manual 3.0.0940.4
24 pages
Quantum Data Warehousing Data Mining Koe 093
No ratings yet
Quantum Data Warehousing Data Mining Koe 093
67 pages
Data Analytics Quantum
100% (1)
Data Analytics Quantum
148 pages
Final
No ratings yet
Final
16 pages
Unit - 1 Big Data Handwritten Notes
No ratings yet
Unit - 1 Big Data Handwritten Notes
16 pages
SE (KCS-601) Unit-1 Notes
No ratings yet
SE (KCS-601) Unit-1 Notes
16 pages
HPMA London - HR Business Partner Competency Framework
No ratings yet
HPMA London - HR Business Partner Competency Framework
11 pages
Notes - KCS 061 Big Data Unit 1
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
Quantum Design and Analysis of Algorithms Full PDF
100% (1)
Quantum Design and Analysis of Algorithms Full PDF
196 pages
Daa Lab Manual Kcs553 2022-23
No ratings yet
Daa Lab Manual Kcs553 2022-23
89 pages
Machine Learning Syllabus
No ratings yet
Machine Learning Syllabus
1 page
Compiler Design Quantum PDF
100% (1)
Compiler Design Quantum PDF
211 pages
Unit 1 DBMS Aktu Notes
100% (1)
Unit 1 DBMS Aktu Notes
21 pages
KOE093-DATA-WAREHOUSING-DATA-MINING (1)
100% (1)
KOE093-DATA-WAREHOUSING-DATA-MINING (1)
2 pages
IT8075-UNIT-5-Best-methods-of-staff-selection-Motivation SPM
No ratings yet
IT8075-UNIT-5-Best-methods-of-staff-selection-Motivation SPM
35 pages
Khu701 Rural Development Administration Planning
No ratings yet
Khu701 Rural Development Administration Planning
2 pages
Computer Graphics Quantum
No ratings yet
Computer Graphics Quantum
74 pages
ITCS Unit 1 Notes knc552
100% (1)
ITCS Unit 1 Notes knc552
23 pages
Big Data (KCS-061)
No ratings yet
Big Data (KCS-061)
46 pages
Ai 2021 22 Quantum Askbooks
No ratings yet
Ai 2021 22 Quantum Askbooks
54 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
AKTU Syllabus CS 3rd Yr
75% (4)
AKTU Syllabus CS 3rd Yr
4 pages
Warehousing and Data Mining
100% (1)
Warehousing and Data Mining
70 pages
Domain in Electronics Engineerin
No ratings yet
Domain in Electronics Engineerin
107 pages
Artificial Intelligence KCS071
100% (1)
Artificial Intelligence KCS071
2 pages
Software Project Management Unit-5 - 4 PDF
No ratings yet
Software Project Management Unit-5 - 4 PDF
2 pages
Engineering Mathematics 1 Quantum
No ratings yet
Engineering Mathematics 1 Quantum
81 pages
Imp 25 AKTU Ques DBMS
No ratings yet
Imp 25 AKTU Ques DBMS
4 pages
Computer Network UNIT 3
No ratings yet
Computer Network UNIT 3
28 pages
AI Unit-3 Notes
No ratings yet
AI Unit-3 Notes
23 pages
Important Questions Unit-1: I Introduction: Algorithms, Analyzing Algorithms, Complexity of Algorithms, Growth of
No ratings yet
Important Questions Unit-1: I Introduction: Algorithms, Analyzing Algorithms, Complexity of Algorithms, Growth of
7 pages
OOSD Quantum
100% (1)
OOSD Quantum
66 pages
Notes On Computer Networks Unit 3
83% (6)
Notes On Computer Networks Unit 3
13 pages
Data Analytics III I
No ratings yet
Data Analytics III I
86 pages
Dbms Unit 1 Acoording To AKTU Syllabus
100% (1)
Dbms Unit 1 Acoording To AKTU Syllabus
22 pages
DATA WAREHOUSE AKTU QUESTION PAPERS
100% (1)
DATA WAREHOUSE AKTU QUESTION PAPERS
7 pages
Big Data Shivani
No ratings yet
Big Data Shivani
78 pages
SPM Latest Quantum
No ratings yet
SPM Latest Quantum
91 pages
Data Mining-Mining Time Series Data
0% (1)
Data Mining-Mining Time Series Data
7 pages
DAA Viva Questions
100% (1)
DAA Viva Questions
15 pages
Quantum Web Technology Kcs 602
No ratings yet
Quantum Web Technology Kcs 602
338 pages
BCA Data Structures Notes
No ratings yet
BCA Data Structures Notes
24 pages
Unit 5 Math 4 Math 4 Notes
No ratings yet
Unit 5 Math 4 Math 4 Notes
13 pages
ADBMS Notes
67% (3)
ADBMS Notes
48 pages
Lab Manual Dbms
100% (1)
Lab Manual Dbms
58 pages
Data Mining-Graph Mining
No ratings yet
Data Mining-Graph Mining
9 pages
Computer Networks Lab File
No ratings yet
Computer Networks Lab File
35 pages
Software Testing and Audit Notes Unit 1 AKTU
100% (1)
Software Testing and Audit Notes Unit 1 AKTU
45 pages
Web Tech Quantum Updated
No ratings yet
Web Tech Quantum Updated
204 pages
Dbms-Unit-3 - Aktu
100% (1)
Dbms-Unit-3 - Aktu
7 pages
Daa Unit 1 Notes
67% (3)
Daa Unit 1 Notes
67 pages
Data Analytics Question Paper
100% (2)
Data Analytics Question Paper
2 pages
Daa Handwritten Notes
No ratings yet
Daa Handwritten Notes
43 pages
21cs502 Unit 4 Ai Notes Short
No ratings yet
21cs502 Unit 4 Ai Notes Short
32 pages
Career Prediction System
No ratings yet
Career Prediction System
31 pages
Artificial Intelligence Kcs 071 7th
No ratings yet
Artificial Intelligence Kcs 071 7th
59 pages
Compiler Design Quantum
100% (1)
Compiler Design Quantum
89 pages
Characteristics of Soft Computing
88% (8)
Characteristics of Soft Computing
11 pages
Data Warehouse and Data Mining Quantum
No ratings yet
Data Warehouse and Data Mining Quantum
69 pages
Web Designing Workshop
0% (1)
Web Designing Workshop
2 pages
Earth and Atmospheric Disaster Management Natural and Man-made
From Everand
Earth and Atmospheric Disaster Management Natural and Man-made
Navale Pandharinath
No ratings yet
Clustering High Dimensional Data
No ratings yet
Clustering High Dimensional Data
15 pages
Chi Square
No ratings yet
Chi Square
34 pages
Statistical Evaluation of Analytical Data: Quality Assurance Programme
No ratings yet
Statistical Evaluation of Analytical Data: Quality Assurance Programme
80 pages
Design of Experiments for Reliability Achievement 1e Steven E. Rigdon - Read the ebook online or download it to own the complete version
100% (1)
Design of Experiments for Reliability Achievement 1e Steven E. Rigdon - Read the ebook online or download it to own the complete version
74 pages
Elementary Logic
No ratings yet
Elementary Logic
5 pages
Lesson Plan Changing Shadows
No ratings yet
Lesson Plan Changing Shadows
1 page
Maria Emanuela Alberti, Serena Sabatini - Exchange Networks and Local Transformation_ Interaction and Local Change in Europe and the Mediterranean From the Bronze Age to the Iron Age-Oxbow Books (2013
No ratings yet
Maria Emanuela Alberti, Serena Sabatini - Exchange Networks and Local Transformation_ Interaction and Local Change in Europe and the Mediterranean From the Bronze Age to the Iron Age-Oxbow Books (2013
193 pages
Module 3 Leadership 2
No ratings yet
Module 3 Leadership 2
17 pages
Irjet V4i12336 PDF
No ratings yet
Irjet V4i12336 PDF
4 pages
Science: Self-Learning Module
No ratings yet
Science: Self-Learning Module
16 pages
PHIL Short Essay #3
No ratings yet
PHIL Short Essay #3
5 pages
Different Agro Ecological Zones in India
100% (1)
Different Agro Ecological Zones in India
7 pages
PIP For Fanuc RoboDrill Models D14MiA5 and D21MiA5
No ratings yet
PIP For Fanuc RoboDrill Models D14MiA5 and D21MiA5
8 pages
Review of Related Literature and Studies
No ratings yet
Review of Related Literature and Studies
20 pages
Project
No ratings yet
Project
38 pages
Lecture 1 Introduction To Control System
No ratings yet
Lecture 1 Introduction To Control System
14 pages
Warnier 2001 - A Praxeological Approach To Subjectivation in A Material World (JMC)
No ratings yet
Warnier 2001 - A Praxeological Approach To Subjectivation in A Material World (JMC)
20 pages
1.4 - Algebra 2 HW
No ratings yet
1.4 - Algebra 2 HW
1 page
Geometrical Optics Theory
No ratings yet
Geometrical Optics Theory
37 pages
Method Statement OF Well Development by Airlifting Method
No ratings yet
Method Statement OF Well Development by Airlifting Method
5 pages
Enrichment Utbk SBMPTN: Tes Bahasa Inggris 2
No ratings yet
Enrichment Utbk SBMPTN: Tes Bahasa Inggris 2
5 pages
Assignment Thermal
No ratings yet
Assignment Thermal
32 pages
Improved Well and Reservoir Production Performance in Waterflood Reservoirs-Revolutionalizing The Hall Plot PDF
No ratings yet
Improved Well and Reservoir Production Performance in Waterflood Reservoirs-Revolutionalizing The Hall Plot PDF
10 pages
What Are Historical Facts - Becker
No ratings yet
What Are Historical Facts - Becker
10 pages
Integration and The Undescended Soul in Plotinus
No ratings yet
Integration and The Undescended Soul in Plotinus
14 pages
Horoscope
No ratings yet
Horoscope
22 pages
11 CAE Gold
No ratings yet
11 CAE Gold
6 pages
3. Materials and Techniques for 3D Printing in Ukraine
No ratings yet
3. Materials and Techniques for 3D Printing in Ukraine
16 pages
Colonies in Space: Survive, Spread Out Independent
No ratings yet
Colonies in Space: Survive, Spread Out Independent
1 page

CLIQUE and PROCLUS

Uploaded by

CLIQUE and PROCLUS

Uploaded by

Clustering high Dimensional

Data CLIQUE and PROCLUS

• Clustering high-dimensional space (e.g., clustering text documents,

• Pattern-growth algorithm has been developed for efficient mining

• What is Cluster Analysis?

• Assume data generated from K probability

• Initialize K cluster centers

– Maximation step: estim

You might also like