0% found this document useful (0 votes)

434 views49 pages

Advanced Cluster Analysis: Clustering High-Dimensional Data

This document discusses clustering techniques for high-dimensional data. It begins by listing various clustering methods including hierarchical clustering, K-means clustering, and subspace clustering approaches. It then focuses on the challenges of clustering high-dimensional data, noting that traditional distance measures may be ineffective in high dimensions due to noise. The document outlines two major approaches for clustering high-dimensional data: subspace clustering, which searches for clusters within subspaces of attributes, and dimensionality reduction, which constructs a lower-dimensional space to search for clusters. Specific subspace clustering methods like CLIQUE and PROCLUS are then described in more detail.

Uploaded by

Priyanka Bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

434 views49 pages

Advanced Cluster Analysis: Clustering High-Dimensional Data

Uploaded by

Priyanka Bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 49

ADVANCED CLUSTER

ANALYSIS
Clustering high-dimensional data
SYLLABUS
Clustering techniques:
 hierarchical,

 K-means,

 clustering high dimensional data,

 CLIQUE and ProCLUS,

 frequent pattern based clustering methods,

 clustering in non-euclidean space,

 clustering for streams and parallelism

 Probabilistic model-based clustering
 Clustering high-dimensional data

 Clustering graph and network data

 Clustering with constraints

CLUSTERING HIGH-DIMENSIONAL DATA

 The clustering methods we have studied so far work well

when the dimensionality is not high, that is, having less
than 10 attributes.
 There are, however, important applications of high
dimensionality.
 “How can we conduct cluster analysis on high-
dimensional data”?
EXAMPLE
 All Electronics keeps track of the products purchased by
every customer.
 As a customer-relationship manager, you want to cluster
customers into groups according to what they purchased
from All Electronics.
 All Electronics carries tens of thousands of products
 It is easy to see that

dist(Ada,Bob) = dist(Bob,Cathy) = dist(Ada,Cathy) = √ 2.

 According to Euclidean distance, the three customers are
equivalently similar (or dissimilar) to each other.
 However, a close look tells us that Ada should be more
similar to Cathy than to Bob because Ada and Cathy
share one common purchased item, P1.
 The traditional distance measures can be ineffective on
high-dimensional data.
 Such distance measures may be dominated by the noise
in many dimensions.
 Therefore, clusters in the full, high-dimensional space
can be unreliable, and finding such clusters may not be
meaningful.

 Clustering high-dimensional data is the search for

clusters and the space in which they exist.
FIRST CHALLENGE
 A major issue is how to create appropriate models for clusters
in high-dimensional data.
 Unlike conventional clusters in low-dimensional spaces,
clusters hidden in high-dimensional data are often
significantly smaller.
 For example, when clustering customer-purchase data, we
would not expect many users to have similar purchase
patterns.
 Searching for such small but meaningful clusters is like
finding needles in a haystack.
 we often have to consider various more sophisticated
techniques that can model correlations and consistency among
objects in subspaces.
SECOND CHALLENGE

 There are typically an exponential number of possible

subspaces or dimensionality reduction options, and thus
the optimal solutions are often computationally
prohibitive.
 For example, if the original data space has 1000
dimensions, and we want to find clusters of
dimensionality 10, then there are 2.63×1023 possible
subspaces.
TWO MAJOR KINDS OF METHODS

 Subspace clustering approaches search for clusters

existing in subspaces of the given high-dimensional data
space, where a subspace is defined using a subset of
attributes in the full space.

 Dimensionality reduction approaches try to construct a

much lower-dimensional space and search for clusters in
such a space. Often, a method may construct new
dimensions by combining some dimensions from the
original data.
SUBSPACE CLUSTERING METHODS

 Subspace search methods

 Correlation-based clustering methods

 Biclustering methods
SUBSPACE SEARCH METHODS
 A subspace search method searches various subspaces
for clusters.
 Here, a cluster is a subset of objects that are similar to
each other in a subspace.
 The similarity is often captured by conventional
measures such as distance or density.

 A major challenge that subspace search methods face is

how to search a series of subspaces effectively and
efficiently
GENERALLY THERE ARE TWO KINDS OF
STRATEGIES:
 Bottom-up approaches start from low-dimensional
subspaces and search higher dimensional subspaces only
when there may be clusters in those higher-dimensional
subspaces.

 Various pruning techniques are explored to reduce the

number of higher dimensional subspaces that need to be
searched.
 CLIQUE is an example of a bottom-up approach.
TOP-DOWN APPROACHES
 Top-down approaches start from the full space and
search smaller and smaller subspaces recursively.
 Top-down approaches are effective only if the locality
assumption holds, which require that the subspace of a
cluster can be determined by the local neighborhood.

 PROCLUS, is an example of a top-down subspace

approach.
TOP-DOWN APPROACHES
 Top-down approaches start from the full space and
search smaller and smaller subspaces recursively.
 Top-down approaches are effective only if the locality
assumption holds, which require that the subspace of a
cluster can be determined by the local neighborhood.

 PROCLUS, is an example of a top-down subspace

approach.
CLIQUE: A DIMENSION –GROWTH
SUBSPACE CLUSTERING METHOD
 CLIQUE (Clustering in QUEst) was the first algorithm
proposed for dimension-growth subspace clustering in
high dimensional space.
 In dimension-growth subspace clustering, the clustering
process starts a single dimensional subspace and grows
upward to high dimensional ones (grid structure)
 It can also be viewed as an integration of density-based
and grid based clustering methods.
 Its overall approach is typical of subspace clustering for
high-dimension space.
EXAMPLE
 The idea of the CLIQUE clustering algorithm are
outlined as follows:
 Given a large set of multidimensional data points, the
data space is usually not uniformly occupied by the data
points.
 CLIQUE’s clustering identifies the sparse and crowded
areas in space, thereby discovering overall distribution
patterns of the data set.
 A unit is dense if the fraction of total points contained in
it exceeds an input model parameter.
 In CLIQUE, a cluster is defined as a maximal set of
connected dense units.
HOW DOES CLIQUE WORKS
 I STEP: CLIQUE partitions the d-dimensional data
space into non overlapping rectangular units, identifying
the dense units among these.

 II STEP: The subspaces representing these dense units

are intersected to form a candidate search space in which
dense units of higher dimensionality may exist.
HOW EFFECTIVELY CLIQUE IS?
 CLIQUE automatically find subspaces of the highest
dimensionality such that high density clusters exist in
those subspace.
 It is insensitive to order of objects.

 It scales linearly with the size of input and has a good

scalability as the number of dimensions in the data is
increased.
 Clustering results are dependent on proper tuning on grid
size and the density threshold.
GRAPHICAL DEFINATION
CLIQUE is the group of nodes in
graph such that all nodes in a
CLIQUE are connected to each other.
 ‘K’ – No of nodes in a CLIQUE
The clique percolation method is as follows:
1) All K cliques present in graph G are extracted.
2) A new clique graph GC is created -
a) Here each extracted K - CLIQUE is compressed as one
vertex.
b) The two vertices are connected by an edge in GC if they
have k - 1 common vertices.
3) connected components in GC are identified.
4) Each connected component in GC represents a
community
5) Set C will be set of communities formed for G.
 K=2 K=3

N1
N2
N2
N1

N1 N2

K=4

N3 N4
COMMUNITY
 Community is the group of CLIQUES such that all the
CLIQUES must have ‘K-1’ nodes in common.
CLIQUE PERCOLATION METHOD (CPM)
CLIQUE
COMMUNITY
CLIQUE- EXAMPLE
EXAMPLE CONTINUE
EXAMPLE CONTINUE
EXAMPLE CONTINUE
CLIQUE & COMMUNITY

Here for K=3

CLIQUE 1 = {N1,N2,N3}
CLIQUE 2 = {N1,N2,N4}

COMMUNITY =
{CLIQUE 1, CLIQUE 2 }
EXAMPLE
CLIQUE ( K =3)
a) {1,2,3}

b) {1,2,8}

c) {2,6,5}

d) {2,6,4}

e) {2,5,4}

f) {4,5,6}

Community 1= {a, b}
Community 2 = { c,d,e,f}
CLIQUE ( K =3)
a) {1,2,3}

b) {1,2,8}
d
c) {2,6,5} c

d) {2,6,4}

e) {2,5,4}

f) {4,5,6} e f
Community 1= {a, b}
Community 2 = { c,d,e,f}
EXAMPLE
IDENTIFY – CLIQUE(K= 5 AND K = 4 )

3
10
2 7

1 9

5 6
PROCLUS
 Choose a sample set of data point randomly.
 Choose a set of data point which is probably the
medoids of the cluster
INPUT AND OUTPUT FOR PROCLUS
 Input:
 The set of data points

 Number of clusters, denoted by k

 Average number of dimensions for each clusters,

denoted by L
 Output:

 The clusters found, and the dimensions respected to such

clusters
Three Phase for PROCLUS:
 Initialization Phase

 Iterative Phase

 Refinement Phase
INITIALIZATION PHASE
 Choose a sample set of data point randomly.
 Choose a set of data point which is probably the medoids
of the cluster
ITERATIVE PHASE
 From the Initialization Phase, we got a set of data points which
should contains the medoids. (Denoted by M)
 This phase, we will find the best medoidsfrom M.

 Randomly find the set of points Mcurrent, and replace the “bad”
medoidsfrom other point in M if necessary.
For the medoids, following will be done:
 Find Dimensions related to the medoids

 Assign Data Points to the medoids

 Evaluate the Clusters formed

 Find the bad medoid, and try the result of replacing bad medoid

 The above procedure is repeated until we got a satisfied result

REFINEMENT PHASE-HANDLE
OUTLIERS
 For each medoid mi with the dimension Di, find the
smallest Manhattan segmental distance
 ito any of the other medoids with respect to the set of
dimensions
 the sphere of influence of the medoidmiA data point is
an outlier if it is not under any spheres of influence.

Machine Learning Bangalore City University 2024
No ratings yet
Machine Learning Bangalore City University 2024
5 pages
Reliability-21 08 2023
No ratings yet
Reliability-21 08 2023
51 pages
Module-3 Association Analysis: Data Mining Association Analysis: Basic Concepts and Algorithms
No ratings yet
Module-3 Association Analysis: Data Mining Association Analysis: Basic Concepts and Algorithms
34 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
136 pages
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
No ratings yet
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
59 pages
ML Question Bank 2024
No ratings yet
ML Question Bank 2024
2 pages
DL Unit Wise Important Questions
No ratings yet
DL Unit Wise Important Questions
2 pages
Data Mining and Model Selection
No ratings yet
Data Mining and Model Selection
27 pages
FIND-S Algorithm: Machine Learning 15CSL76
No ratings yet
FIND-S Algorithm: Machine Learning 15CSL76
3 pages
DL Unit-2 Notes PPT
No ratings yet
DL Unit-2 Notes PPT
39 pages
Instance Based Learning
100% (1)
Instance Based Learning
49 pages
ML Unit-3 Notes
No ratings yet
ML Unit-3 Notes
26 pages
DM Important Questions
100% (1)
DM Important Questions
2 pages
DWDM Important Questions
No ratings yet
DWDM Important Questions
2 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
11 pages
MLT Unit 3
100% (1)
MLT Unit 3
38 pages
UPDATED - HGDML - ALL QUIZ QUESTIONS and ANSWERS v2.3.1
100% (1)
UPDATED - HGDML - ALL QUIZ QUESTIONS and ANSWERS v2.3.1
15 pages
UNIT1
No ratings yet
UNIT1
38 pages
UNIT 2 - Notes
No ratings yet
UNIT 2 - Notes
31 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
Data Mining Question Bank
No ratings yet
Data Mining Question Bank
4 pages
ML unit-1
100% (1)
ML unit-1
15 pages
ML Lab
No ratings yet
ML Lab
21 pages
Data Warehousing and Data Mining Important Question
No ratings yet
Data Warehousing and Data Mining Important Question
7 pages
Ec 467 Pattern Recognition
No ratings yet
Ec 467 Pattern Recognition
2 pages
Jntuk Machine Learning 3-2 Unit-4
No ratings yet
Jntuk Machine Learning 3-2 Unit-4
32 pages
Model Question Paper
0% (1)
Model Question Paper
2 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
PPT04-Knowledge Representation
No ratings yet
PPT04-Knowledge Representation
37 pages
ML Unit 1
No ratings yet
ML Unit 1
44 pages
DWM Lab Manual
No ratings yet
DWM Lab Manual
92 pages
006 Practical List of DM-2023
No ratings yet
006 Practical List of DM-2023
1 page
DMW Question Paper
0% (1)
DMW Question Paper
7 pages
ML UNIT 2 Sir
No ratings yet
ML UNIT 2 Sir
46 pages
Blockbuster Blockchain
No ratings yet
Blockbuster Blockchain
81 pages
Data Mining Metrices
No ratings yet
Data Mining Metrices
6 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
Chpater 1 - Unit 2
No ratings yet
Chpater 1 - Unit 2
31 pages
Numerical Based On Indexing: Problem 1.2
No ratings yet
Numerical Based On Indexing: Problem 1.2
3 pages
1) Aim: Demonstration of Preprocessing of Dataset Student - Arff
No ratings yet
1) Aim: Demonstration of Preprocessing of Dataset Student - Arff
26 pages
Unit-5 Alt
No ratings yet
Unit-5 Alt
15 pages
ML Lab Programs (1-12)
No ratings yet
ML Lab Programs (1-12)
35 pages
Planning and Acting in The Real World
No ratings yet
Planning and Acting in The Real World
31 pages
Distance-Based Methods - KNN
No ratings yet
Distance-Based Methods - KNN
8 pages
2 Marks Deep Learning
No ratings yet
2 Marks Deep Learning
4 pages
Assignment 11
100% (1)
Assignment 11
4 pages
Heuristic Search: Dr.M. Nagaratna Professor, Dept - of CSE Jntuceh
No ratings yet
Heuristic Search: Dr.M. Nagaratna Professor, Dept - of CSE Jntuceh
54 pages
CS2032 2 Marks & 16 Marks With Answers
100% (1)
CS2032 2 Marks & 16 Marks With Answers
30 pages
Data Mining-Graph Mining
No ratings yet
Data Mining-Graph Mining
9 pages
Single Link Example
No ratings yet
Single Link Example
8 pages
Syllabus
No ratings yet
Syllabus
9 pages
Data Mining
No ratings yet
Data Mining
2 pages
Concept Learning
No ratings yet
Concept Learning
62 pages
Classification Error: Training Errors Generalization Errors
No ratings yet
Classification Error: Training Errors Generalization Errors
39 pages
Matlab File - Deepak - Yadav - Bca - 4TH - Sem - A50504819015
No ratings yet
Matlab File - Deepak - Yadav - Bca - 4TH - Sem - A50504819015
59 pages
ML Question Bank - Beena Kapadia
No ratings yet
ML Question Bank - Beena Kapadia
3 pages
Core Java
No ratings yet
Core Java
217 pages
Unit-4_Part-2
No ratings yet
Unit-4_Part-2
45 pages
Clustering High Dimensional Data
No ratings yet
Clustering High Dimensional Data
15 pages
Poisson Beting Calculator
No ratings yet
Poisson Beting Calculator
2 pages
Maximum and Minimum
No ratings yet
Maximum and Minimum
3 pages
Multi-objective economic load dispatch using hybrid NSGA-II and PVDE techniques
No ratings yet
Multi-objective economic load dispatch using hybrid NSGA-II and PVDE techniques
10 pages
(eBook PDF) The Analysis of Time Series: An Introduction with R 7th Edition - Download the ebook now and read anytime, anywhere
No ratings yet
(eBook PDF) The Analysis of Time Series: An Introduction with R 7th Edition - Download the ebook now and read anytime, anywhere
48 pages
Time Series Analysis - Data Exploration and Visualization. - by Himani Gulati - Jovian - Medium
No ratings yet
Time Series Analysis - Data Exploration and Visualization. - by Himani Gulati - Jovian - Medium
3 pages
Functions Vs Relations - Activity and Quiz
No ratings yet
Functions Vs Relations - Activity and Quiz
52 pages
Deep Think
No ratings yet
Deep Think
69 pages
Assignment1-MFML
No ratings yet
Assignment1-MFML
2 pages
Lecture7 Small
No ratings yet
Lecture7 Small
8 pages
Human Activity Recognition
No ratings yet
Human Activity Recognition
8 pages
EViews Help - Estimating Quantile Regression in EViews
No ratings yet
EViews Help - Estimating Quantile Regression in EViews
5 pages
R Data Analysis
No ratings yet
R Data Analysis
10 pages
CP5639 Week 4 PRACTICE On ITERATIONS
No ratings yet
CP5639 Week 4 PRACTICE On ITERATIONS
11 pages
Two Marks Question With Answers
No ratings yet
Two Marks Question With Answers
12 pages
Conformal Prediction
No ratings yet
Conformal Prediction
51 pages
MT Notes
No ratings yet
MT Notes
23 pages
MR Sir Assignment No 2
No ratings yet
MR Sir Assignment No 2
3 pages
Sha 224
No ratings yet
Sha 224
19 pages
Statistics and Probability Midterm Examination
No ratings yet
Statistics and Probability Midterm Examination
5 pages
Machine Learning Based Intrusion Detection System
No ratings yet
Machine Learning Based Intrusion Detection System
5 pages
SAMPLE PAPER 2 2025
No ratings yet
SAMPLE PAPER 2 2025
3 pages
Signals and System MATLAB Code
No ratings yet
Signals and System MATLAB Code
7 pages
Daa Two Mark Questions
No ratings yet
Daa Two Mark Questions
9 pages
Main Advantages of Quantitative Techniques For Forecasting Over The Qualitative Techniques
No ratings yet
Main Advantages of Quantitative Techniques For Forecasting Over The Qualitative Techniques
4 pages
Inter Leaving
No ratings yet
Inter Leaving
1 page
Hate Speech Detection Using Lstm and NLp Sushan Pratihar 3 Page
No ratings yet
Hate Speech Detection Using Lstm and NLp Sushan Pratihar 3 Page
13 pages
Finite Diference Methods For Fractional Diferential Equations
No ratings yet
Finite Diference Methods For Fractional Diferential Equations
29 pages
Latihan CAPM CAL
No ratings yet
Latihan CAPM CAL
6 pages
Ai I Complete Notes by Murali Krishna
No ratings yet
Ai I Complete Notes by Murali Krishna
75 pages

Advanced Cluster Analysis: Clustering High-Dimensional Data

Uploaded by

Advanced Cluster Analysis: Clustering High-Dimensional Data

Uploaded by

ADVANCED CLUSTER

 clustering high dimensional data,

 CLIQUE and ProCLUS,

 frequent pattern based clustering methods,

 clustering in non-euclidean space,

 clustering for streams and parallelism

 Clustering graph and network data

 Clustering with constraints

 The clustering methods we have studied so far work well

dist(Ada,Bob) = dist(Bob,Cathy) = dist(Ada,Cathy) = √ 2.

 Clustering high-dimensional data is the search for

 There are typically an exponential number of possible

 Subspace clustering approaches search for clusters

 Dimensionality reduction approaches try to construct a

 Subspace search methods

 A major challenge that subspace search methods face is

 Various pruning techniques are explored to reduce the

 PROCLUS, is an example of a top-down subspace

 PROCLUS, is an example of a top-down subspace

 II STEP: The subspaces representing these dense units

 It scales linearly with the size of input and has a good

Here for K=3

 Number of clusters, denoted by k

 Average number of dimensions for each clusters,

 The clusters found, and the dimensions respected to such

 Assign Data Points to the medoids

 Evaluate the Clusters formed

 The above procedure is repeated until we got a satisfied result

You might also like