0% found this document useful (0 votes)
4 views

CLUSTER ANALYSIS unit 3 Data mining

Cluster analysis is a data mining technique that groups similar data points into clusters, facilitating the identification of patterns and relationships within datasets. It employs various algorithms such as k-means, hierarchical, and density-based clustering, each suited for different types of data and analysis requirements. Applications of cluster analysis span multiple fields, including market segmentation, image processing, and bioinformatics, while its effectiveness can be influenced by factors such as algorithm choice and data quality.

Uploaded by

sudeshverma2648
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

CLUSTER ANALYSIS unit 3 Data mining

Cluster analysis is a data mining technique that groups similar data points into clusters, facilitating the identification of patterns and relationships within datasets. It employs various algorithms such as k-means, hierarchical, and density-based clustering, each suited for different types of data and analysis requirements. Applications of cluster analysis span multiple fields, including market segmentation, image processing, and bioinformatics, while its effectiveness can be influenced by factors such as algorithm choice and data quality.

Uploaded by

sudeshverma2648
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 84

CLUSTER ANALYSIS

Cluster analysis, also known as clustering, is a method of data mining that groups similar data
points together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such
that the data points within each group are more similar to each other than to data points in other
groups. This process is often used for exploratory data analysis and can help identify patterns or
relationships within the data that may not be immediately obvious. There are many different
algorithms used for cluster analysis, such as k-means, hierarchical clustering, and density-based
clustering. The choice of algorithm will depend on the specific requirements of the analysis and
the nature of the data being analyzed.
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data
points would comprise together to form a cluster in which all the objects would belong to the
same group.
The given data is divided into different groups by combining similar objects into a group. This
group is nothing but a cluster. A cluster is nothing but a collection of similar data which is
grouped together.
For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels
like Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a structured
manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming clusters
like cars cluster which contains all the cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with
huge databases. In order to handle extensive databases, the clustering algorithm should be
scalable. Data should be scalable, if it is not scalable, then we can’t get the appropriate result
which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along
with the data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with
algorithms of clustering. It should be capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing
values, and noisy or erroneous data. If the algorithms are sensitive to such data then it may lead
to poor quality clusters. So it should be able to handle unstructured data and give some structure
to the data by organising it into groups of similar data objects. This makes the job of the data
expert easier in order to process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and
usable. The interpretability reflects how easily the data is understood.
CLUSTERING METHODS
The clustering methods can be classified into the following categories:
 Partitioning Method
 Hierarchical Method
 Density-based Method
 Grid-Based Method
 Model-Based Method
 Constraint-based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n”
partitions are done on “p” objects of the database then each partition is represented by a cluster
and n < p. The two conditions which need to be satisfied with this Partitioning Clustering
Method are:
 One objective should only belong to only one group.
 There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative relocation, which means the
object will be moved from one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data
objects is created. We can classify hierarchical methods and will be able to know the purpose of
classification on the basis of how the hierarchical decomposition is formed. There are two types
of approaches for the creation of hierarchical decomposition, they are:
 Agglomerative Approach: The agglomerative approach is also known as the bottom-up
approach. Initially, the given data is divided into which objects form separate groups.
Thereafter it keeps on merging the objects or the groups that are close to one another
which means that they exhibit similar properties. This merging process continues until the
termination condition holds.
 Divisive Approach: The divisive approach is also known as the top-down approach. In
this approach, we would start with the data objects that are in the same cluster. The group
of individual clusters is divided into small clusters by continuous iteration. The iteration
continues until the condition of termination is met or until each cluster contains one
object.
Once the group is split or merged then it can never be undone as it is a rigid method and is not so
flexible. The two approaches which can be used to improve the Hierarchical Clustering Quality
in Data Mining are: –
 One should carefully analyze the linkages of the object at every partitioning of
hierarchical clustering.
 One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the
microcluster.
Density-Based Method: The density-based method mainly focuses on density. In this method,
the given cluster will keep on growing continuously as long as the density in the neighbourhood
exceeds some threshold, i.e, for each data point within a given cluster. The radius of a given
cluster has to contain at least a minimum number of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e,
the object space is quantized into a finite number of cells that form a grid structure. One of the
major advantages of the grid-based method is fast processing time and it is dependent only on the
number of cells in each dimension in the quantized space. The processing time for this method is
much faster so it can save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized in order to
find the data which is best suited for the model. The clustering of the density function is used to
locate the clusters for a given model. It reflects the spatial distribution of data points and also
provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. Therefore it yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by the
incorporation of application or user-oriented constraints. A constraint refers to the user
expectation or the properties of the desired clustering results. Constraints provide us with an
interactive way of communication with the clustering process. The user or the application
requirement can specify constraints.
Applications Of Cluster Analysis:
 It is widely used in image processing, data analysis, and pattern recognition.
 It helps marketers to find the distinct groups in their customer base and they can
characterize their customer groups by using purchasing patterns.
 It can be used in the field of biology, by deriving animal and plant taxonomies and
identifying genes with the same capabilities.
 It also helps in information discovery by classifying documents on the web.

Advantages of Cluster Analysis:


1. It can help identify patterns and relationships within a dataset that may not be
immediately obvious.
2. It can be used for exploratory data analysis and can help with feature selection.
3. It can be used to reduce the dimensionality of the data.
4. It can be used for anomaly detection and outlier identification.
5. It can be used for market segmentation and customer profiling.
Disadvantages of Cluster Analysis:
1. It can be sensitive to the choice of initial conditions and the number of clusters.
2. It can be sensitive to the presence of noise or outliers in the data.
3. It can be difficult to interpret the results of the analysis if the clusters are not well-
defined.
4. It can be computationally expensive for large datasets.
5. The results of the analysis can be affected by the choice of clustering algorithm used.
6. It is important to note that the success of cluster analysis depends on the data, the goals of
the analysis, and the ability of the analyst to interpret the results.
Cluster Analysis in Data Mining
Cluster analysis, also known as clustering, is a fundamental technique in data mining used to
group a set of objects into clusters such that objects in the same cluster are more similar to each
other than to those in other clusters. It is an unsupervised learning method, meaning it does not
rely on predefined labels or categories. Clustering is widely used in various domains, such as
market segmentation, image processing, bioinformatics, and social network analysis.

Key Concepts in Cluster Analysis


1. Cluster:
o A group of objects that are similar to each other and dissimilar to objects in other
groups.
2. Similarity Measure:
o A metric used to quantify how similar two objects are. Common measures
include:
 Euclidean distance (for numerical data).
 Cosine similarity (for text or high-dimensional data).
 Jaccard similarity (for binary data).
3. Centroid:
o The central point of a cluster, often used in algorithms like k-means.

4. Dendrogram:
o A tree-like diagram used in hierarchical clustering to represent the arrangement of
clusters.

Types of Clustering Algorithms


Clustering algorithms can be broadly categorized into the following types:

1. Partitioning Clustering
 Divides data into non-overlapping clusters.
 Example Algorithms:
o k-Means: Groups data into kk clusters by minimizing the sum of squared
distances between data points and their cluster centroids.
o k-Medoids: Similar to k-means but uses actual data points (medoids) as cluster
centers, making it more robust to outliers.

2. Hierarchical Clustering
 Builds a hierarchy of clusters, either by merging smaller clusters into larger ones
(agglomerative) or splitting larger clusters into smaller ones (divisive).
 Example Algorithms:
o Agglomerative Hierarchical Clustering: Starts with each object as a separate
cluster and merges the closest pairs of clusters iteratively.
o Divisive Hierarchical Clustering: Starts with all objects in one cluster and splits
them into smaller clusters.

3. Density-Based Clustering
 Groups data based on density, identifying clusters as dense regions separated by sparse
regions.
 Example Algorithms:
o DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Identifies clusters of varying shapes and handles noise effectively.
o OPTICS (Ordering Points To Identify the Clustering Structure): Similar to
DBSCAN but creates a reachability plot to identify clusters at different density
levels.

4. Model-Based Clustering
 Assumes that the data is generated from a mixture of probability distributions and tries to
fit the data to these models.
 Example Algorithms:
o Gaussian Mixture Models (GMM): Assumes data points are generated from a
mixture of Gaussian distributions.

5. Grid-Based Clustering
 Divides the data space into a finite number of cells (grids) and performs clustering on the
grid structure.
 Example Algorithms:
o STING (Statistical Information Grid): Divides the data space into rectangular
cells and computes statistical information for each cell.

Steps in Cluster Analysis


1. Data Preparation:
o Clean and preprocess the data (e.g., handle missing values, normalize data).

2. Feature Selection:
o Choose relevant features for clustering.

3. Similarity Measure Selection:


o Choose an appropriate similarity or distance measure based on the data type.

4. Clustering Algorithm Selection:


o Choose a clustering algorithm based on the problem and data characteristics.

5. Cluster Validation:
o Evaluate the quality of clusters using internal or external validation metrics.

6. Interpretation:
o Analyze and interpret the clusters to derive meaningful insights.

Applications of Cluster Analysis


1. Market Segmentation:
o Group customers based on purchasing behavior, demographics, or preferences.

2. Image Segmentation:
o Divide an image into regions for object detection or pattern recognition.

3. Bioinformatics:
o Cluster genes or proteins with similar functions.

4. Social Network Analysis:


o Identify communities or groups within social networks.

5. Anomaly Detection:
o Detect outliers or unusual patterns in data.

Challenges in Cluster Analysis


1. Choosing the Right Algorithm:
o The choice of algorithm depends on the data and the problem.

2. Determining the Number of Clusters:


o In algorithms like k-means, the number of clusters (kk) must be specified in
advance.
3. Handling High-Dimensional Data:
o Clustering becomes challenging as the number of dimensions increases (curse of
dimensionality).
4. Scalability:
o Some algorithms may not scale well with large datasets.

5. Noise and Outliers:


o Noise and outliers can negatively impact clustering results.

Cluster Validation Metrics


To evaluate the quality of clusters, several metrics are used:
1. Internal Validation:
o Measures the compactness and separation of clusters.

o Examples:

 Silhouette Coefficient: Measures how similar an object is to its own


cluster compared to other clusters.
 Davies-Bouldin Index: Evaluates the average similarity ratio of clusters.
2. External Validation:
o Compares clustering results with ground truth labels.

o Examples:

 Adjusted Rand Index (ARI): Measures the similarity between two


clusterings.
 Fowlkes-Mallows Index: Evaluates the similarity between pairs of
clusters.
Basic Clustering Methods
Clustering is a fundamental technique in data mining and machine learning used to group similar
objects into clusters. Below are some of the basic clustering methods widely used in practice:
PARTITIONING CLUSTERING
Partitioning methods divide the data into a predefined number of non-overlapping clusters. Each
object belongs to exactly one cluster.
a) k-Means Clustering
 Objective: Partition the data into kk clusters by minimizing the sum of squared distances
between data points and their cluster centroids.
 Steps:
1. Choose kk initial centroids (randomly or using a heuristic).
2. Assign each data point to the nearest centroid.
3. Recalculate the centroids as the mean of all points in the cluster.
4. Repeat steps 2 and 3 until convergence (no change in centroids).
 Advantages:
o Simple and computationally efficient.

o Works well for spherical clusters.

 Limitations:
o Requires the number of clusters (kk) to be specified in advance.

o Sensitive to initial centroid selection and outliers.

b) k-Medoids Clustering
 Objective: Similar to k-means, but uses actual data points (medoids) as cluster centers
instead of means.
 Algorithm: PAM (Partitioning Around Medoids).
 Advantages:
o More robust to noise and outliers compared to k-means.

 Limitations:
o Computationally more expensive than k-means.

2. Hierarchical Clustering
Hierarchical methods build a hierarchy of clusters, either by merging smaller clusters into larger
ones (agglomerative) or splitting larger clusters into smaller ones (divisive).
a) Agglomerative Hierarchical Clustering
 Steps:
1. Treat each data point as a single cluster.
2. Merge the two closest clusters iteratively.
3. Repeat until all points are in a single cluster.
 Linkage Criteria:
o Single Linkage: Distance between the closest pair of points in two clusters.

o Complete Linkage: Distance between the farthest pair of points in two clusters.

o Average Linkage: Average distance between all pairs of points in two clusters.

 Advantages:
o Does not require the number of clusters to be specified in advance.

o Produces a dendrogram for visualization.

 Limitations:
o Computationally expensive for large datasets.

b) Divisive Hierarchical Clustering


 Steps:
1. Start with all data points in a single cluster.
2. Split the cluster into smaller clusters iteratively.
3. Repeat until each point is in its own cluster.
 Advantages:
o Provides a hierarchical structure of clusters.

 Limitations:
o Computationally expensive and less commonly used than agglomerative
clustering.
3. Density-Based Clustering
Density-based methods identify clusters as dense regions of data points separated by sparse
regions.
a) DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
 Key Concepts:
o Core Point: A point with at least minPtsminPts points within a radius ϵϵ.

o Border Point: A point within ϵϵ of a core point but does not have enough
neighbors.
o Noise Point: A point that is neither a core nor a border point.

 Steps:
1. Identify core points and form clusters around them.
2. Assign border points to the nearest core point's cluster.
3. Treat noise points as outliers.
 Advantages:
o Does not require the number of clusters to be specified.

o Can identify clusters of arbitrary shapes and handle noise.

 Limitations:
o Sensitive to the choice of ϵϵ and minPtsminPts.

b) OPTICS (Ordering Points To Identify the Clustering Structure)


 Key Idea: Creates a reachability plot to identify clusters at different density levels.
 Advantages:
o Can handle varying densities within the dataset.

 Limitations:
o More complex and computationally expensive than DBSCAN.

4. Model-Based Clustering
Model-based methods assume that the data is generated from a mixture of probability
distributions and try to fit the data to these models.
a) Gaussian Mixture Models (GMM)
 Objective: Model the data as a mixture of Gaussian distributions.
 Steps:
1. Assume the data is generated from kk Gaussian distributions.
2. Use the Expectation-Maximization (EM) algorithm to estimate the parameters
of the distributions.
3. Assign each point to the cluster with the highest probability.
 Advantages:
o Can model clusters of different shapes and sizes.

 Limitations:
o Requires the number of clusters to be specified.

o Sensitive to initialization.

5. Grid-Based Clustering
Grid-based methods divide the data space into a finite number of cells (grids) and perform
clustering on the grid structure.
a) STING (Statistical Information Grid)
 Steps:
1. Divide the data space into rectangular cells.
2. Compute statistical information (e.g., mean, count) for each cell.
3. Merge neighboring cells with similar statistics to form clusters.
 Advantages:
o Computationally efficient for large datasets.

 Limitations:
o Limited to low-dimensional data.

Comparison of Basic Clustering Methods


Method Type Advantages Limitations

Simple, efficient, works for Requires kk, sensitive to


k-Means Partitioning
spherical clusters. outliers.

k-Medoids Partitioning Robust to outliers. Computationally expensive.

Agglomerativ No need for kk, produces


Hierarchical Computationally expensive.
e dendrogram.

Density-
DBSCAN Handles noise, arbitrary shapes. Sensitive to parameters.
Based

Models clusters of different Requires kk, sensitive to


GMM Model-Based
shapes. initialization.

Limited to low-dimensional
STING Grid-Based Efficient for large datasets.
data.

PARTITIONING METHOD: This clustering method classifies the information into multiple
groups based on the characteristics and similarity of the data. Its the data analysts to specify the
number of clusters that has to be generated for the clustering methods. In the partitioning method
when database(D) that contains multiple(N) objects then the partitioning method constructs user-
specified(K) partitions of the data in which each partition represents a cluster and a particular
region. There are many algorithms that come under partitioning method some of the popular ones
are K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large Applications) etc. In this
article, we will be seeing the working of K Mean algorithm in detail.
K-Mean (A centroid based Technique): The K means algorithm takes the input parameter K
from the user and partitions the dataset containing N objects into K clusters so that resulting
similarity among the data objects inside the group (intracluster) is high but the similarity of data
objects with the data objects from outside the cluster is low (intercluster). The similarity of the
cluster is determined with respect to the mean value of the cluster. It is a type of square error
algorithm. At the start randomly k objects from the dataset are chosen in which each of the
objects represents a cluster mean(centre). For the rest of the data objects, they are assigned to the
nearest cluster based on their distance from the cluster mean. The new mean of each of the
cluster is then calculated with the added data objects.
The K-Means clustering algorithm is a popular unsupervised machine learning technique used
to partition a dataset into kk distinct, non-overlapping clusters. The goal is to group similar data
points together while keeping dissimilar points in different clusters. Below is a detailed
explanation of the K-Means algorithm, followed by a numerical example.
Steps in the K-Means Algorithm
1. Initialization:
o Choose the number of clusters kk.

o Randomly select kk data points as initial cluster centroids.

2. Assignment Step:
o Assign each data point to the nearest centroid based on a distance metric (usually
Euclidean distance).
3. Update Step:
o Recalculate the centroids as the mean of all data points assigned to each cluster.

4. Repeat:
o Repeat the assignment and update steps until the centroids no longer change
significantly (convergence).
K-Medoids ALGORITHM
The K-Medoids algorithm, also known as Partitioning Around Medoids (PAM), is a
clustering algorithm similar to K-Means but more robust to noise and outliers. Instead of using
the mean of data points as the cluster center (centroid), K-Medoids uses actual data points
called medoids as the representative of each cluster. This makes it more suitable for datasets
with noise or outliers.

Steps in the K-Medoids Algorithm


1. Initialization:
o Choose the number of clusters kk.
o Randomly select kk data points as initial medoids.

2. Assignment Step:
o Assign each data point to the nearest medoid based on a distance metric (e.g.,
Euclidean distance).
3. Update Step:
o For each cluster, find the data point that minimizes the total dissimilarity (cost)
within the cluster. This point becomes the new medoid.
4. Repeat:
o Repeat the assignment and update steps until the medoids no longer change
(convergence).
HIERARCHICAL CLUSTERING
Hierarchical clustering is a type of clustering algorithm that builds a hierarchy of clusters,
either by merging smaller clusters into larger ones (agglomerative approach) or by splitting
larger clusters into smaller ones (divisive approach). Unlike partitioning methods like K-Means
or K-Medoids, hierarchical clustering does not require the number of clusters kk to be specified
in advance. Instead, it produces a dendrogram, which is a tree-like structure that shows the
relationships between clusters at different levels of granularity.
Types of Hierarchical Clustering
1. Agglomerative Hierarchical Clustering:
o This is a bottom-up approach.

o Each data point starts as its own cluster.

o At each step, the two most similar clusters are merged.

o This process continues until all data points are in a single cluster or a stopping
condition is met.
2. Divisive Hierarchical Clustering:
o This is a top-down approach.

o All data points start in one cluster.

o At each step, a cluster is split into smaller clusters.

o This process continues until each data point is in its own cluster or a stopping
condition is met.
AGGLOMERATIVE CLUSTERING
Agglomerative clustering is more commonly used due to its simplicity and computational
efficiency.

Steps in Agglomerative Hierarchical Clustering


1. Initialization:
o Treat each data point as a single cluster.

o Compute the distance matrix between all pairs of clusters.

2. Merge Clusters:
o Find the two closest clusters based on a linkage criterion.

o Merge these two clusters into a single cluster.

o Update the distance matrix to reflect the new cluster.

3. Repeat:
o Repeat the merge step until all data points are in a single cluster or a stopping
condition is met.
4. Dendrogram:
o Visualize the hierarchy of clusters using a dendrogram.

We will use agglomerative hierarchical clustering with single linkage to cluster this dataset
DIFFERENCE BETWEEN AGGLOMERATIVE CLUSTERING AND DIVISIVE
CLUSTERING :

Parameters Agglomerative Clustering Divisive Clustering

Bottom-up: Starts with individual Top-down: Starts with all data


Approach points and merges them. in one cluster and splits.

Less computationally
More computationally expensive due to
expensive but requires careful
pairwise distance calculations.
Complexity Level cluster splitting.

Outliers may lead to inefficient


Better at handling outliers, as outliers
Handling splitting and suboptimal
can be absorbed into larger clusters.
Outliers results.

More interpretable due to clear cluster Can be harder to interpret due


Interpretability merging in the dendrogram. to recursive splitting decisions.

Scikit-learn provides multiple linkage Not widely implemented in


methods such as “ward,” “complete,” major libraries like Scikit-learn
Implementation “average,” and “single.” and SciPy.

Example Image segmentation, customer Less common but can be used


Applications segmentation, document clustering, etc. in hierarchical data analysis.

DIVISIVE HIERARCHICAL CLUSTERING


The divisive hierarchical clustering approach is a top-down clustering method where all data
points start in a single cluster, and the algorithm recursively splits the cluster into smaller clusters
until each data point is in its own cluster or a stopping condition is met. Unlike agglomerative
clustering, which builds clusters by merging, divisive clustering builds clusters by splitting.
Below is a detailed explanation of the divisive hierarchical clustering approach, along with
a numerical example.
Steps in Divisive Hierarchical Clustering
1. Initialization:
o Start with all data points in a single cluster.

2. Split Clusters:
o Choose a cluster to split based on a criterion (e.g., maximizing the distance
between subclusters).
o Split the cluster into two or more smaller clusters.

3. Repeat:
o Repeat the split step until each data point is in its own cluster or a stopping
condition is met.
4. Dendrogram:
o Visualize the hierarchy of clusters using a dendrogram.

Splitting Criterion
A common splitting criterion is to use a distance metric (e.g., Euclidean distance) and split the
cluster into two subclusters such that the distance between the two subclusters is maximized.
This can be done using algorithms like DIANA (Divisive Analysis).
DISTANCE MEASURES
In clustering algorithms, distance measures are used to quantify the similarity or dissimilarity
between data points. The choice of distance measure can significantly impact the results of
clustering, as it determines how the algorithm groups data points into clusters. Below is a
detailed explanation of common distance measures used in algorithmic methods, along with their
mathematical formulations and use cases.
BIRCH
BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data
mining algorithm that performs hierarchical clustering over large data sets. With modifications, it
can also be used to accelerate k-means clustering and Gaussian mixture modeling with the
expectation-maximization algorithm. An advantage of BIRCH is its ability to incrementally and
dynamically cluster incoming, multi-dimensional metric data points to produce the best quality
clustering for a given set of resources (memory and time constraints). In most cases, BIRCH
only requires a single scan of the database.
Its inventors claim BIRCH to be the "first clustering algorithm proposed in the database area to
handle 'noise' (data points that are not part of the underlying pattern) effectively", beating
DBSCAN by two months. The BIRCH algorithm received the SIGMOD 10 year test of time
award in 2006.
Basic clustering algorithms like K means and agglomerative clustering are the most commonly
used clustering algorithms. But when performing clustering on very large datasets, BIRCH and
DBSCAN are the advanced clustering algorithms useful for performing precise clustering on
large datasets. Moreover, BIRCH is very useful because of its easy implementation. BIRCH is a
clustering algorithm that clusters the dataset first in small summaries, then after small summaries
get clustered. It does not directly cluster the dataset. That is why BIRCH is often used with other
clustering algorithms; after making the summary, the summary can also be clustered by other
clustering algorithms.
It is provided as an alternative to MinibatchKMeans. It converts data to a tree data structure with
the centroids being read off the leaf. And these centroids can be the final cluster centroid or the
input for other cluster algorithms like Agglomerative Clustering.
Problem with Previous Clustering Algorithm
Previous clustering algorithms performed less effectively over very large databases and did not
adequately consider the case wherein a dataset was too large to fit in main memory. Furthermore,
most of BIRCH's predecessors inspect all data points (or all currently existing clusters) equally
for each clustering decision. They do not perform heuristic weighting based on the distance
between these data points. As a result, there was a lot of overhead maintaining high clustering
quality while minimizing the cost of additional IO (input/output) operations.
Stages of BIRCH
BIRCH is often used to complement other clustering algorithms by creating a summary of the
dataset that the other clustering algorithm can now use. However, BIRCH has one major
drawback it can only process metric attributes. A metric attribute is an attribute whose values can
be represented in Euclidean space, i.e., no categorical attributes should be present. The BIRCH
clustering algorithm consists of two stages:
1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense regions
called Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined as
an ordered triple (N, LS, SS) where 'N' is the number of data points in the cluster, 'LS' is
the linear sum of the data points, and 'SS' is the squared sum of the data points in the
cluster. A CF entry can be composed of other CF entries. Optionally, we can condense
this initial CF tree into a smaller CF.
2. Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree.
A CF tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree
contains a pointer to a child node, and a CF entry made up of the sum of CF entries in the
child nodes. Optionally, we can refine these clusters.
Due to this two-step process, BIRCH is also called Two-Step Clustering.
Algorithm
The tree structure of the given data is built by the BIRCH algorithm called the Clustering feature
tree (CF tree). This algorithm is based on the CF (clustering features) tree. In addition, this
algorithm uses a tree-structured summary to create clusters.

In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those
nodes that have several sub-clusters can be called CF subclusters. These CF subclusters are
situated in no-terminal CF nodes.
The CF tree is a height-balanced tree that gathers and manages clustering features and holds
necessary information of given data for further hierarchical clustering. This prevents the need to
work with whole data given as input. The tree cluster of data points as CF is represented by three
numbers (N, LS, SS).
o N = number of items in subclusters
o LS = vector sum of the data points

o SS = sum of the squared data points

There are mainly four phases which are followed by the algorithm of BIRCH.
o Scanning data into memory.

o Condense data (resize data).

o Global clustering.

o Refining clusters.

Two of them (resize data and refining clusters) are optional in these four phases. They come in
the process when more clarity is required. But scanning data is just like loading data into a
model. After loading the data, the algorithm scans the whole data and fits them into the CF trees.
In condensing, it resets and resizes the data for better fitting into the CF tree. In global clustering,
it sends CF trees for clustering using existing clustering algorithms. Finally, refining fixes the
problem of CF trees where the same valued points are assigned to different leaf nodes.
Cluster Features
BIRCH clustering achieves its high efficiency by clever use of a small set of summary statistics
to represent a larger set of data points. These summary statistics constitute a CF and represent a
sufficient substitute for the actual data for clustering purposes.
A CF is a set of three summary statistics representing a set of data points in a single cluster.
These statistics are as follows:
o Count [The number of data values in the cluster]

o Linear Sum [The sum of the individual coordinates. This is a measure of the location of
the cluster]
o Squared Sum [The sum of the squared coordinates. This is a measure of the spread of the
cluster]
NOTE: The linear sum and the squared sum are equivalent to the mean and variance of the data
point.
CF Tree
The building process of the CF Tree can be summarized in the following steps, such as:
Step 1: For each given record, BIRCH compares the location of that record with the location of
each CF in the root node, using either the linear sum or the mean of the CF. BIRCH passes the
incoming record to the root node CF closest to the incoming record.
Step 2: The record then descends down to the non-leaf child nodes of the root node CF selected
in step 1. BIRCH compares the location of the record with the location of each non-leaf CF.
BIRCH passes the incoming record to the non-leaf node CF closest to the incoming record.
Step 3: The record then descends down to the leaf child nodes of the non-leaf node CF selected
in step 2. BIRCH compares the location of the record with the location of each leaf. BIRCH
tentatively passes the incoming record to the leaf closest to the incoming record.
Step 4: Perform one of the below points (i) or (ii):
1. If the radius of the chosen leaf, including the new record, does not exceed the threshold T,
then the incoming record is assigned to that leaf. The leaf and its parent CF's are updated
to account for the new data point.
2. If the radius of the chosen leaf, including the new record, exceeds the Threshold T, then a
new leaf is formed, consisting of the incoming record only. The parent CFs is updated to
account for the new data point.
If step 4(ii) is executed, and the maximum L leaves are already in the leaf node, the leaf node is
split into two leaf nodes. If the parent node is full, split the parent node, and so on. The most
distant leaf node CFs are used as leaf node seeds, with the remaining CFs being assigned to
whichever leaf node is closer. Note that the radius of a cluster may be calculated even without
knowing the data points, as long as we have the count n, the linear sum LS, and the squared sum
SS. This allows BIRCH to evaluate whether a given data point belongs to a particular sub-cluster
without scanning the original data set.
Clustering the Sub-Clusters
Once the CF tree is built, any existing clustering algorithm may be applied to the sub-clusters
(the CF leaf nodes) to combine these sub-clusters into clusters. The task of clustering becomes
much easier as the number of sub-clusters is much less than the number of data points. When a
new data value is added, these statistics may be easily updated, thus making the computation
more efficient.
Parameters of BIRCH
There are three parameters in this algorithm that needs to be tuned. Unlike K-means, the optimal
number of clusters (k) need not be input by the user as the algorithm determines them.
o Threshold: Threshold is the maximum number of data points a sub-cluster in the leaf
node of the CF tree can hold.
o branching_factor: This parameter specifies the maximum number of CF sub-clusters in
each node (internal node).
o n_clusters: The number of clusters to be returned after the entire BIRCH algorithm is
complete, i.e., the number of clusters after the final clustering step. The final clustering
step is not performed if set to none, and intermediate clusters are returned.
Advantages of BIRCH
It is local in that each clustering decision is made without scanning all data points and existing
clusters. It exploits the observation that the data space is not usually uniformly occupied, and not
every data point is equally important.
It uses available memory to derive the finest possible sub-clusters while minimizing I/O costs. It
is also an incremental method that does not require the whole data set in advance.
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a clustering
algorithm designed for large datasets. It is particularly effective for handling numerical data and
is widely used in data mining and machine learning applications. BIRCH builds a Clustering
Feature Tree (CF Tree) to summarize the dataset efficiently, enabling it to handle large datasets
with limited memory.

Key Features of BIRCH


1. Scalability: BIRCH is designed to handle large datasets efficiently.
2. Incremental Clustering: It processes data incrementally, making it suitable for
streaming data.
3. Memory Efficiency: It uses a compact data structure called the Clustering Feature Tree
(CF Tree) to summarize the dataset.
4. Two-Phase Clustering:
o Phase 1: Build the CF Tree to summarize the data.

o Phase 2: Apply a global clustering algorithm (e.g., hierarchical clustering or K-


Means) to the CF Tree.

Clustering Feature (CF)


The core concept in BIRCH is the Clustering Feature (CF), which is a compact representation
of a cluster. A CF is a triplet defined as:
CF Tree
The CF Tree is a height-balanced tree that stores CFs. It has two parameters:
1. Branching Factor (B): Maximum number of children per non-leaf node.
2. Threshold (T): Maximum diameter of a subcluster (controls the granularity of
clustering).
Structure of CF Tree:
 Leaf Nodes: Contain CFs of subclusters.
 Non-Leaf Nodes: Contain CFs of their child nodes.

Steps in BIRCH
Phase 1: Build the CF Tree
1. Initialize: Start with an empty CF Tree.
2. Insert Data Points:
1.
o
For each data point, traverse the CF Tree to find the closest leaf node.
o If the point fits within the threshold TT of the closest subcluster, update the CF of
that subcluster.
o If not, create a new subcluster.

o If the leaf node exceeds the branching factor BB, split it and propagate the change
upward.
2. Rebuild the Tree: If the CF Tree grows too large, rebuild it with a larger threshold TT.
Phase 2: Global Clustering
1. Extract Subclusters: Use the CFs from the leaf nodes of the CF Tree as input.
2. Apply Clustering Algorithm: Use a global clustering algorithm (e.g., hierarchical
clustering or K-Means) to cluster the subclusters.

Advantages of BIRCH
 Efficiency: Handles large datasets with limited memory.
 Scalability: Suitable for incremental and streaming data.
 Flexibility: Can be combined with other clustering algorithms for global clustering.

Disadvantages of BIRCH
 Sensitivity to Parameters: The performance depends on the branching factor BB and
threshold TT.
 Limited to Numerical Data: Works best with numerical data; not suitable for categorical
data.
 Outlier Sensitivity: Outliers can affect the structure of the CF Tree.

Example
Dataset:
Consider the following 2D dataset with 6 data points:
X={(1,1),(1,2),(2,2),(4,4),(5,5),(6,6)

Step 2: Global Clustering


 Apply a clustering algorithm (e.g., K-Means) to the CFs:
o Cluster 1: CF1

Cluster 2: CF2
CHAMELEON
CHAMELEON is a hierarchical clustering algorithm that uses dynamic modeling to determine
the similarity between clusters. Unlike traditional hierarchical clustering methods,
CHAMELEON considers both the interconnectivity and closeness of clusters when merging
them. This makes it particularly effective for clustering complex datasets with varying densities
and shapes.

Key Features of CHAMELEON


1. Dynamic Modeling: Uses a graph-based approach to model the dataset and measure
cluster similarity.
2. Two-Phase Clustering:
o Phase 1: Partition the dataset into smaller subclusters using a graph partitioning
algorithm.
o Phase 2: Merge subclusters hierarchically based on relative
interconnectivity and relative closeness.
3. Adaptability: Can handle datasets with arbitrary shapes, sizes, and densities.

Steps in CHAMELEON
Phase 1: Partitioning the Dataset
1. Construct a k-Nearest Neighbor (k-NN) Graph:
o Represent the dataset as a graph where each data point is a node.

o Connect each node to its kk nearest neighbors based on a distance metric (e.g.,
Euclidean distance).
2. Partition the Graph:
o Use a graph partitioning algorithm (e.g., METIS) to divide the graph into smaller
subclusters.
o Each subcluster should contain highly interconnected nodes.

Phase 2: Merging Subclusters


1. Compute Cluster Similarity:
o Measure the similarity between subclusters using two criteria:

 Relative Interconnectivity: The connectivity between two clusters


relative to their internal connectivity.
 Relative Closeness: The closeness of two clusters relative to their internal
closeness.
2. Merge Subclusters:
o Merge the most similar subclusters iteratively until the desired number of clusters
is obtained or a stopping condition is met.

Cluster Similarity Measures


1. Relative Interconnectivity
 Measures how well two clusters are connected relative to their internal connectivity.
 Formula:
Disadvantages of CHAMELEON
 Computational Complexity: High computational cost due to graph construction and
partitioning.
 Parameter Sensitivity: Performance depends on the choice of kk (number of nearest
neighbors) and other parameters.
 Memory Usage: Requires significant memory for large datasets.
PROBABILISTIC HIERARCHICAL CLUSTERING
Probabilistic Hierarchical Clustering is a clustering approach that combines hierarchical
clustering with probabilistic models to group data points. Unlike traditional hierarchical
clustering, which uses deterministic methods to merge or split clusters, probabilistic hierarchical
clustering assigns probabilities to data points belonging to different clusters. This approach is
particularly useful when dealing with uncertain or noisy data.
Key Features of Probabilistic Hierarchical Clustering
1. Probabilistic Models: Uses probability distributions to model clusters.
2. Hierarchical Structure: Builds a hierarchy of clusters, either agglomeratively (bottom-
up) or divisively (top-down).
3. Soft Clustering: Assigns probabilities to data points for belonging to different clusters,
allowing for overlapping clusters.
4. Uncertainty Handling: Suitable for datasets with noise or uncertainty.

Steps in Probabilistic Hierarchical Clustering


Agglomerative Approach (Bottom-Up):
1. Initialization:
o Treat each data point as a separate cluster.

o Assign a probability distribution (e.g., Gaussian) to each cluster.

2. Merge Clusters:
o Compute the similarity between clusters using a probabilistic measure (e.g.,
likelihood or divergence).
o Merge the two most similar clusters into a new cluster.

o Update the probability distribution of the new cluster.

3. Repeat:
o Repeat the merge step until all data points are in a single cluster or a stopping
condition is met.
4. Hierarchy:
o The result is a hierarchy of clusters represented as a dendrogram.

Divisive Approach (Top-Down):


1. Initialization:
o Start with all data points in a single cluster.

o Assign a probability distribution to the cluster.

2. Split Clusters:
o Use a probabilistic criterion (e.g., likelihood or divergence) to split the cluster into
smaller clusters.
o Update the probability distributions of the new clusters.

3. Repeat:
o Repeat the split step until each data point is in its own cluster or a stopping
condition is met.
4. Hierarchy:
o The result is a hierarchy of clusters represented as a dendrogram.

Probabilistic Measures
1. Likelihood-Based Similarity
 Measures how well a cluster explains the data points.
 Formula:

Advantages of Probabilistic Hierarchical Clustering


 Uncertainty Handling: Assigns probabilities to data points, making it robust to noise.
 Soft Clustering: Allows for overlapping clusters.
 Flexibility: Can model complex data using various probability distributions.

Disadvantages of Probabilistic Hierarchical Clustering


 Computational Complexity: More computationally expensive than deterministic
methods.
 Parameter Sensitivity: Performance depends on the choice of probability distributions
and parameters.
 Interpretability: Probabilistic results may be harder to interpret than deterministic
results.
DENSITY-BASED CLUSTERING
Density-Based Clustering is a clustering approach that groups data points based on their density
in the feature space. Unlike partitioning or hierarchical methods, density-based clustering
identifies clusters as areas of high density separated by areas of low density. This makes it
particularly effective for discovering clusters of arbitrary shapes and handling noise and outliers.
The most popular density-based clustering algorithm is DBSCAN (Density-Based Spatial
Clustering of Applications with Noise). Below is a detailed explanation of density-based
clustering, focusing on DBSCAN.
Key Features of Density-Based Clustering
1. Arbitrary Cluster Shapes: Can identify clusters of any shape (e.g., spherical, elongated,
or irregular).
2. Noise Handling: Explicitly identifies and handles noise and outliers.
3. No Predefined Number of Clusters: Does not require the number of clusters kk to be
specified in advance.
4. Density-Based: Clusters are defined as areas of high density separated by areas of low
density.
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based
clustering algorithm that groups data points into clusters based on their density in the feature
space. It is particularly effective for discovering clusters of arbitrary shapes and handling noise
and outliers.

Steps in DBSCAN
1. Initialization:
o Start with an arbitrary unvisited point.

o Retrieve all points within its ϵϵ-neighborhood.

2. Core Point Check:


o If the point has at least MinPtsMinPts in its neighborhood, it is a core point, and a
new cluster is formed.
o If not, the point is marked as noise (but it may later be reassigned to a cluster if it
is density-reachable from another core point).
3. Expand Cluster:
o For each core point, find all density-reachable points and add them to the cluster.

o Repeat this process for all newly added points.

4. Repeat:
o Repeat the process for all unvisited points in the dataset.

5. Termination:
o All points are either assigned to clusters or marked as noise.
Advantages of DBSCAN
1. Arbitrary Cluster Shapes: Can identify clusters of any shape.
2. Noise Handling: Explicitly identifies and handles noise and outliers.
3. No Predefined Number of Clusters: Does not require the number of clusters kk to be
specified in advance.

Disadvantages of DBSCAN
1. Parameter Sensitivity: Performance depends on the choice of ϵϵ and MinPts.
2. Difficulty with Varying Densities: Struggles with datasets where clusters have
significantly different densities.
3. Computational Complexity: High for large datasets due to neighborhood computations.

OPTICS (ORDERING POINTS TO IDENTIFY THE CLUSTERING STRUCTURE)


OPTICS (Ordering Points To Identify the Clustering Structure) is a density-based clustering
algorithm that extends DBSCAN. Unlike DBSCAN, which requires a fixed ϵϵ (radius)
parameter, OPTICS creates a reachability plot that allows for the analysis of clusters at varying
densities. This makes it more flexible and suitable for datasets with clusters of varying densities.
1.
o clusters.

Steps in OPTICS
1. Initialization:
o Compute the core distance for each point.

o Initialize an ordered list to store the output.

2. Processing:
o For each unprocessed point:

 Retrieve its ϵϵ-neighborhood.


 If it is a core point, compute the reachability distances for its neighbors.
 Add the point to the ordered list.
 Expand the cluster by processing its neighbors.
3. Output:
o The ordered list of points and their reachability distances.
o Use the reachability plot to identify clusters.
Advantages of OPTICS
1. Flexibility: Can identify clusters at varying densities.
2. Noise Handling: Explicitly identifies noise and outliers.
3. No Predefined ϵϵ: Does not require a fixed ϵϵ parameter.
Disadvantages of OPTICS
1. Computational Complexity: More computationally expensive than DBSCAN.
2. Interpretation: Requires interpretation of the reachability plot to identify clusters.
DENCLUE (DENSITY-BASED CLUSTERING)
DENCLUE (Density-Based Clustering) is a clustering algorithm that uses density functions to
model the distribution of data points in the feature space. It is particularly effective for
identifying clusters of arbitrary shapes and handling noise. DENCLUE is based on the idea that
clusters are located in regions of high density, separated by regions of low density.

Key Concepts in DENCLUE


1. Density Function:
o DENCLUE uses a kernel density estimation (KDE) function to model the
density of data points.
o Common kernels include Gaussian, Epanechnikov, and Uniform kernels.
GRID-BASED METHOD IN DATA MINING
We can use the grid-based clustering method for multi-resolution of grid-based data structure. It
is used to quantize the area of the object into a finite number of cells, which is stored in the grid
system where all the operations of Clustering are implemented. We can use this method for its
quick processing time, which is generally independent of the number of data objects, still
dependent on only the multiple cells in each dimension in the quantized space.
There is an instance of a grid-based approach that involves STING, which explores statistical
data stored in the grid cells, and WaveCluster, which clusters objects using a wavelet transform
approach. And CLIQUE, which defines a grid-and density-based approach for Clustering in
high-dimensional data space.
Grid-Based Clustering is a clustering approach that partitions the data space into a finite
number of cells (or grids) and then performs clustering on these cells rather than on individual
data points. This method is particularly efficient for large datasets because it reduces the
computational complexity by working with summarized information in each cell.

Key Features of Grid-Based Clustering


1. Efficiency: Works with summarized data in cells, making it faster than point-based
methods.
2. Scalability: Suitable for large datasets due to reduced computational complexity.
3. Arbitrary Cluster Shapes: Can identify clusters of arbitrary shapes by merging adjacent
dense cells.
4. Noise Handling: Can easily identify and handle noise by ignoring sparse cells.
Steps in Grid-Based Clustering
1. Grid Construction:
o Divide the data space into a finite number of cells (grids) of equal size.

o Each cell is defined by its boundaries in each dimension.

2. Density Calculation:
o Calculate the density of each cell by counting the number of data points it
contains.
3. Cluster Formation:
o Merge adjacent dense cells to form clusters.

o Cells with density below a threshold are considered noise.

4. Cluster Refinement:
o Optionally, refine the clusters by reassigning points on the boundaries of cells.

Advantages of Grid-Based Clustering


1. Efficiency: Reduces computational complexity by working with cells instead of
individual points.
2. Scalability: Suitable for large datasets.
3. Noise Handling: Easily identifies and handles noise.
4. Arbitrary Cluster Shapes: Can identify clusters of arbitrary shapes by merging adjacent
dense cells.

Disadvantages of Grid-Based Clustering


1. Parameter Sensitivity: Performance depends on the choice of grid size and density
threshold.
2. Curse of Dimensionality: Performance degrades in high-dimensional spaces due to the
exponential increase in the number of cells.
3. Loss of Precision: Working with cells instead of individual points can lead to a loss of
precision.

Numerical Example
STING
STING is a Grid-Based Clustering Technique. In STING, the dataset is recursively divided in a
hierarchical manner. After the dataset, each cell is divided into a different number of cells. And
after the cell, the statistical measures of the cell are collected, which helps answer the query as
quickly as possible.
Statistical Information Grid(STING):
A STING is a grid-based clustering technique. It uses a multidimensional grid data structure that
quantifies space into a finite number of cells. Instead of focusing on data points, it focuses on the
value space surrounding the data points.
In STING, the spatial area is divided into rectangular cells and several levels of cells at different
resolution levels. High-level cells are divided into several low-level cells.
In STING Statistical Information about attributes in each cell, such as mean, maximum, and
minimum values, are precomputed and stored as statistical parameters. These statistical
parameters are useful for query processing and other data analysis tasks.
The statistical parameter of higher-level cells can easily be computed from the parameters of the
lower-level cells.
How STING Work:
Step 1: Determine a layer, to begin with.
Step 2: For each cell of this layer, it calculates the confidence interval or estimated range of
probability that this is cell is relevant to the query.
Step 3: From the interval calculate above, it labels the cell as relevant or not relevant.
Step 4: If this layer is the bottom layer, go to point 6, otherwise, go to point 5.
Step 5: It goes down the hierarchy structure by one level. Go to point 2 for those cells that form
the relevant cell of the high-level layer.
Step 6: If the specification of the query is met, go to point 8, otherwise go to point 7.
Step 7: Retrieve those data that fall into the relevant cells and do further processing. Return the
result that meets the requirement of the query. Go to point 9.
Step 8: Find the regions of relevant cells. Return those regions that meet the requirement of the
query. Go to point 9.
Step 9: Stop or terminate.
Advantages:
 Grid-based computing is query-independent because the statistics stored in each cell
represent a summary of the data in the grid cells and are query-independent.
 The grid structure facilitates parallel processing and incremental updates.
Disadvantage:
 The main disadvantage of Sting (Statistics Grid). As we know, all cluster boundaries are
either horizontal or vertical, so no diagonal boundaries are detected.
CLIQUE ALGORITHM IN DATA MINING
CLIQUE is a density-based and grid-based subspace clustering algorithm. So let’s first take a
look at what is a grid and density-based clustering technique.
 Grid-Based Clustering Technique: In Grid-Based Methods, the space of instance is
divided into a grid structure. Clustering techniques are then applied using the Cells of the
grid, instead of individual data points, as the base units.
 Density-Based Clustering Technique: In Density-Based Methods, A cluster is a
maximal set of connected dense units in a subspace.
CLIQUE Algorithm:
CLIQUE Algorithm uses density and grid-based technique i.e subspace clustering algorithm and
finds out the cluster by taking density threshold and a number of grids as input parameters. It is
specially designed to handle datasets with a large number of dimensions.CLIQUE Algorithm is
very scalable with respect to the value of the records, and a number of dimensions in the dataset
because it is grid-based and uses the Apriori Property effectively. APRIORI APPROACH ?.
Apriori Approach Stated that If an X dimensional unit is dense then all its projections in X-1
dimensional space are also dense.
This means that dense regions in a given subspace must produce dense regions when projected to
a low-dimensional subspace. CLIQUE restricts its search for high-dimensional dense cells to the
intersection of dense cells in the subspace because CLIQUE uses apriori properties.
Working of CLIQUE Algorithm:
The CLIQUE algorithm first divides the data space into grids. It is done by dividing each
dimension into equal intervals called units. After that, it identifies dense units. A unit is dense if
the data points in this are exceeding the threshold value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to find dense cells
along two dimensions, and it works until all dense cells along the entire dimension are found.
After finding all dense cells in all dimensions, the algorithm proceeds to find the largest set
(“cluster”) of connected dense cells. Finally, the CLIQUE algorithm generates a minimal
description of the cluster. Clusters are then generated from all dense subspaces using the apriori
approach.
Advantage:
 CLIQUE is a subspace clustering algorithm that outperforms K-means, DBSCAN, and
Farthest First in both execution time and accuracy.
 CLIQUE can find clusters of any shape and is able to find any number of clusters in any
number of dimensions, where the number is not predetermined by a parameter.
 One of the simplest methods, and interpretability of results.
Disadvantage:
 The main disadvantage of CLIQUE Algorithm is that if the size of the cell is unsuitable
for a set of very high values, then too much of the estimation will take place and the
correct cluster will be unable to find.
EVALUATION OF CLUSTERING
Evaluation of clustering is the process of assessing the quality of the clusters generated by a
clustering algorithm. Unlike supervised learning, where evaluation metrics like accuracy or F1-
score are used, clustering evaluation is more challenging because there are no ground truth
labels. Clustering evaluation methods can be broadly categorized into internal
evaluation, external evaluation, and relative evaluation.
1. Internal Evaluation
Internal evaluation measures the quality of clusters based on the intrinsic properties of the data,
such as compactness (how close the points in a cluster are) and separation (how well-separated
the clusters are). These metrics do not require ground truth labels.
Common Internal Evaluation Metrics:
1. Silhouette Score:
o Measures how similar a point is to its own cluster compared to other clusters.

o Formula:
Common Relative Evaluation Methods:
1. Elbow Method:
o Used to determine the optimal number of clusters kk.

o Plot the within-cluster sum of squares (WCSS) against the number of clusters.

o The "elbow" point (where the rate of decrease sharply changes) is chosen as the
optimal kk.
2. Gap Statistic:
o Compares the WCSS of the clustering result to the WCSS of a reference dataset
(e.g., uniform random data).
o The optimal kk is the one that maximizes the gap statistic.

3. Stability Analysis:
o Measures the consistency of clustering results across different runs or subsets of
the data.
o Higher stability indicates more reliable clustering.
 Normalized Mutual Information (NMI):
o Compute the mutual information between the clustering result and ground truth
labels.
o Suppose the NMI is 1.0, indicating perfect clustering.

You might also like