CLUSTER ANALYSIS unit 3 Data mining
CLUSTER ANALYSIS unit 3 Data mining
Cluster analysis, also known as clustering, is a method of data mining that groups similar data
points together. The goal of cluster analysis is to divide a dataset into groups (or clusters) such
that the data points within each group are more similar to each other than to data points in other
groups. This process is often used for exploratory data analysis and can help identify patterns or
relationships within the data that may not be immediately obvious. There are many different
algorithms used for cluster analysis, such as k-means, hierarchical clustering, and density-based
clustering. The choice of algorithm will depend on the specific requirements of the analysis and
the nature of the data being analyzed.
Cluster Analysis is the process to find similar groups of objects in order to form clusters. It is an
unsupervised machine learning-based algorithm that acts on unlabelled data. A group of data
points would comprise together to form a cluster in which all the objects would belong to the
same group.
The given data is divided into different groups by combining similar objects into a group. This
group is nothing but a cluster. A cluster is nothing but a collection of similar data which is
grouped together.
For example, consider a dataset of vehicles given in which it contains information about different
vehicles like cars, buses, bicycles, etc. As it is unsupervised learning there are no class labels
like Cars, Bikes, etc for all the vehicles, all the data is combined and is not in a structured
manner.
Now our task is to convert the unlabelled data to labelled data and it can be done using clusters.
The main idea of cluster analysis is that it would arrange all the data points by forming clusters
like cars cluster which contains all the cars, bikes clusters which contains all the bikes, etc.
Simply it is the partitioning of similar objects which are applied to unlabelled data.
Properties of Clustering :
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with
huge databases. In order to handle extensive databases, the clustering algorithm should be
scalable. Data should be scalable, if it is not scalable, then we can’t get the appropriate result
which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along
with the data of small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with
algorithms of clustering. It should be capable of dealing with different types of data like discrete,
categorical and interval-based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing
values, and noisy or erroneous data. If the algorithms are sensitive to such data then it may lead
to poor quality clusters. So it should be able to handle unstructured data and give some structure
to the data by organising it into groups of similar data objects. This makes the job of the data
expert easier in order to process the data and discover new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and
usable. The interpretability reflects how easily the data is understood.
CLUSTERING METHODS
The clustering methods can be classified into the following categories:
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method: It is used to make partitions on the data in order to form clusters. If “n”
partitions are done on “p” objects of the database then each partition is represented by a cluster
and n < p. The two conditions which need to be satisfied with this Partitioning Clustering
Method are:
One objective should only belong to only one group.
There should be no group without even a single purpose.
In the partitioning method, there is one technique called iterative relocation, which means the
object will be moved from one group to another to improve the partitioning
Hierarchical Method: In this method, a hierarchical decomposition of the given set of data
objects is created. We can classify hierarchical methods and will be able to know the purpose of
classification on the basis of how the hierarchical decomposition is formed. There are two types
of approaches for the creation of hierarchical decomposition, they are:
Agglomerative Approach: The agglomerative approach is also known as the bottom-up
approach. Initially, the given data is divided into which objects form separate groups.
Thereafter it keeps on merging the objects or the groups that are close to one another
which means that they exhibit similar properties. This merging process continues until the
termination condition holds.
Divisive Approach: The divisive approach is also known as the top-down approach. In
this approach, we would start with the data objects that are in the same cluster. The group
of individual clusters is divided into small clusters by continuous iteration. The iteration
continues until the condition of termination is met or until each cluster contains one
object.
Once the group is split or merged then it can never be undone as it is a rigid method and is not so
flexible. The two approaches which can be used to improve the Hierarchical Clustering Quality
in Data Mining are: –
One should carefully analyze the linkages of the object at every partitioning of
hierarchical clustering.
One can use a hierarchical agglomerative algorithm for the integration of hierarchical
agglomeration. In this approach, first, the objects are grouped into micro-clusters. After
grouping data objects into microclusters, macro clustering is performed on the
microcluster.
Density-Based Method: The density-based method mainly focuses on density. In this method,
the given cluster will keep on growing continuously as long as the density in the neighbourhood
exceeds some threshold, i.e, for each data point within a given cluster. The radius of a given
cluster has to contain at least a minimum number of points.
Grid-Based Method: In the Grid-Based method a grid is formed using the object together,i.e,
the object space is quantized into a finite number of cells that form a grid structure. One of the
major advantages of the grid-based method is fast processing time and it is dependent only on the
number of cells in each dimension in the quantized space. The processing time for this method is
much faster so it can save time.
Model-Based Method: In the model-based method, all the clusters are hypothesized in order to
find the data which is best suited for the model. The clustering of the density function is used to
locate the clusters for a given model. It reflects the spatial distribution of data points and also
provides a way to automatically determine the number of clusters based on standard statistics,
taking outlier or noise into account. Therefore it yields robust clustering methods.
Constraint-Based Method: The constraint-based clustering method is performed by the
incorporation of application or user-oriented constraints. A constraint refers to the user
expectation or the properties of the desired clustering results. Constraints provide us with an
interactive way of communication with the clustering process. The user or the application
requirement can specify constraints.
Applications Of Cluster Analysis:
It is widely used in image processing, data analysis, and pattern recognition.
It helps marketers to find the distinct groups in their customer base and they can
characterize their customer groups by using purchasing patterns.
It can be used in the field of biology, by deriving animal and plant taxonomies and
identifying genes with the same capabilities.
It also helps in information discovery by classifying documents on the web.
4. Dendrogram:
o A tree-like diagram used in hierarchical clustering to represent the arrangement of
clusters.
1. Partitioning Clustering
Divides data into non-overlapping clusters.
Example Algorithms:
o k-Means: Groups data into kk clusters by minimizing the sum of squared
distances between data points and their cluster centroids.
o k-Medoids: Similar to k-means but uses actual data points (medoids) as cluster
centers, making it more robust to outliers.
2. Hierarchical Clustering
Builds a hierarchy of clusters, either by merging smaller clusters into larger ones
(agglomerative) or splitting larger clusters into smaller ones (divisive).
Example Algorithms:
o Agglomerative Hierarchical Clustering: Starts with each object as a separate
cluster and merges the closest pairs of clusters iteratively.
o Divisive Hierarchical Clustering: Starts with all objects in one cluster and splits
them into smaller clusters.
3. Density-Based Clustering
Groups data based on density, identifying clusters as dense regions separated by sparse
regions.
Example Algorithms:
o DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Identifies clusters of varying shapes and handles noise effectively.
o OPTICS (Ordering Points To Identify the Clustering Structure): Similar to
DBSCAN but creates a reachability plot to identify clusters at different density
levels.
4. Model-Based Clustering
Assumes that the data is generated from a mixture of probability distributions and tries to
fit the data to these models.
Example Algorithms:
o Gaussian Mixture Models (GMM): Assumes data points are generated from a
mixture of Gaussian distributions.
5. Grid-Based Clustering
Divides the data space into a finite number of cells (grids) and performs clustering on the
grid structure.
Example Algorithms:
o STING (Statistical Information Grid): Divides the data space into rectangular
cells and computes statistical information for each cell.
2. Feature Selection:
o Choose relevant features for clustering.
5. Cluster Validation:
o Evaluate the quality of clusters using internal or external validation metrics.
6. Interpretation:
o Analyze and interpret the clusters to derive meaningful insights.
2. Image Segmentation:
o Divide an image into regions for object detection or pattern recognition.
3. Bioinformatics:
o Cluster genes or proteins with similar functions.
5. Anomaly Detection:
o Detect outliers or unusual patterns in data.
o Examples:
o Examples:
Limitations:
o Requires the number of clusters (kk) to be specified in advance.
b) k-Medoids Clustering
Objective: Similar to k-means, but uses actual data points (medoids) as cluster centers
instead of means.
Algorithm: PAM (Partitioning Around Medoids).
Advantages:
o More robust to noise and outliers compared to k-means.
Limitations:
o Computationally more expensive than k-means.
2. Hierarchical Clustering
Hierarchical methods build a hierarchy of clusters, either by merging smaller clusters into larger
ones (agglomerative) or splitting larger clusters into smaller ones (divisive).
a) Agglomerative Hierarchical Clustering
Steps:
1. Treat each data point as a single cluster.
2. Merge the two closest clusters iteratively.
3. Repeat until all points are in a single cluster.
Linkage Criteria:
o Single Linkage: Distance between the closest pair of points in two clusters.
o Complete Linkage: Distance between the farthest pair of points in two clusters.
o Average Linkage: Average distance between all pairs of points in two clusters.
Advantages:
o Does not require the number of clusters to be specified in advance.
Limitations:
o Computationally expensive for large datasets.
Limitations:
o Computationally expensive and less commonly used than agglomerative
clustering.
3. Density-Based Clustering
Density-based methods identify clusters as dense regions of data points separated by sparse
regions.
a) DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Key Concepts:
o Core Point: A point with at least minPtsminPts points within a radius ϵϵ.
o Border Point: A point within ϵϵ of a core point but does not have enough
neighbors.
o Noise Point: A point that is neither a core nor a border point.
Steps:
1. Identify core points and form clusters around them.
2. Assign border points to the nearest core point's cluster.
3. Treat noise points as outliers.
Advantages:
o Does not require the number of clusters to be specified.
Limitations:
o Sensitive to the choice of ϵϵ and minPtsminPts.
Limitations:
o More complex and computationally expensive than DBSCAN.
4. Model-Based Clustering
Model-based methods assume that the data is generated from a mixture of probability
distributions and try to fit the data to these models.
a) Gaussian Mixture Models (GMM)
Objective: Model the data as a mixture of Gaussian distributions.
Steps:
1. Assume the data is generated from kk Gaussian distributions.
2. Use the Expectation-Maximization (EM) algorithm to estimate the parameters
of the distributions.
3. Assign each point to the cluster with the highest probability.
Advantages:
o Can model clusters of different shapes and sizes.
Limitations:
o Requires the number of clusters to be specified.
o Sensitive to initialization.
5. Grid-Based Clustering
Grid-based methods divide the data space into a finite number of cells (grids) and perform
clustering on the grid structure.
a) STING (Statistical Information Grid)
Steps:
1. Divide the data space into rectangular cells.
2. Compute statistical information (e.g., mean, count) for each cell.
3. Merge neighboring cells with similar statistics to form clusters.
Advantages:
o Computationally efficient for large datasets.
Limitations:
o Limited to low-dimensional data.
Density-
DBSCAN Handles noise, arbitrary shapes. Sensitive to parameters.
Based
Limited to low-dimensional
STING Grid-Based Efficient for large datasets.
data.
PARTITIONING METHOD: This clustering method classifies the information into multiple
groups based on the characteristics and similarity of the data. Its the data analysts to specify the
number of clusters that has to be generated for the clustering methods. In the partitioning method
when database(D) that contains multiple(N) objects then the partitioning method constructs user-
specified(K) partitions of the data in which each partition represents a cluster and a particular
region. There are many algorithms that come under partitioning method some of the popular ones
are K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering Large Applications) etc. In this
article, we will be seeing the working of K Mean algorithm in detail.
K-Mean (A centroid based Technique): The K means algorithm takes the input parameter K
from the user and partitions the dataset containing N objects into K clusters so that resulting
similarity among the data objects inside the group (intracluster) is high but the similarity of data
objects with the data objects from outside the cluster is low (intercluster). The similarity of the
cluster is determined with respect to the mean value of the cluster. It is a type of square error
algorithm. At the start randomly k objects from the dataset are chosen in which each of the
objects represents a cluster mean(centre). For the rest of the data objects, they are assigned to the
nearest cluster based on their distance from the cluster mean. The new mean of each of the
cluster is then calculated with the added data objects.
The K-Means clustering algorithm is a popular unsupervised machine learning technique used
to partition a dataset into kk distinct, non-overlapping clusters. The goal is to group similar data
points together while keeping dissimilar points in different clusters. Below is a detailed
explanation of the K-Means algorithm, followed by a numerical example.
Steps in the K-Means Algorithm
1. Initialization:
o Choose the number of clusters kk.
2. Assignment Step:
o Assign each data point to the nearest centroid based on a distance metric (usually
Euclidean distance).
3. Update Step:
o Recalculate the centroids as the mean of all data points assigned to each cluster.
4. Repeat:
o Repeat the assignment and update steps until the centroids no longer change
significantly (convergence).
K-Medoids ALGORITHM
The K-Medoids algorithm, also known as Partitioning Around Medoids (PAM), is a
clustering algorithm similar to K-Means but more robust to noise and outliers. Instead of using
the mean of data points as the cluster center (centroid), K-Medoids uses actual data points
called medoids as the representative of each cluster. This makes it more suitable for datasets
with noise or outliers.
2. Assignment Step:
o Assign each data point to the nearest medoid based on a distance metric (e.g.,
Euclidean distance).
3. Update Step:
o For each cluster, find the data point that minimizes the total dissimilarity (cost)
within the cluster. This point becomes the new medoid.
4. Repeat:
o Repeat the assignment and update steps until the medoids no longer change
(convergence).
HIERARCHICAL CLUSTERING
Hierarchical clustering is a type of clustering algorithm that builds a hierarchy of clusters,
either by merging smaller clusters into larger ones (agglomerative approach) or by splitting
larger clusters into smaller ones (divisive approach). Unlike partitioning methods like K-Means
or K-Medoids, hierarchical clustering does not require the number of clusters kk to be specified
in advance. Instead, it produces a dendrogram, which is a tree-like structure that shows the
relationships between clusters at different levels of granularity.
Types of Hierarchical Clustering
1. Agglomerative Hierarchical Clustering:
o This is a bottom-up approach.
o This process continues until all data points are in a single cluster or a stopping
condition is met.
2. Divisive Hierarchical Clustering:
o This is a top-down approach.
o This process continues until each data point is in its own cluster or a stopping
condition is met.
AGGLOMERATIVE CLUSTERING
Agglomerative clustering is more commonly used due to its simplicity and computational
efficiency.
2. Merge Clusters:
o Find the two closest clusters based on a linkage criterion.
3. Repeat:
o Repeat the merge step until all data points are in a single cluster or a stopping
condition is met.
4. Dendrogram:
o Visualize the hierarchy of clusters using a dendrogram.
We will use agglomerative hierarchical clustering with single linkage to cluster this dataset
DIFFERENCE BETWEEN AGGLOMERATIVE CLUSTERING AND DIVISIVE
CLUSTERING :
Less computationally
More computationally expensive due to
expensive but requires careful
pairwise distance calculations.
Complexity Level cluster splitting.
2. Split Clusters:
o Choose a cluster to split based on a criterion (e.g., maximizing the distance
between subclusters).
o Split the cluster into two or more smaller clusters.
3. Repeat:
o Repeat the split step until each data point is in its own cluster or a stopping
condition is met.
4. Dendrogram:
o Visualize the hierarchy of clusters using a dendrogram.
Splitting Criterion
A common splitting criterion is to use a distance metric (e.g., Euclidean distance) and split the
cluster into two subclusters such that the distance between the two subclusters is maximized.
This can be done using algorithms like DIANA (Divisive Analysis).
DISTANCE MEASURES
In clustering algorithms, distance measures are used to quantify the similarity or dissimilarity
between data points. The choice of distance measure can significantly impact the results of
clustering, as it determines how the algorithm groups data points into clusters. Below is a
detailed explanation of common distance measures used in algorithmic methods, along with their
mathematical formulations and use cases.
BIRCH
BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data
mining algorithm that performs hierarchical clustering over large data sets. With modifications, it
can also be used to accelerate k-means clustering and Gaussian mixture modeling with the
expectation-maximization algorithm. An advantage of BIRCH is its ability to incrementally and
dynamically cluster incoming, multi-dimensional metric data points to produce the best quality
clustering for a given set of resources (memory and time constraints). In most cases, BIRCH
only requires a single scan of the database.
Its inventors claim BIRCH to be the "first clustering algorithm proposed in the database area to
handle 'noise' (data points that are not part of the underlying pattern) effectively", beating
DBSCAN by two months. The BIRCH algorithm received the SIGMOD 10 year test of time
award in 2006.
Basic clustering algorithms like K means and agglomerative clustering are the most commonly
used clustering algorithms. But when performing clustering on very large datasets, BIRCH and
DBSCAN are the advanced clustering algorithms useful for performing precise clustering on
large datasets. Moreover, BIRCH is very useful because of its easy implementation. BIRCH is a
clustering algorithm that clusters the dataset first in small summaries, then after small summaries
get clustered. It does not directly cluster the dataset. That is why BIRCH is often used with other
clustering algorithms; after making the summary, the summary can also be clustered by other
clustering algorithms.
It is provided as an alternative to MinibatchKMeans. It converts data to a tree data structure with
the centroids being read off the leaf. And these centroids can be the final cluster centroid or the
input for other cluster algorithms like Agglomerative Clustering.
Problem with Previous Clustering Algorithm
Previous clustering algorithms performed less effectively over very large databases and did not
adequately consider the case wherein a dataset was too large to fit in main memory. Furthermore,
most of BIRCH's predecessors inspect all data points (or all currently existing clusters) equally
for each clustering decision. They do not perform heuristic weighting based on the distance
between these data points. As a result, there was a lot of overhead maintaining high clustering
quality while minimizing the cost of additional IO (input/output) operations.
Stages of BIRCH
BIRCH is often used to complement other clustering algorithms by creating a summary of the
dataset that the other clustering algorithm can now use. However, BIRCH has one major
drawback it can only process metric attributes. A metric attribute is an attribute whose values can
be represented in Euclidean space, i.e., no categorical attributes should be present. The BIRCH
clustering algorithm consists of two stages:
1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense regions
called Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined as
an ordered triple (N, LS, SS) where 'N' is the number of data points in the cluster, 'LS' is
the linear sum of the data points, and 'SS' is the squared sum of the data points in the
cluster. A CF entry can be composed of other CF entries. Optionally, we can condense
this initial CF tree into a smaller CF.
2. Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree.
A CF tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree
contains a pointer to a child node, and a CF entry made up of the sum of CF entries in the
child nodes. Optionally, we can refine these clusters.
Due to this two-step process, BIRCH is also called Two-Step Clustering.
Algorithm
The tree structure of the given data is built by the BIRCH algorithm called the Clustering feature
tree (CF tree). This algorithm is based on the CF (clustering features) tree. In addition, this
algorithm uses a tree-structured summary to create clusters.
In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those
nodes that have several sub-clusters can be called CF subclusters. These CF subclusters are
situated in no-terminal CF nodes.
The CF tree is a height-balanced tree that gathers and manages clustering features and holds
necessary information of given data for further hierarchical clustering. This prevents the need to
work with whole data given as input. The tree cluster of data points as CF is represented by three
numbers (N, LS, SS).
o N = number of items in subclusters
o LS = vector sum of the data points
There are mainly four phases which are followed by the algorithm of BIRCH.
o Scanning data into memory.
o Global clustering.
o Refining clusters.
Two of them (resize data and refining clusters) are optional in these four phases. They come in
the process when more clarity is required. But scanning data is just like loading data into a
model. After loading the data, the algorithm scans the whole data and fits them into the CF trees.
In condensing, it resets and resizes the data for better fitting into the CF tree. In global clustering,
it sends CF trees for clustering using existing clustering algorithms. Finally, refining fixes the
problem of CF trees where the same valued points are assigned to different leaf nodes.
Cluster Features
BIRCH clustering achieves its high efficiency by clever use of a small set of summary statistics
to represent a larger set of data points. These summary statistics constitute a CF and represent a
sufficient substitute for the actual data for clustering purposes.
A CF is a set of three summary statistics representing a set of data points in a single cluster.
These statistics are as follows:
o Count [The number of data values in the cluster]
o Linear Sum [The sum of the individual coordinates. This is a measure of the location of
the cluster]
o Squared Sum [The sum of the squared coordinates. This is a measure of the spread of the
cluster]
NOTE: The linear sum and the squared sum are equivalent to the mean and variance of the data
point.
CF Tree
The building process of the CF Tree can be summarized in the following steps, such as:
Step 1: For each given record, BIRCH compares the location of that record with the location of
each CF in the root node, using either the linear sum or the mean of the CF. BIRCH passes the
incoming record to the root node CF closest to the incoming record.
Step 2: The record then descends down to the non-leaf child nodes of the root node CF selected
in step 1. BIRCH compares the location of the record with the location of each non-leaf CF.
BIRCH passes the incoming record to the non-leaf node CF closest to the incoming record.
Step 3: The record then descends down to the leaf child nodes of the non-leaf node CF selected
in step 2. BIRCH compares the location of the record with the location of each leaf. BIRCH
tentatively passes the incoming record to the leaf closest to the incoming record.
Step 4: Perform one of the below points (i) or (ii):
1. If the radius of the chosen leaf, including the new record, does not exceed the threshold T,
then the incoming record is assigned to that leaf. The leaf and its parent CF's are updated
to account for the new data point.
2. If the radius of the chosen leaf, including the new record, exceeds the Threshold T, then a
new leaf is formed, consisting of the incoming record only. The parent CFs is updated to
account for the new data point.
If step 4(ii) is executed, and the maximum L leaves are already in the leaf node, the leaf node is
split into two leaf nodes. If the parent node is full, split the parent node, and so on. The most
distant leaf node CFs are used as leaf node seeds, with the remaining CFs being assigned to
whichever leaf node is closer. Note that the radius of a cluster may be calculated even without
knowing the data points, as long as we have the count n, the linear sum LS, and the squared sum
SS. This allows BIRCH to evaluate whether a given data point belongs to a particular sub-cluster
without scanning the original data set.
Clustering the Sub-Clusters
Once the CF tree is built, any existing clustering algorithm may be applied to the sub-clusters
(the CF leaf nodes) to combine these sub-clusters into clusters. The task of clustering becomes
much easier as the number of sub-clusters is much less than the number of data points. When a
new data value is added, these statistics may be easily updated, thus making the computation
more efficient.
Parameters of BIRCH
There are three parameters in this algorithm that needs to be tuned. Unlike K-means, the optimal
number of clusters (k) need not be input by the user as the algorithm determines them.
o Threshold: Threshold is the maximum number of data points a sub-cluster in the leaf
node of the CF tree can hold.
o branching_factor: This parameter specifies the maximum number of CF sub-clusters in
each node (internal node).
o n_clusters: The number of clusters to be returned after the entire BIRCH algorithm is
complete, i.e., the number of clusters after the final clustering step. The final clustering
step is not performed if set to none, and intermediate clusters are returned.
Advantages of BIRCH
It is local in that each clustering decision is made without scanning all data points and existing
clusters. It exploits the observation that the data space is not usually uniformly occupied, and not
every data point is equally important.
It uses available memory to derive the finest possible sub-clusters while minimizing I/O costs. It
is also an incremental method that does not require the whole data set in advance.
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a clustering
algorithm designed for large datasets. It is particularly effective for handling numerical data and
is widely used in data mining and machine learning applications. BIRCH builds a Clustering
Feature Tree (CF Tree) to summarize the dataset efficiently, enabling it to handle large datasets
with limited memory.
Steps in BIRCH
Phase 1: Build the CF Tree
1. Initialize: Start with an empty CF Tree.
2. Insert Data Points:
1.
o
For each data point, traverse the CF Tree to find the closest leaf node.
o If the point fits within the threshold TT of the closest subcluster, update the CF of
that subcluster.
o If not, create a new subcluster.
o If the leaf node exceeds the branching factor BB, split it and propagate the change
upward.
2. Rebuild the Tree: If the CF Tree grows too large, rebuild it with a larger threshold TT.
Phase 2: Global Clustering
1. Extract Subclusters: Use the CFs from the leaf nodes of the CF Tree as input.
2. Apply Clustering Algorithm: Use a global clustering algorithm (e.g., hierarchical
clustering or K-Means) to cluster the subclusters.
Advantages of BIRCH
Efficiency: Handles large datasets with limited memory.
Scalability: Suitable for incremental and streaming data.
Flexibility: Can be combined with other clustering algorithms for global clustering.
Disadvantages of BIRCH
Sensitivity to Parameters: The performance depends on the branching factor BB and
threshold TT.
Limited to Numerical Data: Works best with numerical data; not suitable for categorical
data.
Outlier Sensitivity: Outliers can affect the structure of the CF Tree.
Example
Dataset:
Consider the following 2D dataset with 6 data points:
X={(1,1),(1,2),(2,2),(4,4),(5,5),(6,6)
Cluster 2: CF2
CHAMELEON
CHAMELEON is a hierarchical clustering algorithm that uses dynamic modeling to determine
the similarity between clusters. Unlike traditional hierarchical clustering methods,
CHAMELEON considers both the interconnectivity and closeness of clusters when merging
them. This makes it particularly effective for clustering complex datasets with varying densities
and shapes.
Steps in CHAMELEON
Phase 1: Partitioning the Dataset
1. Construct a k-Nearest Neighbor (k-NN) Graph:
o Represent the dataset as a graph where each data point is a node.
o Connect each node to its kk nearest neighbors based on a distance metric (e.g.,
Euclidean distance).
2. Partition the Graph:
o Use a graph partitioning algorithm (e.g., METIS) to divide the graph into smaller
subclusters.
o Each subcluster should contain highly interconnected nodes.
2. Merge Clusters:
o Compute the similarity between clusters using a probabilistic measure (e.g.,
likelihood or divergence).
o Merge the two most similar clusters into a new cluster.
3. Repeat:
o Repeat the merge step until all data points are in a single cluster or a stopping
condition is met.
4. Hierarchy:
o The result is a hierarchy of clusters represented as a dendrogram.
2. Split Clusters:
o Use a probabilistic criterion (e.g., likelihood or divergence) to split the cluster into
smaller clusters.
o Update the probability distributions of the new clusters.
3. Repeat:
o Repeat the split step until each data point is in its own cluster or a stopping
condition is met.
4. Hierarchy:
o The result is a hierarchy of clusters represented as a dendrogram.
Probabilistic Measures
1. Likelihood-Based Similarity
Measures how well a cluster explains the data points.
Formula:
Steps in DBSCAN
1. Initialization:
o Start with an arbitrary unvisited point.
4. Repeat:
o Repeat the process for all unvisited points in the dataset.
5. Termination:
o All points are either assigned to clusters or marked as noise.
Advantages of DBSCAN
1. Arbitrary Cluster Shapes: Can identify clusters of any shape.
2. Noise Handling: Explicitly identifies and handles noise and outliers.
3. No Predefined Number of Clusters: Does not require the number of clusters kk to be
specified in advance.
Disadvantages of DBSCAN
1. Parameter Sensitivity: Performance depends on the choice of ϵϵ and MinPts.
2. Difficulty with Varying Densities: Struggles with datasets where clusters have
significantly different densities.
3. Computational Complexity: High for large datasets due to neighborhood computations.
Steps in OPTICS
1. Initialization:
o Compute the core distance for each point.
2. Processing:
o For each unprocessed point:
2. Density Calculation:
o Calculate the density of each cell by counting the number of data points it
contains.
3. Cluster Formation:
o Merge adjacent dense cells to form clusters.
4. Cluster Refinement:
o Optionally, refine the clusters by reassigning points on the boundaries of cells.
Numerical Example
STING
STING is a Grid-Based Clustering Technique. In STING, the dataset is recursively divided in a
hierarchical manner. After the dataset, each cell is divided into a different number of cells. And
after the cell, the statistical measures of the cell are collected, which helps answer the query as
quickly as possible.
Statistical Information Grid(STING):
A STING is a grid-based clustering technique. It uses a multidimensional grid data structure that
quantifies space into a finite number of cells. Instead of focusing on data points, it focuses on the
value space surrounding the data points.
In STING, the spatial area is divided into rectangular cells and several levels of cells at different
resolution levels. High-level cells are divided into several low-level cells.
In STING Statistical Information about attributes in each cell, such as mean, maximum, and
minimum values, are precomputed and stored as statistical parameters. These statistical
parameters are useful for query processing and other data analysis tasks.
The statistical parameter of higher-level cells can easily be computed from the parameters of the
lower-level cells.
How STING Work:
Step 1: Determine a layer, to begin with.
Step 2: For each cell of this layer, it calculates the confidence interval or estimated range of
probability that this is cell is relevant to the query.
Step 3: From the interval calculate above, it labels the cell as relevant or not relevant.
Step 4: If this layer is the bottom layer, go to point 6, otherwise, go to point 5.
Step 5: It goes down the hierarchy structure by one level. Go to point 2 for those cells that form
the relevant cell of the high-level layer.
Step 6: If the specification of the query is met, go to point 8, otherwise go to point 7.
Step 7: Retrieve those data that fall into the relevant cells and do further processing. Return the
result that meets the requirement of the query. Go to point 9.
Step 8: Find the regions of relevant cells. Return those regions that meet the requirement of the
query. Go to point 9.
Step 9: Stop or terminate.
Advantages:
Grid-based computing is query-independent because the statistics stored in each cell
represent a summary of the data in the grid cells and are query-independent.
The grid structure facilitates parallel processing and incremental updates.
Disadvantage:
The main disadvantage of Sting (Statistics Grid). As we know, all cluster boundaries are
either horizontal or vertical, so no diagonal boundaries are detected.
CLIQUE ALGORITHM IN DATA MINING
CLIQUE is a density-based and grid-based subspace clustering algorithm. So let’s first take a
look at what is a grid and density-based clustering technique.
Grid-Based Clustering Technique: In Grid-Based Methods, the space of instance is
divided into a grid structure. Clustering techniques are then applied using the Cells of the
grid, instead of individual data points, as the base units.
Density-Based Clustering Technique: In Density-Based Methods, A cluster is a
maximal set of connected dense units in a subspace.
CLIQUE Algorithm:
CLIQUE Algorithm uses density and grid-based technique i.e subspace clustering algorithm and
finds out the cluster by taking density threshold and a number of grids as input parameters. It is
specially designed to handle datasets with a large number of dimensions.CLIQUE Algorithm is
very scalable with respect to the value of the records, and a number of dimensions in the dataset
because it is grid-based and uses the Apriori Property effectively. APRIORI APPROACH ?.
Apriori Approach Stated that If an X dimensional unit is dense then all its projections in X-1
dimensional space are also dense.
This means that dense regions in a given subspace must produce dense regions when projected to
a low-dimensional subspace. CLIQUE restricts its search for high-dimensional dense cells to the
intersection of dense cells in the subspace because CLIQUE uses apriori properties.
Working of CLIQUE Algorithm:
The CLIQUE algorithm first divides the data space into grids. It is done by dividing each
dimension into equal intervals called units. After that, it identifies dense units. A unit is dense if
the data points in this are exceeding the threshold value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to find dense cells
along two dimensions, and it works until all dense cells along the entire dimension are found.
After finding all dense cells in all dimensions, the algorithm proceeds to find the largest set
(“cluster”) of connected dense cells. Finally, the CLIQUE algorithm generates a minimal
description of the cluster. Clusters are then generated from all dense subspaces using the apriori
approach.
Advantage:
CLIQUE is a subspace clustering algorithm that outperforms K-means, DBSCAN, and
Farthest First in both execution time and accuracy.
CLIQUE can find clusters of any shape and is able to find any number of clusters in any
number of dimensions, where the number is not predetermined by a parameter.
One of the simplest methods, and interpretability of results.
Disadvantage:
The main disadvantage of CLIQUE Algorithm is that if the size of the cell is unsuitable
for a set of very high values, then too much of the estimation will take place and the
correct cluster will be unable to find.
EVALUATION OF CLUSTERING
Evaluation of clustering is the process of assessing the quality of the clusters generated by a
clustering algorithm. Unlike supervised learning, where evaluation metrics like accuracy or F1-
score are used, clustering evaluation is more challenging because there are no ground truth
labels. Clustering evaluation methods can be broadly categorized into internal
evaluation, external evaluation, and relative evaluation.
1. Internal Evaluation
Internal evaluation measures the quality of clusters based on the intrinsic properties of the data,
such as compactness (how close the points in a cluster are) and separation (how well-separated
the clusters are). These metrics do not require ground truth labels.
Common Internal Evaluation Metrics:
1. Silhouette Score:
o Measures how similar a point is to its own cluster compared to other clusters.
o Formula:
Common Relative Evaluation Methods:
1. Elbow Method:
o Used to determine the optimal number of clusters kk.
o Plot the within-cluster sum of squares (WCSS) against the number of clusters.
o The "elbow" point (where the rate of decrease sharply changes) is chosen as the
optimal kk.
2. Gap Statistic:
o Compares the WCSS of the clustering result to the WCSS of a reference dataset
(e.g., uniform random data).
o The optimal kk is the one that maximizes the gap statistic.
3. Stability Analysis:
o Measures the consistency of clustering results across different runs or subsets of
the data.
o Higher stability indicates more reliable clustering.
Normalized Mutual Information (NMI):
o Compute the mutual information between the clustering result and ground truth
labels.
o Suppose the NMI is 1.0, indicating perfect clustering.