0% found this document useful (0 votes)
31 views

DWDS Unit 6 Cluster Analysis (1)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

DWDS Unit 6 Cluster Analysis (1)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

What Is Cluster Analysis?

Definition

● Cluster analysis is a technique used to group a set of objects into clusters,


where objects within a cluster are more similar to each other than to
objects in other clusters.
● It is unsupervised learning, meaning it finds hidden patterns or groupings
in data without predefined labels.

Purpose

● To discover the inherent grouping or structure in a dataset.


● Useful for exploratory data analysis, pattern recognition, and data
preprocessing.

Applications

1. Market Segmentation: Identifying customer groups with similar purchasing


behavior.
2. Image Processing: Grouping similar pixels or regions in an image.
3. Social Network Analysis: Detecting communities within a network.
4. Biology: Classifying species based on genetic traits.
5. Document Clustering: Grouping articles or web pages with similar topics.

Characteristics

● Similarity/Dissimilarity: The core idea is to measure how alike (or


different) objects are, typically using distance measures such as:
○ Euclidean distance: For numerical data.
○ Jaccard similarity: For binary or categorical data.
● Output: A set of clusters, each represented by a centroid or a structure
like a tree in hierarchical clustering.

Key Considerations

1. No predefined labels: Unlike classification, cluster analysis does not


require labeled training data.
2. Number of clusters: In most methods (like k-Means), the number of
clusters must be specified beforehand.
3. Cluster validity: Assessing the quality of clusters is crucial. Common
measures include:
○ Compactness: Objects within a cluster are close to each other.
○ Separation: Clusters are distinct and well-separated.

Challenges

● High-dimensional data: It becomes difficult to measure similarity


effectively as dimensions increase.
● Mixed data types: Handling both numerical and categorical data in the
same analysis.
● Scalability: Ensuring the clustering algorithm works efficiently with large
datasets.

Common Techniques
● Partitioning Methods: Divides the data into non-overlapping subsets.
● Hierarchical Methods: Creates a tree-like structure of nested clusters.
● Density-Based Methods: Groups dense regions and separates sparse
regions as noise.
● Grid-Based Methods: Divides the data space into a grid structure.

Advantages

● Helps in simplifying and understanding complex data.


● Can reveal hidden patterns and insights.

Limitations

● Sensitive to noise and outliers.


● Requires careful parameter selection, such as the number of clusters or
similarity threshold.

Types of Data in Cluster Analysis

Clustering algorithms can work with various types of data. Understanding these
types helps in selecting appropriate clustering techniques and distance
measures.
1. Interval-Scaled Variables

● Definition: These are numeric variables where the difference between


values is meaningful. Examples include age, temperature, or income.
● Properties:
○ Measured on a continuous scale.
○ The difference between values is interpretable, but ratios are not
(e.g., 20°C is not "twice as hot" as 10°C).

Key Features of Interval-Scaled Variables:

1. Ordered Measurements:
○ The values are arranged in a specific order. For example, 30°C is hotter than
20°C.
2. Equal Intervals:
○ The difference between any two consecutive points on the scale is the same
throughout. For instance, the difference between 20°C and 30°C is the same as
between 30°C and 40°C.
3. No True Zero:
○ The scale's zero point is arbitrary and does not mean "nothing" or "absence." For
example, 0°C does not mean there is no temperature; it is simply a point on the
scale.
4. Arithmetic Operations:
○ Addition and subtraction are meaningful, but multiplication and division are not.
For example, while you can calculate the difference between two temperatures
(e.g., 40°C - 20°C = 20°C), you cannot say 40°C is "twice as hot" as 20°C.

Examples of Interval-Scaled Variables:

1. Temperature (in Celsius or Fahrenheit):


○ The difference between 20°C and 30°C (10°C) represents the same thermal
difference as between 50°C and 60°C.
○ However, 0°C does not mean the absence of temperature, as it is a reference
point based on a freezing/melting threshold.
2. Calendar Years:
○ The difference between 2000 and 1900 is 100 years, the same as between 1800
and 1700.
○ However, the year 0 AD does not signify the absence of time.
3. IQ Scores:
○ A difference of 10 points (e.g., between 100 and 110) reflects the same
intellectual difference as between 120 and 130.
○ But an IQ of 0 does not mean a complete absence of intelligence—it is a
theoretical lower limit.

2. Binary Variables

● Definition: Binary variables have two states, such as "Yes/No," "Male/Female," or


"True/False."
● Types:
1. Symmetric Binary: Both outcomes (e.g., 0 and 1) are equally important.
2. Asymmetric Binary: One outcome is more important, like the presence (1) or
absence (0) of a disease.
3. Categorical, Ordinal, and Ratio-Scaled Variables

a. Categorical Variables

● Definition: These represent categories or labels without any inherent order, such as
"red," "blue," "green," or "dog," "cat," "bird."

b. Ordinal Variables

● Definition: These variables have a meaningful order but not a meaningful difference
between consecutive values. Examples include "low," "medium," "high" or rankings like
1st, 2nd, 3rd.
● Preprocessing:
○ Convert ordinal data into ranks or scale them to a uniform range (e.g., 1 to 10).
● Distance Measure:
○ Treat ranks as interval-scaled variables after conversion.
● Example: Clustering hotels based on guest satisfaction levels ("poor," "average,"
"excellent").
c. Ratio-Scaled Variables

● Definition: Similar to interval-scaled variables, but ratios are meaningful. These have an
absolute zero point. Examples include weight, height, or salary.
● Properties:
○ Zero means "none" of the quantity.
○ Both differences and ratios are meaningful (e.g., a person earning $50,000 earns
twice as much as someone earning $25,000).
● Distance Measure:
○ Euclidean distance or other numerical measures.
● Example: Clustering employees by their yearly salaries and years of experience.
● 4. Variables of Mixed Types

A Categorization of Major Clustering Methods

Clustering methods are categorized based on their approach to grouping data. Each method
has its strengths and weaknesses, making it suitable for specific types of data and clustering
tasks.

1. Partitioning Methods
● Definition: These methods divide the dataset into k clusters, where each cluster
contains at least one object, and all clusters are non-overlapping.
● Approach:
○ Start with an initial partition of the data into k groups.
○ Iteratively refine the clusters by moving objects between groups to optimize a
criterion (e.g., minimizing intra-cluster distances or maximizing inter-cluster
distances).
● Common Algorithms:
○ k-Means: Uses the mean (centroid) of points in a cluster as its center. Updates
cluster assignments to minimize the sum of squared distances.
○ k-Medoids: Similar to k-Means but uses actual data points (medoids) as cluster
centers to handle noise and outliers better.
● Example: Grouping customers based on their purchase behavior.
● Strengths:
○ Simple and efficient for large datasets.
● Limitations:
○ Requires the user to specify kkk in advance.
○ Sensitive to initial cluster assignments.

2. Hierarchical Methods

● Definition: These methods create a tree-like structure (dendrogram) of nested clusters,


where clusters are formed step-by-step.
● Approach:
○ Agglomerative (Bottom-Up):
■ Start with each object as its own cluster.
■ Gradually merge clusters until only one cluster remains.
○ Divisive (Top-Down):
■ Start with all objects in one cluster.
■ Recursively split clusters until each object is in its own cluster.
● Common Algorithms:
○ BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies):
Efficiently handles large datasets by constructing a compact summary tree.
○ ROCK (Robust Clustering using Links): Designed for categorical data.
○ Chameleon: Dynamically adjusts the similarity measure during clustering.
● Example: Hierarchical grouping of species based on genetic traits.
● Strengths:
○ Does not require the number of clusters in advance.
○ Produces a hierarchy for better interpretation.
● Limitations:
○ Computationally expensive for large datasets.
3. Density-Based Methods

● Definition: Clusters are formed based on dense regions of data separated by


low-density regions.
● Approach:
○ Define clusters as areas with a high density of points.
○ Points in sparse regions are considered noise or outliers.
● Common Algorithms:
○ DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
■ Forms clusters by identifying core points with a minimum number of
neighbors within a radius (ϵ\epsilonϵ).
■ Can detect arbitrary-shaped clusters and noise.
○ OPTICS (Ordering Points to Identify the Clustering Structure):
■ Similar to DBSCAN but provides a more detailed cluster structure.
○ DENCLUE (Density-Based Clustering Using Density Functions):
■ Uses mathematical density functions to group data.
● Example: Clustering geographic data to identify urban regions.
● Strengths:
○ Detects clusters of arbitrary shape.
○ Handles noise effectively.
● Limitations:
○ Sensitive to parameter selection (e.g., ϵ\epsilonϵ, minimum points).

4. Grid-Based Methods

● Definition: These methods partition the data space into a finite number of cells (grid
structure) and form clusters from dense cells.
● Approach:
○ The data space is divided into a grid of cells.
○ Dense cells (with a significant number of data points) are merged to form
clusters.
● Common Algorithms:
○ STING (Statistical Information Grid):
■ Divides the space into hierarchical grids and analyzes statistical
information within each grid.
○ WaveCluster:
■ Applies wavelet transformation to find clusters in the transformed data.
● Example: Clustering sensor network data for monitoring.
● Strengths:
○ Computationally efficient.
○ Suitable for spatial data.
● Limitations:
○ May lose accuracy if the grid size is not chosen properly.
Partitioning Methods in Clustering

Partitioning methods are clustering approaches that divide a dataset into kkk non-overlapping
clusters, where each object belongs to exactly one cluster. These methods iteratively optimize
cluster assignments to minimize a specific criterion, such as intra-cluster variance or distance to
the cluster center.

1. Classical Partitioning Methods

a. k-Means

● Definition: A simple and widely used partitioning algorithm that minimizes the sum of
squared distances between data points and the centroids of their clusters.
● How It Works:
○ Initialization: Choose k random points as initial centroids.
○ Assignment: Assign each data point to the nearest centroid based on Euclidean
distance.
○ Update: Recalculate the centroids as the mean of all points assigned to each
cluster.
○ Repeat: Steps 2 and 3 until the centroids stabilize (i.e., no change in cluster
assignments) or a maximum number of iterations is reached.
● Distance Measure:

● Example:
○ Grouping customers based on purchase behavior (e.g., annual spending and
frequency of purchases).
● Strengths:
○ Simple and fast for large datasets.
○ Works well for spherical-shaped clusters.
● Limitations:
○ Requires the number of clusters k to be specified beforehand.
○ Sensitive to initial centroid selection and outliers.
○ Assumes clusters are convex and roughly equal in size.

b. k-Medoids

● Definition: Similar to k-Means but uses actual data points (medoids) as cluster centers,
making it more robust to noise and outliers.
● How It Works:
○ Initialization: Choose k random data points as medoids.
○ Assignment: Assign each data point to the nearest medoid based on a distance
measure.
○ Update: Replace a medoid with a non-medoid point if it reduces the overall cost
(sum of distances within the cluster).
○ Repeat: Steps 2 and 3 until no changes in medoids occur.
● Distance Measure:
○ Uses the same distance measures as k-Means but does not require calculating
the mean, as medoids are existing data points.
● Example:
○ Clustering geographic locations to find central points (e.g., clustering city
locations to identify central hubs).
● Strengths:
○ Robust to noise and outliers because it uses actual data points as cluster
centers.
○ Suitable for non-spherical clusters.
● Limitations:
○ Slower than k-Means for large datasets due to higher computational cost in
selecting medoids.

2. Partitioning Methods in Large Databases: From k-Medoids to CLARANS

As datasets grow, traditional methods like k-Medoids become computationally expensive.


Advanced techniques like CLARANS (Clustering Large Applications based on Randomized
Search) are designed for efficiency with large databases.
CLARANS (Clustering Large Applications Randomized Search)

● Definition: An optimized version of k-Medoids that uses a randomized search approach


to handle large datasets efficiently.
● How It Works:
○ Initialization: Start with an initial set of k medoids chosen randomly.
○ Local Search:
■ Randomly select a medoid and replace it with a non-medoid point.
■ Evaluate the total cost (sum of distances within clusters) for the new set
of medoids.
■ Accept the new set if the cost improves; otherwise, keep the original set.
○ Global Optimization:
■ Repeat the local search for multiple iterations.
■ Track the best medoid set with the minimum cost over all iterations.
○ Output: The medoid set with the lowest cost is chosen as the final clustering.
● Advantages over k-Medoids:
○ Avoids exhaustive pairwise comparisons by randomizing the medoid selection
process.
○ Reduces computational complexity, making it scalable for large databases.
● Example:
○ Clustering large customer transaction datasets in retail.
● Strengths:
○ Suitable for large databases.
○ Balances computational efficiency with clustering quality.
● Limitations:
○ Requires careful tuning of parameters (e.g., the number of iterations for local and
global searches).

Hierarchical Clustering Methods


Hierarchical clustering builds a tree-like structure (called a dendrogram) to
represent the nested grouping of objects in a dataset. It does not require
the number of clusters to be specified in advance and produces a hierarchy
of clusters that can be analyzed at different levels of granularity.

There are two main approaches: Agglomerative and Divisive clustering.

1. Agglomerative and Divisive Hierarchical Clustering

a. Agglomerative Hierarchical Clustering

● Definition:
A "bottom-up" approach where each object starts in its
own cluster, and clusters are iteratively merged until all objects
belong to a single cluster or a stopping condition is met.
● Steps:
○ Initialize each object as a cluster.
○ Compute a proximity matrix to measure distances between
clusters.
○ Merge the two closest clusters.
○ Update the proximity matrix to reflect the new cluster.
○ Repeat steps 3 and 4 until only one cluster remains or a
desired number of clusters is achieved.
● Distance Measures:
○ Single Linkage: Distance between the closest points in
two clusters.
○ Complete Linkage: Distance between the farthest points
in two clusters.
○ Average Linkage: Average distance between all points in
two clusters.
● Strengths:
○ Produces a comprehensive hierarchy of clusters.
○ Does not require specifying the number of clusters in
advance.
● Limitations:
○ Computationally expensive for large datasets (O(n3)).
○ Sensitive to noise and outliers.

b. Divisive Hierarchical Clustering

● Definition: A "top-down" approach where all objects start in a single


cluster, and clusters are iteratively split until each object is in its own
cluster or a stopping condition is met.
● Steps:
○ Start with all objects in one cluster.
○ Split the cluster into smaller clusters based on a criterion
(e.g., maximizing dissimilarity between clusters).
○ Repeat until a desired number of clusters is achieved.
● Strengths:
○ Can produce clusters by identifying large-scale structures
first.
● Limitations:
○ Computationally expensive.
○ Requires a splitting criterion, which can be complex to
define.

2. BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies

● Definition: BIRCH is an efficient hierarchical clustering method


designed for very large datasets. It incrementally builds a
clustering feature tree (CF tree) and refines the clustering in
multiple phases.
● Key Concepts:
○ Clustering Feature (CF): A compact summary of a cluster,
including the number of points (NNN), the linear sum of
points (LS), and the square sum of points (SS).
■ CF=(N,LS,SS)
○ CF Tree: A tree structure that stores clustering features,
with a fixed number of entries at each node.
● Steps:
○ Building the CF Tree:
■ Insert data points into the tree.
■ Each node summarizes the points it contains.
■ If a node exceeds its threshold, it is split.
○ Refinement:
■ Reorganize the CF tree and optionally apply a
clustering algorithm like k-Means on the leaf entries.
● Advantages:
○ Handles large datasets efficiently.
○ Works well for numeric data.
○ Incremental and dynamic clustering.
● Limitations:
○ Sensitive to the order of data input.
○ Less effective for categorical data or non-spherical clusters.

3. ROCK: A Hierarchical Clustering Algorithm for Categorical


Attributes

● Definition: ROCK (Robust Clustering using Links) is a


hierarchical clustering algorithm specifically designed for
categorical data.
● Key Idea:
○ Instead of using traditional distance measures, ROCK uses
the concept of links between data points. Links represent
shared neighbors between two points.
● Advantages:
○ Handles categorical data efficiently.
○ Identifies clusters with non-spherical shapes.
● Limitations:
○ Computationally intensive for large datasets.
○ Requires defining the number of clusters in advance.

4. Chameleon: A Hierarchical Clustering Algorithm Using Dynamic


Modeling

● Definition: Chameleon is a hierarchical clustering algorithm that


adapts dynamically to the internal structure of the data by using
a two-phase approach to merge clusters.
● Key Features:
○ Uses two properties for merging:
■ Interconnectivity: The relative closeness between
clusters.
■ Similarity: How similar the merged cluster is to the
individual clusters.
● Steps:
○ Initial Partitioning:
■ Use a graph-based approach to partition the data into
many small clusters.
○ Cluster Merging:
■ Iteratively merge clusters based on interconnectivity
and similarity metrics.
■ Ensure that the merged cluster maintains the internal
characteristics of the smaller clusters.
● Advantages:
○ Handles clusters of varying shapes, densities, and sizes.
○ Effectively balances local and global clustering criteria.
● Limitations:
○ Requires advanced preprocessing (e.g., graph
construction).
○ Computationally expensive for very large datasets.

Density-Based Methods in Clustering

Density-based clustering methods group data points into clusters


based on regions of high density, separated by regions of low density.
These methods are effective for identifying clusters of arbitrary shapes
and handling noise (outliers).

1. DBSCAN: Density-Based Spatial Clustering of Applications with Noise

Overview

● DBSCAN forms clusters based on the density of points in a


region.
● A cluster is defined as a maximal set of density-connected
points.
Key Concepts

1. Epsilon (ε): The radius of a neighborhood around a point.


2. MinPts: The minimum number of points required to form a dense
region.
3. Core Point: A point with at least MinPts within its
ε-neighborhood.
4. Border Point: A point that is not a core point but lies within the
ε-neighborhood of a core point.
5. Noise Point: A point that is neither a core point nor a border
point.
Algorithm Steps

1. Select an unvisited point as a candidate for a new cluster.


2. Check if it is a core point:
○ If yes, expand a cluster by including all points within its
ε-neighborhood.
○ Recursively include all density-reachable points.
○ Mark all visited points.
3. If it is not a core point, label it as noise (temporarily).
4. Repeat until all points are visited.
Advantages
● Handles clusters of arbitrary shapes.
● Robust to noise (outliers).
● Does not require specifying the number of clusters beforehand.
Limitations

● Sensitive to the choice of ε and MinPts.


● Struggles with datasets with varying densities.
Example

● Clustering geographic data where clusters represent cities, and


noise points represent isolated locations.

2. OPTICS: Ordering Points to Identify the Clustering Structure

Overview

● OPTICS extends DBSCAN by addressing its sensitivity to the ε


parameter.
● It creates an ordering of points based on their density
reachability, which can reveal the hierarchical structure of
clusters.
Key Concepts

1. Reachability Distance:
○ Measures how easily a point can be reached from another
point, considering ε and MinPts.
2. Core Distance:
○ The minimum distance needed to include MinPts points in
a cluster.
Algorithm Steps

1. Start with an unvisited point.


2. Compute its core distance.
3. Expand the cluster by adding points in increasing order of
reachability distance.
4. Repeat for all points to produce an ordered list of reachability
distances.
Visualization

● The output is a reachability plot, where valleys in the plot


correspond to clusters, and peaks indicate noise or separation
between clusters.
Advantages

● Handles datasets with varying densities by dynamically adjusting


the cluster structure.
● Does not require a fixed ε value.
Limitations

● Computationally expensive compared to DBSCAN.


● Requires interpretation of the reachability plot.
Example

● Clustering customer locations with varying density patterns (e.g.,


urban vs. rural areas).

3. DENCLUE: Clustering Based on Density Distribution Functions

Overview

● DENCLUE models the data distribution using mathematical


density functions and identifies clusters as regions of high
density.
Algorithm Steps

1. Apply a kernel density function to estimate the density


distribution.
2. Identify attractor points by finding local maxima in the density
function.
3. Assign points to clusters based on their convergence to the
nearest attractor point.
Advantages

● Can handle clusters of arbitrary shapes and sizes.


● Provides a theoretical foundation through density estimation.
Limitations

● Sensitive to the choice of kernel function and bandwidth.


● Computationally expensive for large datasets.
Example

● Clustering biological data (e.g., gene expression) where the


density of data points varies significantly.
Grid-Based Clustering Methods

Grid-based clustering methods divide the data space into a finite


number of non-overlapping grid cells. These methods summarize data
into grids and focus computations on the grid structure, rather than
individual data points, making them computationally efficient and
scalable for large datasets.

1. STING: Statistical Information Grid

Overview

● STING (Statistical Information Grid) is a hierarchical grid-based


clustering method.
● It organizes the data space into rectangular cells at multiple
resolutions, storing statistical summaries in each cell.
● Instead of analyzing individual data points, it uses these
summaries to identify clusters.

Key Concepts

1. Hierarchical Grids:
○ The data space is partitioned into cells at different levels of
granularity.
○ At the highest level, there are fewer, larger cells covering
the entire data space.
○ Each cell is subdivided into smaller cells at lower levels.
2. Statistical Summaries:
○ Each cell contains statistical properties of the data points it
covers, such as:
■ Mean
■ Standard deviation
■ Minimum and maximum values
■ Number of data points
○ These summaries help evaluate density and significance.

Algorithm Steps

1. Grid Construction:
○ Divide the data space into hierarchical grids.
○ Compute statistical summaries for each cell.
2. Query Execution:
○ Start from the topmost level of the hierarchy.
○ Select cells that meet a density or significance threshold.
○ Drill down into finer-resolution cells only in selected
regions.
3. Cluster Formation:
○ Adjacent cells meeting the criteria are merged to form
clusters.
○ Cells with low density are considered noise.

Advantages
● Efficiency: Reduces computation by using precomputed
summaries.
● Scalability: Handles large datasets efficiently.
● Simplicity: Works well for numerical data and spatial
applications.

Limitations

● Grid Resolution: The quality of clusters depends on the


resolution of the grid. Too coarse may miss details, too fine may
increase computation.
● Cluster Shape: Limited to detecting grid-aligned clusters.
Example Use Case

● Geospatial Data Analysis:


Grouping regions of a city based on
population density or income levels.

2. WaveCluster: Clustering Using Wavelet Transformation

Overview

● WaveCluster uses wavelet transformation to reduce the


dimensionality of data and identify dense regions that form
clusters.
● Wavelet transformation is a mathematical process that
compresses and smoothens data, making it easier to detect
clusters and eliminate noise.

Key Concepts

1. Wavelet Transformation:
○ A process that transforms the data space into a grid,
applying mathematical filters to highlight dense areas and
suppress noise.
○ Dense regions become more prominent, while sparse
regions and noise are diminished.
2. Feature Space Representation:
○ After the wavelet transformation, the data is represented as
a compact grid with transformed values that are easier to
analyze.
3. Cluster Identification:
○ Clusters are formed by identifying connected dense regions
in the transformed grid.

Algorithm Steps
1. Grid Construction:
○ Divide the data space into a grid with cells.
2. Wavelet Transformation:
○ Apply wavelet filters to the grid to highlight dense regions
and reduce noise.
○ This transformation creates a hierarchical representation of
the data.
3. Cluster Detection:
○ Identify clusters by locating connected regions in the
transformed grid with high density values.
○ Sparse regions are treated as noise.

Advantages

● Noise Handling: Effectively reduces noise through wavelet


transformation.
● Cluster Shapes: Capable of detecting clusters of arbitrary
shapes.
● Efficiency: Reduces the size of the data, making clustering
faster.

Limitations

● Parameter Sensitivity: Results depend on the choice of


wavelet filters and grid resolution.
● Complexity: Requires knowledge of wavelet transformation
techniques.
Example Use Case

● Astronomical Data Clustering:


Grouping celestial objects based on
their positions, where clusters may have varying densities and
irregular shapes.
Evaluation of Clustering Solutions

Evaluating clustering solutions is crucial to assess the quality and effectiveness


of the clusters produced by a clustering algorithm. Unlike supervised learning,
clustering lacks ground truth labels, making the evaluation inherently challenging.
The evaluation methods can be categorized into internal, external, and relative
measures.

1. Internal Evaluation

Internal evaluation methods assess the clustering quality based on the inherent properties of the
data and the resulting clusters without reference to external information.

Criteria for Good Clustering

● Compactness: Data points within a cluster should be as close to each other as possible
(low intra-cluster distance).
● Separation: Clusters should be well-separated from one another (high inter-cluster
distance).

Common Internal Metrics


4.Dunn Index:
● Ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.
● Higher values indicate better clustering.

2. External Evaluation

External evaluation compares the clustering result to a predefined ground truth, which is often
not available in real-world clustering tasks.

Common External Metrics

2.Adjusted Rand Index (ARI):

● Adjusts the Rand Index for chance, providing a more reliable evaluation.

3.F-Measure:

● Combines precision and recall for clustering evaluation.


● Precision: Fraction of correctly grouped points within a predicted cluster.
● Recall: Fraction of ground truth points correctly identified.
3. Relative Evaluation

Relative evaluation compares the quality of different clustering solutions produced by the same
algorithm with varying parameters (e.g., different numbers of clusters or initializations).

Common Approaches

● Elbow Method:
○ Plots SSE or distortion vs. the number of clusters kkk.
○ The "elbow point" indicates an optimal kkk where adding more clusters yields
diminishing returns in reducing SSE.
● Silhouette Analysis:
○ Uses the average silhouette score for different values of kkk to determine the
optimal number of clusters.
● Gap Statistic:
○ Compares the total within-cluster variation of the data to that of random,
uniformly distributed data.
○ A large gap indicates better clustering.

4. Challenges in Evaluation

● Cluster Shape and Density:


○ Metrics like SSE and Silhouette Coefficient assume spherical clusters, which may
not be suitable for non-convex shapes.
● No Ground Truth:
○ External evaluation is often not possible in real-world scenarios.
● Parameter Sensitivity:
○ Clustering results depend on parameters like the number of clusters kkk, which
must be chosen carefully.

You might also like