0% found this document useful (0 votes)

31 views

DWDS Unit 6 Cluster Analysis (1)

Uploaded by

akshayakeerthi1104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

DWDS Unit 6 Cluster Analysis (1)

Uploaded by

akshayakeerthi1104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

What Is Cluster Analysis?

Definition

● Cluster analysis is a technique used to group a set of objects into clusters,

where objects within a cluster are more similar to each other than to
objects in other clusters.
● It is unsupervised learning, meaning it finds hidden patterns or groupings
in data without predefined labels.

Purpose

● To discover the inherent grouping or structure in a dataset.

● Useful for exploratory data analysis, pattern recognition, and data
preprocessing.

Applications

1. Market Segmentation: Identifying customer groups with similar purchasing

behavior.
2. Image Processing: Grouping similar pixels or regions in an image.
3. Social Network Analysis: Detecting communities within a network.
4. Biology: Classifying species based on genetic traits.
5. Document Clustering: Grouping articles or web pages with similar topics.

Characteristics

● Similarity/Dissimilarity: The core idea is to measure how alike (or

different) objects are, typically using distance measures such as:
○ Euclidean distance: For numerical data.
○ Jaccard similarity: For binary or categorical data.
● Output: A set of clusters, each represented by a centroid or a structure
like a tree in hierarchical clustering.

Key Considerations

1. No predefined labels: Unlike classification, cluster analysis does not

require labeled training data.
2. Number of clusters: In most methods (like k-Means), the number of
clusters must be specified beforehand.
3. Cluster validity: Assessing the quality of clusters is crucial. Common
measures include:
○ Compactness: Objects within a cluster are close to each other.
○ Separation: Clusters are distinct and well-separated.

Challenges

● High-dimensional data: It becomes difficult to measure similarity

effectively as dimensions increase.
● Mixed data types: Handling both numerical and categorical data in the
same analysis.
● Scalability: Ensuring the clustering algorithm works efficiently with large
datasets.

Common Techniques
● Partitioning Methods: Divides the data into non-overlapping subsets.
● Hierarchical Methods: Creates a tree-like structure of nested clusters.
● Density-Based Methods: Groups dense regions and separates sparse
regions as noise.
● Grid-Based Methods: Divides the data space into a grid structure.

Advantages

● Helps in simplifying and understanding complex data.

● Can reveal hidden patterns and insights.

Limitations

● Sensitive to noise and outliers.

● Requires careful parameter selection, such as the number of clusters or
similarity threshold.

Types of Data in Cluster Analysis

Clustering algorithms can work with various types of data. Understanding these
types helps in selecting appropriate clustering techniques and distance
measures.
1. Interval-Scaled Variables

● Definition: These are numeric variables where the difference between

values is meaningful. Examples include age, temperature, or income.
● Properties:
○ Measured on a continuous scale.
○ The difference between values is interpretable, but ratios are not
(e.g., 20°C is not "twice as hot" as 10°C).

Key Features of Interval-Scaled Variables:

1. Ordered Measurements:
○ The values are arranged in a specific order. For example, 30°C is hotter than
20°C.
2. Equal Intervals:
○ The difference between any two consecutive points on the scale is the same
throughout. For instance, the difference between 20°C and 30°C is the same as
between 30°C and 40°C.
3. No True Zero:
○ The scale's zero point is arbitrary and does not mean "nothing" or "absence." For
example, 0°C does not mean there is no temperature; it is simply a point on the
scale.
4. Arithmetic Operations:
○ Addition and subtraction are meaningful, but multiplication and division are not.
For example, while you can calculate the difference between two temperatures
(e.g., 40°C - 20°C = 20°C), you cannot say 40°C is "twice as hot" as 20°C.

Examples of Interval-Scaled Variables:

1. Temperature (in Celsius or Fahrenheit):

○ The difference between 20°C and 30°C (10°C) represents the same thermal
difference as between 50°C and 60°C.
○ However, 0°C does not mean the absence of temperature, as it is a reference
point based on a freezing/melting threshold.
2. Calendar Years:
○ The difference between 2000 and 1900 is 100 years, the same as between 1800
and 1700.
○ However, the year 0 AD does not signify the absence of time.
3. IQ Scores:
○ A difference of 10 points (e.g., between 100 and 110) reflects the same
intellectual difference as between 120 and 130.
○ But an IQ of 0 does not mean a complete absence of intelligence—it is a
theoretical lower limit.
○

2. Binary Variables

● Definition: Binary variables have two states, such as "Yes/No," "Male/Female," or

"True/False."
● Types:
1. Symmetric Binary: Both outcomes (e.g., 0 and 1) are equally important.
2. Asymmetric Binary: One outcome is more important, like the presence (1) or
absence (0) of a disease.
3. Categorical, Ordinal, and Ratio-Scaled Variables

a. Categorical Variables

● Definition: These represent categories or labels without any inherent order, such as
"red," "blue," "green," or "dog," "cat," "bird."

b. Ordinal Variables

● Definition: These variables have a meaningful order but not a meaningful difference
between consecutive values. Examples include "low," "medium," "high" or rankings like
1st, 2nd, 3rd.
● Preprocessing:
○ Convert ordinal data into ranks or scale them to a uniform range (e.g., 1 to 10).
● Distance Measure:
○ Treat ranks as interval-scaled variables after conversion.
● Example: Clustering hotels based on guest satisfaction levels ("poor," "average,"
"excellent").
c. Ratio-Scaled Variables

● Definition: Similar to interval-scaled variables, but ratios are meaningful. These have an
absolute zero point. Examples include weight, height, or salary.
● Properties:
○ Zero means "none" of the quantity.
○ Both differences and ratios are meaningful (e.g., a person earning $50,000 earns
twice as much as someone earning $25,000).
● Distance Measure:
○ Euclidean distance or other numerical measures.
● Example: Clustering employees by their yearly salaries and years of experience.
● 4. Variables of Mixed Types

A Categorization of Major Clustering Methods

Clustering methods are categorized based on their approach to grouping data. Each method
has its strengths and weaknesses, making it suitable for specific types of data and clustering
tasks.

1. Partitioning Methods
● Definition: These methods divide the dataset into k clusters, where each cluster
contains at least one object, and all clusters are non-overlapping.
● Approach:
○ Start with an initial partition of the data into k groups.
○ Iteratively refine the clusters by moving objects between groups to optimize a
criterion (e.g., minimizing intra-cluster distances or maximizing inter-cluster
distances).
● Common Algorithms:
○ k-Means: Uses the mean (centroid) of points in a cluster as its center. Updates
cluster assignments to minimize the sum of squared distances.
○ k-Medoids: Similar to k-Means but uses actual data points (medoids) as cluster
centers to handle noise and outliers better.
● Example: Grouping customers based on their purchase behavior.
● Strengths:
○ Simple and efficient for large datasets.
● Limitations:
○ Requires the user to specify kkk in advance.
○ Sensitive to initial cluster assignments.

2. Hierarchical Methods

● Definition: These methods create a tree-like structure (dendrogram) of nested clusters,

where clusters are formed step-by-step.
● Approach:
○ Agglomerative (Bottom-Up):
■ Start with each object as its own cluster.
■ Gradually merge clusters until only one cluster remains.
○ Divisive (Top-Down):
■ Start with all objects in one cluster.
■ Recursively split clusters until each object is in its own cluster.
● Common Algorithms:
○ BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies):
Efficiently handles large datasets by constructing a compact summary tree.
○ ROCK (Robust Clustering using Links): Designed for categorical data.
○ Chameleon: Dynamically adjusts the similarity measure during clustering.
● Example: Hierarchical grouping of species based on genetic traits.
● Strengths:
○ Does not require the number of clusters in advance.
○ Produces a hierarchy for better interpretation.
● Limitations:
○ Computationally expensive for large datasets.
3. Density-Based Methods

● Definition: Clusters are formed based on dense regions of data separated by

low-density regions.
● Approach:
○ Define clusters as areas with a high density of points.
○ Points in sparse regions are considered noise or outliers.
● Common Algorithms:
○ DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
■ Forms clusters by identifying core points with a minimum number of
neighbors within a radius (ϵ\epsilonϵ).
■ Can detect arbitrary-shaped clusters and noise.
○ OPTICS (Ordering Points to Identify the Clustering Structure):
■ Similar to DBSCAN but provides a more detailed cluster structure.
○ DENCLUE (Density-Based Clustering Using Density Functions):
■ Uses mathematical density functions to group data.
● Example: Clustering geographic data to identify urban regions.
● Strengths:
○ Detects clusters of arbitrary shape.
○ Handles noise effectively.
● Limitations:
○ Sensitive to parameter selection (e.g., ϵ\epsilonϵ, minimum points).

4. Grid-Based Methods

● Definition: These methods partition the data space into a finite number of cells (grid
structure) and form clusters from dense cells.
● Approach:
○ The data space is divided into a grid of cells.
○ Dense cells (with a significant number of data points) are merged to form
clusters.
● Common Algorithms:
○ STING (Statistical Information Grid):
■ Divides the space into hierarchical grids and analyzes statistical
information within each grid.
○ WaveCluster:
■ Applies wavelet transformation to find clusters in the transformed data.
● Example: Clustering sensor network data for monitoring.
● Strengths:
○ Computationally efficient.
○ Suitable for spatial data.
● Limitations:
○ May lose accuracy if the grid size is not chosen properly.
Partitioning Methods in Clustering

Partitioning methods are clustering approaches that divide a dataset into kkk non-overlapping
clusters, where each object belongs to exactly one cluster. These methods iteratively optimize
cluster assignments to minimize a specific criterion, such as intra-cluster variance or distance to
the cluster center.

1. Classical Partitioning Methods

a. k-Means

● Definition: A simple and widely used partitioning algorithm that minimizes the sum of
squared distances between data points and the centroids of their clusters.
● How It Works:
○ Initialization: Choose k random points as initial centroids.
○ Assignment: Assign each data point to the nearest centroid based on Euclidean
distance.
○ Update: Recalculate the centroids as the mean of all points assigned to each
cluster.
○ Repeat: Steps 2 and 3 until the centroids stabilize (i.e., no change in cluster
assignments) or a maximum number of iterations is reached.
● Distance Measure:

● Example:
○ Grouping customers based on purchase behavior (e.g., annual spending and
frequency of purchases).
● Strengths:
○ Simple and fast for large datasets.
○ Works well for spherical-shaped clusters.
● Limitations:
○ Requires the number of clusters k to be specified beforehand.
○ Sensitive to initial centroid selection and outliers.
○ Assumes clusters are convex and roughly equal in size.

b. k-Medoids

● Definition: Similar to k-Means but uses actual data points (medoids) as cluster centers,
making it more robust to noise and outliers.
● How It Works:
○ Initialization: Choose k random data points as medoids.
○ Assignment: Assign each data point to the nearest medoid based on a distance
measure.
○ Update: Replace a medoid with a non-medoid point if it reduces the overall cost
(sum of distances within the cluster).
○ Repeat: Steps 2 and 3 until no changes in medoids occur.
● Distance Measure:
○ Uses the same distance measures as k-Means but does not require calculating
the mean, as medoids are existing data points.
● Example:
○ Clustering geographic locations to find central points (e.g., clustering city
locations to identify central hubs).
● Strengths:
○ Robust to noise and outliers because it uses actual data points as cluster
centers.
○ Suitable for non-spherical clusters.
● Limitations:
○ Slower than k-Means for large datasets due to higher computational cost in
selecting medoids.

2. Partitioning Methods in Large Databases: From k-Medoids to CLARANS

As datasets grow, traditional methods like k-Medoids become computationally expensive.

Advanced techniques like CLARANS (Clustering Large Applications based on Randomized
Search) are designed for efficiency with large databases.
CLARANS (Clustering Large Applications Randomized Search)

● Definition: An optimized version of k-Medoids that uses a randomized search approach

to handle large datasets efficiently.
● How It Works:
○ Initialization: Start with an initial set of k medoids chosen randomly.
○ Local Search:
■ Randomly select a medoid and replace it with a non-medoid point.
■ Evaluate the total cost (sum of distances within clusters) for the new set
of medoids.
■ Accept the new set if the cost improves; otherwise, keep the original set.
○ Global Optimization:
■ Repeat the local search for multiple iterations.
■ Track the best medoid set with the minimum cost over all iterations.
○ Output: The medoid set with the lowest cost is chosen as the final clustering.
● Advantages over k-Medoids:
○ Avoids exhaustive pairwise comparisons by randomizing the medoid selection
process.
○ Reduces computational complexity, making it scalable for large databases.
● Example:
○ Clustering large customer transaction datasets in retail.
● Strengths:
○ Suitable for large databases.
○ Balances computational efficiency with clustering quality.
● Limitations:
○ Requires careful tuning of parameters (e.g., the number of iterations for local and
global searches).

Hierarchical Clustering Methods

Hierarchical clustering builds a tree-like structure (called a dendrogram) to
represent the nested grouping of objects in a dataset. It does not require
the number of clusters to be specified in advance and produces a hierarchy
of clusters that can be analyzed at different levels of granularity.

There are two main approaches: Agglomerative and Divisive clustering.

1. Agglomerative and Divisive Hierarchical Clustering

a. Agglomerative Hierarchical Clustering

● Definition:
A "bottom-up" approach where each object starts in its
own cluster, and clusters are iteratively merged until all objects
belong to a single cluster or a stopping condition is met.
● Steps:
○ Initialize each object as a cluster.
○ Compute a proximity matrix to measure distances between
clusters.
○ Merge the two closest clusters.
○ Update the proximity matrix to reflect the new cluster.
○ Repeat steps 3 and 4 until only one cluster remains or a
desired number of clusters is achieved.
● Distance Measures:
○ Single Linkage: Distance between the closest points in
two clusters.
○ Complete Linkage: Distance between the farthest points
in two clusters.
○ Average Linkage: Average distance between all points in
two clusters.
● Strengths:
○ Produces a comprehensive hierarchy of clusters.
○ Does not require specifying the number of clusters in
advance.
● Limitations:
○ Computationally expensive for large datasets (O(n3)).
○ Sensitive to noise and outliers.

b. Divisive Hierarchical Clustering

● Definition: A "top-down" approach where all objects start in a single

cluster, and clusters are iteratively split until each object is in its own
cluster or a stopping condition is met.
● Steps:
○ Start with all objects in one cluster.
○ Split the cluster into smaller clusters based on a criterion
(e.g., maximizing dissimilarity between clusters).
○ Repeat until a desired number of clusters is achieved.
● Strengths:
○ Can produce clusters by identifying large-scale structures
first.
● Limitations:
○ Computationally expensive.
○ Requires a splitting criterion, which can be complex to
define.

2. BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies

● Definition: BIRCH is an efficient hierarchical clustering method

designed for very large datasets. It incrementally builds a
clustering feature tree (CF tree) and refines the clustering in
multiple phases.
● Key Concepts:
○ Clustering Feature (CF): A compact summary of a cluster,
including the number of points (NNN), the linear sum of
points (LS), and the square sum of points (SS).
■ CF=(N,LS,SS)
○ CF Tree: A tree structure that stores clustering features,
with a fixed number of entries at each node.
● Steps:
○ Building the CF Tree:
■ Insert data points into the tree.
■ Each node summarizes the points it contains.
■ If a node exceeds its threshold, it is split.
○ Refinement:
■ Reorganize the CF tree and optionally apply a
clustering algorithm like k-Means on the leaf entries.
● Advantages:
○ Handles large datasets efficiently.
○ Works well for numeric data.
○ Incremental and dynamic clustering.
● Limitations:
○ Sensitive to the order of data input.
○ Less effective for categorical data or non-spherical clusters.

3. ROCK: A Hierarchical Clustering Algorithm for Categorical

Attributes

● Definition: ROCK (Robust Clustering using Links) is a

hierarchical clustering algorithm specifically designed for
categorical data.
● Key Idea:
○ Instead of using traditional distance measures, ROCK uses
the concept of links between data points. Links represent
shared neighbors between two points.
● Advantages:
○ Handles categorical data efficiently.
○ Identifies clusters with non-spherical shapes.
● Limitations:
○ Computationally intensive for large datasets.
○ Requires defining the number of clusters in advance.

4. Chameleon: A Hierarchical Clustering Algorithm Using Dynamic

Modeling

● Definition: Chameleon is a hierarchical clustering algorithm that

adapts dynamically to the internal structure of the data by using
a two-phase approach to merge clusters.
● Key Features:
○ Uses two properties for merging:
■ Interconnectivity: The relative closeness between
clusters.
■ Similarity: How similar the merged cluster is to the
individual clusters.
● Steps:
○ Initial Partitioning:
■ Use a graph-based approach to partition the data into
many small clusters.
○ Cluster Merging:
■ Iteratively merge clusters based on interconnectivity
and similarity metrics.
■ Ensure that the merged cluster maintains the internal
characteristics of the smaller clusters.
● Advantages:
○ Handles clusters of varying shapes, densities, and sizes.
○ Effectively balances local and global clustering criteria.
● Limitations:
○ Requires advanced preprocessing (e.g., graph
construction).
○ Computationally expensive for very large datasets.

Density-Based Methods in Clustering

Density-based clustering methods group data points into clusters

based on regions of high density, separated by regions of low density.
These methods are effective for identifying clusters of arbitrary shapes
and handling noise (outliers).

1. DBSCAN: Density-Based Spatial Clustering of Applications with Noise

Overview

● DBSCAN forms clusters based on the density of points in a

region.
● A cluster is defined as a maximal set of density-connected
points.
Key Concepts

1. Epsilon (ε): The radius of a neighborhood around a point.

2. MinPts: The minimum number of points required to form a dense
region.
3. Core Point: A point with at least MinPts within its
ε-neighborhood.
4. Border Point: A point that is not a core point but lies within the
ε-neighborhood of a core point.
5. Noise Point: A point that is neither a core point nor a border
point.
Algorithm Steps

1. Select an unvisited point as a candidate for a new cluster.

2. Check if it is a core point:
○ If yes, expand a cluster by including all points within its
ε-neighborhood.
○ Recursively include all density-reachable points.
○ Mark all visited points.
3. If it is not a core point, label it as noise (temporarily).
4. Repeat until all points are visited.
Advantages
● Handles clusters of arbitrary shapes.
● Robust to noise (outliers).
● Does not require specifying the number of clusters beforehand.
Limitations

● Sensitive to the choice of ε and MinPts.

● Struggles with datasets with varying densities.
Example

● Clustering geographic data where clusters represent cities, and

noise points represent isolated locations.

2. OPTICS: Ordering Points to Identify the Clustering Structure

Overview

● OPTICS extends DBSCAN by addressing its sensitivity to the ε

parameter.
● It creates an ordering of points based on their density
reachability, which can reveal the hierarchical structure of
clusters.
Key Concepts

1. Reachability Distance:
○ Measures how easily a point can be reached from another
point, considering ε and MinPts.
2. Core Distance:
○ The minimum distance needed to include MinPts points in
a cluster.
Algorithm Steps

1. Start with an unvisited point.

2. Compute its core distance.
3. Expand the cluster by adding points in increasing order of
reachability distance.
4. Repeat for all points to produce an ordered list of reachability
distances.
Visualization

● The output is a reachability plot, where valleys in the plot

correspond to clusters, and peaks indicate noise or separation
between clusters.
Advantages

● Handles datasets with varying densities by dynamically adjusting

the cluster structure.
● Does not require a fixed ε value.
Limitations

● Computationally expensive compared to DBSCAN.

● Requires interpretation of the reachability plot.
Example

● Clustering customer locations with varying density patterns (e.g.,

urban vs. rural areas).

3. DENCLUE: Clustering Based on Density Distribution Functions

Overview

● DENCLUE models the data distribution using mathematical

density functions and identifies clusters as regions of high
density.
Algorithm Steps

1. Apply a kernel density function to estimate the density

distribution.
2. Identify attractor points by finding local maxima in the density
function.
3. Assign points to clusters based on their convergence to the
nearest attractor point.
Advantages

● Can handle clusters of arbitrary shapes and sizes.

● Provides a theoretical foundation through density estimation.
Limitations

● Sensitive to the choice of kernel function and bandwidth.

● Computationally expensive for large datasets.
Example

● Clustering biological data (e.g., gene expression) where the

density of data points varies significantly.
Grid-Based Clustering Methods

Grid-based clustering methods divide the data space into a finite

number of non-overlapping grid cells. These methods summarize data
into grids and focus computations on the grid structure, rather than
individual data points, making them computationally efficient and
scalable for large datasets.

1. STING: Statistical Information Grid

Overview

● STING (Statistical Information Grid) is a hierarchical grid-based

clustering method.
● It organizes the data space into rectangular cells at multiple
resolutions, storing statistical summaries in each cell.
● Instead of analyzing individual data points, it uses these
summaries to identify clusters.

Key Concepts

1. Hierarchical Grids:
○ The data space is partitioned into cells at different levels of
granularity.
○ At the highest level, there are fewer, larger cells covering
the entire data space.
○ Each cell is subdivided into smaller cells at lower levels.
2. Statistical Summaries:
○ Each cell contains statistical properties of the data points it
covers, such as:
■ Mean
■ Standard deviation
■ Minimum and maximum values
■ Number of data points
○ These summaries help evaluate density and significance.

Algorithm Steps

1. Grid Construction:
○ Divide the data space into hierarchical grids.
○ Compute statistical summaries for each cell.
2. Query Execution:
○ Start from the topmost level of the hierarchy.
○ Select cells that meet a density or significance threshold.
○ Drill down into finer-resolution cells only in selected
regions.
3. Cluster Formation:
○ Adjacent cells meeting the criteria are merged to form
clusters.
○ Cells with low density are considered noise.

Advantages
● Efficiency: Reduces computation by using precomputed
summaries.
● Scalability: Handles large datasets efficiently.
● Simplicity: Works well for numerical data and spatial
applications.

Limitations

● Grid Resolution: The quality of clusters depends on the

resolution of the grid. Too coarse may miss details, too fine may
increase computation.
● Cluster Shape: Limited to detecting grid-aligned clusters.
Example Use Case

● Geospatial Data Analysis:

Grouping regions of a city based on
population density or income levels.

2. WaveCluster: Clustering Using Wavelet Transformation

Overview

● WaveCluster uses wavelet transformation to reduce the

dimensionality of data and identify dense regions that form
clusters.
● Wavelet transformation is a mathematical process that
compresses and smoothens data, making it easier to detect
clusters and eliminate noise.

Key Concepts

1. Wavelet Transformation:
○ A process that transforms the data space into a grid,
applying mathematical filters to highlight dense areas and
suppress noise.
○ Dense regions become more prominent, while sparse
regions and noise are diminished.
2. Feature Space Representation:
○ After the wavelet transformation, the data is represented as
a compact grid with transformed values that are easier to
analyze.
3. Cluster Identification:
○ Clusters are formed by identifying connected dense regions
in the transformed grid.

Algorithm Steps
1. Grid Construction:
○ Divide the data space into a grid with cells.
2. Wavelet Transformation:
○ Apply wavelet filters to the grid to highlight dense regions
and reduce noise.
○ This transformation creates a hierarchical representation of
the data.
3. Cluster Detection:
○ Identify clusters by locating connected regions in the
transformed grid with high density values.
○ Sparse regions are treated as noise.

Advantages

● Noise Handling: Effectively reduces noise through wavelet

transformation.
● Cluster Shapes: Capable of detecting clusters of arbitrary
shapes.
● Efficiency: Reduces the size of the data, making clustering
faster.

Limitations

● Parameter Sensitivity: Results depend on the choice of

wavelet filters and grid resolution.
● Complexity: Requires knowledge of wavelet transformation
techniques.
Example Use Case

● Astronomical Data Clustering:

Grouping celestial objects based on
their positions, where clusters may have varying densities and
irregular shapes.
Evaluation of Clustering Solutions

Evaluating clustering solutions is crucial to assess the quality and effectiveness

of the clusters produced by a clustering algorithm. Unlike supervised learning,
clustering lacks ground truth labels, making the evaluation inherently challenging.
The evaluation methods can be categorized into internal, external, and relative
measures.

1. Internal Evaluation

Internal evaluation methods assess the clustering quality based on the inherent properties of the
data and the resulting clusters without reference to external information.

Criteria for Good Clustering

● Compactness: Data points within a cluster should be as close to each other as possible
(low intra-cluster distance).
● Separation: Clusters should be well-separated from one another (high inter-cluster
distance).

Common Internal Metrics

4.Dunn Index:
● Ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.
● Higher values indicate better clustering.

2. External Evaluation

External evaluation compares the clustering result to a predefined ground truth, which is often
not available in real-world clustering tasks.

Common External Metrics

2.Adjusted Rand Index (ARI):

● Adjusts the Rand Index for chance, providing a more reliable evaluation.

3.F-Measure:

● Combines precision and recall for clustering evaluation.

● Precision: Fraction of correctly grouped points within a predicted cluster.
● Recall: Fraction of ground truth points correctly identified.
3. Relative Evaluation

Relative evaluation compares the quality of different clustering solutions produced by the same
algorithm with varying parameters (e.g., different numbers of clusters or initializations).

Common Approaches

● Elbow Method:
○ Plots SSE or distortion vs. the number of clusters kkk.
○ The "elbow point" indicates an optimal kkk where adding more clusters yields
diminishing returns in reducing SSE.
● Silhouette Analysis:
○ Uses the average silhouette score for different values of kkk to determine the
optimal number of clusters.
● Gap Statistic:
○ Compares the total within-cluster variation of the data to that of random,
uniformly distributed data.
○ A large gap indicates better clustering.

4. Challenges in Evaluation

● Cluster Shape and Density:

○ Metrics like SSE and Silhouette Coefficient assume spherical clusters, which may
not be suitable for non-convex shapes.
● No Ground Truth:
○ External evaluation is often not possible in real-world scenarios.
● Parameter Sensitivity:
○ Clustering results depend on parameters like the number of clusters kkk, which
must be chosen carefully.

4. Module-4 Reasoning Under Uncertainty
No ratings yet
4. Module-4 Reasoning Under Uncertainty
60 pages
Short Biography of Geoffrey Chaucer Poet PDF
No ratings yet
Short Biography of Geoffrey Chaucer Poet PDF
20 pages
Mobile and Interactive Media Use by Young Children: The Good, The Bad, and The Unknown
No ratings yet
Mobile and Interactive Media Use by Young Children: The Good, The Bad, and The Unknown
5 pages
Unit 2 - Introduction to Cluster Analysis
No ratings yet
Unit 2 - Introduction to Cluster Analysis
53 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
clustering
No ratings yet
clustering
6 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
Clustering
No ratings yet
Clustering
104 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Clustering
No ratings yet
Clustering
8 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Cluster Analysis
No ratings yet
Cluster Analysis
3 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
Grouping
No ratings yet
Grouping
98 pages
Iv Unit DM
No ratings yet
Iv Unit DM
26 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
21 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
unit iv[1]
No ratings yet
unit iv[1]
96 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Unit 4
No ratings yet
Unit 4
4 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
DATA_MINING_UNIT-4
No ratings yet
DATA_MINING_UNIT-4
15 pages
DM MODULE 4
No ratings yet
DM MODULE 4
17 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Data Mining
No ratings yet
Data Mining
98 pages
Clustering
No ratings yet
Clustering
27 pages
10ClusBasic (1)
No ratings yet
10ClusBasic (1)
31 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Unit 5
No ratings yet
Unit 5
5 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
Cluster Analysis
No ratings yet
Cluster Analysis
61 pages
DWMModule 4 (1) (1) (1)
No ratings yet
DWMModule 4 (1) (1) (1)
31 pages
CLUSTER ANALYSIS unit 3 Data mining
No ratings yet
CLUSTER ANALYSIS unit 3 Data mining
84 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
Analysis of cluteruing
No ratings yet
Analysis of cluteruing
16 pages
k-medoids
No ratings yet
k-medoids
101 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
No ratings yet
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
44 pages
Clustering
No ratings yet
Clustering
47 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
dm 4
No ratings yet
dm 4
76 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data warehousing and Data Mining Unit 1,2,3 Q and A
No ratings yet
Data warehousing and Data Mining Unit 1,2,3 Q and A
41 pages
Decision Tree Ppt
0% (1)
Decision Tree Ppt
24 pages
DWDM Material
No ratings yet
DWDM Material
175 pages
3. Module-3 Knowledge Representation
No ratings yet
3. Module-3 Knowledge Representation
96 pages
Full Listening Practice Test 3
No ratings yet
Full Listening Practice Test 3
6 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
5 pages
inorganics-11-00070
No ratings yet
inorganics-11-00070
11 pages
An Autonomous Carrier Landing System For Unmannned Aerial Vehicles
No ratings yet
An Autonomous Carrier Landing System For Unmannned Aerial Vehicles
17 pages
E Commerce Notes
No ratings yet
E Commerce Notes
28 pages
Bible Verses About Friendship - Bible Quotes On Friendship
No ratings yet
Bible Verses About Friendship - Bible Quotes On Friendship
19 pages
NSSF 2018 Firearms Industry Economic Impact Report
No ratings yet
NSSF 2018 Firearms Industry Economic Impact Report
8 pages
Traditionalism, The Perennial Philosophy and Islamic Studies
No ratings yet
Traditionalism, The Perennial Philosophy and Islamic Studies
6 pages
Extra Research For Database Value
No ratings yet
Extra Research For Database Value
33 pages
1742968779872lSER9U0Ia8v5251h
No ratings yet
1742968779872lSER9U0Ia8v5251h
12 pages
Chapter 11
No ratings yet
Chapter 11
45 pages
Effect of Sawdust Filler With Kevlarbasalt Fiber On The Mechanical
No ratings yet
Effect of Sawdust Filler With Kevlarbasalt Fiber On The Mechanical
6 pages
Creating Futures 2006
No ratings yet
Creating Futures 2006
369 pages
101 Family Activities
No ratings yet
101 Family Activities
2 pages
Chapter 4
No ratings yet
Chapter 4
29 pages
Emerson Self Reliance
No ratings yet
Emerson Self Reliance
6 pages
A Story in Quotes
No ratings yet
A Story in Quotes
2 pages
The Literary Forms in Philippine Literature
No ratings yet
The Literary Forms in Philippine Literature
12 pages
Task 1: Choose A, B, C, or D For The Correct Answer
No ratings yet
Task 1: Choose A, B, C, or D For The Correct Answer
4 pages
Haiier
No ratings yet
Haiier
48 pages
Chapter 3 Wireless-Network-Principles Class
No ratings yet
Chapter 3 Wireless-Network-Principles Class
45 pages
Safety Manual: 2004revised
No ratings yet
Safety Manual: 2004revised
120 pages
PATH Fit 400
No ratings yet
PATH Fit 400
9 pages
A Note On The Availability of D'ailly's Writings On Astrology
No ratings yet
A Note On The Availability of D'ailly's Writings On Astrology
5 pages
Sir Aquino
No ratings yet
Sir Aquino
12 pages
P A S L Kumarathunga
No ratings yet
P A S L Kumarathunga
4 pages
Gcic NCP Seizure PICUOSMUN
100% (3)
Gcic NCP Seizure PICUOSMUN
2 pages
Onix Investors Pitch Deck - 2023-2
No ratings yet
Onix Investors Pitch Deck - 2023-2
24 pages

DWDS Unit 6 Cluster Analysis (1)

Uploaded by

DWDS Unit 6 Cluster Analysis (1)

Uploaded by

What Is Cluster Analysis?

● Cluster analysis is a technique used to group a set of objects into clusters,

● To discover the inherent grouping or structure in a dataset.

1. Market Segmentation: Identifying customer groups with similar purchasing

● Similarity/Dissimilarity: The core idea is to measure how alike (or

1. No predefined labels: Unlike classification, cluster analysis does not

● High-dimensional data: It becomes difficult to measure similarity

● Helps in simplifying and understanding complex data.

● Sensitive to noise and outliers.

Types of Data in Cluster Analysis

● Definition: These are numeric variables where the difference between

Key Features of Interval-Scaled Variables:

Examples of Interval-Scaled Variables:

1. Temperature (in Celsius or Fahrenheit):

● Definition: Binary variables have two states, such as "Yes/No," "Male/Female," or

A Categorization of Major Clustering Methods

● Definition: These methods create a tree-like structure (dendrogram) of nested clusters,

● Definition: Clusters are formed based on dense regions of data separated by

1. Classical Partitioning Methods

2. Partitioning Methods in Large Databases: From k-Medoids to CLARANS

As datasets grow, traditional methods like k-Medoids become computationally expensive.

● Definition: An optimized version of k-Medoids that uses a randomized search approach

Hierarchical Clustering Methods

There are two main approaches: Agglomerative and Divisive clustering.

1. Agglomerative and Divisive Hierarchical Clustering

a. Agglomerative Hierarchical Clustering

b. Divisive Hierarchical Clustering

● Definition: A "top-down" approach where all objects start in a single

2. BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies

● Definition: BIRCH is an efficient hierarchical clustering method

3. ROCK: A Hierarchical Clustering Algorithm for Categorical

● Definition: ROCK (Robust Clustering using Links) is a

4. Chameleon: A Hierarchical Clustering Algorithm Using Dynamic

● Definition: Chameleon is a hierarchical clustering algorithm that

Density-Based Methods in Clustering

Density-based clustering methods group data points into clusters

1. DBSCAN: Density-Based Spatial Clustering of Applications with Noise

● DBSCAN forms clusters based on the density of points in a

1. Epsilon (ε): The radius of a neighborhood around a point.

1. Select an unvisited point as a candidate for a new cluster.

● Sensitive to the choice of ε and MinPts.

● Clustering geographic data where clusters represent cities, and

2. OPTICS: Ordering Points to Identify the Clustering Structure

● OPTICS extends DBSCAN by addressing its sensitivity to the ε

1. Start with an unvisited point.

● The output is a reachability plot, where valleys in the plot

● Handles datasets with varying densities by dynamically adjusting

● Computationally expensive compared to DBSCAN.

● Clustering customer locations with varying density patterns (e.g.,

3. DENCLUE: Clustering Based on Density Distribution Functions

● DENCLUE models the data distribution using mathematical

1. Apply a kernel density function to estimate the density

● Can handle clusters of arbitrary shapes and sizes.

● Sensitive to the choice of kernel function and bandwidth.

● Clustering biological data (e.g., gene expression) where the

Grid-based clustering methods divide the data space into a finite

1. STING: Statistical Information Grid

● STING (Statistical Information Grid) is a hierarchical grid-based

● Grid Resolution: The quality of clusters depends on the

● Geospatial Data Analysis:

2. WaveCluster: Clustering Using Wavelet Transformation

● WaveCluster uses wavelet transformation to reduce the

● Noise Handling: Effectively reduces noise through wavelet

● Parameter Sensitivity: Results depend on the choice of

● Astronomical Data Clustering:

Evaluating clustering solutions is crucial to assess the quality and effectiveness

Criteria for Good Clustering

Common Internal Metrics

Common External Metrics

2.Adjusted Rand Index (ARI):

● Combines precision and recall for clustering evaluation.

● Cluster Shape and Density:

You might also like