DWDS Unit 6 Cluster Analysis (1)
DWDS Unit 6 Cluster Analysis (1)
Definition
Purpose
Applications
Characteristics
Key Considerations
Challenges
Common Techniques
● Partitioning Methods: Divides the data into non-overlapping subsets.
● Hierarchical Methods: Creates a tree-like structure of nested clusters.
● Density-Based Methods: Groups dense regions and separates sparse
regions as noise.
● Grid-Based Methods: Divides the data space into a grid structure.
Advantages
Limitations
Clustering algorithms can work with various types of data. Understanding these
types helps in selecting appropriate clustering techniques and distance
measures.
1. Interval-Scaled Variables
1. Ordered Measurements:
○ The values are arranged in a specific order. For example, 30°C is hotter than
20°C.
2. Equal Intervals:
○ The difference between any two consecutive points on the scale is the same
throughout. For instance, the difference between 20°C and 30°C is the same as
between 30°C and 40°C.
3. No True Zero:
○ The scale's zero point is arbitrary and does not mean "nothing" or "absence." For
example, 0°C does not mean there is no temperature; it is simply a point on the
scale.
4. Arithmetic Operations:
○ Addition and subtraction are meaningful, but multiplication and division are not.
For example, while you can calculate the difference between two temperatures
(e.g., 40°C - 20°C = 20°C), you cannot say 40°C is "twice as hot" as 20°C.
2. Binary Variables
a. Categorical Variables
● Definition: These represent categories or labels without any inherent order, such as
"red," "blue," "green," or "dog," "cat," "bird."
b. Ordinal Variables
● Definition: These variables have a meaningful order but not a meaningful difference
between consecutive values. Examples include "low," "medium," "high" or rankings like
1st, 2nd, 3rd.
● Preprocessing:
○ Convert ordinal data into ranks or scale them to a uniform range (e.g., 1 to 10).
● Distance Measure:
○ Treat ranks as interval-scaled variables after conversion.
● Example: Clustering hotels based on guest satisfaction levels ("poor," "average,"
"excellent").
c. Ratio-Scaled Variables
● Definition: Similar to interval-scaled variables, but ratios are meaningful. These have an
absolute zero point. Examples include weight, height, or salary.
● Properties:
○ Zero means "none" of the quantity.
○ Both differences and ratios are meaningful (e.g., a person earning $50,000 earns
twice as much as someone earning $25,000).
● Distance Measure:
○ Euclidean distance or other numerical measures.
● Example: Clustering employees by their yearly salaries and years of experience.
● 4. Variables of Mixed Types
Clustering methods are categorized based on their approach to grouping data. Each method
has its strengths and weaknesses, making it suitable for specific types of data and clustering
tasks.
1. Partitioning Methods
● Definition: These methods divide the dataset into k clusters, where each cluster
contains at least one object, and all clusters are non-overlapping.
● Approach:
○ Start with an initial partition of the data into k groups.
○ Iteratively refine the clusters by moving objects between groups to optimize a
criterion (e.g., minimizing intra-cluster distances or maximizing inter-cluster
distances).
● Common Algorithms:
○ k-Means: Uses the mean (centroid) of points in a cluster as its center. Updates
cluster assignments to minimize the sum of squared distances.
○ k-Medoids: Similar to k-Means but uses actual data points (medoids) as cluster
centers to handle noise and outliers better.
● Example: Grouping customers based on their purchase behavior.
● Strengths:
○ Simple and efficient for large datasets.
● Limitations:
○ Requires the user to specify kkk in advance.
○ Sensitive to initial cluster assignments.
2. Hierarchical Methods
4. Grid-Based Methods
● Definition: These methods partition the data space into a finite number of cells (grid
structure) and form clusters from dense cells.
● Approach:
○ The data space is divided into a grid of cells.
○ Dense cells (with a significant number of data points) are merged to form
clusters.
● Common Algorithms:
○ STING (Statistical Information Grid):
■ Divides the space into hierarchical grids and analyzes statistical
information within each grid.
○ WaveCluster:
■ Applies wavelet transformation to find clusters in the transformed data.
● Example: Clustering sensor network data for monitoring.
● Strengths:
○ Computationally efficient.
○ Suitable for spatial data.
● Limitations:
○ May lose accuracy if the grid size is not chosen properly.
Partitioning Methods in Clustering
Partitioning methods are clustering approaches that divide a dataset into kkk non-overlapping
clusters, where each object belongs to exactly one cluster. These methods iteratively optimize
cluster assignments to minimize a specific criterion, such as intra-cluster variance or distance to
the cluster center.
a. k-Means
● Definition: A simple and widely used partitioning algorithm that minimizes the sum of
squared distances between data points and the centroids of their clusters.
● How It Works:
○ Initialization: Choose k random points as initial centroids.
○ Assignment: Assign each data point to the nearest centroid based on Euclidean
distance.
○ Update: Recalculate the centroids as the mean of all points assigned to each
cluster.
○ Repeat: Steps 2 and 3 until the centroids stabilize (i.e., no change in cluster
assignments) or a maximum number of iterations is reached.
● Distance Measure:
● Example:
○ Grouping customers based on purchase behavior (e.g., annual spending and
frequency of purchases).
● Strengths:
○ Simple and fast for large datasets.
○ Works well for spherical-shaped clusters.
● Limitations:
○ Requires the number of clusters k to be specified beforehand.
○ Sensitive to initial centroid selection and outliers.
○ Assumes clusters are convex and roughly equal in size.
b. k-Medoids
● Definition: Similar to k-Means but uses actual data points (medoids) as cluster centers,
making it more robust to noise and outliers.
● How It Works:
○ Initialization: Choose k random data points as medoids.
○ Assignment: Assign each data point to the nearest medoid based on a distance
measure.
○ Update: Replace a medoid with a non-medoid point if it reduces the overall cost
(sum of distances within the cluster).
○ Repeat: Steps 2 and 3 until no changes in medoids occur.
● Distance Measure:
○ Uses the same distance measures as k-Means but does not require calculating
the mean, as medoids are existing data points.
● Example:
○ Clustering geographic locations to find central points (e.g., clustering city
locations to identify central hubs).
● Strengths:
○ Robust to noise and outliers because it uses actual data points as cluster
centers.
○ Suitable for non-spherical clusters.
● Limitations:
○ Slower than k-Means for large datasets due to higher computational cost in
selecting medoids.
● Definition:
A "bottom-up" approach where each object starts in its
own cluster, and clusters are iteratively merged until all objects
belong to a single cluster or a stopping condition is met.
● Steps:
○ Initialize each object as a cluster.
○ Compute a proximity matrix to measure distances between
clusters.
○ Merge the two closest clusters.
○ Update the proximity matrix to reflect the new cluster.
○ Repeat steps 3 and 4 until only one cluster remains or a
desired number of clusters is achieved.
● Distance Measures:
○ Single Linkage: Distance between the closest points in
two clusters.
○ Complete Linkage: Distance between the farthest points
in two clusters.
○ Average Linkage: Average distance between all points in
two clusters.
● Strengths:
○ Produces a comprehensive hierarchy of clusters.
○ Does not require specifying the number of clusters in
advance.
● Limitations:
○ Computationally expensive for large datasets (O(n3)).
○ Sensitive to noise and outliers.
Overview
Overview
1. Reachability Distance:
○ Measures how easily a point can be reached from another
point, considering ε and MinPts.
2. Core Distance:
○ The minimum distance needed to include MinPts points in
a cluster.
Algorithm Steps
Overview
Overview
Key Concepts
1. Hierarchical Grids:
○ The data space is partitioned into cells at different levels of
granularity.
○ At the highest level, there are fewer, larger cells covering
the entire data space.
○ Each cell is subdivided into smaller cells at lower levels.
2. Statistical Summaries:
○ Each cell contains statistical properties of the data points it
covers, such as:
■ Mean
■ Standard deviation
■ Minimum and maximum values
■ Number of data points
○ These summaries help evaluate density and significance.
Algorithm Steps
1. Grid Construction:
○ Divide the data space into hierarchical grids.
○ Compute statistical summaries for each cell.
2. Query Execution:
○ Start from the topmost level of the hierarchy.
○ Select cells that meet a density or significance threshold.
○ Drill down into finer-resolution cells only in selected
regions.
3. Cluster Formation:
○ Adjacent cells meeting the criteria are merged to form
clusters.
○ Cells with low density are considered noise.
Advantages
● Efficiency: Reduces computation by using precomputed
summaries.
● Scalability: Handles large datasets efficiently.
● Simplicity: Works well for numerical data and spatial
applications.
Limitations
Overview
Key Concepts
1. Wavelet Transformation:
○ A process that transforms the data space into a grid,
applying mathematical filters to highlight dense areas and
suppress noise.
○ Dense regions become more prominent, while sparse
regions and noise are diminished.
2. Feature Space Representation:
○ After the wavelet transformation, the data is represented as
a compact grid with transformed values that are easier to
analyze.
3. Cluster Identification:
○ Clusters are formed by identifying connected dense regions
in the transformed grid.
Algorithm Steps
1. Grid Construction:
○ Divide the data space into a grid with cells.
2. Wavelet Transformation:
○ Apply wavelet filters to the grid to highlight dense regions
and reduce noise.
○ This transformation creates a hierarchical representation of
the data.
3. Cluster Detection:
○ Identify clusters by locating connected regions in the
transformed grid with high density values.
○ Sparse regions are treated as noise.
Advantages
Limitations
1. Internal Evaluation
Internal evaluation methods assess the clustering quality based on the inherent properties of the
data and the resulting clusters without reference to external information.
● Compactness: Data points within a cluster should be as close to each other as possible
(low intra-cluster distance).
● Separation: Clusters should be well-separated from one another (high inter-cluster
distance).
2. External Evaluation
External evaluation compares the clustering result to a predefined ground truth, which is often
not available in real-world clustering tasks.
● Adjusts the Rand Index for chance, providing a more reliable evaluation.
3.F-Measure:
Relative evaluation compares the quality of different clustering solutions produced by the same
algorithm with varying parameters (e.g., different numbers of clusters or initializations).
Common Approaches
● Elbow Method:
○ Plots SSE or distortion vs. the number of clusters kkk.
○ The "elbow point" indicates an optimal kkk where adding more clusters yields
diminishing returns in reducing SSE.
● Silhouette Analysis:
○ Uses the average silhouette score for different values of kkk to determine the
optimal number of clusters.
● Gap Statistic:
○ Compares the total within-cluster variation of the data to that of random,
uniformly distributed data.
○ A large gap indicates better clustering.
4. Challenges in Evaluation