MA Unit 5
MA Unit 5
Exploratory data analysis: Clustering is often used to uncover hidden patterns and relationships in
data, providing insights into the underlying structure of the data set.
Grouping similar data points: The goal of clustering is to group similar data points together, forming
clusters where data points within each cluster share similar characteristics.
Fraud detection: Detecting fraudulent transactions by identifying patterns that deviate from normal
behavior.
Medical diagnosis: Grouping patients with similar symptoms or medical histories to aid in diagnosis
and treatment decisions.
Hierarchical clustering: A method that builds a hierarchy of clusters by merging or splitting data
points based on their similarity.
Density-based spatial clustering of applications with noise (DBSCAN): An algorithm that identifies
clusters based on the density of data points in the neighborhood of each data point.
Measures of similarity and dissimilarity are fundamental concepts in cluster analysis, as they quantify
the degree of resemblance or difference between data points. These measures play a crucial role in
determining the structure and composition of clusters.
Similarity measures indicate how similar two data points are, with higher values indicating greater
similarity. Common similarity measures include:
Cosine similarity: Suitable for comparing data points represented as vectors, measuring the angle
between the vectors.
Jaccard similarity: Applicable to binary or categorical data, reflecting the proportion of shared
attributes between data points.
Euclidean distance: A general-purpose distance measure, calculating the straight-line distance between
two data points in a multidimensional space.
Dissimilarity measures, on the other hand, quantify how dissimilar two data points are, with higher
values indicating greater dissimilarity. Common dissimilarity measures include:
Manhattan distance: Also known as the city block distance, measuring the sum of absolute differences
between corresponding coordinates of two data points.
Minkowski distance: A generalization of the Euclidean and Manhattan distances, allowing for the
incorporation of different weightings for each dimension.
Mahalanobis distance: Considers the covariance matrix of the data to adjust for the relative scales of
the dimensions.
The choice of similarity or dissimilarity measure depends on the specific characteristics of the data
and the clustering algorithm being used. In general, similarity measures are used for algorithms that
seek to group similar data points together, while dissimilarity measures are used for algorithms that
identify clusters based on the separation between data points.
- In cluster analysis, the measure of similarity and dissimilarity is called proximity. Proximity
measures are mathematical techniques that calculate the similarity or dissimilarity of data points.
Hierarchical clustering is a popular unsupervised learning technique that groups similar objects into
clusters. It's an iterative process that starts by treating each observation as a separate cluster. The
process then repeatedly
Divisive clustering (top-down approach): This method starts with all data points in a single cluster and
iteratively splits the clusters into smaller ones until a stopping criterion is reached. The stopping
criterion is similar to agglomerative clustering, but instead of merging clusters, they are split based on
their similarity or dissimilarity.
Distance or similarity matrix calculation: A distance or similarity matrix is computed to quantify the
pairwise relationships between data points. The choice of distance or similarity measure depends on
the data type and the desired clustering behavior.
Cluster merging or splitting: The algorithm iteratively merges or splits clusters based on the distance
or similarity matrix. The merging or splitting strategy depends on the chosen hierarchical clustering
approach (agglomerative or divisive).
Dendrogram generation: A dendrogram is constructed to visualize the hierarchical relationships
among clusters. The dendrogram represents the merging or splitting process and can be cut at different
levels to determine the desired number of clusters.
Flexible cluster formation: The dendrogram allows for flexibility in determining the number of
clusters and their composition.
Effective for non-convex clusters: Hierarchical clustering can handle non-convex clusters, which are
clusters that cannot be represented by a single convex shape.
Sensitivity to outliers: Outliers can significantly impact the clustering results, as they can influence
the merging or splitting decisions.
Q4) Discuss single linkage, complete linkage and average linkage in cluster analysis.
Single linkage, complete linkage, and average linkage are three different methods for defining the
distance between clusters in hierarchical clustering. These methods determine how the similarity or
dissimilarity between individual data points is aggregated to represent the overall relationship
between clusters.
Single linkage (SL):
In single linkage, the distance between two clusters is defined as the minimum distance between any
two data points, one from each cluster. This method tends to form long, chain-like clusters and is
sensitive to outliers.
Average linkage (AL):In average linkage, the distance between two clusters is defined as the average
of all pairwise distances between data points from different clusters. This method balances the
sensitivity to outliers of single linkage and the tendency to form compact clusters of complete linkage.
The choice of linkage method depends on the specific characteristics of the data and the desired
clustering behavior. Single linkage is suitable for identifying tight clusters and can be useful when
dealing with noisy data. Complete linkage is useful for identifying well-separated clusters and is less
sensitive to outliers compared to single linkage. Average linkage provides a balance between these
two extremes and is often a suitable choice for general-purpose clustering.
Single linkage: computes the minimum distance between clusters before merging them. Complete linkage:
computes the maximum distance between clusters before merging them. Average linkage: computes the average
distance between clusters before merging them.
Q5) Discuss Non-Hierarchical method along with K-means cluster method
Non-hierarchical clustering is a method of clustering that does not require the data to be organized in
a hierarchical manner. Instead, non-hierarchical clustering algorithms work by iteratively assigning
data points to clusters until a stopping criterion is met. This type of clustering is often used when the
number of clusters is not known in advance.
K-means clustering is a popular non-hierarchical clustering algorithm that is based on the idea of
partitioning the data into a predefined number of clusters (k). The k-means algorithm works by
iteratively assigning each data point to the nearest cluster centroid, and then updating the cluster
centroids to reflect the new assignments. This process is repeated until the cluster centroids converge.
Determining the optimal number of clusters and evaluating cluster validity are crucial aspects of
cluster analysis, as they ensure that the identified clusters accurately represent the underlying structure
of the data.
Elbow method: Plots the within-cluster sum of squares (WCSS) against the number of clusters. The
optimal number of clusters is typically indicated by an "elbow" in the plot, where the WCSS starts to
decrease slowly.
Silhouette analysis: Calculates the silhouette coefficient for each data point, which measures the
dissimilarity between a data point and its assigned cluster compared to other clusters. Higher
silhouette values indicate better clustering.
Gap statistic: Compares the WCSS of the actual clustering to the WCSS of randomly generated
clusters. The optimal number of clusters is the one that minimizes the gap between the actual and
random WCSS.
Domain knowledge: Utilize prior knowledge about the data or the problem domain to determine a
reasonable number of clusters.
Dunn index: Measures the ratio of the intra-cluster distance to the minimum inter-cluster distance.
Higher values indicate better separation between clusters.
Calinski-Harabasz index (CH index): Measures the ratio of the sum of between-cluster variance to the
sum of within-cluster variance. Higher values indicate better separation between clusters.
Silhouette score: The average of the silhouette coefficients for all data points. Higher values indicate
better overall clustering.
Visualization: Plotting the clusters in a two-dimensional space can provide visual insights into the
cluster separation and cohesion.