0% found this document useful (0 votes)
10 views

MA Unit 5

Uploaded by

Saurabh Bhosale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

MA Unit 5

Uploaded by

Saurabh Bhosale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Unit-5

Q1) Define cluster analysis in detail


Cluster analysis, also known as clustering, is an unsupervised machine learning technique that groups
similar data points together without predefined labels. It aims to identify patterns and relationships
within the data by partitioning it into clusters, where data points within each cluster share similar
characteristics. Clustering algorithms are often used for exploratory data analysis and can help
researchers understand the underlying structure of their data.

Key characteristics of cluster analysis:


Unsupervised learning: Clustering algorithms do not require labeled data, making them suitable for
situations where the true class labels are unknown or difficult to obtain.

Exploratory data analysis: Clustering is often used to uncover hidden patterns and relationships in
data, providing insights into the underlying structure of the data set.

Grouping similar data points: The goal of clustering is to group similar data points together, forming
clusters where data points within each cluster share similar characteristics.

Common applications of cluster analysis:


Customer segmentation: Identifying customer segments with similar characteristics for targeted
marketing campaigns.

Fraud detection: Detecting fraudulent transactions by identifying patterns that deviate from normal
behavior.

Medical diagnosis: Grouping patients with similar symptoms or medical histories to aid in diagnosis
and treatment decisions.

Image segmentation: Identifying regions of interest in images, such as objects or boundaries.

Examples of clustering algorithms:


K-means clustering: A popular algorithm that partitions the data into a predefined number of clusters.

Hierarchical clustering: A method that builds a hierarchy of clusters by merging or splitting data
points based on their similarity.

Density-based spatial clustering of applications with noise (DBSCAN): An algorithm that identifies
clusters based on the density of data points in the neighborhood of each data point.

Cluster analysis uses criteria such as:


 Smallest distances
 Density of data points
 Graphs
 Statistical distributions
Cluster analysis is common in statistics. For example, investors use cluster analysis to develop a
cluster trading approach to build a diversified portfolio.
Q2) What is measure of similarity and dissimilarity in clusters analysis.

Measures of similarity and dissimilarity are fundamental concepts in cluster analysis, as they quantify
the degree of resemblance or difference between data points. These measures play a crucial role in
determining the structure and composition of clusters.

Similarity measures indicate how similar two data points are, with higher values indicating greater
similarity. Common similarity measures include:

Cosine similarity: Suitable for comparing data points represented as vectors, measuring the angle
between the vectors.

Jaccard similarity: Applicable to binary or categorical data, reflecting the proportion of shared
attributes between data points.

Euclidean distance: A general-purpose distance measure, calculating the straight-line distance between
two data points in a multidimensional space.

Dissimilarity measures, on the other hand, quantify how dissimilar two data points are, with higher
values indicating greater dissimilarity. Common dissimilarity measures include:

Manhattan distance: Also known as the city block distance, measuring the sum of absolute differences
between corresponding coordinates of two data points.

Minkowski distance: A generalization of the Euclidean and Manhattan distances, allowing for the
incorporation of different weightings for each dimension.

Mahalanobis distance: Considers the covariance matrix of the data to adjust for the relative scales of
the dimensions.
The choice of similarity or dissimilarity measure depends on the specific characteristics of the data
and the clustering algorithm being used. In general, similarity measures are used for algorithms that
seek to group similar data points together, while dissimilarity measures are used for algorithms that
identify clusters based on the separation between data points.
- In cluster analysis, the measure of similarity and dissimilarity is called proximity. Proximity
measures are mathematical techniques that calculate the similarity or dissimilarity of data points.

Q3) Discuss Hierarchical method of clustering in detail.

Hierarchical clustering is a popular unsupervised learning technique that groups similar objects into
clusters. It's an iterative process that starts by treating each observation as a separate cluster. The
process then repeatedly

Two main approaches are used in hierarchical clustering:


Agglomerative clustering (bottom-up approach): This method starts with each data point as a separate
cluster and iteratively merges the closest clusters until a stopping criterion is reached. The stopping
criterion can be based on a predefined number of clusters, a maximum distance threshold, or a desired
level of granularity in the dendrogram.

Divisive clustering (top-down approach): This method starts with all data points in a single cluster and
iteratively splits the clusters into smaller ones until a stopping criterion is reached. The stopping
criterion is similar to agglomerative clustering, but instead of merging clusters, they are split based on
their similarity or dissimilarity.

Key steps in hierarchical clustering:


Data preparation: The data should be preprocessed to handle missing values, normalize numerical
features, and encode categorical features if necessary.

Distance or similarity matrix calculation: A distance or similarity matrix is computed to quantify the
pairwise relationships between data points. The choice of distance or similarity measure depends on
the data type and the desired clustering behavior.

Cluster merging or splitting: The algorithm iteratively merges or splits clusters based on the distance
or similarity matrix. The merging or splitting strategy depends on the chosen hierarchical clustering
approach (agglomerative or divisive).
Dendrogram generation: A dendrogram is constructed to visualize the hierarchical relationships
among clusters. The dendrogram represents the merging or splitting process and can be cut at different
levels to determine the desired number of clusters.

Advantages of hierarchical clustering:


No prior knowledge required: Hierarchical clustering does not require prior knowledge of the number
of clusters or the underlying data structure.

Flexible cluster formation: The dendrogram allows for flexibility in determining the number of
clusters and their composition.

Effective for non-convex clusters: Hierarchical clustering can handle non-convex clusters, which are
clusters that cannot be represented by a single convex shape.

Disadvantages of hierarchical clustering:


Computational complexity: Hierarchical clustering algorithms can be computationally expensive for
large datasets.

Sensitivity to outliers: Outliers can significantly impact the clustering results, as they can influence
the merging or splitting decisions.

Some weaknesses of hierarchical methods include:


 It rarely provides the best solution
 It involves lots of arbitrary decisions
 It does not work with missing data
 It works poorly with mixed data types
 It does not work well on very large data sets
 Its main output, the dendrogram, is commonly misinterpreted

Q4) Discuss single linkage, complete linkage and average linkage in cluster analysis.

Single linkage, complete linkage, and average linkage are three different methods for defining the
distance between clusters in hierarchical clustering. These methods determine how the similarity or
dissimilarity between individual data points is aggregated to represent the overall relationship
between clusters.
Single linkage (SL):
In single linkage, the distance between two clusters is defined as the minimum distance between any
two data points, one from each cluster. This method tends to form long, chain-like clusters and is
sensitive to outliers.

Complete linkage (CL):


In complete linkage, the distance between two clusters is defined as the maximum distance between
any two data points, one from each cluster. This method tends to form compact, spherical clusters but
can be less sensitive to outliers compared to single linkage.

Average linkage (AL):In average linkage, the distance between two clusters is defined as the average
of all pairwise distances between data points from different clusters. This method balances the
sensitivity to outliers of single linkage and the tendency to form compact clusters of complete linkage.

The choice of linkage method depends on the specific characteristics of the data and the desired
clustering behavior. Single linkage is suitable for identifying tight clusters and can be useful when
dealing with noisy data. Complete linkage is useful for identifying well-separated clusters and is less
sensitive to outliers compared to single linkage. Average linkage provides a balance between these
two extremes and is often a suitable choice for general-purpose clustering.

Single linkage: computes the minimum distance between clusters before merging them. Complete linkage:
computes the maximum distance between clusters before merging them. Average linkage: computes the average
distance between clusters before merging them.
Q5) Discuss Non-Hierarchical method along with K-means cluster method

Sure, here is a discussion of non-hierarchical clustering and K-means clustering:

Non-hierarchical clustering is a method of clustering that does not require the data to be organized in
a hierarchical manner. Instead, non-hierarchical clustering algorithms work by iteratively assigning
data points to clusters until a stopping criterion is met. This type of clustering is often used when the
number of clusters is not known in advance.

K-means clustering is a popular non-hierarchical clustering algorithm that is based on the idea of
partitioning the data into a predefined number of clusters (k). The k-means algorithm works by
iteratively assigning each data point to the nearest cluster centroid, and then updating the cluster
centroids to reflect the new assignments. This process is repeated until the cluster centroids converge.

Advantages of non-hierarchical clustering:


 Efficient for large datasets
 No need to pre-specify the hierarchy of clusters
Disadvantages of non-hierarchical clustering:
 Sensitive to the choice of the number of clusters
 May not be able to handle non-convex clusters
Advantages of K-means clustering:
 Simple to implement Non-hierarchical clustering, also known as
 Efficient for large datasets partition clustering, involves assigning data
 Handles missing values well points to clusters without considering a
Disadvantages of K-means clustering: progressive structure. In non-hierarchical
 Sensitive to outliers clustering, the relationship between clusters
 Requires the number of clusters to be is undetermined
specified in advance
 May not be able to handle non-convex
clusters
Here are some steps for K-means clustering:
1. Create a custom dataset
2. Initialize random centroids
3. Plot the random center with data points
4. Define Euclidean distance
5. Create a function to assign and update the cluster center
6. Create a function to predict the cluster for the data points

Overall, non-hierarchical clustering and K-


means clustering are both versatile and
widely used techniques for exploratory data
analysis and cluster identification. The choice
between the two methods often depends on
the specific characteristics of the data and the
desired clustering behaviour.
Q6) What is sense of choosing no. of cluster and cluster validity in detail.

Determining the optimal number of clusters and evaluating cluster validity are crucial aspects of
cluster analysis, as they ensure that the identified clusters accurately represent the underlying structure
of the data.

Determining the Number of Clusters


Choosing the appropriate number of clusters is an essential step in cluster analysis, as it influences the
granularity and interpretability of the results. Several methods exist for determining the number of
clusters, including:

Elbow method: Plots the within-cluster sum of squares (WCSS) against the number of clusters. The
optimal number of clusters is typically indicated by an "elbow" in the plot, where the WCSS starts to
decrease slowly.

Silhouette analysis: Calculates the silhouette coefficient for each data point, which measures the
dissimilarity between a data point and its assigned cluster compared to other clusters. Higher
silhouette values indicate better clustering.

Gap statistic: Compares the WCSS of the actual clustering to the WCSS of randomly generated
clusters. The optimal number of clusters is the one that minimizes the gap between the actual and
random WCSS.

Domain knowledge: Utilize prior knowledge about the data or the problem domain to determine a
reasonable number of clusters.

Evaluating Cluster Validity


Cluster validity assesses the quality of the clustering results by measuring the separation between
clusters and the cohesion within clusters. Common cluster validity indices include:

Dunn index: Measures the ratio of the intra-cluster distance to the minimum inter-cluster distance.
Higher values indicate better separation between clusters.

Calinski-Harabasz index (CH index): Measures the ratio of the sum of between-cluster variance to the
sum of within-cluster variance. Higher values indicate better separation between clusters.

Silhouette score: The average of the silhouette coefficients for all data points. Higher values indicate
better overall clustering.

Visualization: Plotting the clusters in a two-dimensional space can provide visual insights into the
cluster separation and cohesion.

Choosing the Right Approach


The choice of method for determining the number of clusters and evaluating cluster validity depends
on the specific data and clustering algorithm used. It is often beneficial to employ multiple methods to
obtain a more comprehensive assessment of the clustering results.

You might also like