0% found this document useful (0 votes)

10 views

MA Unit 5

Uploaded by

Saurabh Bhosale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

MA Unit 5

Uploaded by

Saurabh Bhosale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Unit-5

Q1) Define cluster analysis in detail

Cluster analysis, also known as clustering, is an unsupervised machine learning technique that groups
similar data points together without predefined labels. It aims to identify patterns and relationships
within the data by partitioning it into clusters, where data points within each cluster share similar
characteristics. Clustering algorithms are often used for exploratory data analysis and can help
researchers understand the underlying structure of their data.

Key characteristics of cluster analysis:

Unsupervised learning: Clustering algorithms do not require labeled data, making them suitable for
situations where the true class labels are unknown or difficult to obtain.

Exploratory data analysis: Clustering is often used to uncover hidden patterns and relationships in
data, providing insights into the underlying structure of the data set.

Grouping similar data points: The goal of clustering is to group similar data points together, forming
clusters where data points within each cluster share similar characteristics.

Common applications of cluster analysis:

Customer segmentation: Identifying customer segments with similar characteristics for targeted
marketing campaigns.

Fraud detection: Detecting fraudulent transactions by identifying patterns that deviate from normal
behavior.

Medical diagnosis: Grouping patients with similar symptoms or medical histories to aid in diagnosis
and treatment decisions.

Image segmentation: Identifying regions of interest in images, such as objects or boundaries.

Examples of clustering algorithms:

K-means clustering: A popular algorithm that partitions the data into a predefined number of clusters.

Hierarchical clustering: A method that builds a hierarchy of clusters by merging or splitting data
points based on their similarity.

Density-based spatial clustering of applications with noise (DBSCAN): An algorithm that identifies
clusters based on the density of data points in the neighborhood of each data point.

Cluster analysis uses criteria such as:

 Smallest distances
 Density of data points
 Graphs
 Statistical distributions
Cluster analysis is common in statistics. For example, investors use cluster analysis to develop a
cluster trading approach to build a diversified portfolio.
Q2) What is measure of similarity and dissimilarity in clusters analysis.

Measures of similarity and dissimilarity are fundamental concepts in cluster analysis, as they quantify
the degree of resemblance or difference between data points. These measures play a crucial role in
determining the structure and composition of clusters.

Similarity measures indicate how similar two data points are, with higher values indicating greater
similarity. Common similarity measures include:

Cosine similarity: Suitable for comparing data points represented as vectors, measuring the angle
between the vectors.

Jaccard similarity: Applicable to binary or categorical data, reflecting the proportion of shared
attributes between data points.

Euclidean distance: A general-purpose distance measure, calculating the straight-line distance between
two data points in a multidimensional space.

Dissimilarity measures, on the other hand, quantify how dissimilar two data points are, with higher
values indicating greater dissimilarity. Common dissimilarity measures include:

Manhattan distance: Also known as the city block distance, measuring the sum of absolute differences
between corresponding coordinates of two data points.

Minkowski distance: A generalization of the Euclidean and Manhattan distances, allowing for the
incorporation of different weightings for each dimension.

Mahalanobis distance: Considers the covariance matrix of the data to adjust for the relative scales of
the dimensions.
The choice of similarity or dissimilarity measure depends on the specific characteristics of the data
and the clustering algorithm being used. In general, similarity measures are used for algorithms that
seek to group similar data points together, while dissimilarity measures are used for algorithms that
identify clusters based on the separation between data points.
- In cluster analysis, the measure of similarity and dissimilarity is called proximity. Proximity
measures are mathematical techniques that calculate the similarity or dissimilarity of data points.

Q3) Discuss Hierarchical method of clustering in detail.

Hierarchical clustering is a popular unsupervised learning technique that groups similar objects into
clusters. It's an iterative process that starts by treating each observation as a separate cluster. The
process then repeatedly

Two main approaches are used in hierarchical clustering:

Agglomerative clustering (bottom-up approach): This method starts with each data point as a separate
cluster and iteratively merges the closest clusters until a stopping criterion is reached. The stopping
criterion can be based on a predefined number of clusters, a maximum distance threshold, or a desired
level of granularity in the dendrogram.

Divisive clustering (top-down approach): This method starts with all data points in a single cluster and
iteratively splits the clusters into smaller ones until a stopping criterion is reached. The stopping
criterion is similar to agglomerative clustering, but instead of merging clusters, they are split based on
their similarity or dissimilarity.

Key steps in hierarchical clustering:

Data preparation: The data should be preprocessed to handle missing values, normalize numerical
features, and encode categorical features if necessary.

Distance or similarity matrix calculation: A distance or similarity matrix is computed to quantify the
pairwise relationships between data points. The choice of distance or similarity measure depends on
the data type and the desired clustering behavior.

Cluster merging or splitting: The algorithm iteratively merges or splits clusters based on the distance
or similarity matrix. The merging or splitting strategy depends on the chosen hierarchical clustering
approach (agglomerative or divisive).
Dendrogram generation: A dendrogram is constructed to visualize the hierarchical relationships
among clusters. The dendrogram represents the merging or splitting process and can be cut at different
levels to determine the desired number of clusters.

Advantages of hierarchical clustering:

No prior knowledge required: Hierarchical clustering does not require prior knowledge of the number
of clusters or the underlying data structure.

Flexible cluster formation: The dendrogram allows for flexibility in determining the number of
clusters and their composition.

Effective for non-convex clusters: Hierarchical clustering can handle non-convex clusters, which are
clusters that cannot be represented by a single convex shape.

Disadvantages of hierarchical clustering:

Computational complexity: Hierarchical clustering algorithms can be computationally expensive for
large datasets.

Sensitivity to outliers: Outliers can significantly impact the clustering results, as they can influence
the merging or splitting decisions.

Some weaknesses of hierarchical methods include:

 It rarely provides the best solution
 It involves lots of arbitrary decisions
 It does not work with missing data
 It works poorly with mixed data types
 It does not work well on very large data sets
 Its main output, the dendrogram, is commonly misinterpreted

Q4) Discuss single linkage, complete linkage and average linkage in cluster analysis.

Single linkage, complete linkage, and average linkage are three different methods for defining the
distance between clusters in hierarchical clustering. These methods determine how the similarity or
dissimilarity between individual data points is aggregated to represent the overall relationship
between clusters.
Single linkage (SL):
In single linkage, the distance between two clusters is defined as the minimum distance between any
two data points, one from each cluster. This method tends to form long, chain-like clusters and is
sensitive to outliers.

Complete linkage (CL):

In complete linkage, the distance between two clusters is defined as the maximum distance between
any two data points, one from each cluster. This method tends to form compact, spherical clusters but
can be less sensitive to outliers compared to single linkage.

Average linkage (AL):In average linkage, the distance between two clusters is defined as the average
of all pairwise distances between data points from different clusters. This method balances the
sensitivity to outliers of single linkage and the tendency to form compact clusters of complete linkage.

The choice of linkage method depends on the specific characteristics of the data and the desired
clustering behavior. Single linkage is suitable for identifying tight clusters and can be useful when
dealing with noisy data. Complete linkage is useful for identifying well-separated clusters and is less
sensitive to outliers compared to single linkage. Average linkage provides a balance between these
two extremes and is often a suitable choice for general-purpose clustering.

Single linkage: computes the minimum distance between clusters before merging them. Complete linkage:
computes the maximum distance between clusters before merging them. Average linkage: computes the average
distance between clusters before merging them.
Q5) Discuss Non-Hierarchical method along with K-means cluster method

Sure, here is a discussion of non-hierarchical clustering and K-means clustering:

Non-hierarchical clustering is a method of clustering that does not require the data to be organized in
a hierarchical manner. Instead, non-hierarchical clustering algorithms work by iteratively assigning
data points to clusters until a stopping criterion is met. This type of clustering is often used when the
number of clusters is not known in advance.

K-means clustering is a popular non-hierarchical clustering algorithm that is based on the idea of
partitioning the data into a predefined number of clusters (k). The k-means algorithm works by
iteratively assigning each data point to the nearest cluster centroid, and then updating the cluster
centroids to reflect the new assignments. This process is repeated until the cluster centroids converge.

Advantages of non-hierarchical clustering:

 Efficient for large datasets
 No need to pre-specify the hierarchy of clusters
Disadvantages of non-hierarchical clustering:
 Sensitive to the choice of the number of clusters
 May not be able to handle non-convex clusters
Advantages of K-means clustering:
 Simple to implement Non-hierarchical clustering, also known as
 Efficient for large datasets partition clustering, involves assigning data
 Handles missing values well points to clusters without considering a
Disadvantages of K-means clustering: progressive structure. In non-hierarchical
 Sensitive to outliers clustering, the relationship between clusters
 Requires the number of clusters to be is undetermined
specified in advance
 May not be able to handle non-convex
clusters
Here are some steps for K-means clustering:
1. Create a custom dataset
2. Initialize random centroids
3. Plot the random center with data points
4. Define Euclidean distance
5. Create a function to assign and update the cluster center
6. Create a function to predict the cluster for the data points

Overall, non-hierarchical clustering and K-

means clustering are both versatile and
widely used techniques for exploratory data
analysis and cluster identification. The choice
between the two methods often depends on
the specific characteristics of the data and the
desired clustering behaviour.
Q6) What is sense of choosing no. of cluster and cluster validity in detail.

Determining the optimal number of clusters and evaluating cluster validity are crucial aspects of
cluster analysis, as they ensure that the identified clusters accurately represent the underlying structure
of the data.

Determining the Number of Clusters

Choosing the appropriate number of clusters is an essential step in cluster analysis, as it influences the
granularity and interpretability of the results. Several methods exist for determining the number of
clusters, including:

Elbow method: Plots the within-cluster sum of squares (WCSS) against the number of clusters. The
optimal number of clusters is typically indicated by an "elbow" in the plot, where the WCSS starts to
decrease slowly.

Silhouette analysis: Calculates the silhouette coefficient for each data point, which measures the
dissimilarity between a data point and its assigned cluster compared to other clusters. Higher
silhouette values indicate better clustering.

Gap statistic: Compares the WCSS of the actual clustering to the WCSS of randomly generated
clusters. The optimal number of clusters is the one that minimizes the gap between the actual and
random WCSS.

Domain knowledge: Utilize prior knowledge about the data or the problem domain to determine a
reasonable number of clusters.

Evaluating Cluster Validity

Cluster validity assesses the quality of the clustering results by measuring the separation between
clusters and the cohesion within clusters. Common cluster validity indices include:

Dunn index: Measures the ratio of the intra-cluster distance to the minimum inter-cluster distance.
Higher values indicate better separation between clusters.

Calinski-Harabasz index (CH index): Measures the ratio of the sum of between-cluster variance to the
sum of within-cluster variance. Higher values indicate better separation between clusters.

Silhouette score: The average of the silhouette coefficients for all data points. Higher values indicate
better overall clustering.

Visualization: Plotting the clusters in a two-dimensional space can provide visual insights into the
cluster separation and cohesion.

Choosing the Right Approach

The choice of method for determining the number of clusters and evaluating cluster validity depends
on the specific data and clustering algorithm used. It is often beneficial to employ multiple methods to
obtain a more comprehensive assessment of the clustering results.

확률 변수 Solutions
20% (5)
확률 변수 Solutions
114 pages
Agglomerative Hierarchical Clustering Algorithm-A Review: K.Sasirekha, P.Baby
No ratings yet
Agglomerative Hierarchical Clustering Algorithm-A Review: K.Sasirekha, P.Baby
3 pages
My Lecture On CLUSTER ANALYSIS PDF
No ratings yet
My Lecture On CLUSTER ANALYSIS PDF
55 pages
CMG - Sga Series Brochure
No ratings yet
CMG - Sga Series Brochure
40 pages
Session Plan LO1 Bookkeeping NC III
83% (6)
Session Plan LO1 Bookkeeping NC III
4 pages
Electronic Common Technical Document (eCTD)
No ratings yet
Electronic Common Technical Document (eCTD)
21 pages
Cluster Analysis BRM Session 14
No ratings yet
Cluster Analysis BRM Session 14
25 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
Cluster Analysis Concept & Methods
No ratings yet
Cluster Analysis Concept & Methods
14 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
DA Seminar
No ratings yet
DA Seminar
29 pages
clustering
No ratings yet
clustering
6 pages
Cluster Analysis
No ratings yet
Cluster Analysis
12 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
Data Mining Functionalities
No ratings yet
Data Mining Functionalities
13 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
Unit 4 Self Made (1)
No ratings yet
Unit 4 Self Made (1)
28 pages
Hierarchical
No ratings yet
Hierarchical
31 pages
BA2 7 Cluster
No ratings yet
BA2 7 Cluster
33 pages
Unit-6 Clustering Techniques
No ratings yet
Unit-6 Clustering Techniques
110 pages
Hierarchical Clustering Unit 4 ML
No ratings yet
Hierarchical Clustering Unit 4 ML
14 pages
In Marketing, Cluster Analysis Is Used For: Statistical
No ratings yet
In Marketing, Cluster Analysis Is Used For: Statistical
3 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Agnes
No ratings yet
Agnes
25 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
CLUSTERING
No ratings yet
CLUSTERING
16 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Expt-5
No ratings yet
Expt-5
3 pages
Cluster Analysis
No ratings yet
Cluster Analysis
101 pages
Lecture-9 Cluster Analysis_LAK
No ratings yet
Lecture-9 Cluster Analysis_LAK
4 pages
Clustering
No ratings yet
Clustering
69 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
6 pages
Cluster Analysis
No ratings yet
Cluster Analysis
34 pages
An Overview On Clustering Methods: T. Soni Madhulatha
No ratings yet
An Overview On Clustering Methods: T. Soni Madhulatha
7 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
ML Unit V
No ratings yet
ML Unit V
26 pages
UnSupervisedLearning
No ratings yet
UnSupervisedLearning
22 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
ppt7
No ratings yet
ppt7
41 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
Knowledge Acquisition and Sharing - Data Mining: INF 791 Lecture 4: Cluster Analysis
No ratings yet
Knowledge Acquisition and Sharing - Data Mining: INF 791 Lecture 4: Cluster Analysis
43 pages
Hierarchical Clustering PDF
No ratings yet
Hierarchical Clustering PDF
5 pages
Cluster Analysis
No ratings yet
Cluster Analysis
24 pages
07 Hierarchical Clustering
No ratings yet
07 Hierarchical Clustering
19 pages
YEAH
No ratings yet
YEAH
2 pages
Lec35
No ratings yet
Lec35
18 pages
Lecture 02 - Cluster Analysis 1
No ratings yet
Lecture 02 - Cluster Analysis 1
59 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
RK Clustering
No ratings yet
RK Clustering
77 pages
10.cluster Analysis
No ratings yet
10.cluster Analysis
68 pages
8.Cluster Analysis HCA
No ratings yet
8.Cluster Analysis HCA
31 pages
Hierarchical Clustering - 11.3.2024 - Full
No ratings yet
Hierarchical Clustering - 11.3.2024 - Full
14 pages
MDA Session 4
No ratings yet
MDA Session 4
5 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
Clustering
No ratings yet
Clustering
20 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Clustring
No ratings yet
Clustring
20 pages
Cluster Analysis
No ratings yet
Cluster Analysis
20 pages
Cluster Analysis
No ratings yet
Cluster Analysis
30 pages
Heirarchical clustering
No ratings yet
Heirarchical clustering
22 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
emp_table
No ratings yet
emp_table
1 page
Final_Research_Paper_
No ratings yet
Final_Research_Paper_
11 pages
559035581-SQL-Assignment-1 (1)
No ratings yet
559035581-SQL-Assignment-1 (1)
6 pages
Credit-card-approval-data-information
No ratings yet
Credit-card-approval-data-information
3 pages
Application Recordddd
No ratings yet
Application Recordddd
19,492 pages
Is Dumpster Diving Legal in the U.S._ Key Laws Explained
No ratings yet
Is Dumpster Diving Legal in the U.S._ Key Laws Explained
2 pages
SSC Steno 2024 Complete Vocab (Smart E-book)
No ratings yet
SSC Steno 2024 Complete Vocab (Smart E-book)
7 pages
Job Description Compliance Officer: CMPLNC-JDES - 0001
No ratings yet
Job Description Compliance Officer: CMPLNC-JDES - 0001
2 pages
Seclore Presentation-Latest
No ratings yet
Seclore Presentation-Latest
12 pages
1avrts - Midterm Examination - Matthewreynold
No ratings yet
1avrts - Midterm Examination - Matthewreynold
1 page
Rules Murderers Row
No ratings yet
Rules Murderers Row
2 pages
MTH302 Final Term
No ratings yet
MTH302 Final Term
11 pages
WFG 2013 en
No ratings yet
WFG 2013 en
28 pages
Bearings Constructions and Scale Drawings Worksheet
No ratings yet
Bearings Constructions and Scale Drawings Worksheet
18 pages
L25 MKV
No ratings yet
L25 MKV
6 pages
Handling of Returned Products
No ratings yet
Handling of Returned Products
2 pages
Important Setpoints To Remember: Permissive Input Functions/ Purpose (Output)
No ratings yet
Important Setpoints To Remember: Permissive Input Functions/ Purpose (Output)
10 pages
The AI Revolution in Networking Cybersecurity and Emerging Technologies Omar Santos 2024 Scribd Download
100% (6)
The AI Revolution in Networking Cybersecurity and Emerging Technologies Omar Santos 2024 Scribd Download
62 pages
Cash Voucher 2018
No ratings yet
Cash Voucher 2018
13 pages
Dbmsmcqs
No ratings yet
Dbmsmcqs
13 pages
Introduction To Ansys Software
No ratings yet
Introduction To Ansys Software
49 pages
07 Fuel Supply System
No ratings yet
07 Fuel Supply System
89 pages
CDR
No ratings yet
CDR
14 pages
Drylok Dry Disconnect System Manual
No ratings yet
Drylok Dry Disconnect System Manual
24 pages
Present Simple Past Simple Future Simple: Time Expressions
No ratings yet
Present Simple Past Simple Future Simple: Time Expressions
7 pages
Bray Breeding Calendar Dairy Mac
No ratings yet
Bray Breeding Calendar Dairy Mac
1 page
Techno Acoustic™ Panel: Manufactured For The 1 Time in India!
No ratings yet
Techno Acoustic™ Panel: Manufactured For The 1 Time in India!
65 pages
Icar-Indian Institute of Millets Research (Iimr) Detailed Notification
No ratings yet
Icar-Indian Institute of Millets Research (Iimr) Detailed Notification
4 pages
Love & Temptation - Amanda Mnguni-1
No ratings yet
Love & Temptation - Amanda Mnguni-1
180 pages
Public Notice 9 of 2024 Major Highlights of Finan - 240124 - 170152
No ratings yet
Public Notice 9 of 2024 Major Highlights of Finan - 240124 - 170152
6 pages
Phrasal Verbs - Ly Thuyet
No ratings yet
Phrasal Verbs - Ly Thuyet
35 pages

MA Unit 5

Uploaded by

MA Unit 5

Uploaded by

Unit-5

Q1) Define cluster analysis in detail

Key characteristics of cluster analysis:

Common applications of cluster analysis:

Image segmentation: Identifying regions of interest in images, such as objects or boundaries.

Examples of clustering algorithms:

Cluster analysis uses criteria such as:

Q3) Discuss Hierarchical method of clustering in detail.

Two main approaches are used in hierarchical clustering:

Key steps in hierarchical clustering:

Advantages of hierarchical clustering:

Disadvantages of hierarchical clustering:

Some weaknesses of hierarchical methods include:

Complete linkage (CL):

Sure, here is a discussion of non-hierarchical clustering and K-means clustering:

Advantages of non-hierarchical clustering:

Overall, non-hierarchical clustering and K-

Determining the Number of Clusters

Evaluating Cluster Validity

Choosing the Right Approach

You might also like