0% found this document useful (0 votes)
14 views

Cluster Analysis

Uploaded by

Amber Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Cluster Analysis

Uploaded by

Amber Gupta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Cluster Analysis

Hierarchical Cluster Analysis


• Hierarchical cluster analysis, is an algorithm that groups similar
objects into groups called clusters.
• The endpoint is a set of clusters, where each cluster is distinct from
each other cluster, and the objects within each cluster are broadly
similar to each other.
• Hierarchical clustering involves creating clusters that have a
predetermined ordering from top to bottom. For example, all files
and folders on the hard disk are organized in a hierarchy
Agglomerative vs Divisive method
How hierarchical clustering works
Hierarchical clustering starts by treating each observation as a separate
cluster. Then, it repeatedly executes the following two steps:
(1) identify the two clusters that are closest together, and
(2) merge the two most similar clusters. This iterative process continues
until all the clusters are merged together.
The main output of Hierarchical Clustering is a dendrogram, which shows the
hierarchical relationship between the clusters. The one cluster which combines
all the distances is known as hierarchical clustering or agglomeration schedule.
Usage of Hierarchical clustering
• Hierarchical clustering is the most popular and widely used method to
analyze social network data. In this method, nodes are compared with
one another based on their similarity. Larger groups are built by
joining groups of nodes based on their similarity. A criterion is
introduced to compare nodes based on their relationship.
Measures of distance
Methods:
1. Distance between two points
2. Distance between point and cluster
3. Distance between two clusters
Distance methods- k means
• Euclidean distance
• Manhattan distance
• Minkowski distance
• Hamming distance
Euclidean distance L2
Manhattan distance L1
Minkowski distance

• Minkowski Distance is the


generalized form of Euclidean and
Manhattan Distance.
• When the order(p) is 1, it will
represent Manhattan Distance and
when the order in the above
formula is 2, it will represent
Euclidean Distance.
Hamming distance
• The hamming distance measures
in machine learning the
similarity between two strings of
Example: the same length. The hamming
“euclidean” and “manhattan” distance is the number of
positions at which the
corresponding characters are
different.
Hierarchical Linkage criteria

1.single-linkage: shortest distance


2. complete-linkage :farthest distance
3. mean or average-linkage: the center of the clusters or some other
criterion.
4. Ward linkage: minimum variance criterion minimizes the total within-
cluster variance.
Single linkage
• the distance between two clusters is defined as the shortest distance
between two points in each cluster
Complete linkage
• the distance between two clusters is defined as the longest distance
between two points in each cluster.
Average linkage
• the distance between two clusters is defined as the average distance
between each point in one cluster to every point in the other cluster
(average of all pairwise distances)
Ward’s method
• In Ward's minimum-variance method, the distance between two
clusters is the ANOVA sum of squares between the two clusters added
up over all the variables.
• At each generation, the within-cluster sum of squares is minimized
over all partitions obtainable by merging two clusters from the
previous generation.
• The sums of squares are easier to interpret when they are divided by
the total sum of squares to give proportions of variance (squared semi-
partial correlations).
• Imp:
• The choice of distance metric should be made based on theoretical concerns from
the domain of study. For example, if clustering crime sites in a city, city block
distance may be appropriate. Or, better yet, the time taken to travel between each
location. Where there is no theoretical justification for an alternative, the Euclidean
should generally be preferred, as it is usually the appropriate measure of distance in
the physical world.
The choice of linkage criteria should also be made based on theoretical
considerations from the domain of application. A key theoretical issue is what causes
variation. For example, in archeology, we expect variation to occur through
innovation and natural resources, so working out if two groups of artifacts are similar
may make sense based on identifying the most similar members of the cluster. Where
there are no clear theoretical justifications for the choice of linkage criteria, Ward’s
method is the sensible default. This method works out which observations to group
based on reducing the sum of squared distances of each observation from the average
observation in a cluster.
K-means clustering
• Partitional, non-
deterministic

• Works on centroid
K-means clustering
The idea behind k-Means is that we want to add k new points to the data we have.

Each one of those points — called a Centroid — will be going around trying to center itself in the middle of one of the k
clusters we have.

Once those points stop moving, our clustering algorithm stops.

The value of k is of great importance.

This k is called a hyper-parameter; a variable whose value we set before training. This k specifies the number of clusters we
want the algorithm to yield. This number of clusters is actually the number of centroids going around in the data.

Each data is mapped into the cluster with its nearest mean
Identifying k- Elbow method
• For each value of K, we are
calculating WCSS (Within-
Cluster Sum of Square). WCSS
is the sum of the squared
distance between each point and
the centroid in a cluster. This
point that defines the optimal
number of clusters is known as
the “elbow point”.
Identifying k- Silhouette coefficient
• This coefficient is a measure of
cluster cohesion and separation
• ranges between -1 and 1
Advantages K-means Advantages Hierarchical
Convergence is guaranteed Ease of handling of any forms of similarity or
distance.
Specialized to clusters of different sizes and shapes. Consequently, applicability to any attribute's types.

Can handle big data Cant handle big data

Disadvantages Disadvantages
K-Value is difficult to predict Hierarchical clustering requires the computation and
storage of an n×n distance matrix. For very large
datasets, this can be expensive and slow
K-means produces clusters with uniform sizes (in
terms of density and quantity of observations), even
though the underlying data might behave in a very
different way.
K-means is very sensitive to outliers, since centroids
can be dragged in the presence of noisy data.

You might also like