Cluster Analysis
Cluster Analysis
• Works on centroid
K-means clustering
The idea behind k-Means is that we want to add k new points to the data we have.
Each one of those points — called a Centroid — will be going around trying to center itself in the middle of one of the k
clusters we have.
This k is called a hyper-parameter; a variable whose value we set before training. This k specifies the number of clusters we
want the algorithm to yield. This number of clusters is actually the number of centroids going around in the data.
Each data is mapped into the cluster with its nearest mean
Identifying k- Elbow method
• For each value of K, we are
calculating WCSS (Within-
Cluster Sum of Square). WCSS
is the sum of the squared
distance between each point and
the centroid in a cluster. This
point that defines the optimal
number of clusters is known as
the “elbow point”.
Identifying k- Silhouette coefficient
• This coefficient is a measure of
cluster cohesion and separation
• ranges between -1 and 1
Advantages K-means Advantages Hierarchical
Convergence is guaranteed Ease of handling of any forms of similarity or
distance.
Specialized to clusters of different sizes and shapes. Consequently, applicability to any attribute's types.
Disadvantages Disadvantages
K-Value is difficult to predict Hierarchical clustering requires the computation and
storage of an n×n distance matrix. For very large
datasets, this can be expensive and slow
K-means produces clusters with uniform sizes (in
terms of density and quantity of observations), even
though the underlying data might behave in a very
different way.
K-means is very sensitive to outliers, since centroids
can be dragged in the presence of noisy data.