DA_EXP_10
DA_EXP_10
UID 2021600022
2021600033
Date 10-11-2024
Lab 10
Objective Clustering groups similar data points together, revealing hidden patterns or structures. It
helps in tasks like customer segmentation, anomaly detection, and image recognition by
organizing data into meaningful clusters for better insights and decision-making.
Theory
K-Means Clustering Theory:
K-Means is a popular unsupervised clustering algorithm that divides a dataset into K
distinct clusters based on feature similarity. The objective is to minimize the variance (or
sum of squared distances) within each cluster, ensuring that data points in the same cluster
are as similar as possible.
Distance Metric:
BHARATIYA VIDYA BHAVAN’S
SARDAR PATEL INSTITUTE OF TECHNOLOGY
(Empowered Autonomous Institute Affiliated to University of Mumbai)
[Knowledge is Nectar]
● The Euclidean distance is typically used to measure the distance between data
points and centroids:
d(x,y)=(x1−y1)2+(x2−y2)2+⋯+(xn−yn)2d(x, y) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 +
\dots + (x_n - y_n)^2}d(x,y)=(x1−y1)2+(x2−y2)2+⋯+(xn−yn)2
where xxx and yyy are data points, and x1,x2,…,xnx_1, x_2, \dots, x_nx1,x2,…,xn
are their respective feature values.
Key Concepts:
1. Centroids: The central point of each cluster, typically the mean of all data points in
the cluster.
2. K: The number of clusters you want to divide your dataset into. Selecting the
optimal K is important and can be done using methods like the Elbow Method,
where the sum of squared distances (within-cluster variance) is plotted for different
values of K to find the "elbow" point, indicating the optimal number of clusters.
3. Convergence: K-means converges when either the centroids do not change
significantly between iterations or a predefined number of iterations is reached.
Advantages of K-Means:
● Scalability: K-means is computationally efficient and works well with large
datasets.
● Simplicity: The algorithm is easy to implement and understand.
● Efficiency: Converges quickly, especially when the data is well-separated.
Limitations of K-Means:
● Choosing K: The value of K must be specified in advance, and choosing the
correct K can be challenging.
● Sensitive to Initialization: Poor initialization of centroids can lead to suboptimal
clustering. This is addressed with techniques like K-means++.
● Assumes Spherical Clusters: K-means assumes that clusters are spherical and
equally sized, which may not always be the case.
● Outliers: K-means is sensitive to outliers, as they can distort the mean of the
cluster.
Applications:
● Market segmentation (grouping customers with similar buying behaviors).
● Document clustering (grouping similar texts or articles).
● Image compression (grouping similar pixel values).
Output
BHARATIYA VIDYA BHAVAN’S
SARDAR PATEL INSTITUTE OF TECHNOLOGY
(Empowered Autonomous Institute Affiliated to University of Mumbai)
[Knowledge is Nectar]
Conclusion K-Means is an efficient and widely used clustering algorithm that groups data into K
clusters based on similarity. It works by iteratively assigning data points to the nearest
centroid and updating centroids until convergence. While it is computationally efficient and
simple, it requires selecting the optimal K and can be sensitive to initialization and outliers.
Despite these limitations, K-Means is widely applied in areas like market segmentation,
image compression, and pattern recognition.
References https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/