0% found this document useful (0 votes)
9 views

DA_EXP_10

Uploaded by

anandkrishna1511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

DA_EXP_10

Uploaded by

anandkrishna1511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

BHARATIYA VIDYA BHAVAN’S

SARDAR PATEL INSTITUTE OF TECHNOLOGY


(Empowered Autonomous Institute Affiliated to University of Mumbai)
[Knowledge is Nectar]

Department of Computer Science Engineering


Course - Data Analytics

UID 2021600022
2021600033

Name Mahek Gupta


Shruti Kedari

Class and Batch BE AIML Batch B

Date 10-11-2024

Lab 10

Aim To perform clustering on a dataset.

Objective Clustering groups similar data points together, revealing hidden patterns or structures. It
helps in tasks like customer segmentation, anomaly detection, and image recognition by
organizing data into meaningful clusters for better insights and decision-making.

Theory
K-Means Clustering Theory:
K-Means is a popular unsupervised clustering algorithm that divides a dataset into K
distinct clusters based on feature similarity. The objective is to minimize the variance (or
sum of squared distances) within each cluster, ensuring that data points in the same cluster
are as similar as possible.

Steps in K-Means Algorithm:


1. Initialization:
○ Select K initial cluster centroids randomly or using some heuristic (like
K-means++ for better initial centroids).
2. Assignment Step:
○ Assign each data point to the nearest centroid based on a distance metric
(typically Euclidean distance).
3. Update Step:
○ Recalculate the centroids of the clusters by taking the mean of all the data
points assigned to each cluster.
4. Repeat:
○ Repeat steps 2 and 3 until convergence (i.e., when the assignments no
longer change, or the centroids stabilize).

Distance Metric:
BHARATIYA VIDYA BHAVAN’S
SARDAR PATEL INSTITUTE OF TECHNOLOGY
(Empowered Autonomous Institute Affiliated to University of Mumbai)
[Knowledge is Nectar]

Department of Computer Science Engineering

● The Euclidean distance is typically used to measure the distance between data
points and centroids:
d(x,y)=(x1−y1)2+(x2−y2)2+⋯+(xn−yn)2d(x, y) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 +
\dots + (x_n - y_n)^2}d(x,y)=(x1​−y1​)2+(x2​−y2​)2+⋯+(xn​−yn​)2​
where xxx and yyy are data points, and x1,x2,…,xnx_1, x_2, \dots, x_nx1​,x2​,…,xn​
are their respective feature values.

Key Concepts:
1. Centroids: The central point of each cluster, typically the mean of all data points in
the cluster.
2. K: The number of clusters you want to divide your dataset into. Selecting the
optimal K is important and can be done using methods like the Elbow Method,
where the sum of squared distances (within-cluster variance) is plotted for different
values of K to find the "elbow" point, indicating the optimal number of clusters.
3. Convergence: K-means converges when either the centroids do not change
significantly between iterations or a predefined number of iterations is reached.

Advantages of K-Means:
● Scalability: K-means is computationally efficient and works well with large
datasets.
● Simplicity: The algorithm is easy to implement and understand.
● Efficiency: Converges quickly, especially when the data is well-separated.

Limitations of K-Means:
● Choosing K: The value of K must be specified in advance, and choosing the
correct K can be challenging.
● Sensitive to Initialization: Poor initialization of centroids can lead to suboptimal
clustering. This is addressed with techniques like K-means++.
● Assumes Spherical Clusters: K-means assumes that clusters are spherical and
equally sized, which may not always be the case.
● Outliers: K-means is sensitive to outliers, as they can distort the mean of the
cluster.

Applications:
● Market segmentation (grouping customers with similar buying behaviors).
● Document clustering (grouping similar texts or articles).
● Image compression (grouping similar pixel values).

Implementation / # importing required libraries


BHARATIYA VIDYA BHAVAN’S
SARDAR PATEL INSTITUTE OF TECHNOLOGY
(Empowered Autonomous Institute Affiliated to University of Mumbai)
[Knowledge is Nectar]

Department of Computer Science Engineering


Code import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
import pandas as pd

# Load the dataset


url =
'https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Whole
sale%20customers%20data.csv'
data = pd.read_csv(url)

# Display the first few rows


print(data.head())

# statistics of the data


data.describe()
# standardizing the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# statistics of scaled data


pd.DataFrame(data_scaled).describe()

# inertia on the fitted data


kmeans.inertia_
# fitting multiple k-means algorithms and storing the values in an
empty list
SSE = []
for cluster in range(1,20):
kmeans = KMeans( n_clusters = cluster, init='k-means++')
kmeans.fit(data_scaled)
SSE.append(kmeans.inertia_)
BHARATIYA VIDYA BHAVAN’S
SARDAR PATEL INSTITUTE OF TECHNOLOGY
(Empowered Autonomous Institute Affiliated to University of Mumbai)
[Knowledge is Nectar]

Department of Computer Science Engineering


# converting the results into a dataframe and plotting them
frame = pd.DataFrame({'Cluster':range(1,20), 'SSE':SSE})
plt.figure(figsize=(12,6))
plt.plot(frame['Cluster'], frame['SSE'], marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
# k means using 5 clusters and k-means++ initialization
kmeans = KMeans( n_clusters = 5, init='k-means++')
kmeans.fit(data_scaled)
pred = kmeans.predict(data_scaled)
frame = pd.DataFrame(data_scaled)
frame['cluster'] = pred
frame['cluster'].value_counts()

Output
BHARATIYA VIDYA BHAVAN’S
SARDAR PATEL INSTITUTE OF TECHNOLOGY
(Empowered Autonomous Institute Affiliated to University of Mumbai)
[Knowledge is Nectar]

Department of Computer Science Engineering


BHARATIYA VIDYA BHAVAN’S
SARDAR PATEL INSTITUTE OF TECHNOLOGY
(Empowered Autonomous Institute Affiliated to University of Mumbai)
[Knowledge is Nectar]

Department of Computer Science Engineering

Conclusion K-Means is an efficient and widely used clustering algorithm that groups data into K
clusters based on similarity. It works by iteratively assigning data points to the nearest
centroid and updating centroids until convergence. While it is computationally efficient and
simple, it requires selecting the optimal K and can be sensitive to initialization and outliers.
Despite these limitations, K-Means is widely applied in areas like market segmentation,
image compression, and pattern recognition.

References https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/

You might also like