K Means Clustering Algorithm | K Means Clustering Example | Machine Learning Algorithms |Simplilearn

What is K-Means Clustering?

k-means performs division of objects
into clusters which are “similar”
between them and are “dissimilar” to
the objects belonging to another cluster

Can you explain this with an example?

Sure. For understanding K-Means in a
better way, let’s take an example of
Cricket
Can you explain this with an example?

Task: Identify bowlers and batsmen

Task: Identify bowlers and batsmen
 The data contains runs and wickets gained in the last 10 matches
 So, the bowler will have more wickets and the batsmen will have higher runs
Scores

Assign data points
Here, we have our dataset
with x and y coordinates
Now, we want to cluster this
data using K-Means
Runs
Wickets

Lorem ipsum
Cluster 1Assign data points
Lorem ipsum
We can see that this cluster
has players with high runs and
low wickets
data using K-Means
Runs
Wickets
Runs
Wickets

And here, we can see that this
cluster has players with high
wickets and low wickets
Lorem ipsum
Cluster 1 Cluster 2Assign data points
Lorem ipsumLorem ipsum
We can see that this cluster
has players with high runs and
low wickets
data using K-Means
Runs
Wickets
Runs
Wickets
Runs
Wickets

Consider the same data set of cricket
Solve the problem using K-Means

Initially, two centroids are assigned randomly
Euclidean distance to find out which centroid is closest to each data point and the data points are
assigned to the corresponding centroids

Reposition the two centroids for optimization.

The process is iteratively repeated until our centroids become static

What’s in it for you?
Types of Clustering
Applications of K-Means clustering
Common distance measure
How does K-Means clustering work?
K-Means Clustering Algorithm
Demo: K-Means Clustering
Use Case: Color Compression

Types of Clustering
Clustering
Hierarchical
Clustering
Agglomerative Divisive
Partitional
Clustering
K-Means Fuzzy C-Means

Types of Clustering
Clustering
Hierarchical
Clustering
Division
Clusters have a tree like structure or a parent
child relationship

Types of Clustering
Clustering
Hierarchical
Clustering
a b c fd e
debc
def
bcdef
abcdef
“Bottom up" approach: Begin with
each element as a separate cluster
and merge them into successively
larger clusters

Types of Clustering
“Top down“ approach begin with the
whole set and proceed to divide it into
successively smaller clusters.
a b c fd e
de
def
bcdef
abcdef
bc
Clustering
Hierarchical
Clustering

Types of Clustering
Clustering
Partitional Clustering
c1
c2
Division of objects into clusters such
that each object is in exactly one
cluster, not several

Types of Clustering
Clustering
Partitional Clustering
Division of objects into clusters such
that each object can belong to
multiple clusters
c2c1

Applications of K-Means Clustering

Applications of K-Means Clustering
Academic
Performance
Wireless Sensor
Network's
Diagnostic
Systems
Search Engines

Distance Measure
Euclidean
distance
measure
Manhattan
distance
measure
Squared Euclidean
distance measure
Cosine distance
measure
Distance measure will determine the similarity between two elements and it will influence the shape of
the clusters

Euclidean Distance Measure
• The Euclidean distance is the "ordinary" straight line
• It is the distance between two points in Euclidean space
d=√ 𝑖=1
𝑛
( 𝑞𝑖− )2
p
q
Euclidian
Distance
𝑝𝑖
Option 02
Euclidean distance
measure
01
Squared euclidean
distance measure
02
Manhattan distance
measure
03
Cosine distance
measure
04

Squared Euclidean Distance Measure
The Euclidean squared distance metric uses the same equation as the
Euclidean distance metric, but does not take the square root.
d= 𝑖=1
𝑛
( 𝑞𝑖− )2
𝑝𝑖
Option 02
Euclidean distance
measure
01
Squared euclidean
distance measure
02
Manhattan distance
measure
03
Cosine distance
measure
04

Manhattan Distance Measure
Option 02
Euclidean distance
measure
01
Squared euclidean
distance measure
02
Manhattan distance
measure
03
Cosine distance
measure
04
The Manhattan distance is the simple sum of the horizontal and vertical
components or the distance between two points measured along axes at right angles
d= 𝑖=1
𝑛
| 𝑞 𝑥− |
p
q
Manhattan
Distance
𝑝 𝑥 +|𝑞 𝑥− |𝑝 𝑦
(x,y)
(x,y)

Cosine Distance Measure
Option 02
Euclidean distance
measure
01
Squared euclidean
distance measure
02
Manhattan distance
measure
03
Cosine distance
measure
04
The cosine distance similarity measures the angle between the two vectors
p
q
Cosine
Distance
𝑖=0
𝑛−1
𝑞𝑖−
𝑖=0
𝑛−1
(𝑞𝑖)2
× 𝑖=0
𝑛−1
(𝑝𝑖)2
d=
𝑝 𝑥

Start
Elbow point (k)
Reposition the
centroids
Grouping based on
minimum distance
Measure the distance
Convergence
- +
If clusters are
stable
If clusters are
unstable

Elbow point
Reposition the
centroids
Grouping
Measure the
distance
Convergence
• Let’s say, you have a dataset for a Grocery shop
• Now, the important question is, “how would you choose the optimum
number of clusters?“
?
c
1

Elbow point
Reposition the
centroids
Grouping
Measure the
distance
Convergence
• The best way to do this is by Elbow method
• The idea of the elbow method is to run K-Means clustering on the
dataset where ‘k’ is referred as number of clusters
• Within sum of squares (WSS) is defined as the sum of the squared distance
between each member of the cluster and its centroid
𝑖=1
𝑚
)𝑥𝑖
2
WSS = (
Where x𝑖 = data point and c𝑖 = closest point to centroid
− 𝑐𝑖

Elbow point
Reposition the
centroids
Grouping
Measure the
distance
Convergence
• Now, we draw a curve between WSS (within sum of squares) and the
number of clusters
• Here, we can see a very slow change in the value of WSS after k=2, so you should
take that elbow point value as the final number of clusters
Elbow pointWSS
No . of. clusters
k=2

Elbow point
Reposition the
centroids
Grouping
Measure the
distance
Convergence
Step 1: The given data points below are assumed as delivery points
c1

Elbow point
Reposition the
centroids
Grouping
Measure the
distance
Convergence
Step 2: We can randomly initialize two points called the cluster centroids,
Euclidean distance is a distance measure used to find out which data point
is closest to our centroids
c1
c1
c2c
1
c2

Elbow point
Reposition the
centroids
Grouping
Measure the
distance
Convergence
Step 3: Based upon the distance from c1 and c2 centroids, the data points will
group itself into clusters
c1
c1
c2c
1
c2

Elbow point
Reposition the
centroids
Grouping
Measure the
distance
Convergence
Step 4: Compute the centroid of data points inside blue cluster
Step 5: Reposition the centroid of the blue cluster to the new centroid
c1
c1
c
1
c2

Elbow point
Reposition the
centroids
Grouping
Measure the
distance
Convergence
Step 6: Now, compute the centroid of data points inside orange cluster
Step 7: Reposition the centroid of the orange cluster to the new centroid
c1
c1
c2
c
1
c2

Elbow point
Reposition the
centroids
Grouping
Measure the
distance
Convergence
Step 8: Once the clusters become static, K-Means clustering algorithm is
said to be converged
c
1
c2

Assuming we have inputs x1,x2,x3,…, and value of K,
Step 1 : Pick K random points as cluster centers called centroids
Step 2 : Assign each xi to nearest cluster by calculating its distance to each centroid
Step 3 : Find new cluster center by taking the average of the assigned points
Step 4 : Repeat Step 2 and 3 until none of the cluster assignments change

Step 1 :
We randomly pick K cluster centers (centroids). Let’s assume these are c1,c2,…,ckc1,c2,…,ck, and we
can say that;
C is the set of all centroids.
Step 2:
In this step, we assign each data point to closest center, this is done by calculating Euclidean
distance
arg min dist ( ,x )2
Where dist() is the Euclidean distance.
𝑐𝑖
∈C𝑐𝑖
𝑐1 𝑐2 𝑐 𝑘C= , ,.…

|𝑆𝑖|
= 1 ∑
Step 3:
In this step, we find the new centroid by taking the average of all the points assigned to that
cluster.
is the set of all points assigned to the i th cluster
Step 4:
In this step, we repeat step 2 and 3 until none of the cluster assignments change
That means until our clusters remain stable, we repeat the algorithm
xi∈Si
𝑐𝑖 𝑥𝑖
𝑠𝑖

Problem Statement
• Walmart wants to open a chain of stores across Florida and wants to find out optimal store locations
to maximize revenue
Solution
• Walmart already has a strong e-commerce presence
• Walmart can use its online customer data to analyze the customer locations along with the monthly
sales

%matplotlib inline
import matplotlib.pyplot as plt
# for plot styling
import seaborn as sns; sns.set()
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50);

# output

# assign four clusters
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
# import library
from sklearn.metrics import pairwise_distances_argmin
def find_clusters(X, n_clusters, rseed=2):
# 1. randomly choose clusters
rng = np.random.RandomState(rseed)
i = rng.permutation(X.shape[0])[:n_clusters]
centers = X[i]
while True:

# 2. assign labels based on closest center
labels = pairwise_distances_argmin(X, centers)
# 3. find new centers from means of points
new_centers = np.array([X[labels == i].mean(0)
for i in range(n_clusters)])
centers, labels = find_clusters(X, 4)
plt.scatter(X[:, 0], X[:, 1], c=labels,
s=50, cmap='viridis’)

# 4. check for convergence
if np.all(centers == new_centers):
break
centers = new_centers
return centers, labels
centers, labels = find_clusters(X, 4)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

# output:
Conclusion
Congratulations!
We have demonstrated K-Means
clustering by establishing Walmart stores
across Florida in the most optimized way

Use case – Color compression

Use Case: K-Means for Color Compression
Problem Statement
To perform color compression on images using K-Means algorithm

# example 1:
# note: this requires the ``pillow`` package to be installed
from sklearn.datasets import load_sample_image
china = load_sample_image("flower.jpg")
ax = plt.axes(xticks=[], yticks=[])
ax.imshow(china);
#Output:

# returns the dimensions of the array
china.shape
# reshape the data to [n_samples x n_features], and rescale the colors so that they lie between 0 and 1
data = china / 255.0 # use 0...1 scale
data = data.reshape(427 * 640, 3)
data.shape
# visualize these pixels in this color space, using a subset of 10,000 pixels for efficiency
def plot_pixels(data, title, colors=None, N=10000):
if colors is None:
colors = data

# choose a random subset
rng = np.random.RandomState(0)
i = rng.permutation(data.shape[0])[:N]
colors = colors[i]
R, G, B = data[i].T
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
ax[0].scatter(R, G, color=colors, marker='.')
ax[0].set(xlabel='Red', ylabel='Green', xlim=(0, 1), ylim=(0, 1))
ax[1].scatter(R, B, color=colors, marker='.')
ax[1].set(xlabel='Red', ylabel='Blue', xlim=(0, 1), ylim=(0, 1))
fig.suptitle(title, size=20);

plot_pixels(data, title='Input color space: 16 million possible colors')

# fix numPy issues
import warnings; warnings.simplefilter('ignore’)
# reducing these 16 million colors to just 16 colors
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(16)
kmeans.fit(data)
new_colors = kmeans.cluster_centers_[kmeans.predict(data)]
plot_pixels(data, colors=new_colors,
title="Reduced color space: 16 colors")

china_recolored = new_colors.reshape(china.shape)
fig, ax = plt.subplots(1, 2, figsize=(16, 6), subplot_kw=dict(xticks=[], yticks=[]))
fig.subplots_adjust(wspace=0.05)
ax[0].imshow(china)
ax[0].set_title('Original Image', size=16)
ax[1].imshow(china_recolored)
ax[1].set_title('16-color Image', size=16);
# the result is re-coloring of the original pixels, where each pixel is assigned the color of its closest cluster center
# output:

# output

# example 2:
from sklearn.datasets import load_sample_image
china = load_sample_image(“china.jpg")
ax = plt.axes(xticks=[], yticks=[])
ax.imshow(china);

# fix NumPy issues
import warnings; warnings.simplefilter('ignore’)
# reducing these 16 million colors to just 16 colors
from sklearn.cluster import MiniBatchKMeans
kmeans = MiniBatchKMeans(16)
kmeans.fit(data)
new_colors = kmeans.cluster_centers_[kmeans.predict(data)]
plot_pixels(data, colors=new_colors,
title="Reduced color space: 16 colors")

china_recolored = new_colors.reshape(china.shape)
fig, ax = plt.subplots(1, 2, figsize=(16, 6), subplot_kw=dict(xticks=[], yticks=[]))
fig.subplots_adjust(wspace=0.05)
ax[0].imshow(china)
ax[0].set_title('Original Image', size=16)
ax[1].imshow(china_recolored)
ax[1].set_title('16-color Image', size=16);
# the result is a re-coloring of the original pixels, where each pixel is assigned the color of its closest cluster center
# output

# output
Conclusion
Congratulations!
We have demonstrated
K-Means in color compression.
The hands on example will help
you to encounter any K-Means
project in future.

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning Algorithms |Simplilearn

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning Algorithms |Simplilearn

More Related Content

What's hot (20)

Similar to K Means Clustering Algorithm | K Means Clustering Example | Machine Learning Algorithms |Simplilearn (20)

More from Simplilearn (20)

Recently uploaded (20)

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning Algorithms |Simplilearn

Editor's Notes