Updated_k-means Naive bayes
Updated_k-means Naive bayes
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on.
• It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.
• It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point and
their corresponding clusters.
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid by calculating Euclidian distance formula,
which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.
Problem 1:
Now that we have discussed the algorithm, let us solve a numerical problem on k means
clustering. The problem is as follows. You are given 15 points in the Cartesian coordinate system
as follows.
Point Coordinates
A1 (2,10)
A2 (2,6)
A3 (11,11)
A4 (6,9)
A5 (6,4)
A6 (1,2)
A7 (5,10)
A8 (4,9)
A9 (10,12)
A10 (7,5)
A11 (9,11)
A12 (4,6)
A13 (3,10)
A14 (3,8)
A15 (6,11)
Input Dataset
We are also given the information that we need to make 3 clusters. It means we are given K=3.We
will solve this numerical on k-means clustering using the approach discussed below.
First, we will randomly choose 3 centroids from the given data. Let us consider A2 (2,6), A7 (5,10),
and A15 (6,11) as the centroids of the initial clusters. Hence, we will consider that
Now we will find the euclidean distance between each point and the centroids.
Based on the minimum distance of each point from the centroids, we will assign the points to a
cluster. I have tabulated the distance of the given points from the clusters in the following table
A1
4 3 4.123106 Cluster 2
(2,10)
A3
10.29563 6.082763 5 Cluster 3
(11,11)
A7
5 0 1.414214 Cluster 2
(5,10)
A9
10 5.385165 4.123106 Cluster 3
(10,12)
A10
5.09902 5.385165 6.082763 Cluster 1
(7,5)
A11
8.602325 4.123106 3 Cluster 3
(9,11)
A12
2 4.123106 5.385165 Cluster 1
(4,6)
A13
4.123106 2 3.162278 Cluster 2
(3,10)
A14
2.236068 2.828427 4.242641 Cluster 1
(3,8)
A15
6.403124 1.414214 0 Cluster 3
(6,11)
Results from 1st iteration of K means clustering
At this point, we have completed the first iteration of the k-means clustering algorithm and assigned
each point into a cluster.
In the above table, you can observe that the point that is closest to the centroid of a given cluster
is assigned to the cluster.
• In cluster 1, we have 6 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), A12 (4,6), A14
(3,8). To calculate the new centroid for cluster 1, we will find the mean of the x and y
coordinates of each point in the cluster. Hence, the new centroid for cluster 1 is (3.833,
5.167).
• In cluster 2, we have 5 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), and A13
(3,10). Hence, the new centroid for cluster 2 is (4, 9.6)
• In cluster 3, we have 4 points i.e. A3 (11,11), A9 (10,12), A11 (9,11), and A15 (6,11).
Hence, the new centroid for cluster 3 is (9, 11.25).
Now that we have calculated new centroids for each cluster, we will calculate the distance of each
data point from the new centroids. Then, we will assign the points to clusters based on their
distance from the centroids. The results for this process have been given in the following table.
Distance from
Distance from Distance from Assigned
Point Centroid 1 (3.833,
centroid 2 (4, 9.6) centroid 3 (9, 11.25) Cluster
5.167)
A1
5.169 2.040 7.111 Cluster 2
(2,10)
A3
9.241 7.139 2.016 Cluster 3
(11,11)
A7
4.972 1.077 4.191 Cluster 2
(5,10)
A10
3.171 5.492 6.562 Cluster 1
(7,5)
A11
7.792 5.192 0.250 Cluster 3
(9,11)
A12
0.850 3.600 7.250 Cluster 1
(4,6)
A13
4.904 1.077 6.129 Cluster 2
(3,10)
A14
2.953 1.887 6.824 Cluster 2
(3,8)
A15
6.223 2.441 3.010 Cluster 2
(6,11)
Results from 2nd iteration of K means clustering
Now, we have completed the second iteration of the k-means clustering algorithm and assigned
each point into an updated cluster. In the above table, you can observe that the point closest to
the new centroid of a given cluster is assigned to the cluster.
Now, we will calculate the new centroid for each cluster for the third iteration.
• In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6).
To calculate the new centroid for cluster 1, we will find the mean of the x and y
coordinates of each point in the cluster. Hence, the new centroid for cluster 1 is (4, 4.6).
• In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10),
A14 (3,8), and A15 (6,11). Hence, the new centroid for cluster 2 is (4.143, 9.571)
• In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the new
centroid for cluster 3 is (10, 11.333).
At this point, we have calculated new centroids for each cluster. Now, we will calculate the distance
of each data point from the new centroids. Then, we will assign the points to clusters based on
their distance from the centroids. The results for this process have been given in the following
table.
A1
5.758 2.186 8.110 Cluster 2
(2,10)
A3
9.485 7.004 1.054 Cluster 3
(11,11)
A4 (6,9) 4.833 1.943 4.631 Cluster 2
A7
5.492 0.958 5.175 Cluster 2
(5,10)
A9
9.527 6.341 0.667 Cluster 3
(10,12)
A10
3.027 5.390 7.008 Cluster 1
(7,5)
A11
8.122 5.063 1.054 Cluster 3
(9,11)
A12
1.400 3.574 8.028 Cluster 1
(4,6)
A13
5.492 1.221 7.126 Cluster 2
(3,10)
A14
3.544 1.943 7.753 Cluster 2
(3,8)
A15
6.705 2.343 4.014 Cluster 2
(6,11)
Results from 3rd iteration of K means clustering
Now, we have completed the third iteration of the k-means clustering algorithm and assigned each
point into an updated cluster. In the above table, you can observe that the point that is closest to
the new centroid of a given cluster is assigned to the cluster.
Now, we will calculate the new centroid for each cluster for the third iteration.
• In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6). To
calculate the new centroid for cluster 1, we will find the mean of the x and y coordinates of
each point in the cluster. Hence, the new centroid for cluster 1 is (4, 4.6).
• In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10), A14
(3,8), and A15 (6,11). Hence, the new centroid for cluster 2 is (4.143, 9.571)
• In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the new
centroid for cluster 3 is (10, 11.333).
Here, you can observe that no point has changed its cluster compared to the previous iteration.
Due to this, the centroid also remains constant. Therefore, we will say that the clusters have been
stabilized. Hence, the clusters obtained after the third iteration are the final clusters made from the
given dataset. If
we plot the clusters on a graph, the graph looks like as follows.
You can create a learner using Bayes' Theorem that can predict the
probability of the response variable that will belong to the same class,
given a new set of attributes.
Example:
Consider a situation where you have 1000 fruits which are either
‘banana’ or ‘apple’ or ‘other’. These will be the possible classes of the
variable Y.
The data for the following X variables all of which are in binary (0 and
1):
• Long
• Sweet
• Yellow
Not Not
Type Long Not Long Sweet Yellow Total
sweet Yellow
Consider a case where you’re given that a fruit is long, sweet and yellow
and you need to predict what type of fruit it is. This case is similar to
the case where you need to predict Y only when the X attributes in the
training dataset are known. You can easily solve this problem by using
Naive Bayes.
Step 1:
First of all, you need to compute the proportion of each fruit class out
of all the fruits from the population which is the prior probability of
each fruit class.
The training dataset contains 1000 records. Out of which, you have 500
bananas, 300 apples and 200 others. So the priors are 0.5, 0.3 and 0.2
respectively.
Step 2:
Step 3:
Step 4:
The last step is to substitute all the 3 equations into the mathematical
expression of Naive Bayes to get the probability.
P(Banana|Long,SweetandYellow) = [P(Long|Banana)∗P(Sweet|Ba
nana)∗P(Yellow|Banana) x P(Banana)] /[P(Long)∗P(Sweet)∗P(Yellow)]
= 0.8∗0.7∗0.9∗0.5/[P(Evidence)] = 0.252/[P(Evidence)=0.26]=0.96
In a similar way, you can also compute the probabilities for ‘Apple’ and
‘Other’. The denominator is the same for all cases.