0% found this document useful (0 votes)
16 views

Updated_k-means Naive bayes

,kjnnjknkjnkjbkbnnnnnnl,

Uploaded by

dsingh1be21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Updated_k-means Naive bayes

,kjnnjknkjnkjbkbnnnnnnl,

Uploaded by

dsingh1be21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

K-MEANS CLUSTERING ALGORITHM-

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on.

• It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.
• It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point and
their corresponding clusters.
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid by calculating Euclidian distance formula,
which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid
of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Problem 1:
Now that we have discussed the algorithm, let us solve a numerical problem on k means
clustering. The problem is as follows. You are given 15 points in the Cartesian coordinate system
as follows.

Point Coordinates

A1 (2,10)

A2 (2,6)
A3 (11,11)

A4 (6,9)

A5 (6,4)

A6 (1,2)

A7 (5,10)

A8 (4,9)

A9 (10,12)

A10 (7,5)

A11 (9,11)

A12 (4,6)

A13 (3,10)

A14 (3,8)

A15 (6,11)
Input Dataset

We are also given the information that we need to make 3 clusters. It means we are given K=3.We
will solve this numerical on k-means clustering using the approach discussed below.

First, we will randomly choose 3 centroids from the given data. Let us consider A2 (2,6), A7 (5,10),
and A15 (6,11) as the centroids of the initial clusters. Hence, we will consider that

• Centroid 1=(2,6) is associated with cluster 1.


• Centroid 2=(5,10) is associated with cluster 2.
• Centroid 3=(6,11) is associated with cluster 3.

Now we will find the euclidean distance between each point and the centroids.
Based on the minimum distance of each point from the centroids, we will assign the points to a
cluster. I have tabulated the distance of the given points from the clusters in the following table

Distance from Distance from Distance from Assigned


Point
Centroid 1 (2,6) Centroid 2 (5,10) Centroid 3 (6,11) Cluster

A1
4 3 4.123106 Cluster 2
(2,10)

A2 (2,6) 0 5 6.403124 Cluster 1

A3
10.29563 6.082763 5 Cluster 3
(11,11)

A4 (6,9) 5 1.414214 2 Cluster 2

A5 (6,4) 4.472136 6.082763 7 Cluster 1

A6 (1,2) 4.123106 8.944272 10.29563 Cluster 1

A7
5 0 1.414214 Cluster 2
(5,10)

A8 (4,9) 3.605551 1.414214 2.828427 Cluster 2

A9
10 5.385165 4.123106 Cluster 3
(10,12)

A10
5.09902 5.385165 6.082763 Cluster 1
(7,5)

A11
8.602325 4.123106 3 Cluster 3
(9,11)

A12
2 4.123106 5.385165 Cluster 1
(4,6)
A13
4.123106 2 3.162278 Cluster 2
(3,10)

A14
2.236068 2.828427 4.242641 Cluster 1
(3,8)

A15
6.403124 1.414214 0 Cluster 3
(6,11)
Results from 1st iteration of K means clustering

At this point, we have completed the first iteration of the k-means clustering algorithm and assigned
each point into a cluster.

In the above table, you can observe that the point that is closest to the centroid of a given cluster
is assigned to the cluster.

Now, we will calculate the new centroid for each cluster.

• In cluster 1, we have 6 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), A12 (4,6), A14
(3,8). To calculate the new centroid for cluster 1, we will find the mean of the x and y
coordinates of each point in the cluster. Hence, the new centroid for cluster 1 is (3.833,
5.167).
• In cluster 2, we have 5 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), and A13
(3,10). Hence, the new centroid for cluster 2 is (4, 9.6)
• In cluster 3, we have 4 points i.e. A3 (11,11), A9 (10,12), A11 (9,11), and A15 (6,11).
Hence, the new centroid for cluster 3 is (9, 11.25).

Now that we have calculated new centroids for each cluster, we will calculate the distance of each
data point from the new centroids. Then, we will assign the points to clusters based on their
distance from the centroids. The results for this process have been given in the following table.

Distance from
Distance from Distance from Assigned
Point Centroid 1 (3.833,
centroid 2 (4, 9.6) centroid 3 (9, 11.25) Cluster
5.167)

A1
5.169 2.040 7.111 Cluster 2
(2,10)

A2 (2,6) 2.013 4.118 8.750 Cluster 1

A3
9.241 7.139 2.016 Cluster 3
(11,11)

A4 (6,9) 4.403 2.088 3.750 Cluster 2

A5 (6,4) 2.461 5.946 7.846 Cluster 1

A6 (1,2) 4.249 8.171 12.230 Cluster 1

A7
4.972 1.077 4.191 Cluster 2
(5,10)

A8 (4,9) 3.837 0.600 5.483 Cluster 2


A9
9.204 6.462 1.250 Cluster 3
(10,12)

A10
3.171 5.492 6.562 Cluster 1
(7,5)

A11
7.792 5.192 0.250 Cluster 3
(9,11)

A12
0.850 3.600 7.250 Cluster 1
(4,6)

A13
4.904 1.077 6.129 Cluster 2
(3,10)

A14
2.953 1.887 6.824 Cluster 2
(3,8)

A15
6.223 2.441 3.010 Cluster 2
(6,11)
Results from 2nd iteration of K means clustering

Now, we have completed the second iteration of the k-means clustering algorithm and assigned
each point into an updated cluster. In the above table, you can observe that the point closest to
the new centroid of a given cluster is assigned to the cluster.

Now, we will calculate the new centroid for each cluster for the third iteration.

• In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6).
To calculate the new centroid for cluster 1, we will find the mean of the x and y
coordinates of each point in the cluster. Hence, the new centroid for cluster 1 is (4, 4.6).
• In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10),
A14 (3,8), and A15 (6,11). Hence, the new centroid for cluster 2 is (4.143, 9.571)
• In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the new
centroid for cluster 3 is (10, 11.333).

At this point, we have calculated new centroids for each cluster. Now, we will calculate the distance
of each data point from the new centroids. Then, we will assign the points to clusters based on
their distance from the centroids. The results for this process have been given in the following
table.

Distance from Distance from


Distance from Assigned
Point centroid 2 (4.143, centroid 3 (10,
Centroid 1 (4, 4.6) Cluster
9.571) 11.333)

A1
5.758 2.186 8.110 Cluster 2
(2,10)

A2 (2,6) 2.441 4.165 9.615 Cluster 1

A3
9.485 7.004 1.054 Cluster 3
(11,11)
A4 (6,9) 4.833 1.943 4.631 Cluster 2

A5 (6,4) 2.088 5.872 8.353 Cluster 1

A6 (1,2) 3.970 8.197 12.966 Cluster 1

A7
5.492 0.958 5.175 Cluster 2
(5,10)

A8 (4,9) 4.400 0.589 6.438 Cluster 2

A9
9.527 6.341 0.667 Cluster 3
(10,12)

A10
3.027 5.390 7.008 Cluster 1
(7,5)

A11
8.122 5.063 1.054 Cluster 3
(9,11)

A12
1.400 3.574 8.028 Cluster 1
(4,6)

A13
5.492 1.221 7.126 Cluster 2
(3,10)

A14
3.544 1.943 7.753 Cluster 2
(3,8)

A15
6.705 2.343 4.014 Cluster 2
(6,11)
Results from 3rd iteration of K means clustering

Now, we have completed the third iteration of the k-means clustering algorithm and assigned each
point into an updated cluster. In the above table, you can observe that the point that is closest to
the new centroid of a given cluster is assigned to the cluster.

Now, we will calculate the new centroid for each cluster for the third iteration.

• In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6). To
calculate the new centroid for cluster 1, we will find the mean of the x and y coordinates of
each point in the cluster. Hence, the new centroid for cluster 1 is (4, 4.6).
• In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10), A14
(3,8), and A15 (6,11). Hence, the new centroid for cluster 2 is (4.143, 9.571)
• In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the new
centroid for cluster 3 is (10, 11.333).

Here, you can observe that no point has changed its cluster compared to the previous iteration.
Due to this, the centroid also remains constant. Therefore, we will say that the clusters have been
stabilized. Hence, the clusters obtained after the third iteration are the final clusters made from the
given dataset. If
we plot the clusters on a graph, the graph looks like as follows.

Plot for K-Means Clustering


In the above plot, points in the clusters have been plotted using red, blue, and black markers. The
centroids of the clusters have been marked using green circles.

Naive Bayes Theorem in Machine Learning

Consider a simple problem where you need to learn a machine learning


model from a given set of attributes. Then you will have to describe a
hypothesis or a relation to a response variable and then using this
relation, you will have to predict a response, given the set of attributes
you have.

You can create a learner using Bayes' Theorem that can predict the
probability of the response variable that will belong to the same class,
given a new set of attributes.

Assume that A is the response variable and B is the given attribute. So


according to the equation of Bayes' Theorem, we have:

P(A|B): The conditional probability of the response variable that


belongs to a particular value, given the input attributes, also
known as the posterior probability.

P(A): The prior probability of the response variable.


P(B): The probability of training data (input attributes) or the
evidence.

P(B|A): This is termed as the likelihood of the training data.

The Bayes' Theorem can be reformulated in correspondence with the


machine learning algorithm as:

posterior = (prior x likelihood) / (evidence)

Example:

Consider a situation where you have 1000 fruits which are either
‘banana’ or ‘apple’ or ‘other’. These will be the possible classes of the
variable Y.

The data for the following X variables all of which are in binary (0 and
1):

• Long
• Sweet
• Yellow

Not Not
Type Long Not Long Sweet Yellow Total
sweet Yellow

Banana 400 100 350 150 450 50 500

Apple 0 300 150 150 300 0 300


Not Not
Type Long Not Long Sweet Yellow Total
sweet Yellow

Other 100 100 150 50 50 150 200

Total 500 500 650 350 800 200 1000

The main agenda of the classifier is to predict if a given fruit is a


‘Banana’ or an ‘Apple’ or ‘Other’ when the three attributes(long, sweet
and yellow) are known.

Consider a case where you’re given that a fruit is long, sweet and yellow
and you need to predict what type of fruit it is. This case is similar to
the case where you need to predict Y only when the X attributes in the
training dataset are known. You can easily solve this problem by using
Naive Bayes.

The thing you need to do is to compute the 3 probabilities, i.e. the


probability of being a banana or an apple or other. The one with the
highest probability will be your answer.

Step 1:

First of all, you need to compute the proportion of each fruit class out
of all the fruits from the population which is the prior probability of
each fruit class.

The Prior probability can be calculated from the training dataset:

P(Y=Banana) = 500 / 1000 = 0.50

P(Y=Apple) = 300 / 1000 = 0.30


P(Y=Other) = 200 / 1000 = 0.20

The training dataset contains 1000 records. Out of which, you have 500
bananas, 300 apples and 200 others. So the priors are 0.5, 0.3 and 0.2
respectively.

Step 2:

Secondly, you need to calculate the probability of evidence that goes


into the denominator. It is simply the product of P of X’s for all X:

P(x1=Long) = 500 / 1000 = 0.50

P(x2=Sweet) = 650 / 1000 = 0.65

P(x3=Yellow) = 800 / 1000 = 0.80

Step 3:

The third step is to compute the probability of likelihood of evidence


which is nothing but the product of conditional probabilities of the 3
attributes.

The Probability of Likelihood for Banana:

P(x1=Long | Y=Banana) = 400 / 500 = 0.80

P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70

P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90

Therefore, the overall probability of likelihood for banana will be the


product of the above three,i.e. 0.8 * 0.7 * 0.9 = 0.504.

Step 4:
The last step is to substitute all the 3 equations into the mathematical
expression of Naive Bayes to get the probability.

P(Banana|Long,SweetandYellow) = [P(Long|Banana)∗P(Sweet|Ba
nana)∗P(Yellow|Banana) x P(Banana)] /[P(Long)∗P(Sweet)∗P(Yellow)]
= 0.8∗0.7∗0.9∗0.5/[P(Evidence)] = 0.252/[P(Evidence)=0.26]=0.96

P(Apple|Long,Sweet and Yellow) = 0, because P(Long|Apple) = 0

P(Other|Long,Sweet and Yellow) = 0.01875/[P(Evidence)=.26]=0.072

In a similar way, you can also compute the probabilities for ‘Apple’ and
‘Other’. The denominator is the same for all cases.

Banana gets the highest probability, so that will be considered as the


predicted class.

You might also like