Lecture 25 K Means Clustering
Lecture 25 K Means Clustering
non-fiction
children’s
Label of closest
cluster used to
classify new sample
Label of closest
cluster used to
classify new sample
Repeat
Assign each sample to closest centroid
Calculate mean of cluster to determine new centroid
Solution:
Run k-means multiple times with different random
initial centroids, and choose best results
Caveats:
Does not mean that cluster set 1 is more ‘correct’ than
cluster set 2
• Data-Driven: There are also data-driven method for determining the value
of k. These methods calculate symmetric for different values of k to determine the
best selections of k. One such method is the elbow method.
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Elbow Method for Choosing k
“Elbow” suggests
value for k should be 3
The elbow method for determining the value of k is shown on this plot. As we saw in the
previous slide, WSSE, or within-cluster sum of squared error, measures how much data
samples deviate from their respective centroids in a set of clustering results. If we plot WSSE
for different values for k, we can see how this error measure changes as a value of k changes
as seen in the plot. The bend in this error curve indicates a drop in gain by adding more
clusters. So this elbow in the curve provides a suggestion for a good value of k.
Note that the elbow can not always be unambiguously determined, especially for complex
data. And in many cases, the error curve will not have a clear suggestion for one value, but
for multiple values. This can be used as a guideline for the range of values to try for k.
Big Data Computing Vu Pham Machine Learning Classification Algorithm
Stopping Criteria
When to stop iterating ?
• No changes to centroids: How do you know when to stop
iterating when using k-means? One obviously stopping criterion is when
there are no changes to the centroids. This means that no samples would
change cluster assignments. And recalculating the centroids will not result
in any changes. So additional iterations will not bring about any more
changes to the cluster results.
• Number of samples changing clusters is below
threshold: The stopping criterion can be relaxed to the second
stopping criterion: which is when the number of sample changing clusters
is below a certain threshold, say 1% for example. At this point, the
clusters are changing by only a few samples, resulting in only minimal
changes to the final cluster results. So the algorithm can be stopped here.