Unsupervised ML Clustering
Unsupervised ML Clustering
LEARNING
CLUSTERING AND THE
k-MEANS ALGORITHM
Clustering
• Information (in the dataset) is analyzed to identify clusters of data that
share similar attributes.
k-means Algorithm
• k-means clustering attempts to divide data into k discrete groups and is
effective at uncovering basic data patterns.
• The k-means clustering algorithm works by first splitting data into k number
of clusters with k representing the number of clusters you wish to create. If
you choose to split your dataset into three clusters then k, for example, is set
to 3.
How does k-means clustering separate
the data points?
• The first step is to examine the unclustered data on the scatterplot and
manually select a centroid for each k cluster. That centroid then forms the
epicenter of an individual cluster. Centroids can be chosen at random, which
means you can nominate any data point on the scatterplot to act as a
centroid.
• The remaining data points on the scatterplot are then assigned to the closest
centroid by measuring the Euclidean distance. Each data point can be
assigned to only one cluster and each cluster is discrete. This means that
there is no overlap between clusters and no case of nesting a cluster inside
another cluster.
• After all data points have been allocated to a centroid, the next step is to
aggregate the mean value of all data points for each cluster, which can be
found by calculating the average x and y values of all data points in that
cluster.
• Next, take the mean value of the data points in each cluster and plug in those
x and y values to update your centroid coordinates.
k-means Clustering Algorithm
• Challenge in ML:
• Underfitting
• Overfitting
• Bias refers to the gap between your predicted value and the actual value
• Variance describes how scattered your predicted values are.
• Imagine that the center of the target,
or the bull’s-eye, perfectly predicts the
correct value of your model.
• The dots marked on the target then
represent an individual realization of
our model based on our training data.
• The more the dots deviate from the
bull’s-eye, the higher the bias and the
less accurate the model will be in its
overall predictive ability.
• Ideally, we want a situation where
there is low variance and low bias.
More often, there is a trade-off
between optimal bias and variance
• Bias and variance both contribute to
Figure 8: Shooting targets used to represent bias and variance
error, but it is the prediction error that
we want to minimize, not bias or
variance specifically.
Model Complexity