0% found this document useful (0 votes)
13 views

Unsupervised ML Clustering

Machine learning

Uploaded by

rodiwo7933
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unsupervised ML Clustering

Machine learning

Uploaded by

rodiwo7933
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

UNSUPERVISED

LEARNING
CLUSTERING AND THE
k-MEANS ALGORITHM
Clustering
• Information (in the dataset) is analyzed to identify clusters of data that
share similar attributes.
k-means Algorithm
• k-means clustering attempts to divide data into k discrete groups and is
effective at uncovering basic data patterns.
• The k-means clustering algorithm works by first splitting data into k number
of clusters with k representing the number of clusters you wish to create. If
you choose to split your dataset into three clusters then k, for example, is set
to 3.
How does k-means clustering separate
the data points?

• The first step is to examine the unclustered data on the scatterplot and
manually select a centroid for each k cluster. That centroid then forms the
epicenter of an individual cluster. Centroids can be chosen at random, which
means you can nominate any data point on the scatterplot to act as a
centroid.
• The remaining data points on the scatterplot are then assigned to the closest
centroid by measuring the Euclidean distance. Each data point can be
assigned to only one cluster and each cluster is discrete. This means that
there is no overlap between clusters and no case of nesting a cluster inside
another cluster.
• After all data points have been allocated to a centroid, the next step is to
aggregate the mean value of all data points for each cluster, which can be
found by calculating the average x and y values of all data points in that
cluster.
• Next, take the mean value of the data points in each cluster and plug in those
x and y values to update your centroid coordinates.
k-means Clustering Algorithm

“Assigning” each training example


( ) to the closest cluster centroid

Moving each cluster centroid to


the mean of the points assigned to
it

k: (a parameter of the algorithm) is the number of clusters we want to find; and


: the cluster centroids, represent our current guesses for the positions of the centers of the clusters
• The original dataset (Figure 3) is
clustered/divided into clusters
• Cluster centroids are randomly
chosen or initialised (Figure 4)
• “Assigning” each training example
( ) to the closest cluster centroid ,
(Figure 5), painting the training
examples the same color as the
cluster centroid to which is assigned Figure 3: Original dataset

Figure 4: = 2, centroids are marked in ′ × ′ Figure 5: Assigning data


to the closest cluster
• After the training examples are assigned to the closest cluster as seen in Figure 5, in
Figure 6(a), the centroid of the cluster is move to the mean of the data points
assigned to it
• In Figure 6(b), the data points are reassigned to the closet cluster
• Again in Figure 6(c), the centroid of the cluster is move to the mean of the data points
assigned to it

(a) (b) (c)

Figure 6: Illustration of running two iterations of k-means


Setting
• As increases, clusters become smaller and variance falls, but, the neighboring
clusters become less distinct from one another as k increases
• A scree plot charts the degree of scattering (variance) inside a cluster as the total
number of clusters increase.
• SSE is measured as the sum of the squared distance between the centroid and the
other neighbors inside the cluster. In a nutshell, SSE drops as more clusters are
formed. (In Figure 7, 2 or 3 clusters appear to be ideal solution)
• A more simple and non-mathematical approach to setting k is applying domain
knowledge.

Sum of Squared Error


Number of clusters
Figure 7: Scree plot
BIAS & VARIANCE
• Algorithm selection is an important step in forming an accurate prediction
model.
• Each algorithm can produce vastly different models based on the
hyperparameters provided can lead to dramatically different results.
Hyperparameters are the algorithm’s settings, which are lines of code.

• Challenge in ML:
• Underfitting
• Overfitting

• Bias refers to the gap between your predicted value and the actual value
• Variance describes how scattered your predicted values are.
• Imagine that the center of the target,
or the bull’s-eye, perfectly predicts the
correct value of your model.
• The dots marked on the target then
represent an individual realization of
our model based on our training data.
• The more the dots deviate from the
bull’s-eye, the higher the bias and the
less accurate the model will be in its
overall predictive ability.
• Ideally, we want a situation where
there is low variance and low bias.
More often, there is a trade-off
between optimal bias and variance
• Bias and variance both contribute to
Figure 8: Shooting targets used to represent bias and variance
error, but it is the prediction error that
we want to minimize, not bias or
variance specifically.
Model Complexity

Figure 9: Model complexity based on prediction error


Model Complexity
• Mismanaging the bias-variance trade-off can lead to poor results, as shown in
Figure 10.
• Underfitting (low variance, high bias), overfitting (high variance, low bias)
• A natural temptation is to add complexity to the model (as shown on the right) in
order to improve accuracy, but which can, in turn, lead to overfitting.
• An overfitted model will yield accurate predictions from the training data but prove
less accurate at formulating predictions from the test data.

Figure 10: Issue of Underfitting and Overfitting to Model complexity


• Overfitting can also occur if the training and test data aren’t randomized
before they are split and patterns in the data aren’t distributed across the
two segments of data.
• Underfitting is when your model is overly simple, and again, has not
scratched the surface of the underlying patterns in the dataset.
• Underfitting can lead to inaccurate predictions for both the training data and test
data.
• Common causes of underfitting include insufficient training data to adequately
cover all possible combinations, and situations where the training and test data
were not properly randomized.
To eradicate both underfitting and overfitting
• There may be need to modify the model’s hyperparameters to ensure that they fit
patterns in both the training and test data and not just one-half of the data.
• A suitable fit should acknowledge major trends in the data and play down or even omit
minor variations.
• This may also mean re-randomizing the training and test data or adding new data
points so as to better detect underlying patterns.
• There may be a need to consider switching algorithms or modifying your
hyperparameters based on trial and error to minimize and manage the issue of bias-
variance trade-off.
• We may introduce regularization to combat overfitting and underfitting.
Regularization artificially amplifies bias error by penalizing an increase in a
model’s complexity. In effect, this add-on parameter provides a warning alert to
keep high variance in check while the original parameters are being optimized.
• Another effective technique to contain overfitting and underfitting in your model is
to perform cross validation, to minimize any discrepancies between the training
data and the test data.

You might also like