Unsupervised ML Clustering

Machine learning

Uploaded by

rodiwo7933

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views15 pages

Unsupervised ML Clustering

Machine learning

Uploaded by

rodiwo7933

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

UNSUPERVISED

LEARNING
CLUSTERING AND THE
k-MEANS ALGORITHM
Clustering
• Information (in the dataset) is analyzed to identify clusters of data that
share similar attributes.
k-means Algorithm
• k-means clustering attempts to divide data into k discrete groups and is
effective at uncovering basic data patterns.
• The k-means clustering algorithm works by first splitting data into k number
of clusters with k representing the number of clusters you wish to create. If
you choose to split your dataset into three clusters then k, for example, is set
to 3.
How does k-means clustering separate
the data points?

• The first step is to examine the unclustered data on the scatterplot and
manually select a centroid for each k cluster. That centroid then forms the
epicenter of an individual cluster. Centroids can be chosen at random, which
means you can nominate any data point on the scatterplot to act as a
centroid.
• The remaining data points on the scatterplot are then assigned to the closest
centroid by measuring the Euclidean distance. Each data point can be
assigned to only one cluster and each cluster is discrete. This means that
there is no overlap between clusters and no case of nesting a cluster inside
another cluster.
• After all data points have been allocated to a centroid, the next step is to
aggregate the mean value of all data points for each cluster, which can be
found by calculating the average x and y values of all data points in that
cluster.
• Next, take the mean value of the data points in each cluster and plug in those
x and y values to update your centroid coordinates.
k-means Clustering Algorithm

“Assigning” each training example

( ) to the closest cluster centroid

Moving each cluster centroid to

the mean of the points assigned to
it

k: (a parameter of the algorithm) is the number of clusters we want to find; and

: the cluster centroids, represent our current guesses for the positions of the centers of the clusters
• The original dataset (Figure 3) is
clustered/divided into clusters
• Cluster centroids are randomly
chosen or initialised (Figure 4)
• “Assigning” each training example
( ) to the closest cluster centroid ,
(Figure 5), painting the training
examples the same color as the
cluster centroid to which is assigned Figure 3: Original dataset

Figure 4: = 2, centroids are marked in ′ × ′ Figure 5: Assigning data

to the closest cluster
• After the training examples are assigned to the closest cluster as seen in Figure 5, in
Figure 6(a), the centroid of the cluster is move to the mean of the data points
assigned to it
• In Figure 6(b), the data points are reassigned to the closet cluster
• Again in Figure 6(c), the centroid of the cluster is move to the mean of the data points
assigned to it

(a) (b) (c)

Figure 6: Illustration of running two iterations of k-means

Setting
• As increases, clusters become smaller and variance falls, but, the neighboring
clusters become less distinct from one another as k increases
• A scree plot charts the degree of scattering (variance) inside a cluster as the total
number of clusters increase.
• SSE is measured as the sum of the squared distance between the centroid and the
other neighbors inside the cluster. In a nutshell, SSE drops as more clusters are
formed. (In Figure 7, 2 or 3 clusters appear to be ideal solution)
• A more simple and non-mathematical approach to setting k is applying domain
knowledge.

Sum of Squared Error

Number of clusters
Figure 7: Scree plot
BIAS & VARIANCE
• Algorithm selection is an important step in forming an accurate prediction
model.
• Each algorithm can produce vastly different models based on the
hyperparameters provided can lead to dramatically different results.
Hyperparameters are the algorithm’s settings, which are lines of code.

• Challenge in ML:
• Underfitting
• Overfitting

• Bias refers to the gap between your predicted value and the actual value
• Variance describes how scattered your predicted values are.
• Imagine that the center of the target,
or the bull’s-eye, perfectly predicts the
correct value of your model.
• The dots marked on the target then
represent an individual realization of
our model based on our training data.
• The more the dots deviate from the
bull’s-eye, the higher the bias and the
less accurate the model will be in its
overall predictive ability.
• Ideally, we want a situation where
there is low variance and low bias.
More often, there is a trade-off
between optimal bias and variance
• Bias and variance both contribute to
Figure 8: Shooting targets used to represent bias and variance
error, but it is the prediction error that
we want to minimize, not bias or
variance specifically.
Model Complexity

Figure 9: Model complexity based on prediction error

Model Complexity
• Mismanaging the bias-variance trade-off can lead to poor results, as shown in
Figure 10.
• Underfitting (low variance, high bias), overfitting (high variance, low bias)
• A natural temptation is to add complexity to the model (as shown on the right) in
order to improve accuracy, but which can, in turn, lead to overfitting.
• An overfitted model will yield accurate predictions from the training data but prove
less accurate at formulating predictions from the test data.

Figure 10: Issue of Underfitting and Overfitting to Model complexity

• Overfitting can also occur if the training and test data aren’t randomized
before they are split and patterns in the data aren’t distributed across the
two segments of data.
• Underfitting is when your model is overly simple, and again, has not
scratched the surface of the underlying patterns in the dataset.
• Underfitting can lead to inaccurate predictions for both the training data and test
data.
• Common causes of underfitting include insufficient training data to adequately
cover all possible combinations, and situations where the training and test data
were not properly randomized.
To eradicate both underfitting and overfitting
• There may be need to modify the model’s hyperparameters to ensure that they fit
patterns in both the training and test data and not just one-half of the data.
• A suitable fit should acknowledge major trends in the data and play down or even omit
minor variations.
• This may also mean re-randomizing the training and test data or adding new data
points so as to better detect underlying patterns.
• There may be a need to consider switching algorithms or modifying your
hyperparameters based on trial and error to minimize and manage the issue of bias-
variance trade-off.
• We may introduce regularization to combat overfitting and underfitting.
Regularization artificially amplifies bias error by penalizing an increase in a
model’s complexity. In effect, this add-on parameter provides a warning alert to
keep high variance in check while the original parameters are being optimized.
• Another effective technique to contain overfitting and underfitting in your model is
to perform cross validation, to minimize any discrepancies between the training
data and the test data.

ωt + Φ) A = Amplitude Φ = Phase angle Ω = Frequency and t = Time period
No ratings yet
ωt + Φ) A = Amplitude Φ = Phase angle Ω = Frequency and t = Time period
12 pages
DATA 2024_dist
No ratings yet
DATA 2024_dist
97 pages
(KtabPDF Com) xrwA7TEBGp
No ratings yet
(KtabPDF Com) xrwA7TEBGp
32 pages
Unsupervised Learning Final
No ratings yet
Unsupervised Learning Final
17 pages
som-new
No ratings yet
som-new
21 pages
K-MEANS-FINAL
No ratings yet
K-MEANS-FINAL
10 pages
Unit 5
No ratings yet
Unit 5
77 pages
Intro to Machine Learning New (2)
No ratings yet
Intro to Machine Learning New (2)
18 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
statistic inference unit 2 notes
No ratings yet
statistic inference unit 2 notes
34 pages
Unit-4
No ratings yet
Unit-4
46 pages
Week 5 v1.1 - Unsupervised Learning
No ratings yet
Week 5 v1.1 - Unsupervised Learning
40 pages
Pa ZG512 Ec-3r First Sem 2022-2023
No ratings yet
Pa ZG512 Ec-3r First Sem 2022-2023
5 pages
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
No ratings yet
APznzab0G8iLD5cDfn798Gn-fXshRpam8ullbf6ZS5Hd4l0BEcKNHy9gDG24DS66RfgvnKXAQjMAivMmmi5cmDWF9tqOaPMy3afuzafCU1kpG1xfQIr7b98q406ZWiqt50nL8WhMI6azoYzWSgf7c7khnqww3VlQ9I90ROmc0QL4DbmipYYoLleGYR6TO4UYmc_PsaQB5v0XmLUwPEub3QuwGdUnUEr2dp_hV4bds0MuRbpJ
34 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Unit 2 - SVM
No ratings yet
Unit 2 - SVM
137 pages
Week 11
No ratings yet
Week 11
49 pages
NLP Chapter 2
No ratings yet
NLP Chapter 2
79 pages
k Mean Clustering
No ratings yet
k Mean Clustering
19 pages
k-mean-clustering
No ratings yet
k-mean-clustering
18 pages
Unit 2(P1)
No ratings yet
Unit 2(P1)
15 pages
A Paper With 12pt Global Font Size
No ratings yet
A Paper With 12pt Global Font Size
13 pages
Ml Unit5 Notes
No ratings yet
Ml Unit5 Notes
18 pages
Unit 3 & 4 (p18)
No ratings yet
Unit 3 & 4 (p18)
18 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
Clustering
No ratings yet
Clustering
28 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Mod4_Unsupervised Learning
No ratings yet
Mod4_Unsupervised Learning
9 pages
Unit IV
No ratings yet
Unit IV
96 pages
Probability and Statistics Mansoura Day4
No ratings yet
Probability and Statistics Mansoura Day4
23 pages
Model Evaluation
No ratings yet
Model Evaluation
29 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
42-Unsupervised Learning - k-means clustering-21-11-2024
No ratings yet
42-Unsupervised Learning - k-means clustering-21-11-2024
18 pages
Clustering Kmeans
No ratings yet
Clustering Kmeans
6 pages
FAM_QUESTION_BANK_CT[1]
No ratings yet
FAM_QUESTION_BANK_CT[1]
14 pages
Pattern Revision
No ratings yet
Pattern Revision
63 pages
Machine Learning-4
No ratings yet
Machine Learning-4
73 pages
SML Hand Note Bau by DT
No ratings yet
SML Hand Note Bau by DT
1 page
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
Algorithms 1
No ratings yet
Algorithms 1
23 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
Unit 6
No ratings yet
Unit 6
22 pages
Decision Trees. These Models Use Observations About Certain
No ratings yet
Decision Trees. These Models Use Observations About Certain
6 pages
Classification
No ratings yet
Classification
53 pages
Predict Classify Cluster
No ratings yet
Predict Classify Cluster
12 pages
IDS26 Clustering and Classification
No ratings yet
IDS26 Clustering and Classification
30 pages
WS - Data Analytics Fundamental-R
No ratings yet
WS - Data Analytics Fundamental-R
51 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
Machine Learning Theory
100% (1)
Machine Learning Theory
12 pages
Week 9
No ratings yet
Week 9
66 pages
Week 15
No ratings yet
Week 15
41 pages
Unit 2 ML
No ratings yet
Unit 2 ML
89 pages
Lecture Unsupervised (17!04!2024).Pptx
No ratings yet
Lecture Unsupervised (17!04!2024).Pptx
61 pages
Pattern Analysis-Machine Learning
No ratings yet
Pattern Analysis-Machine Learning
74 pages
ML4_ML_Algorithms
No ratings yet
ML4_ML_Algorithms
123 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Teme Lectii Video C#
No ratings yet
Teme Lectii Video C#
17 pages
3.39 Rain and Snow
No ratings yet
3.39 Rain and Snow
3 pages
EEE 209 Presentation 3 (Circuits Cont)
No ratings yet
EEE 209 Presentation 3 (Circuits Cont)
185 pages
ICT and NPTEL
No ratings yet
ICT and NPTEL
2 pages
Asness 2000
No ratings yet
Asness 2000
19 pages
Number Sense
No ratings yet
Number Sense
13 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
Math 7 Q1 Week 9
No ratings yet
Math 7 Q1 Week 9
3 pages
1002 Location of Warning Torpedo
No ratings yet
1002 Location of Warning Torpedo
21 pages
Krab I
No ratings yet
Krab I
34 pages
GRAPHS pdf
No ratings yet
GRAPHS pdf
10 pages
Physical Chemistry Notes-1
No ratings yet
Physical Chemistry Notes-1
73 pages
Assembly of Nanorods Into Designer Superstructures: The Role of Templating, Capillary Forces, Adhesion, and Polymer Hydration
No ratings yet
Assembly of Nanorods Into Designer Superstructures: The Role of Templating, Capillary Forces, Adhesion, and Polymer Hydration
8 pages
Probability and Statistics
No ratings yet
Probability and Statistics
27 pages
Artificial Intelligence Presentation
No ratings yet
Artificial Intelligence Presentation
21 pages
Design and Implementation of Seismic Absorber Utilizing Inertance and Negative Stiffness
No ratings yet
Design and Implementation of Seismic Absorber Utilizing Inertance and Negative Stiffness
9 pages
Excel Objective Quation and Answer
No ratings yet
Excel Objective Quation and Answer
34 pages
Adding and Subtraction Integers Poster
No ratings yet
Adding and Subtraction Integers Poster
1 page
Unit - 1 D. C. Circuit: - Syllabus
No ratings yet
Unit - 1 D. C. Circuit: - Syllabus
32 pages
Nutritional Modelling for Pigs and Poultry Nilva K Sakmoura pdf download
100% (1)
Nutritional Modelling for Pigs and Poultry Nilva K Sakmoura pdf download
66 pages
Download The Physics of Graphene 2nd Edition Mikhail I. Katsnelson ebook All Chapters PDF
100% (2)
Download The Physics of Graphene 2nd Edition Mikhail I. Katsnelson ebook All Chapters PDF
55 pages
Ultrasound Basics Principles: Resident Physics Lectures
No ratings yet
Ultrasound Basics Principles: Resident Physics Lectures
71 pages
Lecture 1
No ratings yet
Lecture 1
15 pages
Sistem Pakar - 10 - Fuzzy Logic
No ratings yet
Sistem Pakar - 10 - Fuzzy Logic
12 pages
Finite Element Programming in Non linear Geomechanics and Transient Flow 1st Edition Nobuo Morita - The complete ebook is available for download with one click
100% (1)
Finite Element Programming in Non linear Geomechanics and Transient Flow 1st Edition Nobuo Morita - The complete ebook is available for download with one click
77 pages
Date Sheet 2025 Bifurcation Grade VIII (1)
No ratings yet
Date Sheet 2025 Bifurcation Grade VIII (1)
2 pages
Digital Logic Design Module
No ratings yet
Digital Logic Design Module
322 pages
A Generalization of Wilson's Theorem: R. Andrew Ohana June 3, 2009
No ratings yet
A Generalization of Wilson's Theorem: R. Andrew Ohana June 3, 2009
13 pages
Where can buy Algorithms on Trees and Graphs: With Python Code 2nd Edition Gabriel Valiente ebook with cheap price
100% (3)
Where can buy Algorithms on Trees and Graphs: With Python Code 2nd Edition Gabriel Valiente ebook with cheap price
40 pages