0% found this document useful (0 votes)

92 views

SUMERA - Kmeans Clustering - Jupyter Notebook

This document discusses k-means clustering, an unsupervised machine learning technique used to identify clusters of data objects in a dataset without labels. It uses the Iris dataset to demonstrate k-means clustering in Python. The algorithm initializes k cluster centers and assigns each data point to the nearest cluster center, iteratively updating the cluster centers until convergence. The results from k-means clustering on the Iris data identify 3 clusters that are similar to the true classes, showing it can discover internal data patterns without labels.

Uploaded by

Alleah Laylo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views

SUMERA - Kmeans Clustering - Jupyter Notebook

Uploaded by

Alleah Laylo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

𝐌𝐀𝐓𝐇 144 - 𝐈𝐍𝐓𝐑𝐎𝐃𝐔𝐂𝐓𝐈𝐎𝐍 𝐓𝐎 𝐃𝐀𝐓𝐀 𝐒𝐂𝐈𝐄𝐍𝐂𝐄

𝟸𝚀 𝚂𝚈𝟸𝟹𝟸𝟺
Instructor: EDGAR M. ADINA

k-Means Clustering

Clustering is a set of unsupervised learning algorithms. They are useful when we don’t have
any labels of the data, and the algorithms will try to find the patterns of the internal structure or
similarities of the data to put them into different groups. Since they are no labels (true answer)
associated with the data points, we can not use these extra bits of information to constrain the
problem. But instead, there are other ways that we can solve the problem, in this section, we
will take a look of a very popular clustering algorithm - K-means and understand.

K-means clustering, a method used for vector quantization, originally from signal processing,
𝑛 𝑘
that aims to partition observations into groups or clusters(usual notation)in which each
observation belongs to the cluster with the closest mean(cluster centers or cluster centroid),
serving as a prototype of the cluster.

The k-means clustering method is additionally used as an unsupervised machine learning

technique used to identify clusters of data objects in a dataset. There are many various kinds
of clustering methods, but k-means is one of the oldest and most approachable. In this lesson,
we need Python libraries “NumPy” and “Scikit-learn” to implement a K-Means clustering
algorithm. The simulated data will only have three clusters, which will be identified by the
clustering algorithm. The matter is computationally difficult (NP-hard). The unsupervised k-
means algorithm has a loose relationship to the k-nearest neighbor classifier, a well-liked
supervised machine learning technique for classification that’s often confused with k-means
because of the name.

Sample Case - Iris Dataset

The Iris dataset (iris.csv) is one of the earliest datasets used in the literature on classification
methods and widely used in statistics and machine learning. Each instance (row) is a plant.

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris
plant. One class is linearly separable from the other 2; the latter are not linearly separable from
each other.

Predicted attribute: class of iris plant.

Let us first import the needed tools.

localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 1/7

2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

In [1]: import numpy as np

import matplotlib.pyplot as plt
from sklearn import datasets
plt.style.use("seaborn-v0_8")
%matplotlib inline

We import the data. Be sure you have downloaded the dataset (iris.csv) from your Blackboard,
and uploaded in your Jupyter notebook.

In [2]: iris = datasets.load_iris()

Let us just use two features, so that we can easily visualize them.

In [3]: X = iris.data[:, [0, 2]]

y = iris.target
target_names = iris.target_names
feature_names = iris.feature_names

Now, we extract the classes.

In [4]: n_class = len(set(y))

Let us visualize the data first.

localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 2/7

2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

In [5]: plt.figure(figsize = (10,8))

plt.scatter(X[:, 0], X[:, 1], \
color = 'b', marker = 'o', s = 60)

plt.xlabel('Feature 1 - ' + feature_names[0])
plt.ylabel('Feature 2 - ' + feature_names[2])
plt.show()

Now we can use the K-means by initializing the model and train the algorithm.

localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 3/7

2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

In [8]: from sklearn.cluster import KMeans

import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
os.environ["OMP_NUM_THREADS"] = "1"

kmeans = KMeans(n_clusters=n_class, random_state=42)
kmeans.fit(X)

colors = ['b', 'g', 'r']
symbols = ['o', '^', '*']
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

for i, (c, s) in enumerate(zip(colors, symbols)):
ix = kmeans.labels_ == i
ax.scatter(X[:, 0][ix], X[:, 1][ix], color=c, marker=s, s=60, label=target
loc = kmeans.cluster_centers_[i]
ax.scatter(loc[0], loc[1], color='k', marker=s, linewidth=5)

ix = y == i
ax2.scatter(X[:, 0][ix], X[:, 1][ix], color=c, marker=s, s=60, label=targe

plt.legend(loc=4, scatterpoints=1)
ax.set_xlabel('Feature 1 - ' + feature_names[0])
ax.set_ylabel('Feature 2 - ' + feature_names[2])
ax2.set_xlabel('Feature 1 - ' + feature_names[0])
ax2.set_ylabel('Feature 2 - ' + feature_names[2])
plt.tight_layout()
plt.show()

D:\Anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning:
KMeans is known to have a memory leak on Windows with MKL, when there are le
ss chunks than available threads. You can avoid it by setting the environmen
t variable OMP_NUM_THREADS=1.
warnings.warn(

localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 4/7

2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

The results of the found clusters are saved in the labels attribute and the centroids are in the
cluster_centers_. Let’s plot the clustering results and the real species in the following figure.
The left figure shows the clustering results with the bigger symbol as the centroids of the
clusters.

We can see from the above figure, the results are not too bad, they are actually quite similar to
the true classes. But remember, we get this results without the labels only based on the
similarities between data points. We can also predict new data points to the clusters using the
predict function. The following predict the cluster label for two new points.

In [7]: new_points = np.array([[5, 2], [6, 5]])

kmeans.predict(new_points)

Out[7]: array([1, 2])

SUMMARY

Machine learning are algorithms that have the capability to learn from data and generalize to
the new data.

localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 5/7

2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

Machine learning have two main categories supervised learning and unsupervised learning. In
supervised learning, there are classification and regression, while in unsupervised learning,
there are clustering and dimensionality reduction.

The output of the classification tasks are categorical data.

The output of the regression tasks are quantity data.

Reflections

1. Discuss the significance of choosing the appropriate number of clusters and the impact it
had on the results.

Answer:

Selecting the appropriate number of clusters in kMeans Clustering is pivotal as it directly

impacts the quality and interpretability of the outcomes. Opting for too few clusters might
oversimplify the data, resulting in the loss of crucial patterns and nuances. Conversely,
choosing too many clusters can lead to overfitting, where the model captures noise instead of
meaningful information, potentially leading to inaccurate conclusions. By exploring various
cluster numbers using methods like the elbow technique or silhouette analysis, data scientists
can determine the optimal number that maximizes within-cluster similarity while minimizing
between-cluster variability. This ensures that the clusters accurately represent the inherent
structure of the data, enabling insightful analysis and informed decision-making based on the
findings.

2. Share your insights into how K-means clustering helped in uncovering patterns or
relationships within the data. How might the choice of features influence the clustering
outcome?

Answer:

K-means clustering is a valuable tool in revealing patterns and relationships within diverse data
sets, aiding in identifying inherent structures and groupings. This process hinges heavily on
selecting features, as selecting relevant ones that accurately capture the data's essential
characteristics is pivotal for precise clustering outcomes. Including irrelevant or noisy features
can introduce unnecessary variability, diminishing the clustering algorithm's effectiveness.
Additionally, the scale and distribution of features significantly influence the clustering results,
with disparate scales or distributions potentially biasing the outcome towards features with
larger scales or variances. Hence, preprocessing techniques such as feature scaling are often
employed to standardize features and ensure equal weighting in the clustering process.
Ultimately, by carefully selecting and preprocessing features, data scientists can accurately
uncover meaningful insights and relationships within complex data sets using K-means
clustering.

3. Elaborate on any practical applications or decision-making scenarios where K-means

clustering can be effectively employed
localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 6/7
2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

Answer:

K-means clustering offers practical utility across diverse domains owing to its adaptability and
effectiveness in identifying inherent groupings within datasets. Marketing assists in customer
segmentation by categorizing customers based on purchasing habits, demographics, or
preferences, enabling targeted marketing campaigns and personalized product
recommendations. Healthcare professionals utilize K-means clustering to segment patients
and predict outcomes, facilitating tailored treatment plans and resource allocation. Financial
analysts employ it for market segmentation and fraud detection, aiding portfolio optimization
and safeguarding against financial losses. K-means clustering facilitates image segmentation
and object recognition in image processing and computer vision, vital for medical imaging,
satellite analysis, and surveillance. Overall, K-means clustering's ability to reveal natural
groupings within data makes it indispensable for decision-making scenarios requiring
segmentation, pattern recognition, and data-driven insights.

In [ ]:

localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 7/7

HR (Summer Internship Project)
64% (14)
HR (Summer Internship Project)
49 pages
(Ebook PDF) Foundations of Education 13th Edition by Allan C. Ornstein 2024 Scribd Download
100% (5)
(Ebook PDF) Foundations of Education 13th Edition by Allan C. Ornstein 2024 Scribd Download
49 pages
In Love With A Criminal - Fanfiction PDF
83% (6)
In Love With A Criminal - Fanfiction PDF
310 pages
JAVIER KMeans Clustering Jupyter Notebook
No ratings yet
JAVIER KMeans Clustering Jupyter Notebook
7 pages
Overview of Clustering:: UNIT-5
No ratings yet
Overview of Clustering:: UNIT-5
27 pages
K Means
100% (2)
K Means
329 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
09.unsupervised Learning
No ratings yet
09.unsupervised Learning
50 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
AAI101 - Session 2 - Unsupervised Learning
No ratings yet
AAI101 - Session 2 - Unsupervised Learning
38 pages
Detecting Patterns with Unsupervised Learning
No ratings yet
Detecting Patterns with Unsupervised Learning
21 pages
ML - K-Means
No ratings yet
ML - K-Means
12 pages
K.means Clustering
No ratings yet
K.means Clustering
8 pages
Unit-4
No ratings yet
Unit-4
46 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
K-Means_Clustering_Report
No ratings yet
K-Means_Clustering_Report
2 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
K Means
No ratings yet
K Means
9 pages
ML Unit 4 V1
No ratings yet
ML Unit 4 V1
30 pages
1 s2.0 S0031320319301608 Main
No ratings yet
1 s2.0 S0031320319301608 Main
18 pages
K means algorithm
No ratings yet
K means algorithm
4 pages
Experiment 3.1 K-Mean
No ratings yet
Experiment 3.1 K-Mean
8 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
DA_EXP_10 (1)
No ratings yet
DA_EXP_10 (1)
6 pages
DA_EXP_10
No ratings yet
DA_EXP_10
6 pages
DWM_EXP4
No ratings yet
DWM_EXP4
9 pages
Unit-Iv Material
No ratings yet
Unit-Iv Material
24 pages
K Means Clustering - Experiment 12
No ratings yet
K Means Clustering - Experiment 12
3 pages
DA_EXP_10_66
No ratings yet
DA_EXP_10_66
6 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
K-Means Data Clustering Approach: Jaipur National University
No ratings yet
K-Means Data Clustering Approach: Jaipur National University
43 pages
04-FSSR_DS610_2024=2025T1_Kmeans
No ratings yet
04-FSSR_DS610_2024=2025T1_Kmeans
57 pages
Unsupervisd Learning Algorithm
No ratings yet
Unsupervisd Learning Algorithm
6 pages
K Means (996)
No ratings yet
K Means (996)
7 pages
10.Lab Activity
No ratings yet
10.Lab Activity
11 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
DSUP_Exp5[1]
No ratings yet
DSUP_Exp5[1]
7 pages
2.3 Aiml Rishit
No ratings yet
2.3 Aiml Rishit
7 pages
Zara
No ratings yet
Zara
47 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
AML Clustering
No ratings yet
AML Clustering
7 pages
Machine Learning K Means - Unsupervised
No ratings yet
Machine Learning K Means - Unsupervised
5 pages
MINOR PROJECT
No ratings yet
MINOR PROJECT
10 pages
01 K Means - Merged
No ratings yet
01 K Means - Merged
26 pages
INTRO TO ML ASS
No ratings yet
INTRO TO ML ASS
3 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
Facebook Live Seller
No ratings yet
Facebook Live Seller
8 pages
Unit- 4(ML)
No ratings yet
Unit- 4(ML)
13 pages
3.1 K - Means
No ratings yet
3.1 K - Means
16 pages
RDM Slides Clustering With R 1
No ratings yet
RDM Slides Clustering With R 1
64 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
unsupervised learning
No ratings yet
unsupervised learning
23 pages
Unit 3 & 4 (p18)
No ratings yet
Unit 3 & 4 (p18)
18 pages
SE_KMeansClustering
No ratings yet
SE_KMeansClustering
21 pages
1 s2.0 S0020025522014633 Main
No ratings yet
1 s2.0 S0020025522014633 Main
33 pages
SML Hand Note Bau by DT
No ratings yet
SML Hand Note Bau by DT
1 page
CLUSTERING
No ratings yet
CLUSTERING
11 pages
Unit IV
No ratings yet
Unit IV
96 pages
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
Article Review Assignment
88% (16)
Article Review Assignment
17 pages
Immediate download Community Nutrition – Ebook PDF Version ebooks 2024
100% (1)
Immediate download Community Nutrition – Ebook PDF Version ebooks 2024
41 pages
Biological and Pscyhological Models of Abnormality
100% (1)
Biological and Pscyhological Models of Abnormality
8 pages
Recruitment Applicants 2020
No ratings yet
Recruitment Applicants 2020
21 pages
Body of The Master (Full Thesis With Updated Revision Statement) Reduced Size
No ratings yet
Body of The Master (Full Thesis With Updated Revision Statement) Reduced Size
121 pages
Lesson 1
No ratings yet
Lesson 1
29 pages
XXXXXXXXXXXXXXXXXXXXXXX
No ratings yet
XXXXXXXXXXXXXXXXXXXXXXX
1 page
Self-Assessment Guide: Qualification Title: Food and Beverage Service
No ratings yet
Self-Assessment Guide: Qualification Title: Food and Beverage Service
2 pages
Blindness School
No ratings yet
Blindness School
10 pages
Family, Friends & Parents Worksheet
No ratings yet
Family, Friends & Parents Worksheet
7 pages
Strategies for enhancing English proficiency among non-native students- Laura Santos
No ratings yet
Strategies for enhancing English proficiency among non-native students- Laura Santos
13 pages
Richard E. Nisbett - Intelligence and How To Get It - Why Schools and Cultures Count-W. W. Norton & Company (2009)
100% (1)
Richard E. Nisbett - Intelligence and How To Get It - Why Schools and Cultures Count-W. W. Norton & Company (2009)
310 pages
Chapter 1 Thesis Sample Philippines
100% (3)
Chapter 1 Thesis Sample Philippines
6 pages
Cognitive Learning Theory
No ratings yet
Cognitive Learning Theory
5 pages
G11.E (6)
No ratings yet
G11.E (6)
1 page
Narrative Report On Buwan NG Wika
100% (2)
Narrative Report On Buwan NG Wika
5 pages
Module 4 - Facilitating Learner - Centered Teaching
50% (6)
Module 4 - Facilitating Learner - Centered Teaching
5 pages
Educational Backwardness
No ratings yet
Educational Backwardness
2 pages
Document (4) Infection Control
No ratings yet
Document (4) Infection Control
2 pages
BA Illustration and Visual Media PROG SPECS 12.13 FINAL
No ratings yet
BA Illustration and Visual Media PROG SPECS 12.13 FINAL
10 pages
Azul Con Barra Lateral Crema Beca Currículum
No ratings yet
Azul Con Barra Lateral Crema Beca Currículum
1 page
IHOP - 09.13.39 - Humpty Dumpty Fall Assessment Scale
No ratings yet
IHOP - 09.13.39 - Humpty Dumpty Fall Assessment Scale
1 page
Syllabus Structure B.tech Chemical Engg and SY B.tech Syllabus Revised 3rd April 2019
No ratings yet
Syllabus Structure B.tech Chemical Engg and SY B.tech Syllabus Revised 3rd April 2019
78 pages
PHD Dissertation Physics
100% (2)
PHD Dissertation Physics
7 pages
Interactive Notebook Rubric
No ratings yet
Interactive Notebook Rubric
1 page
DEFINITION
No ratings yet
DEFINITION
3 pages
2 MTB TG SB Q4 W35
No ratings yet
2 MTB TG SB Q4 W35
12 pages