0% found this document useful (0 votes)
92 views

SUMERA - Kmeans Clustering - Jupyter Notebook

This document discusses k-means clustering, an unsupervised machine learning technique used to identify clusters of data objects in a dataset without labels. It uses the Iris dataset to demonstrate k-means clustering in Python. The algorithm initializes k cluster centers and assigns each data point to the nearest cluster center, iteratively updating the cluster centers until convergence. The results from k-means clustering on the Iris data identify 3 clusters that are similar to the true classes, showing it can discover internal data patterns without labels.

Uploaded by

Alleah Laylo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

SUMERA - Kmeans Clustering - Jupyter Notebook

This document discusses k-means clustering, an unsupervised machine learning technique used to identify clusters of data objects in a dataset without labels. It uses the Iris dataset to demonstrate k-means clustering in Python. The algorithm initializes k cluster centers and assigns each data point to the nearest cluster center, iteratively updating the cluster centers until convergence. The results from k-means clustering on the Iris data identify 3 clusters that are similar to the true classes, showing it can discover internal data patterns without labels.

Uploaded by

Alleah Laylo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

𝐌𝐀𝐓𝐇 144 - 𝐈𝐍𝐓𝐑𝐎𝐃𝐔𝐂𝐓𝐈𝐎𝐍 𝐓𝐎 𝐃𝐀𝐓𝐀 𝐒𝐂𝐈𝐄𝐍𝐂𝐄


𝟸𝚀 𝚂𝚈𝟸𝟹𝟸𝟺
Instructor: EDGAR M. ADINA

k-Means Clustering

Clustering is a set of unsupervised learning algorithms. They are useful when we don’t have
any labels of the data, and the algorithms will try to find the patterns of the internal structure or
similarities of the data to put them into different groups. Since they are no labels (true answer)
associated with the data points, we can not use these extra bits of information to constrain the
problem. But instead, there are other ways that we can solve the problem, in this section, we
will take a look of a very popular clustering algorithm - K-means and understand.

K-means clustering, a method used for vector quantization, originally from signal processing,
𝑛 𝑘
that aims to partition observations into groups or clusters(usual notation)in which each
observation belongs to the cluster with the closest mean(cluster centers or cluster centroid),
serving as a prototype of the cluster.

The k-means clustering method is additionally used as an unsupervised machine learning


technique used to identify clusters of data objects in a dataset. There are many various kinds
of clustering methods, but k-means is one of the oldest and most approachable. In this lesson,
we need Python libraries “NumPy” and “Scikit-learn” to implement a K-Means clustering
algorithm. The simulated data will only have three clusters, which will be identified by the
clustering algorithm. The matter is computationally difficult (NP-hard). The unsupervised k-
means algorithm has a loose relationship to the k-nearest neighbor classifier, a well-liked
supervised machine learning technique for classification that’s often confused with k-means
because of the name.

Sample Case - Iris Dataset

The Iris dataset (iris.csv) is one of the earliest datasets used in the literature on classification
methods and widely used in statistics and machine learning. Each instance (row) is a plant.

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris
plant. One class is linearly separable from the other 2; the latter are not linearly separable from
each other.

Predicted attribute: class of iris plant.

Let us first import the needed tools.

localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 1/7


2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

In [1]: import numpy as np


import matplotlib.pyplot as plt
from sklearn import datasets
plt.style.use("seaborn-v0_8")
%matplotlib inline

We import the data. Be sure you have downloaded the dataset (iris.csv) from your Blackboard,
and uploaded in your Jupyter notebook.

In [2]: iris = datasets.load_iris()

Let us just use two features, so that we can easily visualize them.

In [3]: X = iris.data[:, [0, 2]]


y = iris.target
target_names = iris.target_names
feature_names = iris.feature_names

Now, we extract the classes.

In [4]: n_class = len(set(y))

Let us visualize the data first.

localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 2/7


2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

In [5]: plt.figure(figsize = (10,8))



plt.scatter(X[:, 0], X[:, 1], \
color = 'b', marker = 'o', s = 60)

plt.xlabel('Feature 1 - ' + feature_names[0])
plt.ylabel('Feature 2 - ' + feature_names[2])
plt.show()

Now we can use the K-means by initializing the model and train the algorithm.

localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 3/7


2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

In [8]: from sklearn.cluster import KMeans


import os
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
os.environ["OMP_NUM_THREADS"] = "1"

kmeans = KMeans(n_clusters=n_class, random_state=42)
kmeans.fit(X)

colors = ['b', 'g', 'r']
symbols = ['o', '^', '*']
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

for i, (c, s) in enumerate(zip(colors, symbols)):
ix = kmeans.labels_ == i
ax.scatter(X[:, 0][ix], X[:, 1][ix], color=c, marker=s, s=60, label=target
loc = kmeans.cluster_centers_[i]
ax.scatter(loc[0], loc[1], color='k', marker=s, linewidth=5)

ix = y == i
ax2.scatter(X[:, 0][ix], X[:, 1][ix], color=c, marker=s, s=60, label=targe

plt.legend(loc=4, scatterpoints=1)
ax.set_xlabel('Feature 1 - ' + feature_names[0])
ax.set_ylabel('Feature 2 - ' + feature_names[2])
ax2.set_xlabel('Feature 1 - ' + feature_names[0])
ax2.set_ylabel('Feature 2 - ' + feature_names[2])
plt.tight_layout()
plt.show()

D:\Anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning:
KMeans is known to have a memory leak on Windows with MKL, when there are le
ss chunks than available threads. You can avoid it by setting the environmen
t variable OMP_NUM_THREADS=1.
warnings.warn(

localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 4/7


2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

The results of the found clusters are saved in the labels attribute and the centroids are in the
cluster_centers_. Let’s plot the clustering results and the real species in the following figure.
The left figure shows the clustering results with the bigger symbol as the centroids of the
clusters.

We can see from the above figure, the results are not too bad, they are actually quite similar to
the true classes. But remember, we get this results without the labels only based on the
similarities between data points. We can also predict new data points to the clusters using the
predict function. The following predict the cluster label for two new points.

In [7]: new_points = np.array([[5, 2], [6, 5]])


kmeans.predict(new_points)

Out[7]: array([1, 2])

SUMMARY

Machine learning are algorithms that have the capability to learn from data and generalize to
the new data.

localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 5/7


2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

Machine learning have two main categories supervised learning and unsupervised learning. In
supervised learning, there are classification and regression, while in unsupervised learning,
there are clustering and dimensionality reduction.

The output of the classification tasks are categorical data.

The output of the regression tasks are quantity data.

Reflections

1. Discuss the significance of choosing the appropriate number of clusters and the impact it
had on the results.

Answer:

Selecting the appropriate number of clusters in kMeans Clustering is pivotal as it directly


impacts the quality and interpretability of the outcomes. Opting for too few clusters might
oversimplify the data, resulting in the loss of crucial patterns and nuances. Conversely,
choosing too many clusters can lead to overfitting, where the model captures noise instead of
meaningful information, potentially leading to inaccurate conclusions. By exploring various
cluster numbers using methods like the elbow technique or silhouette analysis, data scientists
can determine the optimal number that maximizes within-cluster similarity while minimizing
between-cluster variability. This ensures that the clusters accurately represent the inherent
structure of the data, enabling insightful analysis and informed decision-making based on the
findings.

2. Share your insights into how K-means clustering helped in uncovering patterns or
relationships within the data. How might the choice of features influence the clustering
outcome?

Answer:

K-means clustering is a valuable tool in revealing patterns and relationships within diverse data
sets, aiding in identifying inherent structures and groupings. This process hinges heavily on
selecting features, as selecting relevant ones that accurately capture the data's essential
characteristics is pivotal for precise clustering outcomes. Including irrelevant or noisy features
can introduce unnecessary variability, diminishing the clustering algorithm's effectiveness.
Additionally, the scale and distribution of features significantly influence the clustering results,
with disparate scales or distributions potentially biasing the outcome towards features with
larger scales or variances. Hence, preprocessing techniques such as feature scaling are often
employed to standardize features and ensure equal weighting in the clustering process.
Ultimately, by carefully selecting and preprocessing features, data scientists can accurately
uncover meaningful insights and relationships within complex data sets using K-means
clustering.

3. Elaborate on any practical applications or decision-making scenarios where K-means


clustering can be effectively employed
localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 6/7
2/8/24, 10:37 PM SUMERA_kMeans Clustering - Jupyter Notebook

Answer:

K-means clustering offers practical utility across diverse domains owing to its adaptability and
effectiveness in identifying inherent groupings within datasets. Marketing assists in customer
segmentation by categorizing customers based on purchasing habits, demographics, or
preferences, enabling targeted marketing campaigns and personalized product
recommendations. Healthcare professionals utilize K-means clustering to segment patients
and predict outcomes, facilitating tailored treatment plans and resource allocation. Financial
analysts employ it for market segmentation and fraud detection, aiding portfolio optimization
and safeguarding against financial losses. K-means clustering facilitates image segmentation
and object recognition in image processing and computer vision, vital for medical imaging,
satellite analysis, and surveillance. Overall, K-means clustering's ability to reveal natural
groupings within data makes it indispensable for decision-making scenarios requiring
segmentation, pattern recognition, and data-driven insights.

In [ ]: ​

localhost:8888/notebooks/Python/SUMERA_kMeans Clustering.ipynb 7/7

You might also like