SUMERA - Kmeans Clustering - Jupyter Notebook
SUMERA - Kmeans Clustering - Jupyter Notebook
k-Means Clustering
Clustering is a set of unsupervised learning algorithms. They are useful when we don’t have
any labels of the data, and the algorithms will try to find the patterns of the internal structure or
similarities of the data to put them into different groups. Since they are no labels (true answer)
associated with the data points, we can not use these extra bits of information to constrain the
problem. But instead, there are other ways that we can solve the problem, in this section, we
will take a look of a very popular clustering algorithm - K-means and understand.
K-means clustering, a method used for vector quantization, originally from signal processing,
𝑛 𝑘
that aims to partition observations into groups or clusters(usual notation)in which each
observation belongs to the cluster with the closest mean(cluster centers or cluster centroid),
serving as a prototype of the cluster.
The Iris dataset (iris.csv) is one of the earliest datasets used in the literature on classification
methods and widely used in statistics and machine learning. Each instance (row) is a plant.
The data set contains 3 classes of 50 instances each, where each class refers to a type of iris
plant. One class is linearly separable from the other 2; the latter are not linearly separable from
each other.
We import the data. Be sure you have downloaded the dataset (iris.csv) from your Blackboard,
and uploaded in your Jupyter notebook.
Let us just use two features, so that we can easily visualize them.
Now we can use the K-means by initializing the model and train the algorithm.
D:\Anaconda\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning:
KMeans is known to have a memory leak on Windows with MKL, when there are le
ss chunks than available threads. You can avoid it by setting the environmen
t variable OMP_NUM_THREADS=1.
warnings.warn(
The results of the found clusters are saved in the labels attribute and the centroids are in the
cluster_centers_. Let’s plot the clustering results and the real species in the following figure.
The left figure shows the clustering results with the bigger symbol as the centroids of the
clusters.
We can see from the above figure, the results are not too bad, they are actually quite similar to
the true classes. But remember, we get this results without the labels only based on the
similarities between data points. We can also predict new data points to the clusters using the
predict function. The following predict the cluster label for two new points.
SUMMARY
Machine learning are algorithms that have the capability to learn from data and generalize to
the new data.
Machine learning have two main categories supervised learning and unsupervised learning. In
supervised learning, there are classification and regression, while in unsupervised learning,
there are clustering and dimensionality reduction.
Reflections
1. Discuss the significance of choosing the appropriate number of clusters and the impact it
had on the results.
Answer:
2. Share your insights into how K-means clustering helped in uncovering patterns or
relationships within the data. How might the choice of features influence the clustering
outcome?
Answer:
K-means clustering is a valuable tool in revealing patterns and relationships within diverse data
sets, aiding in identifying inherent structures and groupings. This process hinges heavily on
selecting features, as selecting relevant ones that accurately capture the data's essential
characteristics is pivotal for precise clustering outcomes. Including irrelevant or noisy features
can introduce unnecessary variability, diminishing the clustering algorithm's effectiveness.
Additionally, the scale and distribution of features significantly influence the clustering results,
with disparate scales or distributions potentially biasing the outcome towards features with
larger scales or variances. Hence, preprocessing techniques such as feature scaling are often
employed to standardize features and ensure equal weighting in the clustering process.
Ultimately, by carefully selecting and preprocessing features, data scientists can accurately
uncover meaningful insights and relationships within complex data sets using K-means
clustering.
Answer:
K-means clustering offers practical utility across diverse domains owing to its adaptability and
effectiveness in identifying inherent groupings within datasets. Marketing assists in customer
segmentation by categorizing customers based on purchasing habits, demographics, or
preferences, enabling targeted marketing campaigns and personalized product
recommendations. Healthcare professionals utilize K-means clustering to segment patients
and predict outcomes, facilitating tailored treatment plans and resource allocation. Financial
analysts employ it for market segmentation and fraud detection, aiding portfolio optimization
and safeguarding against financial losses. K-means clustering facilitates image segmentation
and object recognition in image processing and computer vision, vital for medical imaging,
satellite analysis, and surveillance. Overall, K-means clustering's ability to reveal natural
groupings within data makes it indispensable for decision-making scenarios requiring
segmentation, pattern recognition, and data-driven insights.
In [ ]: