0% found this document useful (0 votes)
2 views

AAM 7th prac

The document provides a Python implementation of the K-Means clustering algorithm using both synthetic data and a CSV file. It includes steps for feature scaling, determining the optimal number of clusters using the Elbow Method, and visualizing the clusters with distinct colors and centroids. Additionally, it offers guidance on how to adapt the code for personal datasets and analyze the clustering results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

AAM 7th prac

The document provides a Python implementation of the K-Means clustering algorithm using both synthetic data and a CSV file. It includes steps for feature scaling, determining the optimal number of clusters using the Elbow Method, and visualizing the clusters with distinct colors and centroids. Additionally, it offers guidance on how to adapt the code for personal datasets and analyze the clustering results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

7.Implement unsupervised machine learning algorithm (KNN) in python on dataset to cluster data.

(Assume suitable dataset)

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler # For feature scaling (often important for K-
Means)

# 1. Load Data (Example using a synthetic dataset - replace with your data)

# Option 1: Using a synthetic dataset (for demonstration)

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=300, centers=4, random_state=42) # 4 clusters

# Option 2: Loading from a CSV file (replace 'your_dataset.csv' with your file)

# dataset = pd.read_csv('your_dataset.csv')

# X = dataset.iloc[:, [0, 1]].values # Select the columns you want to use for clustering (e.g., columns 0
and 1)

# 2. Feature Scaling (Important for K-Means)

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# 3. Determine Optimal Number of Clusters (k) using the Elbow Method

wcss = [] # Within-cluster sum of squares

for i in range(1, 11): # Try k from 1 to 10

kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)

kmeans.fit(X_scaled)

wcss.append(kmeans.inertia_) # Inertia_ is the within-cluster sum of squares

plt.plot(range(1, 11), wcss)

plt.title('Elbow Method')
plt.xlabel('Number of clusters')

plt.ylabel('WCSS')

plt.show()

# Based on the Elbow Method plot, choose the optimal k (number of clusters)

optimal_k = 4 # Replace with the k value you determined from the elbow plot

# 4. Apply K-Means Clustering with the optimal k

kmeans = KMeans(n_clusters=optimal_k, init='k-means++', random_state=42)

y_kmeans = kmeans.fit_predict(X_scaled) # Fit and predict cluster assignments

# 5. Visualize the Clusters

plt.figure(figsize=(8, 6))
colors = ['red', 'blue', 'green', 'cyan', 'magenta', 'orange', 'purple', 'pink', 'gray', 'brown'] # Add more
colors if needed

for i in range(optimal_k):

plt.scatter(X_scaled[y_kmeans == i, 0], X_scaled[y_kmeans == i, 1], s=100, c=colors[i],


label=f'Cluster {i+1}')

# Plot the centroids

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow',


label='Centroids')

plt.title('K-Means Clustering')

plt.xlabel('Feature 1 (scaled)') # Important to note that the x and y axis are scaled

plt.ylabel('Feature 2 (scaled)')

plt.legend()

plt.show()

# If you used a CSV, you can add the cluster labels back to the DataFrame:
# dataset['Cluster'] = y_kmeans

# print(dataset.head())

1. Synthetic Data (or CSV): The code now provides two options:
o It shows how to create a synthetic dataset using make_blobs for
demonstration purposes. This is very useful for testing and understanding the
algorithm.
o It includes the code to load data from a CSV file (commented out), which
you'll uncomment and adapt to your own data.
2. Feature Scaling: Feature scaling (using StandardScaler) is crucial for K-Means.
K-Means is sensitive to the scales of the features. This is now included and applied to
the data before clustering.
3. Elbow Method: The code now implements the Elbow Method to help you determine
the optimal number of clusters (k). It plots the within-cluster sum of squares (WCSS)
for different values of k, and you visually inspect the "elbow" point to choose the best
k.
4. Clearer Visualization: The visualization is improved:
o It uses a list of colors so you can easily distinguish clusters.
o It plots the cluster centers (centroids) in yellow, making them stand out.
o The plot includes labels and a title.
o Important: The plot now labels the axes as "scaled" to indicate that feature
scaling has been applied.
5. Adding Cluster Labels to DataFrame (Optional): The commented-out code shows
how you can add the cluster assignments (y_kmeans) back to your original Pandas
DataFrame if you loaded from a CSV. This is very useful for further analysis.
6. Comments and Explanations: The code has more comments to explain each step.

How to Use:

1. Choose Data Loading Method: Decide whether you'll use the synthetic data or load
from a CSV. If using a CSV, uncomment and adapt the relevant lines, making sure the
column indices in X = dataset.iloc[:, [0, 1]].values are correct.
2. Run the Code: Run the Python script. The Elbow Method plot will appear first.
Examine it to choose the optimal k value.
3. Set optimal_k: Replace the placeholder optimal_k = 4 with the value you
determined from the Elbow Method plot.
4. Run Again: Run the code again. This time, it will perform K-Means clustering with
your chosen k and display the cluster visualization.
5. Analyze Results: If you loaded from a CSV, uncomment the lines to add the cluster
labels back to your DataFrame and analyze the clusters.
6.

7.

You might also like