AAM 7th prac
AAM 7th prac
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler # For feature scaling (often important for K-
Means)
# 1. Load Data (Example using a synthetic dataset - replace with your data)
# Option 2: Loading from a CSV file (replace 'your_dataset.csv' with your file)
# dataset = pd.read_csv('your_dataset.csv')
# X = dataset.iloc[:, [0, 1]].values # Select the columns you want to use for clustering (e.g., columns 0
and 1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans.fit(X_scaled)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
# Based on the Elbow Method plot, choose the optimal k (number of clusters)
optimal_k = 4 # Replace with the k value you determined from the elbow plot
plt.figure(figsize=(8, 6))
colors = ['red', 'blue', 'green', 'cyan', 'magenta', 'orange', 'purple', 'pink', 'gray', 'brown'] # Add more
colors if needed
for i in range(optimal_k):
plt.title('K-Means Clustering')
plt.xlabel('Feature 1 (scaled)') # Important to note that the x and y axis are scaled
plt.ylabel('Feature 2 (scaled)')
plt.legend()
plt.show()
# If you used a CSV, you can add the cluster labels back to the DataFrame:
# dataset['Cluster'] = y_kmeans
# print(dataset.head())
1. Synthetic Data (or CSV): The code now provides two options:
o It shows how to create a synthetic dataset using make_blobs for
demonstration purposes. This is very useful for testing and understanding the
algorithm.
o It includes the code to load data from a CSV file (commented out), which
you'll uncomment and adapt to your own data.
2. Feature Scaling: Feature scaling (using StandardScaler) is crucial for K-Means.
K-Means is sensitive to the scales of the features. This is now included and applied to
the data before clustering.
3. Elbow Method: The code now implements the Elbow Method to help you determine
the optimal number of clusters (k). It plots the within-cluster sum of squares (WCSS)
for different values of k, and you visually inspect the "elbow" point to choose the best
k.
4. Clearer Visualization: The visualization is improved:
o It uses a list of colors so you can easily distinguish clusters.
o It plots the cluster centers (centroids) in yellow, making them stand out.
o The plot includes labels and a title.
o Important: The plot now labels the axes as "scaled" to indicate that feature
scaling has been applied.
5. Adding Cluster Labels to DataFrame (Optional): The commented-out code shows
how you can add the cluster assignments (y_kmeans) back to your original Pandas
DataFrame if you loaded from a CSV. This is very useful for further analysis.
6. Comments and Explanations: The code has more comments to explain each step.
How to Use:
1. Choose Data Loading Method: Decide whether you'll use the synthetic data or load
from a CSV. If using a CSV, uncomment and adapt the relevant lines, making sure the
column indices in X = dataset.iloc[:, [0, 1]].values are correct.
2. Run the Code: Run the Python script. The Elbow Method plot will appear first.
Examine it to choose the optimal k value.
3. Set optimal_k: Replace the placeholder optimal_k = 4 with the value you
determined from the Elbow Method plot.
4. Run Again: Run the code again. This time, it will perform K-Means clustering with
your chosen k and display the cluster visualization.
5. Analyze Results: If you loaded from a CSV, uncomment the lines to add the cluster
labels back to your DataFrame and analyze the clusters.
6.
7.