Experiment 4 1

This document outlines an experiment focused on analyzing and developing clustering models using Python and scikit-learn. It covers various clustering algorithms including K-Means, Hierarchical Clustering, and DBSCAN, detailing their implementation, evaluation metrics, and practical applications. The experiment aims to enhance understanding of clustering techniques through hands-on data preprocessing, model evaluation, and result interpretation using datasets like the Iris and Customer Segmentation datasets.

Uploaded by

sya833063

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Experiment 4 1

Uploaded by

sya833063

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

EXPERIMENT NO.

:-4

AIM: - Analyze and Develop Clustering Models for Dataset Exploration.

THEORY:-

Objective:
This experiment aims to provide with a thorough understanding of different clustering algorithms and
their practical applications.
Master the core concepts of K-Means, Hierarchical Clustering, and DBSCAN.
Implement these algorithms using Python and the scikit-learn library.
Conduct rigorous model evaluation using appropriate metrics.
Gain practical experience in data preprocessing, feature scaling, and model selection.
Analyze and interpret clustering results to draw meaningful insights.
Materials:
Software: Python with the following libraries:
pandas: For data manipulation and analysis.
numpy: For numerical computing.
scikit-learn: For implementing machine learning algorithms, including clustering.
matplotlib and seaborn: For data visualization.
IDE: Jupyter Notebook or any preferred Python development environment.
Datasets:
Iris Dataset: A classic, built-in dataset in scikit-learn, ideal for introductory clustering.
Customer Segmentation Dataset: A real-world dataset (e.g., from the UCI Machine
Learning Repository) to apply clustering in a practical scenario.
Procedure:
1. Data Loading and Preprocessing:
Load Data: Import the Iris and Customer Segmentation datasets into your chosen
environment using pandas.
Handle Missing Values:
Deletion: Remove rows or columns with missing values (if the proportion is
small).
Imputation: Replace missing values with estimated values (e.g., mean, median,
KNN imputation).
Feature Scaling: Standardize or normalize the data to ensure all features have
comparable scales:
Standardization (Z-score normalization): Transforms features to have zero
mean and unit variance.
Normalization (Min-Max scaling): Scales features to a specific range (e.g.,
between 0 and 1).
2. Implementation of Clustering Algorithms:
K-Means Clustering:
Concept: Partitions data into 'k' clusters by minimizing the within-cluster sum of
squares (WCSS).
Implementation:
Use scikit-learn's KMeans class.
Experiment with different values of 'k' (number of clusters).
Elbow Method: Plot the WCSS against different 'k' values. The "elbow"
point (where the curve starts to bend) often indicates an optimal 'k'.
Silhouette Score: Calculate the Silhouette Score for each data point to
evaluate cluster cohesion and separation. Higher scores generally indicate
better-defined clusters.
Visualization: Create scatter plots to visualize the clusters in two-dimensional
space (if applicable). Color-code data points based on their assigned cluster.
Hierarchical Clustering:
Concept: Creates a hierarchical tree-like structure (dendrogram) representing the
relationships between data points.
Implementation:
Use scikit-learn's AgglomerativeClustering class.
Experiment with different linkage methods (single, complete, average) to
determine the most suitable approach for your data.
Dendrogram: Visualize the dendrogram to identify natural clusters by
cutting the tree at appropriate heights.
Visualization: Similar to K-Means, create scatter plots to visualize the clusters.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Concept: Groups together data points that are closely packed together (dense
regions) while ignoring outliers.
Implementation:
Use scikit-learn's DBSCAN class.
Tune the hyperparameters:
Epsilon (ε): The maximum distance between two samples for
them to be considered as in the same neighborhood.
Min_samples: The minimum number of samples in a
neighborhood for a point to be considered as a core point.
Visualization: Create scatter plots to visualize the clusters, highlighting core
points, border points, and noise points.
3. Model Evaluation:
Within-Cluster Sum of Squares (WCSS): For K-Means, calculate the WCSS for each
cluster. Lower WCSS generally indicates better clustering.
Silhouette Score: Calculate the Silhouette Score for all three algorithms to assess the
overall quality of clustering.
Visual Inspection: Carefully examine the scatter plots and dendrograms to assess the
interpretability and separation of clusters.
4. Comparison and Discussion:
Tabulate the performance of each algorithm based on the evaluation metrics (WCSS,
Silhouette Score).
Discuss the advantages and disadvantages of each algorithm in terms of:
Assumptions: K-Means assumes spherical clusters, while DBSCAN can handle
irregularly shaped clusters.
Scalability: K-Means can be computationally expensive for large datasets.
Sensitivity to outliers: DBSCAN is more robust to outliers than K-Means.
Analyze the results and draw meaningful insights from the clustering. For example:
In customer segmentation, identify distinct customer groups based on their
purchasing behavior.
In biological data analysis, discover natural groupings of species or cell types.

Example Output (K-Means Clustering on Iris Dataset):

Python
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

iris = load_iris()
X = iris.data[:, :2] # Use only the first two features for visualization

# Determine optimal number of clusters using the Elbow Method

wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)

plt.title('Elbow Method for Iris Dataset')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.show()

# Apply K-Means with the optimal number of clusters (e.g., k=3)

kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Visualize the clusters

plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s=100, c='red',
label='Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s=100, c='blue',
label='Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s=100, c='green',
label='Cluster 3')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300,
c='yellow', label='Centroids')
plt.title('K-Means Clustering on Iris Dataset')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend()
plt.show()

Extensions:
Explore other clustering algorithms like Gaussian Mixture Model (GMM).
Implement more advanced data preprocessing techniques, such as dimensionality reduction
(PCA).
Conduct a more in-depth analysis and visualization of the results, including 3D scatter plots
and interactive visualizations.
Apply clustering to different domains, such as image segmentation, anomaly detection, and
social network analysis.

RESULT: - Thus, all the various clustering models on dataset exploration have been executed
successfully.

PRACTICAL ASSIGNMENT:
1. Explain the concept of hierarchical clustering. How does it differ from partitional clustering
methods like K-Means?
2. Describe the K-Means clustering algorithm. What are its key steps and how does it determine the
optimal number of clusters?

Methods of Multivariate Analysis
No ratings yet
Methods of Multivariate Analysis
3 pages
Ds Paper
No ratings yet
Ds Paper
35 pages
21BEC505 Exp2
No ratings yet
21BEC505 Exp2
7 pages
Clustering
No ratings yet
Clustering
1 page
AML Clustering
No ratings yet
AML Clustering
7 pages
Clustering Algorithms CheatSheet 1710438661
No ratings yet
Clustering Algorithms CheatSheet 1710438661
6 pages
Avinash Tiwari 9
No ratings yet
Avinash Tiwari 9
4 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
ML 2.3 Prashant
No ratings yet
ML 2.3 Prashant
4 pages
Practical 03
No ratings yet
Practical 03
3 pages
Week 8 DS Practical (1)
No ratings yet
Week 8 DS Practical (1)
13 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
Department Of: Computer Science & Engineering
No ratings yet
Department Of: Computer Science & Engineering
4 pages
ML2 Practical List
No ratings yet
ML2 Practical List
80 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
Kman 07
No ratings yet
Kman 07
9 pages
23CC554
No ratings yet
23CC554
10 pages
Unsupervisd Learning Algorithm
No ratings yet
Unsupervisd Learning Algorithm
6 pages
FullMarks - Clustering StudentSolution 2
No ratings yet
FullMarks - Clustering StudentSolution 2
13 pages
K Means
No ratings yet
K Means
3 pages
clustering R codes
No ratings yet
clustering R codes
2 pages
AAM 7th prac
No ratings yet
AAM 7th prac
4 pages
Assignment 4 A
No ratings yet
Assignment 4 A
15 pages
Esam - DWM Lab 8
No ratings yet
Esam - DWM Lab 8
5 pages
Customer Segmentation Report
No ratings yet
Customer Segmentation Report
8 pages
01 K Means - Merged
No ratings yet
01 K Means - Merged
26 pages
From Import Import As Import As From Import From Import From Import From Import
No ratings yet
From Import Import As Import As From Import From Import From Import From Import
9 pages
2.3 Aiml Rishit
No ratings yet
2.3 Aiml Rishit
7 pages
4.cluster Analysis
No ratings yet
4.cluster Analysis
7 pages
LP I Assignment A4 Clustering
No ratings yet
LP I Assignment A4 Clustering
13 pages
Lab Report6 - B21CI014
No ratings yet
Lab Report6 - B21CI014
8 pages
Subject: ML Name: Priyanshu Gandhi Date: 10/4/21 Expt. No.: 9 Roll No.: C008 Title: Clustering Implementation in Python
No ratings yet
Subject: ML Name: Priyanshu Gandhi Date: 10/4/21 Expt. No.: 9 Roll No.: C008 Title: Clustering Implementation in Python
7 pages
Clustering
No ratings yet
Clustering
45 pages
DSM 1
No ratings yet
DSM 1
6 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
PeerEval Unsupervised
No ratings yet
PeerEval Unsupervised
6 pages
10 - DBSCANClusteringOnIRIS-Copy1 - Jupyter Notebook
No ratings yet
10 - DBSCANClusteringOnIRIS-Copy1 - Jupyter Notebook
4 pages
Peer Eval
No ratings yet
Peer Eval
6 pages
Data Mining and Machine Learning PDF
No ratings yet
Data Mining and Machine Learning PDF
10 pages
Compute2
No ratings yet
Compute2
10 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
SE_KMeansClustering
No ratings yet
SE_KMeansClustering
21 pages
K++
No ratings yet
K++
5 pages
DWM_EXP4
No ratings yet
DWM_EXP4
9 pages
Exploring Unsupervised Learning Algorithms with the Iris Dataset
No ratings yet
Exploring Unsupervised Learning Algorithms with the Iris Dataset
4 pages
Python DM Lab Manual Part 2
No ratings yet
Python DM Lab Manual Part 2
8 pages
DA_EXP_10 (1)
No ratings yet
DA_EXP_10 (1)
6 pages
Experiment 3.1 K-Mean
No ratings yet
Experiment 3.1 K-Mean
8 pages
To Study About Numpy, Pandas and Matplotlib Libraries in Python
No ratings yet
To Study About Numpy, Pandas and Matplotlib Libraries in Python
21 pages
Lab Report 4
No ratings yet
Lab Report 4
6 pages
ML0101EN Clus DBSCN Weather Py v1
No ratings yet
ML0101EN Clus DBSCN Weather Py v1
16 pages
ML Python Exercises UOM BDS Cluster Analysis
No ratings yet
ML Python Exercises UOM BDS Cluster Analysis
8 pages
Tutorial 8
No ratings yet
Tutorial 8
12 pages
2403res62 - CS564 - Assignment - 4 - K-Means-Iris - Intrinsic - CVIs
No ratings yet
2403res62 - CS564 - Assignment - 4 - K-Means-Iris - Intrinsic - CVIs
30 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Untitled document-2-1-13-7-11.4
No ratings yet
Untitled document-2-1-13-7-11.4
5 pages
DataEnggineering
No ratings yet
DataEnggineering
16 pages
Unit_4 (1)
No ratings yet
Unit_4 (1)
63 pages
AdityaGaur BDA Exp8
No ratings yet
AdityaGaur BDA Exp8
4 pages
ML-Lab Programs - VTU
No ratings yet
ML-Lab Programs - VTU
5 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Session 7-8 - Data Cleaning and Logistic Regression For Classification
No ratings yet
Session 7-8 - Data Cleaning and Logistic Regression For Classification
30 pages
Spotify_Data_Visualization_Proposal
No ratings yet
Spotify_Data_Visualization_Proposal
3 pages
Learn R For Applied Statistics: With Data Visualizations, Regressions, and Statistics 1st Edition Eric Goh Ming Hui
100% (6)
Learn R For Applied Statistics: With Data Visualizations, Regressions, and Statistics 1st Edition Eric Goh Ming Hui
62 pages
WCDMA Drive Test Analysis
100% (1)
WCDMA Drive Test Analysis
61 pages
ML 5th
No ratings yet
ML 5th
8 pages
CAPA Effectiveness Checks
No ratings yet
CAPA Effectiveness Checks
4 pages
Big Data Application in Biomedical Research and
No ratings yet
Big Data Application in Biomedical Research and
10 pages
Stock Watson 3U ExerciseSolutions Chapter10 Students
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter10 Students
7 pages
Statistics Endsem 21-22
No ratings yet
Statistics Endsem 21-22
4 pages
Resistance and Powering Prediction For Transom Stern Hull Forms During Early Stage Ship Design S C Fung
100% (4)
Resistance and Powering Prediction For Transom Stern Hull Forms During Early Stage Ship Design S C Fung
56 pages
Simple Linear Regression and Correlation
100% (1)
Simple Linear Regression and Correlation
52 pages
Capital IQ SPCIQ Fundamentals v2
No ratings yet
Capital IQ SPCIQ Fundamentals v2
4 pages
Data Science Cover Letter Reddit
100% (2)
Data Science Cover Letter Reddit
5 pages
Survival - Notes (Lecture 3)
No ratings yet
Survival - Notes (Lecture 3)
23 pages
Research Methodology Sample For Social R
No ratings yet
Research Methodology Sample For Social R
5 pages
Gaps in Universal Health Coverage in Malawi: A Qualitative Study in Rural Communities
No ratings yet
Gaps in Universal Health Coverage in Malawi: A Qualitative Study in Rural Communities
10 pages
Bachelor of Information Technology
No ratings yet
Bachelor of Information Technology
10 pages
Ndejje Univeristy Student Health Record System PDF
No ratings yet
Ndejje Univeristy Student Health Record System PDF
44 pages
MPC Book of Abstracts - 0 PDF
No ratings yet
MPC Book of Abstracts - 0 PDF
120 pages
Data Science Project - Flow Graph
No ratings yet
Data Science Project - Flow Graph
7 pages
MARKETING MANAGEMENT - TEAM 2-16721426-19-09-2024-Highlight - Report
No ratings yet
MARKETING MANAGEMENT - TEAM 2-16721426-19-09-2024-Highlight - Report
74 pages
MS-8 2
No ratings yet
MS-8 2
3 pages
Unit - 1 & 2 Questions
No ratings yet
Unit - 1 & 2 Questions
4 pages
Unit 2
No ratings yet
Unit 2
15 pages
Factor Analysis For Political Science
No ratings yet
Factor Analysis For Political Science
7 pages
Mulyarti (162052501002) An Analysis of Slang Language by User On Social Media (Facebook)
No ratings yet
Mulyarti (162052501002) An Analysis of Slang Language by User On Social Media (Facebook)
15 pages
Chapter Ii
No ratings yet
Chapter Ii
5 pages
Cp5293 Big Data Analytics 1
No ratings yet
Cp5293 Big Data Analytics 1
9 pages
A Study of Comparative Analysis of Regional Rural Banks in India'
No ratings yet
A Study of Comparative Analysis of Regional Rural Banks in India'
83 pages