0% found this document useful (0 votes)
20 views

DATA MINING LAB MANUAL

The document is a lab manual for a Data Mining course for IV B.Tech II Semester students at Holy Mary Institute of Technology & Science. It outlines course objectives, outcomes, and a detailed list of experiments covering data preprocessing, partitioning, data warehouse schemas, OLAP operations, ETL processes, and various data mining algorithms. Each experiment includes aims, required tools, theoretical background, and code snippets for practical implementation using Python and other tools.

Uploaded by

ayeshakausar2014
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

DATA MINING LAB MANUAL

The document is a lab manual for a Data Mining course for IV B.Tech II Semester students at Holy Mary Institute of Technology & Science. It outlines course objectives, outcomes, and a detailed list of experiments covering data preprocessing, partitioning, data warehouse schemas, OLAP operations, ETL processes, and various data mining algorithms. Each experiment includes aims, required tools, theoretical background, and code snippets for practical implementation using Python and other tools.

Uploaded by

ayeshakausar2014
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

LAB MANUAL

OF
DATA MINING
IV B.TECH II Semester

(HITS – R22)

DEPARTMENT OF CSE
(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING)

HOLY MARY INSTITUTE OF TECHNOLOGY & SCIENCE


(UGC AUTONOMOUS)
Bogaram (V), Keesara (M), Medchal (D), T.S - 501301
B.Tech - CSE (Artificial Intelligence & Machine Learning) - HITS R22

DATA MINING LAB

IV-B.Tech II-Semester LTPC

Course Code: AM712PE --21

COURSE OBJECTIVES:

The course should enable the students to learn:

1. The course is intended to obtain hands-on experience using data mining software.

2. Intended to provide practical exposure of the concepts in data mining algorithms

COURSE OUTCOMES:

At the end of the course, student will be able to:

1. Apply preprocessing statistical methods for any given raw data.

2. Gain practical experience of constructing a data warehouse.

3. Implement various algorithms for data mining in order to discover interesting patterns from
large amounts of data.

4. Apply OLAP operations on data cube construction

LIST OF EXPERIMENTS

WEEK – 1

1. Data Processing Techniques: (i) Data cleaning (ii) Data transformation –


Normalization (iii) Data integration

WEEK – 2

2. Partitioning - Horizontal, Vertical, Round Robin, Hash based

WEEK – 3

Data Warehouse schemas – star, snowflake, fact constellation

WEEK – 4

Data cube construction – OLAP operations.

WEEK – 5

Data Extraction, Transformations & Loading operations


WEEK – 6

Implementation of Attribute oriented induction algorithm

WEEK – 7

Implementation of apriori algorithm

WEEK – 8

Implementation of FP – Growth algorithm.

WEEK – 9

Implementation of Decision Tree Induction

WEEK – 10

Calculating Information gain measures

WEEK – 11

Classification of data using Bayesian approach

WEEK – 12

Classification of data using K – nearest neighbor approach

WEEK – 13

Implementation of K – means algorithm

WEEK – 14

Implementation of BIRCH algorithm

WEEK – 15

Implementation of PAM algorithm

WEEK – 16

Implementation of DBSCAN algorith


DATA MINING PROGRAMS
Experiment 1: Data Preprocessing Techniques

Aim:To perform data cleaning, normalization, and integration using Python programming.

Tools Required
 Python 3.7 or above
 Libraries:
o pandas – for data manipulation
o numpy – for numerical operations
o scikit-learn – for normalization
Install with:
pip install pandas numpy scikit-learn

Theory
🔹 Data Cleaning:
Removing or correcting noisy, inconsistent, or incomplete data (e.g., filling missing values).
🔹 Data Normalization:
Scaling numeric attributes to a common range, typically [0, 1], to improve the performance of algorithms.
🔹 Data Integration:
Combining data from multiple sources to create a unified dataset.

Code Snippet

Step 1: Data Cleaning and Normalization


import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Sample data
data = {'ID': [1, 2, 3], 'Age': [25, np.nan, 35], 'Salary': [50000, 60000, np.nan]}
df = pd.DataFrame(data)

# Clean: Fill missing values with column mean


df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

# Normalize Age and Salary to [0, 1] range


scaler = MinMaxScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])

print("Cleaned and Normalized Data:")


print(df)
Step 2: Data Integration
# Dataset 1
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']

})
# Dataset 2
df2 = pd.DataFrame({
'ID': [1, 2, 3],
'Department': ['HR', 'IT', 'Finance']
})

# Merge datasets on 'ID'


merged_df = pd.merge(df1, df2, on='ID')
print("Integrated Data:")
print(merged_df)

Sample Output
🔹 Output after Cleaning and Normalization:
ID Age Salary
0 1 0.000000 0.000000
1 2 0.500000 1.000000
2 3 1.000000 0.500000
🔹 Output after Data Integration:
ID Name Department
0 1 Alice HR
1 2 Bob IT
2 3 Charlie Finance
Experiment 2: Partitioning - Horizontal, Vertical, Round Robin, Hash based

Aim: To implement horizontal and vertical partitioning of data using Python programming.

Tools Required
 Python 3.7 or above
 Libraries:
o pandas – for data manipulation
o scikit-learn – for train/test data splitting
Install if not already installed:
pip install pandas scikit-learn

Theory
🔹 Horizontal Partitioning:
Divides the dataset by rows, splitting data into subsets used for training and testing.
🔹 Vertical Partitioning:
Divides the dataset by columns, separating features (attributes) into different logical
groups.
These techniques are foundational for data preparation in Machine Learning and Data
Mining.

Code Snippet
from sklearn.model_selection import train_test_split
import pandas as pd

# Sample dataset
df = pd.DataFrame({
'Age': [25, 30, 45, 22],
'Salary': [50000, 60000, 80000, 45000],
'Label': [0, 1, 1, 0]
})

# Horizontal partitioning: Split into training and testing sets


X_train, X_test = train_test_split(df, test_size=0.5, random_state=1)

# Vertical partitioning: Split into different sets of columns


vertical1 = df[['Age']]
vertical2 = df[['Salary', 'Label']]

# Display results
print("Train Partition:")
print(X_train)

print("\nVertical Partitions:")
print("Part 1 (Age):")
print(vertical1)

print("Part 2 (Salary and Label):")


print(vertical2)
Sample Output
Train Partition:
Age Salary Label
3 22 45000 0
1 30 60000 1

Vertical Partitions:
Part 1 (Age):
Age
0 25
1 30
2 45
3 22

Part 2 (Salary and Label):


Salary Label
0 50000 0
1 60000 1
2 80000 1
3 45000 0
Experiment 3: Data Warehouse schemas – star, snowflake, fact constellation

Aim: To design data warehouse schemas such as Star Schema, Snowflake Schema, and Fact Constellation
using Pentaho Data Integration tool.

Tools Required:

- Pentaho Data Integration (Spoon)

- MySQL / PostgreSQL database

- Java JDK (for Pentaho)

Star Schema Design (SQL Snippets):

CREATE TABLE time_dimension (


time_id INT PRIMARY KEY,
day VARCHAR(10),
month VARCHAR(10),
year INT
);

CREATE TABLE product_dimension (


product_id INT PRIMARY KEY,
product_name VARCHAR(50),
category VARCHAR(50)
);

CREATE TABLE store_dimension (


store_id INT PRIMARY KEY,
store_name VARCHAR(50),
location VARCHAR(50)
);

CREATE TABLE sales_fact (


sales_id INT PRIMARY KEY,
product_id INT,
store_id INT, time_id
INT, sales_amount
FLOAT,
FOREIGN KEY (product_id) REFERENCES product_dimension(product_id), FOREIGN
KEY (store_id) REFERENCES store_dimension(store_id),
FOREIGN KEY (time_id) REFERENCES time_dimension(time_id)
);

Sample Output (Query Result from sales_fact):

+----------+------------+----------+---------+--------------------------------------- +
| sales_id | product_id | store_id | time_id | sales_amount |
+----------+------------+----------+---------+--------------------------------------- +
| 1 | 101 | 201 | 301 | 2500.00 |
| 2 | 102 | 202 | 302 | 1850.75 |
+----------+------------+----------+---------+--------------------------------------- +
Experiment 4: Data cube construction – OLAP operations

Aim: To construct a data cube and perform OLAP operations like Roll-up, Drill-down, Slice, Dice, and
Pivot using Python or Excel/Pentaho.

Tools Required
You can perform this experiment using:
 ✅ Python 3.7+ with pandas
 ✅ Microsoft Excel (Pivot Tables)
 ✅ Pentaho Schema Workbench (GUI)
 Optional: MySQL for data source
For Python:
pip install pandas

Theory
🔹 OLAP (Online Analytical Processing):
OLAP enables multidimensional analysis of data and supports operations like:
 Roll-up: Summarizing data (e.g., day → month)
 Drill-down: Getting more details (e.g., month → day)
 Slice: Filtering one dimension (e.g., sales for Jan)
 Dice: Filtering multiple dimensions (e.g., Jan & Electronics)
 Pivot: Rotating the view of the data

Python Code Snippet: OLAP Operations using Pandas


import pandas as pd

# Sample sales data


data = {
'Region': ['North', 'South', 'North', 'South', 'North'],
'Product': ['TV', 'TV', 'Mobile', 'Mobile', 'TV'],
'Month': ['Jan', 'Jan', 'Feb', 'Feb', 'Feb'],
'Sales': [2000, 3000, 4000, 3500, 2500]
}
df = pd.DataFrame(data)

# Roll-up: Total sales by Region


print("Roll-up (Sales by Region):")
print(df.groupby('Region')['Sales'].sum())

# Drill-down: Sales by Region and Product


print("\nDrill-down (Sales by Region and Product):")
print(df.groupby(['Region', 'Product'])['Sales'].sum())

# Slice: Filter where Month = 'Feb'


print("\nSlice (Sales for February):")
print(df[df['Month'] == 'Feb'])

# Dice: Filter where Month = 'Feb' and Product = 'TV'


print("\nDice (Sales for TV in February):")
print(df[(df['Month'] == 'Feb') & (df['Product'] == 'TV')])

# Pivot: Region as index, Product as columns


print("\nPivot (Sales by Region and Product):")
pivot = df.pivot_table(values='Sales', index='Region', columns='Product', aggfunc='sum')
print(pivot)

Sample Output
Roll-up (Sales by Region):
Region
North 8500
South 6500
Name: Sales, dtype: int64

Drill-down (Sales by Region and Product):


Region Product
North Mobile 4000
TV 4500
South Mobile 3500
TV 3000
Name: Sales, dtype: int64

Slice (Sales for February):


Region Product Month Sales
2 North Mobile Feb 4000
3 South Mobile Feb 3500
4 North TV Feb 2500

Dice (Sales for TV in February):


Region Product Month Sales
4 North TV Feb 2500

Pivot (Sales by Region and Product):


Product Mobile TV
Region
North 4000 4500
South 3500 3000
Experiment 5: Data Extraction, Transformations & Loading operations

Aim: To perform Extract, Transform, and Load (ETL) operations on a dataset using Pentaho Data
Integration (Spoon) or Python.

Tools Required
You can perform this experiment using:
 ✅ Pentaho Data Integration (Spoon)
o Java JDK (required)
o Data Source: CSV / Excel / MySQL
 ✅ OR Python (if GUI not preferred)
o Libraries: pandas
✅ Install (Python):
pip install pandas

Theory
🔹 ETL Process:
 Extract: Load raw data from source (CSV, database, Excel)
 Transform: Clean, filter, convert, normalize data
 Load: Store the processed data into a database or file
Example: Load employee records from CSV → clean missing values → store into MySQL table.

Python Code Snippet: ETL Operation


import pandas as pd

# EXTRACT: Load raw data from CSV


df = pd.read_csv("employee_data.csv") # Ensure this file is present
print("Original Data:")
print(df.head())

# TRANSFORM: Fill missing values and filter


df['Salary'].fillna(df['Salary'].mean(), inplace=True)
df = df[df['Department'] != '']

# LOAD: Save the cleaned data to new CSV (can be loaded into MySQL)
df.to_csv("cleaned_employee_data.csv", index=False)
print("ETL Process Completed. Cleaned data saved.")

If using Pentaho (Spoon GUI):


Steps:
1. Open Spoon (PDI tool)
2. Create a New Transformation
3. Extract
o Input step: CSV File Input / Table Input (MySQL)
4. Transform
o Select values, filter rows, replace nulls
o Add calculation fields (e.g., Age group from Age)
5. Load
o Output step: Table Output / Excel Output / CSV file
6. Run the Transformation
o View success message & output file/table

Sample Output (Console from Python)


Original Data:
ID Name Department Salary
0 1 John IT 50000.0
1 2 Alice HR NaN
2 3 Emma 45000.0

Transformed Data Saved:


ID Name Department Salary
0 1 John IT 50000.0
1 2 Alice HR 47500.0
2 3 Emma 45000.0 ← Removed due to empty department (optional)
Experiment 6: Implementation of Attribute oriented induction algorithm

Aim: To generalize raw data using Attribute-Oriented Induction (AOI) by applying concept
hierarchies in Python.

Tools Required
 Python 3.7+
 Libraries: pandas
Install with:
pip install pandas

Theory
🔹 What is AOI?
Attribute-Oriented Induction is a data generalization method in data mining. It:
 Replaces specific values with higher-level concepts
 Uses concept hierarchies for generalization
 Reduces data volume for better mining efficiency
🔹 Example:
 City → State → Country
 Age → Age Group (Youth, Adult, Senior)

Code Snippet
import pandas as pd

# Sample data
data = {
'Name': ['John', 'Alice', 'David', 'Emma'],
'City': ['Hyderabad', 'Chennai', 'Bangalore', 'Hyderabad'],
'Age': [21, 23, 45, 51]
}

df = pd.DataFrame(data)

# Concept hierarchy: City → State


city_to_state = {
'Hyderabad': 'Telangana',
'Chennai': 'Tamil Nadu',
'Bangalore': 'Karnataka'
}
df['State'] = df['City'].map(city_to_state)

# Age generalization
def age_group(age):
if age <= 25:
return 'Youth'
elif age <= 50:
return 'Adult'
else:
return 'Senior'

df['AgeGroup'] = df['Age'].apply(age_group)

# Display generalized data


print("Generalized Data:")
print(df[['Name', 'State', 'AgeGroup']])

Sample Output
Generalized Data:
Name State AgeGroup
0 John Telangana Youth
1 Alice Tamil Nadu Youth
2 David Karnataka Adult
3 Emma Telangana Senior
Experiment 7: Implementation of Apriori Algorithm

Aim: To implement the Apriori algorithm for mining frequent itemsets and generating association rules
using Python.

Tools Required
 Python 3.7+
 Libraries:
o pandas – for data manipulation
o mlxtend – for Apriori and association rules
Install required packages:
pip install pandas mlxtend

Theory
🔹 What is Apriori Algorithm?
 An algorithm to mine frequent itemsets and derive association rules from transactional datasets.
 Uses support and confidence thresholds to prune and generate rules.
🔹 Key Terms:
 Support: Frequency of occurrence of an itemset
 Confidence: Strength of implication
 Lift: Measures importance of the rule

Code Snippet
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd

# Sample transaction data


transactions = [
['Milk', 'Bread', 'Butter'],
['Bread', 'Diapers'],
['Milk', 'Diapers', 'Bread', 'Eggs'],
['Milk', 'Bread', 'Diapers', 'Butter'],
['Bread', 'Eggs']
]

# Encode the data


te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Apply Apriori algorithm


frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
print("Frequent Itemsets:")
print(frequent_itemsets)
# Generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
print("\nAssociation Rules:")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

Sample Output
Frequent Itemsets:
support itemsets
0 0.8 {Bread}
1 0.6 {Milk}
2 0.6 {Diapers}
3 0.6 {Milk, Bread}
4 0.6 {Bread, Diapers}

Association Rules:
antecedents consequents support confidence lift
0 {Milk} {Bread} 0.6 1.0 1.25
1 {Bread} {Milk} 0.6 0.75 1.25
2 {Diapers} {Bread} 0.6 1.0 1.25
3 {Bread} {Diapers} 0.6 0.75 1.25
Experiment 8: Implementation of FP-Growth Algorithm

Aim: To implement the FP-Growth algorithm for mining frequent itemsets from a transactional dataset
using Python.

Tools Required
 Python 3.7 or above
 Libraries:
o pandas – for dataset handling
o mlxtend – for FP-Growth implementation
Install required packages:
pip install pandas mlxtend

Theory
🔹 What is FP-Growth?
 FP-Growth (Frequent Pattern Growth) is a fast algorithm for mining frequent itemsets without
generating candidate sets.
 It uses a prefix tree (FP-tree) to encode transactions and discover patterns.
🔹 Advantages over Apriori:
 No candidate generation.
 Faster and more memory-efficient on large datasets.

Code Snippet
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth
import pandas as pd

# Transaction dataset
transactions = [
['Milk', 'Bread', 'Butter'],
['Bread', 'Diapers'],
['Milk', 'Diapers', 'Bread', 'Eggs'],
['Milk', 'Bread', 'Diapers', 'Butter'],
['Bread', 'Eggs']
]

# Encode transactions
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Apply FP-Growth algorithm


frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)

# Display frequent itemsets


print("Frequent Itemsets using FP-Growth:")
print(frequent_itemsets)

Sample Output
Frequent Itemsets using FP-Growth:
support itemsets
0 0.8 {Bread}
1 0.6 {Milk}
2 0.6 {Diapers}
3 0.6 {Milk, Bread}
4 0.6 {Bread, Diapers}
Experiment 9: Implementation of Decision Tree Induction

Aim: To implement Decision Tree Induction for classification using Python and visualize the tree
structure.

Tools Required
 Python 3.7 or higher
 Libraries:
o scikit-learn (for ML model and dataset)
o matplotlib (for visualization)
Install via:
pip install scikit-learn matplotlib

Theory
A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks.
It splits data into branches based on feature values using metrics like entropy (ID3) or Gini index
(CART).
 Entropy measures impurity.
 Information Gain = Entropy(before) - Entropy(after)

Code Snippet

Python Code:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
import matplotlib.pyplot as plt

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Train-test split (optional for testing accuracy)


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Create Decision Tree Classifier with entropy


clf = DecisionTreeClassifier(criterion='entropy', random_state=1)

clf.fit(X_train, y_train)

# Accuracy on test set


accuracy = clf.score(X_test, y_test)
print("Accuracy:", accuracy)
# Export textual tree
tree_rules = export_text(clf, feature_names=iris.feature_names)
print("\nDecision Tree Structure:")
print(tree_rules)

# Plot and save the tree as JPEG


plt.figure(figsize=(12, 8))
plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.title("Decision Tree for Iris Dataset")
plt.savefig("Decision_Tree_Experiment9.jpeg", format='jpeg')
plt.show()
Experiment 10: Calculating Information Gain

Aim: To calculate information gain and entropy for a given dataset using Python.

Tools Required
 Python 3.7+
 Libraries:
o pandas
o math or numpy for logarithmic calculations
Install with:
pip install pandas

Theory

Python Code
import pandas as pd
import numpy as np
from math import log2

# Sample dataset
data = {
'Weather': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast'],
'Play': ['No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes']
}
df = pd.DataFrame(data)

# Function to calculate entropy


def entropy(labels):
probabilities = labels.value_counts(normalize=True)
return -sum(p * log2(p) for p in probabilities if p > 0)

# Calculate total entropy of the dataset


total_entropy = entropy(df['Play'])
print("Total Entropy of Play:", round(total_entropy, 3))

# Function to calculate information gain


def info_gain(df, feature, target):
total_entropy = entropy(df[target])
values = df[feature].unique()
weighted_entropy = 0
for v in values:
subset = df[df[feature] == v]
weight = len(subset) / len(df)
weighted_entropy += weight * entropy(subset[target])
return total_entropy - weighted_entropy

# Calculate Information Gain for 'Weather'


gain = info_gain(df, 'Weather', 'Play')
print("Information Gain for splitting on 'Weather':", round(gain, 3))

Sample Output
Total Entropy of Play: 0.985

Information Gain for splitting on 'Weather': 0.395


Experiment 11: Classification using Bayesian Approach

Aim: To classify data using the Naive Bayes algorithm and evaluate its performance on a dataset using
Python.

Tools Required
 Python 3.7+
 Libraries:
o scikit-learn – for the Naive Bayes classifier and datasets
Install with:
pip install scikit-learn

Theory

Python Code
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = iris.target

# Split into train/test data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create Naive Bayes model


model = GaussianNB()
model.fit(X_train, y_train)

# Predict on test set


y_pred = model.predict(X_test)
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=iris.target_names))

Sample Output
Experiment 12: Classification using K-Nearest Neighbors (KNN)

Aim: To classify data using the K-Nearest Neighbors (KNN) algorithm and evaluate its performance
using Python.

Tools Required
 Python 3.7 or higher
 Libraries:
o scikit-learn – for dataset, model, and evaluation
Install via:
pip install scikit-learn

Theory
🔹 What is K-Nearest Neighbors?
 A non-parametric, lazy learning algorithm.
 Classifies a data point based on how its k nearest neighbors are classified.
 Uses distance metrics like Euclidean to measure closeness.
🔹 Parameters:
 k: Number of neighbors to consider.
 Distance: Usually Euclidean or Manhattan.

Python Code
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into training and testing


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create KNN model (k = 3)


knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Predict
y_pred = knn.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=iris.target_names))
Sample Output
Experiment 13: Clustering using K-Means Algorithm

Aim: To implement K-Means Clustering on a sample dataset using Python and visualize the clusters.

Tools Required
 Python 3.7 or higher
 Libraries:
o scikit-learn – for K-Means algorithm
o matplotlib – for plotting
o numpy – for numerical computations
Install with:
pip install scikit-learn matplotlib numpy

Theory
🔹 What is K-Means?
 K-Means is an unsupervised learning algorithm used for clustering.
 It groups data into K distinct non-overlapping clusters based on feature similarity.
🔹 Steps:
1. Select k initial cluster centroids.
2. Assign each data point to the closest centroid.
3. Recalculate centroids.
4. Repeat until centroids don’t change or max iterations reached.

Python Code
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate synthetic dataset


X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Apply K-Means clustering


kmeans = KMeans(n_clusters=4, random_state=0)
y_kmeans = kmeans.fit_predict(X)

# Plot the clusters


plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

# Plot the centroids


centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
plt.title("K-Means Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()

Sample Output
Experiment 14: Clustering using BIRCH Algorithm

Aim: To perform clustering using the BIRCH algorithm in Python and visualize the results.

Tools Required
 Python 3.7+
 Libraries:
o scikit-learn
o matplotlib
Install with:
pip install scikit-learn matplotlib

Theory
🔹 What is BIRCH?
BIRCH stands for Balanced Iterative Reducing and Clustering using Hierarchies. It is:
 Efficient for large datasets
 Performs incremental clustering
 Uses Clustering Feature (CF) trees
🔹 Advantages:
 Scalable and memory-efficient
 Good for large and streaming data
 Works in online and offline phases

Python Code
from sklearn.datasets import make_blobs
from sklearn.cluster import Birch
import matplotlib.pyplot as plt

# Generate synthetic data


X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

# Apply BIRCH clustering


birch_model = Birch(n_clusters=4)
birch_model.fit(X)
y_birch = birch_model.predict(X)

# Plotting the clusters


plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y_birch, cmap='viridis', s=50)
plt.title("BIRCH Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()
Sample Output
Experiment 15: Clustering using PAM (K-Medoids) Algorithm

Aim: To perform clustering using the K-Medoids (PAM) algorithm and visualize the results using
Python.

Tools Required
 Python 3.7 or higher
 Libraries:
o scikit-learn-extra for KMedoids
o scikit-learn for dataset
o matplotlib for visualization
Install required packages (in Colab or local machine):

!pip install scikit-learn-extra matplotlib scikit-learn

Theory
 K-Medoids, also known as PAM (Partitioning Around Medoids), is a clustering algorithm
similar to K-Means, but more robust to outliers and noise.
 Instead of using the mean as the cluster center (centroid), it uses actual data points (medoids) as
centers.
🔸 Differences from K-Means:
Feature K-Means K-Medoids (PAM)
Center Centroid (mean) Medoid (actual point)
Sensitivity High (to outliers) Low (more robust)
Speed Faster Slightly slower

Python Code
# Install the library (Only in Colab or Jupyter)
!pip install scikit-learn-extra

# Import libraries
from sklearn_extra.cluster import KMedoids
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate synthetic data


X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply K-Medoids (PAM)


kmedoids = KMedoids(n_clusters=4, random_state=42)
kmedoids.fit(X)

# Get labels and medoid points


labels = kmedoids.labels_
medoids = kmedoids.cluster_centers_
# Plotting clusters and medoids
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50)
plt.scatter(medoids[:, 0], medoids[:, 1], c='red', s=200, marker='X', label='Medoids')
plt.title("PAM (K-Medoids) Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(True)
plt.show()

Sample Output
Experiment 16: Clustering using DBSCAN Algorithm

Aim: To perform density-based clustering using the DBSCAN algorithm on a sample dataset using
Python and visualize the clusters.

Tools Required
 Python 3.7 or higher
 Libraries:
o scikit-learn for DBSCAN and dataset
o matplotlib for visualization
Install required packages (if needed):
pip install scikit-learn matplotlib

Theory
🔹 What is DBSCAN?
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It groups points
that are closely packed together while marking outliers that lie alone in low-density regions.
🔹 Key Parameters:
 eps: Radius of neighborhood
 min_samples: Minimum points required to form a dense region
🔹 Types of Points:
 Core points: In dense area
 Border points: Near a core point
 Noise points: Outliers

Python Code
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt

# Generate sample data (non-linear shape)


X, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

# Apply DBSCAN clustering


dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X)

# Plot the clusters


plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='plasma', s=50)
plt.title("DBSCAN Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()
Sample Output

You might also like