DATA MINING LAB MANUAL
DATA MINING LAB MANUAL
OF
DATA MINING
IV B.TECH II Semester
(HITS – R22)
DEPARTMENT OF CSE
(ARTIFICIAL INTELLIGENCE & MACHINE LEARNING)
COURSE OBJECTIVES:
1. The course is intended to obtain hands-on experience using data mining software.
COURSE OUTCOMES:
3. Implement various algorithms for data mining in order to discover interesting patterns from
large amounts of data.
LIST OF EXPERIMENTS
WEEK – 1
WEEK – 2
WEEK – 3
WEEK – 4
WEEK – 5
WEEK – 7
WEEK – 8
WEEK – 9
WEEK – 10
WEEK – 11
WEEK – 12
WEEK – 13
WEEK – 14
WEEK – 15
WEEK – 16
Aim:To perform data cleaning, normalization, and integration using Python programming.
Tools Required
Python 3.7 or above
Libraries:
o pandas – for data manipulation
o numpy – for numerical operations
o scikit-learn – for normalization
Install with:
pip install pandas numpy scikit-learn
Theory
🔹 Data Cleaning:
Removing or correcting noisy, inconsistent, or incomplete data (e.g., filling missing values).
🔹 Data Normalization:
Scaling numeric attributes to a common range, typically [0, 1], to improve the performance of algorithms.
🔹 Data Integration:
Combining data from multiple sources to create a unified dataset.
Code Snippet
# Sample data
data = {'ID': [1, 2, 3], 'Age': [25, np.nan, 35], 'Salary': [50000, 60000, np.nan]}
df = pd.DataFrame(data)
})
# Dataset 2
df2 = pd.DataFrame({
'ID': [1, 2, 3],
'Department': ['HR', 'IT', 'Finance']
})
Sample Output
🔹 Output after Cleaning and Normalization:
ID Age Salary
0 1 0.000000 0.000000
1 2 0.500000 1.000000
2 3 1.000000 0.500000
🔹 Output after Data Integration:
ID Name Department
0 1 Alice HR
1 2 Bob IT
2 3 Charlie Finance
Experiment 2: Partitioning - Horizontal, Vertical, Round Robin, Hash based
Aim: To implement horizontal and vertical partitioning of data using Python programming.
Tools Required
Python 3.7 or above
Libraries:
o pandas – for data manipulation
o scikit-learn – for train/test data splitting
Install if not already installed:
pip install pandas scikit-learn
Theory
🔹 Horizontal Partitioning:
Divides the dataset by rows, splitting data into subsets used for training and testing.
🔹 Vertical Partitioning:
Divides the dataset by columns, separating features (attributes) into different logical
groups.
These techniques are foundational for data preparation in Machine Learning and Data
Mining.
Code Snippet
from sklearn.model_selection import train_test_split
import pandas as pd
# Sample dataset
df = pd.DataFrame({
'Age': [25, 30, 45, 22],
'Salary': [50000, 60000, 80000, 45000],
'Label': [0, 1, 1, 0]
})
# Display results
print("Train Partition:")
print(X_train)
print("\nVertical Partitions:")
print("Part 1 (Age):")
print(vertical1)
Vertical Partitions:
Part 1 (Age):
Age
0 25
1 30
2 45
3 22
Aim: To design data warehouse schemas such as Star Schema, Snowflake Schema, and Fact Constellation
using Pentaho Data Integration tool.
Tools Required:
+----------+------------+----------+---------+--------------------------------------- +
| sales_id | product_id | store_id | time_id | sales_amount |
+----------+------------+----------+---------+--------------------------------------- +
| 1 | 101 | 201 | 301 | 2500.00 |
| 2 | 102 | 202 | 302 | 1850.75 |
+----------+------------+----------+---------+--------------------------------------- +
Experiment 4: Data cube construction – OLAP operations
Aim: To construct a data cube and perform OLAP operations like Roll-up, Drill-down, Slice, Dice, and
Pivot using Python or Excel/Pentaho.
Tools Required
You can perform this experiment using:
✅ Python 3.7+ with pandas
✅ Microsoft Excel (Pivot Tables)
✅ Pentaho Schema Workbench (GUI)
Optional: MySQL for data source
For Python:
pip install pandas
Theory
🔹 OLAP (Online Analytical Processing):
OLAP enables multidimensional analysis of data and supports operations like:
Roll-up: Summarizing data (e.g., day → month)
Drill-down: Getting more details (e.g., month → day)
Slice: Filtering one dimension (e.g., sales for Jan)
Dice: Filtering multiple dimensions (e.g., Jan & Electronics)
Pivot: Rotating the view of the data
Sample Output
Roll-up (Sales by Region):
Region
North 8500
South 6500
Name: Sales, dtype: int64
Aim: To perform Extract, Transform, and Load (ETL) operations on a dataset using Pentaho Data
Integration (Spoon) or Python.
Tools Required
You can perform this experiment using:
✅ Pentaho Data Integration (Spoon)
o Java JDK (required)
o Data Source: CSV / Excel / MySQL
✅ OR Python (if GUI not preferred)
o Libraries: pandas
✅ Install (Python):
pip install pandas
Theory
🔹 ETL Process:
Extract: Load raw data from source (CSV, database, Excel)
Transform: Clean, filter, convert, normalize data
Load: Store the processed data into a database or file
Example: Load employee records from CSV → clean missing values → store into MySQL table.
# LOAD: Save the cleaned data to new CSV (can be loaded into MySQL)
df.to_csv("cleaned_employee_data.csv", index=False)
print("ETL Process Completed. Cleaned data saved.")
Aim: To generalize raw data using Attribute-Oriented Induction (AOI) by applying concept
hierarchies in Python.
Tools Required
Python 3.7+
Libraries: pandas
Install with:
pip install pandas
Theory
🔹 What is AOI?
Attribute-Oriented Induction is a data generalization method in data mining. It:
Replaces specific values with higher-level concepts
Uses concept hierarchies for generalization
Reduces data volume for better mining efficiency
🔹 Example:
City → State → Country
Age → Age Group (Youth, Adult, Senior)
Code Snippet
import pandas as pd
# Sample data
data = {
'Name': ['John', 'Alice', 'David', 'Emma'],
'City': ['Hyderabad', 'Chennai', 'Bangalore', 'Hyderabad'],
'Age': [21, 23, 45, 51]
}
df = pd.DataFrame(data)
# Age generalization
def age_group(age):
if age <= 25:
return 'Youth'
elif age <= 50:
return 'Adult'
else:
return 'Senior'
df['AgeGroup'] = df['Age'].apply(age_group)
Sample Output
Generalized Data:
Name State AgeGroup
0 John Telangana Youth
1 Alice Tamil Nadu Youth
2 David Karnataka Adult
3 Emma Telangana Senior
Experiment 7: Implementation of Apriori Algorithm
Aim: To implement the Apriori algorithm for mining frequent itemsets and generating association rules
using Python.
Tools Required
Python 3.7+
Libraries:
o pandas – for data manipulation
o mlxtend – for Apriori and association rules
Install required packages:
pip install pandas mlxtend
Theory
🔹 What is Apriori Algorithm?
An algorithm to mine frequent itemsets and derive association rules from transactional datasets.
Uses support and confidence thresholds to prune and generate rules.
🔹 Key Terms:
Support: Frequency of occurrence of an itemset
Confidence: Strength of implication
Lift: Measures importance of the rule
Code Snippet
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd
Sample Output
Frequent Itemsets:
support itemsets
0 0.8 {Bread}
1 0.6 {Milk}
2 0.6 {Diapers}
3 0.6 {Milk, Bread}
4 0.6 {Bread, Diapers}
Association Rules:
antecedents consequents support confidence lift
0 {Milk} {Bread} 0.6 1.0 1.25
1 {Bread} {Milk} 0.6 0.75 1.25
2 {Diapers} {Bread} 0.6 1.0 1.25
3 {Bread} {Diapers} 0.6 0.75 1.25
Experiment 8: Implementation of FP-Growth Algorithm
Aim: To implement the FP-Growth algorithm for mining frequent itemsets from a transactional dataset
using Python.
Tools Required
Python 3.7 or above
Libraries:
o pandas – for dataset handling
o mlxtend – for FP-Growth implementation
Install required packages:
pip install pandas mlxtend
Theory
🔹 What is FP-Growth?
FP-Growth (Frequent Pattern Growth) is a fast algorithm for mining frequent itemsets without
generating candidate sets.
It uses a prefix tree (FP-tree) to encode transactions and discover patterns.
🔹 Advantages over Apriori:
No candidate generation.
Faster and more memory-efficient on large datasets.
Code Snippet
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth
import pandas as pd
# Transaction dataset
transactions = [
['Milk', 'Bread', 'Butter'],
['Bread', 'Diapers'],
['Milk', 'Diapers', 'Bread', 'Eggs'],
['Milk', 'Bread', 'Diapers', 'Butter'],
['Bread', 'Eggs']
]
# Encode transactions
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)
Sample Output
Frequent Itemsets using FP-Growth:
support itemsets
0 0.8 {Bread}
1 0.6 {Milk}
2 0.6 {Diapers}
3 0.6 {Milk, Bread}
4 0.6 {Bread, Diapers}
Experiment 9: Implementation of Decision Tree Induction
Aim: To implement Decision Tree Induction for classification using Python and visualize the tree
structure.
Tools Required
Python 3.7 or higher
Libraries:
o scikit-learn (for ML model and dataset)
o matplotlib (for visualization)
Install via:
pip install scikit-learn matplotlib
Theory
A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks.
It splits data into branches based on feature values using metrics like entropy (ID3) or Gini index
(CART).
Entropy measures impurity.
Information Gain = Entropy(before) - Entropy(after)
Code Snippet
Python Code:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
import matplotlib.pyplot as plt
clf.fit(X_train, y_train)
Aim: To calculate information gain and entropy for a given dataset using Python.
Tools Required
Python 3.7+
Libraries:
o pandas
o math or numpy for logarithmic calculations
Install with:
pip install pandas
Theory
Python Code
import pandas as pd
import numpy as np
from math import log2
# Sample dataset
data = {
'Weather': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast'],
'Play': ['No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'Yes']
}
df = pd.DataFrame(data)
Sample Output
Total Entropy of Play: 0.985
Aim: To classify data using the Naive Bayes algorithm and evaluate its performance on a dataset using
Python.
Tools Required
Python 3.7+
Libraries:
o scikit-learn – for the Naive Bayes classifier and datasets
Install with:
pip install scikit-learn
Theory
Python Code
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
Sample Output
Experiment 12: Classification using K-Nearest Neighbors (KNN)
Aim: To classify data using the K-Nearest Neighbors (KNN) algorithm and evaluate its performance
using Python.
Tools Required
Python 3.7 or higher
Libraries:
o scikit-learn – for dataset, model, and evaluation
Install via:
pip install scikit-learn
Theory
🔹 What is K-Nearest Neighbors?
A non-parametric, lazy learning algorithm.
Classifies a data point based on how its k nearest neighbors are classified.
Uses distance metrics like Euclidean to measure closeness.
🔹 Parameters:
k: Number of neighbors to consider.
Distance: Usually Euclidean or Manhattan.
Python Code
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Predict
y_pred = knn.predict(X_test)
# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=iris.target_names))
Sample Output
Experiment 13: Clustering using K-Means Algorithm
Aim: To implement K-Means Clustering on a sample dataset using Python and visualize the clusters.
Tools Required
Python 3.7 or higher
Libraries:
o scikit-learn – for K-Means algorithm
o matplotlib – for plotting
o numpy – for numerical computations
Install with:
pip install scikit-learn matplotlib numpy
Theory
🔹 What is K-Means?
K-Means is an unsupervised learning algorithm used for clustering.
It groups data into K distinct non-overlapping clusters based on feature similarity.
🔹 Steps:
1. Select k initial cluster centroids.
2. Assign each data point to the closest centroid.
3. Recalculate centroids.
4. Repeat until centroids don’t change or max iterations reached.
Python Code
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
Sample Output
Experiment 14: Clustering using BIRCH Algorithm
Aim: To perform clustering using the BIRCH algorithm in Python and visualize the results.
Tools Required
Python 3.7+
Libraries:
o scikit-learn
o matplotlib
Install with:
pip install scikit-learn matplotlib
Theory
🔹 What is BIRCH?
BIRCH stands for Balanced Iterative Reducing and Clustering using Hierarchies. It is:
Efficient for large datasets
Performs incremental clustering
Uses Clustering Feature (CF) trees
🔹 Advantages:
Scalable and memory-efficient
Good for large and streaming data
Works in online and offline phases
Python Code
from sklearn.datasets import make_blobs
from sklearn.cluster import Birch
import matplotlib.pyplot as plt
Aim: To perform clustering using the K-Medoids (PAM) algorithm and visualize the results using
Python.
Tools Required
Python 3.7 or higher
Libraries:
o scikit-learn-extra for KMedoids
o scikit-learn for dataset
o matplotlib for visualization
Install required packages (in Colab or local machine):
Theory
K-Medoids, also known as PAM (Partitioning Around Medoids), is a clustering algorithm
similar to K-Means, but more robust to outliers and noise.
Instead of using the mean as the cluster center (centroid), it uses actual data points (medoids) as
centers.
🔸 Differences from K-Means:
Feature K-Means K-Medoids (PAM)
Center Centroid (mean) Medoid (actual point)
Sensitivity High (to outliers) Low (more robust)
Speed Faster Slightly slower
Python Code
# Install the library (Only in Colab or Jupyter)
!pip install scikit-learn-extra
# Import libraries
from sklearn_extra.cluster import KMedoids
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
Sample Output
Experiment 16: Clustering using DBSCAN Algorithm
Aim: To perform density-based clustering using the DBSCAN algorithm on a sample dataset using
Python and visualize the clusters.
Tools Required
Python 3.7 or higher
Libraries:
o scikit-learn for DBSCAN and dataset
o matplotlib for visualization
Install required packages (if needed):
pip install scikit-learn matplotlib
Theory
🔹 What is DBSCAN?
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It groups points
that are closely packed together while marking outliers that lie alone in low-density regions.
🔹 Key Parameters:
eps: Radius of neighborhood
min_samples: Minimum points required to form a dense region
🔹 Types of Points:
Core points: In dense area
Border points: Near a core point
Noise points: Outliers
Python Code
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt