0% found this document useful (0 votes)
72 views

aiml manual 6th sem

The document outlines a practical course in Machine Learning (BAIL606) for semester 6, detailing course objectives, experiments, and assessment methods. Students will learn to visualize data, implement various machine learning algorithms, and evaluate models using metrics like accuracy and F1-score. The assessment comprises Continuous Internal Evaluation (CIE) and Semester End Exam (SEE), each contributing 50% to the final grade.

Uploaded by

shreekd2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

aiml manual 6th sem

The document outlines a practical course in Machine Learning (BAIL606) for semester 6, detailing course objectives, experiments, and assessment methods. Students will learn to visualize data, implement various machine learning algorithms, and evaluate models using metrics like accuracy and F1-score. The assessment comprises Continuous Internal Evaluation (CIE) and Semester End Exam (SEE), each contributing 50% to the final grade.

Uploaded by

shreekd2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Artificial Intelligence & Machine Learning

Machine Learning lab (BAIL606)


Template for Practical Course and if AEC is a practical Course Annexure-V

Machine Learning lab Semester 6


Course Code BAIL606 CIE Marks 50
Teaching Hours/Week (L:T:P: S) 0:0:2:0 SEE Marks 50
Credits 01 Exam Hours 100
Examination type (SEE) Practical
Course objectives:
• To become familiar with data and visualize univariate, bivariate, and multivariate data using statistical
techniques and dimensionality reduction.
• To understand various machine learning algorithms such as similarity-based learning, regression, decision
trees, and clustering.
• To familiarize with learning theories, probability-based models and developing the skills required for
decision-making in dynamic environments.
Sl.NO Experiments
1 Develop a program to Load a dataset and select one numerical column. Compute mean, median, mode,
standard deviation, variance, and range for a given numerical column in a dataset. Generate a histogram and
boxplot to understand the distribution of the data. Identify any outliers in the data using IQR. Select a
categorical variable from a dataset. Compute the frequency of each category and display it as a bar chart or
pie chart.

2 Develop a program to Load a dataset with at least two numerical columns (e.g., Iris, Titanic). Plot a scatter
plot of two variables and calculate their Pearson correlation coefficient. Write a program to compute the
covariance and correlation matrix for a dataset. Visualize the correlation matrix using a heatmap to know
which variables have strong positive/negative correlations.

3 Develop a program to implement Principal Component Analysis (PCA) for reducing the dimensionality of
the Iris dataset from 4 features to 2.

4 Develop a program to load the Iris dataset. Implement the k-Nearest Neighbors (k-NN) algorithm for
classifying flowers based on their features. Split the dataset into training and testing sets and evaluate the
model using metrics like accuracy and F1-score. Test it for different values of 𝑘 (e.g., k=1,3,5) and evaluate
the accuracy. Extend the k-NN algorithm to assign weights based on the distance of neighbors (e.g.,
𝑤𝑒𝑖𝑔ℎ𝑡=1/𝑑2 ). Compare the performance of weighted k-NN and regular k-NN on a synthetic or real-world
dataset.

6 Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points. Select
appropriate data set for your experiment and draw graphs.

7 Develop a program to demonstrate the working of Linear Regression and Polynomial Regression. Use
Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency prediction)
for Polynomial Regression.

8 Develop a program to load the Titanic dataset. Split the data into training and test sets. Train a decision tree
classifier. Visualize the tree structure. Evaluate accuracy, precision, recall, and F1-score.

9 Develop a program to implement the Naive Bayesian classifier considering Iris dataset for training. Compute
the accuracy of the classifier, considering the test data.

10 Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set and visualize
the clustering result.
Template for Practical Course and if AEC is a practical Course Annexure-V

Course outcomes (Course Skill Set):


At the end of the course the student will be able to:
● Illustrate the principles of multivariate data and apply dimensionality reduction techniques.
● Demonstrate similarity-based learning methods and perform regression analysis.
● Develop decision trees for classification and regression problems, and Bayesian models for probabilistic
learning.
• Implement the clustering algorithms to share computing resources.
Assessment Details (both CIE and SEE)
The weightage of Continuous Internal Evaluation (CIE) is 50% and for Semester End Exam (SEE) is 50%.
The minimum passing mark for the CIE is 40% of the maximum marks (20 marks out of 50) and for the
SEE minimum passing mark is 35% of the maximum marks (18 out of 50 marks). A student shall be
deemed to have satisfied the academic requirements and earned the credits allotted to each subject/
course if the student secures a minimum of 40% (40 marks out of 100) in the sum total of the CIE
(Continuous Internal Evaluation) and SEE (Semester End Examination) taken together

Continuous Internal Evaluation (CIE):


CIE marks for the practical course are 50 Marks.
The split-up of CIE marks for record/ journal and test are in the ratio 60:40.
• Each experiment is to be evaluated for conduction with an observation sheet and record
write-up. Rubrics for the evaluation of the journal/write-up for hardware/software
experiments are designed by the faculty who is handling the laboratory session and are
made known to students at the beginning of the practical session.
• Record should contain all the specified experiments in the syllabus and each experiment
write-up will be evaluated for 10 marks.
• Total marks scored by the students are scaled down to 30 marks (60% of maximum
marks).
• Weightage to be given for neatness and submission of record/write-up on time.
• Department shall conduct a test of 100 marks after the completion of all the experiments
listed in the syllabus.
• In a test, test write-up, conduction of experiment, acceptable result, and procedural
knowledge will carry a weightage of 60% and the rest 40% for viva-voce.
• The suitable rubrics can be designed to evaluate each student’s performance and learning
ability.
• The marks scored shall be scaled down to 20 marks (40% of the maximum marks).
The Sum of scaled-down marks scored in the report write-up/journal and marks of a test is the
total CIE marks scored by the student.
Semester End Evaluation (SEE):
• SEE marks for the practical course are 50 Marks.
• SEE shall be conducted jointly by the two examiners of the same institute, examiners are
appointed by the Head of the Institute.
Template for Practical Course and if AEC is a practical Course Annexure-V

• The examination schedule and names of examiners are informed to the university before
the conduction of the examination. These practical examinations are to be conducted
between the schedule mentioned in the academic calendar of the University.
• All laboratory experiments are to be included for practical examination.
• (Rubrics) Breakup of marks and the instructions printed on the cover page of the answer
script to be strictly adhered to by the examiners. OR based on the course requirement
evaluation rubrics shall be decided jointly by examiners.
• Students can pick one question (experiment) from the questions lot prepared by the
examiners jointly.
• Evaluation of test write-up/ conduction procedure and result/viva will be conducted
jointly by examiners.
• General rubrics suggested for SEE are mentioned here, writeup-20%, Conduction procedure
and result in -60%, Viva-voce 20% of maximum marks. SEE for practical shall be evaluated for
100 marks and scored marks shall be scaled down to 50 marks (however, based on course
type, rubrics shall be decided by the examiners)
Change of experiment is allowed only once and 15% of Marks allotted to the procedure part
are to be made zero.
The minimum duration of SEE is 02 hours
Suggested Learning Resources:
Books:

1. S Sridhar and M Vijayalakshmi, “Machine Learning”, Oxford University Press, 2021.


2. M N Murty and Ananthanarayana V S, “Machine Learning: Theory and Practice”, Universities Press (India)
Pvt. Limited, 2024.

Web links and Video Lectures (e-Resources):

● https://www.drssridhar.com/?page_id=1053
● https://www.universitiespress.com/resources?id=9789393330697
● https://onlinecourses.nptel.ac.in/noc23_cs18/preview
Experiment 1:
Develop a program to Load a dataset and select one numerical column. Compute mean, median,
mode, standard deviation, variance, and range for a given numerical column in a dataset. Generate
a histogram and boxplot to understand the distribution of the data. Identify any outliers in the data
using IQR. Select a categorical variable from a dataset. Compute the frequency of each category
and display it as a bar chart or pie chart.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load dataset
df = pd.read_csv('pgm1.csv')
# Display the first few rows of the dataset
print("First few rows of the dataset:")
print(df.head())
# Data Cleaning and Preprocessing
df = df.dropna() # Remove rows with missing values
# Numerical Analysis on Scores
score_columns = ['math score', 'reading score', 'writing score']
# Compute statistics
for column in score_columns:
mean = df[column].mean()
median = df[column].median()
mode = df[column].mode()[0]
std_dev = df[column].std()
variance = df[column].var()
data_range = df[column].max() - df[column].min()
# Display statistics
print(f'\nStatistics for {column}:')
print(f'Mean: {mean:.2f}')
print(f'Median: {median:.2f}')
print(f'Mode: {mode:.2f}')
print(f'Standard Deviation: {std_dev:.2f}')
print(f'Variance: {variance:.2f}')
print(f'Range: {data_range:.2f}')
# Generate histogram
plt.figure(figsize=(10, 5))
plt.hist(df[column], bins=10, color='blue', alpha=0.7)
plt.title(f'Histogram of {column}')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.grid(axis='y')
plt.show()
# Generate boxplot
plt.figure(figsize=(10, 5))
sns.boxplot(x=df[column])
plt.title(f'Boxplot of {column}')
plt.show()
# Identify outliers using IQR
Q1 = df[column].quantile(0.25) 1
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
print(f'Outliers in {column}:')
print(outliers)
# Categorical Analysis on Gender
categorical_column = 'gender'
# Compute frequency of each category
frequency = df[categorical_column].value_counts()
# Display frequency as bar chart
plt.figure(figsize=(10, 5))
frequency.plot(kind='bar', color='orange')
plt.title(f'Frequency of Students by {categorical_column}')
plt.xlabel(categorical_column)
plt.ylabel('Frequency')
plt.show()
# Display frequency as pie chart
plt.figure(figsize=(8, 8))
frequency.plot(kind='pie', autopct='%1.1f%%', startangle=90)
plt.title(f'Distribution of Students by {categorical_column}')
plt.ylabel('')
plt.show()

2
Experiment 2:
Develop a program to Load a dataset with at least two numerical columns (e.g., Iris, Titanic). Plot
a scatter plot of two variables and calculate their Pearson correlation coefficient. Write a program
to compute the covariance and correlation matrix for a dataset. Visualize the correlation matrix
using a heatmap to know which variables have strong positive/negative correlations.

import seaborn as sns


import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Load dataset from CSV file
df = pd.read_csv('pgm2.csv') # Replace 'dataset.csv' with the actual filename
# Ensure only numerical columns are selected
df = df.select_dtypes(include=[np.number]).dropna()
# Select two numerical columns for scatter plot
if df.shape[1] < 2:
raise ValueError("Dataset must have at least two numerical columns.")
x_col = df.columns[0] # First numerical column
y_col = df.columns[1] # Second numerical column
# Scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df[x_col], y=df[y_col])
plt.xlabel(x_col)
plt.ylabel(y_col)
plt.title(f'Scatter Plot of {x_col} vs {y_col}')
plt.show()
# Compute Pearson correlation coefficient
if df[x_col].nunique() > 1 and df[y_col].nunique() > 1:
pearson_corr = np.corrcoef(df[x_col], df[y_col])[0, 1]
print(f'Pearson Correlation Coefficient between {x_col} and {y_col}: {pearson_corr:.2f}')
else:
print(f'Cannot compute Pearson correlation coefficient as one of the columns has only one
unique value.')
# Compute covariance matrix
cov_matrix = df.cov()
print("Covariance Matrix:")
print(cov_matrix)
# Compute correlation matrix
corr_matrix = df.corr()
print("Correlation Matrix:")
print(corr_matrix)
# Visualize correlation matrix using heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

3
Experiment 3
Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.

import matplotlib.pyplot as plt


import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Load dataset from CSV file
df = pd.read_csv('pgm3.csv') # Replace 'dataset.csv' with the actual filename
# Ensure only numerical columns are selected
df_numeric = df.select_dtypes(include=[np.number]).dropna()
# Standardizing the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_numeric)
# Apply PCA to reduce dimensions from 4 to 2
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Convert to DataFrame for easier handling
df_pca = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
# Plot the PCA-transformed data
plt.figure(figsize=(8, 6))
sns.scatterplot(x=df_pca['PC1'], y=df_pca['PC2'])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Dataset')
plt.show()

4
Experiment 4:
Develop a program to load the Iris dataset. Implement the k-Nearest Neighbours (k-NN) algorithm
for classifying flowers based on their features. Split the dataset into training and testing sets and
evaluate the model using metrics like accuracy and F1-score. Test it for different values of 𝑘 (e.g.,
k=1,3,5) and evaluate the accuracy. Extend the k-NN algorithm to assign weights based on the
distance of neighbours (e.g., 𝑤𝑒𝑖𝑔ℎ𝑡=1/𝑑2 ). Compare the performance of weighted k-NN and
regular k-NN on a synthetic or real-world dataset.

import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
def load_data(csv_file):
if not os.path.exists(csv_file):
raise FileNotFoundError(f"Error: The file '{csv_file}' was not found. Please check the
filename and path.")
data = pd.read_csv(csv_file)
X = data.iloc[:, :-1].values # Features
y = data.iloc[:, -1].values # Target
return train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
def train_knn(X_train, y_train, X_test, y_test, k, weighted=False):
knn = KNeighborsClassifier(n_neighbors=k, weights='distance' if weighted else 'uniform')
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
return accuracy_score(y_test, y_pred)
def main():
csv_file = "iris_data.csv" # Ensure the file is in the same directory
X_train, X_test, y_train, y_test = load_data(csv_file)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
for k in [1, 3, 5, 7, 9]:
print(f"k={k}, Accuracy (Regular): {train_knn(X_train, y_train, X_test, y_test, k):.4f}")
print(f"k={k}, Accuracy (Weighted): {train_knn(X_train, y_train, X_test, y_test, k,
weighted=True):.4f}")
if __name__ == "__main__":
main()

5
Experiment 5:
Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points.
Select appropriate data set for your experiment and draw graphs

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
def gaussian_kernel(X, x_query, tau):
"""Compute Gaussian weights for all training points relative to x_query."""
weights = np.exp(-np.square(X[:, 1] - x_query[1]) / (2 * tau ** 2))
return np.diag(weights) # Convert to diagonal weight matrix
def locally_weighted_regression(X_train, y_train, x_query, tau):
"""Compute LWR prediction for a single query point x_query."""
W = gaussian_kernel(X_train, x_query, tau) # Correctly compute weights
theta = np.linalg.pinv(X_train.T @ W @ X_train) @ (X_train.T @ W @ y_train)
return x_query @ theta # Return prediction
def predict_lwr(X_train, y_train, X_test, tau):
"""Compute LWR predictions for multiple query points."""
return np.array([locally_weighted_regression(X_train, y_train, x, tau) for x in X_test])
# Load dataset from CSV
data = pd.read_csv("data.csv")
X = data["X"].values.reshape(-1, 1)
y = data["y"].values.reshape(-1, 1)
# Add bias term to X
X_bias = np.hstack([np.ones_like(X), X])
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_bias, y, test_size=0.2, random_state=42)
# Define tau (bandwidth parameter)
tau = 0.5
# Compute predictions
y_pred = predict_lwr(X_train, y_train, X_test, tau)
# Plot results
plt.scatter(X, y, label="Data", color="blue", alpha=0.5)
X_test_sorted = X_test[:, 1].argsort()
plt.plot(X_test[:, 1][X_test_sorted], y_pred[X_test_sorted], label="LWR Fit", color="red")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.title("Locally Weighted Regression (LWR)")
plt.show()

6
Experiment 6:
Develop a program to demonstrate the working of Linear Regression and Polynomial Regression.
Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel
efficiency prediction) for Polynomial Regression.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
# Function to clean Auto MPG dataset (handling missing or non-numeric values)
def clean_auto_mpg_data(df):
# Convert 'horsepower' column to numeric, coercing errors to NaN
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
# Fill missing values in 'horsepower' column with the mean of the column
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].mean())
return df
# --- Boston Housing Dataset ---
# Load Boston Housing Dataset (assuming the file is in the same directory)
boston_df = pd.read_csv("boston_housing.csv")
# Selecting average number of rooms (RM) as the feature and price (PRICE) as the target
X_boston = boston_df[['RM']].values
y_boston = boston_df['PRICE'].values
# Split the dataset into training and testing sets
X_train_boston, X_test_boston, y_train_boston, y_test_boston = train_test_split(X_boston,
y_boston, test_size=0.2, random_state=42)
# Linear Regression for Boston Housing Dataset
linear_model = LinearRegression()
linear_model.fit(X_train_boston, y_train_boston)
# Predictions for the test set
y_pred_boston = linear_model.predict(X_test_boston)
# Plotting Linear Regression results
plt.scatter(X_test_boston, y_test_boston, color='blue', label='Actual')
plt.plot(X_test_boston, y_pred_boston, color='red', label='Predicted')
plt.xlabel("Average number of rooms (RM)")
plt.ylabel("House Price")
plt.title("Linear Regression - Boston Housing Dataset")
plt.legend()
plt.show()
# Print Mean Squared Error for Linear Regression on Boston dataset
print("Boston Housing Linear Regression MSE:", mean_squared_error(y_test_boston,
y_pred_boston))
# --- Auto MPG Dataset ---
# Load Auto MPG Dataset
auto_mpg_df = pd.read_csv("auto_mpg.csv")
# Clean the dataset
auto_mpg_df = clean_auto_mpg_data(auto_mpg_df)
# Selecting 'horsepower' as the feature and 'mpg' as the target
X_auto = auto_mpg_df[['horsepower']].values 7
y_auto = auto_mpg_df['mpg'].values
# Split the dataset into training and testing sets
X_train_auto, X_test_auto, y_train_auto, y_test_auto = train_test_split(X_auto, y_auto,
test_size=0.2, random_state=42)
# Polynomial Regression for Auto MPG Dataset
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train_auto)
X_test_poly = poly.transform(X_test_auto)
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train_auto)
# Predictions for the test set
y_poly_pred = poly_model.predict(X_test_poly)
# Plotting Polynomial Regression results
plt.scatter(X_test_auto, y_test_auto, color='blue', label='Actual')
plt.scatter(X_test_auto, y_poly_pred, color='red', label='Predicted')
plt.xlabel("Horsepower")
plt.ylabel("MPG")
plt.title("Polynomial Regression - Auto MPG Dataset")
plt.legend()
plt.show()
# Print Mean Squared Error for Polynomial Regression on Auto MPG dataset
print("Auto MPG Polynomial Regression MSE:", mean_squared_error(y_test_auto, y_poly_pred))

8
Experiment 7:
Develop a program to load the Titanic dataset. Split the data into training and test sets. Train a
decision tree classifier. Visualize the tree structure. Evaluate accuracy, precision, recall, and F1-
score.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder
# Load dataset
data=pd.read_csv("pgm7.csv")
# Selecting relevant features and handling missing values
data = data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
data.dropna(inplace=True)
# Encoding categorical variables
le_sex = LabelEncoder()
le_embarked = LabelEncoder()
data['Sex'] = le_sex.fit_transform(data['Sex'])
data['Embarked'] = le_embarked.fit_transform(data['Embarked'])
# Splitting data into features and target variable
X = data.drop(columns=['Survived'])
y = data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train decision tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
# Visualizing the tree structure
plt.figure(figsize=(15, 10))
plot_tree(clf, feature_names=X.columns, class_names=['Not Survived', 'Survived'], filled=True)
plt.show()
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

9
Experiment 8:
Develop a program to implement the Naive Bayesian classifier considering Iris dataset for
training. Compute the accuracy of the classifier, considering the test data.

# Import necessary libraries


import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# Load training and test datasets (assuming they are CSV files)
train_data = pd.read_csv('train_data.csv') # Training data file
test_data = pd.read_csv('test_data.csv') # Test data file
# Separate features (X) and labels (y) for training and test datasets
X_train = train_data.drop(columns=['target']) # Features for training data
y_train = train_data['target'] # Labels for training data
X_test = test_data.drop(columns=['target']) # Features for test data
y_test = test_data['target'] # Labels for test data
# Initialize the Naive Bayes classifier
nb_classifier = GaussianNB()
# Train the classifier with the training data
nb_classifier.fit(X_train, y_train)
# Predict the labels on the test set
y_pred = nb_classifier.predict(X_test)
# Compute the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
# Print the accuracy
print(f"Accuracy of Naive Bayes classifier on test data: {accuracy * 100:.2f}%")

10
Experiment 9:
0 Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set
and visualize the clustering result.

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Set environment variable to avoid memory leak warning on Windows
os.environ["OMP_NUM_THREADS"] = "3"
# Load the dataset from a CSV file
df = pd.read_csv('breast_cancer_data.csv')
# Encode categorical columns if present
for col in df.select_dtypes(include=['object']).columns:
df[col] = LabelEncoder().fit_transform(df[col])
# Assume the last column is the target (drop it for clustering)
df_features = df.iloc[:, :-1]
# Standardize the data
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_features)
# Apply K-Means Clustering
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans.fit(df_scaled)
labels = kmeans.labels_
# Reduce dimensions using PCA for visualization
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df_scaled)
# Scatter plot of clusters
plt.figure(figsize=(8, 6))
plt.scatter(df_pca[:, 0], df_pca[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-Means Clustering on Breast Cancer Dataset')
plt.colorbar(label='Cluster')
plt.show()

11

You might also like