aiml manual 6th sem
aiml manual 6th sem
2 Develop a program to Load a dataset with at least two numerical columns (e.g., Iris, Titanic). Plot a scatter
plot of two variables and calculate their Pearson correlation coefficient. Write a program to compute the
covariance and correlation matrix for a dataset. Visualize the correlation matrix using a heatmap to know
which variables have strong positive/negative correlations.
3 Develop a program to implement Principal Component Analysis (PCA) for reducing the dimensionality of
the Iris dataset from 4 features to 2.
4 Develop a program to load the Iris dataset. Implement the k-Nearest Neighbors (k-NN) algorithm for
classifying flowers based on their features. Split the dataset into training and testing sets and evaluate the
model using metrics like accuracy and F1-score. Test it for different values of 𝑘 (e.g., k=1,3,5) and evaluate
the accuracy. Extend the k-NN algorithm to assign weights based on the distance of neighbors (e.g.,
𝑤𝑒𝑖𝑔ℎ𝑡=1/𝑑2 ). Compare the performance of weighted k-NN and regular k-NN on a synthetic or real-world
dataset.
6 Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points. Select
appropriate data set for your experiment and draw graphs.
7 Develop a program to demonstrate the working of Linear Regression and Polynomial Regression. Use
Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency prediction)
for Polynomial Regression.
8 Develop a program to load the Titanic dataset. Split the data into training and test sets. Train a decision tree
classifier. Visualize the tree structure. Evaluate accuracy, precision, recall, and F1-score.
9 Develop a program to implement the Naive Bayesian classifier considering Iris dataset for training. Compute
the accuracy of the classifier, considering the test data.
10 Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set and visualize
the clustering result.
Template for Practical Course and if AEC is a practical Course Annexure-V
• The examination schedule and names of examiners are informed to the university before
the conduction of the examination. These practical examinations are to be conducted
between the schedule mentioned in the academic calendar of the University.
• All laboratory experiments are to be included for practical examination.
• (Rubrics) Breakup of marks and the instructions printed on the cover page of the answer
script to be strictly adhered to by the examiners. OR based on the course requirement
evaluation rubrics shall be decided jointly by examiners.
• Students can pick one question (experiment) from the questions lot prepared by the
examiners jointly.
• Evaluation of test write-up/ conduction procedure and result/viva will be conducted
jointly by examiners.
• General rubrics suggested for SEE are mentioned here, writeup-20%, Conduction procedure
and result in -60%, Viva-voce 20% of maximum marks. SEE for practical shall be evaluated for
100 marks and scored marks shall be scaled down to 50 marks (however, based on course
type, rubrics shall be decided by the examiners)
Change of experiment is allowed only once and 15% of Marks allotted to the procedure part
are to be made zero.
The minimum duration of SEE is 02 hours
Suggested Learning Resources:
Books:
● https://www.drssridhar.com/?page_id=1053
● https://www.universitiespress.com/resources?id=9789393330697
● https://onlinecourses.nptel.ac.in/noc23_cs18/preview
Experiment 1:
Develop a program to Load a dataset and select one numerical column. Compute mean, median,
mode, standard deviation, variance, and range for a given numerical column in a dataset. Generate
a histogram and boxplot to understand the distribution of the data. Identify any outliers in the data
using IQR. Select a categorical variable from a dataset. Compute the frequency of each category
and display it as a bar chart or pie chart.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load dataset
df = pd.read_csv('pgm1.csv')
# Display the first few rows of the dataset
print("First few rows of the dataset:")
print(df.head())
# Data Cleaning and Preprocessing
df = df.dropna() # Remove rows with missing values
# Numerical Analysis on Scores
score_columns = ['math score', 'reading score', 'writing score']
# Compute statistics
for column in score_columns:
mean = df[column].mean()
median = df[column].median()
mode = df[column].mode()[0]
std_dev = df[column].std()
variance = df[column].var()
data_range = df[column].max() - df[column].min()
# Display statistics
print(f'\nStatistics for {column}:')
print(f'Mean: {mean:.2f}')
print(f'Median: {median:.2f}')
print(f'Mode: {mode:.2f}')
print(f'Standard Deviation: {std_dev:.2f}')
print(f'Variance: {variance:.2f}')
print(f'Range: {data_range:.2f}')
# Generate histogram
plt.figure(figsize=(10, 5))
plt.hist(df[column], bins=10, color='blue', alpha=0.7)
plt.title(f'Histogram of {column}')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.grid(axis='y')
plt.show()
# Generate boxplot
plt.figure(figsize=(10, 5))
sns.boxplot(x=df[column])
plt.title(f'Boxplot of {column}')
plt.show()
# Identify outliers using IQR
Q1 = df[column].quantile(0.25) 1
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
print(f'Outliers in {column}:')
print(outliers)
# Categorical Analysis on Gender
categorical_column = 'gender'
# Compute frequency of each category
frequency = df[categorical_column].value_counts()
# Display frequency as bar chart
plt.figure(figsize=(10, 5))
frequency.plot(kind='bar', color='orange')
plt.title(f'Frequency of Students by {categorical_column}')
plt.xlabel(categorical_column)
plt.ylabel('Frequency')
plt.show()
# Display frequency as pie chart
plt.figure(figsize=(8, 8))
frequency.plot(kind='pie', autopct='%1.1f%%', startangle=90)
plt.title(f'Distribution of Students by {categorical_column}')
plt.ylabel('')
plt.show()
2
Experiment 2:
Develop a program to Load a dataset with at least two numerical columns (e.g., Iris, Titanic). Plot
a scatter plot of two variables and calculate their Pearson correlation coefficient. Write a program
to compute the covariance and correlation matrix for a dataset. Visualize the correlation matrix
using a heatmap to know which variables have strong positive/negative correlations.
3
Experiment 3
Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.
4
Experiment 4:
Develop a program to load the Iris dataset. Implement the k-Nearest Neighbours (k-NN) algorithm
for classifying flowers based on their features. Split the dataset into training and testing sets and
evaluate the model using metrics like accuracy and F1-score. Test it for different values of 𝑘 (e.g.,
k=1,3,5) and evaluate the accuracy. Extend the k-NN algorithm to assign weights based on the
distance of neighbours (e.g., 𝑤𝑒𝑖𝑔ℎ𝑡=1/𝑑2 ). Compare the performance of weighted k-NN and
regular k-NN on a synthetic or real-world dataset.
import pandas as pd
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
def load_data(csv_file):
if not os.path.exists(csv_file):
raise FileNotFoundError(f"Error: The file '{csv_file}' was not found. Please check the
filename and path.")
data = pd.read_csv(csv_file)
X = data.iloc[:, :-1].values # Features
y = data.iloc[:, -1].values # Target
return train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
def train_knn(X_train, y_train, X_test, y_test, k, weighted=False):
knn = KNeighborsClassifier(n_neighbors=k, weights='distance' if weighted else 'uniform')
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
return accuracy_score(y_test, y_pred)
def main():
csv_file = "iris_data.csv" # Ensure the file is in the same directory
X_train, X_test, y_train, y_test = load_data(csv_file)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
for k in [1, 3, 5, 7, 9]:
print(f"k={k}, Accuracy (Regular): {train_knn(X_train, y_train, X_test, y_test, k):.4f}")
print(f"k={k}, Accuracy (Weighted): {train_knn(X_train, y_train, X_test, y_test, k,
weighted=True):.4f}")
if __name__ == "__main__":
main()
5
Experiment 5:
Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points.
Select appropriate data set for your experiment and draw graphs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
def gaussian_kernel(X, x_query, tau):
"""Compute Gaussian weights for all training points relative to x_query."""
weights = np.exp(-np.square(X[:, 1] - x_query[1]) / (2 * tau ** 2))
return np.diag(weights) # Convert to diagonal weight matrix
def locally_weighted_regression(X_train, y_train, x_query, tau):
"""Compute LWR prediction for a single query point x_query."""
W = gaussian_kernel(X_train, x_query, tau) # Correctly compute weights
theta = np.linalg.pinv(X_train.T @ W @ X_train) @ (X_train.T @ W @ y_train)
return x_query @ theta # Return prediction
def predict_lwr(X_train, y_train, X_test, tau):
"""Compute LWR predictions for multiple query points."""
return np.array([locally_weighted_regression(X_train, y_train, x, tau) for x in X_test])
# Load dataset from CSV
data = pd.read_csv("data.csv")
X = data["X"].values.reshape(-1, 1)
y = data["y"].values.reshape(-1, 1)
# Add bias term to X
X_bias = np.hstack([np.ones_like(X), X])
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_bias, y, test_size=0.2, random_state=42)
# Define tau (bandwidth parameter)
tau = 0.5
# Compute predictions
y_pred = predict_lwr(X_train, y_train, X_test, tau)
# Plot results
plt.scatter(X, y, label="Data", color="blue", alpha=0.5)
X_test_sorted = X_test[:, 1].argsort()
plt.plot(X_test[:, 1][X_test_sorted], y_pred[X_test_sorted], label="LWR Fit", color="red")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.title("Locally Weighted Regression (LWR)")
plt.show()
6
Experiment 6:
Develop a program to demonstrate the working of Linear Regression and Polynomial Regression.
Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel
efficiency prediction) for Polynomial Regression.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
# Function to clean Auto MPG dataset (handling missing or non-numeric values)
def clean_auto_mpg_data(df):
# Convert 'horsepower' column to numeric, coercing errors to NaN
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
# Fill missing values in 'horsepower' column with the mean of the column
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].mean())
return df
# --- Boston Housing Dataset ---
# Load Boston Housing Dataset (assuming the file is in the same directory)
boston_df = pd.read_csv("boston_housing.csv")
# Selecting average number of rooms (RM) as the feature and price (PRICE) as the target
X_boston = boston_df[['RM']].values
y_boston = boston_df['PRICE'].values
# Split the dataset into training and testing sets
X_train_boston, X_test_boston, y_train_boston, y_test_boston = train_test_split(X_boston,
y_boston, test_size=0.2, random_state=42)
# Linear Regression for Boston Housing Dataset
linear_model = LinearRegression()
linear_model.fit(X_train_boston, y_train_boston)
# Predictions for the test set
y_pred_boston = linear_model.predict(X_test_boston)
# Plotting Linear Regression results
plt.scatter(X_test_boston, y_test_boston, color='blue', label='Actual')
plt.plot(X_test_boston, y_pred_boston, color='red', label='Predicted')
plt.xlabel("Average number of rooms (RM)")
plt.ylabel("House Price")
plt.title("Linear Regression - Boston Housing Dataset")
plt.legend()
plt.show()
# Print Mean Squared Error for Linear Regression on Boston dataset
print("Boston Housing Linear Regression MSE:", mean_squared_error(y_test_boston,
y_pred_boston))
# --- Auto MPG Dataset ---
# Load Auto MPG Dataset
auto_mpg_df = pd.read_csv("auto_mpg.csv")
# Clean the dataset
auto_mpg_df = clean_auto_mpg_data(auto_mpg_df)
# Selecting 'horsepower' as the feature and 'mpg' as the target
X_auto = auto_mpg_df[['horsepower']].values 7
y_auto = auto_mpg_df['mpg'].values
# Split the dataset into training and testing sets
X_train_auto, X_test_auto, y_train_auto, y_test_auto = train_test_split(X_auto, y_auto,
test_size=0.2, random_state=42)
# Polynomial Regression for Auto MPG Dataset
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train_auto)
X_test_poly = poly.transform(X_test_auto)
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train_auto)
# Predictions for the test set
y_poly_pred = poly_model.predict(X_test_poly)
# Plotting Polynomial Regression results
plt.scatter(X_test_auto, y_test_auto, color='blue', label='Actual')
plt.scatter(X_test_auto, y_poly_pred, color='red', label='Predicted')
plt.xlabel("Horsepower")
plt.ylabel("MPG")
plt.title("Polynomial Regression - Auto MPG Dataset")
plt.legend()
plt.show()
# Print Mean Squared Error for Polynomial Regression on Auto MPG dataset
print("Auto MPG Polynomial Regression MSE:", mean_squared_error(y_test_auto, y_poly_pred))
8
Experiment 7:
Develop a program to load the Titanic dataset. Split the data into training and test sets. Train a
decision tree classifier. Visualize the tree structure. Evaluate accuracy, precision, recall, and F1-
score.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder
# Load dataset
data=pd.read_csv("pgm7.csv")
# Selecting relevant features and handling missing values
data = data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
data.dropna(inplace=True)
# Encoding categorical variables
le_sex = LabelEncoder()
le_embarked = LabelEncoder()
data['Sex'] = le_sex.fit_transform(data['Sex'])
data['Embarked'] = le_embarked.fit_transform(data['Embarked'])
# Splitting data into features and target variable
X = data.drop(columns=['Survived'])
y = data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train decision tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
# Visualizing the tree structure
plt.figure(figsize=(15, 10))
plot_tree(clf, feature_names=X.columns, class_names=['Not Survived', 'Survived'], filled=True)
plt.show()
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")
9
Experiment 8:
Develop a program to implement the Naive Bayesian classifier considering Iris dataset for
training. Compute the accuracy of the classifier, considering the test data.
10
Experiment 9:
0 Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set
and visualize the clustering result.
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Set environment variable to avoid memory leak warning on Windows
os.environ["OMP_NUM_THREADS"] = "3"
# Load the dataset from a CSV file
df = pd.read_csv('breast_cancer_data.csv')
# Encode categorical columns if present
for col in df.select_dtypes(include=['object']).columns:
df[col] = LabelEncoder().fit_transform(df[col])
# Assume the last column is the target (drop it for clustering)
df_features = df.iloc[:, :-1]
# Standardize the data
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_features)
# Apply K-Means Clustering
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans.fit(df_scaled)
labels = kmeans.labels_
# Reduce dimensions using PCA for visualization
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df_scaled)
# Scatter plot of clusters
plt.figure(figsize=(8, 6))
plt.scatter(df_pca[:, 0], df_pca[:, 1], c=labels, cmap='viridis', alpha=0.6)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-Means Clustering on Breast Cancer Dataset')
plt.colorbar(label='Cluster')
plt.show()
11