ml lab
ml lab
Develop a program to Load a dataset and select one numerical column. Compute mean,
median, mode, standard deviation, variance, and range for a given numerical column in a
dataset. Generate a histogram and boxplot to understand the distribution of the data. Identify
any outliers in the data using IQR. Select a categorical variable from a dataset. Compute the
frequency of each category and display it as a bar chart or pie chart.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Compute statistics
mean_value = df[num_col].mean()
median_value = df[num_col].median()
mode_value = df[num_col].mode()[0] # Mode might return multiple values
std_dev = df[num_col].std()
variance = df[num_col].var()
data_range = df[num_col].max() - df[num_col].min()
# Print statistics
print("\nStatistical Measures for", num_col)
print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")
print(f"Standard Deviation: {std_dev}")
print(f"Variance: {variance}")
print(f"Range: {data_range}")
# Plot Histogram
plt.figure(figsize=(6, 4))
sns.histplot(df[num_col], bins=20, kde=True)
plt.title(f"Histogram of {num_col}")
plt.xlabel(num_col)
plt.ylabel("Frequency")
plt.show()
# Plot Boxplot
plt.figure(figsize=(6, 4))
sns.boxplot(x=df[num_col])
plt.title(f"Boxplot of {num_col}")
plt.show()
Develop a program to Load a dataset with at least two numerical columns (e.g., Iris, Titanic).
Plot a scatter plot of two variables and calculate their Pearson correlation coefficient. Write a
program to compute the covariance and correlation matrix for a dataset. Visualize the
correlation matrix using a heatmap to know which variables have strong positive/negative
correlations.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Scatter plot
plt.figure(figsize=(6, 4))
sns.scatterplot(x=df[num_col1], y=df[num_col2])
plt.title(f"Scatter Plot: {num_col1} vs {num_col2}")
plt.xlabel(num_col1)
plt.ylabel(num_col2)
plt.show()
Develop a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
Develop a program to load the Iris dataset. Implement the k-Nearest Neighbors (k-NN)
algorithm for classifying flowers based on their features. Split the dataset into training and
different values of 𝑘 (e.g., k=1,3,5) and evaluate the accuracy. Extend the k-NN algorithm to
testing sets and evaluate the model using metrics like accuracy and F1-score. Test it for
assign weights based on the distance of neighbors (e.g., 𝑤𝑒𝑖𝑔ℎ𝑡=1/𝑑2 ). Compare the
performance of weighted k-NN and regular k-NN on a synthetic or real-world dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
# Print results
print("\nRegular k-NN Performance:")
print(knn_df)
print("\nWeighted k-NN Performance:")
print(weighted_knn_df)
# Plot comparison
plt.figure(figsize=(8, 5))
plt.plot(knn_df['k'], knn_df['Accuracy'], marker='o', label='Regular k-NN')
plt.plot(weighted_knn_df['k'], weighted_knn_df['Accuracy'], marker='s', linestyle='dashed',
label='Weighted k-NN')
plt.xlabel("k (Number of Neighbors)")
plt.ylabel("Accuracy")
plt.title("k-NN vs. Weighted k-NN Performance")
plt.legend()
plt.show()
Program 5
Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Develop a program to load the Titanic dataset. Split the data into training and test sets. Train
a decision tree classifier. Visualize the tree structure. Evaluate accuracy, precision, recall, and
F1-score.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder
# Predictions
y_pred = dt_model.predict(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
Develop a program to implement the Naive Bayesian classifier considering Iris dataset for
training. Compute the accuracy of the classifier, considering the test data.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, cmap='Blues', xticklabels=iris.target_names,
yticklabels=iris.target_names)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
Program 9
Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set
and visualize the clustering result.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df['PCA1'], y=df['PCA2'], hue=df['Cluster'], palette='coolwarm', alpha=0.7)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-Means Clustering on Breast Cancer Dataset')
plt.legend(title="Cluster")
plt.show()