0% found this document useful (0 votes)
4 views

ml observation

The document outlines several machine learning programs using Python, including creating histograms and box plots for the California Housing dataset, computing and visualizing a correlation matrix, implementing PCA on the Iris dataset, and applying the Find-S algorithm and k-Nearest Neighbour algorithm on generated data. Each program includes code snippets and outputs demonstrating the analysis and classification tasks performed. The document serves as a laboratory guide for machine learning techniques and data visualization.

Uploaded by

Ff Veera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

ml observation

The document outlines several machine learning programs using Python, including creating histograms and box plots for the California Housing dataset, computing and visualizing a correlation matrix, implementing PCA on the Iris dataset, and applying the Find-S algorithm and k-Nearest Neighbour algorithm on generated data. Each program includes code snippets and outputs demonstrating the analysis and classification tasks performed. The document serves as a laboratory guide for machine learning techniques and data visualization.

Uploaded by

Ff Veera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Machine Learning Laboratory BCSL606

Program 1: Develop a program to create histograms for all numerical


features and analyze the distribution of each feature.
Generate box plots for all numerical features and identify any outliers.
Use California Housing dataset.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset


housing_data = fetch_california_housing(as_frame=True)
data = housing_data['data']
print(data)
data['MedHouseVal'] = housing_data['target'] # Adding target variable for
completeness

# Histograms for all numerical features


print("Creating histograms for all numerical features...")
for column in data.columns:
plt.figure(figsize=(8, 5))
plt.hist(data[column], bins=50, edgecolor='k', alpha=0.7)
plt.title(f'Distribution of {column}')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Box plots for all numerical features


print("Creating box plots for all numerical features to identify outliers...")
for column in data.columns:
plt.figure(figsize=(8, 5))

Dept. of ISE, JSSATEB 2024-25 1


Machine Learning Laboratory BCSL606

plt.boxplot(data[column], vert=False, patch_artist=True,


boxprops=dict(facecolor='skyblue', color='blue'))
plt.title(f'Box Plot of {column}')
plt.xlabel(column)
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()

# Identify outliers using IQR


print("Identifying potential outliers using the IQR method...")
outliers = {}
for column in data.columns:
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers[column] = data[(data[column] < lower_bound) | (data[column] >
upper_bound)]
print(f"{column}:")
print(f"Lower Bound: {lower_bound}, Upper Bound: {upper_bound}")
print(f"Number of outliers: {len(outliers[column])}")
print("---")

Output:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85
... ... ... ... ... ... ... ...
20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48
20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49
20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43
20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43
20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37

Dept. of ISE, JSSATEB 2024-25 2


Machine Learning Laboratory BCSL606
Longitude
0 -122.23
1 -122.22
2 -122.24
3 -122.25
4 -122.25
... ...
20635 -121.09
20636 -121.21
20637 -121.22
20638 -121.32
20639 -121.24

[20640 rows x 8 columns]


Creating histograms for all numerical features...

Identifying potential outliers using the IQR method...


MedInc:
Lower Bound: -0.7063750000000004, Upper Bound: 8.013024999999999
Number of outliers: 681
---
HouseAge:
Lower Bound: -10.5, Upper Bound: 65.5
Number of outliers: 0
---
AveRooms:
Lower Bound: 2.023219161170969, Upper Bound: 8.469878027106942
Number of outliers: 511
---

Dept. of ISE, JSSATEB 2024-25 3


Machine Learning Laboratory BCSL606

AveBedrms:
Lower Bound: 0.8659085155701288, Upper Bound: 1.2396965968190603
Number of outliers: 1424
---
Population:
Lower Bound: -620.0, Upper Bound: 3132.0
Number of outliers: 1196
---
AveOccup:
Lower Bound: 1.1509614824735064, Upper Bound: 4.5610405893536905
Number of outliers: 711
---
Latitude:
Lower Bound: 28.259999999999998, Upper Bound: 43.38
Number of outliers: 0
---
Longitude:
Lower Bound: -127.48499999999999, Upper Bound: -112.32500000000002
Number of outliers: 0
---
MedHouseVal:
Lower Bound: -0.9808749999999995, Upper Bound: 4.824124999999999
Number of outliers: 1071
---

Dept. of ISE, JSSATEB 2024-25 4


Machine Learning Laboratory BCSL606

Program 2: Develop a program to Compute the correlation matrix to


understand the relationships between pairs of features. Visualize the
correlation matrix using a heatmap to know which variables have strong
positive/negative correlations. Create a pair plot to visualize pairwise
relationships between features. Use California Housing dataset.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing

# Load the California Housing dataset


housing_data = fetch_california_housing(as_frame=True)
data = housing_data['data']
data['MedHouseVal'] = housing_data['target'] # Adding target variable for
completeness

# Compute the correlation matrix


print("Computing the correlation matrix...")
correlation_matrix = data.corr()
print(correlation_matrix)

# Visualize the correlation matrix using a heatmap


print("Visualizing the correlation matrix using a heatmap...")
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm",
cbar=True, square=True)
plt.title("Correlation Matrix Heatmap")
plt.show()

Dept. of ISE, JSSATEB 2024-25 5


Machine Learning Laboratory BCSL606

# Create a pair plot to visualize pairwise relationships between features


print("Creating a pair plot to visualize pairwise relationships between
features...")
sns.pairplot(data, diag_kind='kde', corner=True)
plt.show()
Output:
Computing the correlation matrix...
MedInc HouseAge AveRooms AveBedrms Population AveOccup \
MedInc 1.000000 -0.119034 0.326895 -0.062040 0.004834 0.018766
HouseAge -0.119034 1.000000 -0.153277 -0.077747 -0.296244 0.013191
AveRooms 0.326895 -0.153277 1.000000 0.847621 -0.072213 -0.004852
AveBedrms -0.062040 -0.077747 0.847621 1.000000 -0.066197 -0.006181
Population 0.004834 -0.296244 -0.072213 -0.066197 1.000000 0.069863
AveOccup 0.018766 0.013191 -0.004852 -0.006181 0.069863 1.000000
Latitude -0.079809 0.011173 0.106389 0.069721 -0.108785 0.002366
Longitude -0.015176 -0.108197 -0.027540 0.013344 0.099773 0.002476
MedHouseVal 0.688075 0.105623 0.151948 -0.046701 -0.024650 -0.023737

Latitude Longitude MedHouseVal


MedInc -0.079809 -0.015176 0.688075
HouseAge 0.011173 -0.108197 0.105623
AveRooms 0.106389 -0.027540 0.151948
AveBedrms 0.069721 0.013344 -0.046701
Population -0.108785 0.099773 -0.024650
AveOccup 0.002366 0.002476 -0.023737
Latitude 1.000000 -0.924664 -0.144160
Longitude -0.924664 1.000000 -0.045967
MedHouseVal -0.144160 -0.045967 1.000000
Visualizing the correlation matrix using a heatmap...

Dept. of ISE, JSSATEB 2024-25 6


Machine Learning Laboratory BCSL606

Dept. of ISE, JSSATEB 2024-25 7


Machine Learning Laboratory BCSL606

Program 3: Develop a program to implement Principal Component


Analysis (PCA) for reducing the dimensionality of the Iris dataset from 4
features to 2.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from numpy.linalg import eig

# Load the Iris dataset


iris = load_iris()
iris_data = iris.data
iris_target = iris.target
iris_feature_names = iris.feature_names

# Convert to DataFrame
df = pd.DataFrame(iris_data, columns=iris_feature_names)
df['Target'] = iris_target

# Example Data (First 5 Samples for Explanation)


example_data = iris_data[:5]
print("Example Data (First 5 Samples):")
print(example_data)

# Step 1: Standardize the Data


scaler = StandardScaler()
iris_data_scaled = scaler.fit_transform(iris_data)
example_data_scaled = scaler.transform(example_data)
print("\nStandardized Example Data:")
print(example_data_scaled)

Dept. of ISE, JSSATEB 2024-25 8


Machine Learning Laboratory BCSL606

# Step 2: Compute Covariance Matrix Manually


n_samples = iris_data_scaled.shape[0]
mean_vector = np.mean(iris_data_scaled, axis=0)
X_centered = iris_data_scaled - mean_vector
cov_matrix_manual = (1 / (n_samples - 1)) * np.dot(X_centered.T,
X_centered)
print("\nManually Computed Covariance Matrix:")
print(cov_matrix_manual)

# Step 3: Compute Eigenvalues and Eigenvectors Manually


eigenvalues_manual, eigenvectors_manual = eig(cov_matrix_manual)
print("\nManually Computed Eigenvalues:")
print(eigenvalues_manual)
print("\nManually Computed Eigenvectors:")
print(eigenvectors_manual)

# Step 4: Select Top 2 Principal Components


sorted_indices = np.argsort(eigenvalues_manual)[::-1]
top_2_indices = sorted_indices[:2]
top_2_eigenvectors = eigenvectors_manual[:, top_2_indices]
print("\nTop 2 Eigenvectors:")
print(top_2_eigenvectors)

# Step 5: Transform Data to 2D


iris_pca = np.dot(iris_data_scaled, top_2_eigenvectors)
example_pca = np.dot(example_data_scaled, top_2_eigenvectors)
print("\nReduced 2D Example Data:")
print(example_pca)

# Step 6: Visualize PCA Results


iris_pca_df = pd.DataFrame(data=iris_pca, columns=["Principal Component
1", "Principal Component 2"])
iris_pca_df['Target'] = iris_target

Dept. of ISE, JSSATEB 2024-25 9


Machine Learning Laboratory BCSL606

plt.figure(figsize=(8, 6))
sns.scatterplot(
x="Principal Component 1", y="Principal Component 2", hue="Target",
data=iris_pca_df,
palette="viridis", s=100, alpha=0.8
)
plt.title("PCA of Iris Dataset")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.legend(title="Target", labels=iris.target_names)
plt.grid(alpha=0.5)
plt.show()
Output:

Dept. of ISE, JSSATEB 2024-25 10


Machine Learning Laboratory BCSL606

Program 4: For a given set of training data examples stored in a .CSV


file, implement and demonstrate the Find-S algorithm to output a
description of the set of all hypotheses consistent with the training
examples.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Implement Find-S algorithm
print("Implementing Find-S algorithm...")
# Implement Find-S algorithm
def find_s_algorithm(csv_file):
# Load the dataset
dataset = pd.read_csv(csv_file)
attributes = dataset.iloc[:, :-1].values
labels = dataset.iloc[:, -1].values

for i, label in enumerate(labels):


if label == 'Yes': # First positive example found
hypothesis = list(attributes[i])
break # Stop after finding the first "Yes"

for i in range(len(labels)):
if labels[i] == 'Yes': # Only process positive examples
for j in range(len(hypothesis)):
if hypothesis[j] != attributes[i][j]:
hypothesis[j] = '?' # Generalize

return hypothesis

csv_file = "/content/find_s_example.csv" # Provide the path to your CSV file


final_hypothesis = find_s_algorithm(csv_file)
print("Final Hypothesis:", final_hypothesis)

Output:
Implementing Find-S algorithm...
Final Hypothesis: ['Sunny', 'Warm', '?', '?', '?', '?']

Dept. of ISE, JSSATEB 2024-25 11


Machine Learning Laboratory BCSL606

Program 5: Develop a program to implement k-Nearest Neighbour


algorithm to classify the randomly generated 100 values of x in the range
of [0,1]. Perform the following based on dataset generated.
1. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε
Class1, else xi ε Class1
2. Classify the remaining points, x51,……,x100 using KNN. Perform this
for k=1,2,3,4,5,20,30

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

# Step 1: Generate 100 random values in the range [0,1]


np.random.seed(42) # For reproducibility
x = np.random.rand(100).reshape(-1, 1) # Reshape for sklearn compatibility

print(x[:5])

# Step 2: Label the first 50 points


labels = np.array([1 if xi <= 0.5 else 2 for xi in x[:50]]) # Class 1 if xi <= 0.5
else Class 2

# Step 3: Train KNN classifier


k_values = [1, 2, 3, 4, 5, 20, 30]
classified_labels = {}

for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(x[:50], labels) # Train using first 50 points
classified_labels[k] = knn.predict(x[50:]) # Classify remaining 50 points

# Step 4: Visualize the results


plt.figure(figsize=(10, 6))
plt.scatter(x[:50], labels, color='blue', label='Training Data')

Dept. of ISE, JSSATEB 2024-25 12


Machine Learning Laboratory BCSL606

plt.scatter(x[50:], classified_labels[1], color='red', marker='x', label='Classified


Data (k=1)')
plt.xlabel('X values')
plt.ylabel('Class')
plt.title('KNN Classification of Random Values')
plt.legend()
plt.show()

# Print classification results for different k values


for k in k_values:
print(f"Classification results for k={k}: {classified_labels[k]}")

Output:

Classification results for k=1: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1 2


2221112211112
2 2 1 1 2 2 2 2 1 2 1 1 1]
Classification results for k=2: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1 2
2221112211112
2 2 1 1 2 2 2 2 1 2 1 1 1]
Classification results for k=3: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1 2
2221112211112
2 2 1 1 2 2 2 2 2 2 1 1 1]

Dept. of ISE, JSSATEB 2024-25 13


Machine Learning Laboratory BCSL606

Classification results for k=4: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1 2


2221112211112
2 2 1 1 2 2 2 2 2 2 1 1 1]
Classification results for k=5: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1 2
2221112211112
2 2 1 1 2 2 2 2 2 2 1 1 1]
Classification results for k=20: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1
22221112211112
2 2 1 1 2 2 2 2 2 2 1 1 1]
Classification results for k=30: [2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 2 1 2 1 2 2 1 1
22221112211112
2 2 1 1 2 2 2 2 1 2 1 1 1]

Dept. of ISE, JSSATEB 2024-25 14


Machine Learning Laboratory BCSL606

Program 6: Implement the non-parametric Locally Weighted Regression


algorithm in order to fit data points. Select appropriate data set for your
experiment and draw graphs

import numpy as np
import matplotlib.pyplot as plt

def gaussian_kernel(x, x_query, tau):


"""Compute the Gaussian weight for each training sample."""
return np.exp(-np.square(x - x_query) / (2 * tau**2))

def locally_weighted_regression(x_train, y_train, x_query, tau):


"""Perform Locally Weighted Regression (LWR) for a given query point."""
m = len(x_train)
W = np.diag(gaussian_kernel(x_train, x_query, tau)) # Compute weights

X_bias = np.c_[np.ones(m), x_train] # Add bias term


theta = np.linalg.pinv(X_bias.T @ W @ X_bias) @ (X_bias.T @ W @ y_train)

return np.array([1, x_query]) @ theta # Predict output for x_query

# Generate synthetic dataset


np.random.seed(42)
x_train = np.linspace(0, 10, 100)
y_train = np.sin(x_train) + np.random.normal(0, 0.2, 100) # Sinusoidal data
with noise

# Define tau (bandwidth parameter)


tau_values = [0.1, 0.5, 1, 5]
x_test = np.linspace(0, 10, 100) # Test data

plt.figure(figsize=(12, 8))
for tau in tau_values:
y_pred = np.array([locally_weighted_regression(x_train, y_train, xq, tau)
for xq in x_test])
plt.plot(x_test, y_pred, label=f'tau={tau}')

# Plot training data


plt.scatter(x_train, y_train, color='black', label='Training Data', alpha=0.5)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Locally Weighted Regression (LWR) with Different Tau Values')
plt.legend()
plt.show()

Dept. of ISE, JSSATEB 2024-25 15


Machine Learning Laboratory BCSL606

Output:

Dept. of ISE, JSSATEB 2024-25 16


Machine Learning Laboratory BCSL606

Program 7: Develop a program to demonstrate the working of Linear


Regression and Polynomial Regression. Use Boston Housing Dataset for
Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency
prediction) for Polynomial Regression.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline

# Load Boston Housing Dataset from CSV


boston_df = pd.read_csv('/content/Boston.csv')
print("Boston CSV Columns:", boston_df.columns)
X_boston = boston_df[['rm']].values
y_boston = boston_df['medv'].values

X_train, X_test, y_train, y_test = train_test_split(X_boston, y_boston,


test_size=0.2, random_state=42)

linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
y_pred = linear_reg.predict(X_test)

# Plot results
plt.figure(figsize=(10, 5))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted')
plt.xlabel('Average Number of Rooms (RM)')
plt.ylabel('Housing Price')

Dept. of ISE, JSSATEB 2024-25 17


Machine Learning Laboratory BCSL606

plt.title('Linear Regression on Boston Housing Dataset')


plt.legend()
plt.show()

print(f"Mean Squared Error (Linear Regression): {mean_squared_error(y_test,


y_pred)}")

# Polynomial Regression on Auto MPG Dataset


auto_mpg_url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/auto-mpg/auto-mpg.data"
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'model_year', 'origin']

auto_df = pd.read_csv(auto_mpg_url, delim_whitespace=True,


names=column_names, na_values='?')
auto_df = auto_df.dropna() # Remove rows with missing values

X_auto = auto_df[['horsepower']].astype(float).values # Using 'horsepower' as


feature
y_auto = auto_df['mpg'].values

X_train, X_test, y_train, y_test = train_test_split(X_auto, y_auto,


test_size=0.2, random_state=42)

# Polynomial Regression (degree=3)


poly_model = make_pipeline(PolynomialFeatures(degree=3),
StandardScaler(), LinearRegression())
poly_model.fit(X_train, y_train)
y_poly_pred = poly_model.predict(X_test)

# Plot results
X_test_sorted, y_poly_pred_sorted = zip(*sorted(zip(X_test.flatten(),
y_poly_pred)))

Dept. of ISE, JSSATEB 2024-25 18


Machine Learning Laboratory BCSL606

plt.figure(figsize=(10, 5))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test_sorted, y_poly_pred_sorted, color='red', linewidth=2,
label='Predicted')
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('Polynomial Regression on Auto MPG Dataset')
plt.legend()
plt.show()

print(f"Mean Squared Error (Polynomial Regression):


{mean_squared_error(y_test, y_poly_pred)}")

Output:

Mean Squared Error (Linear Regression): 46.144775347317264

Dept. of ISE, JSSATEB 2024-25 19


Machine Learning Laboratory BCSL606

Dept. of ISE, JSSATEB 2024-25 20


Machine Learning Laboratory BCSL606

8. Develop a program to demonstrate the working of the decision tree


algorithm. Use Breast Cancer Data set for building the decision tree and
apply this knowledge to classify a new sample.

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt
from collections import Counter

data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names
target_names = data.target_names

print("Feature names:", feature_names)


print("Target names:", target_names)

def calculate_entropy(labels):
total = len(labels)
counts = Counter(labels)
entropy = 0.0
for count in counts.values():
p = count / total
entropy -= p * np.log2(p)
return entropy

entropy_dataset = calculate_entropy(y)
print(f"\nOverall Entropy of Target (Malignant vs Benign):
{entropy_dataset:.4f}")

print("\nInformation Gain for Each Feature (using median split):")


for i, feature in enumerate(feature_names):
feature_values = X[:, i]
median_value = np.median(feature_values)

# Split dataset
left_mask = feature_values <= median_value
right_mask = feature_values > median_value

y_left = y[left_mask]

Dept. of ISE, JSSATEB 2024-25 21


Machine Learning Laboratory BCSL606

y_right = y[right_mask]

entropy_left = calculate_entropy(y_left)
entropy_right = calculate_entropy(y_right)

weighted_entropy = (len(y_left) / len(y)) * entropy_left + (len(y_right) /


len(y)) * entropy_right
info_gain = entropy_dataset - weighted_entropy

print(f"{feature}: IG = {info_gain:.4f}")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

clf = DecisionTreeClassifier(criterion='entropy', max_depth=4,


random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))

plt.figure(figsize=(20, 10))
plot_tree(clf, feature_names=feature_names, class_names=target_names,
filled=True, rounded=True)
plt.title("Decision Tree Visualization for Breast Cancer Dataset")
plt.show()

new_sample = np.array([[17.99, 10.38, 122.8, 1001.0, 0.1184,


0.2776, 0.3001, 0.1471, 0.2419, 0.07871,
1.095, 0.9053, 8.589, 153.4, 0.006399,
0.04904, 0.05373, 0.01587, 0.03003, 0.006193,
25.38, 17.33, 184.6, 2019.0, 0.1622,
0.6656, 0.7119, 0.2654, 0.4601, 0.1189]])

prediction = clf.predict(new_sample)
print("\nPrediction for new sample:")
print("Class:", target_names[prediction[0]])

Dept. of ISE, JSSATEB 2024-25 22


Machine Learning Laboratory BCSL606

Output:
Feature names: ['mean radius' 'mean texture' 'mean perimeter'
'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal
dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension
error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal
dimension']
Target names: ['malignant' 'benign']

Overall Entropy of Target (Malignant vs Benign): 0.9526

Information Gain for Each Feature (using median split):


mean radius: IG = 0.3416
mean texture: IG = 0.1445
mean perimeter: IG = 0.3507
mean area: IG = 0.3416
mean smoothness: IG = 0.0660
mean compactness: IG = 0.2325
mean concavity: IG = 0.3695
mean concave points: IG = 0.3995
mean symmetry: IG = 0.0627
mean fractal dimension: IG = 0.0000
radius error: IG = 0.1824
texture error: IG = 0.0000
perimeter error: IG = 0.2192
area error: IG = 0.2910
smoothness error: IG = 0.0023
compactness error: IG = 0.0990
concavity error: IG = 0.1601
concave points error: IG = 0.1445
symmetry error: IG = 0.0037
fractal dimension error: IG = 0.0284
worst radius: IG = 0.4588
worst texture: IG = 0.1298
worst perimeter: IG = 0.4436
worst area: IG = 0.4556
worst smoothness: IG = 0.0990
worst compactness: IG = 0.1882
worst concavity: IG = 0.3792
worst concave points: IG = 0.4209
worst symmetry: IG = 0.0762
worst fractal dimension: IG = 0.0452

Dept. of ISE, JSSATEB 2024-25 23


Machine Learning Laboratory BCSL606

Classification Report:
precision recall f1-score support

0 0.97 0.91 0.94 43


1 0.95 0.99 0.97 71

accuracy 0.96 114


macro avg 0.96 0.95 0.95 114
weighted avg 0.96 0.96 0.96 114

Accuracy: 0.956140350877193

Prediction for new sample:


Class: malignant

Dept. of ISE, JSSATEB 2024-25 24


Machine Learning Laboratory BCSL606

9. Develop a program to implement the Naive Bayesian classifier


considering Olivetti Face Data set for training. Compute the accuracy of
the classifier, considering a few test data sets.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix

faces = fetch_olivetti_faces()
X = faces.data # Flattened images: 400 x 4096
y = faces.target # Labels: 0 to 39 (40 classes)
images = faces.images # Original image shapes: 64 x 64

print(f"Total samples: {X.shape[0]}")


print(f"Image shape: {images[0].shape}")
print(f"Total classes: {len(np.unique(y))}")

X_train, X_test, y_train, y_test, img_train, img_test = train_test_split(


X, y, images, test_size=0.3, random_state=42, stratify=y)

model = GaussianNB()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("Accuracy:", accuracy)

def show_predictions(images, true_labels, predicted_labels, n=8):


plt.figure(figsize=(15, 5))
for i in range(n):
plt.subplot(1, n, i + 1)
plt.imshow(images[i], cmap='gray')
plt.title(f"True: {true_labels[i]}\nPred: {predicted_labels[i]}")
plt.axis('off')
plt.tight_layout()
plt.suptitle("Sample Test Predictions", fontsize=16)
plt.subplots_adjust(top=0.75)
plt.show()

show_predictions(img_test, y_test, y_pred, n=8)

Dept. of ISE, JSSATEB 2024-25 25


Machine Learning Laboratory BCSL606

Output:
Classification Report:
precision recall f1-score support

0 1.00 0.67 0.80 3


1 1.00 0.67 0.80 3
2 0.43 1.00 0.60 3
3 1.00 0.33 0.50 3
4 1.00 0.33 0.50 3
5 1.00 1.00 1.00 3
6 1.00 0.67 0.80 3
7 0.60 1.00 0.75 3
8 1.00 1.00 1.00 3
9 1.00 0.33 0.50 3
10 1.00 0.67 0.80 3
11 1.00 1.00 1.00 3
12 1.00 1.00 1.00 3
13 1.00 0.67 0.80 3
14 1.00 1.00 1.00 3
15 0.50 1.00 0.67 3
16 1.00 0.33 0.50 3
17 0.00 0.00 0.00 3
18 1.00 1.00 1.00 3
19 1.00 1.00 1.00 3
20 1.00 1.00 1.00 3
21 1.00 1.00 1.00 3
22 1.00 1.00 1.00 3
23 1.00 1.00 1.00 3
24 1.00 0.67 0.80 3
25 0.75 1.00 0.86 3
26 1.00 0.67 0.80 3
27 1.00 1.00 1.00 3
28 1.00 1.00 1.00 3
29 1.00 1.00 1.00 3
30 0.75 1.00 0.86 3
31 1.00 0.67 0.80 3
32 1.00 1.00 1.00 3
33 1.00 0.67 0.80 3
34 0.43 1.00 0.60 3
35 0.75 1.00 0.86 3
36 1.00 1.00 1.00 3
37 1.00 0.33 0.50 3
38 1.00 1.00 1.00 3
39 0.33 1.00 0.50 3

accuracy 0.82 120


macro avg 0.89 0.82 0.81 120

Dept. of ISE, JSSATEB 2024-25 26


Machine Learning Laboratory BCSL606

weighted avg 0.89 0.82 0.81 120

Accuracy: 0.8166666666666667

Dept. of ISE, JSSATEB 2024-25 27


Machine Learning Laboratory BCSL606

10. Develop a program to implement k-means clustering using


Wisconsin Breast Cancer data set and visualize the clustering result.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler

data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names
target_names = data.target_names

print("Data Shape:", X.shape)


print("Classes:", target_names)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)


clusters = kmeans.fit_predict(X_scaled)

labels_mapped = np.where(clusters == 1, 0, 1)

print("\nConfusion Matrix:")
print(confusion_matrix(y, labels_mapped))
print("Accuracy:", accuracy_score(y, labels_mapped))

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis', alpha=0.6)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=250, marker='X', c='red', label='Centroids')
plt.title("K-Means Clustering of Breast Cancer Dataset (PCA-2D)")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.legend()
plt.grid(True)
plt.show()

Dept. of ISE, JSSATEB 2024-25 28


Machine Learning Laboratory BCSL606

Output:

Data Shape: (569, 30)


Classes: ['malignant' 'benign']

Confusion Matrix:
[[176 36]
[ 18 339]]
Accuracy: 0.9050966608084359

Dept. of ISE, JSSATEB 2024-25 29

You might also like