0% found this document useful (0 votes)
15 views

ML Lab File[1]

Uploaded by

bhavyankarun1504
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

ML Lab File[1]

Uploaded by

bhavyankarun1504
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

INDIRA GANDHI DELHI TECHNICAL

UNIVERSITY FOR WOMEN

MACHINE LEARNING LAB


Tulika Arun
MTech. CSE-AI
(Batch of 2026)
Experiment 1
Aim :
Write a program to implement various feature selection techniques and compare the
performance with a classifier.

Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
1. Radius: Mean of distances from the center to points on the perimeter.
2. Texture: Standard deviation of gray-scale values.
3. Perimeter: The total distance around the boundary of the nucleus.
4. Area: The area within the boundary of the nucleus.
5. Smoothness: Local variation in radius lengths.
6. Compactness: Perimeter² / area - 1.0.
7. Concavity: Severity of concave portions of the contour.
8. Concave Points: Number of concave portions of the contour.
9. Symmetry: Symmetry of the nucleus.
10. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.

Algorithm :
Feature selection (also known as variable selection, attribute selection or subset selection) is
the process of selecting a subset of relevant features to use in machine learning model
building. It is one of the core concepts in machine learning which has a huge impact on
models’ performance. Given a pool of features, the process will select the best subset of
attributes that are most important and have high contribution at the time of prediction
making. For this experiment, we are going to consider the following feature selection
methods :
1. Filter Methods: Rely on the features’ characteristics without using any machine
learning algorithm. Very well-suited for a quick “screen and removal” of irrelevant
features.
a. Spearman Correlation
b. Pearson Correlation
c. Kendall Correlation
d. Chi Squared Test
e. Information Gain

2. Wrapper methods: Consider the selection of a set of features as a search problem,


then uses a predictive machine learning algorithm to select the best feature subset.
In essence, these methods train a new model on each feature subset, which makes it
obviously very computationally expensive. However, they provide the best
performing feature subset for a given machine learning algorithm.
a. Forward Feature Selection
b. Backward Feature Elimination
c. Step Wise Feature Selection

Program :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.feature_selection import mutual_info_classif, chi2, SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score

file_path = r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab


Datasets\breast_cancer_wisconsin.csv' # Update this with your file path
data = pd.read_csv(file_path)

X = data.iloc[:, :-1]
y = data.iloc[:, -1]

if y.dtype == 'O':
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

min_max_scaler = MinMaxScaler()
X_normalized = min_max_scaler.fit_transform(X)

selected_features = {}

# Pearson Correlation
pearson_corr = X.corr(method='pearson').iloc[:, -1].abs()
selected_features['Pearson'] = np.where(pearson_corr > 0.4)[0]

# Spearman Correlation
spearman_corr = X.corr(method='spearman').iloc[:, -1].abs()
selected_features['Spearman'] = np.where(spearman_corr > 0.4)[0]

# Kendall Correlation
kendall_corr = X.corr(method='kendall').iloc[:, -1].abs()
selected_features['Kendall'] = np.where(kendall_corr > 0.4)[0]

# Mutual Information Gain


mutual_info = mutual_info_classif(X_scaled, y)
selected_features['Mutual_Info'] = np.where(mutual_info > 0.3)[0]

# Chi-Squared Test (use non-negative normalized features)


chi_scores, _ = chi2(X_normalized, y)
selected_features['Chi_Squared'] = np.where(chi_scores > 3)[0]

# Forward Feature Selection (using RFE)


logreg = LogisticRegression()
rfe_selector = RFE(logreg, n_features_to_select=5)

rfe_selector = rfe_selector.fit(X_scaled, y)
selected_features['Forward_Selection'] = np.where(rfe_selector.support_)[0]

# Backward Feature Elimination (using SelectKBest)


k_best_selector = SelectKBest(f_classif, k=5)

k_best_selector.fit(X_scaled, y)
selected_features['Backward_Elimination'] =
np.where(k_best_selector.get_support())[0]

# Stepwise Selection (combination of forward and backward)


stepwise_features =
list(set(selected_features['Forward_Selection']).union(selected_features['Bac
kward_Elimination']))
selected_features['Stepwise_Selection'] = np.array(stepwise_features)

# Compare Naive Bayes Performance


def evaluate_model(X_train, X_test, y_train, y_test):
model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
return {
'accuracy': accuracy_score(y_test, y_pred),
'precision': precision_score(y_test, y_pred, average='weighted'),
'recall': recall_score(y_test, y_pred, average='weighted'),
'f1_score': f1_score(y_test, y_pred, average='weighted')
}

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,


test_size=0.3, random_state=42)

all_features_results = evaluate_model(X_train, X_test, y_train, y_test)

results = {'All_Features': all_features_results}


for method, indices in selected_features.items():
X_train_sel, X_test_sel = X_train[:, indices], X_test[:, indices]
results[method] = evaluate_model(X_train_sel, X_test_sel, y_train,
y_test)
results_df = pd.DataFrame(results).T
print(results_df)

accuracy_results = {method: result['accuracy'] for method, result in


results.items()}

accuracy_df = pd.DataFrame(list(accuracy_results.items()), columns=['Feature


Selection Method', 'Accuracy'])

plt.figure(figsize=(10, 6))
sns.barplot(x='Feature Selection Method', y='Accuracy', data=accuracy_df,
palette='viridis')
plt.title('Accuracy of Naive Bayes Classifier with Different Feature
Selection Methods')
plt.xlabel('Feature Selection Method')
plt.ylabel('Accuracy')
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 1)
plt.tight_layout()
plt.show()
Output :
Selected Features via Pearson, Spearman and Kendall Correlation (Thresholds : 0.6, 0.6, 0.4):

Selected Features via Mutual Information Gain :


Selected Features via Chi Sqaured Test :

Selected Features via Forward Feature Selection, Backward Feature Elimination, Step Wise
Selection :
A comparision of different evaluation metric for Naïve Bayes Classifier :

Conclusion :
We can see that Forward Feature Selection performs the best out of the lot.
Experiment 2
Aim :
Write a program to implement linear regression without using python libraries.

Dataset :
Student Performance Dataset contains two columns: Hours and Scores.
• Hours: Represents the number of hours spent studying or practicing (independent
variable or feature).
• Scores: Represents the scores achieved (dependent variable or target).

Algorithm :
Linear regression is used for finding linear relationship between target and one or more
predictors. There are two types of linear regression :

• Simple (one input variable or predictor)


• Multiple (more than one input variables)
Here, we perform simple linear regression on the above dataset.
Simple linear regression is used to estimate the relationship between two quantitative
variables. You can use simple linear regression when you want to know:
1. How strong the relationship is between two variables (e.g. the relationship
between rainfall and soil erosion).
2. The value of the dependent variable at a certain value of the independent variable
(e.g. the amount of soil erosion at a certain level of rainfall).

Program :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

student_scores=pd.read_csv(r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab
Datasets\student_scores.csv')

student_scores.isnull().sum()
print(student_scores)

corr_matrix = student_scores.corr()
print(corr_matrix)

x=student_scores.Hours.values.reshape(-1,1)
y=student_scores.Scores.values.reshape(-1,1)

plt.scatter(x,y,color='blue')
plt.xlabel('Number of Hours studied')
plt.ylabel('Scores Obtained')
plt.title('Simple Linear graph')
plt.show()

hours = student_scores['Hours']
scores = student_scores['Scores']
hours = np.array(hours)
scores = np.array(scores)

mean_x = np.mean(hours)
mean_y = np.mean(scores)

numerator = np.sum((hours - mean_x) * (scores - mean_y))


denominator = np.sum((hours - mean_x) ** 2)
slope = numerator / denominator
intercept = mean_y - slope * mean_x

print(f"Slope (m): {slope}")


print(f"Intercept (b): {intercept}")

def predict(x):
return slope * x + intercept

predictions = predict(hours)

print("\nPredictions:")
for h, p in zip(hours, predictions):
print(f"Hours: {h}, Predicted Score: {p}")

ss_total = np.sum((scores - mean_y) ** 2)


ss_residual = np.sum((scores - predictions) ** 2)
r2 = 1 - (ss_residual / ss_total)
print(f"\nR^2 score: {r2}")

plt.figure(figsize=(10, 6))
plt.scatter(hours, scores, color='blue', label='Actual Scores')
plt.plot(hours, predictions, color='red', linewidth=2, label='Fitted Line')
plt.xlabel('Hours')
plt.ylabel('Scores')
plt.title('Actual vs Predicted Scores')
plt.legend()
plt.grid(True)
plt.show()
Output :

The dataset looks like this :


Upon performing linear regression, the best fit line looks like :

Where the predictions and the R^2 score are :

Conclusion :
The linear regression algorithm built manually gives us outputs similar to the actual scores
which results in an accuracy of 95.2%
Experiment 3
Aim :
Write a program to implement logistic regression without using python libraries.

Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
11. Radius: Mean of distances from the center to points on the perimeter.
12. Texture: Standard deviation of gray-scale values.
13. Perimeter: The total distance around the boundary of the nucleus.
14. Area: The area within the boundary of the nucleus.
15. Smoothness: Local variation in radius lengths.
16. Compactness: Perimeter² / area - 1.0.
17. Concavity: Severity of concave portions of the contour.
18. Concave Points: Number of concave portions of the contour.
19. Symmetry: Symmetry of the nucleus.
20. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.

Algorithm :
The logistic function, also called the sigmoid function, is an S-shaped curve that can take any
real-valued number and map it into a value between 0 and 1, but never exactly at those
limits. [1 / (1 + e^-value)]. The key difference from linear regression is that the output value
being modeled is a binary values (0 or 1) rather than a numeric value. Logistic regression is
named for the function used at the core of the method, the logistic function.
Based on the number of categories, Logistic regression can be classified as:
1. binomial: target variable can have only 2 possible types: “0” or “1” which may
represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
2. multinomial: target variable can have 3 or more possible types which are not
ordered(i.e. types have no quantitative significance) like “disease A” vs “disease B” vs
“disease C”.
3. ordinal: it deals with target variables with ordered categories. For example, a test
score can be categorized as:“very poor”, “poor”, “good”, “very good”. Here, each
category can be given a score like 0, 1, 2, 3.

Logistic regression uses an equation as the representation, very much like linear regression.
Input values (x) are combined linearly using weights or coefficient values (referred to as the
Greek capital letter Beta) to predict an output value (y). Below is an example logistic
regression equation: y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x)) where y is the predicted output,
b0 is the bias or intercept term and b1 is the coefficient for the single input value (x).

Program :

import numpy as np
import pandas as pd

file_path = r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab


Datasets\breast_cancer_wisconsin.csv' # Update this with your file path
data = pd.read_csv(file_path)

data['diagnosis'] = data['diagnosis'].map({'M': 1, 'B': 0})

# Features and target


X = data.drop(['diagnosis', 'id'], axis=1) # Drop the target and any non-
feature columns
y = data['diagnosis']

# Normalize features
X = (X - X.mean()) / X.std()
# Add a column of ones to X for the intercept term
X = np.c_[np.ones(X.shape[0]), X] # Adding the intercept term

# Convert to numpy arrays


X = np.array(X)
y = np.array(y)

def sigmoid(z):
return 1 / (1 + np.exp(-z))

def compute_cost(X, y, theta):


m = len(y)
h = sigmoid(X.dot(theta))
cost = (1/m) * (-y.dot(np.log(h)) - (1 - y).dot(np.log(1 - h)))
return cost

def gradient_descent(X, y, theta, alpha, num_iters):


m = len(y)
cost_history = np.zeros(num_iters)

for i in range(num_iters):
h = sigmoid(X.dot(theta))
gradient = (1/m) * X.T.dot(h - y)
theta -= alpha * gradient
cost_history[i] = compute_cost(X, y, theta)

return theta, cost_history

# Initialize parameters
theta = np.zeros(X.shape[1])
alpha = 0.01 # Learning rate
num_iters = 1000 # Number of iterations

# Perform gradient descent


theta, cost_history = gradient_descent(X, y, theta, alpha, num_iters)

# Output results
print("Optimized theta:", theta)
print("Final cost:", cost_history[-1])

def predict(X, theta):


return sigmoid(X.dot(theta)) >= 0.5

# Predict on the training data


# Predict on the training data
predictions = predict(X, theta)

# Print predictions alongside actual values


for i in range(len(predictions)):
print(f"Prediction: {int(predictions[i])}, Actual: {int(y[i])}")

# Calculate accuracy
accuracy = np.mean(predictions == y)
print("Training accuracy:", accuracy)
Output :
The optimized cost or theta via gradient descent at every iteration is as follows :

The predictions made from our manual logistic regression model look something like :

Where 1 is for malignant and 0 is for benign.


The accuracy of our model during training is:

Conclusion :
Our manual logistic regression algorithm classifies the data into malignant or benign and
gives an accuracy of 98.2%
Experiment 4
Aim :
Write a program to implement the support Vector machine algorithm.

Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
21. Radius: Mean of distances from the center to points on the perimeter.
22. Texture: Standard deviation of gray-scale values.
23. Perimeter: The total distance around the boundary of the nucleus.
24. Area: The area within the boundary of the nucleus.
25. Smoothness: Local variation in radius lengths.
26. Compactness: Perimeter² / area - 1.0.
27. Concavity: Severity of concave portions of the contour.
28. Concave Points: Number of concave portions of the contour.
29. Symmetry: Symmetry of the nucleus.
30. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.

Algorithm :
Support Vector Machine (SVM) is a supervised machine learning algorithm commonly used
for classification tasks. It works by finding an optimal hyperplane that best separates the
data into different classes. For this experiment, SVM will attempt to classify tumors as either
malignant (cancerous) or benign (non-cancerous). The objective of the SVM is to maximize
the margin between data points of different classes, thus creating a robust classifier.

Program :

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
file_path = r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab
Datasets\breast_cancer_wisconsin.csv' # Update this with your file path
data = pd.read_csv(file_path)
data.head()
data.shape
data = data.dropna()

label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
le = LabelEncoder()
data[column] = le.fit_transform(data[column])
label_encoders[column] = le
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

min_max_scaler = MinMaxScaler()
X_normalized = min_max_scaler.fit_transform(X)
X.corr()
plt.figure(figsize=(18, 12))
sns.heatmap(X.corr(), vmin=0.85, vmax=1, annot=True, cmap='YlGnBu',
linewidths=.5)
correlation_matrix = X.corr()
correlation_with_target = correlation_matrix.iloc[:-1, -1].abs() #
Correlation of features with target
selected_features = correlation_with_target[correlation_with_target >
0.5].index # Select features with correlation > 0.5
selected_features
features = ['radius_mean', 'perimeter_mean', 'area_mean', 'compactness_mean',
'concavity_mean', 'concave points_mean', 'radius_se',
'perimeter_se',
'area_se', 'radius_worst', 'perimeter_worst', 'area_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst']
X = X[features]
print(X.columns)
correlation_matrix = X.corr()
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
for j in range(i):
if correlation_matrix.iloc[i, j] > 0.9:
high_corr_pairs.append((correlation_matrix.columns[i],
correlation_matrix.columns[j]))

for high_corr_pair in high_corr_pairs:


print("Highly correlated pairs:", high_corr_pair)

# Drop one feature from each pair of highly correlated features


features_to_drop = set()
for pair in high_corr_pairs:
features_to_drop.add(pair[1]) # Arbitrarily drop the second feature in
each pair
X = X.drop(columns=features_to_drop)
X.columns
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Initialize the SVM classifier


svm_model = SVC(kernel='linear') # Using linear kernel for binary
classification

# Train the model


svm_model.fit(X_train, y_train)

# Predict on the test data


y_pred = svm_model.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
confusion_mat = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", confusion_mat)
Output :
The correlation map of all the features looks like the one below. We took the features which
had a correlation of 0.5 or more with the target variable and for those selected features, we
dropped the ones which had higher correlation between each other (above 0.9).

The final output contains of the binary predictions made on the test data, accuracy, the
confusion matrix, precision, recall, f1 score and support :

Conclusion :
The SVM model on the Breast Cancer Wisconsin dataset performs with high accuracy (97%),
demonstrating strong capability in distinguishing between malignant and benign tumors.
The high precision and recall scores reflect the model's reliability in clinical contexts, where
accurate classification is essential.

Experiment 5
Aim :
Write a program to implement back propagation neural network for classification of sample data
without using Python libraries

Dataset :
The Iris dataset is a well-known dataset in machine learning and statistics, often used for
classification tasks. It consists of 150 samples of iris flowers from three different species:
Setosa, Versicolor, and Virginica. The dataset includes 4 features, which are the
measurements of the flowers' sepals and petals:
1. Sepal Length: The length of the sepal in centimeters.
2. Sepal Width: The width of the sepal in centimeters.
3. Petal Length: The length of the petal in centimeters.
4. Petal Width: The width of the petal in centimeters.
The dataset also includes the Species label, which identifies the species of each iris flower as
either Setosa, Versicolor, or Virginica. The dataset is balanced, with 50 samples from each
species.
• Number of instances: 150
• Number of attributes: 4 features (sepal length, sepal width, petal length, petal
width)
• Classes: 3 species (Setosa, Versicolor, Virginica)

Algorithm :
Backpropagation Network
• It is a multilayer feed forward network
• BPN has two phases:
• Forward pass phase: computes ‘functional signal’, feed forward propagation of input
pattern signals through network
• Backward pass phase: computes ‘error signal’, propagates the error backwards
through network starting at output units (where the error is the difference between
actual and desired output values)
In this implementation, the neural network consists of:
• Input layer: 4 neurons (corresponding to the 4 input features).
• Hidden layers: 2 hidden layers (with 10 and 8 neurons respectively).
• Output layer: 3 neurons (corresponding to the 3 possible species of the iris).
The network is trained for a specified number of epochs (3000 in the updated version),
during which the weights are updated using the backpropagation algorithm to reduce the
classification error.

Program :

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def load_data():

data = pd.read_csv(r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab


Datasets\Iris.csv')
print(data.columns)
data.drop('Id', axis = 1, inplace=True)
print(data.columns)

X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
y = OneHotEncoder(sparse_output=False).fit_transform(y.reshape(-1, 1))

scaler = StandardScaler()
X = scaler.fit_transform(X)

return X, y

# Define the neural network class with two hidden layers


class NeuralNetwork:
def __init__(self, input_size, hidden_size1, hidden_size2, output_size,
lr=0.1):
# Initialize weights and biases
self.lr = lr
self.weights_input_hidden1 = np.random.rand(input_size, hidden_size1)
self.weights_hidden1_hidden2 = np.random.rand(hidden_size1,
hidden_size2)
self.weights_hidden2_output = np.random.rand(hidden_size2,
output_size)

self.bias_hidden1 = np.random.rand(hidden_size1)
self.bias_hidden2 = np.random.rand(hidden_size2)
self.bias_output = np.random.rand(output_size)

def sigmoid(self, x):


return 1 / (1 + np.exp(-x))

def sigmoid_derivative(self, x):


return x * (1 - x)

def forward(self, X):


# Forward pass with two hidden layers
self.hidden_input1 = np.dot(X, self.weights_input_hidden1) +
self.bias_hidden1
self.hidden_output1 = self.sigmoid(self.hidden_input1)

self.hidden_input2 = np.dot(self.hidden_output1,
self.weights_hidden1_hidden2) + self.bias_hidden2
self.hidden_output2 = self.sigmoid(self.hidden_input2)

self.final_input = np.dot(self.hidden_output2,
self.weights_hidden2_output) + self.bias_output
self.final_output = self.sigmoid(self.final_input)
return self.final_output

def backward(self, X, y, output):


# Calculate the error and its derivative
error = y - output
d_output = error * self.sigmoid_derivative(output)
# Backpropagation through the second hidden layer
hidden2_error = d_output.dot(self.weights_hidden2_output.T)
d_hidden2 = hidden2_error *
self.sigmoid_derivative(self.hidden_output2)

# Backpropagation through the first hidden layer


hidden1_error = d_hidden2.dot(self.weights_hidden1_hidden2.T)
d_hidden1 = hidden1_error *
self.sigmoid_derivative(self.hidden_output1)

# Update weights and biases


self.weights_hidden2_output += self.hidden_output2.T.dot(d_output) *
self.lr
self.weights_hidden1_hidden2 += self.hidden_output1.T.dot(d_hidden2)
* self.lr
self.weights_input_hidden1 += X.T.dot(d_hidden1) * self.lr

self.bias_output += np.sum(d_output, axis=0) * self.lr


self.bias_hidden2 += np.sum(d_hidden2, axis=0) * self.lr
self.bias_hidden1 += np.sum(d_hidden1, axis=0) * self.lr

def train(self, X, y, epochs=2000):


for epoch in range(epochs):
output = self.forward(X)
self.backward(X, y, output)

def predict(self, X):


output = self.forward(X)
return np.argmax(output, axis=1)

X, y = load_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

input_size = X_train.shape[1]
hidden_size1 = 10 # First hidden layer neurons
hidden_size2 = 8 # Second hidden layer neurons
output_size = y_train.shape[1]
learning_rate = 0.1
epochs = 3000

nn = NeuralNetwork(input_size, hidden_size1, hidden_size2, output_size,


lr=learning_rate)
nn.train(X_train, y_train, epochs=epochs)

y_pred = nn.predict(X_test)

print(y_pred)
y_test_labels = np.argmax(y_test, axis=1)
accuracy = accuracy_score(y_test_labels, y_pred)

print(f"Test Accuracy: {accuracy * 100:.2f}%")

Output :
After removing the ‘Id’ feature, the test predictions and the accuracy as predicted as output.
• Upon using just one hidden layer and the following attribute values for the neural
network, the accuracy came out to be 71% :
hidden_size = 5
learning_rate = 0.1
epochs = 1000

• When we increase the hidden layers to 2 and use the following attribute values for
the neural network, the accuracy came out to be 91% :
hidden_size1 = 10 # First hidden layer neurons
hidden_size2 = 8 # Second hidden layer neurons
epochs = 3000 # Increase epochs for better training
Conclusion :
In this implementation, we used a backpropagation neural network to classify iris flowers
from the Iris dataset. The neural network was able to achieve an accuracy of around 71%
initially, and after improving the model by:
• Adding more neurons in the hidden layers,
• Introducing a second hidden layer, and
• Scaling the features using StandardScaler,
the accuracy was improved to 91%. With the improved model (more neurons, more epochs,
and better feature scaling), we observed an increase in performance, which demonstrates
the importance of tuning the model architecture and training parameters.
Experiment 6
Aim :
Write a program to implement Principle Component Analysis technique of dimensionality
reduction and evaluate the performance with a classifier.

Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
31. Radius: Mean of distances from the center to points on the perimeter.
32. Texture: Standard deviation of gray-scale values.
33. Perimeter: The total distance around the boundary of the nucleus.
34. Area: The area within the boundary of the nucleus.
35. Smoothness: Local variation in radius lengths.
36. Compactness: Perimeter² / area - 1.0.
37. Concavity: Severity of concave portions of the contour.
38. Concave Points: Number of concave portions of the contour.
39. Symmetry: Symmetry of the nucleus.
40. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.

Algorithm :
Principle Component Analysis :
• The main idea of principal component analysis (PCA) is to reduce the dimensionality
of a data set consisting of many variables correlated with each other, either heavily
or lightly, while retaining the variation present in the dataset, up to the maximum
extent.
• The same is done by transforming the variables to a new set of variables, which are
known as the principal components (or simply, the PCs) and are orthogonal, ordered
such that the retention of variation present in the original variables decreases as we
move down in the order.
• So, in this way, the 1st principal component retains maximum variation that was
present in the original components.
• The principal components are the eigenvectors of a covariance matrix, and hence
they are orthogonal.
• Principal component can be defined as a linear combination of optimally-weighted
observed variables.
• The output of PCA are these principal components, the number of which is equal to
the number of original variables.
• The PCs possess some useful properties which are listed below:
o The PCs are essentially the linear combinations of the original variables, the
weights vector in this combination is actually the eigenvector found which in
turn satisfies the principle of least squares.
o The PCs are orthogonal.
o The variation present in the PCs decrease as we move from the 1st PC to the
last one, hence the importance.

Program :

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import make_pipeline
# Load the Breast Cancer Wisconsin (Diagnostic) dataset
file_path = r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab
Datasets\breast_cancer_wisconsin.csv' # Update this with your file path
data = pd.read_csv(file_path)

data = data.dropna()

label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
le = LabelEncoder()
data[column] = le.fit_transform(data[column])
label_encoders[column] = le

X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Standardize the data before applying PCA


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 1. Train the classifier without PCA


clf_without_pca = make_pipeline(StandardScaler(), SVC())
clf_without_pca.fit(X_train, y_train)
y_pred_without_pca = clf_without_pca.predict(X_test)

# Evaluate performance without PCA


print("Performance without PCA:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_without_pca):.4f}")
print(classification_report(y_test, y_pred_without_pca))

# 2. Apply PCA for dimensionality reduction


pca = PCA(n_components=0.95) # Retain 95% of the variance
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Train the classifier with PCA


clf_with_pca = make_pipeline(StandardScaler(), SVC())
clf_with_pca.fit(X_train_pca, y_train)
y_pred_with_pca = clf_with_pca.predict(X_test_pca)

# Evaluate performance with PCA


print("\nPerformance with PCA:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_with_pca):.4f}")
print(classification_report(y_test, y_pred_with_pca))

# Optionally, print the number of components retained


print(f"\nNumber of components after PCA: {pca.n_components_}")
Output :
The accuracy with SVR (classifier) is 97.6% while the accuracy with 11 PCAs is 91.8%

Conclusion :
• PCA helps reduce the feature space, which can speed up the model training time
and help with overfitting.
• SVM works well with the dataset both with and without PCA. The performance
might remain similar, but with reduced features, the model might become more
generalizable, especially when dealing with noisy or redundant features.
Experiment 7
Aim :
Write a program to implement the ID3 algorithm.

Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
41. Radius: Mean of distances from the center to points on the perimeter.
42. Texture: Standard deviation of gray-scale values.
43. Perimeter: The total distance around the boundary of the nucleus.
44. Area: The area within the boundary of the nucleus.
45. Smoothness: Local variation in radius lengths.
46. Compactness: Perimeter² / area - 1.0.
47. Concavity: Severity of concave portions of the contour.
48. Concave Points: Number of concave portions of the contour.
49. Symmetry: Symmetry of the nucleus.
50. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.

Algorithm :
ID3 Algorithm :
• ID3 stands for Iterative Dichotomiser 3.
• It uses the notion of information gain, which is defined in terms of entropy
• It is a classification algorithm that follows a greedy approach of building a decision
tree by selecting a best attribute that yields maximum Information Gain (IG)
• Attempts to create shortest decision tree
Input Data: The dataset consists of instances with attributes (features) and labels (class).
Steps:
• Calculate Entropy: Entropy is a measure of impurity in the dataset. The goal is to
reduce this impurity with each split.
• Information Gain: For each attribute, calculate the information gain by splitting the
data based on that attribute. The attribute with the highest information gain will be
chosen for the split.
• Build Tree Recursively: The ID3 algorithm recursively builds a decision tree by
selecting the best attribute at each level until all data is classified or a stopping
criterion is met (such as when all data is pure or no more attributes are left).
Stopping Criteria:
• All instances in the dataset belong to the same class.
• There are no remaining attributes to split on.
• All instances have the same attribute values.

Program :
import pandas as pd
import numpy as np
from collections import Counter

# Step 1: Load the Breast Cancer dataset


def load_data(file_path):
df = pd.read_csv(file_path)
# Convert categorical columns to numeric if necessary
df = df.apply(pd.to_numeric, errors='ignore')
return df
# Step 2: Calculate the entropy of a dataset
def entropy(data):
class_counts = Counter(data)
total = len(data)
ent = 0.0
for count in class_counts.values():
prob = count / total
ent -= prob * np.log2(prob)
return ent

# Step 3: Calculate the information gain for each feature


def information_gain(data, feature, target):
# Get the unique values of the feature
feature_values = data[feature].unique()
# Calculate the entropy of the entire dataset
total_entropy = entropy(data[target])
weighted_entropy = 0.0

for value in feature_values:


# Split the data based on feature value
subset = data[data[feature] == value]
weighted_entropy += (len(subset) / len(data)) *
entropy(subset[target])

# Information gain is the reduction in entropy


return total_entropy - weighted_entropy

# Step 4: Select the best feature to split on


def best_feature_to_split(data, features, target):
best_gain = -1
best_feature = None
for feature in features:
gain = information_gain(data, feature, target)
if gain > best_gain:
best_gain = gain
best_feature = feature
return best_feature

# Step 5: Build the decision tree recursively


def build_tree(data, features, target):
# Base Case 1: If all data points have the same target value, return a
leaf node
if len(data[target].unique()) == 1:
return data[target].iloc[0]

# Base Case 2: If there are no more features to split on, return the
majority class
if len(features) == 0:
return data[target].mode()[0]

# Get the best feature to split on


best_feature = best_feature_to_split(data, features, target)

# Create a new tree node


tree = {best_feature: {}}
# Split the data by the best feature
feature_values = data[best_feature].unique()
for value in feature_values:
# Subset the data for each feature value
subset = data[data[best_feature] == value]
# Recursively build the tree for this subset
subtree = build_tree(subset, [f for f in features if f !=
best_feature], target)
tree[best_feature][value] = subtree

return tree

def tree_depth(tree):
# If the tree is a leaf node (i.e., a class label), the depth is 0
if not isinstance(tree, dict):
return 0

# Recursively find the maximum depth of the tree


depths = []
for key, value in tree.items():
for subtree in value.values():
depths.append(tree_depth(subtree)) # Calculate the depth of each
subtree

return 1 + max(depths) # Add 1 for the current level

# Step 6: Classify a new instance using the decision tree


def classify(tree, instance):
if not isinstance(tree, dict):
return tree
feature = list(tree.keys())[0]
value = instance[feature]
subtree = tree[feature].get(value, None)
if subtree is None:
return None # Unknown value
return classify(subtree, instance)

# Example Usage:
# 1. Load data
file_path = r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab
Datasets\breast_cancer_wisconsin.csv'
data = load_data(file_path)

# 2. Define the target and feature columns


target = 'diagnosis' # Assuming 'diagnosis' is the class label
features = [col for col in data.columns if col != target]

# 3. Build the decision tree


tree = build_tree(data, features, target)
print("Decision Tree:", tree)
print(" ")
print(" ")
print(" ")
print(" ")
print(" ")
# 4. Make a prediction (Example instance)
for i in range(100):
example_instance = data.iloc[i] # Example instance for prediction
prediction = classify(tree, example_instance)

print(f"Prediction for {i}th sample:", prediction)

Output :

The ID3 classifier is successfully able to classify the samples into malignant (M) or bening(B)
cancer.

Conclusion :

• The ID3 algorithm is a straightforward approach to decision tree classification. It


works by choosing the feature that best separates the data at each step.
• In practice, ID3 can be used for small to medium-sized datasets with categorical
features.
Experiment 8
Aim :
Write a program to implement the Random forest algorithm.

Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
51. Radius: Mean of distances from the center to points on the perimeter.
52. Texture: Standard deviation of gray-scale values.
53. Perimeter: The total distance around the boundary of the nucleus.
54. Area: The area within the boundary of the nucleus.
55. Smoothness: Local variation in radius lengths.
56. Compactness: Perimeter² / area - 1.0.
57. Concavity: Severity of concave portions of the contour.
58. Concave Points: Number of concave portions of the contour.
59. Symmetry: Symmetry of the nucleus.
60. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.

Algorithm :
Random Forest is an ensemble learning method that operates by building multiple decision
trees during training and outputting the mode of the classes (for classification) or mean
prediction (for regression) of the individual trees. This technique is used primarily for
classification and regression tasks.
Key steps in the Random Forest algorithm:
1. Data Bootstrapping: Randomly select subsets of the training data (with replacement)
for each tree.
2. Feature Randomization: At each node of a tree, a random subset of features is
considered for splitting, which helps reduce the correlation between trees and
increases the diversity of the model.
3. Training Trees: Build multiple decision trees using different subsets of the data and
features.
4. Voting/Averaging: After all trees are built, combine their predictions through voting
(for classification) or averaging (for regression) to obtain the final output.
The main advantages of Random Forest are its robustness to overfitting and its ability to
handle large datasets with higher dimensionality.

Program :
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer

file_path = r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab


Datasets\breast_cancer_wisconsin.csv'
data = pd.read_csv(file_path)

data = data.dropna()

label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
le = LabelEncoder()
data[column] = le.fit_transform(data[column])
label_encoders[column] = le
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Initialize the RandomForestClassifier


rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model on the training data


rf_classifier.fit(X_train, y_train)

# Predict on the test data


y_pred = rf_classifier.predict(X_test)

# Evaluate the performance of the model


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Feature importance plot


feature_importance = rf_classifier.feature_importances_
features = X.columns

# Sort feature importance


sorted_idx = feature_importance.argsort()
plt.barh(features[sorted_idx], feature_importance[sorted_idx])
plt.xlabel("Feature Importance")
plt.title("Feature Importance from Random Forest")
plt.show()

Output :
The accuracy comes out to be 96.4% via the random forest classifier. The feature importance
derived by this model is also shown below.
Conclusion :
Accuracy: The model provides a good accuracy on the test set. Random Forest generally
works well even with a relatively small number of trees and performs better than individual
decision trees due to its ensemble nature.
Feature Importance: By analyzing the feature importance, you can gain insights into which
features are most influential in the classification decision. This is helpful in identifying
important variables for further analysis or feature engineering.
Experiment 9
Aim :
Write a program to implement the K- nearest neighbor algorithm.

Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
61. Radius: Mean of distances from the center to points on the perimeter.
62. Texture: Standard deviation of gray-scale values.
63. Perimeter: The total distance around the boundary of the nucleus.
64. Area: The area within the boundary of the nucleus.
65. Smoothness: Local variation in radius lengths.
66. Compactness: Perimeter² / area - 1.0.
67. Concavity: Severity of concave portions of the contour.
68. Concave Points: Number of concave portions of the contour.
69. Symmetry: Symmetry of the nucleus.
70. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.

Algorithm :
The K-Nearest Neighbors (K-NN) algorithm is a supervised learning technique that classifies
data based on the closest training examples in the feature space. The core idea behind K-NN
is simple: when a new data point is given, it is classified based on the majority class of its K
nearest neighbors from the training dataset. The distance between data points is usually
measured using metrics such as Euclidean distance.
Key Steps of K-NN Algorithm:
1. Choose the number K of neighbors. This is a hyperparameter that can be optimized
through cross-validation.
2. Compute the distance between the new data point and all points in the training
dataset.
3. Sort the distances in ascending order and select the K nearest points.
4. Determine the majority class from the K neighbors and assign this class to the new
data point.
5. Handle ties by using additional heuristics, like choosing the nearest class with the
fewest ties or random selection.

Program :
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the Breast Cancer Wisconsin dataset (CSV file)


file_path = r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab
Datasets\breast_cancer_wisconsin.csv'
data = pd.read_csv(file_path)

data = data.dropna()

label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
le = LabelEncoder()
data[column] = le.fit_transform(data[column])
label_encoders[column] = le
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Split dataset into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)

# Standardize the feature data (important for K-NN)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize the K-NN classifier with K=5 (you can experiment with different
K values)
knn = KNeighborsClassifier(n_neighbors=5)

# Train the model on the training data


knn.fit(X_train, y_train)

# Predict on the test data


y_pred = knn.predict(X_test)
print('Prediction')
print(y_pred)
# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy*100:.2f}%")

# Classification report for more details on precision, recall, and F1 score


print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Output :
The accuracy upon using KNN as a classifier is 95.91% for neighbour = 5. We can see the
binary predictions and the confusion matrix :
If we increase the neighbours to 50 then the accuracy drops :

Conclusion :
K-NN: The K-NN algorithm works well for this task because it is intuitive and performs
reasonably well on this dataset. However, it can be sensitive to the value of K, so we tune
the value to K to be optimal (low in this case)

You might also like