ML Lab File[1]
ML Lab File[1]
Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
1. Radius: Mean of distances from the center to points on the perimeter.
2. Texture: Standard deviation of gray-scale values.
3. Perimeter: The total distance around the boundary of the nucleus.
4. Area: The area within the boundary of the nucleus.
5. Smoothness: Local variation in radius lengths.
6. Compactness: Perimeter² / area - 1.0.
7. Concavity: Severity of concave portions of the contour.
8. Concave Points: Number of concave portions of the contour.
9. Symmetry: Symmetry of the nucleus.
10. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.
Algorithm :
Feature selection (also known as variable selection, attribute selection or subset selection) is
the process of selecting a subset of relevant features to use in machine learning model
building. It is one of the core concepts in machine learning which has a huge impact on
models’ performance. Given a pool of features, the process will select the best subset of
attributes that are most important and have high contribution at the time of prediction
making. For this experiment, we are going to consider the following feature selection
methods :
1. Filter Methods: Rely on the features’ characteristics without using any machine
learning algorithm. Very well-suited for a quick “screen and removal” of irrelevant
features.
a. Spearman Correlation
b. Pearson Correlation
c. Kendall Correlation
d. Chi Squared Test
e. Information Gain
Program :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.feature_selection import mutual_info_classif, chi2, SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
if y.dtype == 'O':
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
min_max_scaler = MinMaxScaler()
X_normalized = min_max_scaler.fit_transform(X)
selected_features = {}
# Pearson Correlation
pearson_corr = X.corr(method='pearson').iloc[:, -1].abs()
selected_features['Pearson'] = np.where(pearson_corr > 0.4)[0]
# Spearman Correlation
spearman_corr = X.corr(method='spearman').iloc[:, -1].abs()
selected_features['Spearman'] = np.where(spearman_corr > 0.4)[0]
# Kendall Correlation
kendall_corr = X.corr(method='kendall').iloc[:, -1].abs()
selected_features['Kendall'] = np.where(kendall_corr > 0.4)[0]
rfe_selector = rfe_selector.fit(X_scaled, y)
selected_features['Forward_Selection'] = np.where(rfe_selector.support_)[0]
k_best_selector.fit(X_scaled, y)
selected_features['Backward_Elimination'] =
np.where(k_best_selector.get_support())[0]
plt.figure(figsize=(10, 6))
sns.barplot(x='Feature Selection Method', y='Accuracy', data=accuracy_df,
palette='viridis')
plt.title('Accuracy of Naive Bayes Classifier with Different Feature
Selection Methods')
plt.xlabel('Feature Selection Method')
plt.ylabel('Accuracy')
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 1)
plt.tight_layout()
plt.show()
Output :
Selected Features via Pearson, Spearman and Kendall Correlation (Thresholds : 0.6, 0.6, 0.4):
Selected Features via Forward Feature Selection, Backward Feature Elimination, Step Wise
Selection :
A comparision of different evaluation metric for Naïve Bayes Classifier :
Conclusion :
We can see that Forward Feature Selection performs the best out of the lot.
Experiment 2
Aim :
Write a program to implement linear regression without using python libraries.
Dataset :
Student Performance Dataset contains two columns: Hours and Scores.
• Hours: Represents the number of hours spent studying or practicing (independent
variable or feature).
• Scores: Represents the scores achieved (dependent variable or target).
Algorithm :
Linear regression is used for finding linear relationship between target and one or more
predictors. There are two types of linear regression :
Program :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
student_scores=pd.read_csv(r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab
Datasets\student_scores.csv')
student_scores.isnull().sum()
print(student_scores)
corr_matrix = student_scores.corr()
print(corr_matrix)
x=student_scores.Hours.values.reshape(-1,1)
y=student_scores.Scores.values.reshape(-1,1)
plt.scatter(x,y,color='blue')
plt.xlabel('Number of Hours studied')
plt.ylabel('Scores Obtained')
plt.title('Simple Linear graph')
plt.show()
hours = student_scores['Hours']
scores = student_scores['Scores']
hours = np.array(hours)
scores = np.array(scores)
mean_x = np.mean(hours)
mean_y = np.mean(scores)
def predict(x):
return slope * x + intercept
predictions = predict(hours)
print("\nPredictions:")
for h, p in zip(hours, predictions):
print(f"Hours: {h}, Predicted Score: {p}")
plt.figure(figsize=(10, 6))
plt.scatter(hours, scores, color='blue', label='Actual Scores')
plt.plot(hours, predictions, color='red', linewidth=2, label='Fitted Line')
plt.xlabel('Hours')
plt.ylabel('Scores')
plt.title('Actual vs Predicted Scores')
plt.legend()
plt.grid(True)
plt.show()
Output :
Conclusion :
The linear regression algorithm built manually gives us outputs similar to the actual scores
which results in an accuracy of 95.2%
Experiment 3
Aim :
Write a program to implement logistic regression without using python libraries.
Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
11. Radius: Mean of distances from the center to points on the perimeter.
12. Texture: Standard deviation of gray-scale values.
13. Perimeter: The total distance around the boundary of the nucleus.
14. Area: The area within the boundary of the nucleus.
15. Smoothness: Local variation in radius lengths.
16. Compactness: Perimeter² / area - 1.0.
17. Concavity: Severity of concave portions of the contour.
18. Concave Points: Number of concave portions of the contour.
19. Symmetry: Symmetry of the nucleus.
20. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.
Algorithm :
The logistic function, also called the sigmoid function, is an S-shaped curve that can take any
real-valued number and map it into a value between 0 and 1, but never exactly at those
limits. [1 / (1 + e^-value)]. The key difference from linear regression is that the output value
being modeled is a binary values (0 or 1) rather than a numeric value. Logistic regression is
named for the function used at the core of the method, the logistic function.
Based on the number of categories, Logistic regression can be classified as:
1. binomial: target variable can have only 2 possible types: “0” or “1” which may
represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.
2. multinomial: target variable can have 3 or more possible types which are not
ordered(i.e. types have no quantitative significance) like “disease A” vs “disease B” vs
“disease C”.
3. ordinal: it deals with target variables with ordered categories. For example, a test
score can be categorized as:“very poor”, “poor”, “good”, “very good”. Here, each
category can be given a score like 0, 1, 2, 3.
Logistic regression uses an equation as the representation, very much like linear regression.
Input values (x) are combined linearly using weights or coefficient values (referred to as the
Greek capital letter Beta) to predict an output value (y). Below is an example logistic
regression equation: y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x)) where y is the predicted output,
b0 is the bias or intercept term and b1 is the coefficient for the single input value (x).
Program :
import numpy as np
import pandas as pd
# Normalize features
X = (X - X.mean()) / X.std()
# Add a column of ones to X for the intercept term
X = np.c_[np.ones(X.shape[0]), X] # Adding the intercept term
def sigmoid(z):
return 1 / (1 + np.exp(-z))
for i in range(num_iters):
h = sigmoid(X.dot(theta))
gradient = (1/m) * X.T.dot(h - y)
theta -= alpha * gradient
cost_history[i] = compute_cost(X, y, theta)
# Initialize parameters
theta = np.zeros(X.shape[1])
alpha = 0.01 # Learning rate
num_iters = 1000 # Number of iterations
# Output results
print("Optimized theta:", theta)
print("Final cost:", cost_history[-1])
# Calculate accuracy
accuracy = np.mean(predictions == y)
print("Training accuracy:", accuracy)
Output :
The optimized cost or theta via gradient descent at every iteration is as follows :
The predictions made from our manual logistic regression model look something like :
Conclusion :
Our manual logistic regression algorithm classifies the data into malignant or benign and
gives an accuracy of 98.2%
Experiment 4
Aim :
Write a program to implement the support Vector machine algorithm.
Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
21. Radius: Mean of distances from the center to points on the perimeter.
22. Texture: Standard deviation of gray-scale values.
23. Perimeter: The total distance around the boundary of the nucleus.
24. Area: The area within the boundary of the nucleus.
25. Smoothness: Local variation in radius lengths.
26. Compactness: Perimeter² / area - 1.0.
27. Concavity: Severity of concave portions of the contour.
28. Concave Points: Number of concave portions of the contour.
29. Symmetry: Symmetry of the nucleus.
30. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.
Algorithm :
Support Vector Machine (SVM) is a supervised machine learning algorithm commonly used
for classification tasks. It works by finding an optimal hyperplane that best separates the
data into different classes. For this experiment, SVM will attempt to classify tumors as either
malignant (cancerous) or benign (non-cancerous). The objective of the SVM is to maximize
the margin between data points of different classes, thus creating a robust classifier.
Program :
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
file_path = r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab
Datasets\breast_cancer_wisconsin.csv' # Update this with your file path
data = pd.read_csv(file_path)
data.head()
data.shape
data = data.dropna()
label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
le = LabelEncoder()
data[column] = le.fit_transform(data[column])
label_encoders[column] = le
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
min_max_scaler = MinMaxScaler()
X_normalized = min_max_scaler.fit_transform(X)
X.corr()
plt.figure(figsize=(18, 12))
sns.heatmap(X.corr(), vmin=0.85, vmax=1, annot=True, cmap='YlGnBu',
linewidths=.5)
correlation_matrix = X.corr()
correlation_with_target = correlation_matrix.iloc[:-1, -1].abs() #
Correlation of features with target
selected_features = correlation_with_target[correlation_with_target >
0.5].index # Select features with correlation > 0.5
selected_features
features = ['radius_mean', 'perimeter_mean', 'area_mean', 'compactness_mean',
'concavity_mean', 'concave points_mean', 'radius_se',
'perimeter_se',
'area_se', 'radius_worst', 'perimeter_worst', 'area_worst',
'compactness_worst', 'concavity_worst', 'concave points_worst']
X = X[features]
print(X.columns)
correlation_matrix = X.corr()
high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
for j in range(i):
if correlation_matrix.iloc[i, j] > 0.9:
high_corr_pairs.append((correlation_matrix.columns[i],
correlation_matrix.columns[j]))
print("Accuracy:", accuracy)
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", confusion_mat)
Output :
The correlation map of all the features looks like the one below. We took the features which
had a correlation of 0.5 or more with the target variable and for those selected features, we
dropped the ones which had higher correlation between each other (above 0.9).
The final output contains of the binary predictions made on the test data, accuracy, the
confusion matrix, precision, recall, f1 score and support :
Conclusion :
The SVM model on the Breast Cancer Wisconsin dataset performs with high accuracy (97%),
demonstrating strong capability in distinguishing between malignant and benign tumors.
The high precision and recall scores reflect the model's reliability in clinical contexts, where
accurate classification is essential.
Experiment 5
Aim :
Write a program to implement back propagation neural network for classification of sample data
without using Python libraries
Dataset :
The Iris dataset is a well-known dataset in machine learning and statistics, often used for
classification tasks. It consists of 150 samples of iris flowers from three different species:
Setosa, Versicolor, and Virginica. The dataset includes 4 features, which are the
measurements of the flowers' sepals and petals:
1. Sepal Length: The length of the sepal in centimeters.
2. Sepal Width: The width of the sepal in centimeters.
3. Petal Length: The length of the petal in centimeters.
4. Petal Width: The width of the petal in centimeters.
The dataset also includes the Species label, which identifies the species of each iris flower as
either Setosa, Versicolor, or Virginica. The dataset is balanced, with 50 samples from each
species.
• Number of instances: 150
• Number of attributes: 4 features (sepal length, sepal width, petal length, petal
width)
• Classes: 3 species (Setosa, Versicolor, Virginica)
Algorithm :
Backpropagation Network
• It is a multilayer feed forward network
• BPN has two phases:
• Forward pass phase: computes ‘functional signal’, feed forward propagation of input
pattern signals through network
• Backward pass phase: computes ‘error signal’, propagates the error backwards
through network starting at output units (where the error is the difference between
actual and desired output values)
In this implementation, the neural network consists of:
• Input layer: 4 neurons (corresponding to the 4 input features).
• Hidden layers: 2 hidden layers (with 10 and 8 neurons respectively).
• Output layer: 3 neurons (corresponding to the 3 possible species of the iris).
The network is trained for a specified number of epochs (3000 in the updated version),
during which the weights are updated using the backpropagation algorithm to reduce the
classification error.
Program :
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def load_data():
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
y = OneHotEncoder(sparse_output=False).fit_transform(y.reshape(-1, 1))
scaler = StandardScaler()
X = scaler.fit_transform(X)
return X, y
self.bias_hidden1 = np.random.rand(hidden_size1)
self.bias_hidden2 = np.random.rand(hidden_size2)
self.bias_output = np.random.rand(output_size)
self.hidden_input2 = np.dot(self.hidden_output1,
self.weights_hidden1_hidden2) + self.bias_hidden2
self.hidden_output2 = self.sigmoid(self.hidden_input2)
self.final_input = np.dot(self.hidden_output2,
self.weights_hidden2_output) + self.bias_output
self.final_output = self.sigmoid(self.final_input)
return self.final_output
X, y = load_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
input_size = X_train.shape[1]
hidden_size1 = 10 # First hidden layer neurons
hidden_size2 = 8 # Second hidden layer neurons
output_size = y_train.shape[1]
learning_rate = 0.1
epochs = 3000
y_pred = nn.predict(X_test)
print(y_pred)
y_test_labels = np.argmax(y_test, axis=1)
accuracy = accuracy_score(y_test_labels, y_pred)
Output :
After removing the ‘Id’ feature, the test predictions and the accuracy as predicted as output.
• Upon using just one hidden layer and the following attribute values for the neural
network, the accuracy came out to be 71% :
hidden_size = 5
learning_rate = 0.1
epochs = 1000
• When we increase the hidden layers to 2 and use the following attribute values for
the neural network, the accuracy came out to be 91% :
hidden_size1 = 10 # First hidden layer neurons
hidden_size2 = 8 # Second hidden layer neurons
epochs = 3000 # Increase epochs for better training
Conclusion :
In this implementation, we used a backpropagation neural network to classify iris flowers
from the Iris dataset. The neural network was able to achieve an accuracy of around 71%
initially, and after improving the model by:
• Adding more neurons in the hidden layers,
• Introducing a second hidden layer, and
• Scaling the features using StandardScaler,
the accuracy was improved to 91%. With the improved model (more neurons, more epochs,
and better feature scaling), we observed an increase in performance, which demonstrates
the importance of tuning the model architecture and training parameters.
Experiment 6
Aim :
Write a program to implement Principle Component Analysis technique of dimensionality
reduction and evaluate the performance with a classifier.
Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
31. Radius: Mean of distances from the center to points on the perimeter.
32. Texture: Standard deviation of gray-scale values.
33. Perimeter: The total distance around the boundary of the nucleus.
34. Area: The area within the boundary of the nucleus.
35. Smoothness: Local variation in radius lengths.
36. Compactness: Perimeter² / area - 1.0.
37. Concavity: Severity of concave portions of the contour.
38. Concave Points: Number of concave portions of the contour.
39. Symmetry: Symmetry of the nucleus.
40. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.
Algorithm :
Principle Component Analysis :
• The main idea of principal component analysis (PCA) is to reduce the dimensionality
of a data set consisting of many variables correlated with each other, either heavily
or lightly, while retaining the variation present in the dataset, up to the maximum
extent.
• The same is done by transforming the variables to a new set of variables, which are
known as the principal components (or simply, the PCs) and are orthogonal, ordered
such that the retention of variation present in the original variables decreases as we
move down in the order.
• So, in this way, the 1st principal component retains maximum variation that was
present in the original components.
• The principal components are the eigenvectors of a covariance matrix, and hence
they are orthogonal.
• Principal component can be defined as a linear combination of optimally-weighted
observed variables.
• The output of PCA are these principal components, the number of which is equal to
the number of original variables.
• The PCs possess some useful properties which are listed below:
o The PCs are essentially the linear combinations of the original variables, the
weights vector in this combination is actually the eigenvector found which in
turn satisfies the principle of least squares.
o The PCs are orthogonal.
o The variation present in the PCs decrease as we move from the 1st PC to the
last one, hence the importance.
Program :
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import make_pipeline
# Load the Breast Cancer Wisconsin (Diagnostic) dataset
file_path = r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab
Datasets\breast_cancer_wisconsin.csv' # Update this with your file path
data = pd.read_csv(file_path)
data = data.dropna()
label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
le = LabelEncoder()
data[column] = le.fit_transform(data[column])
label_encoders[column] = le
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
Conclusion :
• PCA helps reduce the feature space, which can speed up the model training time
and help with overfitting.
• SVM works well with the dataset both with and without PCA. The performance
might remain similar, but with reduced features, the model might become more
generalizable, especially when dealing with noisy or redundant features.
Experiment 7
Aim :
Write a program to implement the ID3 algorithm.
Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
41. Radius: Mean of distances from the center to points on the perimeter.
42. Texture: Standard deviation of gray-scale values.
43. Perimeter: The total distance around the boundary of the nucleus.
44. Area: The area within the boundary of the nucleus.
45. Smoothness: Local variation in radius lengths.
46. Compactness: Perimeter² / area - 1.0.
47. Concavity: Severity of concave portions of the contour.
48. Concave Points: Number of concave portions of the contour.
49. Symmetry: Symmetry of the nucleus.
50. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.
Algorithm :
ID3 Algorithm :
• ID3 stands for Iterative Dichotomiser 3.
• It uses the notion of information gain, which is defined in terms of entropy
• It is a classification algorithm that follows a greedy approach of building a decision
tree by selecting a best attribute that yields maximum Information Gain (IG)
• Attempts to create shortest decision tree
Input Data: The dataset consists of instances with attributes (features) and labels (class).
Steps:
• Calculate Entropy: Entropy is a measure of impurity in the dataset. The goal is to
reduce this impurity with each split.
• Information Gain: For each attribute, calculate the information gain by splitting the
data based on that attribute. The attribute with the highest information gain will be
chosen for the split.
• Build Tree Recursively: The ID3 algorithm recursively builds a decision tree by
selecting the best attribute at each level until all data is classified or a stopping
criterion is met (such as when all data is pure or no more attributes are left).
Stopping Criteria:
• All instances in the dataset belong to the same class.
• There are no remaining attributes to split on.
• All instances have the same attribute values.
Program :
import pandas as pd
import numpy as np
from collections import Counter
# Base Case 2: If there are no more features to split on, return the
majority class
if len(features) == 0:
return data[target].mode()[0]
return tree
def tree_depth(tree):
# If the tree is a leaf node (i.e., a class label), the depth is 0
if not isinstance(tree, dict):
return 0
# Example Usage:
# 1. Load data
file_path = r'C:\Users\tulik\Desktop\IGDTUW\ML\ML Lab\Lab
Datasets\breast_cancer_wisconsin.csv'
data = load_data(file_path)
Output :
The ID3 classifier is successfully able to classify the samples into malignant (M) or bening(B)
cancer.
Conclusion :
Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
51. Radius: Mean of distances from the center to points on the perimeter.
52. Texture: Standard deviation of gray-scale values.
53. Perimeter: The total distance around the boundary of the nucleus.
54. Area: The area within the boundary of the nucleus.
55. Smoothness: Local variation in radius lengths.
56. Compactness: Perimeter² / area - 1.0.
57. Concavity: Severity of concave portions of the contour.
58. Concave Points: Number of concave portions of the contour.
59. Symmetry: Symmetry of the nucleus.
60. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.
Algorithm :
Random Forest is an ensemble learning method that operates by building multiple decision
trees during training and outputting the mode of the classes (for classification) or mean
prediction (for regression) of the individual trees. This technique is used primarily for
classification and regression tasks.
Key steps in the Random Forest algorithm:
1. Data Bootstrapping: Randomly select subsets of the training data (with replacement)
for each tree.
2. Feature Randomization: At each node of a tree, a random subset of features is
considered for splitting, which helps reduce the correlation between trees and
increases the diversity of the model.
3. Training Trees: Build multiple decision trees using different subsets of the data and
features.
4. Voting/Averaging: After all trees are built, combine their predictions through voting
(for classification) or averaging (for regression) to obtain the final output.
The main advantages of Random Forest are its robustness to overfitting and its ability to
handle large datasets with higher dimensionality.
Program :
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
data = data.dropna()
label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
le = LabelEncoder()
data[column] = le.fit_transform(data[column])
label_encoders[column] = le
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
Output :
The accuracy comes out to be 96.4% via the random forest classifier. The feature importance
derived by this model is also shown below.
Conclusion :
Accuracy: The model provides a good accuracy on the test set. Random Forest generally
works well even with a relatively small number of trees and performs better than individual
decision trees due to its ensemble nature.
Feature Importance: By analyzing the feature importance, you can gain insights into which
features are most influential in the classification decision. This is helpful in identifying
important variables for further analysis or feature engineering.
Experiment 9
Aim :
Write a program to implement the K- nearest neighbor algorithm.
Dataset :
Breast Cancer Wisconsin Dataset
The Breast Cancer Wisconsin dataset is a popular dataset used for binary classification tasks
in machine learning. It is commonly used for research and educational purposes, particularly
in the context of medical diagnostics, as it involves determining whether a breast cancer
tumor is benign or malignant based on a set of features computed from a digitized image of
a fine needle aspirate (FNA) of a breast mass.
Dataset Composition
• Number of Instances: 569
• Number of Attributes (Features): 30 numeric features (floating-point) and 1 target
attribute.
• Target Attribute:
o 0 for benign tumors.
o 1 for malignant tumors.
Here is a detailed list of the features:
61. Radius: Mean of distances from the center to points on the perimeter.
62. Texture: Standard deviation of gray-scale values.
63. Perimeter: The total distance around the boundary of the nucleus.
64. Area: The area within the boundary of the nucleus.
65. Smoothness: Local variation in radius lengths.
66. Compactness: Perimeter² / area - 1.0.
67. Concavity: Severity of concave portions of the contour.
68. Concave Points: Number of concave portions of the contour.
69. Symmetry: Symmetry of the nucleus.
70. Fractal Dimension: "Coastline approximation" - 1 (roughness).
Each of these 10 basic features is computed in three different ways (mean, standard error,
and "worst" or largest). Thus, there are 30 features in total.
Dataset Format
The dataset is usually presented in a tabular format with the following columns:
• ID Number: Unique identifier for each instance.
• Diagnosis: Target class (M = malignant, B = benign).
• 30 Feature Columns: As described above.
Algorithm :
The K-Nearest Neighbors (K-NN) algorithm is a supervised learning technique that classifies
data based on the closest training examples in the feature space. The core idea behind K-NN
is simple: when a new data point is given, it is classified based on the majority class of its K
nearest neighbors from the training dataset. The distance between data points is usually
measured using metrics such as Euclidean distance.
Key Steps of K-NN Algorithm:
1. Choose the number K of neighbors. This is a hyperparameter that can be optimized
through cross-validation.
2. Compute the distance between the new data point and all points in the training
dataset.
3. Sort the distances in ascending order and select the K nearest points.
4. Determine the majority class from the K neighbors and assign this class to the new
data point.
5. Handle ties by using additional heuristics, like choosing the nearest class with the
fewest ties or random selection.
Program :
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
data = data.dropna()
label_encoders = {}
for column in data.select_dtypes(include=['object']).columns:
le = LabelEncoder()
data[column] = le.fit_transform(data[column])
label_encoders[column] = le
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
# Split dataset into training and testing sets (70% training, 30% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
# Initialize the K-NN classifier with K=5 (you can experiment with different
K values)
knn = KNeighborsClassifier(n_neighbors=5)
Output :
The accuracy upon using KNN as a classifier is 95.91% for neighbour = 5. We can see the
binary predictions and the confusion matrix :
If we increase the neighbours to 50 then the accuracy drops :
Conclusion :
K-NN: The K-NN algorithm works well for this task because it is intuitive and performs
reasonably well on this dataset. However, it can be sensitive to the value of K, so we tune
the value to K to be optimal (low in this case)