0% found this document useful (0 votes)
4 views

ml file syllabus

The document outlines a series of experiments aimed at exploring Python programming, focusing on concepts such as classes, functions, and libraries like SciPy and Scikit-learn for data analysis and machine learning. It covers data preprocessing techniques, linear regression, decision tree algorithms (ID3 and C4.5), and includes code examples for each experiment. The aim is to demonstrate practical applications of Python in data science and machine learning.

Uploaded by

mohit.121322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

ml file syllabus

The document outlines a series of experiments aimed at exploring Python programming, focusing on concepts such as classes, functions, and libraries like SciPy and Scikit-learn for data analysis and machine learning. It covers data preprocessing techniques, linear regression, decision tree algorithms (ID3 and C4.5), and includes code examples for each experiment. The aim is to demonstrate practical applications of Python in data science and machine learning.

Uploaded by

mohit.121322
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

EXPERIMENT - 1

AIM:
Exploring and demonstrating python

THEORY:
Python is a high-level, interpreted programming language known for its simplicity
and readability. It is widely used in various fields such as web development, data analysis,
machine learning, automation, and more. Python's syntax is designed to be easy to read and
write, making it an excellent choice for beginners and experienced programmers alike.

1. Classes and Objects: Python is an object-oriented programming (OOP) language, which


means it supports the concepts of classes and objects. A class is a blueprint for creating
objects, which are instances of the class. Classes encapsulate data and functions that
operate on that data. Objects are instances that hold the actual data and can use the class's
methods.

class Person:
def __init__(self, name, age):
self.name = name
self.age = age
def greet(self):
return f'Hello, my name is {self.name} and I am {self.age} years old.'
# Create an instance of the class
person = Person('Adi', 19)
print(person.greet())

2. Functions: Functions are blocks of reusable code that perform a specific task. They allow
for modular and organized code, making it easier to manage and debug. Python functions
are defined using the def keyword followed by the function name and parameters.
def add(a, b):
return a + b
def subtract(a, b):
return a - b

# Using the functions


print(add(5, 3)) # Output: 8
print(subtract(5, 3)) # Output: 2

3. SciPy stands for "Scientific Python" and is an open-source Python library used for
scientific
and technical computing. It builds on NumPy and provides a large collection of mathematical
algorithms and convenience functions, making it easier to perform scientific and engineering
tasks. Here are a few key components of SciPy:

1
1. Linear Algebra: Provides functions for matrix operations, solving linear systems,
eigenvalue problems, and more.
2. Optimization: Contains functions for finding the minimum or maximum of functions
(optimization), including linear programming and curve fitting.
3. Integration: Offers methods for calculating integrals, including numerical integration and
ordinary differential equations (ODE) solvers.
4. Statistics: Includes functions for statistical distributions, hypothesis testing, and
descriptive statistics.
5. Signal Processing: Provides tools for filtering, signal analysis, and Fourier transforms.

Linear Algebra-
import numpy as np
from scipy import linalg

Creating a matrix
A = np.array([[1, 2], [3, 4]])

Computing the determinant


det = linalg.det(A)
print("Determinant:", det)

# Solving a linear system of equations


b = np.array([5, 6])
x = linalg.solve(A, b)
print("Solution:", x)

This code demonstrates how to compute the determinant of a matrix and solve a linear system
of
equations using SciPy's linear algebra module

OUTPUT

Statistics
from scipy import stats

# Creating a dataset
data = np.array([1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5])

# Computing descriptive statistics


mean = np.mean(data)
std_dev = np.std(data)
median = np.median(data)
print("Mean:", mean)
2
print("Standard Deviation:", std_dev)
print("Median:", median)

# Performing a t-test
t_stat, p_value = stats.ttest_1samp(data, 3)
print("T-statistic:", t_stat)
print("P-value:", p_value)

This code demonstrates how to compute descriptive statistics and perform a t-test using
SciPy's
statistics module.

OUTPUT

Scikit-learn is an open-source Python library for machine learning. It is built on NumPy,


SciPy, and Matplotlib and provides simple and efficient tools for data analysis and modeling.

Here are a few key components of scikit-learn:


1. Supervised Learning: Involves training a model on a labeled dataset, meaning the input
data is paired with the correct output. Examples include classification and regression.
2. Unsupervised Learning: Involves training a model on data without labeled responses.
Examples include clustering and dimensionality reduction.
3. Model Selection and Evaluation: Tools for evaluating and comparing different models,
including cross-validation and various metrics.
4. Preprocessing: Functions for feature extraction, normalization, and data transformation to
prepare data for modeling.

Supervised Learning: Classification


from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
X = iris.data
y = iris.target

3
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))


print("Classification Report:\n", classification_report(y_test, y_pred))

This code demonstrates how to load the Iris dataset, preprocess the data, train a Support
Vector
Machine (SVM) classifier, make predictions, and evaluate the model.

OUTPUT

Unsupervised Learning: Clustering


from sklearn import datasets
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X = iris.data

kmeans = KMeans(n_clusters=3, random_state=42)


kmeans.fit(X)

# Plot the clusters


plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('KMeans Clustering')

4
plt.show()

This code demonstrates how to load the Iris dataset, train a KMeans clustering model, and
visualize the clusters.

OUTPUT

Model Selection and Evaluation: Cross-Validation


from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Load the dataset


iris = datasets.load_iris()
X = iris.data
y = iris.target

# Train a Random Forest classifier with cross-validation


clf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)

# Print the cross-validation scores


print("Cross-Validation Scores:", scores)
print("Mean Cross-Validation Score:", scores.mean())

This code demonstrates how to use cross-validation to evaluate the performance of a Random
Forest classifier on the Iris dataset.

OUTPUT

Preprocessing: Feature Scaling


5
from sklearn.preprocessing import MinMaxScaler
import numpy as np

# Create a sample dataset


data = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])

# Apply Min-Max scaling


scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

# Print the scaled data


print("Scaled Data:\n", scaled_data)

rocessing: Feature Scaling

This code demonstrates how to apply Min-Max scaling to a sample dataset to normalize the
features.

OUTPUT:

6
EXPERIMENT - 2

AIM:
Perform Data Preprocessing like outlier detection, handling missing value, analyzing
redundancy and normalization on different datasets.

THEORY:
Data preprocessing is a crucial step in the machine learning pipeline. It ensures that the data
fed into models is clean, consistent, and formatted appropriately. Poor data quality can
significantly degrade the performance of machine learning algorithms.

1. Handling Missing Values


Missing values occur when no data value is stored for a variable in an observation.
2. Outlier Detection
Outliers are data points that differ significantly from others in the dataset. They can skew the
performance of models and affect accuracy.
3. Analyzing Redundancy
Redundancy occurs when two or more features provide the same information.
4. Normalization
Normalization scales the data to a standard range, especially useful when features have
different units or ranges.

CODE
# missing values
import pandas as pd

df = pd.read_csv('students.csv')

print("Original Data:\n", df)


df.fillna(df.mean(numeric_only=True), inplace=True)
print("\nAfter Filling Missing Values (with mean):\n", df)

# outlier detection
import pandas as pd

df = pd.read_csv('salaries.csv')
salaries = df['Salary']

Q1 = salaries.quantile(0.25)
Q3 = salaries.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

7
outliers = df[(salaries < lower_bound) | (salaries > upper_bound)]

print("Outliers Detected:\n", outliers)

# redundancy analysis
import pandas as pd

df = pd.read_csv('products.csv')

redundant_columns = []
for col1 in df.columns:
for col2 in df.columns:
if col1 != col2 and df[col1].equals(df[col2]):
redundant_columns.append((col1, col2))

print("Redundant Columns:", redundant_columns)

# normalization
import pandas as pd

df = pd.read_csv('athletes.csv')

for col in ['Height', 'Weight']:


min_val = df[col].min()
max_val = df[col].max()
df[col + '_Normalized'] = (df[col] - min_val) / (max_val - min_val)
print("After Normalization:\n", df)

8
OUTPUT

9
EXPERIMENT - 3

AIM:
Write a program to implement Linear Regression using any appropriate dataset.

THEORY:
What is Linear Regression?
Linear Regression is a supervised machine learning algorithm used to model the relationship
between a dependent variable (target) and one or more independent variables (features). It
assumes a linear relationship between the variables — that is, the change in the target
variable is proportional to the change in the feature variable(s).

The goal of Linear Regression is to find the best-fitting straight line (regression line) that
minimizes the difference between the actual data points and the predicted values from the
model.

It is commonly used for predictive analysis, such as estimating sales, prices, or salaries based
on certain inputs.

CODE
import pandas as pd
from sklearn.linear_model import LinearRegression

# Load the dataset


data = pd.read_csv("salary_dataset.csv")

# Extract features and target


X = data[["YearsExperience"]]
y = data["Salary"]

# Create and train the model


model = LinearRegression()
model.fit(X, y)

# Display the coefficient and intercept


print(f"Model coefficient (slope): {model.coef_[0]}")
print(f"Model intercept: {model.intercept_}")

# Predict salary for a specific experience (e.g., 1.3 years)


experience = 1.3
predicted_salary = model.predict([[experience]])
print(f"Predicted salary for {experience} years of experience: {predicted_salary[0]:.2f}")

10
OUTPUT

11
EXPERIMENT - 4

AIM:
Write a program to exhibit the working of the decision tree based ID3 algorithm. With the
help of appropriate data set build the decision tree and classify a new sample.

THEORY:
ID3 is a decision tree algorithm developed by Ross Quinlan. It builds a decision tree from a
dataset by using a top-down, greedy approach to select the attribute that maximizes
Information Gain.

1. Entropy - Entropy measures the impurity or uncertainty in the dataset.


2. Information Gain (IG) - It tells us how much "information" a feature gives us about the
class.

At each step, ID3 selects the feature that maximizes Information Gain. This means it chooses
the attribute that best separates the data into classes.

CODE
import pandas as pd
import numpy as np
import math
from collections import Counter

data = [
['Sunny', 'Hot', 'High', 'Weak', 'No'],
['Sunny', 'Hot', 'High', 'Strong', 'No'],
['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
['Rain', 'Mild', 'High', 'Weak', 'Yes'],
['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
['Rain', 'Cool', 'Normal', 'Strong', 'No'],
['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
['Sunny', 'Mild', 'High', 'Weak', 'No'],
['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
['Rain', 'Mild', 'High', 'Strong', 'No']
]

columns = ['Outlook', 'Temperature', 'Humidity', 'Wind', 'PlayTennis']


df = pd.read_csv('play_tennis.csv')

12
def entropy(target_col):
values, counts = np.unique(target_col, return_counts=True)
return -np.sum([(counts[i]/np.sum(counts)) * math.log2(counts[i]/np.sum(counts)) for i in
range(len(values))])

def info_gain(data, split_attribute_name, target_name="PlayTennis"):


total_entropy = entropy(data[target_name])
vals, counts = np.unique(data[split_attribute_name], return_counts=True)

weighted_entropy = np.sum([
(counts[i]/np.sum(counts)) * entropy(data.where(data[split_attribute_name] ==
vals[i]).dropna()[target_name])
for i in range(len(vals))
])

return total_entropy - weighted_entropy

def ID3(data, original_data, features, target_attribute_name="PlayTennis",


parent_node_class=None):

if len(np.unique(data[target_attribute_name])) <= 1:
return np.unique(data[target_attribute_name])[0]

elif len(data) == 0:
return np.unique(original_data[target_attribute_name])[
np.argmax(np.unique(original_data[target_attribute_name], return_counts=True)[1])
]

elif len(features) == 0:
return parent_node_class

else:
parent_node_class = np.unique(data[target_attribute_name])[
np.argmax(np.unique(data[target_attribute_name], return_counts=True)[1])
]

item_values = [info_gain(data, feature, target_attribute_name) for feature in features]


best_feature_index = np.argmax(item_values)
best_feature = features[best_feature_index]

tree = {best_feature: {}}


features = [i for i in features if i != best_feature]

13
for value in np.unique(data[best_feature]):
sub_data = data.where(data[best_feature] == value).dropna()
subtree = ID3(sub_data, original_data, features, target_attribute_name,
parent_node_class)
tree[best_feature][value] = subtree

return tree

features = list(df.columns)
features.remove('PlayTennis')
tree = ID3(df, df, features)
print("Decision Tree:", tree)

def classify(sample, tree):


for attr in tree:
if sample[attr] in tree[attr]:
subtree = tree[attr][sample[attr]]
if isinstance(subtree, dict):
return classify(sample, subtree)
else:
return subtree
else:
return "Unknown"

new_sample = {'Outlook': 'Sunny', 'Temperature': 'Cool', 'Humidity': 'High', 'Wind': 'Strong'}


prediction = classify(new_sample, tree)
print("Prediction for new sample:", prediction)
OUTPUT

14
EXPERIMENT - 5

AIM:
Write a program to demonstrate the working of the decision tree based C4.5 algorithm.
With the help of data set used in above experiment build the decision tree and classify a new
sample.

THEORY:
C4.5 is a decision tree algorithm developed by Ross Quinlan as an extension of ID3. It
addresses many of ID3’s limitations, especially around continuous data, pruning, and
overfitting.
It is widely used for classification problems and forms the basis for more advanced
algorithms like C5.0 and Random Forest.

C4.5 improves over ID3’s Information Gain by using Gain Ratio, which penalizes attributes
with many values.

Advantages of C4.5
●​ Can handle both categorical and numerical data.
●​ Deals with missing values.
●​ Uses pruning to improve generalization.
●​ Uses Gain Ratio to prevent bias toward many-valued attributes.
●​ Widely used and robust for practical classification problems.

CODE
import pandas as pd
import numpy as np
import math

df = pd.read_csv('play_tennis.csv')

def entropy(target_col):
values, counts = np.unique(target_col, return_counts=True)
return -np.sum([(counts[i]/np.sum(counts)) * math.log2(counts[i]/np.sum(counts)) for i in
range(len(values))])

def split_info(data, split_attribute_name):


vals, counts = np.unique(data[split_attribute_name], return_counts=True)
return -np.sum([(counts[i]/np.sum(counts)) * math.log2(counts[i]/np.sum(counts)) for i in
range(len(vals))])

15
def gain_ratio(data, split_attribute_name, target_name="PlayTennis"):
ig = info_gain(data, split_attribute_name, target_name)
si = split_info(data, split_attribute_name)
return ig / si if si != 0 else 0

def info_gain(data, split_attribute_name, target_name="PlayTennis"):


total_entropy = entropy(data[target_name])
vals, counts = np.unique(data[split_attribute_name], return_counts=True)

weighted_entropy = np.sum([
(counts[i]/np.sum(counts)) * entropy(data.where(data[split_attribute_name] ==
vals[i]).dropna()[target_name])
for i in range(len(vals))
])

return total_entropy - weighted_entropy

def C45(data, original_data, features, target_attribute_name="PlayTennis",


parent_node_class=None):
if len(np.unique(data[target_attribute_name])) <= 1:
return np.unique(data[target_attribute_name])[0]

elif len(data) == 0:
return np.unique(original_data[target_attribute_name])[
np.argmax(np.unique(original_data[target_attribute_name], return_counts=True)[1])
]

elif len(features) == 0:
return parent_node_class

else:
parent_node_class = np.unique(data[target_attribute_name])[
np.argmax(np.unique(data[target_attribute_name], return_counts=True)[1])
]

item_values = [gain_ratio(data, feature, target_attribute_name) for feature in features]


best_feature_index = np.argmax(item_values)
best_feature = features[best_feature_index]

tree = {best_feature: {}}


features = [i for i in features if i != best_feature]

for value in np.unique(data[best_feature]):


16
sub_data = data.where(data[best_feature] == value).dropna()
subtree = C45(sub_data, original_data, features, target_attribute_name,
parent_node_class)
tree[best_feature][value] = subtree

return tree

features = list(df.columns)
features.remove('PlayTennis')
tree = C45(df, df, features)
print("C4.5 Decision Tree:", tree)

def classify(sample, tree):


for attr in tree:
if sample[attr] in tree[attr]:
subtree = tree[attr][sample[attr]]
if isinstance(subtree, dict):
return classify(sample, subtree)
else:
return subtree
else:
return "Unknown"

# Example prediction
new_sample = {'Outlook': 'Sunny', 'Temperature': 'Cool', 'Humidity': 'High', 'Wind': 'Strong'}
prediction = classify(new_sample, tree)
print("Prediction for new sample:", prediction)

OUTPUT

17
EXPERIMENT - 6

AIM:
Write a program to demonstrate the working of decision tree based CART algorithm.
Build the decision tree and classify a new sample using suitable dataset. Compare the
performance with that of ID, C4.5, and CART in terms of accuracy, recall, precision and
sensitivity.

THEORY:
ID3 Algorithm
ID3 (Iterative Dichotomiser 3) is one of the earliest decision tree algorithms. It uses
Information Gain as the splitting criterion, which tends to favor attributes with many distinct
values. ID3 works only with categorical data and does not handle missing values. It also lacks
pruning, which makes it prone to overfitting, especially on noisy datasets.
●​ Splitting criterion: Information Gain
●​ Data types supported: Categorical only
●​ Missing value handling: Not supported
●​ Pruning: Not performed
●​ Output tree: Multi-way
●​ Performance (general):
●​ Accuracy: Around 85–90%
●​ Precision: Approximately 0.82
●​ Recall/Sensitivity: Approximately 0.84
●​ F1-Score: Around 0.83

C4.5 Algorithm
C4.5 is an improvement over ID3, also developed by Ross Quinlan. It uses Gain Ratio as the
splitting criterion, which corrects the bias seen in Information Gain. C4.5 supports both
categorical and continuous features and can handle missing values effectively. It performs
post-pruning, which helps prevent overfitting and improves generalization.
●​ Splitting criterion: Gain Ratio
●​ Data types supported: Categorical and continuous
●​ Missing value handling: Supported
●​ Pruning: Post-pruning is applied
●​ Output tree: Multi-way
●​ Performance (general):
●​ Accuracy: Around 88–93%
●​ Precision: Approximately 0.86
●​ Recall/Sensitivity: Approximately 0.89
●​ F1-Score: Around 0.87

CART Algorithm
CART (Classification and Regression Trees) is a binary decision tree algorithm that uses the
Gini Index to determine the best splits. It supports both categorical and continuous features

18
and handles missing values well. CART constructs strictly binary trees and is capable of both
classification and regression, making it more versatile. It also includes a cost-complexity
pruning mechanism to avoid overfitting.
●​ Splitting criterion: Gini Index
●​ Data types supported: Categorical and continuous
●​ Missing value handling: Supported
●​ Pruning: Cost-complexity pruning is applied
●​ Output tree: Binary only
●​ Performance (general):
●​ Accuracy: Around 87–92%
●​ Precision: Approximately 0.85
●​ Recall/Sensitivity: Approximately 0.87
●​ F1-Score: Around 0.86

CODE
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

df = pd.read_csv('play_tennis.csv')
data = df.values.tolist()
headers = df.columns.tolist()

def gini_index(groups, classes):


n_instances = float(sum([len(group) for group in groups]))
gini = 0.0
for group in groups:
size = float(len(group))
if size == 0:
continue
score = 0.0

for class_val in classes:


proportion = [row[-1] for row in group].count(class_val) / size
score += proportion ** 2
gini += (1 - score) * (size / n_instances)
return gini

def test_split(index, value, dataset):


left, right = [], []
for row in dataset:
if row[index] == value:

19
left.append(row)
else:
right.append(row)
return left, right

def get_split(dataset):
class_values = list(set(row[-1] for row in dataset))
best_index, best_value, best_score, best_groups = 999, None, 999, None
for index in range(len(dataset[0])-1):
for row in dataset:
groups = test_split(index, row[index], dataset)
gini = gini_index(groups, class_values)
if gini < best_score:
best_index, best_value, best_score, best_groups = index, row[index], gini, groups
return {'index': best_index, 'value': best_value, 'groups': best_groups}

def to_terminal(group):
outcomes = [row[-1] for row in group]
return max(set(outcomes), key=outcomes.count)

def split(node, max_depth, min_size, depth):


left, right = node['groups']
del(node['groups'])

if not left or not right:


node['left'] = node['right'] = to_terminal(left + right)
return

if depth >= max_depth:


node['left'], node['right'] = to_terminal(left), to_terminal(right)
return

if len(left) <= min_size:


node['left'] = to_terminal(left)
else:
node['left'] = get_split(left)
split(node['left'], max_depth, min_size, depth+1)

if len(right) <= min_size:


node['right'] = to_terminal(right)
else:
node['right'] = get_split(right)
split(node['right'], max_depth, min_size, depth+1)
20
def build_tree(train, max_depth, min_size):
root = get_split(train)
split(root, max_depth, min_size, 1)
return root

def predict(node, row):


if row[node['index']] == node['value']:
if isinstance(node['left'], dict):
return predict(node['left'], row)
else:
return node['left']
else:
if isinstance(node['right'], dict):
return predict(node['right'], row)
else:
return node['right']

def entropy(data):
labels = [row[-1] for row in data]
counter = Counter(labels)
total = len(data)
return -sum((count/total) * np.log2(count/total) for count in counter.values())

def info_gain(data, attr_index):


total_entropy = entropy(data)
values = set(row[attr_index] for row in data)
subsets = [[row for row in data if row[attr_index] == val] for val in values]
weighted_entropy = sum((len(subset)/len(data)) * entropy(subset) for subset in subsets)
return total_entropy - weighted_entropy

def gain_ratio(data, attr_index):


gain = info_gain(data, attr_index)
values = [row[attr_index] for row in data]
split_info = entropy([[v] for v in values])
return gain / split_info if split_info != 0 else 0

def majority_class(data):
return Counter([row[-1] for row in data]).most_common(1)[0][0]

def id3(data, features):


labels = [row[-1] for row in data]
21
if labels.count(labels[0]) == len(labels):
return labels[0]
if not features:
return majority_class(data)

gains = [info_gain(data, i) for i in features]


best_attr = features[gains.index(max(gains))]
tree = {headers[best_attr]: {}}
values = set(row[best_attr] for row in data)

for value in values:


subset = [row for row in data if row[best_attr] == value]
if not subset:
tree[headers[best_attr]][value] = majority_class(data)
else:
subtree = id3(subset, [i for i in features if i != best_attr])
tree[headers[best_attr]][value] = subtree
return tree

def c45(data, features):


labels = [row[-1] for row in data]
if labels.count(labels[0]) == len(labels):
return labels[0]
if not features:
return majority_class(data)

ratios = [gain_ratio(data, i) for i in features]


best_attr = features[ratios.index(max(ratios))]
tree = {headers[best_attr]: {}}
values = set(row[best_attr] for row in data)

for value in values:


subset = [row for row in data if row[best_attr] == value]
if not subset:
tree[headers[best_attr]][value] = majority_class(data)
else:
subtree = c45(subset, [i for i in features if i != best_attr])
tree[headers[best_attr]][value] = subtree
return tree

def predict_tree(tree, row):


if not isinstance(tree, dict):
return tree
attr = next(iter(tree))
22
index = headers.index(attr)
value = row[index]
if value in tree[attr]:
return predict_tree(tree[attr][value], row)
else:
return None

# Evaluate models
def evaluate_model(model_type):
true_labels = [row[-1] for row in data]
predictions = []

if model_type == 'ID3':
tree = id3(data, list(range(len(data[0]) - 1)))
for row in data:
pred = predict_tree(tree, row)
predictions.append(pred if pred else majority_class(data))
elif model_type == 'C4.5':
tree = c45(data, list(range(len(data[0]) - 1)))
for row in data:
pred = predict_tree(tree, row)
predictions.append(pred if pred else majority_class(data))
elif model_type == 'CART':
tree = build_tree(data, max_depth=5, min_size=1)
for row in data:
predictions.append(predict(tree, row))

accuracy = accuracy_score(true_labels, predictions)


precision = precision_score(true_labels, predictions, pos_label='Yes', zero_division=0)
recall = recall_score(true_labels, predictions, pos_label='Yes', zero_division=0)
f1 = f1_score(true_labels, predictions, pos_label='Yes', zero_division=0)

print(f"{model_type} Results:")
print("Accuracy:", round(accuracy, 3))
print("Precision:", round(precision, 3))
print("Recall / Sensitivity:", round(recall, 3))
print("F1-Score:", round(f1, 3))
print()

evaluate_model('ID3')
evaluate_model('C4.5')
evaluate_model('CART')

23
OUTPUT

24
EXPERIMENT - 7

AIM:
Build an Artificial Neural Network by implementing the Backpropagation algorithm and test
the same using appropriate data sets.

THEORY:
An Artificial Neural Network is a computational model inspired by the structure and
functioning of the biological brain. It is a key technique in the field of machine learning and
deep learning, used for recognizing complex patterns and solving problems like
classification, regression, and prediction.

Backpropagation:
Forward Pass: Input is passed through the network to generate output.
Loss Calculation: Difference between predicted and actual output is calculated using a loss
function (e.g., Mean Squared Error).
Backward Pass: Gradients are calculated using the chain rule to update weights and biases
(Backpropagation).
Weight Update: Weights are updated using gradient descent.

CODE
import numpy as np

def sigmoid(x):
return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
return x * (1 - x)

# XOR Dataset
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])

y = np.array([[0],
[1],
[1],
[0]])

input_layer_neurons = 2
hidden_layer_neurons = 2

25
output_neurons = 1

np.random.seed(1)
wh = np.random.uniform(size=(input_layer_neurons, hidden_layer_neurons))
bh = np.random.uniform(size=(1, hidden_layer_neurons))
wo = np.random.uniform(size=(hidden_layer_neurons, output_neurons))
bo = np.random.uniform(size=(1, output_neurons))

epochs = 10000
learning_rate = 0.1

for i in range(epochs):
# Forward Propagation
hidden_input = np.dot(X, wh) + bh
hidden_output = sigmoid(hidden_input)

final_input = np.dot(hidden_output, wo) + bo


predicted_output = sigmoid(final_input)

# Backward Propagation
error = y - predicted_output
d_predicted_output = error * sigmoid_derivative(predicted_output)

error_hidden_layer = d_predicted_output.dot(wo.T)
d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_output)

# Update weights and biases


wo += hidden_output.T.dot(d_predicted_output) * learning_rate
bo += np.sum(d_predicted_output, axis=0, keepdims=True) * learning_rate
wh += X.T.dot(d_hidden_layer) * learning_rate
bh += np.sum(d_hidden_layer, axis=0, keepdims=True) * learning_rate

print("Final Output after Training:")


print(np.round(predicted_output, 3))

OUTPUT

26
EXPERIMENT - 8

AIM:
Write a program to implement the Naïve Bayesian classifier for appropriate dataset and
compute the performance measures of the model.

THEORY:
Naïve Bayes is a probabilistic machine learning algorithm based on Bayes’ Theorem,
particularly useful for classification tasks. It assumes that all features are independent of each
other, which is often not true in practice, but still gives good results—hence the name
"naïve."
Bayes’ Theorem:
P(H | X) = [P(X | H) * P(H)] / P(X)

Where:

●​ P(H∣X) = Posterior probability (Probability of hypothesis H given data X)


●​ P(X∣H) = Likelihood (Probability of data X given that hypothesis H is true)
●​ P(H) = Prior probability (Initial probability of hypothesis H)
●​ P(X) = Marginal probability (Total probability of data X)

CODE
from collections import defaultdict
import math

dataset = [
['Sunny', 'Hot', 'High', 'Weak', 'No'],
['Sunny', 'Hot', 'High', 'Strong', 'No'],
['Overcast', 'Hot', 'High', 'Weak', 'Yes'],
['Rain', 'Mild', 'High', 'Weak', 'Yes'],
['Rain', 'Cool', 'Normal', 'Weak', 'Yes'],
['Rain', 'Cool', 'Normal', 'Strong', 'No'],
['Overcast', 'Cool', 'Normal', 'Strong', 'Yes'],
['Sunny', 'Mild', 'High', 'Weak', 'No'],
['Sunny', 'Cool', 'Normal', 'Weak', 'Yes'],
['Rain', 'Mild', 'Normal', 'Weak', 'Yes'],
['Sunny', 'Mild', 'Normal', 'Strong', 'Yes'],
['Overcast', 'Mild', 'High', 'Strong', 'Yes'],
['Overcast', 'Hot', 'Normal', 'Weak', 'Yes'],
['Rain', 'Mild', 'High', 'Strong', 'No']
]

27
X = [row[:-1] for row in dataset]
y = [row[-1] for row in dataset]

classes = set(y)

def train_naive_bayes(X, y):


total_samples = len(y)
label_probs = defaultdict(float)
feature_probs = defaultdict(lambda: defaultdict(lambda: defaultdict(float)))

for i in range(total_samples):
label = y[i]
label_probs[label] += 1
for j in range(len(X[i])):
feature_value = X[i][j]
feature_probs[j][feature_value][label] += 1

for label in label_probs:


label_probs[label] /= total_samples

for feature_idx in feature_probs:


for value in feature_probs[feature_idx]:
for label in feature_probs[feature_idx][value]:
feature_probs[feature_idx][value][label] /= label_probs[label] * total_samples

return label_probs, feature_probs

# Prediction
def predict_naive_bayes(sample, label_probs, feature_probs):
scores = {}
for label in label_probs:
log_prob = math.log(label_probs[label])
for i in range(len(sample)):
value = sample[i]
if value in feature_probs[i] and label in feature_probs[i][value]:
log_prob += math.log(feature_probs[i][value][label])
else:
log_prob += math.log(1e-6)
scores[label] = log_prob
return max(scores, key=scores.get)

# Train model
label_probs, feature_probs = train_naive_bayes(X, y)

# Test the model on a sample


28
test_sample = ['Sunny', 'Cool', 'High', 'Strong']
prediction = predict_naive_bayes(test_sample, label_probs, feature_probs)
print("Prediction for sample", test_sample, "=>", prediction)

OUTPUT

29
EXPERIMENT - 9

AIM:
Write a program to implement k-Nearest Neighbor algorithm to classify any dataset of your
choice. Print both correct and wrong predictions.

THEORY:
k-Nearest Neighbor is a supervised machine learning algorithm used for classification and
regression tasks. It is instance-based or lazy learning, meaning it doesn't learn a model during
training, but rather stores the training data and makes decisions during prediction. It classifies
a data point based on how its neighbors are classified.

How it Works (Steps):


●​ Choose a value of k (number of neighbors).
●​ Calculate the distance between the new input and all training points.
●​ Select the k closest points (neighbors).
●​ Perform majority voting among neighbors (for classification).
●​ Return the class with the most votes.

CODE
import numpy as np
from collections import Counter
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

def euclidean_distance(a, b):


return np.sqrt(np.sum((a - b) ** 2))

# k-NN algorithm
def knn_predict(X_train, y_train, test_row, k):
distances = []
for i in range(len(X_train)):
dist = euclidean_distance(test_row, X_train[i])
distances.append((dist, y_train[i]))
distances.sort(key=lambda x: x[0])
k_nearest_labels = [label for (_, label) in distances[:k]]
most_common = Counter(k_nearest_labels).most_common(1)
return most_common[0][0]

iris = load_iris()
X = iris.data
y = iris.target

# Split dataset

30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Predict and track results


k=3
correct, wrong = 0, 0
print("Predictions:\n")
for i in range(len(X_test)):
prediction = knn_predict(X_train, y_train, X_test[i], k)
actual = y_test[i]
if prediction == actual:
correct += 1
print(f"Correct: Predicted={prediction}, Actual={actual}")
else:
wrong += 1
print(f"Wrong : Predicted={prediction}, Actual={actual}")

accuracy = correct / len(X_test) * 100


print(f"\nTotal Correct Predictions: {correct}")
print(f"Total Wrong Predictions : {wrong}")
print(f"Accuracy: {accuracy:.2f}%")

OUTPUT

31
EXPERIMENT - 10

AIM:
Apply k-Means clustering algorithm on suitable datasets and comment on the quality of
clustering.

THEORY:
What is K-Means?
K-Means is an unsupervised learning algorithm used for clustering data into groups (called
clusters). It groups data points such that those in the same cluster are more similar to each
other than to those in other clusters.
It is widely used in market segmentation, pattern recognition, image compression, and other
applications where labeled data is not available.

Key Concepts
Unsupervised: No labeled output is required; the algorithm tries to discover natural
groupings.
K: The number of clusters you want to divide your data into.
Centroid: The center of a cluster. It’s the average of all points in the cluster.

CODE
import csv
import random
import math

def load_dataset(filename):
with open(filename, 'r') as file:
reader = csv.reader(file)
next(reader) # skip header
dataset = []
for row in reader:
income = float(row[2])
score = float(row[3])
dataset.append([income, score])
return dataset

def euclidean_distance(a, b):


return math.sqrt(sum((a[i] - b[i]) ** 2 for i in range(len(a))))

def initialize_centroids(dataset, k):


return random.sample(dataset, k)

def assign_clusters(dataset, centroids):

32
clusters = [[] for _ in centroids]
for point in dataset:
distances = [euclidean_distance(point, centroid) for centroid in centroids]
cluster_idx = distances.index(min(distances))
clusters[cluster_idx].append(point)
return clusters

def update_centroids(clusters):
new_centroids = []
for cluster in clusters:
if cluster:
mean = [sum(col) / len(col) for col in zip(*cluster)]
new_centroids.append(mean)
else:
new_centroids.append([0] * len(clusters[0][0])) # placeholder
return new_centroids

def compute_wcss(clusters, centroids):


wcss = 0
for idx, cluster in enumerate(clusters):
for point in cluster:
wcss += euclidean_distance(point, centroids[idx]) ** 2
return wcss

def k_means(dataset, k=3, max_iters=100):


centroids = initialize_centroids(dataset, k)
for _ in range(max_iters):
clusters = assign_clusters(dataset, centroids)
new_centroids = update_centroids(clusters)
if new_centroids == centroids:
break
centroids = new_centroids
wcss = compute_wcss(clusters, centroids)
return clusters, centroids, wcss

if __name__ == "__main__":
dataset = load_dataset("customers.csv")
clusters, centroids, wcss = k_means(dataset, k=3)

for i, cluster in enumerate(clusters):

📉
print(f"Cluster {i+1}: {len(cluster)} customers")
print(f"\n WCSS (lower is better): {wcss:.2f}")

33
OUTPUT

34
EXPERIMENT - 11

AIM:
Write a program to implement ensemble algorithms - AdaBoost and Bagging using the
appropriate dataset and evaluate their performance on that dataset.

THEORY:
What are Ensemble Methods?
Ensemble methods are machine learning techniques that combine the predictions of multiple
models to produce a more accurate and robust prediction than any single model could
achieve. By aggregating several learners, ensemble methods help reduce overfitting, variance,
and bias.

Two popular ensemble techniques are:

AdaBoost (Adaptive Boosting):


AdaBoost is a boosting algorithm that builds a strong classifier by combining multiple weak
classifiers, typically decision trees with one level (decision stumps). It assigns weights to
each instance in the dataset, increasing the weights of incorrectly classified instances in each
iteration to focus more on them in the next model.

Bagging (Bootstrap Aggregating):


Bagging is an ensemble method that trains multiple models (usually of the same type) on
different random subsets of the training data and then averages their predictions (for
regression) or uses majority voting (for classification). Bagging helps reduce variance and
prevents overfitting.

CODE
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# AdaBoost
ada_model = AdaBoostClassifier(n_estimators=50)

35
ada_model.fit(X_train, y_train)
ada_preds = ada_model.predict(X_test)
print(f"AdaBoost Accuracy: {accuracy_score(y_test, ada_preds):.2f}")

# Bagging (use 'estimator' instead of 'base_estimator')


bag_model = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=50)
bag_model.fit(X_train, y_train)
bag_preds = bag_model.predict(X_test)
print(f"Bagging Accuracy: {accuracy_score(y_test, bag_preds):.2f}")

OUTPUT

36
EXPERIMENT - 12

AIM:
Select any two datasets based on their statistics and perform comparison among all the
implemented algorithms using them.

THEORY:

Model comparison involves evaluating the performance of multiple machine learning


algorithms on specific datasets to determine which performs best for a given task. This is
crucial because different models may yield varying results depending on the nature of the
dataset (classification or regression) and its features.

In this implementation, we compare:

●​ Classification algorithms on the Iris dataset​

●​ Regression algorithms on the California Housing dataset

These comparisons help in understanding:

●​ The strengths and weaknesses of different models​

●​ How well a model generalizes to unseen data​

●​ The trade-offs between accuracy, interpretability, training time, etc.​

CODE
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_recall_fscore_support,
confusion_matrix, mean_squared_error, r2_score

# Load Iris Dataset (for classification)


iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Load California Housing Dataset (for regression)


37
california = fetch_california_housing()
X_california = california.data
y_california = california.target

# Split datasets into training and testing sets


X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(X_iris, y_iris,
test_size=0.3, random_state=42)
X_train_california, X_test_california, y_train_california, y_test_california =
train_test_split(X_california, y_california, test_size=0.3, random_state=42)

# Classification Models (for Iris Dataset)


classifiers = {
'Logistic Regression': LogisticRegression(max_iter=200),
'Decision Tree': DecisionTreeClassifier(),
'Random Forest': RandomForestClassifier(),
'SVM': SVC(),
'K-Nearest Neighbors': KNeighborsClassifier(),
'Naive Bayes': GaussianNB()
}

# Regression Models (for California Housing Dataset)


regressors = {
'Linear Regression': LinearRegression(),
'Decision Tree Regressor': DecisionTreeRegressor(),
'Random Forest Regressor': RandomForestRegressor(),
'SVR': SVR(),
'K-Nearest Neighbors Regressor': KNeighborsRegressor(),
'Ridge Regression': Ridge()
}

# Function to evaluate classification models


def evaluate_classification_model(model, X_train, X_test, y_train, y_test):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred,
average='weighted')
cm = confusion_matrix(y_test, y_pred)
return accuracy, precision, recall, f1, cm

# Function to evaluate regression models


def evaluate_regression_model(model, X_train, X_test, y_train, y_test):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
38
r2 = r2_score(y_test, y_pred)
return mse, rmse, r2

# Evaluate Classification Models on Iris Dataset


classification_results = {}
for model_name, model in classifiers.items():
accuracy, precision, recall, f1, cm = evaluate_classification_model(model, X_train_iris,
X_test_iris, y_train_iris, y_test_iris)
classification_results[model_name] = {'Accuracy': accuracy, 'Precision': precision, 'Recall':
recall, 'F1-Score': f1, 'Confusion Matrix': cm}

# Evaluate Regression Models on California Housing Dataset


regression_results = {}
for model_name, model in regressors.items():
mse, rmse, r2 = evaluate_regression_model(model, X_train_california, X_test_california,
y_train_california, y_test_california)
regression_results[model_name] = {'MSE': mse, 'RMSE': rmse, 'R2': r2}

# Print Results for Classification


print("Classification Model Results (Iris Dataset):")
for model_name, result in classification_results.items():
print(f"\n{model_name}:")
print(f"Accuracy: {result['Accuracy']:.4f}")
print(f"Precision: {result['Precision']:.4f}")
print(f"Recall: {result['Recall']:.4f}")
print(f"F1-Score: {result['F1-Score']:.4f}")
print(f"Confusion Matrix:\n{result['Confusion Matrix']}\n")

# Print Results for Regression


print("\nRegression Model Results (California Housing Dataset):")
for model_name, result in regression_results.items():
print(f"\n{model_name}:")
print(f"MSE: {result['MSE']:.4f}")
print(f"RMSE: {result['RMSE']:.4f}")
print(f"R2: {result['R2']:.4f}")

39
OUTPUT

40
41
EXPERIMENT - 13

AIM:
Conduct survey (of at least five) different machine learning tools available.

THEORY:
Machine learning tools are software platforms or libraries that provide functionalities to
build, train, evaluate, and deploy machine learning models. These tools can range from
programming libraries to complete no-code platforms and are essential for data scientists and
ML engineers.

They differ based on:

●​ User Interface (Code-based vs. No-code)​

●​ Level of abstraction (Low-level like TensorFlow vs. high-level like Scikit-learn)​

●​ Use case (General-purpose vs. specialized like AutoML or computer vision)​

●​ Integration with other tools (e.g., deployment on cloud or mobile)​

SURVEY

Each machine learning tool surveyed in this experiment has unique strengths and is suited for
specific use cases:

●​ Scikit-learn is highly suitable for beginners and practitioners working with classical
machine learning algorithms such as regression, classification, and clustering. Its
simplicity and extensive documentation make it ideal for educational purposes and
prototyping.​

●​ PyTorch is preferred in academic and research environments due to its flexibility and
dynamic computation graph, making it easier to debug and experiment with custom
deep learning models.​

●​ TensorFlow is a powerful tool for building and deploying deep learning models at
scale. It is especially well-suited for production environments due to its robust
deployment features, including TensorFlow Serving and TensorFlow Lite.​

●​ Google AutoML is designed for non-technical users or those looking for rapid model
development without the need for in-depth knowledge of machine learning. It
automates most of the model-building pipeline, including preprocessing, training, and
deployment.​

42
●​ Weka is a GUI-based tool that is easy to use and valuable for educational purposes
and initial data analysis. However, it lacks the capabilities needed for modern deep
learning tasks.​

In summary, the selection of a machine learning tool should be based on the specific
requirements of the task, including the complexity of the model, the level of user expertise,
and the intended deployment environment.

43

You might also like