ML LAB MANNUAL R22 CSE(DS)
ML LAB MANNUAL R22 CSE(DS)
(AcademicYear: 2024-25)
By
3 PROGRAM OUTCOMES 4
7 CO – PO MAPPING 7
8 LIST OF EXPERIMENTS 8
10 MINI PROJECT 42
COLLEGE MISSION
DEPARTMENT VISION
DEPARTMENT MISSION
PO12.Life-long learning: Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning inthe broadest context
oftechnologicalchange.
PEO3.
HaveextensiveknowledgeinstateofartframeworksinDesignandAnalysisof
Algorithms and design industry accepted AI solutions using modern tools.
.
Course Outcomes :
Understand modern notions in predictive data analysis
Select data, model selection, model complexity and identify the trends
Understand a range of machine learning algorithms along with their strengths and
weaknesses
Build predictive models from data and analyze their performance
CO-PO Mapping:
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
CO1 3 3 3 3 3 1 2 1 2 2
CO2 3 3 3 3 3 1 2 1 1 1
CO3 3 3 3 3 3 1 2 1 1 1
CO4 3 3 3 3 3 1 2 1 1 1
TEXT BOOK :
1. Machine Learning – Tom M. Mitchell, - MGH.
REFERENCE BOOK :
1. Machine Learning: An Algorithmic Perspective, Stephen Marshland, Taylor & Francis.
Description :
Measures of Central Tendency : Measures of central tendency are statistical metrics that
describe the center or middle of a dataset which is used to summarize the entire dataset.
There are 3 measures of central tendency (i) Mean (ii) Median (iii) Mmode.
(i) Mean : The sample mean X is computed as the sum of all the observed n outcomes (X i)
from the sample divided by the total number of events.
n
∑ Xi
X = i =1
n
(ii) Mode : The number with the highest frequency is called mode.
Example data set : 5, 2,,5,5,5,5,4,1,8,5
Here 5 is the mode, since it occurs 6 times
And the rest of the outcomes occur only once.
So Mode = 5
(iii) Median : The median is the middle value of a set of numbers. The median is the same as
the 50th percentile for the set of numbers.
[ ]
th
n+1
Median=Value of observation at Position
2
Case 2 : If number of observations(n) is even
Median= Arithmetic mean of Values of observations at
[] [ ]
th th
n n
∧ +1 Positions
2 2
Measure of Dispersion : A statistic that tells us how the data values are dispersed or spread
out is called the Measure of Dispersion. It is used to determine the spread of data in a set.
(i) Variance : If X is mean of n observed n outcomes (X i) then Variance is
n
∑ ( X i− X )2
σ 2= i =1
n
(ii) Standard Deviation : Standard Deviation is
∑ ( X i−X )2
√ Variance= i=1
n
Use the numpy library to compute Mean, Median, Variance, S.D. of given data set
Use the stats from scipy library to compute Mode of given data set
Program:
# Python program for Mean, mode, median, variance and Standard Deviation
import numpy as np
from scipy import stats
# Example data
data = [10, 20, 30, 40, 40, 50, 50, 50, 60]
# Calculate mean
mean = np.mean(data)
# Calculate median
# Calculate mode
mode = stats.mode(data)
# Calculate variance
variance = np.var(data)
# Printing results
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {sd}")
Output :
Mean: 38.888888888888886
Median: 40.0
Mode: ModeResult(mode=50, count=3)
Variance: 232.09876543209873
Standard Deviation: 15.234788000891209
Aim: To implement Python Libraries such as Statistics, Math, Numpy and Scipy
Description :
1. Statistics library : For estimating statistical models and performing statistical tests, including
linear regression, time-series analysis, and hypothesis testing.
2. math library : Provides access to mathematical functions, including basic math operations,
trigonometry, and logarithms. The math library is used for mathematical operations that are
not covered by NumPy or SciPy.
3. Numpy library : NumPy is a Python library used for working with arrays.
It also has functions for working in domain of linear algebra, fourier transform, and matrices.
Program:
# Python program for study of libraries statistics, math, numpy and scipy
import statistics
import math
import numpy as np
from scipy.special import comb
data = [10,20,30,40,50]
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Geometric Mean : {gm}")
print(f"Harmonic Mean : {hm}")
Output :
Mean: 30
Median: 30
Mode: 10
Geometric Mean : 26.051710846973528
Harmonic Mean : 21.8978102189781
Square Root of 49 is : 7.0
Cube Root of 64 is : 4.0
ceil value : 2
Floor value : 1
Product of array elements is 12000000
Sum of array elements is 150
Sine of array elements are [-0.54402111 0.91294525 -0.98803162 0.74511316 -0.26237485]
Cosine of array elements are [-0.83907153 0.40808206 0.15425145 -0.66693806
0.96496603]
Combinatioons : 36.0
Permutations : 120.0
Log Sum Exponential of given array is 50.00004540096037
Aim: To Study of Python Libraries for ML application such as Pandas and Matplotlib
Description :
1. Pandas library : The Pandas library is used for data manipulation and analysis. The name
"Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by
Wes McKinney in 2008.
The three data structures of pandas
1. Series (1-dimensional)
2. DataFrame (2-dimensional)
3. Panel(2-dimensional)
character color
‘b’ blue
‘g’ green
‘r’ red
‘c’ cyan
‘m’ magenta
‘y’ yellow
‘k’ black
‘w’ white
import pandas as pd
import matplotlib.pyplot as plt
# Create a DataFrame
data = {
'A': [1, 2, 3, 4, 5,6,7],
'B': [7, 8, 6, 11, 7,10,2],
'C': [12, 10, 6, 7,6,5,10],
'D': [4, 4, 9, 9,3,6,5]
}
df = pd.DataFrame(data)
print("A indicates Day")
print("B indicates No of Study Hours")
print("C indicates No of Playing Hours")
print("D indicates No of Sleeping Hours")
print("DataFrame:")
print(df)
#
# Plot using Matplotlib
plt.figure(figsize=(8, 6))
# Adding a legend
plt.legend()
Output :
A indicates Day
B indicates No of Study Hours
C indicates No of Playing Hours
D indicates No of Sleeping Hours
DataFrame:
A B C D
0 1 7 12 4
1 2 8 10 4
2 3 6 6 9
3 4 11 7 9
4 5 7 6 3
5 6 10 5 6
6 7 2 10 5
Description :
Linear Regression : Linear regression is a type of supervised machine learning algorithm that
computes the linear relationship between the dependent variable and one or more
independent features by fitting a linear equation to observed data.
It predicts the continuous output variables based on the independent input variable.
Simple Linear Regression : If there is only one independent feature, then it is known as
Simple Linear Regression
Program :
n1 = np.size(x)
meanx = np.mean(x)
meany = np.mean(y)
plt.xlabel('x')
plt.ylabel('y')
plt.title("Simple Linear Regression", fontsize=30,
color="magenta")
plt.legend()
plt.show()
Output :
Estimated coefficients are :
b0 = 3.799999999999999
b1=2.0545454545454547
Aim : Python program to Implementation of Multiple Linear Regression for House Price
Prediction using sklearn
Description :
Linear Regression : Linear regression is a supervised machine learning algorithm that
computes the linear relationship between the dependent variable and one or more
independent features by fitting a linear equation to observed data.
It predicts the continuous output variables based on the independent input variable.
Nultiple Linear Regression : If there are more than one independent feature in a Linear
regression, then it is known as Simple Linear Regression
The equation for multiple linear regression is:
y=β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3±−−−−∓β n x n
sklearn : Scikit-learn, also known as sklearn, is a machine learning and data modeling library for
Python.
pip install scikit-learn
Program :
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Dataset
data = pd.read_csv(r"C:\Users\DELL\Downloads\DATASET\house.csv")
print(data)
# Load the dataset from a CSV file
24 Machine Learning Lab | VIGNAN INSTITUTE OF TECHNOLOGY AND SCIENCE
file_path = 'your_file.csv' # Replace with your CSV file path
# Define the independent variables (features) and the dependent variable (target)
X = data[['area', 'bedrooms', 'bathrooms']]
y= data['price']
# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Output :
[5 rows x 13 columns]
Mean Squared Error: 2750040479309.052
R-squared: 0.45592991188724463
Coefficients: [3.45466570e+02 3.60197650e+05 1.42231966e+06]
Intercept: 59485.379208717495
Aim : Write a Python program to Implementation of Decision tree using sklearn and its
parameter tuning
Description :
Decision Tree : A decision tree is a flowchart-like structure used to make decisions or
predictions. It consists of nodes representing decisions or tests on attributes, branches
representing the outcome of these decisions, and leaf nodes representing final outcomes or
predictions.
Information Gain: Measures the reduction in entropy or Gini impurity after a dataset is split on
an attribute.
n
Gain(S , A)=Entropy (S)−∑ ¿ S f ∨ ¿ Entropy (S )¿ ¿
f
i=1
i
¿ S∨¿ i
Where S f i
is the subset of S for which attribute A has value i, and the entropy of
partitioning the data is calculated by weighing the entropy of each partition by its size
relative to the original set.
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node
28 Machine Learning Lab | VIGNAN INSTITUTE OF TECHNOLOGY AND SCIENCE
Program :
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
[5 rows x 5 columns]
Accuracy: 100.00%
EXPERIMENT 7 Date :
---------------------------------------------------------------------------------------------------------------
EXPERIMENT 7 : Write a Python program to Implementation of KNN using sklearn
Description :
KNN (K Nearest Neighbor) : KNN is one of the most essential classification algorithms in
machine learning. It belongs to the supervised learning domain. K - Nearest neighbor methods
is to find a predefined number of training samples closest in distance to the new point KNN is
used to analyzes data to find the nearest neighboring point:
Euclidean Distance : The cartesian distance between the two points which are in the
plane/hyperplane
√∑
n
2
d ( xi , yi )= ( xi − y i )
i =1
Program :
# Import necessary libraries
from sklearn.model_selection import train_test_split
# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Output :
Accuracy: 100.00%
Classification Report:
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 9
32 Machine Learning Lab | VIGNAN INSTITUTE OF TECHNOLOGY AND SCIENCE
virginica 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Confusion Matrix:
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
EXPERIMENT 8 Date:
---------------------------------------------------------------------------------------------------------------
EXPERIMENT 8 : Write a Python program to Implementation of Logistic Regression using
sklearn
33 Machine Learning Lab | VIGNAN INSTITUTE OF TECHNOLOGY AND SCIENCE
Aim : Python program to Implementation of Logistic Regression using sklearn
Description :
Logistic Regression : Logistic regression is a supervised machine learning algorithm used for
classification tasks to predict the probability that an instance belongs to a given class or not.
Logistic regression is used for binary classification where it uses sigmoid function, that takes
input as independent variables and produces a probability value between 0 and 1.
The equation for multiple linear regression is:
y=β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3±−−−−∓β n x n
1 1
f ( x i ) =h ( y )= −y
= −( β + β x + β x + β x ±−−−−∓ β xn )
1+ e 1+ e 0 1 1 2 2 3 3 n
The logistic model (or logit model) is a statistical model that models the log-odds of an event as
a linear combination of one or more independent variables.
Loads the Iris dataset, splits it, trains a Logistic Regression model, and evaluates its
performance.
Program :
# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Output :
Accuracy: 100.00%
Classification Report:
precision recall f1-score support
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Confusion Matrix:
35 Machine Learning Lab | VIGNAN INSTITUTE OF TECHNOLOGY AND SCIENCE
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
EXPERIMENT 9 Date :
--------------------------------------------------------------------------------------------------------------
EXPERIMENT 9 : Write a Python program to Implementation of K-Means Clustering
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim
of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
Algorithm :
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
min
d i= d ( xi , μ j )
j
Where
1
Mean of cluster jis μ j= ∑N x
N j i=¿ ¿ j i
And Euclidean Distance is
√∑ (
n
2
d ( xi , yi )= xi − y i )
i =1
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Flowchart :
Program :
# Import necessary libraries
# Plot the clusters (using only the first two features for visualization)
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='sepal length (cm)', y='sepal width (cm)', hue='Cluster',
palette='viridis')
plt.title("K-Means Clustering on Iris Dataset", color="red",size=40)
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.show()
Output :
Cluster Centers:
[[5.9016129 2.7483871 4.39354839 1.43387097]
[5.006 3.428 1.462 0.246 ]
39 Machine Learning Lab | VIGNAN INSTITUTE OF TECHNOLOGY AND SCIENCE
[6.85 3.07368421 5.74210526 2.07105263]]
EXPERIMENT 10 Date :
( Mini Project )
--------------------------------------------------------------------------------------------------------------
EXPERIMENT 10 : Performance analysis of Classification Algorithms on a specific dataset
(Mini Project)
Objective :
Analyze and compare the performance of multiple classification algorithms (e.g.,
Logistic Regression, Decision Tree, Random Forest, SVM, and KNN) on the
popular *Iris dataset* to predict the species of a flower based on its features.
# Load dataset
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
```python
# Check class distribution
print(data['species'].value_counts())
4. Data Preprocessing:
1. Split into training and test sets.
2. Scale the features.
# Train-test split
# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```
5. Train and Evaluate Models:
**Logistic Regression**
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(X_test)
print("Logistic Regression:")
print(classification_report(y_test, y_pred_lr))
```
**Decision Tree**
model_dt = DecisionTreeClassifier()
model_dt.fit(X_train, y_train)
y_pred_dt = model_dt.predict(X_test)
print("Decision Tree:")
print(classification_report(y_test, y_pred_dt))
```
**Random Forest**
model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)
y_pred_rf = model_rf.predict(X_test)
43 Machine Learning Lab | VIGNAN INSTITUTE OF TECHNOLOGY AND SCIENCE
print("Random Forest:")
print(classification_report(y_test, y_pred_rf))
```
model_knn = KNeighborsClassifier()
model_knn.fit(X_train, y_train)
y_pred_knn = model_knn.predict(X_test)
print("K-Nearest Neighbors:")
print(classification_report(y_test, y_pred_knn))
```
model_svm = SVC(probability=True)
model_svm.fit(X_train, y_train)
y_pred_svm = model_svm.predict(X_test)
print("Support Vector Machine:")
print(classification_report(y_test, y_pred_svm))
6. Compare Performance:
Create a table or plot comparing metrics such as accuracy, precision, recall, and
F1-score.
# Accuracy scores
accuracy_scores = {
'Logistic Regression': accuracy_score(y_test, y_pred_lr),
'Decision Tree': accuracy_score(y_test, y_pred_dt),
'Random Forest': accuracy_score(y_test, y_pred_rf),
'K-Nearest Neighbors': accuracy_score(y_test, y_pred_knn),
'Support Vector Machine': accuracy_score(y_test, y_pred_svm)
}
8. Deliverables:
1. **Python Notebook**: Full implementation of the project.
2. **Report**:
- Dataset summary and EDA findings.
- Algorithms and their performance (accuracy, F1-score, etc.).
- Recommendations for the best classi
Output :