0% found this document useful (0 votes)
16 views49 pages

12. B Lab Manual Machine Learning SEM-7 CSE 2024

The document outlines practical exercises for analyzing statistical measures (mean, median, mode) and data visualization techniques (scatter plot, histogram, box plot) using Python. It includes code examples for calculating these measures from datasets and visualizing data distributions. Additionally, it introduces simple linear regression as a predictive analysis method in machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views49 pages

12. B Lab Manual Machine Learning SEM-7 CSE 2024

The document outlines practical exercises for analyzing statistical measures (mean, median, mode) and data visualization techniques (scatter plot, histogram, box plot) using Python. It includes code examples for calculating these measures from datasets and visualizing data distributions. Additionally, it introduces simple linear regression as a predictive analysis method in machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 49

New L J Institute of Engineering and Technology <Enrolment No>

PRACTICAL - 1
Aim: (1 a) Find and analyse mean, median and mode of given
data.

(1 b) Find and analyse mean, median and mode of given data


in a csv file.
Theory:

Mean, Median, and Mode are statistical measures used to describe the central tendency of a
dataset. In machine learning, these measures are used to understand the distribution of data
and identify outliers. Here, we will explore the concepts of Mean, Median, and Mode and
their implementation in Python.

 Mean

The "mean" is the average value of a dataset. It is calculated by adding up all the values in the
dataset and dividing by the number of observations. The mean is a useful measure of central
tendency because it is sensitive to outliers, meaning that extreme values can significantly
affect the value of the mean.

In Python, we can calculate the mean using the NumPy library, which provides a function
called mean().
 Median

The "median" is the middle value in a dataset. It is calculated by arranging the values in the
dataset in order and finding the value that lies in the middle. If there are an even number of
values in the dataset, the median is the average of the two middle values.

The median is a useful measure of central tendency because it is not affected by outliers,
meaning that extreme values do not significantly affect the value of the median.

In Python, we can calculate the median using the NumPy library, which provides a function
called median().
 Mode

The "mode" is the most common value in a dataset. It is calculated by finding the value that
occurs most frequently in the dataset. If there are multiple values that occur with the same
frequency, the dataset is said to be bimodal, trimodal, or multimodal.

Machine Learning (3170724) 1


New L J Institute of Engineering and Technology <Enrolment No>

The mode is a useful measure of central tendency because it can identify the most common
value in a dataset. However, it is not a good measure of central tendency for datasets with a
wide range of values or datasets with no repeating values.

In Python, we can calculate the mode using the SciPy library, which provides a function
called mode().
Code:
(1 a) Find and analyse mean, median and mode of given data.
#Import necessary modules

#statistics module contains various pre-defined data handling functions

#To find the mean, the method is:

import statistics

statistics.mean([21, 89, 34, 67, 96])

Output:

61.4

#To find the median, the method is:

import statistics

statistics.median([21, 89, 34, 67, 96])

Output:

67

#To find the mode, the method is:

import statistics

statistics.mode([21, 89, 34, 67, 96])

Output:

21

Machine Learning (3170724) 2


New L J Institute of Engineering and Technology <Enrolment No>

(1 b) Aim: Find and analyse mean, median and mode of given


data in a csv file.
#Import necessary modules

import pandas as pd

df=pd.read_csv("auto-mpg.csv")

df

df.info()

Output:

# ## Converting Horsepower attribute to numeric and finding its missing values

df['horsepower']=pd.to_numeric(df['horsepower'],errors='coerce')

# ## Finding Null Values in HorsePower attribute

df[df.horsepower.isnull()

df.info()

Output:

Machine Learning (3170724) 3


New L J Institute of Engineering and Technology <Enrolment No>

# ## Dropping Null Values From Dataset

df.dropna(subset=['horsepower'],inplace=True)

# ## No null values in Dataset

df.isnull()

df.info()

# ## Finding Mean,Median and Mode

mean=df['mpg'].mean()

print('Mean of mpg : ',mean)

median=df['mpg'].median()

print('Median of mpg : ',median)

mode=df['mpg'].mode()

print('Mode of mpg : ',mode)

Output:

# ## Details of all the attributes

df.describe()

Machine Learning (3170724) 4


New L J Institute of Engineering and Technology <Enrolment No>

df.mode()

Output:

# ## Fetching mean and median from describe()

df.describe().loc[['mean','50%']]

Output:

Signature with date: _______________

PRACTICAL - 2

Machine Learning (3170724) 5


New L J Institute of Engineering and Technology <Enrolment No>

Aim: 2(a) Understand structure of data using various


visualization methods like scatter plot, histogram and boxplot.

2(b) Understand structure of data using various visualization


methods like scatter plot, histogram and boxplot for given csv.
Theory:

 Data visualization is a crucial aspect of machine learning that enables analysts to


understand and make sense of data patterns, relationships, and trends. Through data
visualization, insights and patterns in data can be easily interpreted and communicated to
a wider audience, making it a critical component of machine learning.

 Data visualization helps machine learning analysts to better understand and analyze
complex data sets by presenting them in an easily understandable format. Data
visualization is an essential step in data preparation and analysis as it helps to identify
outliers, trends, and patterns in the data that may be missed by other forms of analysis.

 With the increasing availability of big data, it has become more important than ever to use
data visualization techniques to explore and understand the data. Machine learning
algorithms work best when they have high-quality and clean data, and data visualization
can help to identify and remove any inconsistencies or anomalies in the data.

 Scatter Plot

Scatter plot is one of the most important data visualization techniques and it is considered
one of the Seven Basic Tools of Quality. A scatter plot is used to plot the relationship
between two variables, on a two-dimensional graph that is known as Cartesian Plane on
mathematical grounds.

It is generally used to plot the relationship between one independent variable and one
dependent variable, where an independent variable is plotted on the x-axis and a dependent
variable is plotted on the y-axis so that you can visualize the effect of the independent
variable on the dependent variable. These plots are known as Scatter Plot Graph or Scatter
Diagram.

Applications of Scatter Plot


As already mentioned, a scatter plot is a very useful data visualization technique. A few
applications of Scatter Plots are listed below.

Machine Learning (3170724) 6


New L J Institute of Engineering and Technology <Enrolment No>

 Correlation Analysis: Scatter plot is useful in the investigation of the correlation


between two different variables. It can be used to find out whether two variables have a
positive correlation, negative correlation or no correlation.
 Outlier Detection: Outliers are data points, which are different from the rest of the data
set. A Scatter Plot is used to bring out these outliers on the surface.
 Cluster Identification: In some cases, scatter plots can help identify clusters or groups
within the data.

Scatter Plot Graph


Scatter Plot is known by several other names, a few of them are scatter chart, scattergram,
scatter plot, and XY graph. A scatter plot is used to visualize a data pair, such that each
element gets its axis, generally the independent one gets the x-axis and the dependent one
gets the y-axis.

This kind of distribution makes it easier to visualize the kind of relationship, the plotted
pair of data is holding. So Scatter Plot is useful in situations when we have to find out the
relationship between two sets of data, or in cases when we suspect that there may be some
relationship between two variables and this relationship may be the root cause of some
problem.

How to Construct a Scatter Plot?


To construct a scatter plot, we have to follow the given steps.
Step 1: Identify the independent and dependent variables
Step 2: Plot the independent variable on x-axis
Step 3: Plot the dependent variable on y-axis
Step 4: Extract the meaningful relationship between the given variables.

Machine Learning (3170724) 7


New L J Institute of Engineering and Technology <Enrolment No>

 Histogram
Histograms helps visualizing and comprehending the data distribution. The article aims to
provide comprehensive overview of histogram and its interpretation.

Histograms are graphical representations of data distributions. They consist of bars, each
representing the frequency or count of observations falling within specific intervals, known
as bins. We can also say a histogram is a variation of a bar chart in which data values are
grouped together and put into different classes. This grouping enables you to see how
frequently data in each class occur in the dataset.

The histogram graphically shows the following:


 Frequency of different data points in the dataset.
 Location of the center of data.
 The spread of dataset.
 Skewness/variance of dataset.
 Presence of outliers in the dataset.

The features provide a strong indication of the proper distributional model in the data. The
probability plot or a goodness-of-fit test can be used to verify the distributional model.

The histogram contains the following axes:


 Vertical Axis: Frequency/count of each bin.
 Horizontal Axis: List of bins/categories.

Machine Learning (3170724) 8


New L J Institute of Engineering and Technology <Enrolment No>

Box Plot is a graphical method to visualize data distribution for gaining insights and making
informed decisions. Box plot is a type of chart that depicts a group of numerical data through
their quartiles.

The idea of box plot was presented by John Tukey in 1970. He wrote about it in his book
“Exploratory Data Analysis” in 1977. Box plot is also known as a whisker plot, box-and-
whisker plot, or simply a box-and whisker diagram. Box plot is a graphical representation of
the distribution of a dataset. It displays key summary statistics such as
the median, quartiles, and potential outliers in a concise and visual manner.
By using Box plot you can provide a summary of the distribution, identify potential and
compare different datasets in a compact and visual manner.

Elements of Box Plot


A box plot gives a five-number summary of a set of data which is-
 Minimum – It is the minimum value in the dataset excluding the outliers.
 First Quartile (Q1) – 25% of the data lies below the First (lower) Quartile.
 Median (Q2) – It is the mid-point of the dataset. Half of the values lie below it and half
above.
 Third Quartile (Q3) – 75% of the data lies below the Third (Upper) Quartile.
 Maximum – It is the maximum value in the dataset excluding the outliers.

Machine Learning (3170724) 9


New L J Institute of Engineering and Technology <Enrolment No>

The area inside the box (50% of the data) is known as the Inter Quartile Range. The IQR is
calculated as –
IQR = Q3-Q1

Outlies are the data points below and above the lower and upper limit. The lower and
upper limit is calculated as –

Lower Limit = Q1 - 1.5*IQR


Upper Limit = Q3 + 1.5*IQR

The values below and above these limits are considered outliers and the minimum and
maximum values are calculated from the points which lie under the lower and upper limit.

2a (1) Python program to draw a scatter plot


#The x array represents the age of each car.

#The y array represents the speed of each car.

import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)
plt.show()

Machine Learning (3170724) 10


New L J Institute of Engineering and Technology <Enrolment No>

Output:

#2a (2) Python program to draw a histogram plot


#We use NumPy to randomly generate an array with 250 values, where the values will concentrate
around 170, and the standard deviation is 10.

import numpy as np

x = np.random.normal(170, 10, 250)

print(x)
#Output: values of x is printed

import matplotlib.pyplot as plt


import numpy as np

x = np.random.normal(170, 10, 250)

plt.hist(x)
plt.show()

Machine Learning (3170724) 11


New L J Institute of Engineering and Technology <Enrolment No>

Output:

#2a (3) Python program to draw a box plot


import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
df.plot.box(grid='True')

Output:

Machine Learning (3170724) 12


New L J Institute of Engineering and Technology <Enrolment No>

2(b) Aim: Understand structure of data using various


visualization methods like scatter plot, histogram and boxplot.
import pandas as pd

df=pd.read_csv("auto-mpg.csv")

df['horsepower']=pd.to_numeric(df['horsepower'],errors='coerce')

df.dropna(subset=['horsepower'],inplace=True)

import matplotlib.pyplot as p

import seaborn as sn

%matplotlib inline

# ## Boxplot of mpg column using matplotlib

p.boxplot(df['mpg'],patch_artist=True,notch=True)

Output:

# ## Boxplot of mpg column using seaborn

sn.boxplot(df['mpg'],color='yellow')

Machine Learning (3170724) 13


New L J Institute of Engineering and Technology <Enrolment No>

Output:

sn.boxplot(x='origin',y='horsepower',data=df)

Output:

Machine Learning (3170724) 14


New L J Institute of Engineering and Technology <Enrolment No>

p.hist(df['mpg'],color='green')

Output:

sn.histplot(df['weight'],bins=20)

Output:

p.scatter(x=df.displacement,y=df.mpg)

sn.pairplot(df)

Machine Learning (3170724) 15


New L J Institute of Engineering and Technology <Enrolment No>

Output:

Signature with date: _______________

Machine Learning (3170724) 16


New L J Institute of Engineering and Technology <Enrolment No>

PRACTICAL - 3
Aim: Perform Simple Linear Regression on Salary_data.
Theory:

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.

The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)

Machine Learning (3170724) 17


New L J Institute of Engineering and Technology <Enrolment No>

a1 = Linear regression coefficient (scale factor to each input value).


ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Multiple Linear
Regression.

Code:
#Import necessary librabries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

data_set= pd.read_csv('Salary_Data.csv')

#take dependent and independent Variables


x= data_set.iloc[:, :-1].values
y= data_set.iloc[:, 1].values

# Splitting the dataset into training and test set.


from sklearn.model_selection import train_test_split

Machine Learning (3170724) 18


New L J Institute of Engineering and Technology <Enrolment No>

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 1/3, random_state=0)

#Fitting the Simple Linear Regression model to the training dataset


from sklearn.linear_model import LinearRegression
regressor= LinearRegression()
regressor.fit(x_train, y_train)

#Prediction of Test and Training set result


y_pred= regressor.predict(x_test)
x_pred= regressor.predict(x_train)

mtp.scatter(x_train, y_train, color="green")


mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Training Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()

#visualizing the Test set results


mtp.scatter(x_test, y_test, color="blue")
mtp.plot(x_train, x_pred, color="red")
mtp.title("Salary vs Experience (Test Dataset)")
mtp.xlabel("Years of Experience")
mtp.ylabel("Salary(In Rupees)")
mtp.show()

Machine Learning (3170724) 19


New L J Institute of Engineering and Technology <Enrolment No>

Output:

Signature with date: _______________

Machine Learning (3170724) 20


New L J Institute of Engineering and Technology <Enrolment No>

PRACTICAL - 4
Aim: Implement Logistic Regression on Iris dataset and evaluate
its performance.
Theory:
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:

Machine Learning (3170724) 21


New L J Institute of Engineering and Technology <Enrolment No>

Logistic Function (Sigmoid Function):


o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Code:
import pandas as pd # used to read the data set

import numpy as np # used to do some operations with the arrays

df = pd.read_csv("iris.csv")

df.head(5)

df = df.drop(columns = ['Id']) #drop unnecessary columns

df.head(5)

#transform string labels to integer

Machine Learning (3170724) 22


New L J Institute of Engineering and Technology <Enrolment No>

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['Species'] = le.fit_transform(df['Species'])

df.head(100)

#drop target variable from X and Keep only target variable in Y

X = df.drop(columns = ['Species'])

Y = df['Species']

#convert dataset into 4 parts X_train, X_test, Y_train, Y_test

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.30)

#Fit Losgistic regression on Training data

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, Y_train)

#predict labels of X_test

y_pred = model.predict(X_test)

#Creating the Confusion matrix

from sklearn.metrics import confusion_matrix

cm= confusion_matrix(Y_test,y_pred)

cm

#Print Classification report

Machine Learning (3170724) 23


New L J Institute of Engineering and Technology <Enrolment No>

from sklearn.metrics import classification_report

print(classification_report(Y_test, y_pred))

#print only Accuracy

print("Accuracy: ", model.score(X_test, Y_test) * 100)

Output:

Signature with date: _______________

Machine Learning (3170724) 24


New L J Institute of Engineering and Technology <Enrolment No>

PRACTICAL - 5
Aim: Implement Decision Tree on Iris dataset and evaluate its
performance.
Theory:
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Machine Learning (3170724) 25


New L J Institute of Engineering and Technology <Enrolment No>

Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.

Steps for Algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:

Machine Learning (3170724) 26


New L J Institute of Engineering and Technology <Enrolment No>

Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular techniques for ASM, which
are:

o Information Gain
o Gini Index

Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.


o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Code:
#Import necessary libraries

import pandas as pd

import numpy as np

from sklearn.datasets import load_iris

from sklearn.metrics import accuracy_score

# Reading the Iris.csv file

data = load_iris()

# Extracting Attributes / Features

X = data.data

Machine Learning (3170724) 27


New L J Institute of Engineering and Technology <Enrolment No>

# Extracting Target / Class Labels

y = data.target

# Import Library for splitting data

from sklearn.model_selection import train_test_split

# Creating Train and Test datasets

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 50, test_size = 0.25)

# Creating Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()

clf.fit(X_train,y_train)

# Predict Accuracy Score

y_pred = clf.predict(X_test)

print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred=clf.predict(X_train)))

print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred=y_pred))

#Creating the Confusion matrix

from sklearn.metrics import confusion_matrix

cm= confusion_matrix(y_test,y_pred)

#Print confusion matrix

cm

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

Machine Learning (3170724) 28


New L J Institute of Engineering and Technology <Enrolment No>

Output:

Signature with date: _______________

Machine Learning (3170724) 29


New L J Institute of Engineering and Technology <Enrolment No>

PRACTICAL - 6
Aim: Implement K-NN on Iris dataset and evaluate its
performance.
Theory:

 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
 K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
 K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it
is used for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs
an action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
Steps for kNN:

1. Load the data


2. Initialise the value of k
3. For getting the predicted class, iterate from 1 to total number of training data points
1. Calculate the distance between test data and each row of training dataset. Here
we will use Euclidean distance as our distance metric since it’s the most
popular method. The other distance function or metrics that can be used are
manhattan distance, Minkowski distance, Chebyshev, cosine, etc. If there are
categorical variables, hamming distance can be used.
4. Sort the calculated distances in ascending order based on distance values
5. Get top k rows from the sorted array
6. Get the most frequent class of these rows
7. Return the predicted class

How to select the value of K in the K-NN Algorithm?


 There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 3 or 5.

Machine Learning (3170724) 30


New L J Institute of Engineering and Technology <Enrolment No>

 K value is taken odd only.


 A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
 Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

 It is simple to implement.
 It is robust to the noisy training data
 It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

 Always needs to determine the value of K which may be complex some time.
 The computation cost is high because of calculating the distance between the data
points for all the training samples.

Code:
import pandas as pd

import numpy as np

from sklearn.datasets import load_iris

from sklearn.metrics import accuracy_score

# Reading the Iris.csv file

data = load_iris()

# Extracting Attributes / Features

X = data.data

# Extracting Target / Class Labels

y = data.target

# Import Library for splitting data

from sklearn.model_selection import train_test_split

Machine Learning (3170724) 31


New L J Institute of Engineering and Technology <Enrolment No>

# Creating Train and Test datasets

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 50, test_size = 0.25)

# Creating KNN Classifier

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train,y_train)

# Predict Accuracy Score

y_pred = knn.predict(X_test)

print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred=knn.predict(X_train)))

print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred=y_pred))

#Creating the Confusion matrix

from sklearn.metrics import confusion_matrix

cm= confusion_matrix(y_test,y_pred)

cm

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

Output:

Signature with date: _______________

Machine Learning (3170724) 32


New L J Institute of Engineering and Technology <Enrolment No>

PRACTICAL - 7
Aim: Implement SVM on Iris dataset and evaluate its
performance.
Theory:

An SVM model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as possible. In
addition to performing linear classification, SVMs can efficiently perform a non-linear
classification, implicitly mapping their inputs into high-dimensional feature spaces.

What Support vector machines do, is to not only draw a line between two classes here, but
consider a region about the line of some given width.

Machine Learning (3170724) 33


New L J Institute of Engineering and Technology <Enrolment No>

Code:
from sklearn import svm, datasets

import matplotlib.pyplot as plt

import numpy as np

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split

#Add datasets, insert the desired number of features and train the model

iris = datasets. load_iris()

X = iris.data[:, :2] # we only take the first two features

y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.25)

clf = svm.SVC(kernel="linear", C=1).fit(X_train, y_train)

# Predicting the output and printing the accuracy of the model

classifier_predictions = clf.predict(X_test)

print(accuracy_score(y_test, classifier_predictions)*100)

Machine Learning (3170724) 34


New L J Institute of Engineering and Technology <Enrolment No>

#Creating the Confusion matrix

from sklearn.metrics import confusion_matrix

cm= confusion_matrix(y_test,classifier_predictions)

cm

Output:

Signature with date: _______________

Machine Learning (3170724) 35


New L J Institute of Engineering and Technology <Enrolment No>

PRACTICAL - 8
Aim: Perform K-means clustering on Iris dataset and evaluate its
performance.
Theory:
 K-Means Clustering is an unsupervised learning algorithm that is used to solve the
 Clustering problems in machine learning or data science.
 It is an iterative algorithm that divides the unlabelled dataset into k different clusters
in
 such a way that each dataset belongs only one group that has similar properties
 It is a centroid-based algorithm, where each cluster is associated with a centroid. The
 main aim of this algorithm is to minimize the sum of distances between the data point
 and their corresponding clusters.

Steps for K-Means Algorithm:


The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
( K can be chosen by either intuitively, Silhouette method or Elbow method)
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate a new centroid of each cluster (by calculating mean).
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

Advantages:
 Relatively simple to implement
 Scales to large data sets

Machine Learning (3170724) 36


New L J Institute of Engineering and Technology <Enrolment No>

 Guarantees to convergence.
 Easily adapts to new examples.
Disadvantage:
 Need to choose K manually.
 Can run into problems when clustering varying sizes and density.
 Sensitive to outliers.
 Doesn’t scale well with large no of dimensions.
 Only works for numeric values.

Code:
#importing the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

#importing the Iris dataset with pandas

dataset = pd.read_csv('iris.csv')

x = dataset.drop(columns = ['Id','Species']).values #drop unncessary columns

#Applying kmeans to the dataset / Creating the kmeans classifier

kmeans = KMeans(n_clusters = 3, init = 'k-means++', max_iter = 300, n_init = 10, random_state


= 0)

y_kmeans = kmeans.fit_predict(x)

#Visualising the clusters

plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'red', label = 'Iris-setosa')

plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Iris-versicolour')

plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Iris-virginica')

#Plotting the centroids of the clusters

Machine Learning (3170724) 37


New L J Institute of Engineering and Technology <Enrolment No>

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 100, c = 'yellow', label


= 'Centroids')

plt.legend()

Output:

Signature with date: _______________

PRACTICAL - 9
Aim: Write a program to implement Naïve Bayes Classifier.

Machine Learning (3170724) 38


New L J Institute of Engineering and Technology <Enrolment No>

Theory:
Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the
simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and
reliable algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.
Naive Bayes classifier assumes that the effect of a particular feature in a class is independent
of other features. For example, a loan applicant is desirable or not depending on his/her
income, previous loan and transaction history, age, and location. Even if these features are
interdependent, these features are still considered independently. This assumption simplifies
computation, and that's why it is considered as naive. This assumption is called class
conditional independence.

 P(h): the probability of hypothesis h being true (regardless of the data). This is known
as the prior probability of h.

 P(D): the probability of the data (regardless of the hypothesis). This is known as the
prior probability.

 P(h|D): the probability of hypothesis h given the data D. This is known as posterior
probability.

 P(D|h): the probability of data d given that the hypothesis h was true. This is known
as posterior probability.

Code:
Generating the Dataset
from sklearn.datasets import make_classification
X, y = make_classification(
n_features=6,
n_classes=3,
n_samples=800,
n_informative=2,
random_state=1,
n_clusters_per_class=1,
)

Machine Learning (3170724) 39


New L J Institute of Engineering and Technology <Enrolment No>

Train-Test Split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=125
)

Build The Model


from sklearn.naive_bayes import GaussianNB
# Build a Gaussian Classifier
model = GaussianNB()
# Model training
model.fit(X_train, y_train)
# Predict Output
predicted = model.predict([X_test[6]])
#print("Actual Value:", y_test[6])
#print("Predicted Value:", predicted[0])

Calculate the Accuracy:


from sklearn.metrics import (
accuracy_score,
confusion_matrix,
ConfusionMatrixDisplay,
f1_score,
)

y_pred = model.predict(X_test)
accuray = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test, average="weighted")

print("Accuracy:", accuray)
print("F1 Score:", f1)

Output:

Machine Learning (3170724) 40


New L J Institute of Engineering and Technology <Enrolment No>

Signature with date: _______________

PRACTICAL - 10
Aim: Write a program to implement ANN.
Theory:

Machine Learning (3170724) 41


New L J Institute of Engineering and Technology <Enrolment No>

Artificial Neural Networks are modeled after the neurons in the human brain.

 Artificial Neural Networks

Artificial Neural Networks contain artificial neurons which are called units. These units are
arranged in a series of layers that together constitute the whole Artificial Neural Network in
a system.

A layer can have only a dozen units or millions of units as this depends on how the
complex neural networks will be required to learn the hidden patterns in the dataset.
Commonly, Artificial Neural Network has an input layer, an output layer as well as hidden
layers.

The input layer receives data from the outside world which the neural network needs to
analyze or learn about. Then this data passes through one or multiple hidden layers that
transform the input into data that is valuable for the output layer.

Finally, the output layer provides an output in the form of a response of the Artificial
Neural Networks to input data provided.

In the majority of neural networks, units are interconnected from one layer to another. Each
of these connections has weights that determine the influence of one unit on another unit.
As the data transfers from one unit to another, the neural network learns more and more
about the data which eventually results in an output from the output layer.

Machine Learning (3170724) 42


New L J Institute of Engineering and Technology <Enrolment No>

Neural Networks Architecture


The structures and operations of human neurons serve as the basis for artificial neural
networks. It is also known as neural networks or neural nets. The input layer of an artificial
neural network is the first layer, and it receives input from external sources and releases it
to the hidden layer, which is the second layer.
In the hidden layer, each neuron receives input from the previous layer neurons, computes
the weighted sum, and sends it to the neurons in the next layer.
These connections are weighted means effects of the inputs from the previous layer are
optimized more or less by assigning different-different weights to each input and it is
adjusted during the training process by optimizing these weights for improved model
performance.
Artificial neurons vs Biological neurons
The concept of artificial neural networks comes from biological neurons found in animal
brains So they share a lot of similarities in structure and function wise.
 Structure: The structure of artificial neural networks is inspired by biological neurons.
A biological neuron has a cell body or soma to process the impulses, dendrites to
receive them, and an axon that transfers them to other neurons. The input nodes of
artificial neural networks receive input signals, the hidden layer nodes compute these
input signals, and the output layer nodes compute the final output by processing the
hidden layer’s results using activation functions.

Biological Neuron Artificial Neuron

Dendrite Inputs

Machine Learning (3170724) 43


New L J Institute of Engineering and Technology <Enrolment No>

Biological Neuron Artificial Neuron

Cell nucleus or Soma Nodes

Synapses Weights

Axon Output

 Synapses: Synapses are the links between biological neurons that enable the
transmission of impulses from dendrites to the cell body. Synapses are the weights that
join the one-layer nodes to the next-layer nodes in artificial neurons. The strength of the
links is determined by the weight value.
 Learning: In biological neurons, learning happens in the cell body nucleus or soma,
which has a nucleus that helps to process the impulses. An action potential is produced
and travels through the axons if the impulses are powerful enough to reach the
threshold. This becomes possible by synaptic plasticity, which represents the ability of
synapses to become stronger or weaker over time in reaction to changes in their activity.
In artificial neural networks, backpropagation is a technique used for learning,
which adjusts the weights between nodes according to the error or differences between
predicted and actual outcomes.
 Activation: In biological neurons, activation is the firing rate of the neuron which
happens when the impulses are strong enough to reach the threshold. In artificial neural
networks, A mathematical function known as an activation function maps the input to
the output, and executes activations.

How do Artificial Neural Networks learn?

Artificial neural networks are trained using a training set. For example, suppose you want
to teach an ANN to recognize a cat. Then it is shown thousands of different images of cats
so that the network can learn to identify a cat. Once the neural network has been trained
enough using images of cats, then you need to check if it can identify cat images correctly.
This is done by making the ANN classify the images it is provided by deciding whether
they are cat images or not. The output obtained by the ANN is corroborated by a human-
provided description of whether the image is a cat image or not. If the ANN identifies
incorrectly then back-propagation is used to adjust whatever it has learned during
training. Backpropagation is done by fine-tuning the weights of the connections in ANN

Machine Learning (3170724) 44


New L J Institute of Engineering and Technology <Enrolment No>

units based on the error rate obtained. This process continues until the artificial neural
network can correctly recognize a cat in an image with minimal possible error rates.

Code:

Loading the data

# Reading the cleaned numeric titanic survival data


import pandas as pd
import numpy as np

from keras.models import Sequential

from keras.layers import Activation, Dense

# To remove the scientific notation from numpy arrays


np.set_printoptions(suppress=True)

TitanicSurvivalDataNumeric=pd.read_pickle('TitanicSurvivalDataNumeric.pkl')
TitanicSurvivalDataNumeric.head()

Splitting the Data into Training and Testing

# Separate Target Variable and Predictor Variables


TargetVariable=['Survived']
Predictors=['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
'Embarked_C', 'Embarked_Q', 'Embarked_S']

X=TitanicSurvivalDataNumeric[Predictors].values
y=TitanicSurvivalDataNumeric[TargetVariable].values

### Sandardization of data ###


### We does not standardize the Target variable for classification
from sklearn.preprocessing import StandardScaler
PredictorScaler=StandardScaler()

# Storing the fit object for later reference


PredictorScalerFit=PredictorScaler.fit(X)

Machine Learning (3170724) 45


New L J Institute of Engineering and Technology <Enrolment No>

# Generating the standardized values of X and y


X=PredictorScalerFit.transform(X)

# Split the data into training and testing set


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Quick sanity check with the shapes of Training and Testing datasets
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
Output:

Take a look at some of the important hyper parameters of ANN below

units=10: This means we are creating a layer with ten neurons in it. Each of these five
neurons will be receiving the values of inputs, for example, the values of ‘Age’ will be passed
to all five neurons, similarly all other columns.

input_dim=9: This means there are nine predictors in the input data which is expected by the
first layer. If you see the second dense layer, we don’t specify this value, because the
Sequential model passes this information further to the next layers.

kernel_initializer=’uniform’: When the Neurons start their computation, some algorithm has
to decide the value for each weight. This parameter specifies that. You can choose different
values for it like ‘normal’ or ‘glorot_uniform’.

activation=’relu’: This specifies the activation function for the calculations inside each
neuron. You can choose values like ‘relu’, ‘tanh’, ‘sigmoid’, etc.

optimizer=’adam’: This parameter helps to find the optimum values of each weight in the
neural network. ‘adam’ is one of the most useful optimizers, another one is ‘rmsprop’

batch_size=10: This specifies how many rows will be passed to the Network in one go after
which the SSE calculation will begin and the neural network will start adjusting its weights
based on the errors.

Machine Learning (3170724) 46


New L J Institute of Engineering and Technology <Enrolment No>

When all the rows are passed in the batches of 10 rows each as specified in this parameter,
then we call that 1-epoch. Or one full data cycle. This is also known as mini-batch gradient
descent. A small value of batch_size will make the ANN look at the data slowly, like 2 rows
at a time or 4 rows at a time which could lead to overfitting, as compared to a large value like
20 or 50 rows at a time, which will make the ANN look at the data fast which could lead to
underfitting. Hence a proper value must be chosen using hyperparameter tuning.

Epochs=10: The same activity of adjusting weights continues for 10 times, as specified by
this parameter. In simple terms, the ANN looks at the full training data 10 times and adjusts
its weights.

classifier = Sequential()
# Defining the Input layer and FIRST hidden layer,both are same!
# relu means Rectifier linear unit function
classifier.add(Dense(units=10, input_dim=9, kernel_initializer='uniform', activation='relu'))

#Defining the SECOND hidden layer, here we have not defined input because it is
# second layer and it will get input as the output of first hidden layer
classifier.add(Dense(units=6, kernel_initializer='uniform', activation='relu'))

# Defining the Output layer


# sigmoid means sigmoid activation function
# for Multiclass classification the activation ='softmax'
# And output_dim will be equal to the number of factor levels
classifier.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid'))

# Optimizer== the algorithm of SGG to keep updating weights


# loss== the loss function to measure the accuracy
# metrics== the way we will compare the accuracy after each step of SGD
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# fitting the Neural Network on the training data


survivalANN_Model=classifier.fit(X_train,y_train, batch_size=10 , epochs=10, verbose=1)

# fitting the Neural Network on the training data


survivalANN_Model=classifier.fit(X_train,y_train, batch_size=10 , epochs=10, verbose=1)
Output:

Machine Learning (3170724) 47


New L J Institute of Engineering and Technology <Enrolment No>

Check The Accuracy On Test Data


# Predictions on testing data
Predictions=classifier.predict(X_test)

# Scaling the test data back to original scale


Test_Data=PredictorScalerFit.inverse_transform(X_test)

# Generating a data frame for analyzing the test data


TestingData=pd.DataFrame(data=Test_Data, columns=Predictors)
TestingData['Survival']=y_test
TestingData['PredictedSurvivalProb']=Predictions

# Defining the probability threshold


def probThreshold(inpProb):
if inpProb > 0.5:
return(1)
else:
return(0)

# Generating predictions on the testing data by applying probability threshold

Machine Learning (3170724) 48


New L J Institute of Engineering and Technology <Enrolment No>

TestingData['PredictedSurvival']=TestingData['PredictedSurvivalProb'].apply(probThreshold)
print(TestingData.head())

###############################################
from sklearn import metrics
print('\n######### Testing Accuracy Results #########')
print(metrics.classification_report(TestingData['Survival'], TestingData['PredictedSurvival']))
print(metrics.confusion_matrix(TestingData['Survival'], TestingData['PredictedSurvival']))
Output:

Signature with date: _______________

Machine Learning (3170724) 49

You might also like