12. B Lab Manual Machine Learning SEM-7 CSE 2024
12. B Lab Manual Machine Learning SEM-7 CSE 2024
PRACTICAL - 1
Aim: (1 a) Find and analyse mean, median and mode of given
data.
Mean, Median, and Mode are statistical measures used to describe the central tendency of a
dataset. In machine learning, these measures are used to understand the distribution of data
and identify outliers. Here, we will explore the concepts of Mean, Median, and Mode and
their implementation in Python.
Mean
The "mean" is the average value of a dataset. It is calculated by adding up all the values in the
dataset and dividing by the number of observations. The mean is a useful measure of central
tendency because it is sensitive to outliers, meaning that extreme values can significantly
affect the value of the mean.
In Python, we can calculate the mean using the NumPy library, which provides a function
called mean().
Median
The "median" is the middle value in a dataset. It is calculated by arranging the values in the
dataset in order and finding the value that lies in the middle. If there are an even number of
values in the dataset, the median is the average of the two middle values.
The median is a useful measure of central tendency because it is not affected by outliers,
meaning that extreme values do not significantly affect the value of the median.
In Python, we can calculate the median using the NumPy library, which provides a function
called median().
Mode
The "mode" is the most common value in a dataset. It is calculated by finding the value that
occurs most frequently in the dataset. If there are multiple values that occur with the same
frequency, the dataset is said to be bimodal, trimodal, or multimodal.
The mode is a useful measure of central tendency because it can identify the most common
value in a dataset. However, it is not a good measure of central tendency for datasets with a
wide range of values or datasets with no repeating values.
In Python, we can calculate the mode using the SciPy library, which provides a function
called mode().
Code:
(1 a) Find and analyse mean, median and mode of given data.
#Import necessary modules
import statistics
Output:
61.4
import statistics
Output:
67
import statistics
Output:
21
import pandas as pd
df=pd.read_csv("auto-mpg.csv")
df
df.info()
Output:
df['horsepower']=pd.to_numeric(df['horsepower'],errors='coerce')
df[df.horsepower.isnull()
df.info()
Output:
df.dropna(subset=['horsepower'],inplace=True)
df.isnull()
df.info()
mean=df['mpg'].mean()
median=df['mpg'].median()
mode=df['mpg'].mode()
Output:
df.describe()
df.mode()
Output:
df.describe().loc[['mean','50%']]
Output:
PRACTICAL - 2
Data visualization helps machine learning analysts to better understand and analyze
complex data sets by presenting them in an easily understandable format. Data
visualization is an essential step in data preparation and analysis as it helps to identify
outliers, trends, and patterns in the data that may be missed by other forms of analysis.
With the increasing availability of big data, it has become more important than ever to use
data visualization techniques to explore and understand the data. Machine learning
algorithms work best when they have high-quality and clean data, and data visualization
can help to identify and remove any inconsistencies or anomalies in the data.
Scatter Plot
Scatter plot is one of the most important data visualization techniques and it is considered
one of the Seven Basic Tools of Quality. A scatter plot is used to plot the relationship
between two variables, on a two-dimensional graph that is known as Cartesian Plane on
mathematical grounds.
It is generally used to plot the relationship between one independent variable and one
dependent variable, where an independent variable is plotted on the x-axis and a dependent
variable is plotted on the y-axis so that you can visualize the effect of the independent
variable on the dependent variable. These plots are known as Scatter Plot Graph or Scatter
Diagram.
This kind of distribution makes it easier to visualize the kind of relationship, the plotted
pair of data is holding. So Scatter Plot is useful in situations when we have to find out the
relationship between two sets of data, or in cases when we suspect that there may be some
relationship between two variables and this relationship may be the root cause of some
problem.
Histogram
Histograms helps visualizing and comprehending the data distribution. The article aims to
provide comprehensive overview of histogram and its interpretation.
Histograms are graphical representations of data distributions. They consist of bars, each
representing the frequency or count of observations falling within specific intervals, known
as bins. We can also say a histogram is a variation of a bar chart in which data values are
grouped together and put into different classes. This grouping enables you to see how
frequently data in each class occur in the dataset.
The features provide a strong indication of the proper distributional model in the data. The
probability plot or a goodness-of-fit test can be used to verify the distributional model.
Box Plot is a graphical method to visualize data distribution for gaining insights and making
informed decisions. Box plot is a type of chart that depicts a group of numerical data through
their quartiles.
The idea of box plot was presented by John Tukey in 1970. He wrote about it in his book
“Exploratory Data Analysis” in 1977. Box plot is also known as a whisker plot, box-and-
whisker plot, or simply a box-and whisker diagram. Box plot is a graphical representation of
the distribution of a dataset. It displays key summary statistics such as
the median, quartiles, and potential outliers in a concise and visual manner.
By using Box plot you can provide a summary of the distribution, identify potential and
compare different datasets in a compact and visual manner.
The area inside the box (50% of the data) is known as the Inter Quartile Range. The IQR is
calculated as –
IQR = Q3-Q1
Outlies are the data points below and above the lower and upper limit. The lower and
upper limit is calculated as –
The values below and above these limits are considered outliers and the minimum and
maximum values are calculated from the points which lie under the lower and upper limit.
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
Output:
import numpy as np
print(x)
#Output: values of x is printed
plt.hist(x)
plt.show()
Output:
Output:
df=pd.read_csv("auto-mpg.csv")
df['horsepower']=pd.to_numeric(df['horsepower'],errors='coerce')
df.dropna(subset=['horsepower'],inplace=True)
import matplotlib.pyplot as p
import seaborn as sn
%matplotlib inline
p.boxplot(df['mpg'],patch_artist=True,notch=True)
Output:
sn.boxplot(df['mpg'],color='yellow')
Output:
sn.boxplot(x='origin',y='horsepower',data=df)
Output:
p.hist(df['mpg'],color='green')
Output:
sn.histplot(df['weight'],bins=20)
Output:
p.scatter(x=df.displacement,y=df.mpg)
sn.pairplot(df)
Output:
PRACTICAL - 3
Aim: Perform Simple Linear Regression on Salary_data.
Theory:
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
The values for x and y variables are training datasets for Linear Regression model
representation.
Linear regression can be further divided into two types of the algorithm:
Code:
#Import necessary librabries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
data_set= pd.read_csv('Salary_Data.csv')
Output:
PRACTICAL - 4
Aim: Implement Logistic Regression on Iris dataset and evaluate
its performance.
Theory:
o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore
the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1,
true or False, etc. but instead of giving the exact value as 0 and 1, it gives the
probabilistic values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
Code:
import pandas as pd # used to read the data set
df = pd.read_csv("iris.csv")
df.head(5)
df.head(5)
le = LabelEncoder()
df['Species'] = le.fit_transform(df['Species'])
df.head(100)
X = df.drop(columns = ['Species'])
Y = df['Species']
model = LogisticRegression()
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)
cm= confusion_matrix(Y_test,y_pred)
cm
print(classification_report(Y_test, y_pred))
Output:
PRACTICAL - 5
Aim: Implement Decision Tree on Iris dataset and evaluate its
performance.
Theory:
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular techniques for ASM, which
are:
o Information Gain
o Gini Index
o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Code:
#Import necessary libraries
import pandas as pd
import numpy as np
data = load_iris()
X = data.data
y = data.target
clf = DecisionTreeClassifier()
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
cm= confusion_matrix(y_test,y_pred)
cm
print(classification_report(y_test, y_pred))
Output:
PRACTICAL - 6
Aim: Implement K-NN on Iris dataset and evaluate its
performance.
Theory:
It is simple to implement.
It is robust to the noisy training data
It can be more effective if the training data is large.
Always needs to determine the value of K which may be complex some time.
The computation cost is high because of calculating the distance between the data
points for all the training samples.
Code:
import pandas as pd
import numpy as np
data = load_iris()
X = data.data
y = data.target
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
cm= confusion_matrix(y_test,y_pred)
cm
print(classification_report(y_test, y_pred))
Output:
PRACTICAL - 7
Aim: Implement SVM on Iris dataset and evaluate its
performance.
Theory:
An SVM model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as possible. In
addition to performing linear classification, SVMs can efficiently perform a non-linear
classification, implicitly mapping their inputs into high-dimensional feature spaces.
What Support vector machines do, is to not only draw a line between two classes here, but
consider a region about the line of some given width.
Code:
from sklearn import svm, datasets
import numpy as np
#Add datasets, insert the desired number of features and train the model
y = iris.target
classifier_predictions = clf.predict(X_test)
print(accuracy_score(y_test, classifier_predictions)*100)
cm= confusion_matrix(y_test,classifier_predictions)
cm
Output:
PRACTICAL - 8
Aim: Perform K-means clustering on Iris dataset and evaluate its
performance.
Theory:
K-Means Clustering is an unsupervised learning algorithm that is used to solve the
Clustering problems in machine learning or data science.
It is an iterative algorithm that divides the unlabelled dataset into k different clusters
in
such a way that each dataset belongs only one group that has similar properties
It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.
Advantages:
Relatively simple to implement
Scales to large data sets
Guarantees to convergence.
Easily adapts to new examples.
Disadvantage:
Need to choose K manually.
Can run into problems when clustering varying sizes and density.
Sensitive to outliers.
Doesn’t scale well with large no of dimensions.
Only works for numeric values.
Code:
#importing the libraries
import numpy as np
import pandas as pd
dataset = pd.read_csv('iris.csv')
y_kmeans = kmeans.fit_predict(x)
plt.legend()
Output:
PRACTICAL - 9
Aim: Write a program to implement Naïve Bayes Classifier.
Theory:
Naive Bayes is a statistical classification technique based on Bayes Theorem. It is one of the
simplest supervised learning algorithms. Naive Bayes classifier is the fast, accurate and
reliable algorithm. Naive Bayes classifiers have high accuracy and speed on large datasets.
Naive Bayes classifier assumes that the effect of a particular feature in a class is independent
of other features. For example, a loan applicant is desirable or not depending on his/her
income, previous loan and transaction history, age, and location. Even if these features are
interdependent, these features are still considered independently. This assumption simplifies
computation, and that's why it is considered as naive. This assumption is called class
conditional independence.
P(h): the probability of hypothesis h being true (regardless of the data). This is known
as the prior probability of h.
P(D): the probability of the data (regardless of the hypothesis). This is known as the
prior probability.
P(h|D): the probability of hypothesis h given the data D. This is known as posterior
probability.
P(D|h): the probability of data d given that the hypothesis h was true. This is known
as posterior probability.
Code:
Generating the Dataset
from sklearn.datasets import make_classification
X, y = make_classification(
n_features=6,
n_classes=3,
n_samples=800,
n_informative=2,
random_state=1,
n_clusters_per_class=1,
)
Train-Test Split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=125
)
y_pred = model.predict(X_test)
accuray = accuracy_score(y_pred, y_test)
f1 = f1_score(y_pred, y_test, average="weighted")
print("Accuracy:", accuray)
print("F1 Score:", f1)
Output:
PRACTICAL - 10
Aim: Write a program to implement ANN.
Theory:
Artificial Neural Networks are modeled after the neurons in the human brain.
Artificial Neural Networks contain artificial neurons which are called units. These units are
arranged in a series of layers that together constitute the whole Artificial Neural Network in
a system.
A layer can have only a dozen units or millions of units as this depends on how the
complex neural networks will be required to learn the hidden patterns in the dataset.
Commonly, Artificial Neural Network has an input layer, an output layer as well as hidden
layers.
The input layer receives data from the outside world which the neural network needs to
analyze or learn about. Then this data passes through one or multiple hidden layers that
transform the input into data that is valuable for the output layer.
Finally, the output layer provides an output in the form of a response of the Artificial
Neural Networks to input data provided.
In the majority of neural networks, units are interconnected from one layer to another. Each
of these connections has weights that determine the influence of one unit on another unit.
As the data transfers from one unit to another, the neural network learns more and more
about the data which eventually results in an output from the output layer.
Dendrite Inputs
Synapses Weights
Axon Output
Synapses: Synapses are the links between biological neurons that enable the
transmission of impulses from dendrites to the cell body. Synapses are the weights that
join the one-layer nodes to the next-layer nodes in artificial neurons. The strength of the
links is determined by the weight value.
Learning: In biological neurons, learning happens in the cell body nucleus or soma,
which has a nucleus that helps to process the impulses. An action potential is produced
and travels through the axons if the impulses are powerful enough to reach the
threshold. This becomes possible by synaptic plasticity, which represents the ability of
synapses to become stronger or weaker over time in reaction to changes in their activity.
In artificial neural networks, backpropagation is a technique used for learning,
which adjusts the weights between nodes according to the error or differences between
predicted and actual outcomes.
Activation: In biological neurons, activation is the firing rate of the neuron which
happens when the impulses are strong enough to reach the threshold. In artificial neural
networks, A mathematical function known as an activation function maps the input to
the output, and executes activations.
Artificial neural networks are trained using a training set. For example, suppose you want
to teach an ANN to recognize a cat. Then it is shown thousands of different images of cats
so that the network can learn to identify a cat. Once the neural network has been trained
enough using images of cats, then you need to check if it can identify cat images correctly.
This is done by making the ANN classify the images it is provided by deciding whether
they are cat images or not. The output obtained by the ANN is corroborated by a human-
provided description of whether the image is a cat image or not. If the ANN identifies
incorrectly then back-propagation is used to adjust whatever it has learned during
training. Backpropagation is done by fine-tuning the weights of the connections in ANN
units based on the error rate obtained. This process continues until the artificial neural
network can correctly recognize a cat in an image with minimal possible error rates.
Code:
TitanicSurvivalDataNumeric=pd.read_pickle('TitanicSurvivalDataNumeric.pkl')
TitanicSurvivalDataNumeric.head()
X=TitanicSurvivalDataNumeric[Predictors].values
y=TitanicSurvivalDataNumeric[TargetVariable].values
# Quick sanity check with the shapes of Training and Testing datasets
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
Output:
units=10: This means we are creating a layer with ten neurons in it. Each of these five
neurons will be receiving the values of inputs, for example, the values of ‘Age’ will be passed
to all five neurons, similarly all other columns.
input_dim=9: This means there are nine predictors in the input data which is expected by the
first layer. If you see the second dense layer, we don’t specify this value, because the
Sequential model passes this information further to the next layers.
kernel_initializer=’uniform’: When the Neurons start their computation, some algorithm has
to decide the value for each weight. This parameter specifies that. You can choose different
values for it like ‘normal’ or ‘glorot_uniform’.
activation=’relu’: This specifies the activation function for the calculations inside each
neuron. You can choose values like ‘relu’, ‘tanh’, ‘sigmoid’, etc.
optimizer=’adam’: This parameter helps to find the optimum values of each weight in the
neural network. ‘adam’ is one of the most useful optimizers, another one is ‘rmsprop’
batch_size=10: This specifies how many rows will be passed to the Network in one go after
which the SSE calculation will begin and the neural network will start adjusting its weights
based on the errors.
When all the rows are passed in the batches of 10 rows each as specified in this parameter,
then we call that 1-epoch. Or one full data cycle. This is also known as mini-batch gradient
descent. A small value of batch_size will make the ANN look at the data slowly, like 2 rows
at a time or 4 rows at a time which could lead to overfitting, as compared to a large value like
20 or 50 rows at a time, which will make the ANN look at the data fast which could lead to
underfitting. Hence a proper value must be chosen using hyperparameter tuning.
Epochs=10: The same activity of adjusting weights continues for 10 times, as specified by
this parameter. In simple terms, the ANN looks at the full training data 10 times and adjusts
its weights.
classifier = Sequential()
# Defining the Input layer and FIRST hidden layer,both are same!
# relu means Rectifier linear unit function
classifier.add(Dense(units=10, input_dim=9, kernel_initializer='uniform', activation='relu'))
#Defining the SECOND hidden layer, here we have not defined input because it is
# second layer and it will get input as the output of first hidden layer
classifier.add(Dense(units=6, kernel_initializer='uniform', activation='relu'))
TestingData['PredictedSurvival']=TestingData['PredictedSurvivalProb'].apply(probThreshold)
print(TestingData.head())
###############################################
from sklearn import metrics
print('\n######### Testing Accuracy Results #########')
print(metrics.classification_report(TestingData['Survival'], TestingData['PredictedSurvival']))
print(metrics.confusion_matrix(TestingData['Survival'], TestingData['PredictedSurvival']))
Output: