ML Hota Assign4
ML Hota Assign4
Equation (1)
In Gaussian Naive Bayes, the likelihood P(xi∣y) is often modelled using a Gaussian (normal)
distribution. The probability density function (PDF) of a Gaussian distribution is:
Equation (2)
In practice, the parameters μ and σ2 are estimated from the training data for each feature xi and each
class y. Then, during classification, these parameters are used to compute the likelihood P(xi∣y) for
each feature given each class.
Your task in this assignment is to experiment with Gaussian NaiveBayes algorithm for the grading
file attached here (Data-NB.xlsx). Grading is based on the test scores. Below are the code snippets in
Scikit learn to import the classifier and other required libraries:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
To get first few records of the Pandas DataFrame, use the following:
(Page 1 of 3)
The output in your notebook should be as below:
The data distribution of the given xlsx file (Grading) is Gaussian as discussed in the class. Plot the
below pattern (Fig.1) to visualize it in your Python code. The Gaussian Naïve Bayes Classifier’s
performance metric for an 80-20 rule is as shown below (Fig.2).
(Fig.1)
(Fig.2)
The second part of this assignment is to classify flowers using iris.csv data file that is also attached with this
assignment using GaussianNB. There are 150 records in this file, and plot the flowers using
matplotlib.imshow method to view the flowers as shown below:
(Page 2 of 3)
Each record has features as Sepal length, Sepal width, Petal length, Petal width, and Species
(Categorical feature: Setosa, Versicolor, and Virginica). You may also import the in-built iris dataset
from sklearn learn as below:
The classification report is as given below with a prediction accuracy of 97% and other related metrics.
o In the Data-NB.xlsx file, few attributes like Gender, Attendance and Grade columns are nominal
variables and require encoding before the model training. Use appropriate encoding.
o Split the dataset into training and testing subsets (80-20 or 70-30). Train a Gaussian Naive Bayes
classifier on the training data and predict the grades in the test data. Calculate the accuracy and the
confusion matrix to assess the classifier's performance.
o Split the Iris dataset (iris.csv) into training and testing subsets, followed by training a Gaussian
Naive Bayes classifier on the training data. Evaluate the classifier's performance by plotting the
classification report as shown above.
o Compare and contrast the performance of the Naïve Bayes classifier built in this assignment with that of
Random Forest and Gradient Boosted Trees (developed in Assignment 2) on the identical datasets. Analyse
the reasons behind any observed differences in their performances.
o For both the datasets visualize the correlation matrix to check and verify the assumptions of Naïve
Bayes algorithm.
Submission Instructions: Same as that of earlier assignments. Any clarification on this coding
assignment may be emailed to I/C or Paryetri Banerjee ([email protected]) or
Anish Shandilya ([email protected]).
References: 1. https://scikit-learn.org/stable/modules/naive_bayes.html
2. https://towardsdatascience.com/the-naive-bayes-classifier-how-it-works-e229e7970b8