0% found this document useful (0 votes)
29 views

ML Lab 7 - Naive Bayes

The document discusses the Naive Bayes algorithm and its assumptions. It demonstrates modeling a sample tennis dataset using Naive Bayes classification. The algorithm calculates probabilities of class membership based on applying Bayes' theorem. Laplace smoothing is introduced to address probabilities being zero for features not present in some classes. Training and test datasets are split for model evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

ML Lab 7 - Naive Bayes

The document discusses the Naive Bayes algorithm and its assumptions. It demonstrates modeling a sample tennis dataset using Naive Bayes classification. The algorithm calculates probabilities of class membership based on applying Bayes' theorem. Laplace smoothing is introduced to address probabilities being zero for features not present in some classes. Training and test datasets are split for model evaluation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Labsheet - 7

Naive Bayes
Machine Learning
BITS F464

I Semester 2023-24

The Naive Bayes (NB) algorithm describes a simple application using Bayes' theorem for
classification. The naive Bayes algorithm is named as such because it makes a couple of "naive"
assumptions about the data.
1. All of the features in the dataset are equally important and independent.
2. It assumes class-conditional independence, which means that events are independent so long
as they are conditioned on the same class value.
Consider the below sample dataset comprising of target concept i.e. PlayTennis and 4 features :
Outlook, Temperature, Humidity, and Windy

Outlook Temperature Humidity Windy PlayTennis

sunny hot high weak no

sunny hot high strong no

overcast hot high weak yes

rain mild high weak yes

rain cool normal weak yes

rain cool normal strong no

overcast cool normal strong yes

sunny mild high weak yes

sunny cool normal weak yes


rain mild normal weak yes

sunny mild normal strong yes

overcast mild high strong no

overcast hot normal weak yes

rain mild high strong no

Modeling the dataset using Bayesian Concept-


P(YES|X) = P(X|YES) P(YES) ⁄ P(X)
P(NO|X) = P(X|NO) P(NO) ⁄ P(X)
where X = (Outlook=Rain,Temperature=Mild,Humidity=High,Windy=strong)
Because the denominator is the same in both cases, it can be ignored for now. Then,

The overall likelihood of YES:


P(PlayTennis=Yes)*P(Outlook=Rain|PlayTennis=Yes)*P(Humidity=High|PlayTennis=Yes)*
P(Humidity=High|PlayTennis=Yes)*P(Windy=strong|PlayTennis=Yes)
The overall likelihood of NO:
P(PlayTennis=No)*P(Outlook=Rain|PlayTennis=No)*P(Humidity=High|PlayTennis=No)*
P(Humidity=High|PlayTennis=No)*P(Windy=strong|PlayTennis=No)

The Naive Bayes classification algorithm we used in the preceding example can be summarized
by the following formula. Essentially, the probability of level L for class C, given the evidence
provided by features F1 through Fn, is equal to the product of the probabilities of each piece of
evidence conditioned on the class level, the prior probability of the class level.

P(CL|F1,…………..,Fn) = P(CL)∏ni=1𝑃(𝐹𝑖|𝐶𝐿)
# Install and load e1071 package
library("e1071")
#Load the dataset
data = read.csv("Lab3/WeatherData.csv",stringsAsFactors = FALSE)
data
#Display the structure of dataset
str(data)
#Encode the target vector as factor
data$PlayTennis <- factor(data$PlayTennis)
#Display the frequency of each target class type
table(data$PlayTennis)
#Display the probability of each target class type
prop.table(table(data$PlayTennis))
#Build the model
classifier <- naiveBayes(data,data$PlayTennis)
classifier
#Generate test sample (X)
Outlook <- "rain"
Temperature <- "mild"
Humidity <- "high"
Windy <- "strong"
#Load the instance X as dataframe
test_data <- data.frame(Outlook,Temperature,Humidity,Windy)
#Predict the target class for a given instance X
test_pred <- predict(classifier,test_data)
test_pred

The Laplace Estimator


An additional issue to be aware of - since naive Bayes uses the product of feature probabilities
conditioned on each class, we run into a serious problem when new data includes a feature
value that never occurs for one or more levels of a response class. What results is P (xi | Ck) = 0
for this individual feature and this zero will ripple through the entire multiplication of all
features and will always force the posterior probability to be zero for that class.
A solution to this problem involves using the Laplace smoother. The Laplace smoother adds a
small number to each of the counts in the frequencies for each feature, which ensures that each
feature has a nonzero probability of occurring for each class.

Li = Ci +S / N+K
Where
Ci: Count of tuples satisfying the test condition
S: Laplacian parameter (add small no.) , usually 1
N: Total no. of tuples belong to that class value
K: Count of Distinct value in particular feature

library("e1071")
data1 = read.csv("Lab3/WeatherData.csv",stringsAsFactors = FALSE)
data1
str(data1)
data1$Salary <- factor(data1$Humidity)
table(data1$Humidity)
prop.table(table(data1$Humidity))

classifier1 <- naiveBayes(data1,data1$Humidity,laplace=1)


classifier1
Outlook <- "sunny"
temperature <- "hot"
Windy <- "weak"
test_data1 <- data.frame(Outlook,temperature,Windy)
test_pred1 <- predict(classifier1,test_data1)
test_pred1

Dataset Preparation –Training and Test datasets


We split our dataset into two portions: A training dataset is used to build the NaiveBayes Model
and a test dataset to evaluate the performance of the model on new data. We will use 80
percent of the data for training and 20 percent for testing, which will provide us with 20%
records to simulate salary of test data.
library("e1071")
data1 = read.csv("Lab 3/Sales.csv",stringsAsFactors = FALSE)
data1
str(data1)
df=as.data.frame(data1)
df
repseq_len(nrow(df))
repeating_sequence = rep.int(seq_len(nrow(df)),df$Count)
repeating_sequence
dataset = df[repeating_sequence,]
dataset
#We no longer need the frequency, drop the feature
dataset$Count = NULL
dataset
sample_set <- sample(nrow(dataset), round(nrow(dataset)*.80), replace= FALSE)
train_data <-dataset[sample_set, ]
test_data <- dataset[-sample_set, ]
prop.table(table(dataset$Salary))
Naive_classifier = naiveBayes(Salary ~ .,data=train_data,laplace=6)
NB_Prediction = predict(Naive_classifier,test_data)
NB_Prediction
tab2 = table(NB_Prediction,test_data$Salary)
Accuracy = sum(diag(tab2)) / sum(tab2)
Accuracy
Exercise
1. Apply Naïve Bayes Algorithm on Titanic Dataset and do the preprocessing as required. Fit
the model on complete dataset and predict the survival for each instance i.e. Yes or no.
Also, built a confusion matrix. Find the Accuracy of the Naïve Bayes Model designed for
the Titanic survival class.
2. Download the Nursery Data Set (from this link
https://archive.ics.uci.edu/ml/datasets/nursery). Build a NaiveBayes Classifier to predict
finance. You may only use categorical variables, and ignore the continuous variables.
Check the accuracy of the model using laplace and find the outcome (which model
performed better)
3. Which data preprocessing technique is most suited for NBC and why?
4. What are the pros and cons of doing away with the assumptions of NBC?
5. How many probability calculations are involved in NBC for a classification problem
involving Xi attributes with cardinality Ci (i=1,2,….M), and class labels, Yi (i=1,2,…P)?

You might also like