0% found this document useful (0 votes)
21 views37 pages

Bayes Learning

The document discusses fundamentals of machine learning including Bayes' theorem, conditional probability, and naive Bayes classifiers. It provides examples of how naive Bayes is used for classification problems and addresses handling zero probabilities.

Uploaded by

ritikagupta.3k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views37 pages

Bayes Learning

The document discusses fundamentals of machine learning including Bayes' theorem, conditional probability, and naive Bayes classifiers. It provides examples of how naive Bayes is used for classification problems and addresses handling zero probabilities.

Uploaded by

ritikagupta.3k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Fundamentals of Machine Learning (DSE 2222)

by
Shavantrevva S. Bilakeri
Dept. of Data Science and Computer Applications
Manipal Institute of Technology, Manipal

March 8, 2024

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 1 / 37


Overview

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 2 / 37


What is conditional probability?
What is Bayes theorem?
Why it is used in Machine Learning?
Examples of Bayes theorem in Machine Learning and much more.

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 3 / 37


Prerequisites for Bayes Theorem

Sample Space: During an experiment what we get as a result is


called as possible outcomes and the set of all possible outcome of
an event is known as sample space. For example, if we are rolling
a dice, sample space will be:
S1 = 1, 2, 3, 4, 5, 6
Similarly, if our experiment is related to toss a coin and recording its
outcomes, then sample space will be:
S2 = Head, Tail
Event: Event is defined as subset of sample space in an experiment.
Further, it is also called as set of outcomes.

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 4 / 37


Prerequisites

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 5 / 37


Prerequisites

Ac represents all outcomes that are outside of A.


Disjoint:they have no outcomes in common,if A occurs, then B cannot
occur, and vice versa, mutually exclusive events.
Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 6 / 37
Prerequisites
axioms are statements or propositions that are accepted as true without
requiring proof within the particular mathematical system or theory under
consideration.

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 7 / 37


Prerequisites

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 8 / 37


Prerequisites

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 9 / 37


Bayes Theorem
Bayes theorem is given by an English statistician, philosopher, and
Presbyterian minister named Mr. Thomas Bayes in 17th century.
Bayes theorem is one of the most popular machine learning concepts
that helps to calculate the probability of occurring one event with
uncertain knowledge while other one has already occurred.
Bayes’ theorem can be derived using product rule and conditional prob-
ability of event X with known event Y.
According to the product rule we can express as the probability of event
X with known event Y as follows:

P(X ΛY ) = P(X |Y )P(Y ) (1)


Further, the probability of event Y with known event X:

P(X ΛY ) = P(Y |X )P(X ) (2)

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 10 / 37


Mathematically, Bayes theorem can be expressed by combining both
equations on right hand side. We will get:

Here, both events X and Y are independent events which means


probability of outcome of both events does not depends one another.
The above equation is called as Bayes Rule or Bayes Theorem.
P(X—Y) is called as posterior, which we need to calculate. It is defined
as updated probability after considering the evidence.
P(Y—X) is called the likelihood. It is the probability of evidence when
hypothesis is true.
P(X) is called the prior probability, probability of hypothesis before
considering the evidence.
P(Y) is called marginal probability. It is defined as the probability of
evidence under any consideration
Hence, Bayes Theorem can be written as:
posterior = likelihood * prior / evidence
Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 11 / 37
Naive Bayes classifiers

Naive Bayes classifiers are a collection of classification algorithms


based on Bayes’ Theorem.
Each algorithms based on the principle: every pair of features being
classified is independent of each other.
The “Naive” part of the name indicates the simplifying assumption
made by the Naı̈ve Bayes classifier.
The classifier assumes that the features used to describe an obser-
vation are conditionally independent, given the class label.
The dataset is divided into two parts, namely, feature matrix and
the response vector.
Feature matrix contains all the vectors(rows) of dataset in which each
vector consists of the value of dependent features.
Response vector contains the value of class variable(prediction or out-
put) for each row of feature matrix.

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 12 / 37


Naı̈ve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
It is a probabilistic classifier, which means it predicts on the basis of
the probability of an object.
Why is it called Naı̈ve Bayes?
The Naı̈ve Bayes algorithm is comprised of two words Naı̈ve and Bayes,
Which can be described as:
Naı̈ve: It is called Naı̈ve because it assumes that the occurrence of a
certain feature is independent of the occurrence of other features. Such
as if the fruit is identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple. Hence each
feature individually contributes to identify that it is an apple without
depending on each other.
Bayes: It is called Bayes because it depends on the principle of Bayes’
Theorem.

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 13 / 37


Naive Bayes Classifier

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 14 / 37


Naive Bayes Classifier

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 15 / 37


Working of Naı̈ve Bayes’ Classifier

Working of Naı̈ve Bayes’ Classifier can be understood with the help of the
below example:
Suppose we have a dataset of weather conditions and corresponding target
variable ”Play”. So using this dataset we need to decide that whether we
should play or not on a particular day according to the weather conditions.
So to solve this problem, we need to follow the below steps:
Convert the given dataset into frequency tables.
Generate Likelihood table by finding the probabilities of given features.

Now, use Bayes theorem to calculate the posterior probability.

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 16 / 37


Example: Naive Bayes
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 17 / 37


Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 18 / 37
Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 19 / 37
Exercise

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 20 / 37


Gaussian Naive Bayes- for continuous data

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 21 / 37


Gaussian Naive Bayes- for continuous data

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 22 / 37


Gaussian Naive Bayes- for continuous data

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 23 / 37


Gaussian Naive Bayes- for continuous data

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 24 / 37


Zero Probability in Naı̈ve Bayes algorithm
Example: text classification where the task is to classify whether the
review Is positive or negative.
We build a likelihood table based on the training data. While querying
a review. we use the Likelihood table values, but what if a word in a
review was not present in the training dataset?
let’s assume only w1, w2, and w3 are present in training data.
To calculate whether the review is positive or negative, we compare
P(positive|review ) and P(negative|review ).

Ex: Query review = w1 w2 w3 w’


In the likelihood table, we have P(w 1|positive), P(w 2|Positive), P(w 3|Pos
and P(positive). but where is P(w ′ |positive)
If the word is absent in the training dataset, then we don’t have its
likelihood. This is called zero probability condition
Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 25 / 37
Zero Probability in Naı̈ve Bayes algorithm

Approach1- Ignore the term P(w ′ |positive)


Assigning it a value of 1, which means the probability of w’ occurring in
positive P(w ′ |positive) and negative review P(w ′ |negative) is 1. This
approach seems logically incorrect.

Approach 2- In a bag of words model


we count the occurrence of words. The occurrences of word w’ in train-
ing are 0. According to that P(w ′ |positive)=0 and P(w ′ |negative)=0
This will make both P(positive|review ) and P(negative|review ) equal to
0
we multiply all the likelihoods. This is the problem of zero probability.
So, how to deal with this problem?

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 26 / 37


Laplace Smoothing
Laplace smoothing is a smoothing technique that handles the problem
of zero probability in Naı̈ve Bayes. Using Laplace smoothing, we can
represent P(w ′ |positive) as

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 27 / 37


Laplace Smoothing

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 28 / 37


Laplace Smoothing

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 29 / 37


Bayesian Belief Network
Bayesian networks are a widely-used class of probabilistic graphical
models.
They consist of two parts: a structure and parameters.
The structure is a directed acyclic graph (DAG) that expresses con-
ditional independencies and dependencies among ran- dom vari-
ables associated with nodes.
The parameters consist of conditional probability distributions asso-
ciated with each node.
A Bayesian network is a probabilistic graphical model which represents
a set of variables and their conditional dependencies using a directed
acyclic graph.
It can also be used in various tasks including prediction, anomaly
detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making under uncertainty.
Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 30 / 37
Bayesian Belief Network

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 31 / 37


Bayesian Belief Network

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 32 / 37


Bayesian Belief Network: Example

In the above figure, we have an alarm A – a node, say installed in a


house of a person X, which rings upon two probabilities i.e burglary B
and fire F, which are – parent nodes of the alarm node. The alarm is
the parent node of two probabilities P1 calls P1 & P2 calls P2 person
nodes.
Shavantrevva S B , Dept of DSCA
Upon the instance of burglary and fire,Subject
P1Code:
andDSEP22222call person
33 / 37
X, re-
Bayesian Belief Network: Example

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 34 / 37


Bayesian Belief Network: Example

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 35 / 37


Bayesian Belief Network: Example

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 36 / 37


Bayesian Belief Network: Example

Shavantrevva S B , Dept of DSCA Subject Code: DSE 2222 37 / 37

You might also like