0% found this document useful (0 votes)
27 views

Notes On ML

Concept based on the Machine learning such as navies Bayes theorem and linear regression and so on

Uploaded by

tempestinify
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Notes On ML

Concept based on the Machine learning such as navies Bayes theorem and linear regression and so on

Uploaded by

tempestinify
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

UNIT-2: BAYESIAN DECISION THEORY

UNIT II BAYESIAN DECISION THEORY


Bayes rule – independence and conditional independence – Minimum error rate
classification, Normal density and discriminant functions – Bayesian Concept
learning - MAP estimation – Bayes Classifier - Maximum Likelihood and Bayesian
Parameter Estimation for common loss functions, Naïve Bayes model.

May 14, 2024 Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology 1
PROBABILITY THEORY

Probability means Possibility that deals with the occurrence of a

random event. The value is expressed from zero to one.

For example, when we toss a coin, either we get Head OR Tail, only

two possible outcomes are possible (H, T). But if we toss two coins in

the air, there could be three possibilities of events to occur, such as

both the coins show heads or both show tails or one shows heads and

one tail, i.e.(H, H), (H, T),(T, T).


May 14, 2024 Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology 2
PROBABILITY FORMULA

P(A) = n(A)/n(S)

P(A) is the probability of an event “A”


n(A) is the number of favorable outcomes
n(S) is the total number of events in the sample space

May 14, 2024 Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology 3
PROBABILITY FORMULA

Probability Formulas List in Math's


Probability Range 0 ≤ P(A) ≤ 1
Rule of Addition P(A∪B) = P(A) + P(B) – P(A∩B)
Rule of Complementary Events P(A’) + P(A) = 1
Disjoint Events P(A∩B) = 0
Independent Events P(A∩B) = P(A) ⋅ P(B)
Conditional Probability P(A | B) = P(A∩B) / P(B)
Bayes Formula P(A | B) = P(B | A) ⋅ P(A) / P(B)

May 14, 2024 Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology 4
BAYES RULE

Bayes' Theorem is a way of finding a probability when we know


certain other probabilities.
The formula is:

Which tells us: how often A happens given that B happens, written P(A|B),

When we know: how often B happens given that A happens, written P(B|A)

and how likely A is on its own, written P(A)


and how likely B is on its own, written P(B)
BAYES RULE

Let us say P(Fire) means how often there is fire, and P(Smoke) means
how often we see smoke, then:

P(Fire|Smoke) means how often there is fire when we can see smoke

P(Smoke|Fire) means how often we can see smoke when there is fire

So the formula tells us a kind of "forwards" P(Fire|Smoke) when we


know "backwards" P(Smoke|Fire)

May 14, 2024 Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology 6
BAYES RULE

Example - 1:

Imagine 100 people at a party, and you tally how many wear pink or
not, and if a man or not, and get these numbers:

May 14, 2024 Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology 7
BAYES RULE

Bayes' Theorem is based off just those 4 numbers!

Let us do some totals:

May 14, 2024 Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology 8
BAYES RULE

And calculate some probabilities:

The probability of being a man is P(Man) = 40/100 = 0.4

The probability of wearing pink is P(Pink) = 25/100 = 0.25

The probability that a man wears pink is P(Pink|Man) = 5/40 = 0.125

The probability that a person wearing pink is a man P(Man|Pink) = ?

May 14, 2024 Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology 9
BAYES RULE

P(Man) = 0.4,
P(Pink) = 0.25 and
P(Pink|Man) = 0.125

if we still had the raw data we could calculate directly 5/25 = 0.2

May 14, 2024 Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology 10
INDEPENDENCE

Let A be the event that it rains tomorrow and suppose

that P(A)=1/3. Also suppose I toss a fair coin; let B be the event that

it lands up with heads. We have P(B)=1/2.

What is P(A|B)?

May 14, 2024 Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology 11
INDEPENDENCE

The result of my coin toss does not have anything to do with

tomorrow's weather. Thus, no matter if B happens or not, the

probability of A should not change. This is an example of

two independent events. Two events are independent if one does not

convey any information about the other.

Two events A and B are independent if and only if P(A∩B)=P(A)P(B)

May 14, 2024 Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology 12
INDEPENDENCE

Now, let's first reconcile this definition with what we mentioned


earlier, P(A|B)=P(A). If two events are independent,
then P(A∩B)=P(A)P(B) so
P(A|B) =P(A∩B)/P(B)

=P(A)P(B)/P(B)

=P(A)
Thus, if two events A and B are independent and P(B)≠0, then P(A|
B)=P(A). To summarize, we can say "independence means we can
multiply the probabilities of events to obtain the probability of their
intersection", or equivalently, "independence means that conditional
probability of one event given another is the same as the original
(prior) probability".
CONDITIONAL INDEPENDENCE

As we mentioned earlier, almost any concept that is defined for


probability can also be extended to conditional probability.
Remember that two events A and B are independent if

P(A∩B)=P(A)P(B),or equivalently, P(A|B)=P(A).

We can extend this concept to conditionally independent events. In


particular,
CONDITIONAL INDEPENDENCE

Two events A and B are conditionally independent given an


event C with P(C) > 0 if

Recall that from the definition of conditional probability,


𝑃 ( 𝐴 Ո 𝐵)
𝑃 ( 𝐴| B ¿=
𝑃 ( 𝐵)
if P(B)>0. By conditioning on C, we obtain

𝑃 ( 𝐴|B,C¿=𝑃 ( 𝐴 Ո𝐵|C ¿ ¿ ¿
𝑃 ( 𝐵|C¿
CONDITIONAL INDEPENDENCE

𝑃 ( 𝐴|B,C¿=𝑃 ( 𝐴 Ո𝐵|C ¿ ¿ ¿
𝑃 ( 𝐵|C¿

¿ 𝑃 ( 𝐴|𝐶¿ 𝑃 ( 𝐵|𝐶 ¿ ¿ ¿
𝑃 ( 𝐵|𝐶 ¿
𝑃 ( 𝐴| B , C ¿=𝑃 ( 𝐴|𝐶 ¿

Thus, if A and B are conditionally independent given C, then


CONDITIONAL INDEPENDENCE

A box contains two coins: a regular coin and one fake two-headed
coin (P(H)=1). I choose a coin at random and toss it twice. Define the
following events.

A= First coin toss results in an H.

B= Second coin toss results in an H.

C= Coin 1 (regular) has been selected.

Find P(A), P(B), P(C), P(A|C), P(B|C), P(A∩B|C), and P(A∩B).


A

BAYESIAN CONCEPT LEARNING

Random variable (Stochastic variable)

In statistics, the random variable is a variable whose possible values are


a result of a random event. Therefore, each possible value of a random
variable has some probability attached to it to represent the likelihood of
those values.

Probability distribution

The function that defines the probability of different outcomes/values of


a random variable. The continuous probability distributions are described
using probability density functions whereas discrete probability
distributions can be represented using probability mass functions.
A

BAYESIAN CONCEPT LEARNING

Conditional probability

This is a measure of probability P(A|B) of an event A given that another


event B has occurred.

Joint probability

Given two random variables that are defined on the same probability
space
A

BAYESIAN CONCEPT LEARNING

Suppose that you are allowed to flip the coin 10 times in order to
determine the fairness of the coin. Your observations from the
experiment will fall under one of the following cases:

Case 1: observing 5 heads and 5 tails.

Case 2: observing h heads and 10−h tails, where h≠10−h.


A

BAYESIAN CONCEPT LEARNING

If case 1 is observed, you are now more certain that the coin is a fair
coin, and you will decide that the probability of observing heads
is 0.5 with more confidence.

If case 2 is observed, you can either:


1.Neglect your prior beliefs since, now you have new data, decide the
probability of observing heads is h/10 by solely depending on
recent observations.

2.Adjust your belief accordingly to the value of h that you have just
observed, and decide the probability of observing heads using your
recent observations.
A

BAYESIAN CONCEPT LEARNING

The first method suggests that we use the frequentist method, where
we omit our beliefs when making decisions. However, the second
method seems to be more convenient because 10 times tossing a coin
are insufficient to determine the fairness of a coin. Therefore, we can
make better decisions by combining our recent observations and
beliefs that we have gained through our past experiences.

It is this thinking model which uses our most recent observations


together with our beliefs or inclination for critical thinking that is
known as Bayesian thinking.
A

BAYESIAN CONCEPT LEARNING

Moreover, assume that it is allowed to conduct another 10 coin flips.


Then we can use these new observations to further update our beliefs. As
we gain more data, we can incrementally update our beliefs increasing
the certainty of our conclusions. This is known as incremental
learning, where you update your knowledge incrementally with new
evidence.
A

BAYESIAN CONCEPT LEARNING

Bayesian learning comes into play on such occasions, where we are


unable to use frequentist statistics due to the drawbacks that we have
discussed above. We can use Bayesian learning to address all these
drawbacks and even with additional capabilities (such as incremental
updates of the posterior) when testing a hypothesis to estimate
unknown parameters of a machine learning models. Bayesian learning
uses Bayes’ theorem to determine the conditional probability of a
hypotheses given some evidence or observations.
A

MAXIMUM A POSTERIORI (MAP)

We can use MAP to determine the valid hypothesis from a set of


hypotheses. According to MAP, the hypothesis that has the maximum
posterior probability is considered as the valid hypothesis. Therefore,
we can express the hypothesis θMAP that is concluded using MAP as
follows:

The argmaxθ operator estimates the event or hypothesis θi that


maximizes the posterior probability P(θi|X). Let us apply MAP to the
above example in order to determine the true hypothesis:
A

MAXIMUM A POSTERIORI (MAP)

Figure illustrates how the posterior probabilities of possible hypotheses


change with the value of prior probability. Unlike frequentist
statistics where our belief or past experience had no influence on the
concluded hypothesis, Bayesian learning is capable of incorporating our
belief to improve the accuracy of predictions.
A

MAXIMUM A POSTERIORI (MAP)

Assuming that we have fairly good programmers and therefore the


probability of observing a bug is P(θ)=0.4 , then we find the θMAP:
A

NAIVE BAYES CLASSIFIER

The Naive Bayes classifier separates data into different classes


according to the Bayes’ Theorem, along with the assumption that all
the predictors are independent of one another. It assumes that a
particular feature in a class is not related to the presence of other
features.
A

NAIVE BAYES CLASSIFIER - EXAMPLE

you can consider a fruit to be a watermelon if it is green, round and


has a 10-inch diameter. These features could depend on each other for
their existence, but each one of them independently contributes to the
probability that the fruit under consideration is a watermelon. That’s
why this classifier has the term ‘Naive’ in its name.
A

NAIVE BAYES CLASSIFIER

This algorithm is quite popular because it can even outperform highly


advanced classification techniques. Moreover, it’s quite simple, and
you can build it quickly.

Here’s the Bayes theorem, which is the basis for this algorithm:

In this equation, ‘A’ stands for class, and ‘B’ stands for attributes.
P(A/B) stands for the posterior probability of class according to the
predictor. P(B) is the prior probability of the predictor, and P(A) is the
prior probability of the class. P(B/A) shows the probability of the
predictor according to the class.
A

NAIVE BAYES CLASSIFIER

To understand how Naive Bayes works, we should discuss an


example. Suppose we want to find stolen cars and have the following
dataset:
Was it
Serial No. Color Type Origin
Stolen?
1 Red Sports Domestic Yes
2 Red Sports Domestic No
3 Red Sports Domestic Yes
4 Yellow Sports Domestic No
5 Yellow Sports Imported Yes
6 Yellow SUV Imported No
7 Yellow SUV Imported Yes
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes
A

NAIVE BAYES CLASSIFIER

It assumes that every feature is independent. For example, the color


‘Yellow’ of a car has nothing to do with its Origin or Type.

It gives every feature the same level of importance. For example,


knowing only the Color and Origin would predict the outcome
correctly. That’s why every feature is equally important and
contributes equally to the result.
A

NAIVE BAYES CLASSIFIER

Now, with our dataset, we have to classify if thieves steal a car

according to its features. Each row has individual entries, and the

columns represent the features of every car. In the first row, we have a

stolen Red Sports Car with Domestic Origin. We’ll find out if thieves

would steal a Red Domestic SUV or not (our dataset doesn’t have an

entry for a Red Domestic SUV).


A

NAIVE BAYES CLASSIFIER

We can rewrite the Bayes Theorem for our example as:

P(y | X) = [P(X | y) P(y)]/P(X)

Here, y stands for the class variable (Was it Stolen?) to show if the
thieves stole the car not according to the conditions. X stands for the
features.

X = x1, x2, x3, …., xn)

Now, we’ll replace X and expand the chain rule to get the following:

P(y | x1, …, xn) = [P(x1 | y) P(x2 | y) … P(xn | y) P(y)]/[P(x1) P (x2) … P(xn)]


A

NAIVE BAYES CLASSIFIER

You can get the values for each by using the dataset and putting their
values in the equation. The denominator will remain static for every
entry in the dataset to remove it and inject proportionality.

P(y | x1, …, xn) ∝ P(y) i = 1nP(xi | y)


In our example, y only has two outcomes, yes or no.

y = argmaxyP(y) i = 1nP(xi | y)
A

NAIVE BAYES CLASSIFIER

We can create a Frequency Table to calculate the posterior probability


P(y|x) for every feature. Then, we’ll mould the frequency tables to
Likelihood Tables and use the Naive Bayesian equation to find every
class’s posterior probability. The result of our prediction would be the
class that has the highest posterior probability. Here are the Likelihood
and Frequency Tables:
A

NAIVE BAYES CLASSIFIER

Frequency Table of Color:

Was it Stolen Was it Stolen


Color
(Yes) (No)
Red 3 2
Yellow 2 3

Likelihood Table of Color:

Was it Stolen Was it Stolen


Color
[P(Yes)] [P(No)]
Red 3/5 2/5
Yellow 2/5 3/5
A

NAIVE BAYES CLASSIFIER

Frequency Table of Type:

Was it Stolen Was it Stolen


Type
(Yes) (No)
Sports 4 2
SUV 1 3

Likelihood Table of Type:

Was it Stolen Was it Stolen


Type
[P(Yes)] [P(No)]
Sports 4/5 2/5
SUV 1/5 3/5
A

NAIVE BAYES CLASSIFIER

Frequency Table of Origin:

Was it Stolen Was it Stolen


Origin
(Yes) (No)
Domestic 2 3
Imported 3 2

Likelihood Table of Origin:

Was it Stolen Was it Stolen


Origin
[P(Yes)] [P(No)]
Domestic 2/5 3/5
Imported 3/5 2/5
A

NAIVE BAYES CLASSIFIER

Our problem has 3 predictors for X, so according to the equations we


saw previously, the posterior probability P(Yes | X) would be as
following:

P(Yes | X) = P(Red | Yes) * P(SUV | Yes) * P(Domestic | Yes) * P(Yes)

=⅗x⅕x⅖x1

= 0.048
A

NAIVE BAYES CLASSIFIER

P(No | X) would be:

P(No | X) = P(Red | No) * P(SUV | No) * P(Domestic | No) * P(No)

=⅖x⅗x⅗x1

= 0.144

So, as the posterior probability P(No | X) is higher than the posterior


probability P(Yes | X), our Red Domestic SUV will have ‘No’ in the
‘Was it stolen?’ section.
May 14, 2024 Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology 42

You might also like