0% found this document useful (0 votes)

39 views

6.1 Bayesian Learning

- Bayesian learning is an important approach that provides a useful perspective for understanding many machine learning algorithms like neural networks. - Bayesian methods combine prior knowledge with observed training data to determine the probability of a hypothesis. New instances can then be classified by combining the predictions of multiple hypotheses. - Bayes' theorem is used to calculate the probability of a hypothesis based on its prior probability, the probabilities of observing the training data given the hypothesis, and the observed training data itself. The hypothesis with the highest posterior probability given the data is the maximum a posteriori (MAP) hypothesis.

Uploaded by

Matrix Bot

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

6.1 Bayesian Learning

Uploaded by

Matrix Bot

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Bayesian Learning

Instructor
Dr. Sanjay Chatterji
Importance
● Among the most practical approaches to certain
types of problems.
● Provides useful perspective for understanding
many learning algorithms.
− Find-S
− Candidate Elimination Algorithm
− Neural Network
● Bayesian perspective of Ockham’s Razor
Features of Bayesian learning methods
● Each observed training example can incrementally
decrease or increase the estimated probability that a
hypothesis is correct.
● Prior knowledge can be combined with observed data
to determine the final probability of a hypothesis.
● Bayesian methods can accommodate hypotheses that
make probabilistic predictions.
● New instances can be classified by combining the
predictions of multiple hypotheses.
Bayes Theorem
● In machine learning, we try to determine the best
hypothesis from some hypothesis space H, given the
observed training data D.
● The best hypothesis means the most probable
hypothesis, given the data D plus any initial knowledge
about the prior probabilities of the various hypotheses
in H.
● Calculate the probability of a hypothesis based on its
prior probability, the probabilities of observing various
data given the hypothesis, and the observed data
itself.
Notations

● P(h) is prior probability of hypothesis h

● P(D) is prior probability of training data D
● P(h|D) is posterior probability of h given D
● P(D|h) is posterior probability of D given h
(likelihood)
Issues

● Require initial knowledge of many probabilities

● Significant computational cost is required to
determine the Bayes optimal hypothesis in the
general case.
− May be reduced in certain specialized situation.
Maximum A Posteriori (MAP) Hypothesis, hMAP

● The learner considers some set of candidate

hypotheses H.
● It is interested in finding the most probable hypothesis
h∈H given the observed data D.
● Any such maximally probable hypothesis is called a
maximum a posteriori (MAP) hypothesish MAP.
Maximum Likelihood (ML) Hypothesis, hML

● If we assume that every hypothesis in H is

equally probable.
− We can only consider P(D|h)to find the most
probable hypothesis.
● Any hypothesis that maximizes P(D|h) is called
a maximum likelihood (ML) hypothesis, hML.
Example: Understand hML

● P(cancer) = .008 P(¬cancer) = .992

● P(+|cancer) = .98 P(-|cancer) = .02
● P(+|¬cancer)=.03 P(-|¬cancer) = .97
hMAP for + is ¬cancer
Basic Formulas for Probabilities
Brute-Force Bayes Concept Learning
● A Concept-Learning algorithm considers a finite
hypothesis space H defined over an instance space X.
● The task is to learn the target concept. c:X →{0,1}.
● The learner gets a set of training examples (<x1,d1>, . .
. <xm,dm> ) S.T. c(xi) = di
● Brute-Force Bayes Concept Learning Algorithm finds
the maximum a posteriori hypothesis (hMAP), based on
Bayes theorem.
● This algorithm may require significant computation,
because it applies Bayes theorem to each hypothesis
in H to calculate P(h|D).
Assumptions
1. The training data D is noise free (i.e., di= c(xi)).
2. The target concept c is contained in the
hypothesis space H
3. We have no a priori reason to believe that any
hypothesis is more probable than any other.
values of P(h|D)
MAP Hypotheses and Consistent Learners
● A learning algorithm is consistent learner if it outputs a
hypothesis that has zero error over training examples.
● Every consistent learner outputs a MAP hypothesis, if we
assume a uniform prior probability distribution over H, and
deterministic, noise free training data.
● Because FIND-S outputs a consistent hypothesis, it will
output a MAP hypothesis.
● under the probability distributions P(h) and P(D|h)
defined above.
● Are there other probability distributions for P(h) and P(D|h)
under which FIND-S outputs MAP hypotheses?
Characterizing the inductive bias

● Candidate Elimination: target concept c is

included in the hypothesis space H.
● Bayes learning provides an alternative way to
characterize the assumptions.
− Equivalent probabilistic reasoning
● P(h) and P(D/h) characterize the implicit assumptions
of Find-S and Candidate Elimination.
● What happens to the noisy data.
Learning Continuous-Valued Target
Function
● Bayesian analysis can show that some learning
algorithms output MAP hypothesis even though it do
not explicitly use Bayes rule.
● Consider the continuous valued target function.
● Under certain assumptions any learning algorithm that
minimizes the squared error between the output
hypothesis predictions and the training data will output
a MAXIMUM LIKELIHOOD HYPOTHESIS.
● It gives Bayesian justifications for many ML algorithms
that attempt to minimize the sum of squared errors
over the training data
Basic Concepts from Probability Theory

● A Normal Distribution (Gaussian Distribution) is a bell-shaped

distribution defined by the probability density function.

● The Central Limit Theorem states that the sum of a large

number of independent, identically distributed random variables
follows a distribution that is approximately Normal.
Learning Continuous-Valued Target
Function
● Learner L considers an instance space X and a hypothesis
space H consisting of some class of real-valued functions
defined over X.
● The problem is to learn an unknown target function f drawn
from H.
● A set of m training examples (xi, di) is provided, where the
target value of each example di=f(xi)+e is corrupted by
random noise e drawn according to a Normal probability
distribution with 0 mean.
● The task of the learner is to output a maximum likelihood
hypothesis, or, equivalently, a MAP hypothesis
Learning A Linear Function
Deriving hML
Maximum Likelihood and Least-Squared
Error Hypotheses
● Given that the noise ei obeys a Normal distribution with
zero mean and unknown variance σ2, each di must also
obey a Normal distribution with variance σ2 centred
around the true target value f(xi)=h(xi) rather than zero.
● The maximum likelihood hypothesis hML is the one that
minimizes the sum of the squared errors between
observed training values di and hypothesis predictions
h(xi).
● Minimizing the sum of squared errors is a common
approach in many neural network, curve fitting, and other
approaches to approximating real-valued functions.
● Section 6.5 and 6.6 is not there in the syllabus.
Bayes Optimal Classifier
● Normally we consider:
− what is the most probable hypothesis given the
training data?
● We can also consider:
− what is the most probable classification of the new
instance given the training data?
● The most probable classification of the new instance is
obtained by combining the predictions of all
hypotheses, weighted by their posterior probabilities.
Bayes Optimal Classifier – Example
● Consider a hypothesis space containing three hypotheses,
h1, h2, and h3 with probabilities .4, .3, and .3
− A new instance x is classified positive by h1, but negative by h2
and h3.
− The probability that x is positive is .4 and the probability that it is
negative is .6
− The most probable classification may be different from the
classification generated by the MAP hypothesis.
Gibbs Algorithm
● Although the Bayes optimal classifier obtains the best
performance that can be achieved from the given
training data, it can be quite costly to apply.
● An alternative, less optimal method is the Gibbs
algorithm:
1. Choose a hypothesis h from H at random, according to the
posterior probability distribution over H.
2. Use h to predict the classification of the next instance x.
● It can be shown that under certain conditions the
expected misclassification error of Gibbs algorithm is at
most twice the expected error of Bayes Optimal
Classifier.
Implication for concept learning algorithm.
Naive Bayes Classifier
● The naive Bayes classifier applies to learning tasks
where each instance x is described by a conjunction of
attribute values and where the target function f(x) can
take on any value from some finite set V.
● A set of training examples is provided, and a new
instance is presented, described by the tuple of attribute
values (a1, a2, …, an)
● The learner is asked to predict the target value
(classification), for this new instance.
● The Bayesian approach to classifying the new instance
is to assign the most probable target value vMAP
Naive Bayes Classifier
Naive Bayes Classifier – Example2
Naive Bayes Classifier – Example2

● Our task is to predict the target value (yes or no) of the

target concept PlayTennis for this new instance.
− (Outlook=sunny, Temperature=cool, Humidity=high,
Wind=strong)
Naive Bayes Classifier – Example3
Learning To Classify Text
● Vocabulary ← the set of all distinct words and other
tokens occurring in any text document from Examples.
● Calculate the required P(vj) and P(wk|vj) probability terms.
● P(vj)← |docsj| / |Examples|
● P(wk|vj) ← (nk+ 1) / (n + |Vocabulary|)
● nk ← number of times word wk occurs in Textj
Bayesian Belief Network

● A directed graph together with an associated set of

probability tables.
● The nodes represent variables, which can be
discrete or continuous.
● The arcs represent causal relationships between
variables.
● chain rule: P(A1, A2,...An)=
P(A1|A2,...,An)P(A2|A3,...,An)...P(An-1|An)P(An)
=∏i=1,…,nP(Xi/Pa(Xi))
Bayesian Belief Network – Example

P(Norman late)=P(Norman
late|Trainstrike)*P(Train
strike) + P(Norman
late|¬train strike)*P(¬train
strike)=(0.8*0.1)*(0.1*0.9)=0.17
Bayesian Belief Network – Example
Thank You

Bayesian Learning Unit 3 PDF
No ratings yet
Bayesian Learning Unit 3 PDF
18 pages
Unit III
No ratings yet
Unit III
19 pages
slide07-bayes
No ratings yet
slide07-bayes
51 pages
ML Unit-Iii
No ratings yet
ML Unit-Iii
178 pages
Unit 3 Bayesian Learning
No ratings yet
Unit 3 Bayesian Learning
49 pages
Bayesian Learning
No ratings yet
Bayesian Learning
49 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
Unit 2 Bayesian Learning
No ratings yet
Unit 2 Bayesian Learning
50 pages
Naive Bayes
No ratings yet
Naive Bayes
60 pages
Bayes Algorithm
No ratings yet
Bayes Algorithm
26 pages
Lecture 9: Bayesian Learning: Cognitive Systems II - Machine Learning SS 2005
No ratings yet
Lecture 9: Bayesian Learning: Cognitive Systems II - Machine Learning SS 2005
39 pages
E-Note 14654 Content Document 20231228101425AM
No ratings yet
E-Note 14654 Content Document 20231228101425AM
10 pages
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
No ratings yet
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
123 pages
ML - Unit 1 - Part Ii
No ratings yet
ML - Unit 1 - Part Ii
18 pages
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
No ratings yet
Bayesian Learning: Based On "Machine Learning", T. Mitchell, Mcgraw Hill, 1997, Ch. 6
54 pages
ML - Unit4pdf
No ratings yet
ML - Unit4pdf
65 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
No ratings yet
ML - Unit-3 Chapter - 6 (Bayes Theorem) - Notes
31 pages
AI&ML-Q With Answer
No ratings yet
AI&ML-Q With Answer
18 pages
Module 4 - Bayesian Learning
No ratings yet
Module 4 - Bayesian Learning
36 pages
Bayesian Learning
No ratings yet
Bayesian Learning
81 pages
Features of Bayesian Learning Methods
No ratings yet
Features of Bayesian Learning Methods
39 pages
UNIT-4
No ratings yet
UNIT-4
24 pages
ML 3
No ratings yet
ML 3
45 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
Bayesian Learning Note
No ratings yet
Bayesian Learning Note
20 pages
module_5_notes BAYESIAN learning notes
No ratings yet
module_5_notes BAYESIAN learning notes
24 pages
Module - 4 Bayeian Learning
No ratings yet
Module - 4 Bayeian Learning
44 pages
UNIT-3
No ratings yet
UNIT-3
99 pages
UNIT 4 - Bayesian Learning
No ratings yet
UNIT 4 - Bayesian Learning
54 pages
L13 Bayesian Methods
No ratings yet
L13 Bayesian Methods
30 pages
Ba Yes Naive
No ratings yet
Ba Yes Naive
15 pages
Module 5
No ratings yet
Module 5
24 pages
3.1 New
No ratings yet
3.1 New
12 pages
Unit 4
No ratings yet
Unit 4
18 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Bayesian
No ratings yet
Bayesian
91 pages
18CS71 Module 4
No ratings yet
18CS71 Module 4
30 pages
MODULE - 4 QB SOLVED-1
No ratings yet
MODULE - 4 QB SOLVED-1
31 pages
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
No ratings yet
Bayesian Decision Theory and Learning: Jayanta Mukhopadhyay Dept. of Computer Science and Engg
56 pages
Bayesian Learning: Salma Itagi, Svit
No ratings yet
Bayesian Learning: Salma Itagi, Svit
14 pages
UNIT4_Part2 aiml
No ratings yet
UNIT4_Part2 aiml
46 pages
Module 2 Notes
No ratings yet
Module 2 Notes
24 pages
Module - 4 AIML
No ratings yet
Module - 4 AIML
22 pages
ML UNIT 4-1-24
No ratings yet
ML UNIT 4-1-24
24 pages
ML Unit 3 Bayesian - Learning (Textbook)
No ratings yet
ML Unit 3 Bayesian - Learning (Textbook)
25 pages
Bayesian Learning
No ratings yet
Bayesian Learning
44 pages
Bayesian Learning Video Tutorial
No ratings yet
Bayesian Learning Video Tutorial
25 pages
Unit-3
No ratings yet
Unit-3
157 pages
Concept Learning
No ratings yet
Concept Learning
33 pages
2BAYESIAN LEARNING (1)
No ratings yet
2BAYESIAN LEARNING (1)
22 pages
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
No ratings yet
Visit:: Join Telegram To Get Instant Updates: Contact: MAIL: Instagram: Instagram: Whatsapp Share
25 pages
ML Unit-4
No ratings yet
ML Unit-4
24 pages
module_3_Last Part
No ratings yet
module_3_Last Part
16 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
14 pages
Bayesian Learning: Artificial Intelligence and Machine Learning 18CS71
No ratings yet
Bayesian Learning: Artificial Intelligence and Machine Learning 18CS71
24 pages
Chapter 6 Bayesianlearning
No ratings yet
Chapter 6 Bayesianlearning
32 pages
ML Module 4 Chapter 8 RNSIT
No ratings yet
ML Module 4 Chapter 8 RNSIT
5 pages
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Error Correction and Detection
No ratings yet
Error Correction and Detection
14 pages
7.3 Pca
No ratings yet
7.3 Pca
17 pages
7.1 Generative & Discriminative Learning
No ratings yet
7.1 Generative & Discriminative Learning
16 pages
3 Decision Tree Learning
No ratings yet
3 Decision Tree Learning
38 pages

6.1 Bayesian Learning

Uploaded by

6.1 Bayesian Learning

Uploaded by

Bayesian Learning

● P(h) is prior probability of hypothesis h

● Require initial knowledge of many probabilities

● The learner considers some set of candidate

● If we assume that every hypothesis in H is

● P(cancer) = .008 P(¬cancer) = .992

● Candidate Elimination: target concept c is

● A Normal Distribution (Gaussian Distribution) is a bell-shaped

● The Central Limit Theorem states that the sum of a large

● Our task is to predict the target value (yes or no) of the

● A directed graph together with an associated set of

You might also like