0% found this document useful (0 votes)
4 views15 pages

Module 4

Bayesian learning is a probabilistic approach that utilizes Bayes' Theorem to update the probability of hypotheses based on new data, combining prior knowledge and evidence for informed predictions. It involves calculating prior, likelihood, and posterior probabilities, allowing for flexible learning and robust decision-making under uncertainty. Key applications include the Naive Bayes classifier and Bayesian Belief Networks, which model relationships between variables using probabilities.

Uploaded by

smizba777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views15 pages

Module 4

Bayesian learning is a probabilistic approach that utilizes Bayes' Theorem to update the probability of hypotheses based on new data, combining prior knowledge and evidence for informed predictions. It involves calculating prior, likelihood, and posterior probabilities, allowing for flexible learning and robust decision-making under uncertainty. Key applications include the Naive Bayes classifier and Bayesian Belief Networks, which model relationships between variables using probabilities.

Uploaded by

smizba777
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Module 4

Bayesian learning is a probabilistic approach to machine learning that uses Bayes' Theorem to
update the probability of hypotheses as new data is observed, combining prior knowledge and
evidence to make informed predictions and decisions.

Bayesian learning is a machine learning method that uses Bayes' Theorem to update the probability
of a hypothesis as more evidence or data becomes available. It treats learning as a process of
probabilistic inference, where the goal is to estimate the posterior distribution of hypotheses given
observed data, combining prior beliefs and the likelihood of the data under those hypotheses

 Bayes' Theorem: Central to Bayesian learning, it mathematically relates the prior probability
of a hypothesis P(h)P(h), the likelihood of observed data given the hypothesis P(D∣h)P(D∣h),
and the posterior probability P(h∣D)P(h∣D) - the updated belief about the hypothesis after
seeing the data:

 Prior Probability: Represents initial beliefs about hypotheses before seeing data.

 Likelihood: Probability of observing the data assuming a particular hypothesis is true.

 Posterior Probability: Updated probability of the hypothesis after considering the data.

 Hypothesis Space: The set of all possible models or explanations considered.

How Bayesian Learning Works:

 Starts with prior beliefs about hypotheses.

 Observes data and calculates how likely this data is under each hypothesis.

 Updates beliefs to form posterior probabilities.

 Uses posterior to make predictions or decisions.

 Allows combining prior knowledge with new evidence flexibly.

Advantages:

 Incorporates prior knowledge explicitly.

 Provides a probabilistic framework that quantifies uncertainty.

 Can combine predictions from multiple hypotheses weighted by their probabilities.

 Offers a principled approach to decision-making under uncertainty.

 Robust to overfitting compared to some other methods because it considers distributions


over parameters rather than point estimates
Example

 Suppose a drug test is 98% accurate:

 If a person uses the drug, the test correctly detects it 98% of the time (true positive
rate).

 If a person does not use the drug, the test correctly shows negative 98% of the time
(true negative rate).

 The prevalence of drug use in the population is 0.5% (0.005 probability).

 A person tests positive. What is the probability that this person actually uses the drug?

Solution

Step 1: Define the events

 A: Person uses the drug.

 B: Test result is positive.

Step 2: Identify the known probabilities

 P(A) = 0.005 (0.5% of people use the drug)

 P(¬A) = 1 - P(A) = 0.995 (person does not use the drug)

 P(B|A) = 0.98 (test is positive if person uses the drug)

 P(B|¬A) = 0.02 (false positive rate: test is positive even if person does not use the drug)

Step 3: Calculate the total probability of testing positive, P(B)

This includes both true positives and false positives:

Step 4: Apply Bayes' Theorem to find P(A|B)


Reasons for learning Bayesian algorithms

1. To calculate probability
2. for hypothesis
3. To provide useful understanding of learning algorithm that do explicitly manipulate
probabilities
4. To minimize the mean squared error in the neural network

Key features

 Incremental updating of hypothesis probabilities: Each observed training example can


increase or decrease the estimated probability that a hypothesis is correct, allowing more
flexible learning than methods that discard hypotheses after a single inconsistency

 Incorporation of prior knowledge: Bayesian learning combines prior probabilities (previous


knowledge or beliefs about hypotheses) with observed data to compute the posterior
probability of hypotheses

 Probabilistic predictions: It can handle hypotheses that make probabilistic, rather than
deterministic, predictions about data

 Combining multiple hypotheses: New instances can be classified by aggregating predictions


from multiple hypotheses weighted by their posterior probabilities, rather than relying on a
single best hypothesis.

 Optimal decision-making benchmark: Even when computationally expensive or intractable,


Bayesian methods provide a theoretical standard of optimal decision-making against which
other practical algorithms can be measured
 Handling uncertainty and sparse data: Bayesian learning explicitly models uncertainty and is
particularly effective when data is limited or noisy, as it integrates prior knowledge and
evidence probabilistically.

 Computational considerations: Bayesian methods may require significant computational


resources and prior probability estimates, which can be challenging but sometimes reducible
in special cases.

Limitations

1. Typically requires initial knowledge of many probabilities


2. Significant computational cost required to determine the bayes optimal hypothesis

The Brute Force MAP (Maximum A Posteriori) Learning algorithm is a straightforward Bayesian
concept learning method that finds the most probable hypothesis given training data by exhaustively
evaluating all hypotheses in a finite hypothesis space.

Key assumptions:

 Training data is noise-free (no mislabelling).

 Target concept is contained in the hypothesis space.

 Prior probabilities are uniform (no hypothesis favoured a priori).


The MAP hypothesis (Maximum A Posteriori hypothesis) is the hypothesis that maximizes
the posterior probability given the observed data. In other words, it is the most probable hypothesis
after combining both the likelihood of the data under the hypothesis and the prior belief about the
hypothesis.

Mathematically, it is defined as:

 h is a hypothesis,

 D is the observed data,

 P(D∣h) is the likelihood of the data given hypothesis h

 P(h) is the prior probability of hypothesis h

 P(h∣D) is the posterior probability of h given the data.


The MAP hypothesis is the best guess for the true hypothesis after considering both the observed
data and prior beliefs.

NAÏVE-BAYES CLASSIFIER

The Naive Bayes classifier is a simple and popular supervised machine learning algorithm used for
classification tasks, such as text classification or spam detection. It is based on Bayes’ Theorem and
assumes that all features (predictors) are conditionally independent given the class label-this is
called the naive independence assumption

 P(Ck) is the prior probability of class CkCk.

 P(xi∣Ck)P(xi∣Ck) is the likelihood of feature xixi given class CkCk.

 The denominator P(x1,x2,...,xn)P(x1,x2,...,xn) is constant for all classes and thus omitted in
classification.

Numerical example below


BBN(Bayesian belief network)

A Bayesian Belief Network (BBN) is like a smart map that shows how different things (variables) are
connected and influence each other, using probabilities. Imagine you want to understand how
weather, traffic, and being late to work are related. A BBN draws arrows from one factor to another
to show which causes which, and uses numbers (probabilities) to express how likely things are given
other things.

In simple terms:

 It’s a diagram made of nodes and arrows; each node is a variable (like "Rain" or "Traffic
Jam").

 The arrows show cause-and-effect relationships (e.g., rain can cause traffic jams).

 Each node has a table that tells you the chance of that variable happening given its causes.

 When you get new information (like it’s raining), the network updates the chances of related
events (like traffic jams or being late).

This helps in making decisions or predictions when things are uncertain by combining what you
know with new evidence.

Mathematical definition

joint probability distribution over all variables X1,X2,...,XnX1,X2,...,Xn in a Bayesian Belief Network
(BBN) can be expressed as the product of the conditional probabilities of each variable given its
parents in the network.

split:

 Instead of calculating the probability of every possible combination of all variables together
(which grows exponentially and becomes infeasible for many variables),

 The Bayesian network breaks down the joint probability into smaller, manageable pieces.

 Each variable depends only on its parent variables (the nodes with arrows pointing to it).

 By multiplying these conditional probabilities, you get the overall joint probability
Gradient ascent training of Bayesian networks is an optimization method that updates the
conditional probabilities in the network by following the gradient of the log-likelihood of observed
data, improving the network’s fit to data iteratively.

 Bayesian network with a fixed structure (nodes and edges).

 Each node has a CPT, where each entry wijk represents the probability that variable Yi takes
value yi given its parents Ui have values uik

 The goal is to find the set of wijk values that maximize the probability of the observed
training data DD, i.e., maximize P(D∣h) where h represents the hypothesis defined by the
CPTs.
Derivation
Example

You might also like