0% found this document useful (0 votes)
2 views13 pages

Lecture_3 (2)

The document presents a lecture on generative models for real-world systems with a focus on the Bayesian concept. It covers topics such as Bayesian learning, prior and posterior processing, the Naïve Bayes classifier, and applications of the Bayesian approach. The lecture concludes with a summary of the methods and practical applications discussed.

Uploaded by

Elias Tol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views13 pages

Lecture_3 (2)

The document presents a lecture on generative models for real-world systems with a focus on the Bayesian concept. It covers topics such as Bayesian learning, prior and posterior processing, the Naïve Bayes classifier, and applications of the Bayesian approach. The lecture concludes with a summary of the methods and practical applications discussed.

Uploaded by

Elias Tol
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

National Technical University of Ukraine

“Igor Sikorsky Kyiv Polytechnic Institute”


Institute of Physics and Technology

Lecture 3
Generative Models for Real-World Systems.
Bayesian Concept

Dmytro Progonov,
PhD, Associate Professor
Content
• Bayesian concept learning;
• Prior and posterior processing;
• Naïve Bayes classifier;
• Bayesian approach applications.

Generative Models for Real-World Systems.


Bayesian Concept 2/13
Bayesian concept learning
Concept 𝐶
(e.g. prime numbers) True

Initial dataset 𝒟 Does next element 𝑥𝑁+1


𝒟 = 𝑥1 , 𝑥2 , ⋯ 𝑥𝑁 belong to 𝒟?

Observer’s False
guessing

Prior Guessing Posterior

𝑝 𝑦 = 𝑐|𝐱, 𝛉 ∝ 𝑝 𝐱|𝑦 = 𝑐, 𝛉 ∙ 𝑝 𝑦 = 𝑐|𝛉

Observer’s Supposed Observed Unknown


answer class features parameters

Generative Models for Real-World Systems.


Bayesian Concept 3/13
Prior and posterior processing (1/4)
Prior
Likelihood
𝑁 𝑁
1 1
𝑝 𝒟|ℎ = = ,
𝑠𝑠𝑠𝑠 ℎ ℎ
𝒟 − observed data;
ℎ − total number of items;
𝑁 − number of sampled (with replacement) items.
Occam’s razor:
The model favors the simplest (smallest) hypothesis consistent with the data.
Jeffreys-Linley paradox:
Bayes-oriented decision systems will always favor the simpler model, since the
probability of the observed data under a complex model with a very diffuse prior will
be very small.
Posterior
𝑝 𝒟|ℎ 𝑝 ℎ 𝑝 ℎ 𝕀 𝒟∈ℎ ⁄ℎ𝑁
𝑝 ℎ|𝒟 = =
∑ℎ́∈ℋ 𝑝 𝐷, ℎ́ ∑ℎ́∈ℋ 𝑝 ℎ́ 𝕀 𝒟 ∈ ℎ́ � ℎ́
𝑁

where 𝕀 𝒟 ∈ ℎ is 1 if and only if all the data are in extension of the hypothesis ℎ.
Generative Models for Real-World Systems.
Bayesian Concept 4/13
Prior and posterior processing (2/4)
If we have enough data, the posterior 𝑝 ℎ|𝒟 becomes Maximum A Posterior Estimate
(MAP):

𝑝 𝒟|ℎ → 𝛿ℎ�𝑀𝑀𝑀 ℎ ,

ℎ�𝑀𝑀𝑀 = argmaxℎ 𝑝 ℎ|𝒟 − the posterior mode;


1, 𝑥 ∈ 𝐴
𝛿𝑥 𝐴 = � − Dirac measure.
0, 𝑥 ∉ 𝐴

Note that the MAP estimate can be written as:


ℎ�𝑀𝑀𝑀 = argmax 𝑝 𝒟|ℎ 𝑝 ℎ = argmax log 𝑝 𝒟|ℎ + log 𝑝 ℎ .
ℎ ℎ

As we get more and more data, the MAP estimate converges towards the Maximum
Likelihood Estimate (MLE):
ℎ�𝑀𝐿𝐿 = argmax 𝑝 𝒟|ℎ = argmax log 𝑝 𝒟|ℎ .
ℎ ℎ

Generative Models for Real-World Systems.


Bayesian Concept 5/13
Prior and posterior processing (3/4)
The way to test if our beliefs are justified is to use them to predict objectively observable
quantities with usage of posterior predictive distribution:

𝑝 𝑥� ∈ 𝐶|𝒟 = � 𝑝 𝑦 = 1|𝑥�, ℎ 𝑝 ℎ|𝒟 .


When we have a small and/or ambiguous dataset, the posterior 𝑝 ℎ|𝒟 is vague, which
induces a broad predictive distribution. However, once we have “figured things out”, the
posterior becomes a delta function centered at the MAP estimate. In this case, we can use
plug-in approximation:

𝑝 𝑥� ∈ 𝐶|𝒟 = � 𝑝 𝑥�| ℎ 𝛿ℎ� ℎ = 𝑝 𝑥�|ℎ� .


Generative Models for Real-World Systems.


Bayesian Concept 6/13
Prior and posterior processing (4/4)
In general, computing the Maximum a Posterior Estimate 𝑝 𝒟|ℎ can be quite difficult. One
simple but popular approximation is known as the Bayesian information criterion (BIC)


𝑑𝑑𝑑 𝛉
� −
𝐵𝐵𝐵 ≜ log 𝑝 𝒟|𝛉 log 𝑁 ≈ log 𝑝 𝒟 ,
2

� − the
� − maximum likelihood estimation of used model parameters; 𝑑𝑑𝑑 𝛉
where 𝜽
number of degrees of freedom in used model.

The BIC method is very closely related to the Minimum Description Length or MDL
principle, which characterizes the score of a model in terms of how well it fits the data,
minus how complex the model is to define.

A very similar expression of BIC / MDL is called the Akaike information criterion or AIC

� 𝑀𝑀𝑀 − 𝑑𝑑𝑑 𝑚 .
𝐴𝐴𝐴 𝑚, 𝒟 ≜ log 𝑝 𝒟|𝜽

Generative Models for Real-World Systems.


Bayesian Concept 7/13
Examples. Beta-binomial model (1/3)
Suppose 𝑋𝑖 ~𝐵𝐵𝐵 𝜃 , where 𝑋𝑖 = 1 represents “heads”, 𝑋𝑖 = 0 represents “tails”, and
𝜃 ∈ 0; 1 is the rate parameter (probability of head). If the data are iid, the likelihood has
the form:

𝑝 𝒟|𝜃 = 𝜃𝑁1 1 − 𝜃 𝑁0
.

where we have 𝑁1 = ∑𝑁 𝑁
𝑖=1 𝕀 𝑥𝑖 = 1 heads and 𝑁0 = ∑𝑖=1 𝕀 𝑥𝑖 = 0 tails, 𝑁 = 𝑁0 + 𝑁1 is
observed trials. In this case we have 𝑁1 ~𝐵𝐵𝐵 𝑁, 𝜃 , which has following pdf:
𝑛 𝑘 𝑛−𝑘
𝐵𝐵𝐵 𝑘|𝑛, 𝜃 = 𝜃 1−𝜃 .
𝑘
𝑛
Since binomial coefficients is a constant independent of 𝜃, the likelihood of the
𝑘
binomial sampling model is the same as the likelihood for the Bernoulli model – any
inference we have about 𝜃 will be same whether we observe the counts 𝒟 = 𝑁1 , 𝑁 or
sequence of trials 𝒟 = 𝑥1 , ⋯ 𝑥𝑁 .

Generative Models for Real-World Systems.


Bayesian Concept 8/13
Examples. Beta-binomial model (2/3)
To make the math easier, it would be convenient if the prior had the same form as the
likelihood:
𝑝 𝜃 ∝ 𝜃 𝛾1 1 − 𝜃 𝛾0

for some prior parameters 𝛾1 and 𝛾2 . Then we could easily evaluate the posterior by simply
adding the exponents
𝑝 𝜃|𝒟 ∝ 𝑝 𝒟|𝜃 𝑝 𝜃 = 𝜃 𝑁1 1 − 𝜃 𝑁0 𝛾1
𝜃 1−𝜃 𝛾0
= 𝜃𝑁1+𝛾1 1 − 𝜃 𝑁0 +𝛾0

When the prior and the posterior have the same form, we say that the prior is a conjugate
prior for the corresponding likelihood. In case of the Bernoulli, the conjugate prior is the
beta distribution:
𝐵𝐵𝐵𝐵 𝜃|𝑎, 𝑏 ∝ 𝜃 𝑎−1 1 − 𝜃 𝑏−1

Then posterior is
𝑝 𝜃|𝒟 ∝ 𝐵𝐵𝐵 𝑁1 |𝜃, 𝑁0 + 𝑁1 𝐵𝐵𝐵𝐵 𝜃|𝑎, 𝑏 ∝ 𝐵𝐵𝐵𝐵 𝜃|𝑁1 + 𝑎, 𝑁0 + 𝑏

Generative Models for Real-World Systems.


Bayesian Concept 9/13
Examples. Beta-binomial model (3/3)
Consider predicting the probability of heads in a single future trial under a Beta 𝑎, 𝑏
posterior:
1 1
𝑎
𝑝 𝑥� = 1|𝒟 = � 𝑝 𝑥 = 1|𝜃 𝑝 𝜃|𝒟 𝑑𝑑 = � 𝜃Beta 𝜃|𝑎, 𝑏 𝑑𝑑 = 𝔼 𝜃|𝒟 = .
0 0 𝑎+𝑏

Zero count (sparse data) problem – occurs when estimating counts from small
amount of data;

Black swan paradox – problem of how to draw general conclusion about the future from
specific observation from the past.
Suppose now we were interested in predicting the number of heads, 𝑥, in 𝑀 future trials:
1
𝑝 𝑥|𝒟, 𝑀 = � Bin 𝑥|𝜃, 𝑀 Beta 𝜃|𝑎, 𝑏 𝑑𝑑 =
0
1
𝑀 1 𝑀 𝐵 𝑥 + 𝑎, 𝑀 − 𝑥 + 𝑏
� 𝜃𝑥 1 − 𝜃 𝑀−𝑥 𝑎−1
𝜃 1−𝜃 𝑏−1
𝑑𝑑 = .
𝑥 𝐵 𝑎, 𝑏 0 𝑥 𝐵 𝑎, 𝑏

Generative Models for Real-World Systems.


Bayesian Concept 10/13
Naïve Bayes classifier
Let us classify vector of discrete-valued features 𝐱 ∈ 1, ⋯ , 𝐾 𝐷 , where 𝐾 is number of
values for each feature, 𝐷 is the number of features. The simplest approach is to assume
the features are conditionally independent given the class label:
𝐷

𝑝 𝐱|𝑦 = 𝑐, 𝛉 = � 𝑝 𝑥𝑗 |𝑦 = 𝑐, 𝛉𝑗𝑗 .
𝑗=1

The model is called “naïve” since we do not expect the features to be independent, even
conditional on the class label. One reason of successful application of naïve Bayes classifier
is that the model is quite simple, and hence it is relatively immune to overfitting.

Type of feature Recommended type of class-conditional density


𝐷
Real-valued 𝑝 𝐱|𝑦 = 𝑐, 𝛉 = � 𝒩 𝑥𝑗 |𝜇𝑗𝑗 , 𝜎𝑗𝑗 2
𝑗=1
𝐷
Binary 𝑝 𝐱|𝑦 = 𝑐, 𝛉 = � Ber 𝑥𝑗 |𝜇𝑗𝑗
𝑗=1
𝐷
Categorical 𝑝 𝐱|𝑦 = 𝑐, 𝛉 = � Multinoulli 𝑥𝑗 |𝛍𝐣𝐣
𝑗=1
Generative Models for Real-World Systems.
Bayesian Concept 11/13
Bayesian approach applications
Hierarchical Bayes (multi-level model) based on putting a prior on used prior:
𝜂 → 𝜃 → 𝒟.

Empirical Bayes violates the principle that the prior should be chosen independently of the
data:

𝜂� = argmax 𝑝 𝒟|𝛈 = argmax � 𝑝 𝒟|𝛉 𝑝 𝛉|𝛈 𝑑𝛉 .

Method Definition

Maximum likelihood � = argmax𝛉 𝑝 𝒟|𝛉


𝛉

MAP estimation � = argmax𝛉 𝑝 𝒟|𝛉 𝑝 𝛉|𝛈


𝛉

Empirical Bayes � = argmax𝛈 𝑝 𝒟|𝛈


𝛈

Full Bayes 𝑝 𝛉, 𝛈|𝒟 ∝ 𝑝 𝒟|𝛉 𝑝 𝛉|𝛈 𝑝 𝛈


Generative Models for Real-World Systems.
Bayesian Concept 12/13
Conclusion
• Concept of Bayesian learning was considered;
• Methods for processing the prior and
posterior information were presented;
• Practical applications of Bayesian approach
were shown.

Review of Probability Theory and its Usage


in System Identification Tasks 13/13

You might also like