Lecture_3 (2)
Lecture_3 (2)
Lecture 3
Generative Models for Real-World Systems.
Bayesian Concept
Dmytro Progonov,
PhD, Associate Professor
Content
• Bayesian concept learning;
• Prior and posterior processing;
• Naïve Bayes classifier;
• Bayesian approach applications.
Observer’s False
guessing
where 𝕀 𝒟 ∈ ℎ is 1 if and only if all the data are in extension of the hypothesis ℎ.
Generative Models for Real-World Systems.
Bayesian Concept 4/13
Prior and posterior processing (2/4)
If we have enough data, the posterior 𝑝 ℎ|𝒟 becomes Maximum A Posterior Estimate
(MAP):
𝑝 𝒟|ℎ → 𝛿ℎ�𝑀𝑀𝑀 ℎ ,
As we get more and more data, the MAP estimate converges towards the Maximum
Likelihood Estimate (MLE):
ℎ�𝑀𝐿𝐿 = argmax 𝑝 𝒟|ℎ = argmax log 𝑝 𝒟|ℎ .
ℎ ℎ
When we have a small and/or ambiguous dataset, the posterior 𝑝 ℎ|𝒟 is vague, which
induces a broad predictive distribution. However, once we have “figured things out”, the
posterior becomes a delta function centered at the MAP estimate. In this case, we can use
plug-in approximation:
�
𝑑𝑑𝑑 𝛉
� −
𝐵𝐵𝐵 ≜ log 𝑝 𝒟|𝛉 log 𝑁 ≈ log 𝑝 𝒟 ,
2
� − the
� − maximum likelihood estimation of used model parameters; 𝑑𝑑𝑑 𝛉
where 𝜽
number of degrees of freedom in used model.
The BIC method is very closely related to the Minimum Description Length or MDL
principle, which characterizes the score of a model in terms of how well it fits the data,
minus how complex the model is to define.
A very similar expression of BIC / MDL is called the Akaike information criterion or AIC
� 𝑀𝑀𝑀 − 𝑑𝑑𝑑 𝑚 .
𝐴𝐴𝐴 𝑚, 𝒟 ≜ log 𝑝 𝒟|𝜽
𝑝 𝒟|𝜃 = 𝜃𝑁1 1 − 𝜃 𝑁0
.
where we have 𝑁1 = ∑𝑁 𝑁
𝑖=1 𝕀 𝑥𝑖 = 1 heads and 𝑁0 = ∑𝑖=1 𝕀 𝑥𝑖 = 0 tails, 𝑁 = 𝑁0 + 𝑁1 is
observed trials. In this case we have 𝑁1 ~𝐵𝐵𝐵 𝑁, 𝜃 , which has following pdf:
𝑛 𝑘 𝑛−𝑘
𝐵𝐵𝐵 𝑘|𝑛, 𝜃 = 𝜃 1−𝜃 .
𝑘
𝑛
Since binomial coefficients is a constant independent of 𝜃, the likelihood of the
𝑘
binomial sampling model is the same as the likelihood for the Bernoulli model – any
inference we have about 𝜃 will be same whether we observe the counts 𝒟 = 𝑁1 , 𝑁 or
sequence of trials 𝒟 = 𝑥1 , ⋯ 𝑥𝑁 .
for some prior parameters 𝛾1 and 𝛾2 . Then we could easily evaluate the posterior by simply
adding the exponents
𝑝 𝜃|𝒟 ∝ 𝑝 𝒟|𝜃 𝑝 𝜃 = 𝜃 𝑁1 1 − 𝜃 𝑁0 𝛾1
𝜃 1−𝜃 𝛾0
= 𝜃𝑁1+𝛾1 1 − 𝜃 𝑁0 +𝛾0
When the prior and the posterior have the same form, we say that the prior is a conjugate
prior for the corresponding likelihood. In case of the Bernoulli, the conjugate prior is the
beta distribution:
𝐵𝐵𝐵𝐵 𝜃|𝑎, 𝑏 ∝ 𝜃 𝑎−1 1 − 𝜃 𝑏−1
Then posterior is
𝑝 𝜃|𝒟 ∝ 𝐵𝐵𝐵 𝑁1 |𝜃, 𝑁0 + 𝑁1 𝐵𝐵𝐵𝐵 𝜃|𝑎, 𝑏 ∝ 𝐵𝐵𝐵𝐵 𝜃|𝑁1 + 𝑎, 𝑁0 + 𝑏
Zero count (sparse data) problem – occurs when estimating counts from small
amount of data;
Black swan paradox – problem of how to draw general conclusion about the future from
specific observation from the past.
Suppose now we were interested in predicting the number of heads, 𝑥, in 𝑀 future trials:
1
𝑝 𝑥|𝒟, 𝑀 = � Bin 𝑥|𝜃, 𝑀 Beta 𝜃|𝑎, 𝑏 𝑑𝑑 =
0
1
𝑀 1 𝑀 𝐵 𝑥 + 𝑎, 𝑀 − 𝑥 + 𝑏
� 𝜃𝑥 1 − 𝜃 𝑀−𝑥 𝑎−1
𝜃 1−𝜃 𝑏−1
𝑑𝑑 = .
𝑥 𝐵 𝑎, 𝑏 0 𝑥 𝐵 𝑎, 𝑏
𝑝 𝐱|𝑦 = 𝑐, 𝛉 = � 𝑝 𝑥𝑗 |𝑦 = 𝑐, 𝛉𝑗𝑗 .
𝑗=1
The model is called “naïve” since we do not expect the features to be independent, even
conditional on the class label. One reason of successful application of naïve Bayes classifier
is that the model is quite simple, and hence it is relatively immune to overfitting.
Empirical Bayes violates the principle that the prior should be chosen independently of the
data:
Method Definition