Lec14 15 GenerativeModelsForDiscreteData
Lec14 15 GenerativeModelsForDiscreteData
Email: [email protected]
URL: https://www.zabaras.com/
J M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online resource)
D. Sivia and J Skilling, Data Analysis: A Bayesian Tutorial, Oxford University Press,
2006.
The key to using such models is specifying a suitable form for the class-
conditional density 𝑝(𝒙|𝑦 = 𝑐, 𝜽), which defines what kind of data we expect to
see in each class.
In this lecture, we focus on the case where the observed data are discrete.
The goal is to learn the indicator function 𝑓 which defines which elements are
in the set 𝐶.
Tenenbaum, J. (1999). A Bayesian framework for concept learning. Ph.D. thesis, MIT.
We now ask you whether some new test case 𝑥 belongs to 𝐶 ( i.e., we ask
you to classify 𝑥).
Tenenbaum, J. (1999). A Bayesian framework for concept learning. Ph.D. thesis, MIT.
The subset of 𝐻 that is consistent with the data 𝒟 is called the version space.
After 4 examples, the likelihood of ℎ = ℎ𝑡𝑤𝑜 is ( 1/6 )4 = 7.7 10−4 , whereas the
likelihood of ℎ𝑒𝑣𝑒𝑛 is ( 1/50 )4 = 1.6 10−7 . This is a likelihood ratio of almost
5000: 1 in favor of ℎ𝑡𝑤𝑜.
If for example, you are told the numbers are from some arithmetic rule, then
given 1200, 1500, 900 and 1400, you may think 400 is likely but 1183 is
unlikely.
But if you are told that the numbers are examples of healthy cholesterol
levels, you would probably think 400 is unlikely and 1183 is likely.
where 𝕀(𝒟 ∈ ℎ) is 1 iff all the data are in the extension of the
hypothesis ℎ.
In general, when we have enough data, the posterior 𝑝(ℎ|𝒟) becomes peaked
on a single concept, namely the MAP estimate, i.e., 𝑝 ℎ|𝒟 = 𝛿ℎ𝑀𝐴𝑃 (ℎ൯ where
ℎ 𝑀𝐴𝑃 = argmax𝑝 ℎ|𝐷
ℎ
1 x A
is the posterior mode, and the Dirac measure is defined by x ( A)
0 x A
Since the likelihood term depends exponentially on 𝑁, and the prior stays
constant, as we get more and more data, the MAP estimate converges
towards the maximum likelihood estimate or MLE:
If we have enough data, the data overwhelms the prior. Then, the MAP
estimate converges towards the MLE.
even
odd
squares 30 30
mult of 3
mult of 4
mult of 5
mult of 6
mult of 7 25 25
mult of 8
mult of 9
mult of 10
ends in 1
ends in 2
ends in 3
20 20 Prior,
ends in 4
ends in 5
likelihood
ends in 6
ends in 7 15 15 and
ends in 8
ends in 9 posterior
powers of 2
powers of 3 for 𝒟 = {16}
powers of 4 10 10
powers of 5
powers of 6
powers of 7
powers of 8
powers of 9 5 5
powers of 10
all
powers of 2 + {37}
powers of 2 - {32}
0 0
0 0.1 0.2 0 0.2 0.4 0 0.2 0.4
prior lik post
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18
Prior, Likelihood and Posterior
data = 16 8 2 64
35 35
even
odd
squares 30 30
mult of 3
mult of 4
mult of 5
mult of 6
mult of 7 25 25
mult of 8
mult of 9
mult of 10
ends in 1
ends in 2 20 20
ends in 3
ends in 4
ends in 5
Prior,
ends in 6
ends in 7 15 15 likelihood and
ends in 8
ends in 9 posterior for
powers of 2
powers of 3 𝒟 = {16,8,2,64}
powers of 4 10 10
powers of 5
powers of 6
powers of 7
powers of 8
powers of 9 5 5
powers of 10
all
powers of 2 + {37}
powers of 2 - {32}
0 0
0 0.1 0.2 0 1 2 0 0.5 1
prior lik -3 post
x 10
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Posterior Predictive Distribution
The posterior predictive distribution in this context is given by
𝑝 𝑥 ∈ 𝐶|𝒟 = 𝑝 𝑦 = 1|𝑥,
ℎ 𝑝 ℎ|𝒟 = 𝑝 𝑥|ℎ
𝑝 ℎ|𝒟
ℎ ℎ
This is a weighted average of the predictions of each individual hypothesis
(Bayes model averaging). Posterior over
1
hypotheses and
the
0.5
corresponding
0
predictive
powers of 4
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
distribution after
powers of 2
seeing 𝒟 =
ends in 6
{16}.
squares
even
powers of 2 + {37}
this hypothesis.
0 0.5 1
p(h | 16 )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20
Posterior Predictive Distribution
The posterior predictive distribution in this context is given by
𝑝 𝑥 ∈ 𝐶|𝒟 = 𝑝 𝑦 = 1|𝑥,
ℎ 𝑝 ℎ|𝒟
ℎ
This is a weighted average of the predictions of each individual hypothesis
(Bayes model averaging). The graph
1
𝑝(ℎ|𝒟) on the
right is the
0.5 weight given to
hypothesis ℎ.
0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
powers of 4
powers of 2 By taking a
ends in 6
weighed sum of
squares
dots, we get:
even
mult of 8
𝑝 𝑥 ∈ 𝐶|𝒟
mult of 4
all
powers of 2 - {32}
powers of 2 + {37}
0 0.5 1
p(h | 16 )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
Plug-in Approximation to the Predictive Distribution
When we have a small dataset, the posterior 𝑝(ℎ|𝒟) is vague, which induces
a broad predictive distribution.
However, with a lots of data, the posterior becomes a delta function centered
at the MAP estimate. In this case, the predictive distribution is
𝑝 𝑥 ∈ 𝐶|𝒟 = 𝑝 𝑥|ℎ ℎ
𝛿ℎ ℎ = 𝑝 𝑥|
ℎ
Hoeting, J., D. Madigan, A. Raftery, and C. Volinsky (1999). Bayesian model averaging: A tutorial. Statistical
Science 4(4).
Thus the prior is a mixture of two priors, one over arithmetical rules, and one
over intervals:
p (h) 0 prules (h) (1 0 ) pint erval (h)
The results are not that sensitive to 𝜋0 assuming that 𝜋0 > 0.5.
Tenenbaum, J. (1999). A Bayesian framework for concept learning. Ph.D. thesis, MIT.
Diffuse similarity
Diffuse similarity
Powers of two
Problem of interest: inferring the probability that a coin shows up heads, given
a series of observed coin tosses.
The coin model forms the basis of many methods including Naive Bayes
classifiers, Markov models, etc.
We specify first the likelihood and prior, and then derive the posterior and
predictive distributions.
i 1 i 1
These two counts are the sufficient statistics of the data. This is all we need
to know about 𝒟 to infer 𝜃.
Now suppose the data consists of the count of the number of heads
𝑁1 observed in a fixed number of 𝑁 of trials. In this case, we have
𝑁1 ~ ℬ𝒾𝓃(𝑁, 𝜃), where ℬ𝒾𝓃 represents the binomial distribution:
n k
B (k | n, ) (1 ) n k
k
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
The Beta-Binomial Model: Likelihood
n k
B (k | n, ) (1 ) n k
k
n
Since is a constant independent of 𝜃, the likelihood for the binomial
k
sampling model is the same as the likelihood for the Bernoulli model.
N N
p( | ) (1 ) , N1 ( xi 1), N 0 ( xi 0)
N1 N0
i 1 i 1
So any inference we make about 𝜃 will be the same whether we observe the
counts, 𝒟 = (𝑁1, 𝑁), or a sequence of trials, 𝒟 = (𝑥1, … , 𝑥𝑁).
When the prior and the posterior have the same form, we say that the prior is
a conjugate prior for the corresponding likelihood. Conjugate priors are widely
used since they simplify computation and are easy to interpret.
In the case of the Bernoulli, the conjugate prior is the Beta distribution.
Beta ( | a, b) a 1 (1 )b 1
The parameters of the prior are called hyper-parameters.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28
Posterior
If we multiply the likelihood by the Beta prior, we get the following posterior
Note that the posterior is obtained by adding the prior hyper-parameters to the
empirical counts.
The strength of the prior, also known as the effective sample size of the prior,
is the sum of the pseudo counts, 𝑎 + 𝑏; this plays a role analogous to the data
set size 𝑁.
1 1
0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Run binomialBetaPosteriorDemo from PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30
Posterior Mean and Mode
p ( | D ) Beta N1 a, N 0 b
The MAP estimate is given by
𝑁1 + 𝑎 − 1
𝜃መ𝑀𝐴𝑃 =
𝑎+𝑏+𝑁−2
So the weaker the prior, the smaller is 𝜆 and hence the closer the posterior
mean is to the MLE.
Can also show that the posterior mode is a convex combination of the prior
mode and the MLE, and that it too converges to the MLE.
is as follows:
var |
N1 a N 0 b
0 N 0 N 1
2
This means it is easier to be sure that a coin is biased than to be sure that it is
fair!
How about if we plug in the MAP estimate? We can see that we don’t then get
this smoothing effect:
𝑁1 + 𝑎 − 1 𝑁1
𝑝(𝑥 = 1|𝒟) = mod𝑒 𝜃|𝑁1 + 𝑎, 𝑁0 + 𝑏 = = = 𝜃መ𝑀𝐿𝐸
𝑁 + 𝛼0 − 2 𝑁
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35
Posterior Predictive Distribution
Consider predicting the probability of heads in a single future trial by plugging
the MLE estimate as:
𝑁1
𝑝(𝑥 = 1|𝒟) ≈ 𝐵𝑒𝑟𝑛 𝑥 = 1|𝜃መ𝑀𝐿𝐸 = 𝜃መ𝑀𝐿𝐸 =
𝑁
This often can be a troubling estimation (especially when we have few data
available).
For example, if 𝑁1 = 0 in 𝑁 = 3 trials, then this predicts that the chance for
getting heads on the next trial is zero!
0
𝑝(𝑥 = 1|𝒟) ≈ 𝜃መ𝑀𝐿𝐸 = =0
3
This is called the zero count problem or the sparse data problem, and
frequently occurs when estimating counts from small data sets.
Taleb, N, (2007). The Black Swan: The Impact of the Highly Improbable, 2nd Edt. Random House.
0.3 0.3
Prior : Beta 2, 2
0.25 0.25
0.2 0.2
Data : N1 3, N 0 17
0.15 0.15
0.1 0.1
0.05 0.05
0
0 1 2 3 4 5 6 7 8 9 10 0
0 1 2 3 4 5 6 7 8 9 10
𝑀 𝑥 𝑀−𝑥
Posterior Predictive Plug-in: 𝑝 𝑥 𝒟, 𝑀 = ℬ𝒾𝓃 𝑥|𝜃𝑀𝐴𝑃 , 𝑀 = 𝜃𝑀𝐴𝑃 1 − 𝜃𝑀𝐴𝑃
𝑥
Run betaBinomPostPredDemo from PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39
The Dirichlet-Multinomial Model
Suppose we observe 𝑁 dice rolls x1 , x2 ,..., xN where
xi 1, 2,..., K
𝑁𝑘 is the number of times event 𝑘 occurred (sufficient statistics for this model).
The likelihood for the multinomial model has the same form, up to an
irrelevant constant factor.
In deriving the mode of this posterior (i.e., the MAP estimate), we must
enforce the constraint K
k 1
k 1
N
k 1
'
k
k 1 k 1 k 1
1
K
N 0 K , where : 0 k is the equivalent sample size of the prior
k 1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43
The Dirichlet-Multinomial Model Posterior
( , ) N k'
Using 0 , the MAP estimate is given by
k k
𝑁𝑘 + 𝛼𝑘 − 1
𝜃መ𝑘 =
𝑁 + 𝛼0 − 𝐾
Compare this with the mode of the Dirichlet distribution Dir | 1 ,..., K
k k 1 k 0 k K
[ xk ] , mode[ xk ] , var[ xk ] 2 , where : 0 k
0 0 K 0 ( 0 1) k 1
j p ( j | )d j j |
N j j N j j
K
N 0
Nk k
k 1
𝑁𝑗 + 1
If we set 𝑎𝑗 = 1, we get 𝑝(𝑋෨ = 𝑗|𝒟) = 𝔼(𝜃𝑗 |𝒟) = from which:
17 + 10
3 5 5 1 2 2 1 2 1 5
𝑝(𝑋෨ = 𝑗|𝒟) = , , , , , , , , ,
27 27 27 27 27 27 27 27 27 27
Note that the words ``big’’, ``black’’ and ``rain’’ are predicted to occur with non-
zero probability in the future, even though they have never been seen before!
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47
Bayesian Analysis of the Uniform Distribution
Consider 𝒰𝓃𝒾𝒻(0, 𝜃). The MLE is 𝜃 = 𝑚𝑎𝑥 𝒟 . This is unsuitable for predicting
future data since it puts zero probability mass outside the training data.
We will perform a Bayesian analysis of the uniform distribution. The conjugate
prior is the Pareto distribution,
∞ 𝑖𝑓 𝐾 ≤ 1
𝑝 𝜃 = 𝒫𝒶𝓇ℯ𝓉ℴ 𝜃|𝑏, 𝐾 = 𝐾𝑏 𝐾 𝜃 −(𝐾+1) 𝕀 𝜃 ≥ 𝑏 , Mode= 𝑏, Mean= ቐ 𝐾𝑏
𝑖𝑓 𝐾 > 1
𝐾−1
Let 𝑚 = 𝑚𝑎𝑥 𝒟 . The evidence (probability that all 𝑁 samples came from
𝒰𝓃𝒾𝒻 0, 𝜃 ) is
𝐾
∞ 𝐾 𝑁
𝑖𝑓 𝑚 ≤ 𝑏
𝐾𝑏 𝑁+𝐾 𝑏
𝑝 𝒟 =න 𝑑𝜃 =
max(𝑚,𝑏) 𝜃
(𝑁+𝐾+1) 𝐾𝑏 𝐾
(𝑁+𝐾)
𝑖𝑓 𝑚 > 𝑏
(𝑁 + 𝐾)𝑚
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48
Bayesian Analysis of the Uniform Distribution
𝑝 𝒟, 𝜃 = 𝐾𝑏 𝐾 𝜃 −(𝑁+𝐾+1) 𝕀 𝜃 ≥ 𝑚𝑎𝑥 𝒟, 𝑏
𝐾
𝑖𝑓 𝑚 ≤ 𝑏
𝑁 + 𝐾 𝑏𝑁
𝑝 𝒟 =
𝐾𝑏 𝐾
(𝑁+𝐾)
𝑖𝑓 𝑚 > 𝑏
(𝑁 + 𝐾)𝑚
where 𝐾 is the number of values for each feature, and 𝐷 is the number of
features.
𝑝 𝒙|𝑦 = 𝑐, 𝜽 = ෑ 𝑝 𝑥𝑗 |𝑦 = 𝑐, 𝜽𝑗𝑐
𝑗=1
It is called "naive" since in practice the features 𝑥𝑗 are not independent, even
conditional on the class label 𝑐.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51
Naive Bayes Classifiers
𝐷
𝑝 𝒙|𝑦 = 𝑐, 𝜽 = ෑ 𝑝 𝑥𝑗 |𝑦 = 𝑐, 𝜽𝑗𝑐
𝑗=1
The reason for this is that the model is simple (it only has 𝒪(𝐶𝐷)
parameters, for 𝐶 classes and 𝐷 features)
Domingos, P. and M. Pazzani (1997). On the optimality of the simple Bayesian classifier under zero-one loss.
Machine Learning 29, 103– 130.
For binary features, 𝑥𝑗 ∈ {0,1}, we can use the Bernoulli distribution where
𝜇𝑗𝑐 is the probability that feature 𝑗 occurs in class 𝑐. This is called the
multivariate Bernoulli naive Bayes model.
𝐷
where 𝝁𝑗𝑐 is a vector of the probabilities over the 𝐾 possible values for 𝑥𝑗
in class 𝑐.
This usually refers to computing the MLE and MAP estimates for the
model parameters.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 54
MLE for Naïve Bayes Classifier
The probability for a single data case (known features & class label) is
p xi , yi | p yi | p xij | yi , j c ( yi c ) p xij | jc
D
( yi c )
j 1 c j c
Hence the joint log-likelihood is given by
log p x | jc
C D C
log p | N c log c ij
c 1 j 1 c 1 i: yi c
The MLE for the class prior is given (proof as given earlier) by
𝑁𝑐
𝜋ො 𝑐 =
𝑁
where N c
i
( yi c) is the number of examples in class 𝑐.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55
MLE for Naïve Bayes Classifier
log p x | jc
C D C
log p D | N c log c ij
c 1 j 1 c 1 i: yi c
The MLE for 𝜽𝑗𝑐 depends on the type of distribution we use for each feature.
For simplicity, let us suppose all features are binary, so
x j | y c ~ Ber ( jc )
𝑁𝑗𝑐
𝜃𝑗𝑐 = , 𝑁𝑐 = 𝕀(𝑦𝑖 = 𝑐 ) , 𝑁𝑗𝑐 = 𝕀(𝑥𝑖𝑗 = 1൯
𝑁𝑐
𝑖 𝑖:𝑦𝑖 =𝑐
0.3
0.3
0.2
0.2
0.1 0.1
0 0
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700
We use a 𝒟𝒾𝓇(𝒂) prior for and a ℬℯ𝓉𝒶(𝛽0, 𝛽1) prior for each 𝜃𝑗𝑐 . Often we
take 𝜶 = 𝟏 and 𝜷 = 𝟏, corresponding to add-one or Laplace smoothing.
Combining the factored likelihood (for binary features)
1
N c N jc
p | Nc N jc
c jc jc
c j c
with the factored prior above gives the following factored posterior:
1 jc
D C C D C
N c N jc 1 1
) p( jc | ) jc
N c c 1 N jc 0 1
p ( | ) p( | c
j 1 c 1 c j 1 c
We thus simply update the prior counts with the empirical counts from the
likelihood. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 59
Predictive Distribution
At test time, the goal is to compute the predictive distribution:
D
p( y c | x, ) p( y c | ) p ( x j | y c , )
j 1
p( y c | x, ) Cat ( y c | ) p ( | )d
D
Ber ( x
j 1
j | y c, jc ) p ( jc | )d jc
𝐷
𝕀(𝑥𝑗 =1൯ 𝕀(𝑥𝑗 =0൯
𝑝(𝑦 = 𝑐|𝒙, 𝒟) ∝ 𝜋ො 𝑐 ෑ 𝜃መ𝑗𝑐 1 − 𝜃መ𝑗𝑐
𝑗=1
The only difference is we replaced the posterior mean 𝜃ҧ with the MAP or MLE
𝜃.
This small difference is important since the posterior mean results in less
overfitting.
can fail due to numerical underflow. The problem is that 𝑝(𝒙|𝑦 = 𝑐) is often a
very small number, especially if 𝒙 is a high-dimensional vector.
The obvious solution is to take logs when applying Bayes rule, as follows:
C bc '
log p ( y c | x ) bc log e , bc log p ( x | y c) log p ( y c)
c '1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 63
The Log-Sum-Exp Trick
However, this requires evaluating the following expression
C bc '
log e log p ( y c ', x ) log p ( x )
c '1 c'
One can factor out the largest term and represent the remaining numbers
relative to that, e.g.
log e 120 e 121 log e0 e 1 120
In general, we have
bc bc B B bc B
log e log e e log e B
c c c
From PMTK
C bc '
log p ( y c | x ) bc log e , bc log p ( x | y c) log p ( y c)
c '1
All of these quantities are computed when fitting the naive Bayes classifier.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 67
Feature Selection Using Mutual Information
The words with highest MI are much more discriminative than the words which
are most probable.
Most Probable Words: In the earlier example, the most probable word in both
classes is “subject”, which always occurs because this is newsgroup data
which always has a subject line. Obviously this is not very discriminative.
Most Discriminative Words: The words with highest MI with the class label are
(in decreasing order) “windows”, “microsoft”, “DOS” and “motif”. This makes
sense, since the two classes correspond to Microsoft Windows and 𝑋
Windows.
j 1 j 1
x
j 1
ij Ni Because of this constraint,
the features are not independent
xij !
j 1
j 1
j 1
jc 1
for each class 𝑐.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 70
Burstiness of Words
The multinomial classifier is easy to train and use for predictions, however it
does not take into account the burstiness of word usage.
Words occur in bursts: most words never appear in any given document, but if
they do appear once, they are likely to appear more than once.
x
j 1
ij ! j 1
𝑁𝑖𝑗
has the form 𝜃𝑗𝑐 and since 𝜃𝑗𝑐 << 1 for rare words, it becomes increasingly
unlikely to generate many of them.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 71
Multinomial Document Classifiers
𝑁𝑖𝑗
For more frequent words, the decay rate is not as fast as To see why 𝜃𝑗𝑐 .
intuitively, note that the most frequent words are function words which are not
specific to the class, such as “and”, “the”, and “but”.
The independence assumption is more reasonable for common words: e.g.
the chance of the word “and” occurring is pretty much the same no matter
how many times it has previously occurred.
Since rare words are the ones that matter most for classification purposes,
these are the ones we want to model the most carefully.
Various ad hoc heuristics have been proposed to improve the performance of
the multinomial document classifier.
Rennie, J., L. Shih, J. Teevan, and D. Karger (2003). Tackling the poor assumptions of naive Bayes text classifiers.
In Intl. Conf. on Machine Learning.
Madsen, R., D. Kauchak, and C. Elkan (2005). Modeling word burstiness using the Dirichlet distribution. In Intl.
Conf. on Machine Learning.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 72
Compound Multinomial (DCM) Density
Suppose we simply replace the multinomial class conditional density with the
Dirichlet Compound Multinomial (DCM) density, defined as: K
Ni !
p ( xi | yi c, ) Mu ( xi | N i , c ) Dir ( c | c ) d c
k
B ( xi c )
B ( c ) B ( ) k 1
D
x
j 1
ij ! k
k
After seeing one occurrence of a word, say word 𝑗, the posterior counts on 𝜃𝑗
gets updated, making another occurrence of word 𝑗 more likely. By contrast, if
𝜃𝑗 is fixed, then the occurrences of each word are independent.