0% found this document useful (0 votes)

29 views

Lec14 15 GenerativeModelsForDiscreteData

This document discusses generative Bayesian models for discrete data. It provides an introduction to Bayesian inference and prediction for classification problems with discrete data. Some key concepts discussed include the likelihood, prior, and posterior distributions. As an example, it covers the number game concept learning problem, where the goal is to learn a concept from positive examples. The posterior predictive distribution represents the probability that a new example belongs to the concept given the training data. The document also lists contents to be covered, such as the Dirichlet-multinomial model and Naive Bayes classifiers.

Uploaded by

hu jack

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Lec14 15 GenerativeModelsForDiscreteData

Uploaded by

hu jack

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Generative Bayesian Models

For Discrete Data

Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

September 14, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 1

References
 Kevin Murphy, Machine Learning: A probabilistic perspective, Chapter 3

 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to

Compulational Implementation, Springer-Verlag, NY, 2001 (online resource)

 A Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman

and Hall CRC Press, 2nd Edition, 2003.

 J M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online resource)

 D. Sivia and J Skilling, Data Analysis: A Bayesian Tutorial, Oxford University Press,
2006.

 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.

 Additional References with links are provided in the lecture slides.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2
Contents
 Introduction: Generative Models
 Bayesian concept learning, Likelihood, Prior, Posterior, Posterior predictive
distribution, Plug-in Approximation, A more complex prior
 The beta-binomial model, Likelihood, Prior, Posterior, Posterior predictive distribution,
Blackswan paradoxes and Plug-in approximations, Outcome of multiple future trials,
Beta-Binomial Distribution
 The Dirichlet-multinomial model, Likelihood, Prior, Posterior, Posterior predictive,
Bayesian Analysis of the Uniform Distribution, Language Model using Bag of Words
 Naive Bayes classifiers, Examples, MLE for Naïve Mayes Classifier, Example for
bag-of-words 2 class model, Summary of the Algorithm, Bayesian Naïve Bayes,
Using the model for prediction, The log-sum-exp trick, Feature selection using mutual
information
 Classifying documents using bag of words

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

Goals
 The goal’s for today’s lecture are the following:
 Familiarize ourselves with Bayesian inference and prediction for classification
problems with discrete data
 Learn about the plug-in approximations to the predictive distribution
 Understand the concept of add-one smoothing in posterior predictive distribution
using the posterior mean in the plug-in approximation
 Learn about the Dirichlet-multinomial model and how to apply it to language
models using bag of words
 Work with Bayesian Naïve Bayes classifiers and learn to perform feature
selection with mutual information
 Implement the log-sum-exp trick
 Understand how to model burstiness of words in language models

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4

Introduction: Generative Models
 We discuss how to classify a feature vector 𝒙 by applying Bayes rule to a
generative classifier of the form:
p  y  c | x ,    p  x | y  c,   p  y  c |  

 The key to using such models is specifying a suitable form for the class-
conditional density 𝑝(𝒙|𝑦 = 𝑐, 𝜽), which defines what kind of data we expect to
see in each class.

 In this lecture, we focus on the case where the observed data are discrete.

 We also discuss how to infer the unknown parameters 𝜽 of such models.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

Bayesian Concept Learning
 Concept learning is equivalent to binary classification, e.g. define 𝑓(𝑥) = 1 if
𝑥 is an example of the concept 𝐶, and 𝑓(𝑥) = 0, otherwise.

 The goal is to learn the indicator function 𝑓 which defines which elements are
in the set 𝐶.

 Standard binary classification techniques require positive and negative

examples. However, here we consider a way to learn only from positive
examples.

 We will consider a simple example of concept learning called the number

game.

 Tenenbaum, J. (1999). A Bayesian framework for concept learning. Ph.D. thesis, MIT.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6

Bayesian Concept Learning: Number Game
 Choose a simple arithmetical concept 𝐶, such as "prime number" or "a
number between 1 and 10".

 We then give you a series of randomly chosen positive examples 𝒟 =

{ 𝑥1, … , 𝑥𝑁} drawn from 𝐶.

 We now ask you whether some new test case 𝑥෤ belongs to 𝐶 ( i.e., we ask
you to classify 𝑥).
෤

 Tenenbaum, J. (1999). A Bayesian framework for concept learning. Ph.D. thesis, MIT.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

Number Game
 Given: all numbers are integers between 1 and 100.
 Given: '16' is a positive example of the concept.
 Question: What other numbers do you think are positive? 17? 6? 32? 99?
 Analysis: It is hard to tell with only one example, so your predictions will be
vague. Numbers that are similar in “some sense” to 16 are likely.
 Similar in what way?
 17 is similar, because it is "close by",

 6 is similar because it has a digit in common,

 32 is similar because it is also even and a power of 2,

 but 99 does not seem similar.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8
Posterior Predictive Distribution
 Analysis: We can represent this as a probability distribution, 𝑝 𝑥|𝒟෤ which is the
probability that 𝑥෤ ∈ 𝐶 given the data 𝒟 for any 𝑥෤ ∈ 1,2, … , 100 .

 This is called the posterior predictive distribution.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9

Empirical Predictive Distribution
The empirical predictive 𝒟 = {16}
distribution averaged over
8 persons in the number
game is shown. E.g. 𝒟 = {60}

 Given: 𝒟 = {16, 8, 2 64}

𝒟 = {16, 8, 2, 64}
Analysis: guess that the
hidden concept is "powers
of two“ (an example of 𝒟 = {16, 23, 19, 20}
induction)
 Tenenbaum, J. (1999). A Bayesian framework for concept learning.
 Given: 𝒟 = {16,23,19,20} Ph.D. thesis, MIT.

Analysis: you will get a

different kind of
generalization (numbers
near 20)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10
Hypothesis Space of Concepts
 The classic approach 𝒟 = { 16}
to induction is to
assume that we have
a hypothesis space 𝒟 = {60}
of concepts, 𝐻, such
as:
𝒟 = {16, 8, 2, 64}
 odd numbers,
 even numbers,
 all numbers 𝒟 = {16, 23, 19, 20}
between 1 and
100,
 powers of two, all Empirical Predictive Distribution
numbers ending  Tenenbaum, J. (1999). A Bayesian framework for concept learning.
in 𝑗 (for 0 ≤ 𝑗 ≤ Ph.D. thesis, MIT.
9),
 etc.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11

The Version Space
 The classic approach 𝒟 = {16}
to induction is to
assume that we have a 𝒟 = {60}
hypothesis space of
concepts, 𝐻:
𝒟 = {16, 8, 2, 64}
 Tenenbaum, J. (1999). A Bayesian
framework for concept learning. Ph.D.
thesis, MIT.
𝒟 = {16, 23, 19, 20}

Empirical Predictive Distribution

 The subset of 𝐻 that is consistent with the data 𝒟 is called the version space.

 As we see more examples, the version space shrinks and we become

increasingly certain about the concept.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12
Likelihood
 Why we choose ℎ𝑡𝑤𝑜 ="powers of two", and not, say, ℎ𝑒𝑣𝑒𝑛 = “even numbers”,
after seeing 𝒟 = {16,8,2,64}? Can we provide a Bayesian explanation of this.

 We assume that examples are sampled uniformly at random from a concept

ℎ.

 Given this assumption, the probability of independently sampling 𝑁 items (with

replacement) from ℎ is given by
N
1
N
 1 
p ( | h)     
 size(h)  h
 This crucial equation embodies what Tenenbaum calls the size principle,
which means the model favors the simplest (smallest) hypothesis consistent
with the data.
 Tenenbaum, J. (1999). A Bayesian framework for concept learning. Ph.D. thesis, MIT.
 Tenenbaum, J. and F. Xu (2000). Word Learning as Bayesian Inference. In Proc. 22nd Annual Conf. of the Cognitive
Science Society.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13
Likelihood
 Example: let 𝒟 = { 16 }. Then 𝑝(𝒟|ℎ𝑡𝑤𝑜 ) = 1/6, since there are only 6 powers
of two less than 100, but 𝑝(𝒟|ℎ𝑒𝑣𝑒𝑛 ) = 1/50, since there are 50 even numbers.
So the likelihood that ℎ = ℎ𝑡𝑤𝑜 is higher than if ℎ = ℎ𝑒𝑣𝑒𝑛.

 After 4 examples, the likelihood of ℎ = ℎ𝑡𝑤𝑜 is ( 1/6 )4 = 7.7 10−4 , whereas the
likelihood of ℎ𝑒𝑣𝑒𝑛 is ( 1/50 )4 = 1.6 10−7 . This is a likelihood ratio of almost
5000: 1 in favor of ℎ𝑡𝑤𝑜.

 This quantifies our earlier intuition that 𝒟 = {16,8,2,64} would be a very

suspicious coincidence if generated by ℎ𝑒𝑣𝑒𝑛.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14

Prior
 Although the subjectivity of the prior is controversial, it is quite useful.

 If for example, you are told the numbers are from some arithmetic rule, then
given 1200, 1500, 900 and 1400, you may think 400 is likely but 1183 is
unlikely.

 But if you are told that the numbers are examples of healthy cholesterol
levels, you would probably think 400 is unlikely and 1183 is likely.

 The prior is the mechanism by which background knowledge can be brought

to bear on a problem. Without having a prior, rapid learning from small
samples sizes is impossible.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15

Posterior
 The posterior is simply the likelihood times the prior, normalized. In this
context we have p  | h p h p  h    h  / | h |N
p h |   
 p 
h 'H
, h '   p  h '    h
h 'H
'  / | h ' | N

where 𝕀(𝒟 ∈ ℎ) is 1 iff all the data are in the extension of the
hypothesis ℎ.

 In general, when we have enough data, the posterior 𝑝(ℎ|𝒟) becomes peaked
on a single concept, namely the MAP estimate, i.e., 𝑝 ℎ|𝒟 = 𝛿ℎ෡𝑀𝐴𝑃 (ℎ൯ where
ℎ෠ 𝑀𝐴𝑃 = argmax𝑝 ℎ|𝐷
ℎ
1 x  A
is the posterior mode, and the Dirac measure is defined by  x ( A)  
0 x  A

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16

MAP Estimate
 Note that the MAP estimate can be written as

ℎ෠ 𝑀𝐴𝑃 = argmax𝑝 𝒟|ℎ 𝑝 ℎ = argmax log𝑝 𝒟|ℎ + log𝑝 ℎ

ℎ ℎ

 Since the likelihood term depends exponentially on 𝑁, and the prior stays
constant, as we get more and more data, the MAP estimate converges
towards the maximum likelihood estimate or MLE:

ℎ෠ 𝑀𝐿𝐸 = argmax𝑝 𝒟|ℎ = argmax log𝑝 𝒟|ℎ

ℎ ℎ

 If we have enough data, the data overwhelms the prior. Then, the MAP
estimate converges towards the MLE.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17

Prior, Likelihood, and Posterior
Run numbersGame from PMTK
data = 16
35 35

even
odd
squares 30 30
mult of 3
mult of 4
mult of 5
mult of 6
mult of 7 25 25
mult of 8
mult of 9
mult of 10
ends in 1
ends in 2
ends in 3
20 20 Prior,
ends in 4
ends in 5
likelihood
ends in 6
ends in 7 15 15 and
ends in 8
ends in 9 posterior
powers of 2
powers of 3 for 𝒟 = {16}
powers of 4 10 10
powers of 5
powers of 6
powers of 7
powers of 8
powers of 9 5 5
powers of 10
all
powers of 2 + {37}
powers of 2 - {32}
0 0
0 0.1 0.2 0 0.2 0.4 0 0.2 0.4
prior lik post
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18
Prior, Likelihood and Posterior
data = 16 8 2 64
35 35

even
odd
squares 30 30
mult of 3
mult of 4
mult of 5
mult of 6
mult of 7 25 25
mult of 8
mult of 9
mult of 10
ends in 1
ends in 2 20 20
ends in 3
ends in 4
ends in 5
Prior,
ends in 6
ends in 7 15 15 likelihood and
ends in 8
ends in 9 posterior for
powers of 2
powers of 3 𝒟 = {16,8,2,64}
powers of 4 10 10
powers of 5
powers of 6
powers of 7
powers of 8
powers of 9 5 5
powers of 10
all
powers of 2 + {37}
powers of 2 - {32}
0 0
0 0.1 0.2 0 1 2 0 0.5 1
prior lik -3 post
x 10
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19
Posterior Predictive Distribution
 The posterior predictive distribution in this context is given by
𝑝 𝑥෤ ∈ 𝐶|𝒟 = ෍ 𝑝 𝑦 = 1|𝑥,
෤ ℎ 𝑝 ℎ|𝒟 = ෍ 𝑝 𝑥|ℎ
෤ 𝑝 ℎ|𝒟
ℎ ℎ
 This is a weighted average of the predictions of each individual hypothesis
(Bayes model averaging).  Posterior over
1
hypotheses and
the
0.5
corresponding
0
predictive
powers of 4
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
distribution after
powers of 2
seeing 𝒟 =
ends in 6
{16}.
squares

even

mult of 8  A dot means

mult of 4 this number is
all
consistent with
powers of 2 - {32}

powers of 2 + {37}
this hypothesis.
0 0.5 1
p(h | 16 )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20
Posterior Predictive Distribution
The posterior predictive distribution in this context is given by
𝑝 𝑥෤ ∈ 𝐶|𝒟 = ෍ 𝑝 𝑦 = 1|𝑥,
෤ ℎ 𝑝 ℎ|𝒟
ℎ
 This is a weighted average of the predictions of each individual hypothesis
(Bayes model averaging).  The graph
1
𝑝(ℎ|𝒟) on the
right is the
0.5 weight given to
hypothesis ℎ.
0
4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100
powers of 4

powers of 2  By taking a
ends in 6
weighed sum of
squares
dots, we get:
even

mult of 8
𝑝 𝑥෤ ∈ 𝐶|𝒟
mult of 4

all

powers of 2 - {32}

powers of 2 + {37}
0 0.5 1
p(h | 16 )
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21
Plug-in Approximation to the Predictive Distribution
 When we have a small dataset, the posterior 𝑝(ℎ|𝒟) is vague, which induces
a broad predictive distribution.

 However, with a lots of data, the posterior becomes a delta function centered
at the MAP estimate. In this case, the predictive distribution is

𝑝 𝑥෤ ∈ 𝐶|𝒟 = ෍ 𝑝 𝑥|ℎ ෤ ℎ෠
෤ 𝛿ℎ෡ ℎ = 𝑝 𝑥|
ℎ

 This is called a plug-in approximation to the predictive density and is very

widely used, due to its simplicity.

 Hoeting, J., D. Madigan, A. Raftery, and C. Volinsky (1999). Bayesian model averaging: A tutorial. Statistical
Science 4(4).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22

Complex Prior: 𝑝(ℎ) = 𝜋0 𝑝𝑟𝑢𝑙𝑒𝑠 (ℎ) + (1 − 𝜋0 )𝑝int𝑒𝑟𝑣𝑎𝑙 (ℎ)
 Let us see a more complex prior.

 To model human behavior, Tenenbaum used a slightly more sophisticated

prior which was derived by analyzing some experimental data of how people
measure similarity between numbers.

 Thus the prior is a mixture of two priors, one over arithmetical rules, and one
over intervals:
p (h)   0 prules (h)  (1   0 ) pint erval (h)

 The results are not that sensitive to 𝜋0 assuming that 𝜋0 > 0.5.

 Tenenbaum, J. (1999). A Bayesian framework for concept learning. Ph.D. thesis, MIT.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23

Complex Prior: 𝑝(ℎ) = 𝜋0 𝑝𝑟𝑢𝑙𝑒𝑠 (ℎ) + (1 − 𝜋0 )𝑝int𝑒𝑟𝑣𝑎𝑙 (ℎ)

Diffuse similarity

Powers of two

Focused similarity: Numbers near 20

Empirical predictive distribution Predictive distributions for the model

averaged over 8 humans in the using the full hypothesis space
number game. 𝑝(ℎ) = 𝜋0 𝑝𝑟𝑢𝑙𝑒𝑠 (ℎ) + (1 − 𝜋0 )𝑝int𝑒𝑟𝑣𝑎𝑙 (ℎ)
(results on are only shown for 𝑥ු for
which data is available – compare with
the posterior predictive distribution
given here)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24
The Beta-Binomial Model
 We discuss next applications with the unknown parameters being continuous.

 Problem of interest: inferring the probability that a coin shows up heads, given
a series of observed coin tosses.

 The coin model forms the basis of many methods including Naive Bayes
classifiers, Markov models, etc.

 We specify first the likelihood and prior, and then derive the posterior and
predictive distributions.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25

The Beta-Binomial Model: Likelihood
 Suppose 𝑋𝑖 ~ℬℯ𝓇(𝜃), where 𝑥𝑖 = 1 represents "heads", 𝑥𝑖 = 0
represents "tails", and   [0,1] is the probability of heads. If the data
are i.i.d., the likelihood is:
N N
p( |  )   (1   ) , N1   ( xi  1), N 0   ( xi  0)
N1 N0

i 1 i 1

 𝑁1 is the number of heads and 𝑁0 the number of tails in 𝑁 trials.

 These two counts are the sufficient statistics of the data. This is all we need
to know about 𝒟 to infer 𝜃.

 Now suppose the data consists of the count of the number of heads
𝑁1 observed in a fixed number of 𝑁 of trials. In this case, we have
𝑁1 ~ ℬ𝒾𝓃(𝑁, 𝜃), where ℬ𝒾𝓃 represents the binomial distribution:
n k
B (k | n,  )     (1   ) n  k
k 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26
The Beta-Binomial Model: Likelihood
n k
B (k | n,  )     (1   ) n  k
k 
n
 Since   is a constant independent of 𝜃, the likelihood for the binomial
k 
sampling model is the same as the likelihood for the Bernoulli model.
N N
p( |  )   (1   ) , N1   ( xi  1), N 0   ( xi  0)
N1 N0

i 1 i 1

 So any inference we make about 𝜃 will be the same whether we observe the
counts, 𝒟 = (𝑁1, 𝑁), or a sequence of trials, 𝒟 = (𝑥1, … , 𝑥𝑁).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27

Conjugate Prior
 We need a prior with support [ 0,1]. It would be convenient if the prior had the
same form as the likelihood, i.e., if the prior looked like
p ( )   1 (1   ) 2 ,  1 ,  2 constants

 Then we could easily evaluate the posterior by simply adding up the

exponents:
p ( | )  p( |  ) p( )   N1 1 (1   ) N0  0

 When the prior and the posterior have the same form, we say that the prior is
a conjugate prior for the corresponding likelihood. Conjugate priors are widely
used since they simplify computation and are easy to interpret.
 In the case of the Bernoulli, the conjugate prior is the Beta distribution.
Beta ( | a, b)   a 1 (1   )b 1
The parameters of the prior are called hyper-parameters.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28
Posterior
 If we multiply the likelihood by the Beta prior, we get the following posterior

p ( | D )  p(D |  ) p( )   N1  a 1 (1   ) N0 b 1  Beta  N1  a, N 0  b 

 Note that the posterior is obtained by adding the prior hyper-parameters to the
empirical counts.

 For this reason, the hyper-parameters are known as pseudo-counts.

 The strength of the prior, also known as the effective sample size of the prior,
is the sum of the pseudo counts, 𝑎 + 𝑏; this plays a role analogous to the data
set size 𝑁.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29

The Beta-Binomial Model: Posterior
 Left figure: we update a weak ℬℯ𝓉𝒶(2,2) prior with a peaked likelihood
function, corresponding to a large sample size; the posterior is essentially
identical to the likelihood since the data has overwhelmed the prior.

 Right figure: we update a strong ℬℯ𝓉𝒶(5,2) prior with a peaked likelihood

function; now we see that the posterior is a compromise between the prior and
the likelihood.
6 4.5
prior Be(2.0, 2.0) prior Be(5.0, 2.0)
4
5 lik Be(4.0, 18.0) lik Be(12.0, 14.0)
post Be(5.0, 19.0) 3.5 post Be(16.0, 15.0)
4
3
𝑁1 = 3, 𝑁0 = 17
3
2.5 𝑁1 = 11
2 𝑁0 = 13
2
1.5

1 1

0.5
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Run binomialBetaPosteriorDemo from PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30
Posterior Mean and Mode
p ( | D )  Beta  N1  a, N 0  b 
The MAP estimate is given by
𝑁1 + 𝑎 − 1
𝜃መ𝑀𝐴𝑃 =
𝑎+𝑏+𝑁−2

 If we use a uniform prior (𝑎 = 𝑏 = 1), then the MAP

estimate reduces to the MLE, which is just the empirical
fraction of heads: 𝑁1
𝜃መ𝑀𝐿𝐸 =
𝑁
 Let 𝑎0 = 𝑎 + 𝑏 be the equivalent sample size of the prior,
and let the prior mean be
m1  a /  0 .
 Then the posterior mean is given by
𝑁1 + 𝛼 𝑁1 + 𝛼0 𝑚1 𝛼0 𝑁 𝑁1
𝔼𝜃 = = = 𝑚1 + = 𝜆𝑚1 + (1 − 𝜆)𝜃መ𝑀𝐿𝐸
𝛼0 + 𝑁 𝛼0 + 𝑁 𝛼0 + 𝑁 𝛼0 + 𝑁 𝑁
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Posterior Mean and Mode
𝑁1 + 𝛼0 𝑚1 𝛼0 𝑁1 𝑁1
𝔼𝜃 = = 𝑚1 + = 𝜆𝑚1 + (1 − 𝜆)𝜃መ𝑀𝐿𝐸
𝛼0 + 𝑁 𝛼0 + 𝑁 𝛼0 + 𝑁 𝑁

 𝜆 is the ratio of the prior to posterior equivalent sample size.

𝛼0
𝜆=
𝛼0 + 𝑁

 So the weaker the prior, the smaller is 𝜆 and hence the closer the posterior
mean is to the MLE.

 Can also show that the posterior mode is a convex combination of the prior
mode and the MLE, and that it too converges to the MLE.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32

Posterior Variance
 The mean and mode are point estimates, but it is useful to know how much
we can trust them. The variance of the posterior
p ( | )  Beta  N1  a, N 0  b 

is as follows:
var  |
 N1  a  N 0  b 

 0  N   0  N  1
2

 We can simplify in the case that 𝑁 >> 𝑎, 𝑏, to get

𝑁1 𝑁0 𝜃መ𝑀𝐿𝐸 1 − 𝜃መ𝑀𝐿𝐸 𝜃መ 1 − 𝜃መ
var 𝜃|𝒟 ≈ = =
𝑁𝑁𝑁 𝑁 𝑁

where 𝜃෠ = 𝜃෠𝑀𝐿𝐸 . Thus the error bar in our estimate is given by

𝜃መ 1 − 𝜃መ
𝜎≈
𝑁
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33
Posterior Variance
𝜃መ 1 − 𝜃መ
𝜎≈
𝑁

 We see that the uncertainty goes down at a rate of 1/ N .

 Note, however, that the uncertainty is maximized when 𝜃෠ = 0.5, and is

minimized when 𝜃෠ is close to 0 or 1.

 This means it is easier to be sure that a coin is biased than to be sure that it is
fair!

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34

Posterior Predictive Distribution
 Consider predicting the probability of heads in a single future trial. We obtain
the posterior mean:
1
𝑁1 + 𝑎
𝑝(𝑥෤ = 1|𝒟) = න 𝜃ℬℯ𝓉𝒶 𝑁1 + 𝑎, 𝑁0 + 𝑏 𝑑𝜃 = 𝔼 𝜃|𝑁1 + 𝑎, 𝑁0 + 𝑏 =
𝑁 + 𝛼0
0

 For example, under uniform prior (𝑎 = 𝑏 = 1), this probability is:

𝑁1 + 1
𝑝(𝑥෤ = 1|𝒟) =
𝑁+2
 Add-One Smoothing: This justifies the common practice of adding 1 to the
empirical counts, normalizing and then plugging them in.

 How about if we plug in the MAP estimate? We can see that we don’t then get
this smoothing effect:
𝑁1 + 𝑎 − 1 𝑁1
𝑝(𝑥෤ = 1|𝒟) = mod𝑒 𝜃|𝑁1 + 𝑎, 𝑁0 + 𝑏 = = = 𝜃መ𝑀𝐿𝐸
𝑁 + 𝛼0 − 2 𝑁
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35
Posterior Predictive Distribution
 Consider predicting the probability of heads in a single future trial by plugging
the MLE estimate as:
𝑁1
𝑝(𝑥෤ = 1|𝒟) ≈ 𝐵𝑒𝑟𝑛 𝑥෤ = 1|𝜃መ𝑀𝐿𝐸 = 𝜃መ𝑀𝐿𝐸 =
𝑁

 This often can be a troubling estimation (especially when we have few data
available).

 For example, if 𝑁1 = 0 in 𝑁 = 3 trials, then this predicts that the chance for
getting heads on the next trial is zero!
0
𝑝(𝑥෤ = 1|𝒟) ≈ 𝜃መ𝑀𝐿𝐸 = =0
3

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36

The Black Swan Paradox
0
𝑝 𝑥෤ = 1|𝒟 = 𝑇𝑇𝑇 ≈ 𝜃መ𝑀𝐿𝐸 = =0
3

 This is called the zero count problem or the sparse data problem, and
frequently occurs when estimating counts from small data sets.

 The zero-count problem is analogous to a problem in philosophy called the

black swan paradox.

 This paradox was used to illustrate the problem of induction - drawing

conclusions about the future from specific observations from the past.

 Taleb, N, (2007). The Black Swan: The Impact of the Highly Improbable, 2nd Edt. Random House.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37

Predicting the Outcome of Multiple Future Trials
 Suppose now we are interested in predicting the number of heads, 𝑥, in 𝑀
future trials. For simplicity, we denote our posterior as ℬℯ𝓉𝒶 𝑎, 𝑏 .
M  x M x  (1   )b 1
1 1 a 1
p( x | , M )   Bin  x |  , M  Beta  a, b  d      (1   ) d
0 0
x  Beta  a , b 
 M  1  a  x 1 (1   )b  M  x 1  M  Beta  a  x, b  M  x 
   d   
 x 0 Beta  a, b   x Beta  a, b 

 This is known as the (compound)

beta-binomial distribution.

 This distribution has the following mean and

variance a
 x  M
ab
ab(a  b  M )
var  x   M
(a  b) 2 (a  b  1) Note: here 𝑥𝜃, 𝑀𝑛

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38

Predicting the Outcome of Multiple Future Trials
 Posterior predictive distributions after seeing 𝑁1 = 3, 𝑁0 = 17.
 Note the Bayesian prediction has longer tails, spreading its probablity mass
more widely, and thus is less prone to overfitting and blackswan type
paradoxes. 0.35
posterior predictive
0.35
plugin predictive

0.3 0.3

Prior : Beta  2, 2 
0.25 0.25

0.2 0.2
Data : N1  3, N 0  17
0.15 0.15

0.1 0.1

0.05 0.05

0
0 1 2 3 4 5 6 7 8 9 10 0
0 1 2 3 4 5 6 7 8 9 10

෠ 𝑀 ෠ 𝑥 𝑀−𝑥
 Posterior Predictive Plug-in: 𝑝 𝑥 𝒟, 𝑀 = ℬ𝒾𝓃 𝑥|𝜃𝑀𝐴𝑃 , 𝑀 = 𝜃𝑀𝐴𝑃 1 − 𝜃෠𝑀𝐴𝑃
𝑥
Run betaBinomPostPredDemo from PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39
The Dirichlet-Multinomial Model
 Suppose we observe 𝑁 dice rolls   x1 , x2 ,..., xN  where
xi  1, 2,..., K 

 If we assume the data is i.i.d., the likelihood has the form

K N
p( |  )   , N k  
k
Nk
 xi  k 
k 1 i 1

𝜃𝑘 is the probability of getting face 𝑘.

 𝑁𝑘 is the number of times event 𝑘 occurred (sufficient statistics for this model).

 The likelihood for the multinomial model has the same form, up to an
irrelevant constant factor.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40

The Dirichlet-Multinomial Model
 Since the parameter vector lives in the 𝐾 −dimensional probability simplex,
we need a prior that has support over this simplex. Ideally it would also be
conjugate.

 The Dirichlet distribution satisfies both criteria.

 So we will use the following prior:

K
1
Dir ( |  )   k
Beta ( ) k 1
  k 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41

The Dirichlet-Multinomial Model Posterior
 Multiplying the likelihood by the prior, we find that the posterior is also
Dirichlet:
K
p ( | )  p( |  ) p( )    kNk  k k 1  Dir  | N1  1 ,..., N K   K 
k 1

 In deriving the mode of this posterior (i.e., the MAP estimate), we must
enforce the constraint K


k 1
k 1

 We can do this by using a Lagrange multiplier. The constrained objective

function is given by the log likelihood plus log prior plus the constraint:
K K
 K

( ,  )   N k log  k    k  1 log  k   1    k 
k 1 k 1  k 1 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42
The Dirichlet-Multinomial Model Posterior
K K
 K

( ,  )   N k log  k    k  1 log  k   1    k 
k 1 k 1  k 1 
 To simplify notation, we define N k'  N k   k  1
 ( ,  )  K

 Taking derivatives with respect to 𝜆 yields:  1    k 0
  k 1 
 Taking derivatives with respect to 𝜽𝑘 yields
 ( ,  ) N k'
   0
 k k
 We can solve for 𝜆 using the sum-to-one constraint
    k      N k   k  1    N    k  K
K K K K

N
k 1
'
k
k 1 k 1 k 1
1
K
  N   0  K , where :  0    k is the equivalent sample size of the prior
k 1
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43
The Dirichlet-Multinomial Model Posterior
 ( ,  ) N k'
 Using     0 , the MAP estimate is given by
 k k
𝑁𝑘 + 𝛼𝑘 − 1
𝜃መ𝑘 =
𝑁 + 𝛼0 − 𝐾

 Compare this with the mode of the Dirichlet distribution Dir  | 1 ,...,  K 

k k 1  k  0   k  K
[ xk ]  , mode[ xk ]  , var[ xk ]  2 , where :  0    k
0 0  K  0 ( 0  1) k 1

 If we use a uniform prior, 𝑎𝑘 = 1, we recover the MLE:

𝑁𝑘
𝜃መ𝑘 =
𝑁

 This is just the empirical fraction of times face 𝑘 shows up.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44

   j p ( j | )d j   j | 
N j  j N j  j
 
K
N  0
  Nk  k 
k 1

 The above expression avoids the

zero-count problem.
 This form of Bayesian smoothing is even
more important in the multinomial case than
the binary case, since data sparsity
increases once we start partitioning
the data into many categories.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Language Models Using Bag of Words
 Language modeling means predicting which words might occur next in a
sequence. Assume that the 𝑖 −th word, X i  1, 2,..., K  is sampled independently
from all the other words using a 𝒞𝒶𝓉(𝜃) distribution. This is called the bag of
words model.
 Suppose we observe the following two sequences
 Mary had a little lamb, little lamb, little lamb, 1 10 3 2 3 2 3 2
 Mary had a little lamb, its fleece was white snow 1 10 3 2 10 5 10 6 8

 Furthermore, suppose our vocabulary consists of the following words:

Mary lamb little big fleece white black snow rain unk
1 2 3 4 5 6 7 8 9 10
 Replace each word by its index into the vocabulary:
Token: 1 2 3 4 5 6 7 8 9 10
Word: Mary lamb little big fleece white black snow rain unk
Count: 2 4 4Statistical0Computing
1 and Machine1Learning, Fall
0 2020, N. Zabaras
1 0 4 46
Language Models Using Bag of Words
 Denote the above counts by 𝑁𝑗. If we use a 𝒟𝒾𝓇(𝑎) prior for 𝜽, the posterior
predictive is just
𝑁𝑗 + 𝛼𝑗
𝑝(𝑋෨ = 𝑗|𝒟) = 𝔼(𝜃𝑗 |𝒟) =
𝑁 + 𝛼0

𝑁𝑗 + 1
 If we set 𝑎𝑗 = 1, we get 𝑝(𝑋෨ = 𝑗|𝒟) = 𝔼(𝜃𝑗 |𝒟) = from which:
17 + 10

3 5 5 1 2 2 1 2 1 5
𝑝(𝑋෨ = 𝑗|𝒟) = , , , , , , , , ,
27 27 27 27 27 27 27 27 27 27

 The modes of the predictive distribution are 𝑋 = 2 ("lamb"), 𝑋 = 3 ("little"),

and 𝑋 = 10 ("unk").

 Note that the words ``big’’, ``black’’ and ``rain’’ are predicted to occur with non-
zero probability in the future, even though they have never been seen before!
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47
Bayesian Analysis of the Uniform Distribution
 Consider 𝒰𝓃𝒾𝒻(0, 𝜃). The MLE is 𝜃෠ = 𝑚𝑎𝑥 𝒟 . This is unsuitable for predicting
future data since it puts zero probability mass outside the training data.
 We will perform a Bayesian analysis of the uniform distribution. The conjugate
prior is the Pareto distribution,
∞ 𝑖𝑓 𝐾 ≤ 1
𝑝 𝜃 = 𝒫𝒶𝓇ℯ𝓉ℴ 𝜃|𝑏, 𝐾 = 𝐾𝑏 𝐾 𝜃 −(𝐾+1) 𝕀 𝜃 ≥ 𝑏 , Mode= 𝑏, Mean= ቐ 𝐾𝑏
𝑖𝑓 𝐾 > 1
𝐾−1

 Given a Pareto prior, the joint distribution of 𝜃 and 𝒟 = 𝑥1 , … , 𝑥𝑁 is:

𝑝 𝒟, 𝜃 = 𝐾𝑏 𝐾 𝜃 −(𝑁+𝐾+1) 𝕀 𝜃 ≥ 𝑚𝑎𝑥 𝒟, 𝑏

 Let 𝑚 = 𝑚𝑎𝑥 𝒟 . The evidence (probability that all 𝑁 samples came from
𝒰𝓃𝒾𝒻 0, 𝜃 ) is
𝐾
∞ 𝐾 𝑁
𝑖𝑓 𝑚 ≤ 𝑏
𝐾𝑏 𝑁+𝐾 𝑏
𝑝 𝒟 =න 𝑑𝜃 =
max(𝑚,𝑏) 𝜃
(𝑁+𝐾+1) 𝐾𝑏 𝐾
(𝑁+𝐾)
𝑖𝑓 𝑚 > 𝑏
(𝑁 + 𝐾)𝑚
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48
Bayesian Analysis of the Uniform Distribution
𝑝 𝒟, 𝜃 = 𝐾𝑏 𝐾 𝜃 −(𝑁+𝐾+1) 𝕀 𝜃 ≥ 𝑚𝑎𝑥 𝒟, 𝑏

𝐾
𝑖𝑓 𝑚 ≤ 𝑏
𝑁 + 𝐾 𝑏𝑁
𝑝 𝒟 =
𝐾𝑏 𝐾
(𝑁+𝐾)
𝑖𝑓 𝑚 > 𝑏
(𝑁 + 𝐾)𝑚

 Using 𝑐 = max 𝑚, 𝑏 , the posterior is given as:

𝐾 −(𝑁+𝐾+1) (𝑁+𝐾)𝑏𝑁 (𝑁+𝐾)𝑏(𝑁+𝐾)

𝑝 𝜃,𝒟
𝐾𝑏 𝜃 = 𝑖𝑓 𝑚 ≤ 𝑏 ≤ 𝜃,
𝐾 𝜃𝑁+𝐾+1
𝑝 𝜃|𝒟 = = ቐ (𝑁+𝐾) =
𝑝 𝒟 𝐾 −(𝑁+𝐾+1) (𝑁+𝐾)𝑚 (𝑁+𝐾)𝑚(𝑁+𝐾)
𝐾𝑏 𝜃 = 𝑖𝑓 𝑏 < 𝑚 ≤ 𝜃
𝐾𝑏𝐾 𝜃𝑁+𝐾+1
𝒫𝒶𝓇ℯ𝓉ℴ 𝜃|𝑐, 𝑁 + 𝐾

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 49

Bayesian Analysis of the Uniform Distribution
𝑝 𝜃|𝒟 = 𝒫𝒶𝓇ℯ𝓉ℴ 𝜃|𝑐, 𝑁 + 𝐾 , 𝑐 = max(𝑚, 𝑏)

 With this posterior, the predictive distribution can be computed as follows:

∞
1 (𝑁+𝐾) −(𝑁+𝐾+1)
(𝑁 + 𝐾)
න (𝑁 + 𝐾)𝑐 𝜃 𝑑𝜃 = 𝑖𝑓 𝑥 ≤ 𝑐
𝑐 𝜃 𝑁 + 𝐾 + 1 𝑐
𝑝 𝑥|𝒟 = ∞ (𝑁+𝐾)
1 (𝑁 + 𝐾)𝑐
න (𝑁 + 𝐾)𝑐 (𝑁+𝐾) 𝜃 −(𝑁+𝐾+1) 𝑑𝜃 = (𝑁+𝐾+1)
𝑖𝑓 𝑥 > 𝑐
𝑥 𝜃 𝑁+𝐾+1 𝑥

 Using an (improper) non-informative prior on 𝜃 of the form 𝑝(𝜃) =

𝒫𝒶𝓇ℯ𝓉ℴ(𝜃|0, 0) ∝ 1/𝜃, 𝑏 = 𝐾 = 0, and if 𝒟 = {100}, so that 𝑚 = 100, 𝑁 = 1,
then:
1
𝑖𝑓 𝑥 ≤ 𝑐 = 𝑚 = 100
𝑝 𝑥|𝒟 = ቐ 2𝑚
𝑚
𝑖𝑓 𝑥 > 𝑐
2𝑥 2
 For implementation see ParetoDemoTaxicab, from PMTK
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50
Naive Bayes Classifier
 We discuss how to classify vectors of discrete-valued features, x  1, 2,..., K 
D

where 𝐾 is the number of values for each feature, and 𝐷 is the number of
features.

 The simplest approach is to assume the features are conditionally

independent given the class label. This allows us to write the class conditional
density as a product of one dimensional densities:
𝐷

𝑝 𝒙|𝑦 = 𝑐, 𝜽 = ෑ 𝑝 𝑥𝑗 |𝑦 = 𝑐, 𝜽𝑗𝑐
𝑗=1

 The resulting model is called a naive Bayes classifier (NBC).

 It is called "naive" since in practice the features 𝑥𝑗 are not independent, even
conditional on the class label 𝑐.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51
Naive Bayes Classifiers
𝐷

𝑝 𝒙|𝑦 = 𝑐, 𝜽 = ෑ 𝑝 𝑥𝑗 |𝑦 = 𝑐, 𝜽𝑗𝑐
𝑗=1

 The naive Bayes assumption as unreasonable as it might be, it often results in

classifiers that work well.

 The reason for this is that the model is simple (it only has 𝒪(𝐶𝐷)
parameters, for 𝐶 classes and 𝐷 features)

 Hence the model is relatively immune to overfitting.

 Domingos, P. and M. Pazzani (1997). On the optimality of the simple Bayesian classifier under zero-one loss.
Machine Learning 29, 103– 130.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52

Naive Bayes Classifiers: Examples
 The form of the class-conditional density depends on the type of each feature.
Three examples are shown below:
 For real-valued features, we can use the Gaussian distribution:
𝐷
2
𝑝 𝒙|𝑦 = 𝑐, 𝜃 = ෑ 𝒩 𝑥𝑗 |𝜇𝑗𝑐 , 𝜎𝑗𝑐
𝑗=1
where 𝜇𝑗𝑐 is the mean of feature 𝑗 in objects of class 𝑐, and 𝜎𝑗𝑐2 is its variance.

 For binary features, 𝑥𝑗 ∈ {0,1}, we can use the Bernoulli distribution where
𝜇𝑗𝑐 is the probability that feature 𝑗 occurs in class 𝑐. This is called the
multivariate Bernoulli naive Bayes model.
𝐷

𝑝 𝒙|𝑦 = 𝑐, 𝜃 = ෑ ℬℯ𝓇 𝑥𝑗 |𝜇𝑗𝑐

𝑗=1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 53

Naïve Bayes Classifiers: Examples
The form of the class-conditional density depends on the type of each feature.

 For categorical features, x j  1, 2,...., K  , we model using the multinoulli

(Categorical) distribution (this is the same as the multinomial distribution
with 𝑁 = 1) 𝐷

𝑝 𝒙|𝑦 = 𝑐, 𝜃 = ෑ 𝒞𝒶𝓉 𝑥𝑗 |𝝁𝑗𝑐

𝑗=1

where 𝝁𝑗𝑐 is a vector of the probabilities over the 𝐾 possible values for 𝑥𝑗
in class 𝑐.

Training a naive Bayes classifier:

 This usually refers to computing the MLE and MAP estimates for the
model parameters.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 54
MLE for Naïve Bayes Classifier
 The probability for a single data case (known features & class label) is
 
p  xi , yi |    p  yi |    p  xij | yi ,  j      c ( yi c )   p  xij |  jc 
D
( yi  c )

j 1  c  j c
 Hence the joint log-likelihood is given by

 log p  x |  jc 
C D C
log p  |     N c log  c   ij
c 1 j 1 c 1 i: yi  c

 The likelihood decomposes into one term concerning 𝜋 and 𝐷 × 𝐶 terms

containing the 𝜽𝑗𝑐 . We can optimize these parameters separately.

 The MLE for the class prior is given (proof as given earlier) by

𝑁𝑐
𝜋ො 𝑐 =
𝑁
where N c  
i
( yi  c) is the number of examples in class 𝑐.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 55
MLE for Naïve Bayes Classifier

 log p  x |  jc 
C D C
log p  D |     N c log  c   ij
c 1 j 1 c 1 i: yi  c

 The MLE for 𝜽𝑗𝑐 depends on the type of distribution we use for each feature.
For simplicity, let us suppose all features are binary, so
x j | y  c ~ Ber ( jc )

 In this case, it is easy to show from the MLE becomes

𝑁𝑗𝑐
𝜃෠𝑗𝑐 = , 𝑁𝑐 = ෍𝕀(𝑦𝑖 = 𝑐 ) , 𝑁𝑗𝑐 = ෍ 𝕀(𝑥𝑖𝑗 = 1൯
𝑁𝑐
𝑖 𝑖:𝑦𝑖 =𝑐

 𝑁𝑐 is the number of data in class 𝑐.

 𝑁𝑗𝑐 is the number of times in these 𝑁𝑐 data that feature 𝑗 is on.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 56
MLE for Naïve Bayes Classifier
 The figures below give an example with 2 classes and 600 binary features,
representing the presence or absence of words in a bag-of-words model. The
plot visualizes the 𝜽𝑐 vectors for the two classes.
 The spike at location 107 corresponds to the word `subject’, which occurs in
both classes with probability 1 (this is overfitting).
p(xj=1|y=2)
p(xj=1|y=1) 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4

0.3
0.3

0.2
0.2

0.1 0.1

0 0
0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700

Class conditional densities 𝑝(𝑥𝑗 = 1|𝑦 = 𝑐) for two

document classes, corresponding to "𝑋 windows" and Run naiveBayesBowDemo from PMTK
"MS windows".
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 57
MLE for Naïve Bayes Classifier: Algorithm
 We summarize below the MLE calculation algorithm for binary features

Run naiveBayesBowDemo from PMTK

From PMTK
c
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 58
Bayesian Naïve Bayes Classifier
 The trouble with MLE (as shown earlier) is that it can overfit.
 A Bayesian approach is needed. Use a factored conjugate prior:
1   
D C C D C
1 1
p ( )  p ( ) p( jc )    k  k 1
   0 1
jc jc
j 1 c 1 k 1 j 1 c 1

 We use a 𝒟𝒾𝓇(𝒂) prior for  and a ℬℯ𝓉𝒶(𝛽0, 𝛽1) prior for each 𝜃𝑗𝑐 . Often we
take 𝜶 = 𝟏 and 𝜷 = 𝟏, corresponding to add-one or Laplace smoothing.
Combining the factored likelihood (for binary features)
 1   
N c  N jc
p |     Nc N jc
c jc jc
c j c
with the factored prior above gives the following factored posterior:
1   jc 
D C C D C
N c  N jc  1 1
) p( jc | )    jc
N c  c 1 N jc   0 1
p ( | )  p( | c
j 1 c 1 c j 1 c

p ( | )  Dir  N1  1 ,..., N C   C  , p( jc | )  Beta  N jc   0 , N c  N jc  1 

 We thus simply update the prior counts with the empirical counts from the
likelihood. Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 59
Predictive Distribution
 At test time, the goal is to compute the predictive distribution:
D
p( y  c | x, )  p( y  c | ) p ( x j | y  c , )
j 1

 In the Bayesian procedure, we integrate out the unknown parameters:

p( y  c | x, )    Cat ( y  c |  ) p ( | )d 
 
D

   Ber ( x
j 1
j | y  c,  jc ) p ( jc | )d jc 


Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 60

Predictive Distribution
 This is easy to compute for the Dirichlet and Beta posteriors. E.g. for a single
multinoulli trial, recall that the predictive distribution gives the posterior mean of
the parameters: 𝑁 +𝛼 𝑗 𝑗
𝑝(𝑋෨ = 𝑗|𝒟) = 𝔼(𝜃𝑗 |𝒟) =
𝑁 + 𝛼0

 Thus we know the posterior predictive density can be obtained by simply

plugging in the posterior mean parameters  . Hence
𝐷
𝕀(𝑥𝑗 =1൯ 𝕀(𝑥𝑗 =0൯
ҧ
𝑝(𝑦 = 𝑐|𝒙, 𝒟) ∝ 𝜋ത 𝑐 ෑ 𝜃𝑗𝑐 ҧ
1 − 𝜃𝑗𝑐
𝑗=1
𝑁𝑗𝑐 + 𝛽0
ҧ =
𝜃𝑗𝑐 posterior mean of Beta
𝑁𝑐 + 𝛽0 + 𝛽1
𝑁𝑐 + 𝛼𝑐
𝜋ത 𝑐 = , 𝛼 = ෍ 𝛼𝑐 posterior mean of Dirichlet
𝑁 + 𝛼0 0
𝑐

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 61

Plugin Approximation
 If we approximate the posterior by a single point, 𝑝(𝜃|𝐷) ≈ 𝛿𝜃෡ (𝜃൯ where 𝜃෠ is the
MLE or MAP, then the posterior predictive density is obtained by plugging in
the parameters to yield a virtually identical rule.

𝐷
𝕀(𝑥𝑗 =1൯ 𝕀(𝑥𝑗 =0൯
𝑝(𝑦 = 𝑐|𝒙, 𝒟) ∝ 𝜋ො 𝑐 ෑ 𝜃መ𝑗𝑐 1 − 𝜃መ𝑗𝑐
𝑗=1

 The only difference is we replaced the posterior mean 𝜃ҧ with the MAP or MLE
෠
𝜃.

 This small difference is important since the posterior mean results in less
overfitting.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 62

The Log-Sum-Exp Trick
 A naive implementation of
p  x | y  c,   p  y  c |  
p  y  c | x ,  
 p  x | y  c ',  p  y  c ' |  
c'

can fail due to numerical underflow. The problem is that 𝑝(𝒙|𝑦 = 𝑐) is often a
very small number, especially if 𝒙 is a high-dimensional vector.

 This is because we require that σ𝒙 𝑝(𝒙|𝑦) = 1, so the probability of observing

any particular high-dimensional vector is 𝒙 small.

 The obvious solution is to take logs when applying Bayes rule, as follows:
 C bc ' 
log p ( y  c | x )  bc  log   e  , bc  log p ( x | y  c)  log p ( y  c)
 c '1 
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 63
The Log-Sum-Exp Trick
 However, this requires evaluating the following expression
 C bc ' 
log   e   log  p ( y  c ', x )  log p ( x )
 c '1  c'

and we cannot add up in the log domain.

 One can factor out the largest term and represent the remaining numbers
relative to that, e.g.
log  e 120  e 121   log  e0  e 1   120
 In general, we have
 bc   bc  B  B   bc  B 
log   e   log   e  e   log   e   B
 c   c    c 

 This log-sum-exp trick is widely used. Run naiveBayesPredict from PMTK

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 64

Predictions with the Naïve Bayes Classifier: Algorithm
 We summarize below the prediction NBC Algorithm for binary features.

From PMTK

 C bc ' 
log p ( y  c | x )  bc  log   e  , bc  log p ( x | y  c)  log p ( y  c)
 c '1 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 65

Feature Selection Using Mutual Information
 Since an NBC is fitting a joint distribution over many features
 (a) it can suffer from overfitting and
 (b) the run-time cost is 𝒪(𝐷) which may be too high
 We tackle these problems by performing feature selection i.e. by removing
irrelevant features.
 Evaluate the relevance of each feature separately, and then take the top
𝐾, where 𝐾 is chosen based on some tradeoff between accuracy and
complexity.
 This approach is known as variable ranking, filtering, or screening.

 One way to measure relevance is to use mutual information between feature

𝑋𝑗 and the class label 𝑌:
p( x j , y)
I ( X , Y )   p ( x j , y ) log
xj y p( x j ) p( y)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 66
Feature Selection Using Mutual Information
 The mutual information can be thought of as the reduction in entropy on the
label distribution once we observe the value of feature j.
 If the features are binary, it is easy to show that the MI can be computed as
follows   jc 1   jc 
I j    jc c log  (1   jc ) c log 
c 
  j 1   j 


where 𝜋𝑐 = 𝑝(𝑦 = 𝑐), 𝜃𝑗𝑐 = 𝑝(𝑥𝑗 = 1|𝑦 = 𝑐), and 𝜃𝑗 = 𝑝 𝑥𝑗 = 1 = σ𝑐 𝜋𝑐 𝜃𝑗𝑐

p( x j  x | y  c)
Ij    p( x j  x | y) p( y  c) log
x 0,1 c p( x j  x)
p ( x j  1| y  c) p( x j  0 | y  c)
  p ( x j  1| y ) p ( y  c) log   p ( x j  0 | y  c) p ( y  c) log
c p ( x j  1) c p ( x j  0)
  jc 1   jc 
   jc c log  (1   jc ) c log 
c 
 j 1   j 

 All of these quantities are computed when fitting the naive Bayes classifier.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 67
Feature Selection Using Mutual Information
 The words with highest MI are much more discriminative than the words which
are most probable.

 Most Probable Words: In the earlier example, the most probable word in both
classes is “subject”, which always occurs because this is newsgroup data
which always has a subject line. Obviously this is not very discriminative.

 Most Discriminative Words: The words with highest MI with the class label are
(in decreasing order) “windows”, “microsoft”, “DOS” and “motif”. This makes
sense, since the two classes correspond to Microsoft Windows and 𝑋
Windows.

Run naiveBayesBowDemo from PMTK

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 68
Classifying Documents Using Bag of Words
 Consider classifying text documents into different categories.
 Represent each document as a binary vector recording whether each word is
present or not, so 𝑥𝑖𝑗 = 1 iff word 𝑗 occurs in document 𝑖, otherwise 𝑥𝑖𝑗 = 0.
Then use the following class conditional density:
p  xi | yi  c,     Ber  xij |  jc    jc
D D
(1 xij )
(1  jc )
( xij )

j 1 j 1

 This is called the Bernoulli product model, or the binary independence

model.
 Note that ignoring the number of times each word occurs in a document (as is
the case with the model above) loses some information.
 For more accurate representation we need to count the number of
occurrences of each word.
 McCallum, A. and K. Nigam (1998). A comparison of event models for naive Bayes text classification. In
AAAI/ICML workshop on Learning for Text Categorization.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 69
Classifying Documents Using Bag of Words
 Specifically, let 𝒙𝑖 be a vector of counts for document 𝑖, 𝑥𝑖𝑗 ∈ 0,1 … . , 𝑁𝑖 ,
where 𝑁𝑖 is the number of terms in document 𝑖
D

x
j 1
ij  Ni Because of this constraint,
the features are not independent

 For the class conditional densities, we can use a multinomial distribution:

D
Ni !
p ( xi | yi  c, )  Mu ( xi | N i , c )   jc
xij
D

 xij !
j 1
j 1

 Here we assume that the document length 𝑁𝑖 is independent of the class.

 𝜃𝑗𝑐 is the probability of generating word 𝑗 in documents of class 𝑐; these

parameters satisfy the constraint that
D


j 1
jc 1
for each class 𝑐.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 70
Burstiness of Words
 The multinomial classifier is easy to train and use for predictions, however it
does not take into account the burstiness of word usage.

 Words occur in bursts: most words never appear in any given document, but if
they do appear once, they are likely to appear more than once.

 The multinomial model cannot capture the burstiness phenomenon. To see

why, note that the equation
Ni !
p ( xi | yi  c, )  Mu ( xi | N i , c )  
xij
D jc

x
j 1
ij ! j 1

𝑁𝑖𝑗
has the form 𝜃𝑗𝑐 and since 𝜃𝑗𝑐 << 1 for rare words, it becomes increasingly
unlikely to generate many of them.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 71
Multinomial Document Classifiers
𝑁𝑖𝑗
 For more frequent words, the decay rate is not as fast as To see why 𝜃𝑗𝑐 .
intuitively, note that the most frequent words are function words which are not
specific to the class, such as “and”, “the”, and “but”.
 The independence assumption is more reasonable for common words: e.g.
the chance of the word “and” occurring is pretty much the same no matter
how many times it has previously occurred.
 Since rare words are the ones that matter most for classification purposes,
these are the ones we want to model the most carefully.
 Various ad hoc heuristics have been proposed to improve the performance of
the multinomial document classifier.

 Rennie, J., L. Shih, J. Teevan, and D. Karger (2003). Tackling the poor assumptions of naive Bayes text classifiers.
In Intl. Conf. on Machine Learning.
 Madsen, R., D. Kauchak, and C. Elkan (2005). Modeling word burstiness using the Dirichlet distribution. In Intl.
Conf. on Machine Learning.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 72
Compound Multinomial (DCM) Density
 Suppose we simply replace the multinomial class conditional density with the
Dirichlet Compound Multinomial (DCM) density, defined as: K

Ni !    
p ( xi | yi  c,  )   Mu ( xi | N i , c ) Dir ( c |  c ) d c 
k
B ( xi  c )
B ( c ) B ( )  k 1
D  
x
j 1
ij !    k 
 k 

 This DCM density captures the burstiness phenomenon.

 After seeing one occurrence of a word, say word 𝑗, the posterior counts on 𝜃𝑗
gets updated, making another occurrence of word 𝑗 more likely. By contrast, if
𝜃𝑗 is fixed, then the occurrences of each word are independent.

 The multinomial model corresponds to drawing a ball from an urn with 𝐾

colors of ball, recording its color, and then replacing it. By contrast, the DCM
model corresponds to drawing a ball, recording its color, and then replacing it
with one additional copy (Polya urn).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 73
Compound Multinomial (DCM) Density
 The DCM as the class conditional density gives much better results than
using the multinomial.

 It has performance comparable to state of the art methods.

 Fitting the DCM model is however more complex.

 Minka, T. (2000). Estimating a Dirichlet distribution. Technical report, MIT.

 Elkan, C. (2006). Clustering documents with an exponential family approximation of the Dirichlet compound
multinomial model. In Intl. Conf. on Machine Learning.
 Madsen, R., D. Kauchak, and C. Elkan (2005). Modeling word burstiness using the Dirichlet distribution. In Intl.
Conf. on Machine Learning.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 74

Psychology Statistics For Dummies
From Everand
Psychology Statistics For Dummies
Martin Dempster
5/5 (1)
SSP Exam WS1718
No ratings yet
SSP Exam WS1718
5 pages
Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1
No ratings yet
Tutorial Part I Information Theory Meets Machine Learning Tuto - Slides - Part1
46 pages
Markov Models Supervised and Unsupervised Machine Learning: Mastering Data Science And Python
From Everand
Markov Models Supervised and Unsupervised Machine Learning: Mastering Data Science And Python
William Sullivan
2/5 (1)
Unifying Count-Based Exploration and Intrinsic Motivation: ON Tezuma S Evenge
No ratings yet
Unifying Count-Based Exploration and Intrinsic Motivation: ON Tezuma S Evenge
26 pages
cs228 HW 1
No ratings yet
cs228 HW 1
6 pages
Jeff Byers - Machine Learning and Advanced Statitics
No ratings yet
Jeff Byers - Machine Learning and Advanced Statitics
48 pages
Sample Theory With Ques. - Estimation (JAM MS Unit-14)
No ratings yet
Sample Theory With Ques. - Estimation (JAM MS Unit-14)
25 pages
On Lower Bounds For Statistical Learning Theory
No ratings yet
On Lower Bounds For Statistical Learning Theory
17 pages
Amcs 2004 14 3 11
No ratings yet
Amcs 2004 14 3 11
18 pages
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
No ratings yet
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
91 pages
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
No ratings yet
CPSC540: Machine Learning Machine Learning Machine Learning Machine Learning
387 pages
Statistics
No ratings yet
Statistics
19 pages
Papal: A Provable Particle-Based Primal-Dual Algorithm For Mixed Nash Equilibrium
No ratings yet
Papal: A Provable Particle-Based Primal-Dual Algorithm For Mixed Nash Equilibrium
48 pages
Lecture Notes Week 2
No ratings yet
Lecture Notes Week 2
10 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Hybrid Predictive Models
No ratings yet
Hybrid Predictive Models
31 pages
Module4 Notes
100% (1)
Module4 Notes
31 pages
Extended Bayesian Information Criteria For Model Selection With Large Model Spaces
No ratings yet
Extended Bayesian Information Criteria For Model Selection With Large Model Spaces
27 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Empirical Bayes Estimation Via Data Fission: Ignat@uchicago - Edu Dlsun@stanford - Edu
No ratings yet
Empirical Bayes Estimation Via Data Fission: Ignat@uchicago - Edu Dlsun@stanford - Edu
4 pages
CS 215: Data Analysis and Interpretation: Sample Questions
No ratings yet
CS 215: Data Analysis and Interpretation: Sample Questions
10 pages
THE ULTRAMETRIC PROPERTIES OF BINARY DATASETS P. WILCZEK Silesian - J - Pure - Appl - Math - v6 - I1 - STR - 069-084
No ratings yet
THE ULTRAMETRIC PROPERTIES OF BINARY DATASETS P. WILCZEK Silesian - J - Pure - Appl - Math - v6 - I1 - STR - 069-084
16 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Module_7
No ratings yet
Module_7
51 pages
Notation
No ratings yet
Notation
7 pages
slides_foundations
No ratings yet
slides_foundations
81 pages
Class 01
No ratings yet
Class 01
75 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
ML Unit3
No ratings yet
ML Unit3
21 pages
Destercke 22 A
No ratings yet
Destercke 22 A
11 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
What Is Data Science? Probability Overview Descriptive Statistics
No ratings yet
What Is Data Science? Probability Overview Descriptive Statistics
10 pages
Proba&Stats For ML TelParis
No ratings yet
Proba&Stats For ML TelParis
17 pages
Slide 1
No ratings yet
Slide 1
37 pages
Statistics
No ratings yet
Statistics
15 pages
10f 601 Midterm
No ratings yet
10f 601 Midterm
17 pages
A Tutorial On Support Vector Machines For Pattern Recognition
100% (1)
A Tutorial On Support Vector Machines For Pattern Recognition
47 pages
Cs 228
No ratings yet
Cs 228
98 pages
AI Semiring
No ratings yet
AI Semiring
87 pages
A5 PDF
No ratings yet
A5 PDF
9 pages
Bark08 Ghahramani Samlbb 01
No ratings yet
Bark08 Ghahramani Samlbb 01
26 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages
Text Classification
No ratings yet
Text Classification
53 pages
Artificial Intelligence and Machine Learning
No ratings yet
Artificial Intelligence and Machine Learning
55 pages
Finding The Nash Equilibria of N - Person Noncooper
No ratings yet
Finding The Nash Equilibria of N - Person Noncooper
24 pages
ML - Unit 1 - Part Ii
No ratings yet
ML - Unit 1 - Part Ii
18 pages
BASAD Research Presentation
No ratings yet
BASAD Research Presentation
87 pages
Abduction-Based Explanations For Machine Learning Models
No ratings yet
Abduction-Based Explanations For Machine Learning Models
9 pages
DistributionLearn (Shai)
No ratings yet
DistributionLearn (Shai)
47 pages
16 Binomial Distribution
No ratings yet
16 Binomial Distribution
3 pages
Chapter 5 - Machine Learning
No ratings yet
Chapter 5 - Machine Learning
59 pages
Examples Are Not Enought
No ratings yet
Examples Are Not Enought
9 pages
Data Science Cheat Sheet
No ratings yet
Data Science Cheat Sheet
10 pages
7. Statistical Perspective
No ratings yet
7. Statistical Perspective
85 pages
How_to_measure_uncertainty_in_uncertainty_sampling
No ratings yet
How_to_measure_uncertainty_in_uncertainty_sampling
35 pages
Statistical Modeling and Computation Scribd PDF Download
100% (2)
Statistical Modeling and Computation Scribd PDF Download
14 pages
Analytic Inequalities
From Everand
Analytic Inequalities
Nicholas D. Kazarinoff
5/5 (1)
The Reasoned Schemer, second edition
From Everand
The Reasoned Schemer, second edition
Daniel P. Friedman
4/5 (16)
Ek 2020
No ratings yet
Ek 2020
203 pages
Dai 2020
No ratings yet
Dai 2020
62 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Seminar em
No ratings yet
Seminar em
51 pages
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
No ratings yet
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
31 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Lecture 1.1 - Single States
No ratings yet
Lecture 1.1 - Single States
49 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Introduction To State Space Models and Sequential Bayesian Inference
No ratings yet
Introduction To State Space Models and Sequential Bayesian Inference
58 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Lec27 AcceptReject
No ratings yet
Lec27 AcceptReject
30 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lec28 StratifiedSampling
No ratings yet
Lec28 StratifiedSampling
15 pages
Lec30 GibbsSampling
No ratings yet
Lec30 GibbsSampling
55 pages
Chapter 7
50% (4)
Chapter 7
38 pages
UNIT-III Lecture Notes
No ratings yet
UNIT-III Lecture Notes
18 pages
Text List of MiniTab Commands
No ratings yet
Text List of MiniTab Commands
9 pages
Sales Person For Real Estate
No ratings yet
Sales Person For Real Estate
26 pages
.-I, II Advanced Statistics (External)
No ratings yet
.-I, II Advanced Statistics (External)
8 pages
Lecture 5 Dummy Variable
No ratings yet
Lecture 5 Dummy Variable
11 pages
Math Study Guide Chi Square
No ratings yet
Math Study Guide Chi Square
3 pages
One Sample Hypothesis Testing Worksheet: Have The Honesty and Integrity You Expect in A President?"
No ratings yet
One Sample Hypothesis Testing Worksheet: Have The Honesty and Integrity You Expect in A President?"
2 pages
08 Learning Representations
No ratings yet
08 Learning Representations
38 pages
Exponential Smoothing Excel Template
No ratings yet
Exponential Smoothing Excel Template
6 pages
Psy 234 Investigating Relationships Week 11
No ratings yet
Psy 234 Investigating Relationships Week 11
37 pages
Instant Download Test Bank For Business Statistics: For Contemporary Decision Making, 10th Edition, US Edition, Ken Black PDF All Chapter
100% (3)
Instant Download Test Bank For Business Statistics: For Contemporary Decision Making, 10th Edition, US Edition, Ken Black PDF All Chapter
64 pages
6414 SP2022 Practice Final Part1 Solutions
No ratings yet
6414 SP2022 Practice Final Part1 Solutions
3 pages
Time Series Interview Questions
No ratings yet
Time Series Interview Questions
7 pages
Chapter 6
No ratings yet
Chapter 6
47 pages
Descriptive and Inferential Statistics
100% (2)
Descriptive and Inferential Statistics
3 pages
Case Study New
No ratings yet
Case Study New
16 pages
Environmental and Ecological Statistics With R, Second Edition (Song S. Qian)
No ratings yet
Environmental and Ecological Statistics With R, Second Edition (Song S. Qian)
560 pages
Hypoth. Testing
100% (1)
Hypoth. Testing
35 pages
Economitrics
No ratings yet
Economitrics
20 pages
Slides Ridge Lasso Regression
No ratings yet
Slides Ridge Lasso Regression
23 pages
Assignment 3 Hints
No ratings yet
Assignment 3 Hints
8 pages
2421-Article Text-7513-2-10-20230126
No ratings yet
2421-Article Text-7513-2-10-20230126
9 pages
SAS 02 - MAT089 (Biostat) - Branches of Statistics, Biostatistics
No ratings yet
SAS 02 - MAT089 (Biostat) - Branches of Statistics, Biostatistics
6 pages
Lecture 2 HYPOTHESIS TESTING Real
No ratings yet
Lecture 2 HYPOTHESIS TESTING Real
10 pages
(Ebook) Microeconometrics Using Stata, Second Edition, Volume I: Cross-Sectional and Panel Regression Models by A. Colin Cameron & Pravin K. Trivedi ISBN 9781597183611, 159718361X - Quickly download the ebook to never miss any content
100% (1)
(Ebook) Microeconometrics Using Stata, Second Edition, Volume I: Cross-Sectional and Panel Regression Models by A. Colin Cameron & Pravin K. Trivedi ISBN 9781597183611, 159718361X - Quickly download the ebook to never miss any content
56 pages
Acemoglu Et Al. - 2014 - Democracy Does Cause Growth PDF
No ratings yet
Acemoglu Et Al. - 2014 - Democracy Does Cause Growth PDF
66 pages
SPSS Pearson R
No ratings yet
SPSS Pearson R
20 pages
Exercises Ch9 Type of Errors
No ratings yet
Exercises Ch9 Type of Errors
15 pages