0% found this document useful (0 votes)
7 views

bayesian-course-2-short

The document discusses Bayesian inference, focusing on model testing through posterior predictive checks and the challenges of exact Bayesian statistics. It provides an example of modeling rainfall in Oxford to assist farmers and explains the use of Bernoulli likelihood for discrete random variables. Additionally, it addresses the concept of paternal discrepancy, illustrating how to infer prevalence using binomial likelihood and the complexities of calculating the denominator in Bayesian analysis.

Uploaded by

Lavy Koilpitchai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

bayesian-course-2-short

The document discusses Bayesian inference, focusing on model testing through posterior predictive checks and the challenges of exact Bayesian statistics. It provides an example of modeling rainfall in Oxford to assist farmers and explains the use of Bernoulli likelihood for discrete random variables. Additionally, it addresses the concept of paternal discrepancy, illustrating how to infer prevalence using binomial likelihood and the complexities of calculating the denominator in Bayesian analysis.

Uploaded by

Lavy Koilpitchai
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Lecture 2: Bayesian inference in practice

Ben Lambert1
[email protected]

1 Imperial College London

Tuesday 5th March, 2019


Outline

1 Model testing through posterior predictive checks

2 Why is exact Bayesian statistics hard?

3 Attempts to deal with the difficulty

4 Sampling
1 Model testing through posterior predictive checks

2 Why is exact Bayesian statistics hard?

3 Attempts to deal with the difficulty

4 Sampling
Example: Modelling rainfall in Oxford

Example:
Measure the average rainfall by month in Oxford.
Modelling rainfall in Oxford

Scenario: modelling Oxford rainfall for farmers


Government needs a model for rainfall to help plan the
budget for farmers’ subsidies over the next 5 years.
Crop yields depend on rainfall following typical season
patterns.
If rainfall is persistently above normal for a number of
months =⇒ yields↓
Assume crop more tolerant to drier spells.

=⇒ create a binary variable equal to 1 if rainfall above


average; 0 otherwise.
Scenario: modelling Oxford rainfall for farmers

150 
monthly rainfall, mm

100

50  
  
 
 

0
time
Scenario: modelling Oxford rainfall for farmers

Long term average rainfall by month


150 
monthly rainfall, mm

100

50  
  
 
 

0
time
Scenario: modelling Oxford rainfall for farmers

long term average rainfall by month


150 
monthly rainfall, mm

100

50  
  
 
 

0
time
Scenario: modelling Oxford rainfall for farmers

1   
rainfall indicator

0         

time
Choosing a likelihood

Building a model to explain Xt ∈ (0, 1); whether the rainfall in


one month exceeds a long term monthly average.
Independence: the value of Xt in month t is independent
of that in the previous months.
Identical distribution: all months in our sample have the
same probability (θ) of rainfall exceeding long-term
average.
Choosing a likelihood

Conditions:
Xt ∈ (0, 1) is a discrete random variable.
Assume independence among Xt .
Assume identical distribution for Xt ; probability of
rainfall exceeding monthly average is θ.
=⇒ Bernoulli likelihood for each individual Xt .
The Bernoulli likelihood

Xt measures whether or not the rainfall in a month t is above a


long term average. A Bernoulli likelihood for a single Xt has
the form:
p(Xt |θ) = θXt (1 − θ)1−Xt (1)
But what does this mean? Work out the probabilities given θ:
p(Xt = 1|θ) = θ1 (1 − θ)0 = θ
p(Xt = 0|θ) = θ0 (1 − θ)1 = 1 − θ
Likelihood vs sampling distribution

Question: what is the difference between a likelihood and a


sampling/probability distribution?
Answer: they are given by the same object, but under different
conditions (“the equivalence relation”). Consider a single Xt :

L(θ|Xt ) = p(Xt |θ) (2)

If hold θ constant =⇒ sampling distribution


Xt ∼ p(Xt |θ).
If hold Xt constant =⇒ likelihood distribution
θ ∼ L(θ|Xt ).
In Bayes’ rule we vary θ =⇒ we use the likelihood
interpretation.
Likelihood vs sampling distribution

Sampling distribution: hold parameter constant, for example


θ = 0.75:

Pr (Xt = 1|θ = 0.75) = 0.751 (1 − 0.75)0 = 0.75


Pr (Xt = 0|θ = 0.75) = 0.750 (1 − 0.75)1 = 0.25

Likelihood distribution: hold data constant for example


consider Xt = 1:

L(θ|Xt = 1) = θ1 (1 − θ0 ) = θ (3)
Therefore here the sampling distribution is discrete whereas
the likelihood distribution is continuous.
Likelihood vs sampling distribution

Sampling distribution: hold θ constant and vary the data Xt


=⇒ valid probability distribution. For example for θ = 0.75:
0.8 0.75

0.6
pmf

0.4
0.25

0.2

0.0
0 1
Xt , rainfall above monthly average
Likelihood vs sampling distribution

Likelihood: hold Xt = 1 and vary θ


=⇒ L(θ|Xt = 1) = θ1 (1 − θ)0 = θ:
1.0

0.8
likelihood

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
θ, probability monthly rain is above average
Likelihood vs sampling distribution

Likelihood: hold Xt = 1 and vary θ. Not a valid probability


distribution!
1.0

0.8
area = 0.5
likelihood

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
θ, probability monthly rain is above average
The overall likelihood

Now assuming that we have a series of X = (X1 , X2 , ..., XT ).


Question: How do we obtain the full likelihood? By
independence:

p(X1 , X2 , ..., XT |θ) = θX1 (1 − θ)1−X1 × θX2 (1 − θ)1−X2 × ...


× θXT (1 − θ)1−XT
P P
=θ Xt
(1 − θ)T − Xt

So if we suppose rain exceeded average in 4/12 months =⇒

L(θ|X ) = θ4 (1 − θ)8 (4)


Posterior predictive distribution

Defined:
“The probability distribution for a new data sample X̃ given
our current data X .”
We obtain this by the following recipe:
1 Sample a value of θi from posterior:

θi ∼ p(θ|X ) (5)

where X is the current data.


2 Sample a value of X̃i from the sampling distribution
conditional on θi ;
X̃i ∼ p(X̃ |θi ) (6)
3 Graph histogram of X̃i values =⇒ posterior predictive
distribution.
Scenario 1: key question

Crop yields depend on whether rainfall is persistently


above average.
Key question: does the model allow for sufficient
persistence in process?
Answer: find the length of maximum run of consecutive
Xt = 1 in real data. Then:
- Draw a sample data series 60 months long from the
posterior predictive distribution.
- Find maximum run of consecutive Xt = 1 in simulated
series.
Repeat the above steps a number of times.
Compare real maximum run length with distribution of
simulated run lengths.
Scenario: maximum length run of wet months in
real data

Start with real data.


Find maximum run of Xt = 1 (rainfall above average).

Nmaxreal = 7

1
rainfall indicator

2010 2011 2012 2013 2014 2015


year
Scenario: posterior predictive checks

Repeating for data simulated from the posterior predictive.

Nmaxsim = 5

1
rainfall indicator

2010 2011 2012 2013 2014 2015


year
Scenario: posterior predictive checks

Another sample.

Nmaxsim = 4

1
rainfall indicator

2010 2011 2012 2013 2014 2015


year
Scenario: posterior predictive checks

A further sample.

Nmaxsim = 7

1
rainfall indicator

2010 2011 2012 2013 2014 2015


year
Scenario: posterior predictive checks

A number of samples.
Nmaxsim = 7 Nmaxsim = 2 Nmaxsim = 5

1 1 1
rainfall indicator

rainfall indicator

rainfall indicator
0 0 0

2010 2011 2012 2013 2014 2015 2010 2011 2012 2013 2014 2015 2010 2011 2012 2013 2014 2015
year year year

Nmaxsim = 3 Nmaxsim = 3 Nmaxsim = 3

1 1 1
rainfall indicator

rainfall indicator

rainfall indicator
0 0 0

2010 2011 2012 2013 2014 2015 2010 2011 2012 2013 2014 2015 2010 2011 2012 2013 2014 2015
year year year

Nmaxsim = 9 Nmaxsim = 2 Nmaxsim = 4

1 1 1
rainfall indicator

rainfall indicator

rainfall indicator
0 0 0

2010 2011 2012 2013 2014 2015 2010 2011 2012 2013 2014 2015 2010 2011 2012 2013 2014 2015
year year year
Scenario: p value

Repeat 10,000 times; each time recording maximum run length.

Pr(Tsim ≥ Tactual | x) = 0.05


3000
2500
frequency

2000
1500
1000
500
0
2 4 6 8 10 12
max run of wet months
Scenario: p value

Find percentage of times where simulated exceeds real.

Pr(Tsim ≥ Tactual | X) = 5.0%


3000
2500
frequency

2000
1500
1000
500
0
2 4 6 8 10 12
max run of wet months
Scenario: p value

Therefore conclude that model is not fit for purpose!

Pr(Tsim ≥ Tactual | X) = 5.0%


3000
2500
frequency

2000
1500
1000
500
0
2 4 6 8 10 12
max run of wet months
1 Model testing through posterior predictive checks

2 Why is exact Bayesian statistics hard?

3 Attempts to deal with the difficulty

4 Sampling
Example problem: paternal discrepancy

Paternal discrepancy is the term given to a child who


has a biological father different to their supposed
biological father.
Question: how common is it?
Answer: a recent meta-analysis of studies of “paternal
discrepancy” (PD) found a rate of ∼ 10%1 .
Suppose we have data for a random sample of 10
children’s presence/absence of PD.
Aim: infer the prevalence of PD in the population (θ).
Paternal discrepancy

Assume individual samples are:


Independent.
Identically-distributed.
Since sample size is fixed at 10 =⇒ binomial likelihood.
The denominator revisited

p(X = 2|θ) × p(θ)


p(θ|X = 2) = (7)

p(X = 2)

Where we suppose we have data X = 2 out of a sample of 10 in


our PD example. We obtain the denominator by averaging out
all θ dependence. This is equivalent to integrating across all θ:

Z1
p(X = 2) = p(X = 2|θ) × p(θ)dθ (8)
0

(We approximately determined this using sampling previously.)


The denominator as an area

pdf prior

0 20 40 60 80 100
likelihood
likelihood

0 20 40 60 80 100
likelihood × prior

0 20 40 60 80 100
θ (PD prevalence), %
The denominator as an area

pdf prior

0 20 40 60 80 100
likelihood
likelihood

0 20 40 60 80 100
likelihood × prior

Pr(X = 2) ≃ 0.08

0 20 40 60 80 100
θ (PD prevalence), %
Calculating the denominator in 1 dimension

For our PD example there is a single parameter θ =⇒

Z1
p(X = 2) = p(X = 2|θ) × p(θ)dθ (9)
0
This is equivalent to working out an area under a curve.
likelihood × prior

Pr(X = 2) ≃ 0.08

0 20 40 60 80 100
θ (PD prevalence), %
Calculating the denominator in 2 dimensions

If we considered a different model where there were two


parameters θ1 ∈ (0, 1), θ2 ∈ (0, 1) =⇒ :

Z1 Z1
p(X = 2) = p(X = 2|θ1 , θ2 ) × p(θ1 , θ2 )dθ1 dθ2 (10)
0 0

This is equivalent to working out a volume contained within a


surface.
Calculating the denominator in d dimensions

If we considered a different model where there were d


parameters (θ1 , ..., θd ) all defined to lie between 0 and 1 =⇒ :

Z1 Z1
p(X = 2) = ... p(X = 2|θ1 , ..., θd ) × p(θ1 , ..., θd )dθ1 ...dθd
0 0
(11)
This is equivalent to working out a (d + 1)-dimensional
volume contained within a d-dimensional (hyper-surface)!
“I have no idea what I’m doing.”
The difficult denominator

Calculating the denominator possible for d <∼ 3 using


computers.
Numerical quadrature and many other approximate
schemes struggle for larger d.
Many models have thousands of parameters.

Arrrghhh!
Other difficult integrals

Assume we can calculate posterior:

p(X |θ) × p(θ)


p(θ|X ) = (12)
p(X )

Typically we want summary measures of posterior, for example,


the mean of θ1 :
 
Z Z Z
E(θ1 |X ) = θ1  ... p(θ1 , θ2 , ..., θd |X )dθd ...dθ2  dθ1
 

Θ1 Θ2 Θd
Z
= θ1 p(θ1 |X )dθ1
Θ1

Nearly as difficult as denominator!


1 Model testing through posterior predictive checks

2 Why is exact Bayesian statistics hard?

3 Attempts to deal with the difficulty

4 Sampling
What are conjugate priors?

Judicious choice of prior and likelihood can make posterior


calculation trivial.
Choose a likelihood L.
Choose a prior θ ∼ f ∈ F , where:
- F is a family of distributions.
- f is a member of that same family.
If posterior, θ|X ∼ f 0 ∈ F =⇒ conjugate!
In other words both the prior and posterior are members
of the same distribution!
Conjugate priors: PD example revisited

Sample 10 children and count number (X) with PD:


For likelihood (if independent and identically-distributed):

X ∼ Binomial(10, θ) =⇒ p(X |θ) ∝ θX (1 − θ)10−X (13)

For prior assume a Beta distribution (a reasonable choice


if θ ∈ (0, 1)):

θ ∼ Beta(α, β) =⇒ p(θ) ∝ θα−1 (1 − θ)β−1 (14)

Numerator of Bayes’ rule for inference:

p(X |θ) × p(θ) ∝ θX (1 − θ)10−X × θα−1 (1 − θ)β−1 (15)


Conjugate priors: PD example revisited

Numerator of Bayes’ rule for inference:

p(X |θ) × p(θ) ∝ θX (1 − θ)10−X × θα−1 (1 − θ)β−1


= θX +α−1 (1 − θ)10−X +β−1

This has same θ-dependence as a Beta(X + α, 10 − X + β)


density =⇒ must be this distribution!
∴ a Beta prior is conjugate to a Binomial likelihood.
Table of common conjugate pairs of likelihoods and
priors

No need to do any integrals! Just lookup rules:

Likelihood Prior Posterior


n
P n
P
Bernoulli Beta(α, β) Beta(α + Xi , β + n − Xi )
i=1 i=1
Pn n
P n
P
Binomial Beta(α, β) Beta(α + Xi , β + Ni − Xi )
i=1 i=1 i=1
n
P
Poisson Gamma(α, β) Gamma(α + Xi , β + n)
i=1
Pn
Multinomial Dirichlet(α) Dirichlet(α + Xi)
i=1
Normal Normal-inv-Γ Normal-inv-Γ
Limits of conjugate modelling

Using conjugate priors is limiting because:


Often restricted to univariate problems.
- =⇒ we could just use numerical quadrature instead.
Required to use relevant conjugate prior for a given
likelihood ⇐= may not be sufficient to capture pre-data
beliefs of analyst.
Another solution: discrete Bayes’ rule

To calculate the denominator we need to do an integral, if


parameters are continuous.
If instead parameters are discrete =⇒ denominator is a
sum over finite number of possible parameter values:
p
X
p(X ) = p(X |θi ) × p(θi ) (16)
i=1
In general this sum is more tractable than an integral.
Question: can we use this to help us with continuous
parameter problems?
Discretised Bayesian inference

Method:
Convert continuous parameter into k discrete values.
Use discrete version of Bayes’ rule.
As k → ∞ discrete posterior → true posterior.
Scenario: discretised Bayesian inference

Xt measures whether rainfall exceeds long term monthly


average.
Suppose Xt = 1 and Xt+1 = 0.
Assumed p(Xt = 1, Xt+1 = 0|θ) = θ(1 − θ); i.e. likelihood.
Also assume p(θ) = 1; i.e. the prior.
Discretise θ ∈ (0, 1) → (0.0, 0.2, 0.4, 0.6, 0.8, 1.0).
Scenario: discretised Bayesian inference

Discretise θ at intervals of 0.2.


prior
pdf

0.0 0.2 0.4 0.6 0.8 1.0


likelihood
likelihood

0.0 0.2 0.4 0.6 0.8 1.0


posterior
pdf

0.0 0.2 0.4 0.6 0.8 1.0


θ, probability monthly rain is above average
Scenario: discretised Bayesian inference

Discretise θ at intervals of 0.2.


prior
pdf

0.0 0.2 0.4 0.6 0.8 1.0


likelihood
likelihood

0.0 0.2 0.4 0.6 0.8 1.0


posterior
pdf

0.0 0.2 0.4 0.6 0.8 1.0


θ, probability monthly rain is above average
Scenario: discretised Bayesian inference

Discretise θ at intervals of 0.02.


prior
pdf

0.0 0.2 0.4 0.6 0.8 1.0


likelihood
likelihood

0.0 0.2 0.4 0.6 0.8 1.0


posterior
pdf

0.0 0.2 0.4 0.6 0.8 1.0


θ, probability monthly rain is above average
Scenario: discretised Bayesian inference

Discretise θ at intervals of 0.02.


prior
pdf

0.0 0.2 0.4 0.6 0.8 1.0


likelihood
likelihood

0.0 0.2 0.4 0.6 0.8 1.0


posterior
pdf

0.0 0.2 0.4 0.6 0.8 1.0


θ, probability monthly rain is above average
The problem with discretised Bayes

1 parameter -> 10 points


● ●

● ●

● ●

● ●
● ●
The problem with discretised Bayes

2 parameters -> 102 points


The problem with discretised Bayes and numerical
quadrature
Question: how many grid points do we need for a
20-parameter model?
Answer: 1020 = 100, 000, 000, 000, 000, 000, 000 grid points ∴
impossible!

Same goes for other methods that makes Bayesian inference


discrete, for example numerical quadrature.
The problem of aforementioned methods: summary

Bayesian inference requires us to difficult integrals; both


for the denominator and posterior summaries.
Conjugate priors are too simple for most real life examples.
Another method is to approximate integrals by discretising
them into sums.
Method works ok for models with a few parameters.
But doesn’t scale well for models with more than about 3
parameters (curse of dimensionality).
Question: can we find a method whose complexity is
independent of the # of parameters?
1 Model testing through posterior predictive checks

2 Why is exact Bayesian statistics hard?

3 Attempts to deal with the difficulty

4 Sampling
Black box die

Black box containing a die with an unknown number of


faces, and weightings towards sides.
Shake the box and view the number that lands face up
through a viewing window.
Note: an individual shake represents one sample from the
probability distribution of the die.
Black box die: estimating mean

Question: How can we estimate the die’s mean?


Answer: shake it off! Then calculate the overall mean
across all shakes.
Computational die in a box: results

50
current value = 26
40
running mean

30

20

10

0
0 20 40 60 80 100
# shakes
Black box die: sampling to estimate a sum

Mean of a sample of size n is:


n
1X
X = Xi (17)
n
i=1

Whereas the true mean of the die is given by:


#X
faces
E(X ) = Pr (Xj = xj ) × xj (18)
j=1

For a sample size of <∼1000 we were able to estimate:

X ≈ E(X ) (19)
An infinitely-sided die as a continuous distribution

0.1
001
# faces

1 6
0.7
0.001
.48
0.23 0 0
.87
50
0 1
number on face

Imagine increasing the number of faces to infinity (a


strange die indeed).
Each face corresponds to one real number between 0 and
1.
All possible numbers between 0 and 1 are covered.
Basically like a continuous uniform distribution between
0 and 1.
An infinitely-sided die

However its mean is now given by an integral rather than


a sum.
Z
E(X ) = p(X ) × X dX (20)
all faces

Question: can still estimate its true mean by the sample


mean?
If so this amounts to estimating the above integral!
Continuous distribution sampling

1.0
current value = 0.725545

0.8
running mean

0.6

0.4

0.2

0.0
0 20 40 60 80 100
# shakes
A stranger distribution

Method seems to work for continuous uniform distribution.


Question: does it work for other distributions?

pdf

-10 -5 0 5 10
die value
A stranger distribution: sampling

10
current value = 4.85897

5
running mean

-5

-10
0 20 40 60 80 100
# shakes
A stranger distribution: why does sampling work?

Compare samples...

frequency

-10 -5 0 5 10
die value
A stranger distribution: why does sampling work?

...with actual distribution =⇒ same shape!

frequency

-10 -5 0 5 10
die value
A stranger distribution: why does sampling work?

Therefore sample properties → actual properties.

frequency

-10 -5 0 5 10
die value
A stranger distribution: why does sampling work?

Note: nowhere have we explicitly mentioned the parameter


dimension (complexity-free scaling?).

frequency

-10 -5 0 5 10
die value
What is an independent sample?

Aforementioned methods require us to generate


independent samples from the distribution.
Question: what is an independent sample?
Answer: a value drawn from the distribution whose value
is unconnected to other samples (apart from their joint
reliance on the distribution.)
How to generate independent samples?

By definition using independent sampling to estimate


integrals requires us to be able to generate independent
samples: θi ∼ p(θ).
Not as simple as might first appear.
Most statistical software has inbuilt ability to generate
(pseudo-)independent samples for a few basic
distributions: uniform, normal, poisson etc.
However, for more complex distributions it is not trivial to
create an independent sampler.
Summary

Posterior is a weighted average of prior and likelihood,


where weight of likelihood determined by amount of data.
Posterior predictive distributions show implications of the
posterior on the observable world.
Exact Bayes is hard due to difficulty of calculating
posterior, and other high dimensional integrals.
Conjugate priors can make analysis simpler, although are
highly restrictive.
Discretisation can work for low dimensional problems but
cannot cope with more complex models.
Independent sampling can help to estimate integrals but
can be hard to do in practice (see problem set).

You might also like