0% found this document useful (0 votes)

11 views

abc_slides

Uploaded by

hoangquyenvlo2019

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

abc_slides

Uploaded by

hoangquyenvlo2019

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

An intro to ABC – approximate Bayesian

computation
PhD course FMS020F–NAMS002 “Statistical inference for partially
observed stochastic processes”, Lund University

http://goo.gl/sX8vU9

Umberto Picchini
Centre for Mathematical Sciences,
Lund University

Umberto Picchini ([email protected])

In this lecture we consider the case where it is not possible to pursue
exact inference for model parameters θ, nor it is possible to
approximate the likelihood function of θ within a given computational
budget and available time.
The above is not a rare circumstance.
Since the advent of affordable computers and the introduction of
advanced statistical methods, researchers have become increasingly
ambitious, and try to formulate and fit very complex models.
Example: MCMC (Markov chain Monte Carlo) has provided a
universal machinery for Bayesian inference since its rediscovery in
the statistical community in the early 90’s.
Thanks to MCMC (and related methods) scientists’ ambitions have
been pushed further and further.

Umberto Picchini ([email protected])

However for complex models (and/or large datasets) MCMC is often
impractical. Calculating the likelihood, or an approximation thereof
might be impossible.

For example in spatial statistics INLA (integrated nested Laplace

approximation) is a welcome alternative to the more expensive
MCMC.

Also MCMC is not online: when new observations arrive we have to

re-compute the whole likelihood for the total set of observations, i.e.
we can’t make use of the likelihood computed at previous
observations.

Umberto Picchini ([email protected])

Particle marginal methods (particle MCMC) are a fantastic possibility
for exact Bayesian inference for state-space models. But what can we
do for non-state space models?

And what can we do when the dimension of the mathematical system

is large and the implementation of particle filters with millions of
particles is infeasible?

Umberto Picchini ([email protected])

There is an increasingly interest in statistical methods for models that
are easy to simulate from, but for which it is impossible to calculate
transition densities or likelihoods.
General set-up: we have a complex stochastic process {Xt } with
unknown parameters θ. For any θ we can simulate from this process.

We have observations y = f ({X0:T }).

We want to estimate θ but we cannot calculate p(y|θ), as this involves

integrating over the realisations of {X0:T }.

Notice we are not specifying the probabilistic properties of X0:T nor

Y. We are certainly not restricting ourselves to state-space models.

Umberto Picchini ([email protected])

The likelihood-free idea

Likelihood-free inference motivating idea:

Easy to simulate from model conditional on parameters.
So run simulations for many parameters.
See for which parameter value the simulated data sets match
observed data best

Umberto Picchini ([email protected])

Different likelihood-free methods

Likelihood-free methods date back to at least Diggle and Gratton

(1984) and Rubin (1984, p. 1160)

More recent examples:

Indirect Inference (Gourieroux and Ronchetti 1993);
Approximate Bayesian Computation (ABC) (a review is Marin
et al. 2011);
bootstrap filter of Gordon, Salmond and Smith (1993)
Synthetic Likelihoods method of Wood (2010)

Umberto Picchini ([email protected])

Are approximations any worth?

Why should we care about approximate methods?

Well, we know the most obvious answer: it’s because this is what we
do when exact methods are impractical. No big news...

But I am more interested on the following phenomenon, which I

noticed by direct experience:
Many scientists seem to get intellectual fulfilment by using exact
methods, leading to exact inference.
What we might not see is when they fail to communicate that
they (consciously or unconsciously) pushed themselves to
formulate simpler models, so that exact inference could be
achieved.

Umberto Picchini ([email protected])

So the pattern I often notice is:
1 You have a complex scenario, noisy data, unobserved variables
etc
2 you formulate a pretty realistic model... which you can’t fit to
data (i.e. exact inference is not possible)
3 you simplify the model (a lot) so it is now tractable with exact
methods.
4 You are happy.

However you might have simplified the model a wee too much to be
realistic/useful/sound.

Umberto Picchini ([email protected])

John Tukey – 1962
“Far better an approximate answer to the right question, which is
often vague, than an exact answer to the wrong question, which can
always be made precise. ”

If a complex model is the one I want to use to answer the right

question, then I prefer to obtain an approximative answer using
approximate inference, than fooling myself with a simpler model
using exact inference.

Umberto Picchini ([email protected])

Gelman and Rubin, 1996
“[...] as emphasized in Rubin (1984), one of the great scientific
advantages of simulation analysis of Bayesian methods is the freedom
it gives the researcher to formulate appropriate models rather than be
overly interested in analytically neat but scientifically inappropriate
models.”

Approximate Bayesian Computation and Synthetic Likelihoods are

two approximate methods for inference, with ABC vastly more
popular and with older origins.

We will discuss ABC only.

Umberto Picchini ([email protected])

Features of ABC

only need a generative model, i.e. the model we assumed having

generated available data y.
only need to be able to simulate from such a model.
in other words, we do not need to assume anything regarding the
probabilistic features of the model components.
particle marginal methods also assume the ability to simulate
from the model, but also assume a specific model structure,
usually a state-space model (SSM).
also, particle marginal methods for SSM require at least
knowledge of p(yt |xt ; θ) (to compute importance weights). What
do we do without such requirement?

Umberto Picchini ([email protected])

For the moment we can denote data with y instead of, say, y1:T as
what we are going to introduce is not specific to dynamical models.

Umberto Picchini ([email protected])

Bayesian setting: target is π(θ|y) ∝ p(y|θ)π(θ)
What to do when (1) the likelihood p(y|θ) is unknown in closed form
and/or (2) it is expensive to approximate?
Notice that if we are able to simulate observations y∗ by running the
generative model, then we have

y∗ ∼ p(y|θ)

That is y∗ is produced by the statistical model that generated observed

data y.

(i) Therefore if Y is the space where y takes values, then y∗ ∈ Y.

(ii) y and y∗ have the same dimension.

Umberto Picchini ([email protected])

Loosely speaking...

Example: if we have a SSM and given a parameter value θ and xt−1

simulate xt , then plug xt in the observation equation ans simulate y∗t ,
then I have that y∗t ∼ p(yt |θ).
This is because if I have two random variables x and y with joint
distribution (conditional on θ) p(y, x|θ) then
p(y, x|θ) = p(y|x; θ)p(x|θ).
I first simulate x∗ from p(x|θ), then conditional on x∗ I simulate y∗
from p(y|x∗ , θ).
What I obtain is a draw (x∗ , y∗ ) from p(y, x|θ) hence y∗ alone must be
a draw from the marginal p(y|θ).

Umberto Picchini ([email protected])

Likelihood free rejection sampling

1 simulate from the prior θ∗ ∼ π(θ)

2 plug θ∗ in your model and simulate a y∗ [this is the same as
writing y∗ ∼ p(y|θ∗ )]
3 if y∗ = y store θ∗ . Go to step 1 and repeat.
The above is a likelihood free algorithm: it does not require
knowledge of the expression of p(y|θ).

Each accepted θ∗ is such that θ∗ ∼ π(θ|y) exactly.

We justify the result in next slide.

Umberto Picchini ([email protected])

Justification
The previous algorithm is exact. Let’s see why.

Denote with f (θ∗ , y∗ ) the joint distribution of the accepted (θ∗ , y∗ ).

We have that

f (θ∗ , y∗ ) = p(y∗ |θ∗ )π(θ∗ )Iy (y∗ )

with Iy (y∗ ) = 1 iff y∗ = y and zero otherwise. Marginalizing y∗ we
have

Z
∗
f (θ ) = p(y∗ |θ∗ )π(θ∗ )Iy (y∗ )dy∗ = p(y|θ∗ )π(θ∗ ) ∝ π(θ∗ |y)
Y

hence all accepted θ∗ are drawn from the exact posterior.

Umberto Picchini ([email protected])
Curse of dimensionality

Algorithmically the rejection algorithm could be coded in a while

loop, that would repeat itself until the equality condition is satisfied.
For y taking discrete values in a “small” set of states this is
manageable.
For y a long sequence of observations from a discrete random
variables with many states this is very challenging.
For y a continuous variable the equality happens with probability zero.

Umberto Picchini ([email protected])

ABC rejection sampling (Tavare et al.1 )

Attack the curse of dimensionality by introducing an approximation.

Take an arbitrary distance k · k and a threshold > 0.
1 simulate from the prior θ∗ ∼ π(θ)
2 simulate a y∗ ∼ p(y|θ∗ )
3 if k y∗ − y k< store θ∗ . Go to step 1 and repeat.

Each accepted θ∗ is such that θ∗ ∼ π (θ|y).

Z
π (θ|y) ∝ p(y∗ |θ∗ )π(θ∗ )IA,y (y∗ )dy∗
Y
A,y (y∗ ) = {y∗ ∈ Y; k y∗ − y k< }.
1
Tavare et al. 1997. Genetics;145(2)
Umberto Picchini ([email protected])
Figure 1. Parameter estimation by Approximate Bayesian Computation: a conceptual overview.
doi:10.1371/journal.pcbi.1002803.g001

Umberto Picchini ([email protected])

Step 5: The posterior distribution is approximated with the This example application of ABC used simplifications for
accepted parameter points. The posterior distribution should have illustrative purposes. A number of review articles provide pointers
a nonnegligible probability for parameter values in a region to more realistic applications of ABC [9–11,14].
It is self evident that when imposing = 0 we force y∗ = y thus
implying that draws will be, again, from the true posterior.
However in practice imposing = 0 might require unbearable
computational times to obtain a single acceptance. In practice we
have to set > 0, so that draws are from the approximate posterior
π (θ|y).
Important ABC result
Convergence “in distribution”:
when → 0, π (θ|y) → π(θ|y)
when → ∞, π (θ|y) → π(θ)
Essentially for a too large we learn nothing.

Umberto Picchini ([email protected])

Toy model

Let’s try something really trivial. We show how ABC rejection can
become easily inefficient.

n = 5 i.i.d. observations yi ∼ Weibull(2, 5)

want to estimate parameters of the Weibull, so
θ = (2, 5) = (a, b) are the true values.
P
take k y − y∗ k= ni=1 (yi − y∗i )2 (you can try a different
distance, this is not really crucial)
let’s use different values of
run 50,000 iterations of the algorithm.

Umberto Picchini ([email protected])

We assume wide priors for the “shape” parameter a ∼ U(0.01, 6) and
for the “scale” b ∼ U(0.01, 10).

Try = 20

shape scale
0.15

0.08
Density

Density
0.10

0.04
0.05
0.00

0.00
0 1 2 3 4 5 6 0 2 4 6 8 10
N = 45654 Bandwidth = 0.1699 N = 45654 Bandwidth = 0.3038

We are evidently sampling from the prior. Must reduce . In fact

notice about 46,000 draws were accepted.
Umberto Picchini ([email protected])
Try = 7

shape scale

0.00 0.05 0.10 0.15 0.20

0.20
Density

Density
0.10
0.00

0 1 2 3 4 5 6 0 2 4 6 8 10
N = 19146 Bandwidth = 0.1779 N = 19146 Bandwidth = 0.2186

Here about 19,000 draws were accepted (38%).

Umberto Picchini ([email protected])

Try = 3

shape scale

0.0 0.1 0.2 0.3 0.4

0.20
Density

Density
0.10
0.00

0 2 4 6 2 4 6 8 10
N = 586 Bandwidth = 0.321 N = 586 Bandwidth = 0.2233

Here about 1% of the produced simulations has been accepted. Recall

true values are (a, b) = (2, 5).
Of course n = 5 is a very small sample size so inference is of limited
quality, but you got the idea of the method.

Umberto Picchini ([email protected])

An idea for self-study

Compare the ABC (marginal) posteriors with exact posteriors from

some experiment using conjugate priors.

For example see http://www.johndcook.com/

CompendiumOfConjugatePriors.pdf

Umberto Picchini ([email protected])

Curse of dimensionality

It becomes immediately evident that results will soon degrade for

a larger sample size n
even for a moderately long
Pn dataset y, how likely is that we
∗ ∗
produce a y such that i=1 (yi − yi )2 < for small ?
Very unlikely.
inevitably, we’ll be forced to enlarge thus degrading the quality
of the inference.

Umberto Picchini ([email protected])

Here we take n = 200. To compare with our “best” previous result,
we use = 31 (to obtain again a 1% acceptance rate on 50,000
iterations).

shape scale

1.2
0.8

0.8
0.6
Density

Density
0.4

0.4
0.2
0.0

0.0
3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 3.5 4.0 4.5 5.0 5.5
N = 474 Bandwidth = 0.1351 N = 474 Bandwidth = 0.08315

Notice shape is completely off (true value is 2).

The approach is just not going to be of any practical use with
continuous data.
Umberto Picchini ([email protected])
ABC rejection with summaries (Pritchard et al.2 )

Same as before, but comparing S(y) with S(y∗ ) for “appropriate”

summary statistics S(·).
1 simulate from the prior θ∗ ∼ π(θ)
2 simulate a y∗ ∼ p(y|θ∗ ), compute S(y∗ )
3 if k S(y∗ ) − S(y) k< store θ∗ . Go to step 1 and repeat.

Samples are from π (θ|S(y)) with

Z
π (θ|S(y)) ∝ p(y∗ |θ∗ )π(θ∗ )IA,y (y∗ )dy∗
Y

A,y (y∗ ) = {y∗ ∈ Y; k S(y∗ ) − S(y) k< }.

2
Pritchard et al. 1999, Molecular Biology and Evolution, 16:1791-1798.
Umberto Picchini ([email protected])
Using summary statistics clearly introduces a further level of
approximation. Except when S(·) is sufficient for θ (carries the same
info about θ as the whole y).

When S(·) is a set of sufficient statistics for θ,

π (θ|S(y)) = π (θ|y)

But then again when y is not in the exponential family, we basically

have no hope to construct sufficient statistics.

A central topic in ABC is to construct “informative” statistics, as

a replacement for the (unattainable) sufficient ones.
Important paper, Fearnhead and Prangle 2012 (discussed later).
If we have “good summaries” we can bypass the curse of
dimensionality problem.
Umberto Picchini ([email protected])
Weibull example, reprise
Take n = 200. Set S(y) = (sample mean y, sample SD y) and
similarly for y∗ . Use = 0.35.

shape scale

1.5
1.2

1.0
Density

Density
0.8

0.5
0.4
0.0

0.0
1.5 2.0 2.5 3.0 4.0 4.5 5.0 5.5
N = 453 Bandwidth = 0.07102 N = 453 Bandwidth = 0.06776

This time we have captured both shape and scale (with 1%

acceptance).
Also, enlarging n would not cause problems → robust comparisons
thanks to S(·).
Umberto Picchini ([email protected])
From now on we silently assume working with S(y∗ ) and S(y), and if
we wish not to summarize anything we can always set S(y) := y.

A main issue in ABC research is that when we use an arbitrary S(·)

we can’t quantify “how much off” we are from the ideal sufficient
statistic.
Important work on constructing “informative” statistics:
Fearnhead and Prangle 2012, JRSS-B 74(3).
review by Blum et al 2013, Statistical Science 28(2).

Michael Blum will give a free workshop in Lund on 10 March. Sign

up here!

Umberto Picchini ([email protected])

Beyond ABC rejection
ABC rejection is the simplest example of ABC algorithm.
It generates independent draws and can be coded into an
embarrassingly parallel algorithm. However in can be massively
inefficient.
Parameters are proposed from the prior π(θ). A prior does not exploit
the information of already accepted parameters.
Unless π(θ) is somehow similar to π (θ|y) many proposals will be
rejected for moderately small .
This is especially true for a large dimensional θ.
A natural approach is to consider ABC within an MCMC algorithm.
In a MCMC with random walk proposals the proposed parameter
explores a neighbourhood of the last accepted parameter.
Umberto Picchini ([email protected])
ABC-MCMC
Consider the approximated augmented posterior:
π (θ, y∗ |y) ∝ J (y∗ , y) p(y∗ |θ)π(θ)
| {z }
∝π(θ|y∗ )

J (y∗ , y) a function which is a positive constant when y = y∗ (or

S(y) = S(y∗ )) and takes large positive values when y∗ ≈ y (or
S(y) ≈ S(y∗ )).
π(θ|y∗ ) the (intractable) posterior corresponding to artificial
observations y∗ .
when = 0 we have J (y∗ , y) constant and
π (θ, y∗ |y) = π(θ|y).

Without loss of generality, let’s assume that J (y∗ , y) ∝ Iy (y∗ ), the

indicator function.
Umberto Picchini ([email protected])
ABC-MCMC (Marjoram et al. 3 )
We wish to simulate from the posterior π (θ, y∗ |y): hence construct
proposals for both θ and y∗ .

Present state is θ# (and corresponding y# ). Propose θ∗ ∼ q(θ∗ |θ# ).

Simulate y∗ from the model given θ∗ hence the proposal is the model
itself, y∗ ∼ p(y∗ |θ∗ ).

The acceptance probability is thus:

Iy (y∗ )p(y∗ |θ∗ )π(θ∗ ) q(θ# |θ∗ )p(y# |θ# )

α = min 1, ×
1 × p(y |θ)π(θ )
# # q(θ∗ |θ# )p(y∗ |θ∗ )

The “1” at the denominator it’s there because of course we must start the
algorithm at some admissible (accepted) y# , hence the denominator will
always have Iy (y# ) = 1.
3
Marjoram et al. 2003, PNAS 100(26).
Umberto Picchini ([email protected])
By considering the simplification in the previous acceptance
probability we have the ABC-MCMC:

1 Last accepted parameter is θ# (and corresponding y# ). Propose

θ∗ ∼ q(θ∗ |θ# ).
2 generate y∗ conditionally on θ∗ and compute Iy (y∗ )
3 if Iy (y∗ ) = 1 go to step 4 else stay at θ# and return to step 1.
4 Calculate
π(θ∗ ) q(θ# |θ∗ )
α = min 1, ×
π(θ# ) q(θ∗ |θ# )
generate u ∼ U(0, 1). If u < α set θ# := θ∗ otherwise stay at θ# .
Return to step 1.
During the algorithm there is no need to retain the generated y∗ hence
the set of accepted θ form a Markov chain with stationary distribution
π (θ|y).
Umberto Picchini ([email protected])
The previous ABC-MCMC algorithm is also denoted as
“likelihood-free MCMC”.
Notice that likelihoods do not appear in the algorithm.
Likelihoods are substituted by sampling of artificial observations
from the data-generating model.
The Handbook of MCMC (CRC press) has a very good chapter
on Likelihood-free Markov chain Monte Carlo.

Umberto Picchini ([email protected])

Blackboard: proof that the algorithm targets the correct distribution

Umberto Picchini ([email protected])

A (trivial) generalization of ABC-MCMC

Marjoram et. al used J (y∗ , y) ≡ Iy (y∗ ). This implies that we

consider equally ok those y∗ such that |y∗ − y| < (or such that
|S(y∗ ) − S(y)| < )
However we might also reward y∗ in different ways depending on
their distance to y.
Examples:
Pn ∗ 2 2
Gaussian kernel: J (y∗ , y) ∝ e− i=1 (yi −yi ) /2 , or...
∗ 0 −1 ∗ 2
for vector S(·): J (y∗ , y) ∝ e−(S(y)−S(y )) W (S(y)−S(y ))/2
And of course the in the two formulations above are different.

Umberto Picchini ([email protected])

Then the acceptance probability trivially generalizes to4

J (y∗ , y)π(θ∗ ) q(θ# |θ∗ )

α = min 1, × .
J (y# , y)π(θ# ) q(θ∗ |θ# )

This is still a likelihood-free approach.

4
Sisson and Fan (2010), chapter in Handbook of Markov chain Monte Carlo.
Umberto Picchini ([email protected])
Choice of the threshold
We would like to use a “small” > 0, however it turns out that if you
start at a bad value of θ a small will cause many rejections.
start with a fairly large allowing the chain to move in the
parameters space.
after some iterations reduce so the chain will explore a
(narrower) and more precise approximation to π(θ|y)
keep reducing (slowly) . Use the set of θ’s accepted using the
smallest to report inference results.
It’s not obvious how to determine the sequence of
1 > 2 > ... > k > 0. If the sequence decreases too fast there will
be many rejections (chain suddenly trapped in some tail).

It’s a problem similar to tuning the “temperature” in optimization via

simulated annealing.
Umberto Picchini ([email protected])
Choice of the threshold
A possibility:
Say that you have completed a number of iterations via ABC-MCMC
or via rejection sampling using 1 , and say that you stored the
distances d1 =k S(y) − S(y∗ ) k obtained using 1 .
Take the xth percentile of such distances and set a new threshold 2 as
2 := xth percentile of d1 .
this way 2 < 1 . So now you can use 2 to conduct more simulations,
then similarly obtain 3 := xth percentile of d2 etc.
Depending on x the decrease from a to another 0 will be more or
less fast. Setting say x = 20 will cause a sharp decrease, while x = 90
will let the threshold decrease more slowly.
A slow decrease of is safer but implies longer simulations before
reaching acceptable results.
Alternatively just to set the sequence of ’s by trial and error.
Umberto Picchini ([email protected])
When do we stop decreasing ?

Several studies have shown that when using ABC-MCMC obtain

a chain resulting in a 1% acceptance rate (at the smallest ) is a
good compromise between accuracy and computational needs.
This is also my experience.
However recall that a “small” implies many rejections→you’ll
have to run a longer simulation to obtain enough acceptances to
enable inference.
ABC, unlike exact MCMC, does require a small acceptance rate.
This is needed by its own nature as we are not happy to use a
large .
A high acceptance rate denotes that your is way too large and
you are probably sampling from the prior π(θ) (!)

Umberto Picchini ([email protected])

Example from Sunnåker et al. 2013
[Large chunks from the cited article constitute the ABC entry in Wikipedia.]
additional bias due to the loss of information
bias—for example, in the context of mod
more subtle [12,18].
At the same time, some of the criticisms t
at the ABC methods, in particular
phylogeography [19–21], are not specific
all Bayesian methods or even all statistic
choice of prior distribution and param
However, because of the ability of ABC-me
more complex models, some of these g
particular relevance in the context of ABC
This section discusses these potential risk
ways to address them (Table 2).

Approximation of the Posterior

A nonnegligible e comes with the price
p(hDr(D ^ ,D)ƒe) instead of the true post
Figure 2. A dynamic bistable hidden Markov model.
sufficiently small tolerance, and a sensible
doi:10.1371/journal.pcbi.1002803.g002
resulting distribution p(hDr(D ^ ,D)ƒe) shou
We have a hidden system state, moving between states {A,B} with
distribution for these models. Again, computational improvements
the actual target distribution p(hDD) reasona
hand, a tolerance that is large enough th
probability θ, and stays
for ABC inspace
in the theofcurrent
models havestate with such
been proposed, probability
as 1 − θ.
parameter space becomes accepted will yiel
constructing a particle filter in the joint space of models and
distribution. There are empirical studies of
parameters [17].
Actual observations
Onceaffected by measurement
the posterior probabilities of models have beenerrors:
estimated, probability to
p(hDr(D ^ ,D)ƒe) and p(hDD) as a function of
one can make full use of the techniques of Bayesian model results for an upper e-dependent bound for
misread system states − γ tofor
comparison.isFor1instance, both
compare the A and
relative B. of
plausibilities estimates [24]. The accuracy of the pos
two models M1 and M2 , one can compute their posterior ratio, expected quadratic loss) delivered by ABC
which is related to the Bayes factor B1,2 : also been investigated [25]. However, th
Umberto Picchini ([email protected]) distributions when e approaches zero, and
distance measure used, is an important to
p(M1 DD) p(DDM1 ) p(M1 ) p(M1 )
Example of application: the behavior of the Sonic Hedgehog (Shh)
transcription factor in Drosophila melanogaster can be modeled by
the given model.
Not surprisingly, the example is a hidden Markov model:
p(xt |xt−1 ) = θ when xt , xt−1 and 1 − θ otherwise.
p(yt |xt ) = γ when yt = xt and 1 − γ otherwise.

In other words a typical simulation pattern looks like:

A,B,B,B,A,B,A,A,A,B (states x1:T )
A,A,B,B,B,A,A,A,A,A (observations y1:T )
Misrecorded states are flagged in red.

Umberto Picchini ([email protected])

The example could be certainly solved via exact methods, but just for
the sake of illustration, assume we are only able to simulate random
sequences from our model.
Here is how we simulate a sequence of length T:
1 given θ, generate xt∗ ∼ Bin(1, θ)
2 conditionally on xt , yt is Bernoulli: generate a u ∼ U(0, 1) if
u < γ set y∗t := xt∗ otherwise take the other value.
3 set t := t + 1 go to 1 and repeat until we have collected y1 , ..., yT .

So we are totally set to generate sequences of A’s and B’s given

parameter values.

Umberto Picchini ([email protected])

We generate a sequence of size T = 150 with θ = 0.25 and γ = 0.9.
The states are discrete and only two (A and B) hence with datasets of
moderate size we could do without summary statistics. But not for
large T.
Take S(·) = number of switches between observed states.
Example: if y = (A, B, B, A, A, B) we switched 3 times so
S(y) = 3.
We only need to set a metric and then we are done:
Example (you can choose a different metric): J (y∗ , y) = Iy (y∗ )
with
∗ 1, |S(y∗ ) − S(y)| <
Iy (y ) =
0, otherwise
Plug this setup into an ABC-MCMC and we are essentially using
Marjoram et al. original algorithm.
Umberto Picchini ([email protected])
Priors: θ ∼ U(0, 1) and γ ∼ Beta(20, 3).
Starting values for the ABC-MCMC: θ = γ = 0.5

θ γ
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
×104 ×104

Used = 6 (first 5,000 iterations) then = 2 for further 25,000

iterations and = 0 for the remaining 50,000 iterations.
When = 6 accept. rate 20%, when = 2 accept. rate 9% and
when = 0 accept. rate 2%.

Umberto Picchini ([email protected])

Results at = 0
Dealing with a discrete state-space model allows the luxury to obtain
results at = 0 (impossible with continuous states).
Below: ABC posteriors (blue), true parameters (vertical red lines) and
Beta prior (black). For θ we used a uniform prior in [0,1].

θ γ
5 7

6
4
5

3
4

3
2

2
1
1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1.2

Remember: when using non-sufficient statistics results will be

biased even with = 0.
Umberto Picchini ([email protected])
A price to be paid when using ABC with a small is that, because of
the many rejections, autocorrelations are very high.

θ γ
1 1
epsilon = 0 epsilon = 0
epsilon = 2 epsilon = 2
0.8 epsilon = 6 epsilon = 6
0.8

0.6
0.6

0.4

0.4
0.2

0.2
0

-0.2 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

This implies the need for longer simulations.

Umberto Picchini ([email protected])

An apology

Paradoxically, all the (trivial) examples I have shown do not require

ABC.

I considered simple examples because it’s easier to illustrate the

method, but you will receive an homework having (really) intractable
likelihoods :-b

Umberto Picchini ([email protected])

Weighting summary statistics
Consider a vector of summaries S(·) ∈ Rd , not much literature
discuss how to assign weights to the components in S(·).

For example consider

∗ )k/22
J (y∗ , y) ∝ e−k(S(y)−S(y

with k S(y) − S(y∗ ) k= (S(y) − S(y∗ )) 0 · W −1 · (S(y) − S(y∗ ))

Prangle5 notes that if S(y) = (S1 (y), ..., Sd (y)) and if we give to all Sj
the same weight (hence W is the identity matrix) then the distance
k · k is dominated by the most variable summary Sj .

Only the component of θ “explained” by such Sj will be nicely

estimated.
5
D. Prangle (2015) arXiv:1507.00874
Umberto Picchini ([email protected])
Useful to have a diagonal W, say W = diag(σ21 , ..., σ2d ).

The σj could be determined from some pilot study. Say that we are
using ABC-MCMC, after some appropriate burnin say that we have
stored R realizations of S(y∗ ) corresponding to the R parameters θ∗
into a R × d matrix.

For each column j extract the unique values from

(1) (R)
(Sj (y∗ ), .., Sj (y∗ )) 0 then compute its madj (median absolute
deviation).

Set σ2j := mad2j .

(The median absolute deviation is a robust measure of dispersion.)

Rerun ABC-MCMC with the updated W, and an adjustment to will

probably be required.

Umberto Picchini ([email protected])

ABC for dynamical models

It is trickier to select intuitive (i.e. without the Fearnhead-Prangle

approach) summaries for dynamical models.

However, we can bypass the need for S(·) if we use an ABC version
of sequential Monte Carlo.

A very good review of methods for dynamical models is given in

Jasra 2015.

Umberto Picchini ([email protected])

ABC-SMC

A simple ABC-SMC algorithm is in Jasra et al. 2010, presented in

next slide (with some minor modifications).

For the sake of brevity, just consider a bootstrap filter approach with
N particles.

Recall in in ABC we assume that if observation yt ∈ Y then also

t ∈ Y.
yi∗

As usual, we assume t ∈ {1, 2, ..., T}.

Umberto Picchini ([email protected])

Step 0.
Set t = 1. For i = 1, ..., N sample x1i ∼ π(x0 ), y∗i
1 ∼ p(y1 |x1 ),
i
i ∗i
compute weights w1 = J1, (y1 , y1 ) and normalize weights
P
w̃i1 := wi1 / Ni=1 wi1 .

Step 1.
resample N particles {xti , w̃it }. Set wit = 1/N.
Set t := t + 1 and if t = T + 1, stop.

Step 2.
For i = 1, ..., N sample xti ∼ p(xt |xt−1
i ) and y∗i ∼ p(y |xi ). Compute
t t t

wit := Jt, (yt , y∗i

t )
P
normalize weights w̃it := wit / Ni=1 wit and go to step 1.

Umberto Picchini ([email protected])

The previous algorithm is not as general as the one actually given in
Jasra et al. 2010.

I assumed that resampling is performed at every t (not strictly

necessary). If resampling is not performed at every t in step 2 we have

wit := wit−1 Jt, (yt , y∗i

t ).

Specifically Jasra et al. use Jt, (yt , y∗i

t ) ≡ Iky∗i
t −yt k<
but that’s not
essential for the method to work.
What is important to realize is that in SMC methods the comparison
is “local”, that is we compare particles at time t vs. the observation at
t. So we can avoid summaries and use data directly.
That is instead of comparing a length T vector y∗ with a length T
vector y we perform separately T comparisons k y∗i t − yt k. This is
very feasible and clearly does not require an S(·).
Umberto Picchini ([email protected])
So you can form an approximation to the likelihood as we explained
in the particle marginal methods lecture, then plug it into a standard
MCMC (not ABC-MCMC) algorithm for parameter estimation.

This is a topic for a final project.

Umberto Picchini ([email protected])

Construction of S(·)

We have somehow postponed an important issue in ABC practice: the

choice/construction of S(·).
This is the most serious open-problem in ABC and one often
determining the success or failure of the simulation.
We are ready to accept non-sufficiency (available only for data in
the exponential family) in exchange of an “informative statistic”.
Statistics are somehow easier to identify for static models. For
dynamical models their identification is rather arbitrary, but see
Martin et al6 for state space models.

6
Martin et al. 2014, arXiv:1409.8363.
Umberto Picchini ([email protected])
Semi-automatic summary statistics
To date the most important study on the construction of summaries in ABC
is in Fearnehad-Prangle 20127 which is a discussion paper on JRSS-B.
Recall a well-known result: consider the class of quadratic losses

L(θ0 , θ̂; A) = (θ0 − θ̂)T A(θ0 − θ̂)

with θ0 true value of a parameter and θ̂ an estimator of θ. A is a positive

definite matrix.

If we set S(y) = E(θ | y) then the minimal expected quadratic loss

E(L(θ0 , θ̂; A) | y) is achieved via θ̂ = EABC (θ | S(y)) as → 0.

That is to say, as → 0, we minimize the expected posterior loss using the

ABC posterior expectation (if S(y) = E(θ|y)). However E(θ | y) is
unknown.
7
Fearnhead and Prangle (2012).
Umberto Picchini ([email protected])
So Fearnhead & Prangle propose a regression-based approach to
determine S(·) (prior to ABC-MCMC start):
for the jth parameter in θ fit separately the linear regression
models
(j)
Sj (y) = Ê(θj |y) = β̂0 + β̂(j) η(y), j = 1, 2, ..., dim(θ)
(j) (j) (j) (j)
[e.g. Sj (y) = β̂0 + β̂(j) η(y) = β̂0 + β̂1 y0 + · · · + β̂n yn or
you can let η(·) contain powers of y, say η(y, y2 , y3 , ...)]
repeat the fitting separately for each θj .
(j)
hopefully Sj (y) = β̂0 + β̂(j) η(y) will be “informative” for θj .
Clearly, in the end we have as many summaries as the number of
unknown parameters dim(θ).

Umberto Picchini ([email protected])

An example (run before ABC-MCMC):
1. p = dim(θ). Simulate from the prior θ∗ ∼ π(θ) (not very efficient...)
2. using θ∗ , generate y∗ from your model.
Repeat (1)-(2) many times to get the following matrices:
   
(1) (1) (1) ∗(1) (∗1) (∗1)
θ1 θ2 · · · θp y1 y2 · · · yn
 (2) (2) (2)   (∗2) (∗2) (∗2) 
θ1 θ2 · · · θp  , y1 y2 · · · yn 
.. .. .. ..
   
. . . ··· .

and for each column of the left matrix do a multivariate linear regression (or
lasso, or...)
 (1)   
(∗1) (∗1) (∗1)
θj 1 y1 y2 · · · yn
 (2)  
θj  = 1 y1(∗2) y(∗2) 2
(∗2) 
· · · yn  × βj (j = 1, ..., p),
.. .. .. ..
   
. . . ··· .
(j)
and obtain a statistic for θj , Sj (·) = β̂0 + β̂(j) η(·).
Umberto Picchini ([email protected])
Use the same coefficients when calculating summaries for simulated
data and actual data, i.e.
(j)
Sj (y) = β̂0 + β̂(j) η(y)
(j)
Sj (y∗ ) = β̂0 + β̂(j) η(y∗ )

In Picchini 2013 I used this approach to select summaries for

state-space models defined by stochastic differential equations.

Umberto Picchini ([email protected])

Software (coloured links are clickable)

EasyABC, R package. Research article.

abc, R package. Research article
abctools, R package. Research article. Focusses on tuning.
Lists with more options here and here .
examples with implemented model simulators (useful to
incorporate in your programs).

Umberto Picchini ([email protected])

Reviews

Fairly extensive but accessible reviews:

1 Sisson and Fan 2010
2 (with applications in ecology) Beaumont 2010
3 Marin et al. 2010
Simpler introductions:
1 Sunnåker et al. 2013
2 (with applications in ecology) Hartig et al. 2013
Review specific for dynamical models:
1 Jasra 2015

Umberto Picchini ([email protected])

Non-reviews, specific for dynamical models

1 SMC for Parameter estimation and model comparison: Toni et

al. 2009
2 Markov models: White et al. 2015
3 SMC: Sisson et al. 2007
4 SMC: Dean et al. 2014
5 SMC: Jasra et al. 2010
6 MCMC: Picchini 2013

Umberto Picchini ([email protected])

More specialistic resources

selection of summary statistics: Fearnhead and Prangle 2012.

review on summary statistics selection: Blum et al. 2013
expectation-propagation ABC: Barthelme and Chopin 2012
Gaussian Processes ABC: Meeds and Welling 2014
ABC model choice: Pudlo et al 2015

Umberto Picchini ([email protected])

Blog posts and slides

1 Christian P. Robert often blogs about ABC (and beyond: it’s a

fantastic blog!)
2 an intro to ABC by Darren J. Wilkinson
3 Two posts by Rasmus Bååth here and here
4 Tons of slides at Slideshare.

Umberto Picchini ([email protected])

8373 Geo Generative Engine Optimiza
No ratings yet
8373 Geo Generative Engine Optimiza
14 pages
K. Sam Shanmugan, Arthur M. Breipohl-Random Signals - Detection, Estimation and Data Analysis-Wiley (1988) PDF
100% (4)
K. Sam Shanmugan, Arthur M. Breipohl-Random Signals - Detection, Estimation and Data Analysis-Wiley (1988) PDF
676 pages
Elly Aj NK Abc Grad
No ratings yet
Elly Aj NK Abc Grad
35 pages
Double/Debiased Machine Learning For Treatment and Structural Parameters
No ratings yet
Double/Debiased Machine Learning For Treatment and Structural Parameters
71 pages
Double-Debiased Machine Learning for Treatment
No ratings yet
Double-Debiased Machine Learning for Treatment
71 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
90206271d48a1ed0a3638e9d69bddc2117ee
No ratings yet
90206271d48a1ed0a3638e9d69bddc2117ee
10 pages
Approximate Bayesian Computation Using Asymptotically Normal Point Estimates
No ratings yet
Approximate Bayesian Computation Using Asymptotically Normal Point Estimates
38 pages
ML Unit 3
No ratings yet
ML Unit 3
14 pages
Inference For SDE Models Via Approximate Bayesian Computation
No ratings yet
Inference For SDE Models Via Approximate Bayesian Computation
27 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
Using Early Rejection Markov Chain Monte Carlo and Gaussian Processes To Accelerate ABC Methods
No ratings yet
Using Early Rejection Markov Chain Monte Carlo and Gaussian Processes To Accelerate ABC Methods
33 pages
Barthelme EP2
No ratings yet
Barthelme EP2
58 pages
Lecture Notes On Regression: Markov Chain Monte Carlo (MCMC)
No ratings yet
Lecture Notes On Regression: Markov Chain Monte Carlo (MCMC)
13 pages
Stuart 81
No ratings yet
Stuart 81
25 pages
Bayesian
No ratings yet
Bayesian
50 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Lecture 1.3
No ratings yet
Lecture 1.3
7 pages
2006 March 21 MRF
No ratings yet
2006 March 21 MRF
101 pages
Slide 1
No ratings yet
Slide 1
37 pages
Assignment 10 solution
No ratings yet
Assignment 10 solution
8 pages
Dempster-Shafer Theory and Statistical Inference With Weak Beliefs
No ratings yet
Dempster-Shafer Theory and Statistical Inference With Weak Beliefs
16 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Lecture 8: Bayesian Estimation of Parameters in State Space Models
No ratings yet
Lecture 8: Bayesian Estimation of Parameters in State Space Models
33 pages
Hogg 2018 ApJS 236 11
No ratings yet
Hogg 2018 ApJS 236 11
18 pages
Approximate Inference
No ratings yet
Approximate Inference
37 pages
Likelihood Frequentist
No ratings yet
Likelihood Frequentist
27 pages
02_review_estimation_2
No ratings yet
02_review_estimation_2
36 pages
IntroBayesTimeSeries1
No ratings yet
IntroBayesTimeSeries1
72 pages
Inference in BN
No ratings yet
Inference in BN
18 pages
Conceptual Introduction To MCMC
No ratings yet
Conceptual Introduction To MCMC
56 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
[Seminar] Likelihood-Free Frequentist Inference
No ratings yet
[Seminar] Likelihood-Free Frequentist Inference
62 pages
Latihan Soal
No ratings yet
Latihan Soal
49 pages
MCMC Bayes PDF
No ratings yet
MCMC Bayes PDF
27 pages
Bayesian Inference and Computation A Beginner's Guide - Brewer
No ratings yet
Bayesian Inference and Computation A Beginner's Guide - Brewer
40 pages
The Levenberg-Marquardt Algorithm For Nonlinear Least Squares Curve-Fitting Problems
No ratings yet
The Levenberg-Marquardt Algorithm For Nonlinear Least Squares Curve-Fitting Problems
19 pages
Bayesian Modelling For Data Analysis and Learning From Data
No ratings yet
Bayesian Modelling For Data Analysis and Learning From Data
19 pages
Detecting and Correcting For Label Shift With Black Box Predictors
No ratings yet
Detecting and Correcting For Label Shift With Black Box Predictors
11 pages
Bayesian Computation and Model Selection without
No ratings yet
Bayesian Computation and Model Selection without
32 pages
The Levenberg-Marquardt Algorithm For Nonlinear Least Squares Curve-Fitting Problems
No ratings yet
The Levenberg-Marquardt Algorithm For Nonlinear Least Squares Curve-Fitting Problems
19 pages
Machine Learning and Pattern Recognition Sampling Based Approximations
No ratings yet
Machine Learning and Pattern Recognition Sampling Based Approximations
3 pages
Bayesian Nonparametrics and The Probabilistic Approach To Modelling
No ratings yet
Bayesian Nonparametrics and The Probabilistic Approach To Modelling
27 pages
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
No ratings yet
Model Selection/ Structure Learning Koller & Friedman Chapter 14 Mackay Chapter 28
49 pages
From Physics To Economics
No ratings yet
From Physics To Economics
19 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
A Conceptual Introduction To Markov Chain Monte Carlo Methods
No ratings yet
A Conceptual Introduction To Markov Chain Monte Carlo Methods
56 pages
Bayesian Uncertainty Quantification
No ratings yet
Bayesian Uncertainty Quantification
23 pages
6-AI Markov
No ratings yet
6-AI Markov
24 pages
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
No ratings yet
Gaussian Mixture Model: P (X - Y) P (Y - X) P (X)
3 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Print Merged
No ratings yet
Print Merged
23 pages
Lec-01-Introduction to Statistical Learning
No ratings yet
Lec-01-Introduction to Statistical Learning
38 pages
Computation
No ratings yet
Computation
11 pages
Block 4 ST3189
No ratings yet
Block 4 ST3189
25 pages
Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email
No ratings yet
Christophe Andrieu - Arnaud Doucet Bristol, BS8 1TW, UK. Cambridge, CB2 1PZ, UK. Email
4 pages
Notes Chapter Logistic Regression
No ratings yet
Notes Chapter Logistic Regression
6 pages
14.384 Time Series Analysis: Mit Opencourseware
No ratings yet
14.384 Time Series Analysis: Mit Opencourseware
6 pages
The Levenberg-Marquardt Method For Nonlinear Least Squares Curve-Fitting Problems
No ratings yet
The Levenberg-Marquardt Method For Nonlinear Least Squares Curve-Fitting Problems
17 pages
How To Code For Quantum Computers
From Everand
How To Code For Quantum Computers
Nivio Dos Santos
No ratings yet
Algorithmic Information Theory: Fundamentals and Applications
From Everand
Algorithmic Information Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
Download Full Foundations of Engineering 3rd Edition Mark Holtzapple PDF All Chapters
100% (1)
Download Full Foundations of Engineering 3rd Edition Mark Holtzapple PDF All Chapters
66 pages
Solved Example Problems for Regression Analysis - Maths
No ratings yet
Solved Example Problems for Regression Analysis - Maths
21 pages
Working With Z-Scores 2:: Surveys & The Central Limit Theorem
No ratings yet
Working With Z-Scores 2:: Surveys & The Central Limit Theorem
3 pages
An Innovations Approach To Fault Detection and Diagnosis in Dynamic Systems
No ratings yet
An Innovations Approach To Fault Detection and Diagnosis in Dynamic Systems
4 pages
FINAL REPORT BSSS Front Page
No ratings yet
FINAL REPORT BSSS Front Page
29 pages
GE 5 Module 4
No ratings yet
GE 5 Module 4
31 pages
Assignment 1a BMS ISTTM
No ratings yet
Assignment 1a BMS ISTTM
4 pages
Chapter II - Lecture 2 - KNN
No ratings yet
Chapter II - Lecture 2 - KNN
21 pages
Power Line Recognition From Aerial Images With Deep Learning
No ratings yet
Power Line Recognition From Aerial Images With Deep Learning
12 pages
Chapter 4 Thesis Quantitative
100% (3)
Chapter 4 Thesis Quantitative
8 pages
Best Practices For Minimizing Oracle EBS R12 Upgrade Downtime - Final
No ratings yet
Best Practices For Minimizing Oracle EBS R12 Upgrade Downtime - Final
79 pages
Lesson On Writing Chapter 3 For Capstone Project (20240418165038)
No ratings yet
Lesson On Writing Chapter 3 For Capstone Project (20240418165038)
93 pages
Homework 0
No ratings yet
Homework 0
5 pages
Think Stats
100% (2)
Think Stats
142 pages
Uses and Abuses of Statistics in The Real World
No ratings yet
Uses and Abuses of Statistics in The Real World
2 pages
Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)
No ratings yet
Machine Learning and Data Mining: Introduction to (Học máy và Khai phá dữ liệu)
26 pages
Solved - You Have Data From A Corporation On The Annual Salary O...
No ratings yet
Solved - You Have Data From A Corporation On The Annual Salary O...
1 page
MR Alll
No ratings yet
MR Alll
47 pages
06Sep15Connotation Denotation
No ratings yet
06Sep15Connotation Denotation
23 pages
Handbook Statistical Foundations of Machine Learning
No ratings yet
Handbook Statistical Foundations of Machine Learning
267 pages
BSIT 1ASigned Learning Material No. 4A Data Management Answer
No ratings yet
BSIT 1ASigned Learning Material No. 4A Data Management Answer
16 pages
Dissertation Topic Ideas Information Technology
100% (2)
Dissertation Topic Ideas Information Technology
9 pages
Lesson+9.3+Answer+Key+ +Intro+Stats+ +Stats+Medic
No ratings yet
Lesson+9.3+Answer+Key+ +Intro+Stats+ +Stats+Medic
2 pages
Dai Unit 4
No ratings yet
Dai Unit 4
17 pages
Summer Project On Challanges Faced by BBA Students After Completion of BBA
No ratings yet
Summer Project On Challanges Faced by BBA Students After Completion of BBA
52 pages
Lecture Notes: III ECE II Semester
No ratings yet
Lecture Notes: III ECE II Semester
199 pages
Linkert Scales Technique: BRM-Likert Scale Technique 1
No ratings yet
Linkert Scales Technique: BRM-Likert Scale Technique 1
12 pages
1526 Evaluation of The Propagation Model Recommendation ITU-R P.1546 For Mobile Services in Rural Australia
No ratings yet
1526 Evaluation of The Propagation Model Recommendation ITU-R P.1546 For Mobile Services in Rural Australia
15 pages

abc_slides

Uploaded by

abc_slides

Uploaded by

An intro to ABC – approximate Bayesian

Umberto Picchini ([email protected])

Umberto Picchini ([email protected])

For example in spatial statistics INLA (integrated nested Laplace

Also MCMC is not online: when new observations arrive we have to

Umberto Picchini ([email protected])

And what can we do when the dimension of the mathematical system

Umberto Picchini ([email protected])

We have observations y = f ({X0:T }).

We want to estimate θ but we cannot calculate p(y|θ), as this involves

Notice we are not specifying the probabilistic properties of X0:T nor

Umberto Picchini ([email protected])

Likelihood-free inference motivating idea:

Umberto Picchini ([email protected])

Likelihood-free methods date back to at least Diggle and Gratton

More recent examples:

Umberto Picchini ([email protected])

Why should we care about approximate methods?

But I am more interested on the following phenomenon, which I

Umberto Picchini ([email protected])

Umberto Picchini ([email protected])

If a complex model is the one I want to use to answer the right

Umberto Picchini ([email protected])

Approximate Bayesian Computation and Synthetic Likelihoods are

We will discuss ABC only.

Umberto Picchini ([email protected])

only need a generative model, i.e. the model we assumed having

Umberto Picchini ([email protected])

Umberto Picchini ([email protected])

That is y∗ is produced by the statistical model that generated observed

(i) Therefore if Y is the space where y takes values, then y∗ ∈ Y.

Umberto Picchini ([email protected])

Example: if we have a SSM and given a parameter value θ and xt−1

Umberto Picchini ([email protected])

1 simulate from the prior θ∗ ∼ π(θ)

Each accepted θ∗ is such that θ∗ ∼ π(θ|y) exactly.

We justify the result in next slide.

Umberto Picchini ([email protected])

Denote with f (θ∗ , y∗ ) the joint distribution of the accepted (θ∗ , y∗ ).

f (θ∗ , y∗ ) = p(y∗ |θ∗ )π(θ∗ )Iy (y∗ )

hence all accepted θ∗ are drawn from the exact posterior.

Algorithmically the rejection algorithm could be coded in a while

Umberto Picchini ([email protected])

Attack the curse of dimensionality by introducing an approximation.

Each accepted θ∗ is such that θ∗ ∼ π (θ|y).

Umberto Picchini ([email protected])

Umberto Picchini ([email protected])

n = 5 i.i.d. observations yi ∼ Weibull(2, 5)

Umberto Picchini ([email protected])

We are evidently sampling from the prior. Must reduce . In fact

0.00 0.05 0.10 0.15 0.20

Here about 19,000 draws were accepted (38%).

Umberto Picchini ([email protected])

0.0 0.1 0.2 0.3 0.4

Here about 1% of the produced simulations has been accepted. Recall

Umberto Picchini ([email protected])

Compare the ABC (marginal) posteriors with exact posteriors from

For example see http://www.johndcook.com/

Umberto Picchini ([email protected])

It becomes immediately evident that results will soon degrade for

Umberto Picchini ([email protected])

Notice shape is completely off (true value is 2).

Same as before, but comparing S(y) with S(y∗ ) for “appropriate”

Samples are from π (θ|S(y)) with

A,y (y∗ ) = {y∗ ∈ Y; k S(y∗ ) − S(y) k< }.

When S(·) is a set of sufficient statistics for θ,

But then again when y is not in the exponential family, we basically

A central topic in ABC is to construct “informative” statistics, as

This time we have captured both shape and scale (with 1%

A main issue in ABC research is that when we use an arbitrary S(·)

Michael Blum will give a free workshop in Lund on 10 March. Sign

Umberto Picchini ([email protected])

J (y∗ , y) a function which is a positive constant when y = y∗ (or

Without loss of generality, let’s assume that J (y∗ , y) ∝ Iy (y∗ ), the

Present state is θ# (and corresponding y# ). Propose θ∗ ∼ q(θ∗ |θ# ).

The acceptance probability is thus:

Each accepted θ∗ is such that θ∗ ∼ π (θ|y).

We are evidently sampling from the prior. Must reduce . In fact

Samples are from π (θ|S(y)) with

A,y (y∗ ) = {y∗ ∈ Y; k S(y∗ ) − S(y) k< }.

J (y∗ , y) a function which is a positive constant when y = y∗ (or

Without loss of generality, let’s assume that J (y∗ , y) ∝ Iy (y∗ ), the

Marjoram et. al used J (y∗ , y) ≡ Iy (y∗ ). This implies that we

J (y∗ , y)π(θ∗ ) q(θ# |θ∗ )

Used = 6 (first 5,000 iterations) then = 2 for further 25,000

Rerun ABC-MCMC with the updated W, and an adjustment to will

wit := Jt, (yt , y∗i

wit := wit−1 Jt, (yt , y∗i

Specifically Jasra et al. use Jt, (yt , y∗i

That is to say, as → 0, we minimize the expected posterior loss using the