2505.04603v1
2505.04603v1
1
Department of Statistics, Stanford University
2
Department of Biomedical Data Science, Stanford University
May 8, 2025
Abstract
When the likelihood is analytically unavailable and computationally intractable, approx-
imate Bayesian computation (ABC) has emerged as a widely used methodology for ap-
proximate posterior inference; however, it suffers from severe computational inefficiency in
high-dimensional settings or under diffuse priors. To overcome these limitations, we propose
Adaptive Bayesian Inference (ABI), a framework that bypasses traditional data-space discrep-
ancies and instead compares distributions directly in posterior space through nonparamet-
ric distribution matching. By leveraging a novel Marginally-augmented Sliced Wasserstein
(MSW) distance on posterior measures and exploiting its quantile representation, ABI trans-
forms the challenging problem of measuring divergence between posterior distributions into
a tractable sequence of one-dimensional conditional quantile regression tasks. Moreover,
we introduce a new adaptive rejection sampling scheme that iteratively refines the poste-
rior approximation by updating the proposal distribution via generative density estimation.
Theoretically, we establish parametric convergence rates for the trimmed MSW distance
and prove that the ABI posterior converges to the true posterior as the tolerance threshold
vanishes. Through extensive empirical evaluation, we demonstrate that ABI significantly
outperforms data-based Wasserstein ABC, summary-based ABC, as well as state-of-the-art
likelihood-free simulators, especially in high-dimensional or dependent observation regimes.
1 Introduction
Bayesian modeling is widely used across natural science and engineering disciplines. It enables
researchers to easily construct arbitrarily complex probabilistic models through forward sam-
pling techniques (implicit models) while stabilizing ill-posed problems by incorporating prior
knowledge. Yet, the likelihood function x 7→ fθ (x) may be intractable to evaluate or entirely
∗
Corresponding author; Electronic address: [email protected]
1
inaccessible in many scenarios (Zeng et al., 2019; Chiachío-Ruano et al., 2021), thus render-
ing Markov chain-based algorithms—such as Metropolis-Hastings and broader Markov Chain
Monte Carlo methods—unsuitable for posterior inference. Approximate Bayesian Computation
(ABC) emerges as a compelling approach for scenarios where exact posterior inference for model
parameters is infeasible (Tavaré, 2018). Owing to its minimal modeling assumptions and ease
of implementation, ABC has garnered popularity across various Bayesian domains, including
likelihood-free inference (Markram et al., 2015; Alsing et al., 2018), Bayesian inverse problems
(Chatterjee et al., 2021), and posterior estimation for simulator-based stochastic systems (Wood,
2010). ABC generates a set of parameters θ ∈ Ω ⊆ Rd with high posterior density through a
rejection-based process: it simulates fake datasets for different parameter draws and retains only
those parameters that yield data sufficiently similar to the observed values.
However, when the data dimensionality is high or the prior distribution is uninformative
about the observed data, ABC becomes extremely inefficient and often requires excessive rejec-
tions to retain a single sample. Indeed, Lemmas B.1 and B.2 show that the expected number of
simulations needed to retain a single draw grows exponentially in the data dimension. To enhance
computational efficiency, researchers frequently employ low-dimensional summary statistics and
conduct rejection sampling instead in the summary statistic space (Fearnhead and Prangle,
2012). Nevertheless, the Pitman-Koopman-Darmois theorem stipulates that low-dimensional
sufficient statistics exist only for the exponential family. Consequently, practical problems of-
ten require considerable judgment in choosing appropriate summary statistics, typically in a
problem-specific manner (Wood, 2010; Marin et al., 2012). Moreover, the use of potentially
non-sufficient summary statistics to evaluate discrepancies can result in ABC approximations
that, while useful, may lead to a systematic loss of information relative to the original posterior
distribution. For instance, Fearnhead and Prangle (2011) and Jiang et al. (2017) propose a
semi-automatic approach that employs an approximation of the posterior mean as a summary
statistic; however, this method ensures only first-order accuracy.
Another critical consideration is selecting an appropriate measure of discrepancy between
datasets. A large proportion of the ABC literature is devoted to investigating ABC strategies
adopting variants of the ℓp -distance between summaries (Prangle, 2017), which are suscepti-
ble to significant variability in discrepancies across repeated samples from fθ (Bernton et al.,
2019). Such drawbacks have spurred a shift towards summary-free ABC methods that directly
compare the empirical distributions of observed and simulated data via an integral probability
metric (IPM), thereby obviating the need to predefine summary statistics (Legramanti et al.,
2022). Popular examples include ABC versions that utilize the Kullback-Leibler divergence
(Jiang, 2018), 2-Wasserstein distance (Bernton et al., 2019), and Hellinger and Cramer–von
Mises distances (Frazier, 2020). The accuracy of the resulting approximate posteriors relies cru-
cially on the fixed sample size n of the observed data, as the quality of IPM estimation between
data-generating processes from a finite, often small, number of samples is affected by the con-
vergence rate of empirically estimated IPMs to their population counterparts. In particular, a
significant drawback of Wasserstein-based ABC methods stems from the slow convergence rate
of the Wasserstein distance, which scales as O(n−1/s ) when the data dimension s ≥ 3 (Tala-
grand, 1994). As a result, achieving accurate posterior estimates is challenging with limited
2
samples, particularly for high-dimensional datasets. A further limitation of sample-based IPM
evaluation is the need for additional considerations in the case of dependent data, since ignoring
such dependencies might render certain parameters unidentifiable (Bernton et al., 2019).
Thus, two fundamental questions ensue from this discourse: What constitutes an informa-
tive set of summary statistics, and what serves as an appropriate measure of divergence between
datasets? To address the aforementioned endeavors, we introduce the Adaptive Bayesian In-
ference (ABI) framework, which directly compares posterior distributions through distribution
matching and adaptively refines the estimated posterior via rejection sampling. At its core, ABI
bypasses observation-based comparisons by selecting parameters whose synthetic-data-induced
posteriors align closely with the target posterior, a process we term nonparametric distribution
matching. To achieve this, ABI learns a discrepancy measure in the posterior space, rather than
the observation space, by leveraging the connection between the Wasserstein distance and con-
ditional quantile regression, thereby transforming the task into a tractable supervised learning
problem. Then, ABI simultaneously refines both the posterior estimate and the approximated
posterior discrepancy over successive iterations.
Viewed within the summary statistics framework, our proposed method provides a principled
approach for computing a model-agnostic, one-dimensional kernel statistic. Viewed within the
discrepancy framework, our method approximates an integral probability metric on the space
of posteriors, thus circumventing the limitations of data-based IPM evaluations such as small
sample sizes and dependencies among observations.
Contributions Our work makes three main contributions. First, we introduce a novel integral
probability metric—the Marginally-augmented Sliced Wasserstein (MSW) distance—defined on
the space of posterior probability measures. We then characterize the ABI approximate posterior
as the distribution of parameters obtained by conditioning on those datasets whose induced
posteriors fall within the prescribed MSW tolerance of the target posterior. Whereas conven-
tional approaches rely on integral probability metrics on empirical data distributions, our poste-
rior–based discrepancy remains robust even under small observed sample sizes n, intricate sample
dependency structures, and parameter non-identifiability. We further argue that considering the
axis-aligned marginals can help improve the projection efficiency of uniform slice-based Wasser-
stein distances. Second, we show that the posterior MSW distance can be accurately estimated
through conditional quantile regression by exploiting the equivalence between the univariate
Wasserstein distance and differences in quantiles. This novel insight reduces the traditionally
challenging task of operating in the posterior space into a supervised distributional regression
task, which we solve efficiently using deep neural networks. The same formulation naturally
accommodates multi-dimensional parameters and convenient sequential refinement via rejection
sampling. Third, we propose a sequential version of the rejection–ABC that, to the best of our
knowledge, is the first non-Monte-Carlo-based sequential ABC. Existing sequential refinement
methods in the literature frequently rely on adaptive importance sampling techniques, such as
sequential Monte Carlo (Del Moral et al., 2012; Bonassi and West, 2015) and population Monte
Carlo (Beaumont et al., 2009). These approaches, particularly in their basic implementations,
are often constrained to the support of the empirical distribution derived from prior samples.
3
While advanced variants can theoretically explore beyond this initial support through rejuve-
nation steps and MCMC moves, they nevertheless require careful selection of transition kernels
and auxiliary backward transition kernels (Del Moral et al., 2012). In contrast, ABI iteratively
refines the posterior distribution via rejection sampling by updating the proposal distribution us-
ing the generative posterior approximation from the previous step—learned through a generative
model (not to be confused with the original simulator in the likelihood-free setup). Generative-
model-based approaches for posterior inference harness the expressive power of neural networks
to capture intricate probabilistic structures without requiring an explicit distributional speci-
fication. This generative learning stage enables ABI to transcend the constrained support of
the empirical parameter distribution and eliminates the need for explicit prior-density evalua-
tion (unlike Papamakarios and Murray (2016)), thereby accommodating cases where the prior
distribution itself may be intractable.
We characterize the topological and statistical behavior of the MSW distance, establishing
both its parametric convergence rate and its continuity on the space of posterior measures. Our
proof employs a novel martingale-based argument appealing to Doob’s theorem, which offers an
alternative technique to existing proofs based on the Lebesgue differentiation theorem (Barber
et al., 2015). This new technique may be of independent theoretical interest for studying the
convergence of other sequential algorithms. We then prove that, as the tolerance threshold
vanishes (with observations held fixed), the ABI posterior converges in distribution to the true
posterior. Finally, we derive a finite-sample bound on the bias induced by the approximate
rejection-sampling procedure. Through comprehensive empirical experiments, we demonstrate
that ABI achieves highly competitive performance compared to data-based Wasserstein ABC,
and several recent, state-of-the-art likelihood-free posterior simulators.
Notation Let the parameter and data (θ, X) be jointly defined on some probability space. The
prior probability measure π on the parameter space Ω ⊆ Rd is assumed absolutely continuous
with respect to Lebesgue measure, with density π(θ) for θ ∈ Ω. For simplicity, we use π(·) to
denote both the density and its corresponding distribution. Let the observation space be X ⊆
RdX for some dX ∈ N+ , where N+ := {1, 2, . . . }. We observe a data vector x∗ = (x∗1 , . . . , x∗n )⊤ ∈
(n)
X n ⊂ RndX , whose joint distribution on X n is given by the likelihood Pθ . If the samples are
not exchangeable, we simply set n = 1 with a slight abuse of notation and write x∗ for that
(n)
single observation. We assume x∗ is generated from Pθ∗ for some true but unknown θ∗ ∈ Ω.
(n)
Both the prior density π(θ) and the likelihood Pθ (x) may be analytically intractable; however,
we assume access to
• a prior simulator that draws θ ∼ π, and
(n)
• a data generator that simulates X ∼ Pθ given any θ.
We do not assume parameter identifiability; that is, we allow for the possibility that distinct
(n) (n)
parameter values θ ̸= θ′ to yield identical probability distributions, Pθ = Pθ′ . Our inferential
(n)
goal is to generate samples from the posterior π(θ | x∗ ) ∝ π(θ)Pθ (x∗ ), where θ ∈ Ω. For
notational convenience, we use D(·, ·) for a generic distance metric, which may act on the data
space or probability measures, depending on the context.
4
For any function class G and probability measures µ and ν, we define the Integral Probability
Metric (IPM) between µ and ν with respect to G as: DG (µ, ν) = supg∈G | gdµ − gdν|.
R R
Let ∥·∥ denote the ℓ2 (Euclidean) distance and let (Ω, ∥·∥) be a Polish space. For p ∈ [1, ∞),
we denote by Pp (Ω) the set of Borel probability measures defined on Ω with finite p-th moment.
For µ, ν ∈ Pp (Ω), the p-Wasserstein distance between µ and ν is defined as the solution of the
optimal mass transportation problem
Z !1/p
Wp (µ, ν) = inf p
∥x − y∥ dγ(x, y) , (1.1)
γ∈Γ(µ,ν) Ω×Ω
The p-Wasserstein space is defined as (Pp (Ω), Wp ). For a comprehensive treatment of the Wasser-
stein distance and its connections to optimal transport, we refer the reader to Villani et al.
(2009).
For results on convergence rates and the bias–cost trade-off when using sufficient statistics
in ABC, see Barber et al. (2015), who establish consistency of ABC posterior expectations via
the Lebesgue differentiation theorem.
5
Wasserstein distance over all directions on the unit sphere:
Z
SWpp (µ, ν) = Wpp (φ# µ, φ# ν) dσ(φ), (1.2)
Sd−1
where Sd−1 is the unit sphere in Rd , σ is the uniform measure on Sd−1 , φ# denotes the projection
onto the one-dimensional subspace spanned by φ. By reducing the problem to univariate cases,
each of which admits an analytic solution, this approach circumvents the high computational cost
of directly evaluating the d-dimensional Wasserstein distance while preserving key topological
properties of the classical Wasserstein metric, including its ability to metrize weak convergence
(Bonnotte, 2013).
To approximate the integral in (1.2), in practice, one draws K directions i.i.d. from the sphere
and forms the unbiased Monte Carlo estimator
K
!1/p
d p (µ, ν) = 1 X (k) (k)
SW W p (φ µ, φ# ν) , (1.3)
K k=1 p #
where ρτ (u) = max{τ u, (τ − 1)u} is the quantile loss function, and Ntrain denotes the number
of training samples. We use Ntrain deliberately to avoid confusion with n, which represents the
(fixed) number of observations in Bayesian inference.
6
random variable with a simple, known distribution. Here Gβ : Z → Y is the generative (push-
forward) map that transforms latent vectors into observations. The parameters β are optimized
so that the generated samples ỹ ∼ P̂Y are statistically similar to the real data. Learning objec-
tives may be formulated using a variety of generative frameworks, including optimal transport
networks (Lu et al., 2025), generative adversarial networks, and auto-encoder models.
7
Organization of the Paper The remainder of this manuscript is organized as follows. Sec-
tion 2 introduces the ABI framework and its algorithmic components. Section 3 establishes
the empirical convergence rates of the proposed MSW distance, characterizes its topological
properties, and proves that the ABI posterior converges to the target posterior as the tolerance
threshold vanishes. Section 4 demonstrates the effectiveness of ABI through extensive empirical
evaluations. Finally, Section 5 summarizes the paper and outlines future research directions.
Proofs of technical results and additional simulation details are deferred to the Appendix.
that is, the distribution of θ conditional on the event that the posterior induced by dataset x lies
within an ϵ-neighborhood of the observed posterior under the MSW metric. This formulation
enables direct comparison of candidate posteriors via the posterior MSW distance and supports
efficient inference by exploiting its quantile representation. As such, ABI approximates the
target posterior through nonparametric posterior matching. In practice, for each proposed θ,
we simulate an associated dataset X and evaluate the MSW distance between the conditional
posterior π(θ | X) and the observed-data posterior π(θ | x∗ ). Simulated samples for which
this estimated deviation is small are retained, thereby steering our approximation progressively
closer to the true posterior. For brevity, we denote the target posterior by π ∗ = π(θ | x∗ ). Our
approach proceeds in four steps.
Step 1. Estimate the trimmed MSW distance between posteriors π ∗ and π(θ | x), x ∈ X using
conditional quantile regression with multilayer feedforward neural networks; see Section
2.1.2.
Step 2. Sample from the current proposal distribution by decomposing it into a marginal com-
ponent over θ and an acceptance constraint on X, then employ rejection sampling; see
Section 2.2.1.
Step 3. Refine the posterior approximation via acceptance–rejection sampling: retain only
those synthetic parameter draws whose simulated data yield an estimated MSW dis-
tance to π ∗ below the specified threshold, and discard the rest; see Section 2.2.2.
Step 4. Update the proposal for the next iteration by fitting a generative model to the accepted
parameter draws; see Section 2.2.3.
These objectives are integrated into a unified algorithm with a nested sequential structure.
At each iteration t = 1, 2, . . . , T, the proposal distribution is updated using the previous step’s
(1) (2) (T)
posterior approximation, thus constructing a sequence of partial posteriors π∗ , π∗ , . . . , π∗
that gradually shift toward the target posterior π ∗ . This iterative approach improves the ac-
curacy of posterior approximation through adaptive concentration on regions of high posterior
8
alignment, which in turn avoids the unstable variance that can arise from single-round inference.
By contrast, direct Monte Carlo estimation would require an infeasible number of simulations
to observe even a single instance where the generated sample exhibits sufficient similarity to
the observed data. For a concrete illustration of sequential refinement, see the simple Gaus-
sian–Gaussian conjugate example in the Appendix C.1. The complete procedure is presented in
Algorithm 2.
We first discuss the conceptual underpinnings underlying our proposed posterior-based distance.
Consider the target posterior π ∗ , which conditions on the observed data x∗ . If an alternative
posterior π(θ | X) is close to π ∗ under an appropriate measure of posterior discrepancy, then
π(θ | X) naturally constitutes a viable approximation to the target posterior. Thus, we can
select, from among all candidate posteriors, the ones whose divergence from the true posterior
falls within the prescribed tolerance. We formalize this intuition through the notion of minimal
posterior sufficiency.
Under the classical frequentist paradigm, Fisher showed that the likelihood function L(θ; X) =
(n)
Pθ (X), viewed as a random function of the data, is a minimal sufficient statistic for θ as it en-
capsulates all available information about the parameter θ (Berger and Wolpert, 1988, Chapter
3). This result is known as the likelihood principle. The likelihood principle naturally extends
(n)
to the Bayesian regime since the posterior distribution is proportional to π(θ)Pθ (X). In par-
ticular, Theorem 2.1 shows that, given a prior distribution π(·), the posterior map X 7→ π(· | X)
is minimally Bayes sufficient with respect to the prior π. We refer to this concept as minimal
posterior sufficiency.
Definition 2.1 (Bayes Sufficient). A statistic T (X) is Bayes sufficient with respect to a prior
distribution π(θ) if π(θ | X) = π(θ | T (X)).
Theorem 2.1 (Minimal Bayes Sufficiency of Posterior Distribution). The posterior map X 7→
π(· | X) is minimally Bayes sufficient.
Ideally, inference should be based on minimally sufficient statistics, which suggests that our
posterior inference should utilize such statistics when low-dimensional versions exist. However,
low-dimensional sufficient statistics—let alone minimally sufficient ones—are available for only a
very limited class of distributions; consequently, we need to consider other alternatives to classi-
cal summaries. Observe that the acceptance event formed by matching the infinite-dimensional
statistics π(θ | X) and π(θ | x∗ ) coincides with the event,
Leveraging this equivalence, our key insight is to collapse the infinite-dimensional posterior maps
into a one-dimensional kernel statistic formed by applying a distributional metric on posterior
measures—thus preserving essential geometric structures, conceptually analogous to the “kernel
9
Algorithm 2: Adaptive Bayesian Inference (ABI)
Input: Tolerance thresholds: ∞ > ϵ1 > ϵ2 > · · · > ϵT ;
Generative model: Gβ ;
MSW distance parameters:
Trimming parameter: δ ∈ (0, 1/2);
Mixing parameter: λ ∈ (0, 1);
Number of slices: K > 0;
Number of discretization points: H > 0;
4 Sample K random projections φ(k) ∼ σ(Sd−1 ) and form the projection set
Φ = {φ(k) }K
k=1 ∪ {ej }j=1 ; set K ← K + d;
d ′
(ϵ ) (T)
11 Set π T
bABI (θ | x∗ ) ← π
b∗ (θ);
(ϵ )
Output: Generative model that samples from the ABI posterior: π T
bABI (θ | x∗ );
10
trick.” We realize this idea concretely via the novel Marginally-augmented Sliced Wasserstein
(MSW) distance. The MSW distance preserves marginal structure and mitigates the curse
of dimensionality, achieving the parametric convergence rate when p = 1 (see Section 3.4).
Moreover, MSW is topologically equivalent to the classical Wasserstein distance, retaining its
geometric properties such as metrizing weak convergence.
2.1.2 Estimating Trimmed MSW Distance via Deep Conditional Quantile Regres-
sion
To mitigate the well-known sensitivity of the Wasserstein and Sliced Wasserstein distances to
heavy tails, we adopt a robust, trimmed variant of the MSW distance, expanding upon the
works of Alvarez-Esteban et al. (2008) and Manole et al. (2022). To set the stage for our
multivariate extension, we first recall the definition of the trimmed Wasserstein distance in one
dimension. For univariate probability measures µ and ν, and a trimming parameter δ ∈ [0, 1/2),
the δ-trimmed Wp distance is defined as:
!1/p
1
Z 1−δ
p
Wp,δ (µ, ν) = Fµ−1 (τ ) − Fν−1 (τ ) dτ , (2.1)
1 − 2δ δ
where Fµ−1 and Fν−1 denote the quantile functions of µ and ν, respectively.
We now extend this univariate trimming concept to the multivariate setting and provide a
formal definition of the trimmed MSW distance.
Definition 2.2. Let δ ∈ [0, 1/2) be a trimming constant, and let µ and ν be probability
measures on Rd (with d ≥ 2) that possess finite p-th moments for p ≥ 1. The δ-trimmed
Marginally-augmented Sliced Wasserstein (MSW) distance between µ and ν is defined as
d
1X h
p
i 1
p
MSWp,δ (µ, ν) = λ Wp,δ µj , νj +(1 − λ) Eφ∼σ Wp,δ φ# µ, φ# ν , (2.2)
d j=1 | {z }
| {z } Sliced Wasserstein distance
marginal augmentation
λ ∈ (0, 1) is a mixing parameter; σ is the uniform probability measure on the unit sphere Sd−1 ;
µj denotes the marginal distribution of the j-th coordinate under the joint measure µ; and
φ# denotes the pushforward by the projection φ. This robustification of the MSW distance
compares distributions after trimming up to a 2δ fraction of their mass along each projection.
Remark 2.1. When d = 1, Definition 2.2 reduces to the marginal term alone. In that case,
the trimmed MSW distance coincides exactly with the standard trimmed Wasserstein distance
between the two one-dimensional distributions.
Remark 2.2. When δ = 0, MSWp,δ reduces to the untrimmed MSWp distance. For complete-
ness, we give the formal definition of MSWp (·, ·) in Appendix A.1.
The trimmed MSW distance comprises two components: the Sliced Wasserstein term, which
captures joint interactions through random projections on the unit sphere, and a marginal aug-
11
mentation term, which gauges distributional disparities along coordinate axes. The inclusion of
the marginal term enhances the MSW distance’s sensitivity to discrepancies along each coordi-
nate axis, remedying the inefficiency of standard SW projections that arises from uninformative
directions sampled uniformly at random. Furthermore, because the SW distance is approximated
via Monte Carlo, explicitly accounting for coordinate-wise marginals is particularly pivotal as
these marginal distributions directly determine the corresponding posterior credible intervals.
The value of incorporating axis-aligned marginals has also been highlighted in recent works
(Moala and O’Hagan, 2010; Drovandi et al., 2024; Chatterjee et al., 2025; Lu et al., 2025). For
brevity, unless stated otherwise, we refer to the trimmed MSW distance simply as the MSW
distance throughout for the remainder of this section.
In continuation of our earlier discussion on the need for a posterior space metric, the pos-
terior MSW distance quantifies the extent to which posterior distributions shift in response to
perturbations in the observations. In contrast, most existing ABC methods rely on distances
computed directly between datasets, either as D(x, x∗ ) or as D(µ bx∗ ) where µ
bx , µ b· denotes the em-
pirical distribution—serving as indirect proxies for posterior discrepancy due to the fundamental
challenges in estimating posterior-based metrics. Importantly, our approach overcomes this lim-
itation by leveraging the quantile representation of the posterior MSW distance, as formally
established in Definition 2.3.
Definition 2.3 (Quantile Representation of MSW Distance). The trimmed MSW distance de-
fined in Definition 2.2 can be equivalently expressed using the quantile representation as
d
!1/p
1
Z 1−δ
λX p
MSWp,δ (µ, ν) = Fµ−1 (τ ) − Fν−1 (τ ) dτ
d j=1 1 − 2δ δ
j j
!1/p
1
Z Z 1−δ
p
+ (1 − λ) Fφ−1 (τ ) − Fφ−1 (τ ) dτ dσ(φ) . (2.3)
1 − 2δ Sd−1 δ
#µ #ν
12
as follows:
d h
\p,δ,K,H (πx , πx′ ) = λ
i1/p
IH Fπ−1 −1
X
MSW x,j
, Fπx′ ,j
d j=1
K
!!1/p
1 X
+ (1 − λ) IH F −1
(k) , F −1
(k) , (2.4)
K k=1 φ# πx φ# πx′
H−1
∆
p p
where IH (q1 , q2 ) = q1 (δ) − q2 (δ) + 2 q1 (δ + h∆) − q2 (δ + h∆)
X
2(1 − 2δ) h=1
p
+ q1 (1 − δ) − q2 (1 − δ) . (2.5)
In the equations above, IH represents the trapezoidal discretization function, while Fπ−1
x,j
(τ )
denotes the quantile function associated with the j-th coordinate of πx evaluated at the τ th
quantile.
Remark 2.3. The estimated MSW distance can be viewed both as a measure of discrepancy
between the posterior distributions and as an informative low-dimensional kernel statistic. De-
pending on the context, we will use these interpretations interchangeably to best suit the task
at hand.
Remark 2.4. When only the individual posterior marginals π(θj | x∗ ) for j = 1, . . . , d are
of interest, one can elide the Sliced Wasserstein component entirely and compute only the
univariate marginal terms. Since the marginals often suffice for decision-making without the
full joint posterior (Moala and O’Hagan, 2010), this approach yields substantial computational
savings.
To estimate the quantile functions, we perform nonparametric conditional quantile regression
(CQR) via deep ReLU neural networks. These networks have demonstrated remarkable abilities
to approximate complex nonlinear functions and adapt to unknown low-dimensional structures
while possessing attractive theoretical properties. In particular, Padilla et al. (2022) establish
that, under mild smoothness conditions, the ReLU-network quantile regression estimator attains
minimax-optimal convergence rates.
Definition 2.4 (Deep Neural Networks). Let ϕ(x) = max{x, 0} be the ReLU activation func-
tion. For a network with L hidden layers, let d = (d0 , d1 , . . . , dL+1 )⊤ ∈ RL+2 specify the number
of neurons in each layer, where d0 represents the input dimension and dL+1 the output dimen-
sion. The class of multilayer feedforward ReLU neural networks specified by architecture (L, d)
comprises all functions from Rd0 to RdL+1 formed by composing affine maps with elementwise
ReLU activations:
where each layer ℓ is represented by an affine transformation gℓ (d(ℓ−1) ) = W(ℓ) dℓ−1 + b(ℓ) , with
W(ℓ) ∈ Rdℓ ×dℓ−1 as the weight matrix and b(ℓ) ∈ Rdℓ as the bias vector.
13
Building on Definition 2.4, we approximate (2.4) by training a single deep ReLU network to
jointly predict all slice-quantiles. Let K ′ := K + d denote the total number of directions in the
augmented projection set Φ = {φ(k) }K k=1 ∪ {ej }j=1 . For each projection φ
d (k) ∈ Φ and quantile
Q∗k,h (x) = F −1
(k) (τh )
φ# πx
be the true conditional τh -quantile along φ(k) . Given Ntrain training pairs {(x(m) , θ(m) )}N
m=1 , we
train
′ (H+1)
learn QK ′ ,H : X → R
b K by solving
K′ X
H NX
train
b ′ = arg min ρτh ,κ (⟨φ(k) , θ(m) ⟩ − Q[k,h] (x(m) )), (2.6)
X
QK ,H
Q∈Q k=1 h=0 m=1
where Q is the class of ReLU neural network models with architecture (L, d) and output dimen-
sion dL+1 = K ′ (H + 1), and Q[k,h] is the ((H + 1)(k − 1) + h + 1)-th entry of the flattened output.
This single network thus shares parameters across all K ′ slices and H + 1 quantile levels.
Contrary to the conventional pinball quantile loss (Padilla et al., 2022), we employ the
Huber quantile regression loss (Huber, 1964), which is less sensitive to extreme outliers. This
loss function, parameterized by threshold κ (Dabney et al., 2018), is defined as:
1 |τ − 1(u < 0)|u2 , |u| ≤ κ
ρτ,κ (u) = 2κ (2.7)
|τ − 1(u < 0)|(|u| − 1 κ), |u| > κ.
2
Upon convergence of the training process, we obtain a single quantile network that outputs the
predicted quantile of θ for any given projection φ(k) , quantile level τh , and conditioning variable
x ∈ X.
Remark 2.5. Unlike Padilla et al. (2022), we impose no explicit monotonicity constraints in
(2.6); that is, we do not enforce the following ordering restrictions during the joint estimation
stage:
H + 1 values in ascending order. This post-processing step automatically guarantees the non-
crossing restriction without adding any constraints to the optimization.
Collectively, the elements in this section constitute the core of the nonparametric distribution
matching component of our proposed methodology. The corresponding algorithmic procedure
for distribution matching is summarized in Algorithm 3.
14
Algorithm 3: Trimmed MSW Distance Estimation via CQR
Input: Proposal distribution π (t) (θ, X);
Number of random projections K;
Number of quantile levels H;
Smoothing parameter κ;
Network architecture (L, d);
Training sample size Ntrain ;
1 Sample K directions {φ(k) }K k=1 ∼ Unif(S
d−1 );
= P θ, X | X ∈ A1 (conditional refinement).
P θ | X ∈ A1 Pθ X | X ∈ A1
15
To refine to At ⊆ At−1 , note that
P(θ, X ∈ At )
P θ | X ∈ At−1 Pθ X ∈ At | X ∈ At−1 =
P(X ∈ At−1 )
(sequential update).
∝ P θ | X ∈ At
Thus, by conditioning on At , we obtain samples from the intermediate partial posterior P(θ | X ∈
At ). Iterating this procedure until termination yields the final approximation P(θ | X ∈ AT ),
which converges to the true posterior as ϵT approaches 0 and the acceptance regions become
increasingly precise.
In the following subsections, we present a detailed implementation for each of these three
steps.
In this section, we describe how to generate samples from the refined proposal distribution using
rejection sampling.
Decoupling the Joint Proposal Let ϵ1 > ϵ2 > · · · > ϵT be a user-specified, decreasing
sequence of tolerances. Define the data-space acceptance region and its corresponding event by
At = {x ∈ X n : MSW
\p,δ,K,H (πx , πx∗ ) ≤ ϵt } ⊆ X n ,
Et = {ω : X(ω) ∈ At } .
At iteration t ≥ 1, we adopt the joint proposal distribution over (θ, X) given by:
(n)
π (1) (θ, X) = π(θ) Pθ (X),
π (t) (θ, X) = π θ, X | Et−1 for t = 2, . . . , T,
where we condition on the event Et−1 = {X ∈ At−1 } with A0 = X . Since direct sampling from
this conditional distribution is generally infeasible, we recover it via rejection sampling after
decoupling the joint proposal. Let
Z
π (θ) =
(t)
π (t) (θ, x) dx
Xn
be the marginal law of π (t) over θ. This auxiliary distribution matches the correct conditional
marginal while remaining independent of X. Observe that the joint proposal for the t-th iteration
π (t) admits the factorization:
(n)
π (t) (θ, X) = π(θ | Et−1 )Pθ (X | Et−1 )
(n)
= π (t) (θ) Pθ (X | X ∈ At−1 ) . (2.8)
| {z } | {z }
marginal in θ constraint on X
16
Note that for a given θ ∈ Ω, the data-conditional term in the equation above satisfies
n o
(n) (n)
Pθ (X | X ∈ At−1 ) ∝ Pθ (X) 1 MSW
\p,δ,K,H (πX , πx∗ ) ≤ ϵt−1 , (2.9)
| {z }
constraint on X
(n)
where the proportional symbol hides the normalizing constant Pθ (At−1 ). The decomposition
in (2.8) cleanly decouples the proposal distribution into a marginal draw over θ and a constraint
on X. In other words, the first component eliminates the coupling while retaining the correct
conditional marginal π (t) (θ), and the second term imposes a data-dependent coupling constraint
to be enforced via a simple rejection step.
(n)
without computing its normalizing constant Pθ (At−1 ), we apply rejection sampling to the
unnormalized joint factorization in (2.8):
1. Sample θ ∼ π (t) (θ);
(n)
2. Generate X ∼ Pθ repeatedly until 1{MSW
\p,δ,K,H (πX , πx∗ ) ≤ ϵt−1 } = 1.
By construction, the marginal distribution of θ remains π(θ | Et−1 ) since all θ values are uncon-
ditionally accepted, while the acceptance criterion precisely enforces the constraint X ∈ At−1 .
Consequently, the retained pairs (θ, X) follow the desired joint distribution π(θ, X | Et−1 ). How-
ever, it is practically infeasible to perform exact rejection sampling as the expected number of
simulations for Step 2 may be unbounded. To address this limitation, we introduce a budget-
constrained rejection procedure termed Approximate Rejection Sampling (ARS), as outlined in
Algorithm 4. The core idea is as follows: given a fixed computational budget R ∈ N+ , we repeat
Step 2 at most R times. If no simulated data set satisfies the tolerance criterion within this
budget, the current parameter proposal is discarded, and the algorithm proceeds to the next
parameter draw.
17
While approximate rejection sampling introduces a small bias, the approximation error be-
comes negligible under appropriate conditions. Theorem 2.2 establishes that, under mild regu-
larity conditions, the resulting error decays exponentially fast in R.
Assumption 2.1 (Local Positivity). There exists constants c > 0 and γ > 0 such that, for
(n)
every θ with π (t) (θ) > 0, the per-draw acceptance probability Pθ (At ) is uniformly bounded
(n)
away from 0 satisfying Pθ (At ) ≥ cϵγt .
Assumption 2.1 is satisfied, for instance, if the kernel statistic Dθ (X) = MSW
\p,δ,K,H (πX , πx∗ )
(n)
under X ∼ Pθ admits a continuous density qθ (u) that is strictly positive in a neighborhood of
u = 0. In that case, for small ϵt ,
Z ϵt
(n) (n)
Pθ (At ) = Pθ (Dθ ≤ ϵt ) = qθ (u)du ≥ qθ (0)ϵt /2.
0
Theorem 2.2 (Sample Complexity for ARS). Suppose Assumption 2.1 holds. For any δ̄ ∈ (0, 1)
log(1/δ̄)
and ϵt > 0, if the number of proposal draws R satisfies R = O ϵγ
, then the total-variation
t
(t)
distance between the exact and approximate proposal distributions obeys DTV π (t) , πARS ≤ δ̄.
The strength of ABI lies in its sequential refinement of partial posteriors through a process guided
by a descending sequence of tolerance thresholds ϵ1 > ϵ2 > · · · > ϵT . This sequence progressively
tightens the admissible deviation from the target posterior π ∗ , yielding increasingly improved
posterior approximations. By iteratively decreasing the tolerances rather than prefixing a single
small threshold, ABI directs partial posteriors to dynamically focus on regions of the parameter
space most compatible with the observed data. This adaptive concentration is particularly
advantageous when the prior is diffuse (i.e., uninformative) or the likelihood is concentrated in
low-prior-mass regions, a setting in which one-pass ABC is notoriously inefficient.
The refinement procedure unfolds as follows. First, we acquire N samples from the proposal
distribution via Algorithm 4, namely
(t)
which form the initial proposal set S0 = {(θ(i) , X (i) )}N
i=1 for proceeding refinement. Next,
we retain only those parameter draws θ that exhibit a sufficiently small estimated posterior
(i)
and discard the remainder. This selection yields the training set for the generative density
18
estimation step,
(t) (t)
Sθ,∗ = θ(i) : (θ(i) , X (i) ) ∈ S0 and MSW
\p,δ,K,H (π (i) , πx∗ ) ≤ ϵt ,
X
consisting of parameter samples drawn from the refined conditional distribution at the current
(t)
iteration. Let π∗ denote the true (unobserved) marginal distribution of θ underlying the em-
(t) (t)
pirical parameter set Sθ,∗ 1 . Importantly, π∗ depends solely on θ since the X component has
been discarded—thus removing the coupling between θ and X.
By design, our pruning procedure progressively refines the parameter proposals by incor-
porating accumulating partial information garnered from previous iterations. As the tolerance
decays, the retained parameters are incrementally confined to regions that closely align with the
(t)
target posterior π ∗ , thereby sculpting each partial posterior π∗ toward π ∗ .
Determining the Sequence of Tolerance Levels Thus far, our discussion has implicitly
(t) (t)
assumed that the choice of ϵt yields a non-empty set Sθ,∗ . To ensure that Sθ,∗ is non-empty, the
tolerance level ϵt must be chosen judiciously relative to ϵt−1 . in particular, ϵt should neither
be substantially smaller than ϵt−1 (which might result in an empty set) nor excessively large
(t)
(which would lead to inefficient refinement). By construction, the initial proposal samples S0 =
{θ(i) , X (i) } satisfy
where α ∈ (0, 1) is a quantile threshold hyperparameter (Biau et al., 2015). In this manner, our
selection procedure yields a monotone decreasing sequence of thresholds, ϵ0 (α) > ϵ1 (α) > · · · >
(t)
ϵT (α), while ensuring that the refined parameter sets Sθ,∗ remain non-empty.
To incorporate the accumulated information into subsequent iterations, we update the proposal
distribution using the current partial posterior. Recall that the partial posterior factorizes into a
marginal component over θ and a constraint on X, as described in Section 2.2.1. In this section,
we focus on updating the marginal partial posterior by applying generative density estimation
to the retained parameter draws from the preceding pruning step. This process ensures that
the proposal distribution at each iteration reflects all refined information acquired in earlier
iterations.
1 (t)
The true distribution π∗ is unobserved, as only its empirical counterpart is available.
19
Marginal Proposal Update with Generative Modeling Our approach utilizes a genera-
tive model Gβ (Z) : Z 7→ θ̂, parameterized by β, which transforms low-dimensional latent noise
(t)
Z into synthetic samples θ̂. When properly trained to convergence on the refined set Sθ,∗ , Gβ
(t)
produces a generative distribution denoted by π b∗ that closely approximates the target partial
(t)
posterior π∗ . Since ABI is compatible with any generative model—including generative adver-
sarial networks, variational auto-encoders, and Gaussian mixture models—practitioners enjoy
considerable flexibility in their implementation choices. In this work, we employ POTNet (Lu
et al., 2025) because of its robust performance and resistance to mode collapse, which are cru-
cial attributes for preserving diversity of the target distribution and minimizing potential biases
arising from approximation error that could propagate to subsequent iterations.
At the end of iteration t, we update the proposal distribution for the (t + 1)-th iteration with
the t-th iteration’s approximate marginal partial posterior:
(t)
π (t+1) (θ) ← π
b∗ (θ).
Thereafter, we can simply apply the ARS algorithm described in Section 2.2.1 to generate samples
from the joint proposal π (t+1) (θ, X) conditional on the event Et . At the final iteration T, we
take the ABI posterior to be
(ϵ ) (T)
π T
bABI (θ | x∗ ) ← π
b∗ (θ),
which approximates the coarsened target distribution π(θ | MSWp (πX , πx∗ ) ≤ ϵT ). We em-
phasize that the core of ABI ’s sequential refinement mechanism hinges on the key novelty of
utilizing generative models, whose inherent generative capability enables approximation and
efficient sampling from the revised proposal distributions.
Iterative Fine-tuning of the Quantile Network We retrain the quantile network on the
newly acquired samples at each iteration to fully leverage the accumulated information, thereby
adapting the kernel statistics to become more informative about the posterior distribution. This
continual fine-tuning improves our estimation of the posterior MSW discrepancy and thus yields
progressively more accurate refinements of the parameter subset in subsequent iterations.
3 Theoretical Analysis
In this section, we investigate theoretical properties of the proposed MSW distance, its trimmed
version MSWp,δ (·, ·), and the ABI algorithm. In Section 3.1, we first establish some important
topological and statistical properties of the MSW distance between distributions µ, ν under mild
moment assumptions. In particular, we show that the error between the empirical MSW distance
and the true MSW distance decays at the parametric rate (see Remark 3.2). Then, in Section
3.2, we derive asymptotic guarantees on the convergence of the resulting ABI posterior in the
limit of ϵ ↓ 0.
We briefly review the notation as follows. Throughout this section, we assume that p ≥ 1
20
and d ∈ N+ . We denote by σ(·) the uniform probability measure on Sd−1 , and by Pp (Rd ) the
space of probability measures on Rd with finite p-th moments. Given a probability measure
µ ∈ Pp (Rd ) and the projection mapping fφ : θ 7→ φ⊤ θ (where φ ∈ Sd−1 ), we write φ# µ for its
pushforward under the projection fφ (·). Additionally, for any α ∈ [0, 1] and any one-dimensional
probability measure γ, we denote by Fγ−1 (α) the αth quantile of γ,
Subsequently, we denote by MSWp (·, ·) the untrimmed version of the MSW distance with δ = 0
(as defined in A.1) and omit the subscript δ.
Proposition 3.1 (Metricity). The untrimmed Marginally-augmented Sliced Wasserstein MSWp (·, ·)
distance is a valid metric on Pp (Rd ).
Our first theorem shows that the 1-MSW distance is an IPM and allows a dual formulation.
Theorem 3.1. The 1-MSW distance is an Integral Probability Metric defined by the class,
d
( Z
FMSW = f : R → R f (x) = gj (e⊤ + (1 − λ) gφ (φ⊤ x) dσ(φ) :
X
d λ
d j x)
j=1 Sd−1
)
gj , gφ ∈ Lip1 (R), sup |gφ (0)| < ∞ , 0 < λ < 1, (3.1)
φ∈Sd−1
where for each φ ∈ Sd−1 , gφ : R → R is a 1-Lipschitz function, such that the mapping (φ, t) 7→
gφ (t) is jointly measurable with respect to the product of the Borel σ-algebras on Sd−1 and R.
†
Theorem 3.2 (Topological Equivalence of MSWp and Wp ). There exists a constant Cd,p,λ >0
depending on d, p, λ, such that for µ, ν ∈ Pp (BR (0)),
1 1
† 1− p(d+1)
MSWp (µ, ν) ≤ Cd,p,λ Wp (µ, ν) ≤ Cd,p,λ R MSWpp(d+1) (µ, ν),
1/p
where Cd,p,λ = λ + (1 − λ) d−1 Sd−1 ∥φ∥pp dσ(φ)
R
. Consequently, the p-MSW distance induces
the same topology as the p-Wasserstein distance.
Theorem 3.3 (MSW Metrizes Weak Convergence). The MSWp distance metrizes weak conver-
21
gence on Pp (Rd ), in the sense of metricity as defined in Definition 6.8 of Villani et al. (2009).
Remark 3.1. This result holds without the requirement of compact domains.
In this section, we establish statistical guarantees for the trimmed MSWp,δ (·, ·) distance as
formalized in Definition 2.2. We focus particularly on how closely the empirical version of this
distance approximates its population counterpart when estimated from finite samples. For any
µ, ν ∈ Pp (Rd ) and m, m′ ∈ N+ , we denote by µbm and νbm′ the empirical measures constructed
from m and m i.i.d. samples drawn from µ and ν, respectively. Our main result, presented
′
in Theorem 3.4, derives a non-asymptotic bound on |MSWp,δ (µ bm , νbm′ ) − MSWp,δ (µ, ν)| that
achieves the parametric convergence rate of m−1/2 when p = 1 and m = m′ , as to be shown in
Eq. (3.7).
Assumptions. We assume that µ, ν ∈ Pp′ (Rd ) where p′ = max{p, 2}. The sample sizes
m, m′ ∈ N+ satisfy min{m, m′ } > max{2(p + 2)/δ, log(32d/δ)/(2δ 2 )}, where δ ∈ (0, 1/2) is the
trimming parameter and δ ∈ (0, 1) is the confidence level. We define effective radii Mµ,p , Mν,p ∈
(0, ∞) such that EZ∼µ [∥Z∥p ] < Mµ,p and EZ∼ν [∥Z∥p ] < Mν,p ; the existence of these radii follows
from the fact that µ, ν ∈ Pp (Rd ).
where Fγ is the CDF of γ. Note that from our assumption min{m, m′ } > log(32d/δ)/(2δ 2 ), we
have 2 exp(−2 min{m, m′ }δ 2 ) < δ/(16d); as limt→∞ ψδ,t (γ) = δ, there exist εm,d,δ,δ (γ), εm′ ,d,δ,δ (γ) ∈
(0, ∞) such that
δ δ
2 exp −2m ψδ,ε (γ) (γ)
2
≤ , 2 exp −2m′ ψδ,ε (γ) (γ)2 ≤ . (3.4)
m,d,δ,δ 16d m′ ,d,δ,δ 16d
Similarly, let εm,d,1−δ,δ (γ), εm′ ,d,1−δ,δ (γ) ∈ (0, ∞) be such that
δ δ
2 exp −2m ψ1−δ,ε (γ) (γ)
2
≤ , 2 exp −2m′ ψ1−δ,ε (γ) (γ)2 ≤ . (3.5)
m,d,1−δ,δ 16d m′ ,d,1−δ,δ 16d
For every j ∈ [d], we define
1/p !
Mµ,p
Rµj ,δ := 2 + εm,d,δ,δ (µj ) ∨ εm,d,1−δ,δ (µj ) ,
δ
22
1/p !
Mν,p
Rνj ,δ := 2 + εm′ ,d,δ,δ (νj ) ∨ εm′ ,d,1−δ,δ (νj ) .
δ
We further define
Rmax := max{Rµj ,δ } ∨ max{Rνj ,δ }. (3.6)
j∈[d] j∈[d]
The next theorem quantifies the convergence rate of the empirical trimmed MSW distance.
Theorem 3.4 (Convergence Rate of MSW Distance). Suppose that the assumptions given above
hold. For any δ̄ ∈ (0, 1), with probability at least 1 − δ̄, we have
where
where Rmax is as defined in Eq. (3.6) and Cp > 0 is a constant that depends only on p.
Remark 3.2. Theorem 3.4 states that the empirical trimmed MSWp,δ distance between two
samples converges to the true population trimmed MSWp,δ distance at the rate
In particular, when p = 1 and m = m′ , this recovers the familiar O(m−1/2 ) parametric rate.
Assume that the joint distribution of (Θ, X) (denoted by PΘ,X ) admits the density fΘ,X , the
(n)
marginal distribution of X (denoted by PX ) admits the density fX , and PΘ admits the density
fX|Θ , all with respect to Lebesgue measure. Let x∗ ∈ X n be such that fX (x∗ ) > 0 and fX is
23
continuous at x∗ . Suppose
Z
M := sup ∥θ∥pp fΘ,X (θ, x) dθ < ∞.
x∈X n Ω
Remark 3.3. Contrary to the standard convergence proofs that rely on the Lebesgue differ-
entiation theorem (Barber et al., 2015; Biau et al., 2015; Prangle, 2017), we establish the con-
vergence of the ABC posterior by leveraging martingale techniques. To the best of the authors’
knowledge, this represents the first convergence proof for ABC that employs a martingale-based
method (specifically, leveraging Lévy’s 0–1 law).
4 Empirical Evaluation
In this section, we present extensive empirical evaluations demonstrating the efficacy of ABI
across a broad range of simulation scenarios. We benchmark the performance of ABI against
four widely used alternative methods: ABC with the 2-Wasserstein distance (WABC; see Bern-
ton et al. 20192 ), ABC with automated neural summary statistic (ABC-SS; see Jiang et al.
2017), Sequential Neural Likelihood Approximation (SNLE; see Papamakarios et al. 2019), and
Sequential Neural Posterior Approximation (SNPE; see Greenberg et al. 2019). For SNLE and
SNPE, we employ the implementations provided by the Python SBI package (Tejero-Cantero
et al., 2020). In the Multimodal Gaussian example (Section 4.1), we additionally compare ABI
against the Wasserstein generative adversarial network with gradient penalty (WGAN-GP; see
Gulrajani et al. 2017). We summarize the key characteristics of each method below:
2
We use the implementation available at https://github.com/pierrejacob/winference
24
Compatibility ABI WABC ABC-SS SNLE SNPE
Intractable Prior Yes No Yes No No
Intractable Likelihood Yes Yes Yes Yes Yes
ABC-based Yes Yes Yes No No
Table 1: Compatibility comparison of different inference methods. ABI provides full compatibility with
both intractable priors and intractable likelihood, offering enhanced modeling flexibility.
θk ∼ Unif(−3, 3), k = 1, . . . , 5,
⊤
µθ = (θ1 , θ2 ) ,
s1 = θ32 , s2 = θ42 , ρ = tanh(θ5 ),
!
s21 ρs1 s2
Σθ = ,
ρs1 s2 s22
Xj | θ ∼ N (µθ , Σθ ), j = 1, . . . , 4.
Despite its structural simplicity, this model yields a complex posterior distribution characterized
by truncated support and four distinct modes that arise from the inherent unidentifiability of
the signs of θ3 and θ4 .
We implemented ABI with two sequential iterations using adaptively selected thresholds. The
MSW distance was evaluated using 10 quantiles and five SW slices. For SNLE and SNPE, we
similarly conducted two-round sequential inference. To ensure fair comparison, we calibrated
the training budget for ABC-SS3 , WABC, and WGAN to match the total number of samples
utilized for training across both ABI iterations.
Figures 1 and 2 present comparative analyses of posterior marginal distributions generated
by ABI (in red) and alternative inference methods, with the true posterior distribution (shown
in black) obtained via the No-U-Turn Sampler (NUTS) implemented in rstan using 10 MCMC
chains. We illustrate the evolution of the ABI posterior over iterations in Figure 3. Notably,
ABC-SS produces a predominantly unimodal posterior distribution centered around the poste-
3
Since both ABI and ABC-SS are rejection-ABC-based, we applied the same adaptive rejection quantile thresh-
olds for ABI (iteration 1) and ABC-SS.
25
Figure 1: Comparison of approximate posterior densities obtained from ABI and alternative
benchmark methods under the Multimodal Gaussian model. The true posterior is shown in
the black dashed line. ABI produced posteriors that accurately align with the true posterior
distribution.
rior mean, illustrating its fundamental limitation of yielding only first-order sufficient statistics
(i.e., mean-matching) in the asymptotic regime with vanishing tolerance. WGAN partially cap-
tures the bimodality of θ3 and θ4 , yet produces posterior samples that significantly deviate from
the true distribution. The parameter θ5 poses the greatest challenge for accurate estimation
across methods. Overall, ABI generates samples that closely approximate the true posterior
distribution across all parameters.
Figure 2: Comparison of marginal posteriors generated by ABI and WGAN-GP. The true pos-
terior is shown in the black dashed line.
Table 2 presents a quantitative comparison using multiple metrics: maximum mean discrep-
ancy with Gaussian kernel, empirical 1-Wasserstein distance4 , bias in posterior mean (measured
as absolute difference between posterior distributions), and bias in posterior correlation (calcu-
4
The W1 distance is computed using the Python Optimal Transport (POT) package.
26
lated as summed absolute deviation between empirical correlation matrices). ABI consistently
demonstrates superior performance across the majority of evaluation criteria (with the exception
of parameters θ2 and θ5 ),
Table 2: Comparative performance evaluation of inference methodologies. Lower values indicate superior
accuracy; best results are highlighted in bold.
j=1
n i i−1 o
Yi = Ui + max 0, Yj = Ui + max(0, Vi − Xi−1 ).
X X
Wj −
j=1 j=1
We assume that the queue is initially empty before the first customer arrives. We assign the
following truncated prior distributions:
27
For our analysis, we use the dataset from Shestopaloff and Neal (2014)5 , which was simulated
with true parameter values (θ1 , θ2 − θ1 , θ3 ) = (4, 3, 0.15) and consists of n = 50 observations.
Sequential versions of all algorithms were executed using four iterations with 10,000 training
samples per iteration. As in the first example, to ensure fair comparison, we allocated an equal
total number of training samples to WABC and ABC-SS as provided to ABI, SNLE, and SNPE.
Figure 4 presents the posterior distributions for parameters θ1 , θ2 − θ1 , and θ3 , with the true
posterior mean indicated by a black dashed line. The results demonstrate that the approximate
posteriors produced by ABI (in red) not only align most accurately with the true posterior
means, but also concentrate tightly around them.
Figure 4: Comparison of approximate posterior densities under the M/G/1 queuing example.
The dashed black line indicates the true posterior mean at (3.96, 2.99, 0.177). ABI outperforms
alternative approaches and exhibits superior alignment with the true posterior mean.
Figure 5: 30 trajectories simulated from the cosine model with parameter values ω ∗ = 1/80,
ϕ∗ = π/4, log(σ ∗ ) = 0, and log(A∗ ) = log(2).
28
with prior distributions:
ω ∼ Unif[0, 1/10], ϕ ∼ Unif[0, 2π], log(σ) ∼ N (0, 1), log(A) ∼ N (0, 1).
Posterior inference for these parameters is challenging because information about ω and ϕ
is substantially obscured in the marginal empirical distribution of observations (Y1 , . . . , Yn ).
The observed data was generated with t = 100 time steps using parameter values ω ∗ = 1/80,
ϕ∗ = π/4, log(σ ∗ ) = 0, and log(A∗ ) = log(2); 30 example trajectories are displayed in Figure 5.
The exact posterior distribution was obtained using the NUTS algorithm. Sequential algorithms
(ABI, SNLE, SNPE) were executed with three iterations, each utilizing 5,000 training samples,
while WABC6 and ABC-SS were trained with a total of 15,000 simulations.
Figure 6: Comparison of approximate posterior densities under the cosine model. The true
posterior density is displayed as a dashed black line. Among all methods, ABI achieves the most
accurate representation of the true posterior.
Figure 7: Approximate posterior densities generated by ABI under the cosine model.
Figure 6 compares the approximate posterior distributions obtained from ABI and alternative
methods, with the true posterior shown in black. Among these parameters, ω indeed proves to
be the most difficult. We observe that ABI again yields the most satisfactory approximation
across all four parameters. For clarity, we additionally provide a direct comparison between the
ABI-generated posterior and the true posterior in Figure 7.
6
Wasserstein distance was computed by treating each dataset as a flattened vector containing 100 independent
one-dimensional observations.
29
4.4 Lotka-Volterra Model
The final simulation example investigates the Lotka-Volterra (LV) model (Din, 2013), which
involves a pair of first-order nonlinear differential equations. This system is frequently used to
describe the dynamics of biological systems involving interactions between two species: a preda-
tor population (represented by yt ) and a prey population (represented by xt ). The populations
evolve deterministically according to the following set of equations:
dx dy
= αx − βxy, = −γy + δxy.
dt dt
The changes in population states are governed by four parameters (α, β, γ, δ) controlling the
ecological processes: prey growth rate (α), predation rate (β), predator growth rate (δ), and
predator death rate (γ).
Figure 8: 30 trajectories sampled from the Lotka-Volterra model, each corresponding to one of
three distinct parameter configurations (α∗ , β ∗ , γ ∗ , δ ∗ ).
Due to the absence of a closed-form transition density, inference in LV models using tra-
ditional methods presents significant challenges; however, these systems are particularly well-
suited for ABC approaches since they allow for efficient generation of simulated datasets. The
dynamics can be simulated using a discrete-time Markov jump process according to the Gillespie
algorithm (Gillespie, 1976). At time t, we evaluate the following rates,
The algorithm first samples the waiting time until the next event from an exponential distri-
bution with parameter r∆ (t), then selects one of the four possible events (prey death/birth,
predator death/birth) with probability proportional to its own rate r· (t). We choose the trun-
cated uniform prior over the restricted domain [0, 1]×[0.0.1]×[0, 2]×[0, 0.1]. The true parameter
values are (α∗ , β ∗ , γ ∗ , δ ∗ ) = (0.5, 0.01, 1.0, 0.01) (Papamakarios et al., 2019). Posterior estima-
tion in this case is particularly difficult because the likelihood surface contains concentrated
probability mass in isolated, narrow regions throughout the parameter domain. We initialized
30
the populations at time t = 0 with (x0 , y0 ) = (50, 100) and recorded system states at 0.1 time
unit intervals over a duration of 10 time units, yielding a total of 101 observations as illustrated
in Figure 8.
We carried out sequential algorithms across two rounds, with each round utilizing 5,000
training samples. For fair comparison, WABC and ABC-SS were provided with a total of 10,000
training samples. The resulting posterior approximations are presented in Figure 9, with black
dashed lines marking the true parameter values, as the true posterior distributions are not
available for this model. The results again demonstrate that ABI consistently provides accurate
approximations across all four parameters of the model.
Figure 9: Comparison of approximate posterior distributions for the Lotka-Volterra model. True
parameter values are indicated by the dashed black lines.
5 Discussion
In this work, we introduce the Adaptive Bayesian Inference (ABI) framework, which shifts the
focus of approximate Bayesian computation from data-space discrepancies to direct compar-
isons in posterior space. Our approach leverages a novel Marginally-augmented Sliced Wasser-
stein (MSW) distance—an integral probability metric defined on posterior measures that com-
bines coordinate-wise marginals with random one-dimensional projections. We then establish
a quantile-representation of MSW that reduces complex posterior comparisons to a tractable
distributional regression task. Moreover, we propose an adaptive rejection-sampling scheme in
which each iteration’s proposal is updated via generative modeling of the accepted parame-
ters in the preceding iteration. The generative modeling–based proposal updates allow ABI to
perform posterior approximation without explicit prior density evaluation, overcoming a key
limitation of sequential Monte Carlo, population Monte Carlo, and neural density estimation
approaches. Our theoretical analysis shows that MSW retains the topological guarantees of the
classical Wasserstein metric, including the ability to metrize weak convergence, while achieving
parametric convergence rates in the trimmed setting when p = 1. The martingale-based proof of
sequential convergence offers an alternative to existing Lebesgue differentiation arguments and
may find broader application in the analysis of other adaptive algorithms.
Empirically, ABI delivers substantially more accurate posterior approximations than Wasser-
stein ABC and summary-based ABC, as well as state-of-the-art likelihood-free simulators and
Wasserstein GAN. Through a variety of simulation experiments, we have shown that the poste-
31
rior MSW distance remains robust under small observed sample sizes, intricate dependency struc-
tures, and non-identifiability of parameters. Furthermore, our conditional quantile-regression
implementation exhibits stability to network initialization and requires minimal tuning.
Several avenues for future research arise from this work. One promising direction is to inte-
grate our kernel-statistic approach into Sequential Monte Carlo algorithms to improve efficiency
in likelihood-free settings. Future work could also apply ABI to large-scale scientific simulators,
such as those used in systems biology, climate modeling, and cosmology, to spur domain-specific
adaptations of the posterior-matching paradigm.
In summary, ABI offers a new perspective on approximate Bayesian computation as well as
likelihood-free inference by treating the posterior distribution itself as the primary object of
comparison. By combining a novel posterior space metric, quantile-regression–based estimation,
and generative-model–driven sequential refinement, ABI significantly outperforms alternative
ABC and likelihood-free methods. Moreover, this posterior-matching viewpoint may catalyze
further advances in approximate Bayesian computation and open new avenues for inference in
complex, simulator-based models.
Acknowledgements
The authors would like to thank Naoki Awaya, X.Y. Han, Iain Johnstone, Tengyuan Liang,
Art Owen, Robert Tibshirani, John Cherian, Michael Howes, Tim Sudijono, Julie Zhang, and
Chenyang Zhong for their valuable discussions and insightful comments. The authors would
like to especially acknowledge Michael Howes and Chenyang Zhong for their proofreading of
the technical results. W.S.L. gratefully acknowledges support from the Stanford Data Science
Scholarship and the Two Sigma Graduate Fellowship Fund during this research. 5366 W.H.W.’s
research was partially supported by NSF grant 2310788.
References
Justin Alsing, Benjamin Wandelt, and Stephen Feeney. Massive optimal data compression and
density estimation for scalable, likelihood-free inference in cosmology. Monthly Notices of the
Royal Astronomical Society, 477(3):2874–2885, 2018.
Pedro César Alvarez-Esteban, Eustasio Del Barrio, Juan Antonio Cuesta-Albertos, and Carlos
Matran. Trimmed comparison of distributions. Journal of the American Statistical Associa-
tion, 103(482):697–704, 2008.
Stuart Barber, Jochen Voss, and Mark Webster. The rate of convergence for approximate
bayesian computation. Electronic Journal of Statistics., 2015.
Mark A Beaumont, Jean-Marie Cornuet, Jean-Michel Marin, and Christian P Robert. Adaptive
approximate bayesian computation. Biometrika, 96(4):983–990, 2009.
James O Berger and Robert L Wolpert. The likelihood principle. IMS, 1988.
32
Espen Bernton, Pierre E Jacob, Mathieu Gerber, and Christian P Robert. Approximate bayesian
computation with the wasserstein distance. Journal of the Royal Statistical Society Series B:
Statistical Methodology, 81(2):235–269, 2019.
Gérard Biau, Frédéric Cérou, and Arnaud Guyader. New insights into approximate bayesian
computation. In Annales de l’IHP Probabilités et statistiques, volume 51, pages 376–403, 2015.
Fernando V Bonassi and Mike West. Sequential monte carlo with adaptive weights for approxi-
mate bayesian computation. Bayesian Analysis, 2015.
Nicolas Bonnotte. Unidimensional and evolution methods for optimal transportation. PhD thesis,
Université Paris Sud-Paris XI; Scuola normale superiore (Pise, Italie), 2013.
Ewan Cameron and AN Pettitt. Approximate bayesian computation for astronomical model
analysis: a case study in galaxy demographics and morphological transformation at high
redshift. Monthly Notices of the Royal Astronomical Society, 425(1):44–65, 2012.
Neel Chatterjee, Somya Sharma, Sarah Swisher, and Snigdhansu Chatterjee. Approximate
bayesian computation for physical inverse modeling. arXiv preprint arXiv:2111.13296, 2021.
Sourav Chatterjee, Trevor Hastie, and Robert Tibshirani. Univariate-guided sparse regression.
arXiv preprint arXiv:2501.18360, 2025.
Manuel Chiachío-Ruano, Juan Chiachío-Ruano, and María L. Jalón. Solving inverse problems
by approximate bayesian computation. In Bayesian Inverse Problems. 2021.
Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for
distributional reinforcement learning. In International conference on machine learning, pages
1096–1105. PMLR, 2018.
Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. An adaptive sequential monte carlo method
for approximate bayesian computation. Statistics and computing, 22:1009–1020, 2012.
Christopher Drovandi, David J Nott, and David T Frazier. Improving the accuracy of marginal
approximations in likelihood-free inference via localization. Journal of Computational and
Graphical Statistics, 33(1):101–111, 2024.
Paul Fearnhead and Dennis Prangle. Constructing abc summary statistics: semi-automatic abc.
Nature Precedings, pages 1–1, 2011.
Paul Fearnhead and Dennis Prangle. Constructing summary statistics for approximate bayesian
computation: semi-automatic approximate bayesian computation. Journal of the Royal Sta-
tistical Society Series B: Statistical Methodology, 74(3):419–474, 2012.
David T Frazier. Robust and efficient approximate bayesian computation: A minimum distance
approach. arXiv preprint arXiv:2006.14126, 2020.
33
Malay Ghosh. Exponential tail bounds for chisquared random variables. Journal of Statistical
Theory and Practice, 15(2):35, 2021.
Daniel T Gillespie. A general method for numerically simulating the stochastic time evolution
of coupled chemical reactions. Journal of computational physics, 22(4):403–434, 1976.
David Greenberg, Marcel Nonnenmacher, and Jakob Macke. Automatic posterior transformation
for likelihood-free inference. In International conference on machine learning, pages 2404–
2414. PMLR, 2019.
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.
Improved training of wasserstein gans. Advances in neural information processing systems,
30, 2017.
Peter J Huber. Robust estimation of a location parameter. The Annals of Mathematical Statis-
tics, 35(1):73–101, 1964.
Bai Jiang. Approximate bayesian computation with kullback-leibler divergence as data discrep-
ancy. In International conference on artificial intelligence and statistics, pages 1711–1721.
PMLR, 2018.
Bai Jiang, Tung-yu Wu, Charles Zheng, and Wing H Wong. Learning summary statistic for
approximate bayesian computation via deep neural network. Statistica Sinica, pages 1595–
1618, 2017.
Olav Kallenberg and Olav Kallenberg. Foundations of modern probability, volume 2. Springer,
1997.
Jungeum Kim, Percy S Zhai, and Veronika Ročková. Deep generative quantile bayes. arXiv
preprint arXiv:2410.08378, 2024.
Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: journal of the
Econometric Society, pages 33–50, 1978.
Sirio Legramanti, Daniele Durante, and Pierre Alquier. Concentration of discrepancy-based abc
via rademacher complexity. arXiv preprint arXiv:2206.06991, 2022.
Wenhui Sophia Lu, Chenyang Zhong, and Wing Hung Wong. Efficient generative modeling via
penalized optimal transport network. arXiv preprint arXiv:2402.10456v2, 2025.
Tudor Manole, Sivaraman Balakrishnan, and Larry Wasserman. Minimax confidence intervals
for the sliced wasserstein distance. Electronic Journal of Statistics, 16(1):2252–2345, 2022.
Jean-Michel Marin, Pierre Pudlo, Christian P Robert, and Robin J Ryder. Approximate bayesian
computational methods. Statistics and computing, 22(6):1167–1180, 2012.
34
Henry Markram, Eilif Muller, Srikanth Ramaswamy, Michael W Reimann, Marwan Abdellah,
Carlos Aguado Sanchez, Anastasia Ailamaki, Lidia Alonso-Nanclares, Nicolas Antille, Selim
Arsever, et al. Reconstruction and simulation of neocortical microcircuitry. Cell, 163(2):
456–492, 2015.
Kimia Nadjahi, Alain Durmus, Umut Simsekli, and Roland Badeau. Asymptotic guarantees for
learning generative models with the sliced-wasserstein distance. Advances in Neural Informa-
tion Processing Systems, 32, 2019.
Oscar Hernan Madrid Padilla, Wesley Tansey, and Yanzhen Chen. Quantile regression with relu
networks: Estimators and minimax rates. Journal of Machine Learning Research, 23(247):
1–42, 2022.
George Papamakarios and Iain Murray. Fast ε-free inference of simulation models with bayesian
conditional density estimation. Advances in neural information processing systems, 29, 2016.
George Papamakarios, David Sterratt, and Iain Murray. Sequential neural likelihood: Fast
likelihood-free inference with autoregressive flows. In The 22nd international conference on
artificial intelligence and statistics, pages 837–848. PMLR, 2019.
Nicholas G Polson and Vadim Sokolov. Generative ai for bayesian computation. arXiv preprint
arXiv:2305.14972, 2023.
Dennis Prangle. Adapting the abc distance function. Bayesian Analysis, 2017.
Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its
application to texture mixing. In Scale Space and Variational Methods in Computer Vision:
Third International Conference, SSVM 2011, Ein-Gedi, Israel, May 29–June 2, 2011, Revised
Selected Papers 3, pages 435–446. Springer, 2012.
Alexander Y Shestopaloff and Radford M Neal. On bayesian inference for the m/g/1 queue with
efficient mcmc sampling. arXiv preprint arXiv:1401.5548, 2014.
Michel Talagrand. The transportation cost from the uniform measure to the empirical measure
in dimension ≥ 3. The Annals of Probability, pages 919–959, 1994.
Alvaro Tejero-Cantero, Jan Boelts, Michael Deistler, Jan-Matthis Lueckmann, Conor Durkan,
Pedro J. Gonçalves, David S. Greenberg, and Jakob H. Macke. sbi: A toolkit for simulation-
based inference. Journal of Open Source Software, 5(52):2505, 2020. doi: 10.21105/joss.02505.
URL https://doi.org/10.21105/joss.02505.
35
Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.
Yuexi Wang and Veronika Ročková. Adversarial bayesian simulation. arXiv preprint
arXiv:2208.12113, 2022.
Simon N Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature,
466(7310):1102–1104, 2010.
Yang Zeng, Hu Wang, Shuai Zhang, Yong Cai, and Enying Li. A novel adaptive approximate
bayesian computation method for inverse heat conduction problem. International Journal of
Heat and Mass Transfer, 134:185–197, 2019.
Xingyu Zhou, Yuling Jiao, Jin Liu, and Jian Huang. A deep generative approach to conditional
sampling. Journal of the American Statistical Association, 118(543):1837–1848, 2023.
36
Appendix
The Appendix section contains detailed proofs of theoretical results, supplementary results and
descriptions on the simulation setup, as well as further remarks.
d
1X h i1/p
MSWp (µ, ν) = λ Wp (ej )# µ, (ej )# ν +(1 − λ) Eφ∼σ Wpp (φ# µ, φ# ν) , (A.1)
d j=1 | {z }
| {z } Sliced Wasserstein distance
marginal augmentation
where λ ∈ (0, 1) is a mixing parameter, σ(·) denotes the uniform probability measure on the
unit sphere Sd−1 , and (ej )# represents the pushforward of a measure under projection onto the
j-th coordinate axis.
θ ∈ Rd ∼ Unif[−1, 1]d ,
X|θ ∈ Rn ∼ N (Hθ, σ 2 In ),
where H ∈ Rn×d is a matrix that maps from the low-dimensional parameter space to the high-
dimensional observation space.
Lemma B.1 (Curse of dimensionality). The expected number of samples needed to produce a
37
draw within the ϵ-ball centered at x∗ grows as Ω(exp(n)) for 0 < ϵ < 1/2(n + ∥Hθ − x∗ ∥22 /σ 2 )1/2 .
which scales linearly in n as the dimension n → ∞. Let Y ∼ χ2n (ζ), then for 0 < c < n + ζ =
n + ∆2 /σ 2 , by Theorem 4 of Ghosh (2021) we have,
!
n c c nc2
P(Y < n + ζ − c) ≤ exp + log 1 − ≤ exp − .
2 n + 2ζ n + 2ζ 4(n + 2ζ)2
√
For 0 < ϵ < 1/2 n + ζ, where ζ = ∥Hθ − x∗ ∥22 /σ 2 , we set c := n + ζ − ϵ2 > 0. Applying the
inequality above yields
n(n + ζ − σ 2 ϵ2 )2
!
−2
P σ ∥X − x∗ ∥22 < ϵ |θ ≤ exp −
2
= O (exp (−n)) .
4(n + 2ζ)2
Therefore, the expected number of samples needed to produce a draw within the ϵ-ball centered
at x∗ is Ω(exp(n)); specifically, the number of required samples grows exponentially with the
dimension n.
P(∥X∥ ≤ ϵ) ≤ Cϵn .
Proof. We begin by bounding the probability using the volume of an n-dimensional ball:
Z Z
P(∥X∥ ≤ ϵ) = f (x) dx ≤ K dx = KVol(Bϵn )
∥x∥≤ϵ ∥x∥≤ϵ
Kπ n/2 ϵn
= ,
Γ n
2 +1
where Brn denotes a ball of radius r in Rn . To analyze the behavior of this bound as n increases,
38
we apply Stirling’s approximation:
1 2πe
n/2
π n/2 π n/2
n→∞
∼ √ =√ → 0
Γ 2 +1
n
n
n/2 πn n
πn 2e
Kπ n/2
C := sup < ∞.
n≥1 Γ 2 + 1
n
We then retain those θ(i) whose associated X (i) falls within Bϵt (x∗ ):
(t) (t)
Sθ,∗ = θ(i) : (θ(i) , X (i) ) ∈ S0 and X (i) ∈ Bϵt (x∗ ) .
We generated a single observed value x∗ = 6.24 from the model; by conjugacy the exact
39
posterior is π(θ | x∗ ) = N (5.94, 0.95). At each iteration t, we drew N = 104 synthetic pairs
(θ(i) , X (i) ) from the current Gaussian proposal and retained those θ(i) for which X (i) ∈ Bϵt (x∗ ).
We used the decreasing tolerance schedule
The retained θ-values were used to fit the next Gaussian proposal by moment matching. We
repeated this procedure for T = 9 iterations. Let the empirical retention rate be defined as
1
{ i : |X (i) − x∗ | ≤ 0.001} .
N
Figure 10 displays the smoothed posterior densities across iterations, together with the corre-
sponding retention rates. From our simulations, we make the following observations:
(1). The sequential rejection sampling refinement is self-consistent: repeated application does
not cause the sampler to drift away from the true posterior.
(2). In this univariate example, the sampler converges rapidly, requiring just just three itera-
tions in this univariate example.
Figure 10: Adaptive inference on the univariate Gaussian model. Left: Smoothed proposal and
posterior densities over iterations; prior shown in gray, true posterior in black. Right: Empirical
retention rate per iteration. The red line indicates the rate obtained using the true posterior,
while the error bands represent ±1 standard deviation of the retention rate, calculated from 100
independent repetitions.
40
the conditional probability measure on (X n , A) given Θ = θ.
We provide the definition of Bayes sufficiency presented in Godambe (1968) below.
Definition D.1 (Bayes Sufficiency). Let (Ω, B) be a measurable parameter space, (X , A) be a
sample space, and Cπ be a countable class of prior distributions on (Ω, B). A statistic T : X → T
is said to be Bayes sufficient with respect to Cπ if, for any π ∈ Cπ , the posterior density π(θ | X)
depends on X only through T (X) for almost every x, i.e., there exists a function g : Ω×T → R+
such that
π(θ | x) = g(θ, T (x)) for π-almost every θ ∈ Ω and Pπ -almost every x ∈ X , (D.1)
where Pπ is the marginal distribution of X under the prior π for all π ∈ Cπ . Equivalently, for
any π ∈ Cπ and B ∈ B,
Proof. We will show that T is Bayes sufficient and minimally so in two parts.
41
Thus, for Pπ -almost every x and for every measurable set B we obtain
Z
E [1B (Θ) | X = x] = π (dθ | x)
ZB
= π (dθ | T (x))
B
= E [1B (Θ) | T (X) = T (x)] .
Since this equality holds for every measurable set B, we conclude that the conditional distri-
butions π (dθ | X = x) and π (dθ | T (x)) are equal for Pπ -almost every x. By the definition of
Bayes sufficiency, this shows that T is Bayes sufficient.
for all π ∈ Cπ . But T (x) = π(dθ | x), hence T (x) = T (x′ ) almost surely whenever T ′ (x) = T ′ (x′ ).
Thus T is minimal.
X = (X1 , . . . , Xn ) ∈ RndX ,
\p,δ (X, x∗ ) ≤ ϵt },
At = {x : MSW
\p,δ (X (ω), x∗ ) ≤ ϵt } = {ω : X(ω) ∈ At }.
Et = {ω : MSW
Theorem E.1 (Error Rate of ARS). Suppose Assumption E.1 holds. Then, the total variation
distance between the marginal distributions over θ, corresponding to the exact rejection sampling
procedure in Algorithm 2 and the ARS procedure in Algorithm 4, converges to zero at a rate of
O(exp(−Rϵγt )) as R → ∞.
Proof. We begin by establishing key notations for our analysis. Let π (t) = π(θ | Et ) denote
(t)
the marginal target distribution of the exact rejection sampling procedure. Define πARS as the
marginal distribution of θ under the approximate sampling procedure with R attempts. We
(t)
denote the corresponding measures by π (t) and πARS , respectively.
For r = 1, 2, . . . , R, let Et,r = {MSW
\p,δ (Xr , x∗ ) ≤ ϵt } be the event where the rth replicate
Xr satisfies the threshold condition. Define Et = Et,1 . A particular θ is retained if at least one
42
of the corresponding data replicate passes the threshold,
R n o
Et,r = ω : {X1 (ω) ∈ At } {XR (ω) ∈ At } .
[ [ [
···
r=1
R
Z ! R
(t) (n)
eARS (θ) e (θ) Pθ (dX)
[ Y
(t)
π ∝ π Et,r
r=1 r=1
R
Z ! R
(n)
= e (θ) 1 − c (ω) Pθ (dX)
\ Y
(t)
π 1Et,r
r=1 r=1
(n)
=π
e (θ) 1 −(t)
Pθ (Act )R ,
where the last equality follows from the i.i.d. nature of X1 , . . . , XR conditional on θ. Thus,
under the ARS procedure (Algorithm 4), the marginal distribution of Θ is proportional to:
(t) (n)
eARS (θ) ∝ π
π e (t) (θ) 1 − Pθ (Act )R . (E.1)
h i
(n) (n)
Let Zt = Eeπ(t) 1 − Pθ (Act )R = Ω π
e (t) (θ) 1 − Pθ (Act )R dθ be the normalizing constant
R
for Eq. (E.1). We can then bound the total variation distance:
1
Z
(t) (t)
eARS =
DTV π (t) , π π (t) (θ) − π eARS dθ
2 Ω
1
Z
(n)
= π (t) (θ) − Zt−1 π e (t) (θ) 1 − Pθ (Act )R dθ
2 Ω
1
Z
(n)
= Zt π e (t) (θ) − π e (t) (θ) 1 − Pθ (Act )R dθ (E.2)
2Zt Ω
1
Z
(n)
= πe (t) (θ) Zt − 1 − Pθ (Act )R dθ
2Zt Ω
1
Z 1/2 Z 1/2
(n) 2
≤ e (t) (θ) dθ
π Zt − 1 + Pθ (Act )R π e (t) (θ)dθ (E.3)
2Zt Ω Ω
1
Z 1/2
h
(n)
i
(n) 2
= Eeπ(t) Pθ (Act )R − Pθ (Act )R dθ
2Zt Ω
1
(n)
1/2
= Vareπ(t) Pθ (Act )R
2Zt
1 h
(n)
i1/2
≤ Eeπ(t) Pθ (Act )2R (E.4)
2Zt
d
where Eq. (E.2) uses the fact that π (t) (θ) =θ π
e (t) (θ), Eq. (E.3) follows from Hölder’s inequality,
and we also used the fact that Ω πe (θ) dθ = 1.
R
(n)
By Assumption E.1, there exists a constant c such that inf Pθ (At ) ≥ cϵγt . Conse-
θ∈supp(π (t) )
43
quently,
i1/2
2R 1/2
h
(n)
Eeπ(t) Pθ (Act )2R ≤ (1 − cϵγt )+ ≤ exp(−Rcϵγt ),
where we use the inequality (1 − u)R+ ≤ exp(−Ru) for u ∈ R and integer R. Similarly, for Zt ,
(n) γ
using sup Pθ (At ) ≤ 1 − cϵt :
c
θ∈supp(π (t) )
h i
(n)
Zt = Eeπ(t) 1 − Pθ (Act )R ≥ 1 − (1 − cϵγt )R γ
+ ≥ 1 − exp(−Rcϵt ).
1 h
(n)
i1/2 1 exp(−Rcϵγt )
Eeπ(t) Pθ (Act )2R ≤ · ≲ exp (−Rcϵγt ) ,
2Zt 2 1 − exp(−Rcϵγt )
which establishes the upper bound and the desired convergence rate.
Theorem E.2 (Sample Complexity for ARS). Suppose Assumption E.1 holds. For any δ̄ ∈ (0, 1)
log(1/δ̄)
and ϵt > 0, if the number of samples R in the ARS algorithm satisfies R = O ϵγt
, then the
total variation
distance between the exact and approximate posterior distributions is bounded by
(t) (t)
δ̄, i.e., DTV π , π eARS ≤ δ̄.
Corollary E.2.1 (1-Wasserstein Error Incurred by ARS). Under the conditions of Theorem E.1,
if the parameter space Ω has a finite diameter dΩ = diam(Ω), then the 1-Wasserstein distance
between the exact and approximate posterior distributions is bounded by
(t)
eARS ≲ dΩ exp (−Rcϵγt ) .
W1 π (t) , π (E.5)
(t) log(dΩ /δ̄)
eARS ≤ δ̄, we need R = O
Therefore, to achieve W1 π (t) , π ϵγt
.
Nonnegativity, symmetry, and triangle inequality for the MSW distance follow directly from
the corresponding properties of Wp and SWp distances, coupled with the additivity property of
44
metrics. Note that if µ = ν, then
SWp (µ, ν) = 0,
Wp ((ej )# µ, (ej )# ν) = 0, j = 1, . . . , d.
For the converse direction, suppose MSWp (µ, ν) = 0. As 0 < λ < 1, both terms in the MSW
distance must vanish:
d
Wp ((ej )# µ, (ej )# ν) = 0,
X
j=1
Z
Wpp (φ# µ, φ# ν) dσ(φ) = 0,
Sd−1
which implies that the marginal distributions of µ and ν are identical for all coordinates, and
Wp (φ# µ, φ# ν) = 0 for σ-almost all φ ∈ Sd−1 . By the Cramér-Wold theorem, this is sufficient
to conclude that µ = ν.
Lemma E.3. The 1-Sliced Wasserstein distance is an Integral Probability Metric on P1 (Rd ).
where Sd−1 is the unit sphere in Rd and σ is the uniform measure on Sd−1 . Define the critic
function class,
( Z )
⊤
F := f (x) = gφ (φ x) dσ(φ) gφ ∈ Lip1 (R), sup |gφ (0)| < ∞
Sd−1 φ∈Sd−1
where for each φ ∈ Sd−1 , gφ : R → R is a 1-Lipschitz function, such that the mapping (φ, t) 7→
gφ (t) is jointly measurable with respect to the product of the Borel σ-algebras on Sd−1 and R.
Note that F is nonempty, as it includes constant functions. For instance, if we take fc : x 7→
Sd−1 k dσ(φ) = k < ∞, then fc ∈ F.
R
First, we show that supf ∈F | f d(µ − ν)| ≤ SW1 (µ, ν) holds. Fix f ∈ F. By definition,
R
|gφ (φ⊤ x)| ≤ |gφ (0)| + |φ⊤ x| ≤ sup |gφ (0)| + ∥x∥.
φ∈Sd−1
As supφ∈Sd−1 |gφ (0)| < ∞ and ∥x∥ dµ < ∞, we have |gφ (φ⊤ x)| dσ(φ)dµ < ∞; similarly,
R RR
Sd−1
45
|gφ (φ⊤ x)| dσ(φ)dν < ∞. By Fubini–Tonelli,
RR
Sd−1
Z Z Z
f d(µ − ν) = gφ (φ⊤ x)d(µ − ν)(x) dσ(φ).
Sd−1
Thus
Z Z Z
f d(µ − ν) ≤ gφ (φ⊤ x)d(µ − ν)(x) dσ(φ)
Sd−1
Z
≤ W1 (φ# µ, φ# ν) dσ(φ) = SW1 (µ, ν),
Sd−1
where we appeal to the dual representation of W1 for each φ. Taking the supremum over F
establishes the claimed inequality.
We now show that the reverse inequality SW1 (µ, ν) ≤ supf ∈F | f d(µ − ν)| holds. Fix ε > 0.
R
For each φ ∈ Sd−1 , choose a 1-Lipschitz function gφε with gφε (0) = 0 and
Z
gφε (φ⊤ x)d(µ − ν)(x) ≥ W1 (φ# µ, φ# ν) − ε.
Note that fε ∈ F because each slice is 1-Lipschitz and supφ∈Sd−1 |gφε (0)| < ∞, by construction.
Then, exactly as before,
Z Z Z
fε d(µ − ν) ≥ gφε (φ⊤ x) d(µ − ν)(x) dσ(φ)
Sd−1
Z
W1 (φ# µ, φ# ν) − ε dσ(φ) = SW1 (µ, ν) − ε.
≥
Sd−1
Taking the supremum over F and letting ε ↓ 0 yields the reverse inequality.
Combining both steps gives
Z
SW1 (µ, ν) = sup f d(µ − ν) ,
f ∈F
Lemma E.4 (Closure of IPMs under finite linear combinations). Let D1 , D2 , . . . , DK be IPMs
on the same space of probability measures, and let λ1 , λ2 , . . . , λK ≥ 0 be nonnegative constants.
46
Define
K
D(µ, ν) := λk Dk (µ, ν). (E.6)
X
k=1
Proof. Let D1 , . . . , DK be as given. For each k = 1, . . . , K, let Fk be the critic function class
corresponding to Dk , i.e.,
Z
Dk (µ, ν) = sup f d(µ − ν) .
f ∈Fk
Let Fk† := Fk ∪ (−Fk ); note that Dk is unchanged when Fk is replaced by Fk† . For any
λ1 , λ2 , · · · , λK ≥ 0, define the new critic function class
K
F := f = fk ∈ Fk† , k = 1, . . . , K .
X
λk fk
k=1
K
Z X K Z K
λk Dk (µ, ν).
X X
λk fk d(µ − ν) ≤ λk fk d(µ − ν) ≤
k=1 k=1 k=1
Now we show that supf ∈F |f d(µ − ν)| ≥ D(µ, ν) holds. Fix ε > 0. For each k choose fkε ∈ Fk†
such that
ε
Z
fkε d(µ − ν) ≥ Dk (µ, ν) − .
K(λk ∨ 1)
Define fε := ∈ F. By linearity,
PK ε
k=1 λk fk
K K
ε
Z Z
fε d(µ − ν) = λk Dk (µ, ν) −
X X
λk fkε d(µ − ν) ≥ ≥ D(µ, ν) − ε.
k=1 k=1
K(λk ∨ 1)
47
Since the bound holds for arbitrary ε > 0, letting ε ↓ 0 yields
Z
sup f d(µ − ν) ≥ D(µ, ν). (E.8)
f ∈F
Proof. For each ej , j = 1, . . . , d, W1 ((ej )# µ, (ej )# ν) is an IPM. By Lemmas E.3 and E.4, we
readily obtain the claimed result.
To prove Theorem 3.2, we need to first establish some useful lemmas and propositions.
Lemma E.5. For any p ≥ 1 and any µ, ν ∈ Pp (Rd ), we have MSW1 (µ, ν) ≤ MSWp (µ, ν).
For the Sliced Wasserstein distance, fixing any projection φ ∈ Sd−1 , the one-dimensional Wasser-
stein distance satisfies W1 (φ# µ, φ# ν) ≤ Wp (φ# µ, φ# ν). Hence by Jensen’s inequality,
Z Z 1/p Z 1/p
W1 (φ# µ, φ# ν) dσ(φ) ≤ W1 (φ# µ, φ# ν)p dσ(φ) ≤ Wp (φ# µ, φ# ν)p dσ(φ) .
Thus SW1 (µ, ν) ≤ SWp (µ, ν). Since MSW is a convex combination of these components, the
desired inequality holds for the MSW distance.
where
1
Z
1/p
Cd,p,λ = λ + (1 − λ)cd,p , cd,p = ∥φ∥pp dσ(φ) ≤ 1.
d Sd−1
48
Proof. Let γ ∗ ∈ Γ(µ, ν) be an optimal transport plan for the minimization problem (1.1) with
Ω = Rd . Then for any j ∈ [d], (ej ⊗ ej )# γ ∗ is a transport plan between (ej )# µ and (ej )# ν. By
the definition of the Wasserstein distance as an infimum over all couplings, we have:
Z 1/p Z 1/p
∗ ∗
Wp ((ej )# µ, (ej )# ν) ≤ |⟨ej , x − y⟩| dγ (x, y)
p
≤ ∥x − y∥ dγ (x, y)
p
= Wp (µ, ν).
Consequently,
d
1X
Wp ((ej )# µ, (ej )# ν) ≤ Wp (µ, ν).
d j=1
where
1
Z
cd,p = ∥φ∥pp dσ(φ) ≤ 1.
d Sd−1
Proposition E.2. For all µ, ν supported in BR (0), R > 0, there exists a constant Cd,λ > 0 such
that
1/(d+1)
W1 (µ, ν) ≤ Cd,λ Rd/(d+1) MSW1 (µ, ν) (E.9)
Proof. For brevity of notation, let M1 (µ, ν) := j=1 W1 ((ej )# µ, (ej )# ν). By the definition of
1 Pd
d
the MSW distance,
49
Setting Cd,λ := Ced (1 − λ)−1/(d+1) , we obtain the desired result:
1/(d+1)
W1 (µ, ν) ≤ Cd,λ Rd/(d+1) MSW1 (µ, ν).
Proof. The first inequality follows from Proposition E.1. The second inequality follows from
Proposition E.2 and Lemma E.5, on noting that Wpp (µ, ν) ≤ (2R)p−1 W1 (µ, ν).
Proof. Let (µℓ )ℓ∈N+ , µ be probability measures in Pp (Rd ) such that limℓ→∞ MSWp (µℓ , µ) = 0.
Since λ ∈ (0, 1), this implies limℓ→∞ SWp (µℓ , µ) = 0. By Theorem 1 of Nadjahi et al. (2019),
we conclude that µℓ ⇒ µ, where we use ⇒ to denote weak convergence in Pp (Rd ).
Conversely, suppose µℓ ⇒ µ. Theorem 6.9 of Villani et al. (2009) gives Wp (µℓ , µ) → 0.
Because SWp (µℓ , µ) ≤ Wp (µℓ , µ), we also obtain SWp (µℓ , µ) → 0. For the marginal component,
by the Cramér-Wold theorem, a sequence of probability measures on Rd converges weakly if and
only if their one-dimensional projections converge weakly for all directions; therefore, for any
j ∈ {1, · · · , d},
d
(ej )# µℓ → (ej )# µ.
Now fix j ∈ {1, . . . , d}. Recall that weak convergence in Pp (Rd ) requires that (Villani et al.,
2009, Definition 6.8)
Z
lim lim sup ∥z∥p 1∥z∥≥R dµℓ (z) = 0.
R→∞ ℓ→∞ Rd
Hence (ej )# µℓ converges to (ej )# µ in Pp (R) for every j ∈ {1, · · · , d}. Appealing once again to
Theorem 6.9 of Villani et al. (2009) yields each term Wp ((ej )# µℓ , (ej )# µ) → 0. Hence,
d
λX
MSWp (µℓ , µ) = Wp ((ej )# µℓ , (ej )# µ) + (1 − λ)SWp (µℓ , µ) → 0.
d j=1
Therefore MSWp indeed metrizes weak convergence on Pp (Rd ) without any compactness as-
sumptions, as desired.
50
E.2.5 Proof of Theorem 3.4
E[|Zφ |p ] Mµ,p
P(|Zφ | ≥ t) ≤ p
< p .
t t
Mµ,p 1/p
Setting tδ = δ < ∞, we get
Mµ,p
P(|Zφ | ≥ tδ ) < = δ.
tpδ
Fµ−1
φ
(δ) ≥ −tδ , Fµ−1
φ
(1 − δ) ≤ tδ .
Therefore,
1/p
Mµ,p
Fµ−1
φ
(τ ) ≤ tδ = < ∞ for all τ ∈ [δ, 1 − δ], φ ∈ Sd−1 .
δ
Lemma E.7 (Empirical Quantile Deviation). Let δ ∈ (0, 1/2), φ ∈ Sd−1 , and suppose that
Zφ,1 , . . . , Zφ,m are i.i.d. samples from φ# µ, so that the empirical measure is
m
1 X
φ# µ
bm = δZ ,
m i=1 φ,i
with empirical CDF Fbm,µφ and true CDF Fµφ . Then, for any t ≥ 0, we have
n o
−1
P Fbm,µ φ
(δ) − Fµ−1
φ
(δ) > t ≤ 2 exp −2m ψδ,t (µφ )2 ,
51
where ψδ,t (·) is as defined in Eq. (3.2).
Remark E.2. A similar result holds for ν.
Proof. For simplicity of notation, we omit the dependence on the fixed direction φ and µ through-
out this proof. Let Fm (z) := F̂m,µφ (z) = m 1 Pm
i=1 1 {Zφ,i ≤ z} and F (z) := Fµφ (z) for every
z ∈ R.
By Hoeffding’s inequality, for any z ∈ R and t ≥ 0, we have
−1
(δ) − F −1 (δ) > t) = P Fm (F −1 (δ) + t) < δ
P Fm
= P Fm (F −1 (δ) + t) − F (F −1 (δ) + t) < δ − F (F −1 (δ) + t)
where we use the fact that F (F −1 (δ) + t) ≥ F (F −1 (δ)) ≥ δ in the last line). An analogous
argument applies for upper bounding P(Fm −1 (δ) − F −1 (δ) < −t). Applying the union bound to
Proof of Theorem 3.4. Fix any δ ∈ (0, 1/2) and δ̄ ∈ (0, 1). We let
Analysis of Empirical Convergence Rate We decompose the total error into two compo-
nents and apply concentration techniques to control each term.
52
Error from the Marginal Component For each coordinate j = 1, . . . , d, the projected
measures µj and νj are one-dimensional. Let qµ,δ (ej ) = |q µ,δ (ej )| ∨ |q µ,1−δ (ej )|. Note that
1 1−δ
Z
p p
−1 −1
(ej )# µ
bm , (ej )# µ = (τ ) (τ )
Wp,δ Fbm,µ − F dτ
1 − 2δ δ j µ j
1
Z 2qµ,δ (ej ) Z 1−δ n o
−1 −1
≤ ptp−1 1 Fbm,µ (τ ) − F (τ ) ≥ t dτ dt
1 − 2δ 0 δ
j µj
1
Z 2qµ,δ (ej ) Z 1−δ n o
−1
≤ sup {ptp−1 } 1 Fbm,µ (τ ) − Fµ−1 (τ ) ≥ t dτ dt
1 − 2δ t∈[0,2qµ,δ (ej )] 0 δ
j j
p p−1 1−δ −1
Z
≤ 2qµ,δ (ej ) Fbm,µj (τ ) − Fµ−1 (τ ) dτ
1 − 2δ δ
j
By Lemma E.6, Eq. (E.10)-(E.11), and the union bound, the following inequalities hold with
probability at least 1 − 8δ for all j ∈ {1, . . . , d}:
Mµ,p 1/p 1
qµ,δ (ej ) ≤ + εm,d,δ,δ (µj ) ∨ εm,d,1−δ,δ (µj ) = Rµj ,δ ,
δ 2
1/p !
Mµ,p
q µ,1−δ (ej ) − q µ,δ (ej ) ≤ 2 + εm,d,δ,δ (µj ) ∨ εm,d,1−δ,δ (µj ) = Rµj ,δ .
δ
p p
(ej )# µ
bm , (ej )# µ ≤ Rp (E.12)
Wp,δ Fbm,µj − Fµj .
1 − 2δ µj ,δ ∞
53
Combining these results, we obtain that
[ p δ
p 2
(ej )# µ
bm , (ej )# µ > Rµp j ,δ · t ≤ + 2de−2mt .
P Wp,δ
j∈[d]
1 − 2δ 8
q
2 log(16d/δ)
Setting = δ
8 2de−2mt , we obtain t = 2m . Therefore, with probability at least 1 − 4δ , for
all j ∈ {1, · · · , d},
s
p p log(16d/δ)
(ej )# µ
bm , (ej )# µ ≤ max{Rµp j ,δ } ·
Wp,δ .
1 − 2δ j∈[d] 2m
Recall that Rmax = maxj∈[d] {Rµj ,δ } ∨ maxj∈[d] {Rνj ,δ }. By the union bound and the triangle
inequality, with probability at least 1 − 2δ , we have for all j ∈ [d],
Wp,δ (ej )# µ
bm , (ej )# νbm′ − Wp,δ (ej )# µ, (ej )# ν
Define µ
bm,j := (ej )# µ
bm and similarly νbm′ ,j := (ej )# νbm′ , we therefore have
!1/p !
p
q 1 1
− 2p ′− 2p
P max Wp,δ µ ≤ 2Rmax log(16d/δ) +m
bm,j , νbm′ ,j − Wp,δ µj , νj · m
j∈[d] 1 − 2δ
δ
≥1− .
2
Thus
d d
!1/p
1X 1X
!
p
q 1 1
− 2p ′− 2p
P bm,j , νbm′ ,j −
Wp,δ µ Wp,δ (µj , νj ) ≤ 2Rmax · log(16d/δ) (m +m )
d j=1 d j=1 1 − 2δ
δ
≥1− .
2
(E.13)
54
Error from the Sliced Wasserstein Component By Proposition 1(ii) of Manole et al.
(2022), there exists a constant Cp > 0 depending only on p such that
E SWp,δ (µ
bm , νbm′ ) − SWp,δ (µ, ν)
Combining the expression above with Markov’s inequality, for any t ≥ 0, we have
P SWp,δ (µ
bm , νbm′ ) − SWp,δ (µ, ν) ≥ t
Hence taking
we get
δ
P SWp,δ (µ
bm , νbm′ ) − SWp,δ (µ, ν) ≥ tSW ≤ . (E.14)
2
P MSWp,δ (µ
bm , νbm′ ) − MSWp,δ (µ, ν) ≥ tMSW ≤ δ,
where
As (Ω, B) and (X n , A) are standard Borel, the disintegration theorem (Thoerem 3.4, Kallenberg
and Kallenberg 1997) yields a measurable kernel
55
satisfying
Z
P(Θ,X) (B × A) = πΘ|X (B, x)PX (dx)
A
Hence
(ϵ)
Therefore, as ϵ → 0+ , πABI (θ | x∗ ) converges weakly in Pp (Ω) to πΘ|X (θ|x∗ ).
As fX (x∗ ) > 0 and fX is continuous at x∗ , the point x∗ lies in the support of PX , so all
conditional probabilities below are well-defined.
Consider a deterministic decreasing sequence ϵt ↓ 0 as t → ∞. Let
Zt := 1Dt ,
which records whether X falls in successively smaller neighborhoods of the observed data x∗ .
56
Define the filtration
Ft = σ(Z1 , . . . , Zt ) ⊆ F,
By convention, we let F0 := {∅, Ξ}. By construction, the filtration is increasing, i.e., Ft′ ⊆ Ft
for all t′ ≤ t. Define F∞ as the minimal σ-algebra generated by (Ft )t∈N , i.e.,
F∞ = σ(∪t Ft ).
Dt = {X = x∗ }.
\
t≥1
In other words, F∞ reveals whether the random vector X coincides with the realized observation
x∗ .
For any set B ∈ B, denote UB (ω) := 1{Θ(ω) ∈ B}, ω ∈ Ξ. Then, by Lévy’s 0-1 law, we have
a.s.
E 1Dt UB | Ft → E 1{X = x∗ }UB | F∞ as t → ∞
L1
= 1{X = x∗ }E[UB | F∞ ]
= 1{X = x∗ }P[UB | X = x∗ ].
Let V (θ) = ∥θ∥pp . By assumption, we have M = supx∈X n (θ)fΘ,X (θ, x) dθ < ∞. Note
R
ΩV
that for any t ≥ 1, V 1Dt ≤ V 1D1 , and
Z Z
E[V 1D1 ] = ∥θ∥pp fΘ,X (θ, x)dθdx ≤ M · Leb(X n ∩ Bϵ1 (x∗ )) < ∞,
Ω X n ∩Bϵ1 (x∗ )
where Leb denotes the Lebesgue measure on RndX . Hence as t → ∞, by Lévy’s 0-1 law,
a.s.
E[V 1Dt | Ft ] → 1X=x∗ E[V | X = x∗ ].
L1
Therefore, we have
Z Z
a.s.
V (θ)πΘ|X (dθ | Dt ) → V (θ)πΘ|X (dθ | X = x∗ ).
Ω L1 Ω
57
Together with weak convergence, this gives
πΘ|X (· | Dt ) ⇒ πΘ|X (· | X = x∗ ).
as desired.
58
Appendix F Additional Details on Simulation Settings
F.1 Multimodal Gaussian Model
Below, we present pairwise bivariate density plots of the posterior distribution under the multi-
modal Gaussian model. The plots demonstrate that ABI accurately recovers the joint posterior
structure.
(a) (b)
dx
= αx − βxy,
dt
dy
= −γy + δxy.
dt
We simulate the Lotka–Volterra process using the Gillespie algorithm Gillespie (1976). This
approach is a stochastic Euler scheme in which time steps are drawn from an exponential dis-
tribution. Algorithm 5 outlines the Gillespie procedure for the Lotka–Volterra model.
59
Algorithm 5: Gillespie algorithm for Lotka-Volterra model
input: t = 0, 1, . . . , T
α > 0, β > 0, γ > 0, δ > 0
X0 > 0, Y0 > 0
1 while (t < T and Xt ̸= 0 and Yt ̸= 0) do
2 Compute rates:
3 r1 ← αXt ;
4 r2 ← βXt Yt ;
5 r3 ← γYt ;
6 r4 ← δXt Yt ;
7 Draw: τ ∼ Exp(R), where R = r1 + r2 + r3 + r4 ;
8 Choose index i from list {1, 2, 3, 4};
9 i ∼ Discrete([r1 /R, r2 /R, r3 /R, r4 /R]);
10 if i = 1 then
11 Xt+1 ← Xt + 1// Prey birth
12 Yt+1 ← Yt ;
13 else if i = 2 then
14 Xt+1 ← Xt − 1// Prey death
15 Yt+1 ← Yt − 1;
16 else if i = 3 then
17 Xt+1 ← Xt ;
18 Yt+1 ← Yt − 1// Predator death
19 else if i = 4 then
20 Xt+1 ← Xt ;
21 Yt+1 ← Yt + 1// Predator birth
22 t ← t + τ;
process:
V1 ∼ Exp(θ3 )
Vi | Vi−1 ∼ Vi−1 + Exp(θ3 ), i = 2, . . . , n
Yi | Xi−1 , Vi ∼ Unif(θ1 + max(0, Vi − Xi−1 ), θ2 + max(0, Vi − Xi−1 )), i = 1, . . . , n.
Appendix References
Daniel T Gillespie. A general method for numerically simulating the stochastic time evolution
of coupled chemical reactions. Journal of computational physics, 22(4):403–434, 1976.
60
Alexander Y Shestopaloff and Radford M Neal. On bayesian inference for the m/g/1 queue with
efficient mcmc sampling. arXiv preprint arXiv:1401.5548, 2014.
61