0% found this document useful (0 votes)
20 views61 pages

2505.04603v1

The document presents a novel framework called Adaptive Bayesian Inference (ABI) that enhances approximate Bayesian computation (ABC) by directly comparing posterior distributions through nonparametric distribution matching. ABI utilizes a new Marginally-augmented Sliced Wasserstein (MSW) distance and an adaptive rejection sampling scheme to improve computational efficiency, particularly in high-dimensional settings. Empirical evaluations demonstrate that ABI significantly outperforms existing likelihood-free simulators and traditional ABC methods, particularly under challenging conditions such as high dimensionality and dependent observations.

Uploaded by

khanakmittal92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views61 pages

2505.04603v1

The document presents a novel framework called Adaptive Bayesian Inference (ABI) that enhances approximate Bayesian computation (ABC) by directly comparing posterior distributions through nonparametric distribution matching. ABI utilizes a new Marginally-augmented Sliced Wasserstein (MSW) distance and an adaptive rejection sampling scheme to improve computational efficiency, particularly in high-dimensional settings. Empirical evaluations demonstrate that ABI significantly outperforms existing likelihood-free simulators and traditional ABC methods, particularly under challenging conditions such as high dimensionality and dependent observations.

Uploaded by

khanakmittal92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Likelihood-Free Adaptive Bayesian Inference via

Nonparametric Distribution Matching


Wenhui Sophia Lu∗1 and Wing Hung Wong1,2
arXiv:2505.04603v1 [stat.ME] 7 May 2025

1
Department of Statistics, Stanford University
2
Department of Biomedical Data Science, Stanford University

May 8, 2025

Abstract
When the likelihood is analytically unavailable and computationally intractable, approx-
imate Bayesian computation (ABC) has emerged as a widely used methodology for ap-
proximate posterior inference; however, it suffers from severe computational inefficiency in
high-dimensional settings or under diffuse priors. To overcome these limitations, we propose
Adaptive Bayesian Inference (ABI), a framework that bypasses traditional data-space discrep-
ancies and instead compares distributions directly in posterior space through nonparamet-
ric distribution matching. By leveraging a novel Marginally-augmented Sliced Wasserstein
(MSW) distance on posterior measures and exploiting its quantile representation, ABI trans-
forms the challenging problem of measuring divergence between posterior distributions into
a tractable sequence of one-dimensional conditional quantile regression tasks. Moreover,
we introduce a new adaptive rejection sampling scheme that iteratively refines the poste-
rior approximation by updating the proposal distribution via generative density estimation.
Theoretically, we establish parametric convergence rates for the trimmed MSW distance
and prove that the ABI posterior converges to the true posterior as the tolerance threshold
vanishes. Through extensive empirical evaluation, we demonstrate that ABI significantly
outperforms data-based Wasserstein ABC, summary-based ABC, as well as state-of-the-art
likelihood-free simulators, especially in high-dimensional or dependent observation regimes.

Keywords: approximate Bayesian computation; likelihood-free inference; simulator-based in-


ference; conditional quantile regression; nonparametric distribution matching; adaptive rejection
sampling; generative modeling; Wasserstein distance

1 Introduction
Bayesian modeling is widely used across natural science and engineering disciplines. It enables
researchers to easily construct arbitrarily complex probabilistic models through forward sam-
pling techniques (implicit models) while stabilizing ill-posed problems by incorporating prior
knowledge. Yet, the likelihood function x 7→ fθ (x) may be intractable to evaluate or entirely

Corresponding author; Electronic address: [email protected]

1
inaccessible in many scenarios (Zeng et al., 2019; Chiachío-Ruano et al., 2021), thus render-
ing Markov chain-based algorithms—such as Metropolis-Hastings and broader Markov Chain
Monte Carlo methods—unsuitable for posterior inference. Approximate Bayesian Computation
(ABC) emerges as a compelling approach for scenarios where exact posterior inference for model
parameters is infeasible (Tavaré, 2018). Owing to its minimal modeling assumptions and ease
of implementation, ABC has garnered popularity across various Bayesian domains, including
likelihood-free inference (Markram et al., 2015; Alsing et al., 2018), Bayesian inverse problems
(Chatterjee et al., 2021), and posterior estimation for simulator-based stochastic systems (Wood,
2010). ABC generates a set of parameters θ ∈ Ω ⊆ Rd with high posterior density through a
rejection-based process: it simulates fake datasets for different parameter draws and retains only
those parameters that yield data sufficiently similar to the observed values.
However, when the data dimensionality is high or the prior distribution is uninformative
about the observed data, ABC becomes extremely inefficient and often requires excessive rejec-
tions to retain a single sample. Indeed, Lemmas B.1 and B.2 show that the expected number of
simulations needed to retain a single draw grows exponentially in the data dimension. To enhance
computational efficiency, researchers frequently employ low-dimensional summary statistics and
conduct rejection sampling instead in the summary statistic space (Fearnhead and Prangle,
2012). Nevertheless, the Pitman-Koopman-Darmois theorem stipulates that low-dimensional
sufficient statistics exist only for the exponential family. Consequently, practical problems of-
ten require considerable judgment in choosing appropriate summary statistics, typically in a
problem-specific manner (Wood, 2010; Marin et al., 2012). Moreover, the use of potentially
non-sufficient summary statistics to evaluate discrepancies can result in ABC approximations
that, while useful, may lead to a systematic loss of information relative to the original posterior
distribution. For instance, Fearnhead and Prangle (2011) and Jiang et al. (2017) propose a
semi-automatic approach that employs an approximation of the posterior mean as a summary
statistic; however, this method ensures only first-order accuracy.
Another critical consideration is selecting an appropriate measure of discrepancy between
datasets. A large proportion of the ABC literature is devoted to investigating ABC strategies
adopting variants of the ℓp -distance between summaries (Prangle, 2017), which are suscepti-
ble to significant variability in discrepancies across repeated samples from fθ (Bernton et al.,
2019). Such drawbacks have spurred a shift towards summary-free ABC methods that directly
compare the empirical distributions of observed and simulated data via an integral probability
metric (IPM), thereby obviating the need to predefine summary statistics (Legramanti et al.,
2022). Popular examples include ABC versions that utilize the Kullback-Leibler divergence
(Jiang, 2018), 2-Wasserstein distance (Bernton et al., 2019), and Hellinger and Cramer–von
Mises distances (Frazier, 2020). The accuracy of the resulting approximate posteriors relies cru-
cially on the fixed sample size n of the observed data, as the quality of IPM estimation between
data-generating processes from a finite, often small, number of samples is affected by the con-
vergence rate of empirically estimated IPMs to their population counterparts. In particular, a
significant drawback of Wasserstein-based ABC methods stems from the slow convergence rate
of the Wasserstein distance, which scales as O(n−1/s ) when the data dimension s ≥ 3 (Tala-
grand, 1994). As a result, achieving accurate posterior estimates is challenging with limited

2
samples, particularly for high-dimensional datasets. A further limitation of sample-based IPM
evaluation is the need for additional considerations in the case of dependent data, since ignoring
such dependencies might render certain parameters unidentifiable (Bernton et al., 2019).
Thus, two fundamental questions ensue from this discourse: What constitutes an informa-
tive set of summary statistics, and what serves as an appropriate measure of divergence between
datasets? To address the aforementioned endeavors, we introduce the Adaptive Bayesian In-
ference (ABI) framework, which directly compares posterior distributions through distribution
matching and adaptively refines the estimated posterior via rejection sampling. At its core, ABI
bypasses observation-based comparisons by selecting parameters whose synthetic-data-induced
posteriors align closely with the target posterior, a process we term nonparametric distribution
matching. To achieve this, ABI learns a discrepancy measure in the posterior space, rather than
the observation space, by leveraging the connection between the Wasserstein distance and con-
ditional quantile regression, thereby transforming the task into a tractable supervised learning
problem. Then, ABI simultaneously refines both the posterior estimate and the approximated
posterior discrepancy over successive iterations.
Viewed within the summary statistics framework, our proposed method provides a principled
approach for computing a model-agnostic, one-dimensional kernel statistic. Viewed within the
discrepancy framework, our method approximates an integral probability metric on the space
of posteriors, thus circumventing the limitations of data-based IPM evaluations such as small
sample sizes and dependencies among observations.

Contributions Our work makes three main contributions. First, we introduce a novel integral
probability metric—the Marginally-augmented Sliced Wasserstein (MSW) distance—defined on
the space of posterior probability measures. We then characterize the ABI approximate posterior
as the distribution of parameters obtained by conditioning on those datasets whose induced
posteriors fall within the prescribed MSW tolerance of the target posterior. Whereas conven-
tional approaches rely on integral probability metrics on empirical data distributions, our poste-
rior–based discrepancy remains robust even under small observed sample sizes n, intricate sample
dependency structures, and parameter non-identifiability. We further argue that considering the
axis-aligned marginals can help improve the projection efficiency of uniform slice-based Wasser-
stein distances. Second, we show that the posterior MSW distance can be accurately estimated
through conditional quantile regression by exploiting the equivalence between the univariate
Wasserstein distance and differences in quantiles. This novel insight reduces the traditionally
challenging task of operating in the posterior space into a supervised distributional regression
task, which we solve efficiently using deep neural networks. The same formulation naturally
accommodates multi-dimensional parameters and convenient sequential refinement via rejection
sampling. Third, we propose a sequential version of the rejection–ABC that, to the best of our
knowledge, is the first non-Monte-Carlo-based sequential ABC. Existing sequential refinement
methods in the literature frequently rely on adaptive importance sampling techniques, such as
sequential Monte Carlo (Del Moral et al., 2012; Bonassi and West, 2015) and population Monte
Carlo (Beaumont et al., 2009). These approaches, particularly in their basic implementations,
are often constrained to the support of the empirical distribution derived from prior samples.

3
While advanced variants can theoretically explore beyond this initial support through rejuve-
nation steps and MCMC moves, they nevertheless require careful selection of transition kernels
and auxiliary backward transition kernels (Del Moral et al., 2012). In contrast, ABI iteratively
refines the posterior distribution via rejection sampling by updating the proposal distribution us-
ing the generative posterior approximation from the previous step—learned through a generative
model (not to be confused with the original simulator in the likelihood-free setup). Generative-
model-based approaches for posterior inference harness the expressive power of neural networks
to capture intricate probabilistic structures without requiring an explicit distributional speci-
fication. This generative learning stage enables ABI to transcend the constrained support of
the empirical parameter distribution and eliminates the need for explicit prior-density evalua-
tion (unlike Papamakarios and Murray (2016)), thereby accommodating cases where the prior
distribution itself may be intractable.
We characterize the topological and statistical behavior of the MSW distance, establishing
both its parametric convergence rate and its continuity on the space of posterior measures. Our
proof employs a novel martingale-based argument appealing to Doob’s theorem, which offers an
alternative technique to existing proofs based on the Lebesgue differentiation theorem (Barber
et al., 2015). This new technique may be of independent theoretical interest for studying the
convergence of other sequential algorithms. We then prove that, as the tolerance threshold
vanishes (with observations held fixed), the ABI posterior converges in distribution to the true
posterior. Finally, we derive a finite-sample bound on the bias induced by the approximate
rejection-sampling procedure. Through comprehensive empirical experiments, we demonstrate
that ABI achieves highly competitive performance compared to data-based Wasserstein ABC,
and several recent, state-of-the-art likelihood-free posterior simulators.

Notation Let the parameter and data (θ, X) be jointly defined on some probability space. The
prior probability measure π on the parameter space Ω ⊆ Rd is assumed absolutely continuous
with respect to Lebesgue measure, with density π(θ) for θ ∈ Ω. For simplicity, we use π(·) to
denote both the density and its corresponding distribution. Let the observation space be X ⊆
RdX for some dX ∈ N+ , where N+ := {1, 2, . . . }. We observe a data vector x∗ = (x∗1 , . . . , x∗n )⊤ ∈
(n)
X n ⊂ RndX , whose joint distribution on X n is given by the likelihood Pθ . If the samples are
not exchangeable, we simply set n = 1 with a slight abuse of notation and write x∗ for that
(n)
single observation. We assume x∗ is generated from Pθ∗ for some true but unknown θ∗ ∈ Ω.
(n)
Both the prior density π(θ) and the likelihood Pθ (x) may be analytically intractable; however,
we assume access to
• a prior simulator that draws θ ∼ π, and
(n)
• a data generator that simulates X ∼ Pθ given any θ.
We do not assume parameter identifiability; that is, we allow for the possibility that distinct
(n) (n)
parameter values θ ̸= θ′ to yield identical probability distributions, Pθ = Pθ′ . Our inferential
(n)
goal is to generate samples from the posterior π(θ | x∗ ) ∝ π(θ)Pθ (x∗ ), where θ ∈ Ω. For
notational convenience, we use D(·, ·) for a generic distance metric, which may act on the data
space or probability measures, depending on the context.

4
For any function class G and probability measures µ and ν, we define the Integral Probability
Metric (IPM) between µ and ν with respect to G as: DG (µ, ν) = supg∈G | gdµ − gdν|.
R R

Let ∥·∥ denote the ℓ2 (Euclidean) distance and let (Ω, ∥·∥) be a Polish space. For p ∈ [1, ∞),
we denote by Pp (Ω) the set of Borel probability measures defined on Ω with finite p-th moment.
For µ, ν ∈ Pp (Ω), the p-Wasserstein distance between µ and ν is defined as the solution of the
optimal mass transportation problem

Z !1/p
Wp (µ, ν) = inf p
∥x − y∥ dγ(x, y) , (1.1)
γ∈Γ(µ,ν) Ω×Ω

where Γ(µ, ν) is the set of all couplings γ ∈ Γ(µ, ν) such that

γ(B1 × Ω) = µ(B1 ), γ(Ω × B2 ) = ν(B2 ), for all Borel sets B1 , B2 ⊆ Ω.

The p-Wasserstein space is defined as (Pp (Ω), Wp ). For a comprehensive treatment of the Wasser-
stein distance and its connections to optimal transport, we refer the reader to Villani et al.
(2009).

1.1 Approximate Bayesian Computation


We begin with a brief review of classic Approximate Bayesian Computation (ABC). Given a
threshold ϵ > 0, a distance D(·, ·) on summary statistics s(·), classic ABC produces samples
from the approximate posterior,
Z
(n)
ϵ
(θ | x∗ ) ∝ π(θ) 1 D s(x), s(x∗ ) ≤ ϵ dPθ (x),
  
πABC
Xn

via the following procedure:

Algorithm 1: Rejection-ABC Algorithm


1 for i = 1, 2, . . . , N do
(i) (i) (i) (n)
2 Simulate θ(i) ∼ π(θ), X (i) = (X1 , X2 , . . . , Xn ) ∼ Pθ(i) Accept θ(i) if
D s(X (i) ), s(x∗ ) ≤ ϵ


For results on convergence rates and the bias–cost trade-off when using sufficient statistics
in ABC, see Barber et al. (2015), who establish consistency of ABC posterior expectations via
the Lebesgue differentiation theorem.

1.2 Sliced Wasserstein Distance


The Sliced Wasserstein (SW) distance, introduced by Rabin et al. (2012), provides a computa-
tionally efficient approximation to the Wasserstein distance in high dimensions. For measures µ
and ν on Rd , the p-Sliced Wasserstein distance integrates the p-th power of the one-dimensional

5
Wasserstein distance over all directions on the unit sphere:
Z
SWpp (µ, ν) = Wpp (φ# µ, φ# ν) dσ(φ), (1.2)
Sd−1

where Sd−1 is the unit sphere in Rd , σ is the uniform measure on Sd−1 , φ# denotes the projection
onto the one-dimensional subspace spanned by φ. By reducing the problem to univariate cases,
each of which admits an analytic solution, this approach circumvents the high computational cost
of directly evaluating the d-dimensional Wasserstein distance while preserving key topological
properties of the classical Wasserstein metric, including its ability to metrize weak convergence
(Bonnotte, 2013).
To approximate the integral in (1.2), in practice, one draws K directions i.i.d. from the sphere
and forms the unbiased Monte Carlo estimator

K
!1/p
d p (µ, ν) = 1 X (k) (k)
SW W p (φ µ, φ# ν) , (1.3)
K k=1 p #

where each φ(k) ∼ Unif(Sd−1 ).

1.3 Conditional Quantile Regression


Drawing on flexible, distribution-free estimation methods, we briefly review conditional quantile
regression. Introduced by Koenker and Bassett Jr (1978), quantile regression offers a robust
alternative to mean response modeling by estimating conditional quantiles of the response vari-
able. For response Y ∈ R and covariates X ∈ Rd , the τ -th conditional quantile of Y given X
is

Qτ (x) = inf{y ∈ R : FY |X (y | x) ≥ τ }, x ∈ Rd , (1.4)

where FY |X (· | x) is the conditional CDF of Y given X = x. We estimate Qτ by minimizing the


empirical quantile loss over a model class Q:
NX
train
b τ = arg min
Q ρτ (yi − Q(xi )), (1.5)
Q∈Q i=1

where ρτ (u) = max{τ u, (τ − 1)u} is the quantile loss function, and Ntrain denotes the number
of training samples. We use Ntrain deliberately to avoid confusion with n, which represents the
(fixed) number of observations in Bayesian inference.

1.4 Generative Density Estimation


Consider a dataset Y = {Y1 , . . . , Yn } of i.i.d. draws from an unknown distribution PY . Genera-
tive models seek to construct a distribution P̂Y that closely approximates PY and is amenable
to efficient generation of new samples. Typically, one assumes an underlying latent variable
structure whereby samples are generated as Ye = Gβ (Z), with Z ∼ PZ being a low-dimensional

6
random variable with a simple, known distribution. Here Gβ : Z → Y is the generative (push-
forward) map that transforms latent vectors into observations. The parameters β are optimized
so that the generated samples ỹ ∼ P̂Y are statistically similar to the real data. Learning objec-
tives may be formulated using a variety of generative frameworks, including optimal transport
networks (Lu et al., 2025), generative adversarial networks, and auto-encoder models.

1.5 Article Structure and Related Literature


Related Works Several prior works have proposed the usage of generative adversarial net-
works (GANs) as conditional density estimators (Zhou et al., 2023; Wang and Ročková, 2022).
These approaches aim to learn a generator mapping (X, η) 7→ θe with η independent of X and θ.
Adversarial training then aims to align the model’s induced joint distribution Pdata (X) PG (θe | X)
with the true joint distribution Pdata (X) Pdata (θ | X). The distinction between ABI and these
methods is crucial: ABI operates in the space of posterior distributions on Ω and measures
distances between full posterior distributions, while the latter works in the joint space X n × Ω,
using a discriminator loss to match generated pairs (X, θ)
e to real data (X, θ). Due to this funda-
mental difference, these generative approaches focus on learning conditional mappings uniformly
over the entire domain X n in a single round. While GAN-based methods can be adapted for
sequential refinement using importance re-weighting, as demonstrated by Wang and Ročková
(2022), such adaptations typically require significant additional computational resources, in-
cluding training auxiliary networks (such as the classifier network needed to approximate ratio
weights in their two-step process). In contrast, ABI is inherently designed for sequential refine-
ment without necessitating such auxiliary models. In particular, results on simulated data in
Figure 2 show that our sequential method significantly outperforms Wasserstein GANs when
the prior is uninformative.
Recent work by Polson and Sokolov (2023) and its multivariate extension by Kim et al.
(2024) propose simulating posterior samples via inverse transform sampling. Although both
methods and ABI employ quantile regression, they differ fundamentally in scope and mecha-
nism. The former approaches apply a one-step procedure that pushes noise through an inverse-
CDF or multivariate quantile map to produce posterior draws, inherently precluding direct
sequential refinement. In contrast, ABI employs conditional quantile regression to estimate a
posterior metric—the posterior MSW distance—which extends naturally to any dimension and
any p ∈ [1, ∞). In particular, the case p = 1 is noteworthy, since it renders both MSW1 and
W1 distances integral probability metrics (see Theorem 3.1) and allows a dual formulation (Vil-
lani et al., 2009). On the other hand, Kim et al. (2024) focus exclusively on the 2-Wasserstein
case. Moreover, they rely on a combination of Long Short-Term Memory and Deep Sets ar-
chitectures to construct a multivariate summary and approximate the 2-Wasserstein transport
map, a step that the authors acknowledge is sensitive to random initialization for obtaining a
meaningful quantile mapping. By comparison, ABI uses a simple feed-forward network to learn
a one-dimensional kernel statistic corresponding to the estimated posterior MSW distance. Our
experiments (Section 4) demonstrate that ABI is robust across different scenarios, insensitive to
initialization, and requires minimal tuning.

7
Organization of the Paper The remainder of this manuscript is organized as follows. Sec-
tion 2 introduces the ABI framework and its algorithmic components. Section 3 establishes
the empirical convergence rates of the proposed MSW distance, characterizes its topological
properties, and proves that the ABI posterior converges to the target posterior as the tolerance
threshold vanishes. Section 4 demonstrates the effectiveness of ABI through extensive empirical
evaluations. Finally, Section 5 summarizes the paper and outlines future research directions.
Proofs of technical results and additional simulation details are deferred to the Appendix.

2 Adaptive Bayesian Inference


In this section, we introduce the proposed Adaptive Bayesian Inference (ABI) methodology. The
fundamental idea of ABI is to transcend observation-based comparisons by operating directly in
posterior space. Specifically, we approximate the target posterior as
 
πABI (θ) = π θ MSW(π(θ | x), π(θ | x∗ )) ≤ ϵ ,

that is, the distribution of θ conditional on the event that the posterior induced by dataset x lies
within an ϵ-neighborhood of the observed posterior under the MSW metric. This formulation
enables direct comparison of candidate posteriors via the posterior MSW distance and supports
efficient inference by exploiting its quantile representation. As such, ABI approximates the
target posterior through nonparametric posterior matching. In practice, for each proposed θ,
we simulate an associated dataset X and evaluate the MSW distance between the conditional
posterior π(θ | X) and the observed-data posterior π(θ | x∗ ). Simulated samples for which
this estimated deviation is small are retained, thereby steering our approximation progressively
closer to the true posterior. For brevity, we denote the target posterior by π ∗ = π(θ | x∗ ). Our
approach proceeds in four steps.
Step 1. Estimate the trimmed MSW distance between posteriors π ∗ and π(θ | x), x ∈ X using
conditional quantile regression with multilayer feedforward neural networks; see Section
2.1.2.
Step 2. Sample from the current proposal distribution by decomposing it into a marginal com-
ponent over θ and an acceptance constraint on X, then employ rejection sampling; see
Section 2.2.1.
Step 3. Refine the posterior approximation via acceptance–rejection sampling: retain only
those synthetic parameter draws whose simulated data yield an estimated MSW dis-
tance to π ∗ below the specified threshold, and discard the rest; see Section 2.2.2.
Step 4. Update the proposal for the next iteration by fitting a generative model to the accepted
parameter draws; see Section 2.2.3.
These objectives are integrated into a unified algorithm with a nested sequential structure.
At each iteration t = 1, 2, . . . , T, the proposal distribution is updated using the previous step’s
(1) (2) (T)
posterior approximation, thus constructing a sequence of partial posteriors π∗ , π∗ , . . . , π∗
that gradually shift toward the target posterior π ∗ . This iterative approach improves the ac-
curacy of posterior approximation through adaptive concentration on regions of high posterior

8
alignment, which in turn avoids the unstable variance that can arise from single-round inference.
By contrast, direct Monte Carlo estimation would require an infeasible number of simulations
to observe even a single instance where the generated sample exhibits sufficient similarity to
the observed data. For a concrete illustration of sequential refinement, see the simple Gaus-
sian–Gaussian conjugate example in the Appendix C.1. The complete procedure is presented in
Algorithm 2.

2.1 Nonparametric Distribution Matching


2.1.1 Motivation: Minimal Posterior Sufficiency

We first discuss the conceptual underpinnings underlying our proposed posterior-based distance.
Consider the target posterior π ∗ , which conditions on the observed data x∗ . If an alternative
posterior π(θ | X) is close to π ∗ under an appropriate measure of posterior discrepancy, then
π(θ | X) naturally constitutes a viable approximation to the target posterior. Thus, we can
select, from among all candidate posteriors, the ones whose divergence from the true posterior
falls within the prescribed tolerance. We formalize this intuition through the notion of minimal
posterior sufficiency.
Under the classical frequentist paradigm, Fisher showed that the likelihood function L(θ; X) =
(n)
Pθ (X), viewed as a random function of the data, is a minimal sufficient statistic for θ as it en-
capsulates all available information about the parameter θ (Berger and Wolpert, 1988, Chapter
3). This result is known as the likelihood principle. The likelihood principle naturally extends
(n)
to the Bayesian regime since the posterior distribution is proportional to π(θ)Pθ (X). In par-
ticular, Theorem 2.1 shows that, given a prior distribution π(·), the posterior map X 7→ π(· | X)
is minimally Bayes sufficient with respect to the prior π. We refer to this concept as minimal
posterior sufficiency.
Definition 2.1 (Bayes Sufficient). A statistic T (X) is Bayes sufficient with respect to a prior
distribution π(θ) if π(θ | X) = π(θ | T (X)).
Theorem 2.1 (Minimal Bayes Sufficiency of Posterior Distribution). The posterior map X 7→
π(· | X) is minimally Bayes sufficient.
Ideally, inference should be based on minimally sufficient statistics, which suggests that our
posterior inference should utilize such statistics when low-dimensional versions exist. However,
low-dimensional sufficient statistics—let alone minimally sufficient ones—are available for only a
very limited class of distributions; consequently, we need to consider other alternatives to classi-
cal summaries. Observe that the acceptance event formed by matching the infinite-dimensional
statistics π(θ | X) and π(θ | x∗ ) coincides with the event,

1 D π(θ | X), π(θ | x∗ )


 
≤ ϵ.

Leveraging this equivalence, our key insight is to collapse the infinite-dimensional posterior maps
into a one-dimensional kernel statistic formed by applying a distributional metric on posterior
measures—thus preserving essential geometric structures, conceptually analogous to the “kernel

9
Algorithm 2: Adaptive Bayesian Inference (ABI)
Input: Tolerance thresholds: ∞ > ϵ1 > ϵ2 > · · · > ϵT ;
Generative model: Gβ ;
MSW distance parameters:
Trimming parameter: δ ∈ (0, 1/2);
Mixing parameter: λ ∈ (0, 1);
Number of slices: K > 0;
Number of discretization points: H > 0;

1 Initialize π (1) (θ) ← π(θ);


2 for Iteration t = 1, 2, . . . , T do
// Sample the current proposal; see Section 2.2.1
3 Generate N parameter-data pairs {(θ(i) , X (i) )}N
i=1 by rejection sampling from the
unnormalized joint proposal using Algorithm 4:

π (t) (θ, X) = π θ, X | 1{MSW


\p,δ,K,H (πX , πx∗ ) ≤ ϵt−1 } .


4 Sample K random projections φ(k) ∼ σ(Sd−1 ) and form the projection set
Φ = {φ(k) }K
k=1 ∪ {ej }j=1 ; set K ← K + d;
d ′

// Train quantile estimator network; see Section 2.1.2


5 Generate Ntrain parameter-data pairs {(X (m) , θ(m) )}N m=1 using Algorithm 4 to form the
train

training set for the quantile ReLU network;


6 Fine-tune the conditional quantile ReLU network Q b ′ using Algorithm 3;
K ,H
For each i = 1, . . . , N, evaluate MSWp,δ,K,H π(θ | X ), π(θ | x∗ ) with Q
(i) b ′ using Eq.

7 \ K ,H
(2.4);
// Pruning stage, refine the partial posterior distribution; see Section
2.2.2
8 Retain parameter draws satisfying the tolerance threshold condition:
(t) \p,δ,K,H π(θ | X (i) ), π(θ | x∗ ) ≤ ϵt }
Sθ,∗ ← {θ(i) : MSW


// Sequential generative density estimation stage; see Section 2.2.3


(t)
9 Train generative density estimator Gβ on Sθ,∗ to obtain the approximate partial
(t)
posterior π
b∗ ;
10 Update the proposal distribution for iteration (t + 1):
(t)
π (t+1) (θ) ← π
b∗ (θ)

(ϵ ) (T)
11 Set π T
bABI (θ | x∗ ) ← π
b∗ (θ);
(ϵ )
Output: Generative model that samples from the ABI posterior: π T
bABI (θ | x∗ );

10
trick.” We realize this idea concretely via the novel Marginally-augmented Sliced Wasserstein
(MSW) distance. The MSW distance preserves marginal structure and mitigates the curse
of dimensionality, achieving the parametric convergence rate when p = 1 (see Section 3.4).
Moreover, MSW is topologically equivalent to the classical Wasserstein distance, retaining its
geometric properties such as metrizing weak convergence.

2.1.2 Estimating Trimmed MSW Distance via Deep Conditional Quantile Regres-
sion

To mitigate the well-known sensitivity of the Wasserstein and Sliced Wasserstein distances to
heavy tails, we adopt a robust, trimmed variant of the MSW distance, expanding upon the
works of Alvarez-Esteban et al. (2008) and Manole et al. (2022). To set the stage for our
multivariate extension, we first recall the definition of the trimmed Wasserstein distance in one
dimension. For univariate probability measures µ and ν, and a trimming parameter δ ∈ [0, 1/2),
the δ-trimmed Wp distance is defined as:
!1/p
1
Z 1−δ
p
Wp,δ (µ, ν) = Fµ−1 (τ ) − Fν−1 (τ ) dτ , (2.1)
1 − 2δ δ

where Fµ−1 and Fν−1 denote the quantile functions of µ and ν, respectively.
We now extend this univariate trimming concept to the multivariate setting and provide a
formal definition of the trimmed MSW distance.
Definition 2.2. Let δ ∈ [0, 1/2) be a trimming constant, and let µ and ν be probability
measures on Rd (with d ≥ 2) that possess finite p-th moments for p ≥ 1. The δ-trimmed
Marginally-augmented Sliced Wasserstein (MSW) distance between µ and ν is defined as

d
1X    h
p
 i 1
p
MSWp,δ (µ, ν) = λ Wp,δ µj , νj +(1 − λ) Eφ∼σ Wp,δ φ# µ, φ# ν , (2.2)
d j=1 | {z }
| {z } Sliced Wasserstein distance
marginal augmentation

λ ∈ (0, 1) is a mixing parameter; σ is the uniform probability measure on the unit sphere Sd−1 ;
µj denotes the marginal distribution of the j-th coordinate under the joint measure µ; and
φ# denotes the pushforward by the projection φ. This robustification of the MSW distance
compares distributions after trimming up to a 2δ fraction of their mass along each projection.
Remark 2.1. When d = 1, Definition 2.2 reduces to the marginal term alone. In that case,
the trimmed MSW distance coincides exactly with the standard trimmed Wasserstein distance
between the two one-dimensional distributions.
Remark 2.2. When δ = 0, MSWp,δ reduces to the untrimmed MSWp distance. For complete-
ness, we give the formal definition of MSWp (·, ·) in Appendix A.1.
The trimmed MSW distance comprises two components: the Sliced Wasserstein term, which
captures joint interactions through random projections on the unit sphere, and a marginal aug-

11
mentation term, which gauges distributional disparities along coordinate axes. The inclusion of
the marginal term enhances the MSW distance’s sensitivity to discrepancies along each coordi-
nate axis, remedying the inefficiency of standard SW projections that arises from uninformative
directions sampled uniformly at random. Furthermore, because the SW distance is approximated
via Monte Carlo, explicitly accounting for coordinate-wise marginals is particularly pivotal as
these marginal distributions directly determine the corresponding posterior credible intervals.
The value of incorporating axis-aligned marginals has also been highlighted in recent works
(Moala and O’Hagan, 2010; Drovandi et al., 2024; Chatterjee et al., 2025; Lu et al., 2025). For
brevity, unless stated otherwise, we refer to the trimmed MSW distance simply as the MSW
distance throughout for the remainder of this section.
In continuation of our earlier discussion on the need for a posterior space metric, the pos-
terior MSW distance quantifies the extent to which posterior distributions shift in response to
perturbations in the observations. In contrast, most existing ABC methods rely on distances
computed directly between datasets, either as D(x, x∗ ) or as D(µ bx∗ ) where µ
bx , µ b· denotes the em-
pirical distribution—serving as indirect proxies for posterior discrepancy due to the fundamental
challenges in estimating posterior-based metrics. Importantly, our approach overcomes this lim-
itation by leveraging the quantile representation of the posterior MSW distance, as formally
established in Definition 2.3.
Definition 2.3 (Quantile Representation of MSW Distance). The trimmed MSW distance de-
fined in Definition 2.2 can be equivalently expressed using the quantile representation as

d
!1/p
1
Z 1−δ
λX p
MSWp,δ (µ, ν) = Fµ−1 (τ ) − Fν−1 (τ ) dτ
d j=1 1 − 2δ δ
j j

!1/p
1
Z Z 1−δ
p
+ (1 − λ) Fφ−1 (τ ) − Fφ−1 (τ ) dτ dσ(φ) . (2.3)
1 − 2δ Sd−1 δ
#µ #ν

Building on Definition 2.3, we reformulate posterior comparison as a conditional quantile


regression problem for θ given X = x. Specifically, the MSW distance is constructed in terms of
one-dimensional projections of the distributions to leverage the closed-form expression available
for univariate Wasserstein evaluation. By approximating the spherical integral with K Monte
Carlo–sampled directions, computing MSW therefore reduces to fitting a series of conditional
quantile regressions, each corresponding to a distinct single-dimensional projection.
To evaluate the MSW distance, we first draw K > 0 projections {φ(1) , . . . , φ(K) } uniformly
at random from the unit sphere Sd−1 . Subsequently, we discretize the interval [δ, 1 − δ] into
H > 0 equidistant subintervals, each of width ∆ = (1 − 2δ)/H. For x, x′ ∈ X , we write
πx = π(dθ | x) and πx′ = π(dθ | x′ ) as shorthand for the respective posterior distributions. The
posterior MSWp,δ distance between πx , πx′ , denoted MSW
\p,δ,K,H (πx , πx′ ), can be approximated

12
as follows:
d h
\p,δ,K,H (πx , πx′ ) = λ
 i1/p
IH Fπ−1 −1
X
MSW x,j
, Fπx′ ,j
d j=1
K
!!1/p
1 X
+ (1 − λ) IH F −1
(k) , F −1
(k) , (2.4)
K k=1 φ# πx φ# πx′

H−1
∆ 
p p
where IH (q1 , q2 ) = q1 (δ) − q2 (δ) + 2 q1 (δ + h∆) − q2 (δ + h∆)
X
2(1 − 2δ) h=1

p
+ q1 (1 − δ) − q2 (1 − δ) . (2.5)

In the equations above, IH represents the trapezoidal discretization function, while Fπ−1
x,j
(τ )
denotes the quantile function associated with the j-th coordinate of πx evaluated at the τ th
quantile.
Remark 2.3. The estimated MSW distance can be viewed both as a measure of discrepancy
between the posterior distributions and as an informative low-dimensional kernel statistic. De-
pending on the context, we will use these interpretations interchangeably to best suit the task
at hand.
Remark 2.4. When only the individual posterior marginals π(θj | x∗ ) for j = 1, . . . , d are
of interest, one can elide the Sliced Wasserstein component entirely and compute only the
univariate marginal terms. Since the marginals often suffice for decision-making without the
full joint posterior (Moala and O’Hagan, 2010), this approach yields substantial computational
savings.
To estimate the quantile functions, we perform nonparametric conditional quantile regression
(CQR) via deep ReLU neural networks. These networks have demonstrated remarkable abilities
to approximate complex nonlinear functions and adapt to unknown low-dimensional structures
while possessing attractive theoretical properties. In particular, Padilla et al. (2022) establish
that, under mild smoothness conditions, the ReLU-network quantile regression estimator attains
minimax-optimal convergence rates.
Definition 2.4 (Deep Neural Networks). Let ϕ(x) = max{x, 0} be the ReLU activation func-
tion. For a network with L hidden layers, let d = (d0 , d1 , . . . , dL+1 )⊤ ∈ RL+2 specify the number
of neurons in each layer, where d0 represents the input dimension and dL+1 the output dimen-
sion. The class of multilayer feedforward ReLU neural networks specified by architecture (L, d)
comprises all functions from Rd0 to RdL+1 formed by composing affine maps with elementwise
ReLU activations:

f (x) = gL+1 ◦ ϕ ◦ gL ◦ · · · ◦ ϕ ◦ g1 (x),

where each layer ℓ is represented by an affine transformation gℓ (d(ℓ−1) ) = W(ℓ) dℓ−1 + b(ℓ) , with
W(ℓ) ∈ Rdℓ ×dℓ−1 as the weight matrix and b(ℓ) ∈ Rdℓ as the bias vector.

13
Building on Definition 2.4, we approximate (2.4) by training a single deep ReLU network to
jointly predict all slice-quantiles. Let K ′ := K + d denote the total number of directions in the
augmented projection set Φ = {φ(k) }K k=1 ∪ {ej }j=1 . For each projection φ
d (k) ∈ Φ and quantile

level τh = δ + h∆ with h = 0, 1, . . . , H and ∆ = (1 − 2δ)/H, let

Q∗k,h (x) = F −1
(k) (τh )
φ# πx

be the true conditional τh -quantile along φ(k) . Given Ntrain training pairs {(x(m) , θ(m) )}N
m=1 , we
train
′ (H+1)
learn QK ′ ,H : X → R
b K by solving

K′ X
H NX
train
b ′ = arg min ρτh ,κ (⟨φ(k) , θ(m) ⟩ − Q[k,h] (x(m) )), (2.6)
X
QK ,H
Q∈Q k=1 h=0 m=1

where Q is the class of ReLU neural network models with architecture (L, d) and output dimen-
sion dL+1 = K ′ (H + 1), and Q[k,h] is the ((H + 1)(k − 1) + h + 1)-th entry of the flattened output.
This single network thus shares parameters across all K ′ slices and H + 1 quantile levels.
Contrary to the conventional pinball quantile loss (Padilla et al., 2022), we employ the
Huber quantile regression loss (Huber, 1964), which is less sensitive to extreme outliers. This
loss function, parameterized by threshold κ (Dabney et al., 2018), is defined as:

 1 |τ − 1(u < 0)|u2 , |u| ≤ κ
ρτ,κ (u) = 2κ (2.7)
|τ − 1(u < 0)|(|u| − 1 κ), |u| > κ.
2

Upon convergence of the training process, we obtain a single quantile network that outputs the
predicted quantile of θ for any given projection φ(k) , quantile level τh , and conditioning variable
x ∈ X.
Remark 2.5. Unlike Padilla et al. (2022), we impose no explicit monotonicity constraints in
(2.6); that is, we do not enforce the following ordering restrictions during the joint estimation
stage:

b ′ (X (m) ) for all 0 ≤ h < h′ ≤ H, k ∈ [K ′ ], m ∈ [Ntrain ].


[k,h] (X ) ≤ Q
(m)
Q
b
[k,h ]

Instead, after obtaining the predictions {Q


b
[k,h] (x)}h=0 for each slice φ , we simply sort these
H (k)

H + 1 values in ascending order. This post-processing step automatically guarantees the non-
crossing restriction without adding any constraints to the optimization.
Collectively, the elements in this section constitute the core of the nonparametric distribution
matching component of our proposed methodology. The corresponding algorithmic procedure
for distribution matching is summarized in Algorithm 3.

14
Algorithm 3: Trimmed MSW Distance Estimation via CQR
Input: Proposal distribution π (t) (θ, X);
Number of random projections K;
Number of quantile levels H;
Smoothing parameter κ;
Network architecture (L, d);
Training sample size Ntrain ;
1 Sample K directions {φ(k) }K k=1 ∼ Unif(S
d−1 );

2 Set Φ ← {φ(k) }K k=1 ∪ {ej }j=1 ;


d

3 Compute quantile grid τh = δ + h 1−2δ H for h = 0, . . . , H;


4 for m = 1, . . . , Ntrain do
5 Generate sample (θ(m) , X (m) ) ∼ π (t) (θ, X);
6 Train quantile network Q b ′ with architecture (L, d) by minimizing the empirical risk in
K ,H
(2.6) using loss ρτh ,κ ;
7 for k = 1, . . . , K ′ do
8 For each sample X (m) , sort the H + 1 outputs {Q̂[k,h] (X (m) )}H
h=0 to enforce
monotonicity;
9 return trained network Q
b ′ for evaluating MSW
K ,H
\p,δ,K,H (πx , πx∗ ) via (2.4).

2.2 Adaptive Rejection Sampling


In contrast to the single-round rejection framework of classical rejection-ABC, ABI implements
an adaptive rejection sampling approach. For clarity of exposition, we decompose this approach
into three distinct stages: proposal sampling, conditional refinement, and sequential updating.
At a high level, this sequential scheme decomposes the target event into a chain of more
tractable conditional events. Define a sequence of nested subsets X n ⊃ A1 ⊃ A2 ⊃ · · · ⊃ AT
associated with decreasing thresholds ϵ1 > ϵ2 > · · · > ϵT , following a structure similar to
adaptive multilevel splitting. We proceed by induction. At initialization, draw θ ∼ π(θ) and
(n)
X | θ ∼ Pθ (·). At iteration one, we condition on the event X ∈ A1 by selecting among the
initial samples (θ, X) those for which X ∈ A1 , so that the joint law becomes

= P θ, X | X ∈ A1 (conditional refinement).
  
P θ | X ∈ A1 Pθ X | X ∈ A1

At iteration t, we first obtain samples from P(θ, X | X ∈ At−1 ) generated by


(n)
(proposal sampling).
 
θ ∼ π θ | X ∈ At−1 , X | θ, At−1 ∼ Pθ · | X ∈ At−1

15
To refine to At ⊆ At−1 , note that

P(θ, X ∈ At )
P θ | X ∈ At−1 Pθ X ∈ At | X ∈ At−1 =
 
P(X ∈ At−1 )
(sequential update).

∝ P θ | X ∈ At

Thus, by conditioning on At , we obtain samples from the intermediate partial posterior P(θ | X ∈
At ). Iterating this procedure until termination yields the final approximation P(θ | X ∈ AT ),
which converges to the true posterior as ϵT approaches 0 and the acceptance regions become
increasingly precise.
In the following subsections, we present a detailed implementation for each of these three
steps.

2.2.1 Sampling from the Refined Proposal Distribution

In this section, we describe how to generate samples from the refined proposal distribution using
rejection sampling.

Decoupling the Joint Proposal Let ϵ1 > ϵ2 > · · · > ϵT be a user-specified, decreasing
sequence of tolerances. Define the data-space acceptance region and its corresponding event by

At = {x ∈ X n : MSW
\p,δ,K,H (πx , πx∗ ) ≤ ϵt } ⊆ X n ,
Et = {ω : X(ω) ∈ At } .

At iteration t ≥ 1, we adopt the joint proposal distribution over (θ, X) given by:
(n)
π (1) (θ, X) = π(θ) Pθ (X),
π (t) (θ, X) = π θ, X | Et−1 for t = 2, . . . , T,


where we condition on the event Et−1 = {X ∈ At−1 } with A0 = X . Since direct sampling from
this conditional distribution is generally infeasible, we recover it via rejection sampling after
decoupling the joint proposal. Let
Z
π (θ) =
(t)
π (t) (θ, x) dx
Xn

be the marginal law of π (t) over θ. This auxiliary distribution matches the correct conditional
marginal while remaining independent of X. Observe that the joint proposal for the t-th iteration
π (t) admits the factorization:
(n)
π (t) (θ, X) = π(θ | Et−1 )Pθ (X | Et−1 )
(n)
= π (t) (θ) Pθ (X | X ∈ At−1 ) . (2.8)
| {z } | {z }
marginal in θ constraint on X

16
Note that for a given θ ∈ Ω, the data-conditional term in the equation above satisfies
n o
(n) (n)
Pθ (X | X ∈ At−1 ) ∝ Pθ (X) 1 MSW
\p,δ,K,H (πX , πx∗ ) ≤ ϵt−1 , (2.9)
| {z }
constraint on X

(n)
where the proportional symbol hides the normalizing constant Pθ (At−1 ). The decomposition
in (2.8) cleanly decouples the proposal distribution into a marginal draw over θ and a constraint
on X. In other words, the first component eliminates the coupling while retaining the correct
conditional marginal π (t) (θ), and the second term imposes a data-dependent coupling constraint
to be enforced via a simple rejection step.

Sampling the Proposal via Rejection In order to draw

(θ, X) ∼ π (t) (θ, X) = π(θ, X | Et−1 )

(n)
without computing its normalizing constant Pθ (At−1 ), we apply rejection sampling to the
unnormalized joint factorization in (2.8):
1. Sample θ ∼ π (t) (θ);
(n)
2. Generate X ∼ Pθ repeatedly until 1{MSW
\p,δ,K,H (πX , πx∗ ) ≤ ϵt−1 } = 1.

By construction, the marginal distribution of θ remains π(θ | Et−1 ) since all θ values are uncon-
ditionally accepted, while the acceptance criterion precisely enforces the constraint X ∈ At−1 .
Consequently, the retained pairs (θ, X) follow the desired joint distribution π(θ, X | Et−1 ). How-
ever, it is practically infeasible to perform exact rejection sampling as the expected number of
simulations for Step 2 may be unbounded. To address this limitation, we introduce a budget-
constrained rejection procedure termed Approximate Rejection Sampling (ARS), as outlined in
Algorithm 4. The core idea is as follows: given a fixed computational budget R ∈ N+ , we repeat
Step 2 at most R times. If no simulated data set satisfies the tolerance criterion within this
budget, the current parameter proposal is discarded, and the algorithm proceeds to the next
parameter draw.

Algorithm 4: Approximate Rejection Sampling


1 for i = 1, . . . , N do
2 Sample θ(i) ∼ π (t) (θ);
3 for r = 1, . . . , R do
(n)
4 Sample X (i,r) ∼ Pθ(i) ;
5 \p,δ,K,H (π (i,r) , πx∗ ) ≤ ϵt−1 then
if MSW X
6 Retain (θ(i) , X (i,r) );
7 break;

8 \p,δ,K,H (π (i,r) , πx∗ ) > ϵt−1 for all r = 1, . . . , R then


if MSW X
9 Discard θ(i) ;

17
While approximate rejection sampling introduces a small bias, the approximation error be-
comes negligible under appropriate conditions. Theorem 2.2 establishes that, under mild regu-
larity conditions, the resulting error decays exponentially fast in R.
Assumption 2.1 (Local Positivity). There exists constants c > 0 and γ > 0 such that, for
(n)
every θ with π (t) (θ) > 0, the per-draw acceptance probability Pθ (At ) is uniformly bounded
(n)
away from 0 satisfying Pθ (At ) ≥ cϵγt .
Assumption 2.1 is satisfied, for instance, if the kernel statistic Dθ (X) = MSW
\p,δ,K,H (πX , πx∗ )
(n)
under X ∼ Pθ admits a continuous density qθ (u) that is strictly positive in a neighborhood of
u = 0. In that case, for small ϵt ,
Z ϵt
(n) (n)
Pθ (At ) = Pθ (Dθ ≤ ϵt ) = qθ (u)du ≥ qθ (0)ϵt /2.
0

Theorem 2.2 (Sample Complexity for ARS). Suppose Assumption  2.1 holds. For any δ̄ ∈ (0, 1)
log(1/δ̄)
and ϵt > 0, if the number of proposal draws R satisfies R = O ϵγ
, then the total-variation
t  
(t)
distance between the exact and approximate proposal distributions obeys DTV π (t) , πARS ≤ δ̄.

2.2.2 Adaptive Refinement of the Partial Posterior

The strength of ABI lies in its sequential refinement of partial posteriors through a process guided
by a descending sequence of tolerance thresholds ϵ1 > ϵ2 > · · · > ϵT . This sequence progressively
tightens the admissible deviation from the target posterior π ∗ , yielding increasingly improved
posterior approximations. By iteratively decreasing the tolerances rather than prefixing a single
small threshold, ABI directs partial posteriors to dynamically focus on regions of the parameter
space most compatible with the observed data. This adaptive concentration is particularly
advantageous when the prior is diffuse (i.e., uninformative) or the likelihood is concentrated in
low-prior-mass regions, a setting in which one-pass ABC is notoriously inefficient.
The refinement procedure unfolds as follows. First, we acquire N samples from the proposal
distribution via Algorithm 4, namely

θ(i) ∼ π (t) (θ),


(n)
X (i) | θ(i) ∼ Pθ(i) · | Et−1 , i = 1, . . . , N,


(t)
which form the initial proposal set S0 = {(θ(i) , X (i) )}N
i=1 for proceeding refinement. Next,
we retain only those parameter draws θ that exhibit a sufficiently small estimated posterior
(i)

discrepancy; that is, parameters whose associated simulated data satisfy

\p,δ,K,H (πX , πx∗ ) ≤ ϵt


MSW

and discard the remainder. This selection yields the training set for the generative density

18
estimation step,
(t) (t)
Sθ,∗ = θ(i) : (θ(i) , X (i) ) ∈ S0 and MSW
\p,δ,K,H (π (i) , πx∗ ) ≤ ϵt ,

X

consisting of parameter samples drawn from the refined conditional distribution at the current
(t)
iteration. Let π∗ denote the true (unobserved) marginal distribution of θ underlying the em-
(t) (t)
pirical parameter set Sθ,∗ 1 . Importantly, π∗ depends solely on θ since the X component has
been discarded—thus removing the coupling between θ and X.
By design, our pruning procedure progressively refines the parameter proposals by incor-
porating accumulating partial information garnered from previous iterations. As the tolerance
decays, the retained parameters are incrementally confined to regions that closely align with the
(t)
target posterior π ∗ , thereby sculpting each partial posterior π∗ toward π ∗ .

Determining the Sequence of Tolerance Levels Thus far, our discussion has implicitly
(t) (t)
assumed that the choice of ϵt yields a non-empty set Sθ,∗ . To ensure that Sθ,∗ is non-empty, the
tolerance level ϵt must be chosen judiciously relative to ϵt−1 . in particular, ϵt should neither
be substantially smaller than ϵt−1 (which might result in an empty set) nor excessively large
(t)
(which would lead to inefficient refinement). By construction, the initial proposal samples S0 =
{θ(i) , X (i) } satisfy

\p,δ,K,H (π (i) , πx∗ ) ≤ ϵt−1 .


max MSW X
i=1,...,N

Consequently, we determine the sequence of tolerance thresholds empirically (analogous to


adaptive multilevel splitting) by setting ϵt (α) as the αth quantile of the set:
n o
MSW i=1 : MSWp,δ,K,H (πX (i) , πx∗ ) ≤ ϵt−1 ,
\p,δ,K,H (π (i) , πx∗ )}N \
X

where α ∈ (0, 1) is a quantile threshold hyperparameter (Biau et al., 2015). In this manner, our
selection procedure yields a monotone decreasing sequence of thresholds, ϵ0 (α) > ϵ1 (α) > · · · >
(t)
ϵT (α), while ensuring that the refined parameter sets Sθ,∗ remain non-empty.

2.2.3 Sequential Density Estimation

To incorporate the accumulated information into subsequent iterations, we update the proposal
distribution using the current partial posterior. Recall that the partial posterior factorizes into a
marginal component over θ and a constraint on X, as described in Section 2.2.1. In this section,
we focus on updating the marginal partial posterior by applying generative density estimation
to the retained parameter draws from the preceding pruning step. This process ensures that
the proposal distribution at each iteration reflects all refined information acquired in earlier
iterations.
1 (t)
The true distribution π∗ is unobserved, as only its empirical counterpart is available.

19
Marginal Proposal Update with Generative Modeling Our approach utilizes a genera-
tive model Gβ (Z) : Z 7→ θ̂, parameterized by β, which transforms low-dimensional latent noise
(t)
Z into synthetic samples θ̂. When properly trained to convergence on the refined set Sθ,∗ , Gβ
(t)
produces a generative distribution denoted by π b∗ that closely approximates the target partial
(t)
posterior π∗ . Since ABI is compatible with any generative model—including generative adver-
sarial networks, variational auto-encoders, and Gaussian mixture models—practitioners enjoy
considerable flexibility in their implementation choices. In this work, we employ POTNet (Lu
et al., 2025) because of its robust performance and resistance to mode collapse, which are cru-
cial attributes for preserving diversity of the target distribution and minimizing potential biases
arising from approximation error that could propagate to subsequent iterations.
At the end of iteration t, we update the proposal distribution for the (t + 1)-th iteration with
the t-th iteration’s approximate marginal partial posterior:
(t)
π (t+1) (θ) ← π
b∗ (θ).

Thereafter, we can simply apply the ARS algorithm described in Section 2.2.1 to generate samples
from the joint proposal π (t+1) (θ, X) conditional on the event Et . At the final iteration T, we
take the ABI posterior to be
(ϵ ) (T)
π T
bABI (θ | x∗ ) ← π
b∗ (θ),

which approximates the coarsened target distribution π(θ | MSWp (πX , πx∗ ) ≤ ϵT ). We em-
phasize that the core of ABI ’s sequential refinement mechanism hinges on the key novelty of
utilizing generative models, whose inherent generative capability enables approximation and
efficient sampling from the revised proposal distributions.

Iterative Fine-tuning of the Quantile Network We retrain the quantile network on the
newly acquired samples at each iteration to fully leverage the accumulated information, thereby
adapting the kernel statistics to become more informative about the posterior distribution. This
continual fine-tuning improves our estimation of the posterior MSW discrepancy and thus yields
progressively more accurate refinements of the parameter subset in subsequent iterations.

3 Theoretical Analysis
In this section, we investigate theoretical properties of the proposed MSW distance, its trimmed
version MSWp,δ (·, ·), and the ABI algorithm. In Section 3.1, we first establish some important
topological and statistical properties of the MSW distance between distributions µ, ν under mild
moment assumptions. In particular, we show that the error between the empirical MSW distance
and the true MSW distance decays at the parametric rate (see Remark 3.2). Then, in Section
3.2, we derive asymptotic guarantees on the convergence of the resulting ABI posterior in the
limit of ϵ ↓ 0.
We briefly review the notation as follows. Throughout this section, we assume that p ≥ 1

20
and d ∈ N+ . We denote by σ(·) the uniform probability measure on Sd−1 , and by Pp (Rd ) the
space of probability measures on Rd with finite p-th moments. Given a probability measure
µ ∈ Pp (Rd ) and the projection mapping fφ : θ 7→ φ⊤ θ (where φ ∈ Sd−1 ), we write φ# µ for its
pushforward under the projection fφ (·). Additionally, for any α ∈ [0, 1] and any one-dimensional
probability measure γ, we denote by Fγ−1 (α) the αth quantile of γ,

Fγ−1 (α) = inf x ∈ R Fγ (x) ≥ α ,




where Fγ (x) = γ((−∞, x]) is the cumulative distribution function of γ.

3.1 Topological and Statistical Properties of the MSW Distance


We first establish important topological and statistical properties of the MSW distance. Specif-
ically, we show that the MSW distance is indeed a metric, functions as an integral probability
metric when p = 1, topologically equivalent to Wp on Pp (Rd ) and metrizes weak convergence on
Pp (Rd ).

3.1.1 Topological Properties

Subsequently, we denote by MSWp (·, ·) the untrimmed version of the MSW distance with δ = 0
(as defined in A.1) and omit the subscript δ.
Proposition 3.1 (Metricity). The untrimmed Marginally-augmented Sliced Wasserstein MSWp (·, ·)
distance is a valid metric on Pp (Rd ).

Our first theorem shows that the 1-MSW distance is an IPM and allows a dual formulation.
Theorem 3.1. The 1-MSW distance is an Integral Probability Metric defined by the class,
d
( Z
FMSW = f : R → R f (x) = gj (e⊤ + (1 − λ) gφ (φ⊤ x) dσ(φ) :
X
d λ
d j x)
j=1 Sd−1
)
gj , gφ ∈ Lip1 (R), sup |gφ (0)| < ∞ , 0 < λ < 1, (3.1)
φ∈Sd−1

where for each φ ∈ Sd−1 , gφ : R → R is a 1-Lipschitz function, such that the mapping (φ, t) 7→
gφ (t) is jointly measurable with respect to the product of the Borel σ-algebras on Sd−1 and R.

Theorem 3.2 (Topological Equivalence of MSWp and Wp ). There exists a constant Cd,p,λ >0
depending on d, p, λ, such that for µ, ν ∈ Pp (BR (0)),

1 1
† 1− p(d+1)
MSWp (µ, ν) ≤ Cd,p,λ Wp (µ, ν) ≤ Cd,p,λ R MSWpp(d+1) (µ, ν),
 1/p
where Cd,p,λ = λ + (1 − λ) d−1 Sd−1 ∥φ∥pp dσ(φ)
R
. Consequently, the p-MSW distance induces
the same topology as the p-Wasserstein distance.
Theorem 3.3 (MSW Metrizes Weak Convergence). The MSWp distance metrizes weak conver-

21
gence on Pp (Rd ), in the sense of metricity as defined in Definition 6.8 of Villani et al. (2009).
Remark 3.1. This result holds without the requirement of compact domains.

3.1.2 Statistical Properties

In this section, we establish statistical guarantees for the trimmed MSWp,δ (·, ·) distance as
formalized in Definition 2.2. We focus particularly on how closely the empirical version of this
distance approximates its population counterpart when estimated from finite samples. For any
µ, ν ∈ Pp (Rd ) and m, m′ ∈ N+ , we denote by µbm and νbm′ the empirical measures constructed
from m and m i.i.d. samples drawn from µ and ν, respectively. Our main result, presented

in Theorem 3.4, derives a non-asymptotic bound on |MSWp,δ (µ bm , νbm′ ) − MSWp,δ (µ, ν)| that
achieves the parametric convergence rate of m−1/2 when p = 1 and m = m′ , as to be shown in
Eq. (3.7).

Assumptions. We assume that µ, ν ∈ Pp′ (Rd ) where p′ = max{p, 2}. The sample sizes
m, m′ ∈ N+ satisfy min{m, m′ } > max{2(p + 2)/δ, log(32d/δ)/(2δ 2 )}, where δ ∈ (0, 1/2) is the
trimming parameter and δ ∈ (0, 1) is the confidence level. We define effective radii Mµ,p , Mν,p ∈
(0, ∞) such that EZ∼µ [∥Z∥p ] < Mµ,p and EZ∼ν [∥Z∥p ] < Mν,p ; the existence of these radii follows
from the fact that µ, ν ∈ Pp (Rd ).

Notation. For simplicity, we denote one-dimensional projections along coordinate axes as


µj := (ej )# µ and νj := (ej )# ν for all j ∈ [d]. Similarly, for general projections, we write
µφ := φ# µ and νφ := φ# ν for all φ ∈ Sd−1 .
For any one-dimensional probability measure γ and any t ≥ 0, we define
n    o
ψδ,t (γ) := min Fγ Fγ−1 (δ) + t − δ, δ − Fγ Fγ−1 (δ) − t , (3.2)
n    o
ψ1−δ,t (γ) := min Fγ Fγ−1 (1 − δ) + t − (1 − δ), (1 − δ) − Fγ Fγ−1 (1 − δ) − t , (3.3)

where Fγ is the CDF of γ. Note that from our assumption min{m, m′ } > log(32d/δ)/(2δ 2 ), we
have 2 exp(−2 min{m, m′ }δ 2 ) < δ/(16d); as limt→∞ ψδ,t (γ) = δ, there exist εm,d,δ,δ (γ), εm′ ,d,δ,δ (γ) ∈
(0, ∞) such that
  δ   δ
2 exp −2m ψδ,ε (γ) (γ)
2
≤ , 2 exp −2m′ ψδ,ε (γ) (γ)2 ≤ . (3.4)
m,d,δ,δ 16d m′ ,d,δ,δ 16d
Similarly, let εm,d,1−δ,δ (γ), εm′ ,d,1−δ,δ (γ) ∈ (0, ∞) be such that

  δ   δ
2 exp −2m ψ1−δ,ε (γ) (γ)
2
≤ , 2 exp −2m′ ψ1−δ,ε (γ) (γ)2 ≤ . (3.5)
m,d,1−δ,δ 16d m′ ,d,1−δ,δ 16d
For every j ∈ [d], we define
1/p !
Mµ,p

Rµj ,δ := 2 + εm,d,δ,δ (µj ) ∨ εm,d,1−δ,δ (µj ) ,
δ

22
1/p !
Mν,p

Rνj ,δ := 2 + εm′ ,d,δ,δ (νj ) ∨ εm′ ,d,1−δ,δ (νj ) .
δ
We further define
Rmax := max{Rµj ,δ } ∨ max{Rνj ,δ }. (3.6)
j∈[d] j∈[d]

The next theorem quantifies the convergence rate of the empirical trimmed MSW distance.
Theorem 3.4 (Convergence Rate of MSW Distance). Suppose that the assumptions given above
hold. For any δ̄ ∈ (0, 1), with probability at least 1 − δ̄, we have

|MSWp,δ (µ̂m , ν̂m′ ) − MSWp,δ (µ, ν)| ≤ tMSW ,

where

2(1 − λ)Cp EZ∼µ [∥Z∥2 ]1/2 ∨ EZ∼ν [∥Z∥2 ]1/2


( 1/p )
p
 q
tMSW := 2λRmax log(16d/δ) + √
1 − 2δ δ(1 − 2δ)1/p δ
· m−1/(2p) + m′−1/(2p) ,


where Rmax is as defined in Eq. (3.6) and Cp > 0 is a constant that depends only on p.
Remark 3.2. Theorem 3.4 states that the empirical trimmed MSWp,δ distance between two
samples converges to the true population trimmed MSWp,δ distance at the rate

tMSW = O m−1/(2p) + m′−1/(2p) . (3.7)




In particular, when p = 1 and m = m′ , this recovers the familiar O(m−1/2 ) parametric rate.

3.2 Theoretical properties of the ABI posterior


In this section, we investigate theoretical properties of the ABI posterior. First, we prove that
the oracle ABI posterior converges to the true posterior distribution as ϵ ↓ 0. We additionally
establish that the MSW distance is continuous with respect to ABC posteriors in the sense that
this distance vanishes in the limit as ϵ ↓ 0 through a novel martingale-based technique.
Theorem 3.5 (Convergence of the ABI Posterior). Let (Θ, X) be defined on a probability space
(Ξ, F, P) with Θ ∈ Ω where Ω ⊆ Rd is a Polish parameter space with Borel σ-algebra B, and
X ∈ X n where X n ⊆ RndX is a Polish observation space with Borel σ-algebra A. Let p ∈ [1, ∞)
and assume
(n)
Θ ∼ π, X | Θ ∼ PΘ .

Assume that the joint distribution of (Θ, X) (denoted by PΘ,X ) admits the density fΘ,X , the
(n)
marginal distribution of X (denoted by PX ) admits the density fX , and PΘ admits the density
fX|Θ , all with respect to Lebesgue measure. Let x∗ ∈ X n be such that fX (x∗ ) > 0 and fX is

23
continuous at x∗ . Suppose
Z
M := sup ∥θ∥pp fΘ,X (θ, x) dθ < ∞.
x∈X n Ω

Then as ϵ ↓ 0, the oracle ABI posterior, with density


(ϵ)
πABI (θ | x∗ ) := πΘ|X (θ | MSWp(πΘ|x , πΘ|x∗ ) ≤ ϵ)
π(θ) X fX|Θ (x | θ)1{MSWp(πΘ|x , πΘ|x∗ ) ≤ ϵ} dx
R
=R
Θ×X π(θ)fX|Θ (x | θ)1{MSWp(πΘ|x , πΘ|x∗ ) ≤ ϵ} dθ dx

converges weakly in Pp (Ω) to the true posterior distribution πΘ|X (θ | x∗ ).


Theorem 3.6 (Continuity of the ABC Posterior under the MSWp Distance). Let Θ, X, x∗ be
defined similarly as and satisfy all the assumptions in Theorem 3.5. For any ϵ > 0, define
Bϵ (x∗ ) = {x : ∥x − x∗ ∥ ≤ ϵ}. Then for any decreasing sequence ϵt ↓ 0,

πΘ|X dθ | X ∈ Bϵt (x∗ ) ⇒ πΘ|X (dθ | X = x∗ )



as t → ∞,

where we use ⇒ to denote weak convergence in Pp (Ω). Consequently,


 
lim MSWp πΘ|X (dθ | X ∈ Bϵt (x∗ )), πΘ|X (dθ | X = x∗ ) = 0.
t→∞

Remark 3.3. Contrary to the standard convergence proofs that rely on the Lebesgue differ-
entiation theorem (Barber et al., 2015; Biau et al., 2015; Prangle, 2017), we establish the con-
vergence of the ABC posterior by leveraging martingale techniques. To the best of the authors’
knowledge, this represents the first convergence proof for ABC that employs a martingale-based
method (specifically, leveraging Lévy’s 0–1 law).

4 Empirical Evaluation
In this section, we present extensive empirical evaluations demonstrating the efficacy of ABI
across a broad range of simulation scenarios. We benchmark the performance of ABI against
four widely used alternative methods: ABC with the 2-Wasserstein distance (WABC; see Bern-
ton et al. 20192 ), ABC with automated neural summary statistic (ABC-SS; see Jiang et al.
2017), Sequential Neural Likelihood Approximation (SNLE; see Papamakarios et al. 2019), and
Sequential Neural Posterior Approximation (SNPE; see Greenberg et al. 2019). For SNLE and
SNPE, we employ the implementations provided by the Python SBI package (Tejero-Cantero
et al., 2020). In the Multimodal Gaussian example (Section 4.1), we additionally compare ABI
against the Wasserstein generative adversarial network with gradient penalty (WGAN-GP; see
Gulrajani et al. 2017). We summarize the key characteristics of each method below:

2
We use the implementation available at https://github.com/pierrejacob/winference

24
Compatibility ABI WABC ABC-SS SNLE SNPE
Intractable Prior Yes No Yes No No
Intractable Likelihood Yes Yes Yes Yes Yes
ABC-based Yes Yes Yes No No

Table 1: Compatibility comparison of different inference methods. ABI provides full compatibility with
both intractable priors and intractable likelihood, offering enhanced modeling flexibility.

4.1 Multimodal Gaussian Model with Complex Posterior


For the initial example, we consider a model commonly employed in likelihood-free inference (see
Papamakarios et al., 2019; Wang and Ročková, 2022), which exposes the intrinsic fragility of
traditional ABC methods even in a seemingly simple scenario. In this setup, θ is a 5-dimensional
vector drawn according to Unif(−3, 3); for each θ ∈ R5 , we observe four i.i.d. sets of bivariate
Gaussian samples X, where the mean and covariance of these samples are determined by θ.
For simplicity, we will subsequently treat X as a flattened 8-dimensional vector. The forward
sampling model is defined as follows:

θk ∼ Unif(−3, 3), k = 1, . . . , 5,

µθ = (θ1 , θ2 ) ,
s1 = θ32 , s2 = θ42 , ρ = tanh(θ5 ),
!
s21 ρs1 s2
Σθ = ,
ρs1 s2 s22
Xj | θ ∼ N (µθ , Σθ ), j = 1, . . . , 4.

Despite its structural simplicity, this model yields a complex posterior distribution characterized
by truncated support and four distinct modes that arise from the inherent unidentifiability of
the signs of θ3 and θ4 .
We implemented ABI with two sequential iterations using adaptively selected thresholds. The
MSW distance was evaluated using 10 quantiles and five SW slices. For SNLE and SNPE, we
similarly conducted two-round sequential inference. To ensure fair comparison, we calibrated
the training budget for ABC-SS3 , WABC, and WGAN to match the total number of samples
utilized for training across both ABI iterations.
Figures 1 and 2 present comparative analyses of posterior marginal distributions generated
by ABI (in red) and alternative inference methods, with the true posterior distribution (shown
in black) obtained via the No-U-Turn Sampler (NUTS) implemented in rstan using 10 MCMC
chains. We illustrate the evolution of the ABI posterior over iterations in Figure 3. Notably,
ABC-SS produces a predominantly unimodal posterior distribution centered around the poste-
3
Since both ABI and ABC-SS are rejection-ABC-based, we applied the same adaptive rejection quantile thresh-
olds for ABI (iteration 1) and ABC-SS.

25
Figure 1: Comparison of approximate posterior densities obtained from ABI and alternative
benchmark methods under the Multimodal Gaussian model. The true posterior is shown in
the black dashed line. ABI produced posteriors that accurately align with the true posterior
distribution.

rior mean, illustrating its fundamental limitation of yielding only first-order sufficient statistics
(i.e., mean-matching) in the asymptotic regime with vanishing tolerance. WGAN partially cap-
tures the bimodality of θ3 and θ4 , yet produces posterior samples that significantly deviate from
the true distribution. The parameter θ5 poses the greatest challenge for accurate estimation
across methods. Overall, ABI generates samples that closely approximate the true posterior
distribution across all parameters.

Figure 2: Comparison of marginal posteriors generated by ABI and WGAN-GP. The true pos-
terior is shown in the black dashed line.

Figure 3: Evolution of the sample path over successive iterations of ABI.

Table 2 presents a quantitative comparison using multiple metrics: maximum mean discrep-
ancy with Gaussian kernel, empirical 1-Wasserstein distance4 , bias in posterior mean (measured
as absolute difference between posterior distributions), and bias in posterior correlation (calcu-
4
The W1 distance is computed using the Python Optimal Transport (POT) package.

26
lated as summed absolute deviation between empirical correlation matrices). ABI consistently
demonstrates superior performance across the majority of evaluation criteria (with the exception
of parameters θ2 and θ5 ),

Evaluation Metric (Parameter) ABI WABC ABC-SS SNLE SNPE WGAN


MMD 0.466 0.592 0.573 0.511 0.536 0.514
1-Wasserstein 0.609 2.738 1.663 0.912 1.126 1.079
θ1 0.001 0.383 0.18 0.033 0.363 0.039
Bias θ2 0.030 0.193 0.103 0.001 0.112 0.049
(Posterior θ3 0.016 0.058 0.076 0.24 0.192 0.030
Mean) θ4 0.006 0.022 0.007 0.084 0.012 0.143
θ5 0.137 0.345 0.642 0.078 0.361 0.087
Bias (Posterior Corr.) 0.881 1.776 1.382 1.146 1.094 2.340

Table 2: Comparative performance evaluation of inference methodologies. Lower values indicate superior
accuracy; best results are highlighted in bold.

4.2 M/G/1 Queuing Model


We now turn to the M/G/1 queuing model (Fearnhead and Prangle, 2012; Bernton et al., 2019).
This system illustrates a setting where, despite dependencies among observations, the model
parameters remain identifiable from the marginal distribution of the data.
In this model, customers arrive at a single server with interarrival times Wi ∼ Exp(θ3 )
(where θ3 represents the arrival rate) and model the service times as Ui ∼ Unif(θ1 , θ2 ). Rather
than observing Wi and Ui directly, we record only the interdeparture times, defined through the
following relationships:
i
Vi = (arrival time of i-th customer),
X
Wj
j=1
i
Xi = Yj , X0 ≡ 0 (departure time of i-th customer),
X

j=1
n i i−1 o
Yi = Ui + max 0, Yj = Ui + max(0, Vi − Xi−1 ).
X X
Wj −
j=1 j=1

We assume that the queue is initially empty before the first customer arrives. We assign the
following truncated prior distributions:

θ1 , θ2 ∼ Unif(0, 10), θ3 ∼ Unif 0, 31 .




27
For our analysis, we use the dataset from Shestopaloff and Neal (2014)5 , which was simulated
with true parameter values (θ1 , θ2 − θ1 , θ3 ) = (4, 3, 0.15) and consists of n = 50 observations.
Sequential versions of all algorithms were executed using four iterations with 10,000 training
samples per iteration. As in the first example, to ensure fair comparison, we allocated an equal
total number of training samples to WABC and ABC-SS as provided to ABI, SNLE, and SNPE.
Figure 4 presents the posterior distributions for parameters θ1 , θ2 − θ1 , and θ3 , with the true
posterior mean indicated by a black dashed line. The results demonstrate that the approximate
posteriors produced by ABI (in red) not only align most accurately with the true posterior
means, but also concentrate tightly around them.

Figure 4: Comparison of approximate posterior densities under the M/G/1 queuing example.
The dashed black line indicates the true posterior mean at (3.96, 2.99, 0.177). ABI outperforms
alternative approaches and exhibits superior alignment with the true posterior mean.

Figure 5: 30 trajectories simulated from the cosine model with parameter values ω ∗ = 1/80,
ϕ∗ = π/4, log(σ ∗ ) = 0, and log(A∗ ) = log(2).

4.3 Cosine Model


In the third demonstration, we examine the cosine model (Bernton et al., 2019), defined as:

Yt = A cos(2πωt + ϕ) + σϵt , ϵt ∼ N (0, 1) for t ≥ 1,


5
This corresponds to the Intermediate dataset, with true posterior means provided in Table 4 of Shestopaloff
and Neal (2014).

28
with prior distributions:

ω ∼ Unif[0, 1/10], ϕ ∼ Unif[0, 2π], log(σ) ∼ N (0, 1), log(A) ∼ N (0, 1).

Posterior inference for these parameters is challenging because information about ω and ϕ
is substantially obscured in the marginal empirical distribution of observations (Y1 , . . . , Yn ).
The observed data was generated with t = 100 time steps using parameter values ω ∗ = 1/80,
ϕ∗ = π/4, log(σ ∗ ) = 0, and log(A∗ ) = log(2); 30 example trajectories are displayed in Figure 5.
The exact posterior distribution was obtained using the NUTS algorithm. Sequential algorithms
(ABI, SNLE, SNPE) were executed with three iterations, each utilizing 5,000 training samples,
while WABC6 and ABC-SS were trained with a total of 15,000 simulations.

Figure 6: Comparison of approximate posterior densities under the cosine model. The true
posterior density is displayed as a dashed black line. Among all methods, ABI achieves the most
accurate representation of the true posterior.

Figure 7: Approximate posterior densities generated by ABI under the cosine model.

Figure 6 compares the approximate posterior distributions obtained from ABI and alternative
methods, with the true posterior shown in black. Among these parameters, ω indeed proves to
be the most difficult. We observe that ABI again yields the most satisfactory approximation
across all four parameters. For clarity, we additionally provide a direct comparison between the
ABI-generated posterior and the true posterior in Figure 7.
6
Wasserstein distance was computed by treating each dataset as a flattened vector containing 100 independent
one-dimensional observations.

29
4.4 Lotka-Volterra Model
The final simulation example investigates the Lotka-Volterra (LV) model (Din, 2013), which
involves a pair of first-order nonlinear differential equations. This system is frequently used to
describe the dynamics of biological systems involving interactions between two species: a preda-
tor population (represented by yt ) and a prey population (represented by xt ). The populations
evolve deterministically according to the following set of equations:

dx dy
= αx − βxy, = −γy + δxy.
dt dt
The changes in population states are governed by four parameters (α, β, γ, δ) controlling the
ecological processes: prey growth rate (α), predation rate (β), predator growth rate (δ), and
predator death rate (γ).

Figure 8: 30 trajectories sampled from the Lotka-Volterra model, each corresponding to one of
three distinct parameter configurations (α∗ , β ∗ , γ ∗ , δ ∗ ).

Due to the absence of a closed-form transition density, inference in LV models using tra-
ditional methods presents significant challenges; however, these systems are particularly well-
suited for ABC approaches since they allow for efficient generation of simulated datasets. The
dynamics can be simulated using a discrete-time Markov jump process according to the Gillespie
algorithm (Gillespie, 1976). At time t, we evaluate the following rates,

rα (t) = αxt , rβ (t) = βxt yt , rγ (t) = γyt , rδ (t) = δxt yt ,


r∆ (t) := rα (t) + rβ (t) + rγ (t) + rδ (t)

The algorithm first samples the waiting time until the next event from an exponential distri-
bution with parameter r∆ (t), then selects one of the four possible events (prey death/birth,
predator death/birth) with probability proportional to its own rate r· (t). We choose the trun-
cated uniform prior over the restricted domain [0, 1]×[0.0.1]×[0, 2]×[0, 0.1]. The true parameter
values are (α∗ , β ∗ , γ ∗ , δ ∗ ) = (0.5, 0.01, 1.0, 0.01) (Papamakarios et al., 2019). Posterior estima-
tion in this case is particularly difficult because the likelihood surface contains concentrated
probability mass in isolated, narrow regions throughout the parameter domain. We initialized

30
the populations at time t = 0 with (x0 , y0 ) = (50, 100) and recorded system states at 0.1 time
unit intervals over a duration of 10 time units, yielding a total of 101 observations as illustrated
in Figure 8.
We carried out sequential algorithms across two rounds, with each round utilizing 5,000
training samples. For fair comparison, WABC and ABC-SS were provided with a total of 10,000
training samples. The resulting posterior approximations are presented in Figure 9, with black
dashed lines marking the true parameter values, as the true posterior distributions are not
available for this model. The results again demonstrate that ABI consistently provides accurate
approximations across all four parameters of the model.

Figure 9: Comparison of approximate posterior distributions for the Lotka-Volterra model. True
parameter values are indicated by the dashed black lines.

5 Discussion
In this work, we introduce the Adaptive Bayesian Inference (ABI) framework, which shifts the
focus of approximate Bayesian computation from data-space discrepancies to direct compar-
isons in posterior space. Our approach leverages a novel Marginally-augmented Sliced Wasser-
stein (MSW) distance—an integral probability metric defined on posterior measures that com-
bines coordinate-wise marginals with random one-dimensional projections. We then establish
a quantile-representation of MSW that reduces complex posterior comparisons to a tractable
distributional regression task. Moreover, we propose an adaptive rejection-sampling scheme in
which each iteration’s proposal is updated via generative modeling of the accepted parame-
ters in the preceding iteration. The generative modeling–based proposal updates allow ABI to
perform posterior approximation without explicit prior density evaluation, overcoming a key
limitation of sequential Monte Carlo, population Monte Carlo, and neural density estimation
approaches. Our theoretical analysis shows that MSW retains the topological guarantees of the
classical Wasserstein metric, including the ability to metrize weak convergence, while achieving
parametric convergence rates in the trimmed setting when p = 1. The martingale-based proof of
sequential convergence offers an alternative to existing Lebesgue differentiation arguments and
may find broader application in the analysis of other adaptive algorithms.
Empirically, ABI delivers substantially more accurate posterior approximations than Wasser-
stein ABC and summary-based ABC, as well as state-of-the-art likelihood-free simulators and
Wasserstein GAN. Through a variety of simulation experiments, we have shown that the poste-

31
rior MSW distance remains robust under small observed sample sizes, intricate dependency struc-
tures, and non-identifiability of parameters. Furthermore, our conditional quantile-regression
implementation exhibits stability to network initialization and requires minimal tuning.
Several avenues for future research arise from this work. One promising direction is to inte-
grate our kernel-statistic approach into Sequential Monte Carlo algorithms to improve efficiency
in likelihood-free settings. Future work could also apply ABI to large-scale scientific simulators,
such as those used in systems biology, climate modeling, and cosmology, to spur domain-specific
adaptations of the posterior-matching paradigm.
In summary, ABI offers a new perspective on approximate Bayesian computation as well as
likelihood-free inference by treating the posterior distribution itself as the primary object of
comparison. By combining a novel posterior space metric, quantile-regression–based estimation,
and generative-model–driven sequential refinement, ABI significantly outperforms alternative
ABC and likelihood-free methods. Moreover, this posterior-matching viewpoint may catalyze
further advances in approximate Bayesian computation and open new avenues for inference in
complex, simulator-based models.

Acknowledgements
The authors would like to thank Naoki Awaya, X.Y. Han, Iain Johnstone, Tengyuan Liang,
Art Owen, Robert Tibshirani, John Cherian, Michael Howes, Tim Sudijono, Julie Zhang, and
Chenyang Zhong for their valuable discussions and insightful comments. The authors would
like to especially acknowledge Michael Howes and Chenyang Zhong for their proofreading of
the technical results. W.S.L. gratefully acknowledges support from the Stanford Data Science
Scholarship and the Two Sigma Graduate Fellowship Fund during this research. 5366 W.H.W.’s
research was partially supported by NSF grant 2310788.

References
Justin Alsing, Benjamin Wandelt, and Stephen Feeney. Massive optimal data compression and
density estimation for scalable, likelihood-free inference in cosmology. Monthly Notices of the
Royal Astronomical Society, 477(3):2874–2885, 2018.

Pedro César Alvarez-Esteban, Eustasio Del Barrio, Juan Antonio Cuesta-Albertos, and Carlos
Matran. Trimmed comparison of distributions. Journal of the American Statistical Associa-
tion, 103(482):697–704, 2008.

Stuart Barber, Jochen Voss, and Mark Webster. The rate of convergence for approximate
bayesian computation. Electronic Journal of Statistics., 2015.

Mark A Beaumont, Jean-Marie Cornuet, Jean-Michel Marin, and Christian P Robert. Adaptive
approximate bayesian computation. Biometrika, 96(4):983–990, 2009.

James O Berger and Robert L Wolpert. The likelihood principle. IMS, 1988.

32
Espen Bernton, Pierre E Jacob, Mathieu Gerber, and Christian P Robert. Approximate bayesian
computation with the wasserstein distance. Journal of the Royal Statistical Society Series B:
Statistical Methodology, 81(2):235–269, 2019.

Gérard Biau, Frédéric Cérou, and Arnaud Guyader. New insights into approximate bayesian
computation. In Annales de l’IHP Probabilités et statistiques, volume 51, pages 376–403, 2015.

Fernando V Bonassi and Mike West. Sequential monte carlo with adaptive weights for approxi-
mate bayesian computation. Bayesian Analysis, 2015.

Nicolas Bonnotte. Unidimensional and evolution methods for optimal transportation. PhD thesis,
Université Paris Sud-Paris XI; Scuola normale superiore (Pise, Italie), 2013.

Ewan Cameron and AN Pettitt. Approximate bayesian computation for astronomical model
analysis: a case study in galaxy demographics and morphological transformation at high
redshift. Monthly Notices of the Royal Astronomical Society, 425(1):44–65, 2012.

Neel Chatterjee, Somya Sharma, Sarah Swisher, and Snigdhansu Chatterjee. Approximate
bayesian computation for physical inverse modeling. arXiv preprint arXiv:2111.13296, 2021.

Sourav Chatterjee, Trevor Hastie, and Robert Tibshirani. Univariate-guided sparse regression.
arXiv preprint arXiv:2501.18360, 2025.

Manuel Chiachío-Ruano, Juan Chiachío-Ruano, and María L. Jalón. Solving inverse problems
by approximate bayesian computation. In Bayesian Inverse Problems. 2021.

Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for
distributional reinforcement learning. In International conference on machine learning, pages
1096–1105. PMLR, 2018.

Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. An adaptive sequential monte carlo method
for approximate bayesian computation. Statistics and computing, 22:1009–1020, 2012.

Qamar Din. Dynamics of a discrete lotka-volterra model. Advances in Difference Equations,


2013:1–13, 2013.

Christopher Drovandi, David J Nott, and David T Frazier. Improving the accuracy of marginal
approximations in likelihood-free inference via localization. Journal of Computational and
Graphical Statistics, 33(1):101–111, 2024.

Paul Fearnhead and Dennis Prangle. Constructing abc summary statistics: semi-automatic abc.
Nature Precedings, pages 1–1, 2011.

Paul Fearnhead and Dennis Prangle. Constructing summary statistics for approximate bayesian
computation: semi-automatic approximate bayesian computation. Journal of the Royal Sta-
tistical Society Series B: Statistical Methodology, 74(3):419–474, 2012.

David T Frazier. Robust and efficient approximate bayesian computation: A minimum distance
approach. arXiv preprint arXiv:2006.14126, 2020.

33
Malay Ghosh. Exponential tail bounds for chisquared random variables. Journal of Statistical
Theory and Practice, 15(2):35, 2021.

Daniel T Gillespie. A general method for numerically simulating the stochastic time evolution
of coupled chemical reactions. Journal of computational physics, 22(4):403–434, 1976.

VP Godambe. Bayesian sufficiency in survey-sampling. Annals of the Institute of Statistical


Mathematics, 20(1):363–373, 1968.

David Greenberg, Marcel Nonnenmacher, and Jakob Macke. Automatic posterior transformation
for likelihood-free inference. In International conference on machine learning, pages 2404–
2414. PMLR, 2019.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville.
Improved training of wasserstein gans. Advances in neural information processing systems,
30, 2017.

Peter J Huber. Robust estimation of a location parameter. The Annals of Mathematical Statis-
tics, 35(1):73–101, 1964.

Bai Jiang. Approximate bayesian computation with kullback-leibler divergence as data discrep-
ancy. In International conference on artificial intelligence and statistics, pages 1711–1721.
PMLR, 2018.

Bai Jiang, Tung-yu Wu, Charles Zheng, and Wing H Wong. Learning summary statistic for
approximate bayesian computation via deep neural network. Statistica Sinica, pages 1595–
1618, 2017.

Olav Kallenberg and Olav Kallenberg. Foundations of modern probability, volume 2. Springer,
1997.

Jungeum Kim, Percy S Zhai, and Veronika Ročková. Deep generative quantile bayes. arXiv
preprint arXiv:2410.08378, 2024.

Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: journal of the
Econometric Society, pages 33–50, 1978.

Sirio Legramanti, Daniele Durante, and Pierre Alquier. Concentration of discrepancy-based abc
via rademacher complexity. arXiv preprint arXiv:2206.06991, 2022.

Wenhui Sophia Lu, Chenyang Zhong, and Wing Hung Wong. Efficient generative modeling via
penalized optimal transport network. arXiv preprint arXiv:2402.10456v2, 2025.

Tudor Manole, Sivaraman Balakrishnan, and Larry Wasserman. Minimax confidence intervals
for the sliced wasserstein distance. Electronic Journal of Statistics, 16(1):2252–2345, 2022.

Jean-Michel Marin, Pierre Pudlo, Christian P Robert, and Robin J Ryder. Approximate bayesian
computational methods. Statistics and computing, 22(6):1167–1180, 2012.

34
Henry Markram, Eilif Muller, Srikanth Ramaswamy, Michael W Reimann, Marwan Abdellah,
Carlos Aguado Sanchez, Anastasia Ailamaki, Lidia Alonso-Nanclares, Nicolas Antille, Selim
Arsever, et al. Reconstruction and simulation of neocortical microcircuitry. Cell, 163(2):
456–492, 2015.

Fernando A Moala and Anthony O’Hagan. Elicitation of multivariate prior distributions: A


nonparametric bayesian approach. Journal of Statistical Planning and Inference, 140(7):1635–
1655, 2010.

Kimia Nadjahi, Alain Durmus, Umut Simsekli, and Roland Badeau. Asymptotic guarantees for
learning generative models with the sliced-wasserstein distance. Advances in Neural Informa-
tion Processing Systems, 32, 2019.

Oscar Hernan Madrid Padilla, Wesley Tansey, and Yanzhen Chen. Quantile regression with relu
networks: Estimators and minimax rates. Journal of Machine Learning Research, 23(247):
1–42, 2022.

George Papamakarios and Iain Murray. Fast ε-free inference of simulation models with bayesian
conditional density estimation. Advances in neural information processing systems, 29, 2016.

George Papamakarios, David Sterratt, and Iain Murray. Sequential neural likelihood: Fast
likelihood-free inference with autoregressive flows. In The 22nd international conference on
artificial intelligence and statistics, pages 837–848. PMLR, 2019.

Nicholas G Polson and Vadim Sokolov. Generative ai for bayesian computation. arXiv preprint
arXiv:2305.14972, 2023.

Dennis Prangle. Adapting the abc distance function. Bayesian Analysis, 2017.

Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its
application to texture mixing. In Scale Space and Variational Methods in Computer Vision:
Third International Conference, SSVM 2011, Ein-Gedi, Israel, May 29–June 2, 2011, Revised
Selected Papers 3, pages 435–446. Springer, 2012.

Alexander Y Shestopaloff and Radford M Neal. On bayesian inference for the m/g/1 queue with
efficient mcmc sampling. arXiv preprint arXiv:1401.5548, 2014.

Michel Talagrand. The transportation cost from the uniform measure to the empirical measure
in dimension ≥ 3. The Annals of Probability, pages 919–959, 1994.

Simon Tavaré. On the history of abc. In Handbook of Approximate Bayesian Computation,


pages 55–69. Chapman and Hall/CRC, 2018.

Alvaro Tejero-Cantero, Jan Boelts, Michael Deistler, Jan-Matthis Lueckmann, Conor Durkan,
Pedro J. Gonçalves, David S. Greenberg, and Jakob H. Macke. sbi: A toolkit for simulation-
based inference. Journal of Open Source Software, 5(52):2505, 2020. doi: 10.21105/joss.02505.
URL https://doi.org/10.21105/joss.02505.

35
Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.

Yuexi Wang and Veronika Ročková. Adversarial bayesian simulation. arXiv preprint
arXiv:2208.12113, 2022.

Simon N Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature,
466(7310):1102–1104, 2010.

Yang Zeng, Hu Wang, Shuai Zhang, Yong Cai, and Enying Li. A novel adaptive approximate
bayesian computation method for inverse heat conduction problem. International Journal of
Heat and Mass Transfer, 134:185–197, 2019.

Xingyu Zhou, Yuling Jiao, Jin Liu, and Jian Huang. A deep generative approach to conditional
sampling. Journal of the American Statistical Association, 118(543):1837–1848, 2023.

36
Appendix
The Appendix section contains detailed proofs of theoretical results, supplementary results and
descriptions on the simulation setup, as well as further remarks.

Appendix A Marginally-augmented Sliced Wasserstein (MSW)


Distance
In this section, we further discuss the Marginally-augmented Sliced Wasserstein (MSW) distance.
We begin by presenting the formal definition of the untrimmed MSW distance.
Definition A.1. Let p ≥ 1 and µ, ν ∈ Pp (Rd ) with d ≥ 1. The Marginally-augmented Sliced
Wasserstein (MSW) distance between µ and ν is defined as

d
1X    h i1/p
MSWp (µ, ν) = λ Wp (ej )# µ, (ej )# ν +(1 − λ) Eφ∼σ Wpp (φ# µ, φ# ν) , (A.1)
d j=1 | {z }
| {z } Sliced Wasserstein distance
marginal augmentation

where λ ∈ (0, 1) is a mixing parameter, σ(·) denotes the uniform probability measure on the
unit sphere Sd−1 , and (ej )# represents the pushforward of a measure under projection onto the
j-th coordinate axis.

Appendix B Curse of Dimensionality in Rejection-ABC


In practice, observed data often involve numerous covariates, nonexchangeable samples, or de-
pendencies between samples, resulting in a high-dimensional sample space. Examples of such
scenarios include the Lotka-Volterra model for modeling predator-prey interactions, the M/G/1
queuing model, and astronomical models of high-redshift galaxy morphology (Cameron and Pet-
titt, 2012; Bernton et al., 2019). However, it is well known that distance metrics such as the
Euclidean distance become highly unreliable in high dimensions, as observations concentrate
near the hypersphere—a phenomenon called the curse of dimensionality. In the following exam-
ple, we demonstrate that even when the likelihood has fast decay, the number of samples needed
to obtain an observation within an ϵ-ball around x∗ grows exponentially with the dimensionality
of the observation space.
Example B.1 (High-dimensional Gaussian). For an illustrative example, consider the case
when the prior distribution is uninformative, i.e.,

θ ∈ Rd ∼ Unif[−1, 1]d ,
X|θ ∈ Rn ∼ N (Hθ, σ 2 In ),

where H ∈ Rn×d is a matrix that maps from the low-dimensional parameter space to the high-
dimensional observation space.
Lemma B.1 (Curse of dimensionality). The expected number of samples needed to produce a

37
draw within the ϵ-ball centered at x∗ grows as Ω(exp(n)) for 0 < ϵ < 1/2(n + ∥Hθ − x∗ ∥22 /σ 2 )1/2 .

Proof. Since X|θ ∼ N (Hθ, σ 2 In ), we have X − x∗ | θ ∼ N (Hθ − x∗ , σ 2 In ). Define ∆ :=


∥Hθ − x∗ ∥2 , so ∥X − x∗ ∥22 | θ ∼ σ 2 χ2n (ζ) with noncentrality parameter ζ = ∆2 /σ 2 . Therefore,
the expected distance is given by:
h i
E σ −2 ∥X − x∗ ∥22 = n + σ −2 (∥Hθ∥22 + ∥x∗ ∥22 − 2(x∗ )⊤ Hθ) = n + ∆2 /σ 2 ,

which scales linearly in n as the dimension n → ∞. Let Y ∼ χ2n (ζ), then for 0 < c < n + ζ =
n + ∆2 /σ 2 , by Theorem 4 of Ghosh (2021) we have,
!
n c c nc2
   
P(Y < n + ζ − c) ≤ exp + log 1 − ≤ exp − .
2 n + 2ζ n + 2ζ 4(n + 2ζ)2

For 0 < ϵ < 1/2 n + ζ, where ζ = ∥Hθ − x∗ ∥22 /σ 2 , we set c := n + ζ − ϵ2 > 0. Applying the
inequality above yields

n(n + ζ − σ 2 ϵ2 )2
!
 
−2
P σ ∥X − x∗ ∥22 < ϵ |θ ≤ exp −
2
= O (exp (−n)) .
4(n + 2ζ)2

Therefore, the expected number of samples needed to produce a draw within the ϵ-ball centered
at x∗ is Ω(exp(n)); specifically, the number of required samples grows exponentially with the
dimension n.

Lemma B.2 (High-dimensional Bounded Density). Let X = (X1 , . . . , Xn ) ∈ Rn be a random


vector with a joint density function f (x1 , . . . , xn ) that is bounded by K > 0. Then, for any ϵ > 0,
there exists a constant C > 0, independent of n, such that

P(∥X∥ ≤ ϵ) ≤ Cϵn .

Consequently, if we use an ABC-acceptance region of radius ε around the observed x∗ , the


expected number of simulations needed to obtain a single accepted draw is at least C1ϵn , which
grows exponentially in the dimension n.

Proof. We begin by bounding the probability using the volume of an n-dimensional ball:
Z Z
P(∥X∥ ≤ ϵ) = f (x) dx ≤ K dx = KVol(Bϵn )
∥x∥≤ϵ ∥x∥≤ϵ
Kπ n/2 ϵn
= ,
Γ n
2 +1

where Brn denotes a ball of radius r in Rn . To analyze the behavior of this bound as n increases,

38
we apply Stirling’s approximation:

1 2πe
n/2
π n/2 π n/2

n→∞
∼ √ =√ → 0
Γ 2 +1
n

n
 n/2 πn n
πn 2e

Therefore, we can define a constant C, independent of n, as

Kπ n/2
C := sup  < ∞.
n≥1 Γ 2 + 1
n

The inequality then follows as claimed for all n ≥ 1.

Appendix C Adaptive Rejection Sampling


C.1 Warmup: Adaptive Bayesian Inference for the Univariate Gaussian Model
As a simple illustration of our sequential procedure, consider the conjugate Gaussian–Gaussian
model:

θ ∼ N (0, 20), X | θ ∼ N (θ, 1).

The prior on θ is diffuse, while X | θ concentrates around the true parameter.


We run ABI for T iterations with a decreasing tolerance sequence ϵ1 > ϵ2 > · · · > ϵT . In this
univariate setting, we use the Euclidean distance, and write Br (x) = {y : |y − x| ≤ r}. Initialize
the proposal as π (0) (θ) := π(θ). At iteration t, we draw N synthetic parameter–data pairs from
the current proposal π (t) (θ) := π(θ | Et−1 ):
1. Draw θ ∼ π (t) (θ);
(n)
2. Draw X | θ ∼ Pθ (· | Et−1 ) by rejection sampling using Algorithm 2.
This yields the selected set, whose distribution is π(θ, X | Et−1 ), which we denote by π (t) (θ, X):
(t)
S0 = {(θ(i) , X (i) )}N
i=1 .

We then retain those θ(i) whose associated X (i) falls within Bϵt (x∗ ):
(t) (t)
Sθ,∗ = θ(i) : (θ(i) , X (i) ) ∈ S0 and X (i) ∈ Bϵt (x∗ ) .


(t) (t) (t)


Let θ̄∗ and (σ̂∗ )2 denote the sample mean and variance of the set of retained samples Sθ,∗ .
Since a Gaussian distribution is fully determined by its mean and variance, proposal distribution
for the next iteration is updated to be
(t) (t)
θ ∼ N θ̄∗ , (σ̂∗ )2 .


We generated a single observed value x∗ = 6.24 from the model; by conjugacy the exact

39
posterior is π(θ | x∗ ) = N (5.94, 0.95). At each iteration t, we drew N = 104 synthetic pairs
(θ(i) , X (i) ) from the current Gaussian proposal and retained those θ(i) for which X (i) ∈ Bϵt (x∗ ).
We used the decreasing tolerance schedule

ϵ = (2, 0.7, 0.3, 0.01, 0.005, 0.003, 0.001, 0.001, 0.001).

The retained θ-values were used to fit the next Gaussian proposal by moment matching. We
repeated this procedure for T = 9 iterations. Let the empirical retention rate be defined as

1
{ i : |X (i) − x∗ | ≤ 0.001} .
N
Figure 10 displays the smoothed posterior densities across iterations, together with the corre-
sponding retention rates. From our simulations, we make the following observations:
(1). The sequential rejection sampling refinement is self-consistent: repeated application does
not cause the sampler to drift away from the true posterior.
(2). In this univariate example, the sampler converges rapidly, requiring just just three itera-
tions in this univariate example.

Figure 10: Adaptive inference on the univariate Gaussian model. Left: Smoothed proposal and
posterior densities over iterations; prior shown in gray, true posterior in black. Right: Empirical
retention rate per iteration. The red line indicates the rate obtained using the true posterior,
while the error bands represent ±1 standard deviation of the retention rate, calculated from 100
independent repetitions.

Appendix D Posterior Sufficiency


First, for completeness and measure-theoretic rigor, we formally define the probability space.
Let the parameter and data (Θ, X) be jointly defined on a nice probability space (Ω × X n , B ⊗
A, P ), where (Ω, B) ⊆ (Rd , B(Rd )) is the measurable Polish parameter space and (X n , A) ⊆
(RdX , B(RdX )) is the measurable observation space, with d, dX ∈ N+ . The prior probability
measure Π on (Ω, B) is assumed to be absolutely continuous with respect to the Lebesgue
(n)
measure, with corresponding density function π(dθ) for θ ∈ Ω. For each θ ∈ Ω, let Pθ denote

40
the conditional probability measure on (X n , A) given Θ = θ.
We provide the definition of Bayes sufficiency presented in Godambe (1968) below.
Definition D.1 (Bayes Sufficiency). Let (Ω, B) be a measurable parameter space, (X , A) be a
sample space, and Cπ be a countable class of prior distributions on (Ω, B). A statistic T : X → T
is said to be Bayes sufficient with respect to Cπ if, for any π ∈ Cπ , the posterior density π(θ | X)
depends on X only through T (X) for almost every x, i.e., there exists a function g : Ω×T → R+
such that

π(θ | x) = g(θ, T (x)) for π-almost every θ ∈ Ω and Pπ -almost every x ∈ X , (D.1)

where Pπ is the marginal distribution of X under the prior π for all π ∈ Cπ . Equivalently, for
any π ∈ Cπ and B ∈ B,

E[1B (Θ) | X] = E[1B (Θ) | T ] Pπ -almost surely. (D.2)

In other words, conditional on T (X), the observation X provides no additional information


about Θ under the specified prior π.
Theorem D.1 (Minimal Bayes Sufficiency of Posterior Distribution). Let (Ω, B) be a measurable
parameter space and (X , A) be a sample space. For X ∈ X , define T : X → M(Ω) as the function
that maps each observation X to its corresponding posterior distribution under the prior class
Cπ , i.e., T (X) := π(dθ | X), where M(Ω) denotes the space of probability measures on (Ω, B).
Note that this is a random probability measure on Ω, and that M(Ω) is a Polish space (under
the weak topology). Then:
1. T is Bayes sufficient for Θ with respect to π.
2. If T ′ is any other Bayes sufficient statistic with respect to π, then T ′ (x) = T ′ (x′ ) implies
that T (x) = T (x′ ), Pπ -almost every x, x′ .
Consequently, the posterior map X 7→ π(dθ | X) is minimally Bayes sufficient for the specified
prior π.

Proof. We will show that T is Bayes sufficient and minimally so in two parts.

T is Bayes sufficient For any measurable B ⊆ Ω, by the definition of conditional expectation,


for Pπ -almost every x ∈ X we have
Z
E [1B (Θ) | X = x] = π (dθ | x) .
B

41
Thus, for Pπ -almost every x and for every measurable set B we obtain
Z
E [1B (Θ) | X = x] = π (dθ | x)
ZB
= π (dθ | T (x))
B
= E [1B (Θ) | T (X) = T (x)] .

Since this equality holds for every measurable set B, we conclude that the conditional distri-
butions π (dθ | X = x) and π (dθ | T (x)) are equal for Pπ -almost every x. By the definition of
Bayes sufficiency, this shows that T is Bayes sufficient.

Minimality of T Let T ′ be any other Bayes-sufficient statistic. If T ′ (x) = T ′ (x′ ), Bayes


sufficiency of T ′ gives

π(dθ | x) = π(dθ | T ′ (x)) = π(dθ | T ′ (x′ )) = π(dθ | x′ ) (Pπ -a.e. x, x′ )

for all π ∈ Cπ . But T (x) = π(dθ | x), hence T (x) = T (x′ ) almost surely whenever T ′ (x) = T ′ (x′ ).
Thus T is minimal.

Appendix E Proofs and Technical Results


E.1 Proof of Theorem 2.2
Assumption E.1 (Local Positivity). There exists global constants c > 0 and γ > 0 such that,
(n) (n)
for all θ ∈ supp(π (t) ), Pθ (At ) is uniformly bounded away from zero, i.e., inf Pθ (At ) ≥
θ∈supp(π (t) )
cϵγt , where

X = (X1 , . . . , Xn ) ∈ RndX ,
\p,δ (X, x∗ ) ≤ ϵt },
At = {x : MSW
\p,δ (X (ω), x∗ ) ≤ ϵt } = {ω : X(ω) ∈ At }.
Et = {ω : MSW

Theorem E.1 (Error Rate of ARS). Suppose Assumption E.1 holds. Then, the total variation
distance between the marginal distributions over θ, corresponding to the exact rejection sampling
procedure in Algorithm 2 and the ARS procedure in Algorithm 4, converges to zero at a rate of
O(exp(−Rϵγt )) as R → ∞.

Proof. We begin by establishing key notations for our analysis. Let π (t) = π(θ | Et ) denote
(t)
the marginal target distribution of the exact rejection sampling procedure. Define πARS as the
marginal distribution of θ under the approximate sampling procedure with R attempts. We
(t)
denote the corresponding measures by π (t) and πARS , respectively.
For r = 1, 2, . . . , R, let Et,r = {MSW
\p,δ (Xr , x∗ ) ≤ ϵt } be the event where the rth replicate
Xr satisfies the threshold condition. Define Et = Et,1 . A particular θ is retained if at least one

42
of the corresponding data replicate passes the threshold,
R n o
Et,r = ω : {X1 (ω) ∈ At } {XR (ω) ∈ At } .
[ [ [
···
r=1

Taking expectation over Et,1 , . . . , Et,R , we obtain:

R
Z ! R
(t) (n)
eARS (θ) e (θ) Pθ (dX)
[ Y
(t)
π ∝ π Et,r
r=1 r=1
R
Z ! R
(n)
= e (θ) 1 − c (ω) Pθ (dX)
\ Y
(t)
π 1Et,r
r=1 r=1
 
(n)

e (θ) 1 −(t)
Pθ (Act )R ,

where the last equality follows from the i.i.d. nature of X1 , . . . , XR conditional on θ. Thus,
under the ARS procedure (Algorithm 4), the marginal distribution of Θ is proportional to:
 
(t) (n)
eARS (θ) ∝ π
π e (t) (θ) 1 − Pθ (Act )R . (E.1)

h i  
(n) (n)
Let Zt = Eeπ(t) 1 − Pθ (Act )R = Ω π
e (t) (θ) 1 − Pθ (Act )R dθ be the normalizing constant
R

for Eq. (E.1). We can then bound the total variation distance:

1
  Z
(t) (t)
eARS =
DTV π (t) , π π (t) (θ) − π eARS dθ
2 Ω
1
Z  
(n)
= π (t) (θ) − Zt−1 π e (t) (θ) 1 − Pθ (Act )R dθ
2 Ω
1
Z  
(n)
= Zt π e (t) (θ) − π e (t) (θ) 1 − Pθ (Act )R dθ (E.2)
2Zt Ω
1
Z   
(n)
= πe (t) (θ) Zt − 1 − Pθ (Act )R dθ
2Zt Ω
1
Z 1/2 Z 1/2
(n) 2
≤ e (t) (θ) dθ
π Zt − 1 + Pθ (Act )R π e (t) (θ)dθ (E.3)
2Zt Ω Ω
1
Z 1/2
h
(n)
i
(n) 2
= Eeπ(t) Pθ (Act )R − Pθ (Act )R dθ
2Zt Ω
1 
(n)
1/2
= Vareπ(t) Pθ (Act )R
2Zt
1 h
(n)
i1/2
≤ Eeπ(t) Pθ (Act )2R (E.4)
2Zt
d
where Eq. (E.2) uses the fact that π (t) (θ) =θ π
e (t) (θ), Eq. (E.3) follows from Hölder’s inequality,
and we also used the fact that Ω πe (θ) dθ = 1.
R

(n)
By Assumption E.1, there exists a constant c such that inf Pθ (At ) ≥ cϵγt . Conse-
θ∈supp(π (t) )

43
quently,
i1/2
2R 1/2
h  
(n)
Eeπ(t) Pθ (Act )2R ≤ (1 − cϵγt )+ ≤ exp(−Rcϵγt ),

where we use the inequality (1 − u)R+ ≤ exp(−Ru) for u ∈ R and integer R. Similarly, for Zt ,
(n) γ
using sup Pθ (At ) ≤ 1 − cϵt :
c
θ∈supp(π (t) )

h i
(n)
Zt = Eeπ(t) 1 − Pθ (Act )R ≥ 1 − (1 − cϵγt )R γ
+ ≥ 1 − exp(−Rcϵt ).

Substituting these bounds into Eq. (E.4) yields:

1 h
(n)
i1/2 1 exp(−Rcϵγt )
Eeπ(t) Pθ (Act )2R ≤ · ≲ exp (−Rcϵγt ) ,
2Zt 2 1 − exp(−Rcϵγt )

which establishes the upper bound and the desired convergence rate.

Theorem E.2 (Sample Complexity for ARS). Suppose Assumption E.1 holds. For any  δ̄ ∈ (0, 1)
log(1/δ̄)
and ϵt > 0, if the number of samples R in the ARS algorithm satisfies R = O ϵγt
, then the
total variation
 distance between the exact and approximate posterior distributions is bounded by
(t) (t)
δ̄, i.e., DTV π , π eARS ≤ δ̄.

Proof. This follows directly from Theorem E.1.

Corollary E.2.1 (1-Wasserstein Error Incurred by ARS). Under the conditions of Theorem E.1,
if the parameter space Ω has a finite diameter dΩ = diam(Ω), then the 1-Wasserstein distance
between the exact and approximate posterior distributions is bounded by
 
(t)
eARS ≲ dΩ exp (−Rcϵγt ) .
W1 π (t) , π (E.5)
   
(t) log(dΩ /δ̄)
eARS ≤ δ̄, we need R = O
Therefore, to achieve W1 π (t) , π ϵγt
.

Proof. By Theorem 6.15 of Villani et al. (2009), we have


   
(t) (t)
W1 π (t) , π
eARS ≤ dΩ · DTV π (t) , π
eARS .

The rest follows from Theorem E.1 and Theorem 2.2.

E.2 Proofs of the Main Theorems from Section 3.1


E.2.1 Proof of Proposition 3.1

Nonnegativity, symmetry, and triangle inequality for the MSW distance follow directly from
the corresponding properties of Wp and SWp distances, coupled with the additivity property of

44
metrics. Note that if µ = ν, then

SWp (µ, ν) = 0,
Wp ((ej )# µ, (ej )# ν) = 0, j = 1, . . . , d.

For the converse direction, suppose MSWp (µ, ν) = 0. As 0 < λ < 1, both terms in the MSW
distance must vanish:
d
Wp ((ej )# µ, (ej )# ν) = 0,
X

j=1
Z
Wpp (φ# µ, φ# ν) dσ(φ) = 0,
Sd−1

which implies that the marginal distributions of µ and ν are identical for all coordinates, and
Wp (φ# µ, φ# ν) = 0 for σ-almost all φ ∈ Sd−1 . By the Cramér-Wold theorem, this is sufficient
to conclude that µ = ν.

E.2.2 Proof of Theorem 3.1

Lemma E.3. The 1-Sliced Wasserstein distance is an Integral Probability Metric on P1 (Rd ).

Proof. Recall the definition of the SW1 distance,


Z
SW1 (µ, ν) = W1 (φ# µ, φ# ν)dσ(φ),
Sd−1

where Sd−1 is the unit sphere in Rd and σ is the uniform measure on Sd−1 . Define the critic
function class,
( Z )

F := f (x) = gφ (φ x) dσ(φ) gφ ∈ Lip1 (R), sup |gφ (0)| < ∞
Sd−1 φ∈Sd−1

where for each φ ∈ Sd−1 , gφ : R → R is a 1-Lipschitz function, such that the mapping (φ, t) 7→
gφ (t) is jointly measurable with respect to the product of the Borel σ-algebras on Sd−1 and R.
Note that F is nonempty, as it includes constant functions. For instance, if we take fc : x 7→
Sd−1 k dσ(φ) = k < ∞, then fc ∈ F.
R

First, we show that supf ∈F | f d(µ − ν)| ≤ SW1 (µ, ν) holds. Fix f ∈ F. By definition,
R

f (x) = Sd−1 gφ (φ⊤ x) dσ(φ). For any φ ∈ Sd−1 ,


R

|gφ (φ⊤ x)| ≤ |gφ (0)| + |φ⊤ x| ≤ sup |gφ (0)| + ∥x∥.
φ∈Sd−1

As supφ∈Sd−1 |gφ (0)| < ∞ and ∥x∥ dµ < ∞, we have |gφ (φ⊤ x)| dσ(φ)dµ < ∞; similarly,
R RR
Sd−1

45
|gφ (φ⊤ x)| dσ(φ)dν < ∞. By Fubini–Tonelli,
RR
Sd−1
Z Z Z
f d(µ − ν) = gφ (φ⊤ x)d(µ − ν)(x) dσ(φ).
Sd−1

Thus
Z Z Z
f d(µ − ν) ≤ gφ (φ⊤ x)d(µ − ν)(x) dσ(φ)
Sd−1
Z
≤ W1 (φ# µ, φ# ν) dσ(φ) = SW1 (µ, ν),
Sd−1

where we appeal to the dual representation of W1 for each φ. Taking the supremum over F
establishes the claimed inequality.
We now show that the reverse inequality SW1 (µ, ν) ≤ supf ∈F | f d(µ − ν)| holds. Fix ε > 0.
R

For each φ ∈ Sd−1 , choose a 1-Lipschitz function gφε with gφε (0) = 0 and
Z
gφε (φ⊤ x)d(µ − ν)(x) ≥ W1 (φ# µ, φ# ν) − ε.

By the Kuratowski–Ryll–Nardzewski measurable selection theorem, we can select φ 7→ gφε such


that the mapping (φ, t) 7→ gφε (t) is jointly measurable with respect to the product of the Borel
σ-algebras on Sd−1 and R.
Define
Z
fε (x) = gφε (φ⊤ x) dσ(φ) ∈ F.
Sd−1

Note that fε ∈ F because each slice is 1-Lipschitz and supφ∈Sd−1 |gφε (0)| < ∞, by construction.
Then, exactly as before,
Z Z Z
fε d(µ − ν) ≥ gφε (φ⊤ x) d(µ − ν)(x) dσ(φ)
Sd−1
Z
W1 (φ# µ, φ# ν) − ε dσ(φ) = SW1 (µ, ν) − ε.


Sd−1

Taking the supremum over F and letting ε ↓ 0 yields the reverse inequality.
Combining both steps gives
Z
SW1 (µ, ν) = sup f d(µ − ν) ,
f ∈F

so SW1 is an IPM with critic class F.

Lemma E.4 (Closure of IPMs under finite linear combinations). Let D1 , D2 , . . . , DK be IPMs
on the same space of probability measures, and let λ1 , λ2 , . . . , λK ≥ 0 be nonnegative constants.

46
Define
K
D(µ, ν) := λk Dk (µ, ν). (E.6)
X

k=1

Then D(·, ·) is itself an IPM.

Proof. Let D1 , . . . , DK be as given. For each k = 1, . . . , K, let Fk be the critic function class
corresponding to Dk , i.e.,
Z
Dk (µ, ν) = sup f d(µ − ν) .
f ∈Fk

Let Fk† := Fk ∪ (−Fk ); note that Dk is unchanged when Fk is replaced by Fk† . For any
λ1 , λ2 , · · · , λK ≥ 0, define the new critic function class
 K 
F := f = fk ∈ Fk† , k = 1, . . . , K .
X
λk fk
k=1

We prove the desired equality in two steps.


We first show that supf ∈F |f d(µ − ν)| ≤ D(µ, ν) holds. For any choice of fk ∈ Fk† for every
k ∈ [K], we have

K
Z X K Z K
λk Dk (µ, ν).
X X
λk fk d(µ − ν) ≤ λk fk d(µ − ν) ≤
k=1 k=1 k=1

Taking the supremum over all f = ∈ F yields


PK
k=1 λk fk
Z
sup f d(µ − ν) ≤ D(µ, ν). (E.7)
f ∈F

Now we show that supf ∈F |f d(µ − ν)| ≥ D(µ, ν) holds. Fix ε > 0. For each k choose fkε ∈ Fk†
such that
ε
Z
fkε d(µ − ν) ≥ Dk (µ, ν) − .
K(λk ∨ 1)

Define fε := ∈ F. By linearity,
PK ε
k=1 λk fk

K K
ε
Z Z  
fε d(µ − ν) = λk Dk (µ, ν) −
X X
λk fkε d(µ − ν) ≥ ≥ D(µ, ν) − ε.
k=1 k=1
K(λk ∨ 1)

Since fε ∈ F, it follows that

sup |f d(µ − ν)| ≥ D(µ, ν) − ε.


f ∈F

47
Since the bound holds for arbitrary ε > 0, letting ε ↓ 0 yields
Z
sup f d(µ − ν) ≥ D(µ, ν). (E.8)
f ∈F

Combining (E.7) and (E.8) gives


Z
D(µ, ν) = sup f d(µ − ν) ,
f ∈F

which confirms that D is an IPM, as desired.

Below we provide the proof of Theorem 3.1.

Proof. For each ej , j = 1, . . . , d, W1 ((ej )# µ, (ej )# ν) is an IPM. By Lemmas E.3 and E.4, we
readily obtain the claimed result.

E.2.3 Proof of Theorem 3.2

To prove Theorem 3.2, we need to first establish some useful lemmas and propositions.
Lemma E.5. For any p ≥ 1 and any µ, ν ∈ Pp (Rd ), we have MSW1 (µ, ν) ≤ MSWp (µ, ν).

Proof. By Hölder’s inequality, for each j ∈ [d], we have

W1 ((ej )# µ, (ej )# ν) ≤ Wp ((ej )# µ, (ej )# ν).

For the Sliced Wasserstein distance, fixing any projection φ ∈ Sd−1 , the one-dimensional Wasser-
stein distance satisfies W1 (φ# µ, φ# ν) ≤ Wp (φ# µ, φ# ν). Hence by Jensen’s inequality,
Z Z 1/p Z 1/p
W1 (φ# µ, φ# ν) dσ(φ) ≤ W1 (φ# µ, φ# ν)p dσ(φ) ≤ Wp (φ# µ, φ# ν)p dσ(φ) .

Thus SW1 (µ, ν) ≤ SWp (µ, ν). Since MSW is a convex combination of these components, the
desired inequality holds for the MSW distance.

Proposition E.1. For any p ≥ 1 and µ, ν ∈ Pp (Rd ), we have

MSWp (µ, ν) ≤ Cd,p,λ Wp (µ, ν),

where
1
Z
1/p
Cd,p,λ = λ + (1 − λ)cd,p , cd,p = ∥φ∥pp dσ(φ) ≤ 1.
d Sd−1

Note that Cd,p,λ ≤ 1 for every d ∈ N+ , p ≥ 1, and λ ∈ (0, 1).

48
Proof. Let γ ∗ ∈ Γ(µ, ν) be an optimal transport plan for the minimization problem (1.1) with
Ω = Rd . Then for any j ∈ [d], (ej ⊗ ej )# γ ∗ is a transport plan between (ej )# µ and (ej )# ν. By
the definition of the Wasserstein distance as an infimum over all couplings, we have:
Z 1/p Z 1/p
∗ ∗
Wp ((ej )# µ, (ej )# ν) ≤ |⟨ej , x − y⟩| dγ (x, y)
p
≤ ∥x − y∥ dγ (x, y)
p
= Wp (µ, ν).

Consequently,
d
1X
Wp ((ej )# µ, (ej )# ν) ≤ Wp (µ, ν).
d j=1

Furthermore, from Proposition 5.1.3 of Bonnotte (2013), we have:

SWpp (µ, ν) ≤ cd,p Wpp (µ, ν),

where
1
Z
cd,p = ∥φ∥pp dσ(φ) ≤ 1.
d Sd−1

Combining these results yields the desired inequality.

Proposition E.2. For all µ, ν supported in BR (0), R > 0, there exists a constant Cd,λ > 0 such
that
1/(d+1)
W1 (µ, ν) ≤ Cd,λ Rd/(d+1) MSW1 (µ, ν) (E.9)

Proof. For brevity of notation, let M1 (µ, ν) := j=1 W1 ((ej )# µ, (ej )# ν). By the definition of
1 Pd
d
the MSW distance,

MSW1 (µ, ν) − λM1 (µ, ν)


SW1 (µ, ν) =
1−λ
−1
≤ (1 − λ) MSW1 (µ, ν),

as λM1 (µ, ν) ≥ 0 for λ ∈ (0, 1).


By Lemma 5.1.4 of Bonnotte (2013), there exists a constant Ced > 0 such that

W1 (µ, ν) ≤ Ced Rd/(d+1) SW1 (µ, ν)1/(d+1)


≤ Ced Rd/(d+1) ((1 − λ)−1 MSW1 (µ, ν))1/(d+1)
= Ced Rd/(d+1) (1 − λ)−1/(d+1) MSW1 (µ, ν)1/(d+1)

for all probability measures µ, ν supported in BR (0).

49
Setting Cd,λ := Ced (1 − λ)−1/(d+1) , we obtain the desired result:

1/(d+1)
W1 (µ, ν) ≤ Cd,λ Rd/(d+1) MSW1 (µ, ν).

Now we can introduce the proof of the Theorem 3.2.

Proof. The first inequality follows from Proposition E.1. The second inequality follows from
Proposition E.2 and Lemma E.5, on noting that Wpp (µ, ν) ≤ (2R)p−1 W1 (µ, ν).

E.2.4 Proof of Theorem 3.3

Proof. Let (µℓ )ℓ∈N+ , µ be probability measures in Pp (Rd ) such that limℓ→∞ MSWp (µℓ , µ) = 0.
Since λ ∈ (0, 1), this implies limℓ→∞ SWp (µℓ , µ) = 0. By Theorem 1 of Nadjahi et al. (2019),
we conclude that µℓ ⇒ µ, where we use ⇒ to denote weak convergence in Pp (Rd ).
Conversely, suppose µℓ ⇒ µ. Theorem 6.9 of Villani et al. (2009) gives Wp (µℓ , µ) → 0.
Because SWp (µℓ , µ) ≤ Wp (µℓ , µ), we also obtain SWp (µℓ , µ) → 0. For the marginal component,
by the Cramér-Wold theorem, a sequence of probability measures on Rd converges weakly if and
only if their one-dimensional projections converge weakly for all directions; therefore, for any
j ∈ {1, · · · , d},

d
(ej )# µℓ → (ej )# µ.

Now fix j ∈ {1, . . . , d}. Recall that weak convergence in Pp (Rd ) requires that (Villani et al.,
2009, Definition 6.8)
Z
lim lim sup ∥z∥p 1∥z∥≥R dµℓ (z) = 0.
R→∞ ℓ→∞ Rd

Since |zj |p ≤ ∥z∥p , we have


Z Z
lim lim sup |t|p 1|t|≥R d(ej )# µℓ (t) = lim lim sup |zj |p 1|zj |≥R dµℓ (z) = 0.
R→∞ ℓ→∞ R R→∞ ℓ→∞ Rd

Hence (ej )# µℓ converges to (ej )# µ in Pp (R) for every j ∈ {1, · · · , d}. Appealing once again to
Theorem 6.9 of Villani et al. (2009) yields each term Wp ((ej )# µℓ , (ej )# µ) → 0. Hence,

d
λX
MSWp (µℓ , µ) = Wp ((ej )# µℓ , (ej )# µ) + (1 − λ)SWp (µℓ , µ) → 0.
d j=1

Therefore MSWp indeed metrizes weak convergence on Pp (Rd ) without any compactness as-
sumptions, as desired.

50
E.2.5 Proof of Theorem 3.4

We start with two lemmas.


Lemma E.6. Fix δ ∈ (0, 1/2). For every φ ∈ Sd−1 and τ ∈ [δ, 1 − δ], we have
1/p
Mµ,p

Fµ−1
φ
(τ ) ≤ < ∞.
δ

Remark E.1. A similar result holds for ν.

Proof. Let Z ∼ µ. For any φ ∈ Sd−1 , denote by Zφ := φ⊤ Z the corresponding one-dimensional


random variable with distribution µφ . By the Cauchy-Schwarz inequality, using the fact that
E[∥Z∥p ] < Mµ,p , for every φ ∈ Sd−1 , we have

E[|Zφ |p ] = E[|⟨φ, Z⟩|p ] ≤ E[∥Z∥p ] < Mµ,p < ∞.

Hence by Markov’s inequality, for any t > 0,

E[|Zφ |p ] Mµ,p
P(|Zφ | ≥ t) ≤ p
< p .
t t

Mµ,p 1/p
 
Setting tδ = δ < ∞, we get

Mµ,p
P(|Zφ | ≥ tδ ) < = δ.
tpδ

Together with the fact that Fµφ (Fµ−1


φ
(δ)) ≥ δ, this implies that

Fµ−1
φ
(δ) ≥ −tδ , Fµ−1
φ
(1 − δ) ≤ tδ .

Therefore,
1/p
Mµ,p

Fµ−1
φ
(τ ) ≤ tδ = < ∞ for all τ ∈ [δ, 1 − δ], φ ∈ Sd−1 .
δ

Lemma E.7 (Empirical Quantile Deviation). Let δ ∈ (0, 1/2), φ ∈ Sd−1 , and suppose that
Zφ,1 , . . . , Zφ,m are i.i.d. samples from φ# µ, so that the empirical measure is
m
1 X
φ# µ
bm = δZ ,
m i=1 φ,i

with empirical CDF Fbm,µφ and true CDF Fµφ . Then, for any t ≥ 0, we have
n o  
−1
P Fbm,µ φ
(δ) − Fµ−1
φ
(δ) > t ≤ 2 exp −2m ψδ,t (µφ )2 ,

51
where ψδ,t (·) is as defined in Eq. (3.2).
Remark E.2. A similar result holds for ν.

Proof. For simplicity of notation, we omit the dependence on the fixed direction φ and µ through-
out this proof. Let Fm (z) := F̂m,µφ (z) = m 1 Pm
i=1 1 {Zφ,i ≤ z} and F (z) := Fµφ (z) for every
z ∈ R.
By Hoeffding’s inequality, for any z ∈ R and t ≥ 0, we have

P(Fm (z) − F (z) ≥ t) ≤ exp(−2mt2 ), P(Fm (z) − F (z) ≤ −t) ≤ exp(−2mt2 ).

Observe that the event Fm


−1 (δ) > F −1 (δ) + t is equivalent to F (F −1 (δ) + t) < δ. Thus, we can
m
write:

−1
(δ) − F −1 (δ) > t) = P Fm (F −1 (δ) + t) < δ

P Fm
= P Fm (F −1 (δ) + t) − F (F −1 (δ) + t) < δ − F (F −1 (δ) + t)


≤ exp − 2m(F (F −1 (δ) + t) − δ)2 ,




where we use the fact that F (F −1 (δ) + t) ≥ F (F −1 (δ)) ≥ δ in the last line). An analogous
argument applies for upper bounding P(Fm −1 (δ) − F −1 (δ) < −t). Applying the union bound to

combine the two upper bounds yields the desired inequality.

Proof of Theorem 3.4. Fix any δ ∈ (0, 1/2) and δ̄ ∈ (0, 1). We let

q µ,δ (φ) : = Fµ−1


φ
−1
(δ) ∧ Fbm,µ φ
(δ),
q µ,1−δ (φ) : = Fµ−1
φ
−1
(1 − δ) ∨ Fbm,µ φ
(1 − δ),
q ν,δ (φ) : = Fν−1
φ
−1
(δ) ∧ Fbm,ν φ
(δ),
q ν,1−δ (φ) : = Fν−1
φ
−1
(1 − δ) ∨ Fbm,ν φ
(1 − δ).

Analysis of Empirical Convergence Rate We decompose the total error into two compo-
nents and apply concentration techniques to control each term.

52
Error from the Marginal Component For each coordinate j = 1, . . . , d, the projected
measures µj and νj are one-dimensional. Let qµ,δ (ej ) = |q µ,δ (ej )| ∨ |q µ,1−δ (ej )|. Note that

1 1−δ
Z
p p
−1 −1
(ej )# µ
bm , (ej )# µ = (τ ) (τ )

Wp,δ Fbm,µ − F dτ
1 − 2δ δ j µ j

1
Z 2qµ,δ (ej ) Z 1−δ n o
−1 −1
≤ ptp−1 1 Fbm,µ (τ ) − F (τ ) ≥ t dτ dt
1 − 2δ 0 δ
j µj

1
Z 2qµ,δ (ej ) Z 1−δ n o
−1
≤ sup {ptp−1 } 1 Fbm,µ (τ ) − Fµ−1 (τ ) ≥ t dτ dt
1 − 2δ t∈[0,2qµ,δ (ej )] 0 δ
j j

p p−1 1−δ −1
Z
≤ 2qµ,δ (ej ) Fbm,µj (τ ) − Fµ−1 (τ ) dτ
1 − 2δ δ
j

p p−1 qµ,1−δ (ej )


Z
≤ 2qµ,δ (ej ) |Fbm,µj (t) − Fµj (t)| dt
1 − 2δ q (ej )
µ,δ
p p−1
≤ 2qµ,δ (ej ) (q µ,1−δ (ej ) − q µ,δ (ej ))∥Fbm,µj − Fµj ∥∞ ,
1 − 2δ
where the second, fourth, and fifth lines are derived using the Fubini–Tonelli theorem.
Recall Eq. (3.2)-(3.5). By Lemma E.7, for any j ∈ {1, · · · , d}, we have
n
−1
o   δ
P Fbm,µ (δ) − Fµ−1 (δ) > εm,d,δ,δ (µj ) ≤ 2 exp −2m ψδ,ε (µj ) (µj )
2
≤ . (E.10)
j j m,d,δ,δ 16d
Similarly,
n
−1
o   δ
P Fbm,µ (1 − δ) − Fµ−1 (1 − δ) > εm,d,1−δ,δ (µj ) ≤ 2 exp −2m ψ1−δ,ε (µj ) (µj )
2
≤ .
j j m,d,1−δ,δ 16d
(E.11)

By Lemma E.6, Eq. (E.10)-(E.11), and the union bound, the following inequalities hold with
probability at least 1 − 8δ for all j ∈ {1, . . . , d}:

Mµ,p 1/p 1
 
qµ,δ (ej ) ≤ + εm,d,δ,δ (µj ) ∨ εm,d,1−δ,δ (µj ) = Rµj ,δ ,
δ 2
1/p !
Mµ,p

q µ,1−δ (ej ) − q µ,δ (ej ) ≤ 2 + εm,d,δ,δ (µj ) ∨ εm,d,1−δ,δ (µj ) = Rµj ,δ .
δ

Therefore, with probability at least 1 − 8δ , we have for all j ∈ {1, · · · , d},

p p
(ej )# µ
bm , (ej )# µ ≤ Rp (E.12)

Wp,δ Fbm,µj − Fµj .
1 − 2δ µj ,δ ∞

By the Dvoretzky-Kiefer-Wolfowitz inequality and a union bound over j = 1, . . . , d, we have


!
2
P max{∥Fbm,µj − Fµj ∥∞ } > t ≤ 2de−2mt .
j∈[d]

53
Combining these results, we obtain that
 
[  p δ

p 2
(ej )# µ
bm , (ej )# µ > Rµp j ,δ · t  ≤ + 2de−2mt .

P Wp,δ
j∈[d]
1 − 2δ 8

q
2 log(16d/δ)
Setting = δ
8 2de−2mt , we obtain t = 2m . Therefore, with probability at least 1 − 4δ , for
all j ∈ {1, · · · , d},
s
p p log(16d/δ)
(ej )# µ
bm , (ej )# µ ≤ max{Rµp j ,δ } ·

Wp,δ .
1 − 2δ j∈[d] 2m

Applying an analogous argument to ν yields


 s 
p p log(16d/δ)  δ
P max (ej )# νbm′ , (ej )# ν max{Rp } · ≥1− .
 
Wp,δ ≤
j∈[d] 1 − 2δ j∈[d] νj ,δ 2m′ 4

Recall that Rmax = maxj∈[d] {Rµj ,δ } ∨ maxj∈[d] {Rνj ,δ }. By the union bound and the triangle
inequality, with probability at least 1 − 2δ , we have for all j ∈ [d],

Wp,δ (ej )# µ
bm , (ej )# νbm′ − Wp,δ (ej )# µ, (ej )# ν
 

bm , (ej )# µ + Wp,δ (ej )# νbm′ , (ej )# ν .


≤ Wp,δ (ej )# µ
 

Define µ
bm,j := (ej )# µ
bm and similarly νbm′ ,j := (ej )# νbm′ , we therefore have

!1/p !
p
q 1 1 
− 2p ′− 2p
P max Wp,δ µ ≤ 2Rmax log(16d/δ) +m
 
bm,j , νbm′ ,j − Wp,δ µj , νj · m
j∈[d] 1 − 2δ
δ
≥1− .
2
Thus

d d
!1/p
1X  1X
!
p
q 1 1
− 2p ′− 2p
P bm,j , νbm′ ,j −
Wp,δ µ Wp,δ (µj , νj ) ≤ 2Rmax · log(16d/δ) (m +m )
d j=1 d j=1 1 − 2δ
δ
≥1− .
2
(E.13)

54
Error from the Sliced Wasserstein Component By Proposition 1(ii) of Manole et al.
(2022), there exists a constant Cp > 0 depending only on p such that

E SWp,δ (µ
bm , νbm′ ) − SWp,δ (µ, ν)
 

Cp EZ∼µ [∥Z∥2 ]1/2 ∨ EZ∼ν [∥Z∥2 ]1/2



m−1/(2p) + m′−1/(2p) .

≤ √
δ(1 − 2δ)1/p

Combining the expression above with Markov’s inequality, for any t ≥ 0, we have

P SWp,δ (µ
bm , νbm′ ) − SWp,δ (µ, ν) ≥ t


Cp EZ∼µ [∥Z∥2 ]1/2 ∨ EZ∼ν [∥Z∥2 ]1/2



m−1/(2p) + m′−1/(2p) .

≤ √
δ(1 − 2δ)1/p t

Hence taking

2Cp EZ∼µ [∥Z∥2 ]1/2 ∨ EZ∼ν [∥Z∥2 ]1/2



:= m−1/(2p) + m′−1/(2p) ,

tSW √
δ(1 − 2δ) δ
1/p

we get

δ
P SWp,δ (µ
bm , νbm′ ) − SWp,δ (µ, ν) ≥ tSW ≤ . (E.14)

2

Thus, combining both parts, we obtain the desired rate

P MSWp,δ (µ
bm , νbm′ ) − MSWp,δ (µ, ν) ≥ tMSW ≤ δ,


where

2(1 − λ)Cp EZ∼µ [∥Z∥2 ]1/2 ∨ EZ∼ν [∥Z∥2 ]1/2


( 1/p )
p
 q
tMSW := 2λRmax log(16d/δ) + √
1 − 2δ δ(1 − 2δ)1/p δ
· m−1/(2p) + m′−1/(2p) .


E.3 Proofs of the Main Theorems from Section 3.2


E.3.1 Proof of Theorem 3.5

As (Ω, B) and (X n , A) are standard Borel, the disintegration theorem (Thoerem 3.4, Kallenberg
and Kallenberg 1997) yields a measurable kernel

πΘ|X : B × X n → [0, 1], (B, x) 7→ πΘ|X (B, x),

55
satisfying
Z
P(Θ,X) (B × A) = πΘ|X (B, x)PX (dx)
A

for all B ∈ B, A ∈ A. The random variable


Z
ω 7→ πΘ|X B, X(ω) =
 
πΘ|X dθ, X(ω)
B

is a version of the conditional probability P(Θ ∈ B | σ(X)).


Recall that fX (·) the marginal density of X, i.e., for any x ∈ X n ,
Z
fX (x) = π(θ)fX|Θ (x | θ).
Θ

For any ϵ > 0, denote Aϵ := {x ∈ X : MSWp(πΘ|x , πΘ|x∗ ) ≤ ϵ}. Note that

πΘ|X (θ | x)fX (x)dx


R
(ϵ) ∗ Aϵ
πABI (θ | x ) = πΘ|X (θ | X ∈ Aϵ ) = . (E.15)
Aϵ fX (x)dx
R

Hence

πΘ|X (θ | x)fX (x)dx


R !
 
(ϵ) Aϵ
MSWp πABI (θ | x∗ ), πΘ|X (θ | x∗ ) = MSWp , πΘ|X (θ | x∗ )
Aϵ fX (x)dx
R

fX (x)MSWp(πΘ|X (θ | x), πΘ|X (θ | x∗ ))dx


R

≤ . (E.16)
Aϵ fX (x)dx
R

By the definition of Aϵ , for any x ∈ Aϵ , MSWp(πΘ|x , πΘ|x∗ ) ≤ ϵ. Hence


 
(ϵ)
MSWp πABI (θ | x∗ ), πΘ|X (θ | x∗ ) ≤ ϵ. (E.17)

(ϵ)
Therefore, as ϵ → 0+ , πABI (θ | x∗ ) converges weakly in Pp (Ω) to πΘ|X (θ|x∗ ).

E.3.2 Proof of Theorem 3.6

As fX (x∗ ) > 0 and fX is continuous at x∗ , the point x∗ lies in the support of PX , so all
conditional probabilities below are well-defined.
Consider a deterministic decreasing sequence ϵt ↓ 0 as t → ∞. Let

Dt = {X ∈ Bϵt (x∗ )}.

Define the random variable

Zt := 1Dt ,

which records whether X falls in successively smaller neighborhoods of the observed data x∗ .

56
Define the filtration

Ft = σ(Z1 , . . . , Zt ) ⊆ F,

By convention, we let F0 := {∅, Ξ}. By construction, the filtration is increasing, i.e., Ft′ ⊆ Ft
for all t′ ≤ t. Define F∞ as the minimal σ-algebra generated by (Ft )t∈N , i.e.,

F∞ = σ(∪t Ft ).

A key observation is that

Dt = {X = x∗ }.
\

t≥1

Hence, the indicator random variable of the exact match satisfies

1{X = x∗ } = 1{X ∈ ∩t Dt } ∈ F∞ . (E.18)

In other words, F∞ reveals whether the random vector X coincides with the realized observation
x∗ .
For any set B ∈ B, denote UB (ω) := 1{Θ(ω) ∈ B}, ω ∈ Ξ. Then, by Lévy’s 0-1 law, we have
 a.s.
E 1Dt UB | Ft → E 1{X = x∗ }UB | F∞ as t → ∞
  
L1
= 1{X = x∗ }E[UB | F∞ ]
= 1{X = x∗ }P[UB | X = x∗ ].

This convergence holds for all B ∈ B, hence


d
πΘ|X (· | Dt ) → πΘ|X (· | X = x∗ ).

Let V (θ) = ∥θ∥pp . By assumption, we have M = supx∈X n (θ)fΘ,X (θ, x) dθ < ∞. Note
R
ΩV
that for any t ≥ 1, V 1Dt ≤ V 1D1 , and
Z Z
E[V 1D1 ] = ∥θ∥pp fΘ,X (θ, x)dθdx ≤ M · Leb(X n ∩ Bϵ1 (x∗ )) < ∞,
Ω X n ∩Bϵ1 (x∗ )

where Leb denotes the Lebesgue measure on RndX . Hence as t → ∞, by Lévy’s 0-1 law,
a.s.
E[V 1Dt | Ft ] → 1X=x∗ E[V | X = x∗ ].
L1

Therefore, we have
Z Z
a.s.
V (θ)πΘ|X (dθ | Dt ) → V (θ)πΘ|X (dθ | X = x∗ ).
Ω L1 Ω

57
Together with weak convergence, this gives

πΘ|X (· | Dt ) ⇒ πΘ|X (· | X = x∗ ).

Since the MSWp distance is topologically equivalent to Wp on Pp (Ω), we have


 
lim MSWp πΘ|X (dθ | X ∈ Bϵt (x∗ )), πΘ|X (dθ | X = x∗ ) = 0,
t→∞

as desired.

58
Appendix F Additional Details on Simulation Settings
F.1 Multimodal Gaussian Model
Below, we present pairwise bivariate density plots of the posterior distribution under the multi-
modal Gaussian model. The plots demonstrate that ABI accurately recovers the joint posterior
structure.

(a) (b)

Figure 11: Bivariate density plot of the posterior distribution.

F.2 Lotka-Volterra model


In the Lotka-Volterra model Din, 2013, the populations evolve over time based on the following
set of equations:

dx
= αx − βxy,
dt
dy
= −γy + δxy.
dt

We simulate the Lotka–Volterra process using the Gillespie algorithm Gillespie (1976). This
approach is a stochastic Euler scheme in which time steps are drawn from an exponential dis-
tribution. Algorithm 5 outlines the Gillespie procedure for the Lotka–Volterra model.

F.3 M/G/1 Queuing Model


Here we provide additional details on the queuing model setup. Following Shestopaloff and Neal
(2014), we formulate the model using arrival times Vi as latent variables that evolve as a Markov

59
Algorithm 5: Gillespie algorithm for Lotka-Volterra model
input: t = 0, 1, . . . , T
α > 0, β > 0, γ > 0, δ > 0
X0 > 0, Y0 > 0
1 while (t < T and Xt ̸= 0 and Yt ̸= 0) do
2 Compute rates:
3 r1 ← αXt ;
4 r2 ← βXt Yt ;
5 r3 ← γYt ;
6 r4 ← δXt Yt ;
7 Draw: τ ∼ Exp(R), where R = r1 + r2 + r3 + r4 ;
8 Choose index i from list {1, 2, 3, 4};
9 i ∼ Discrete([r1 /R, r2 /R, r3 /R, r4 /R]);
10 if i = 1 then
11 Xt+1 ← Xt + 1// Prey birth
12 Yt+1 ← Yt ;
13 else if i = 2 then
14 Xt+1 ← Xt − 1// Prey death
15 Yt+1 ← Yt − 1;
16 else if i = 3 then
17 Xt+1 ← Xt ;
18 Yt+1 ← Yt − 1// Predator death
19 else if i = 4 then
20 Xt+1 ← Xt ;
21 Yt+1 ← Yt + 1// Predator birth
22 t ← t + τ;

process:

V1 ∼ Exp(θ3 )
Vi | Vi−1 ∼ Vi−1 + Exp(θ3 ), i = 2, . . . , n
Yi | Xi−1 , Vi ∼ Unif(θ1 + max(0, Vi − Xi−1 ), θ2 + max(0, Vi − Xi−1 )), i = 1, . . . , n.

Setting V = (V1 , . . . , Vn ), the joint density of Vi and Yi can be factorized as:


n n
P(V, Y | θ) = P(V1 | θ)
Y Y
P(Vi | Vi−1 , θ) P(Yi | Vi , Xi−1 , θ).
i=2 i=1

Appendix References
Daniel T Gillespie. A general method for numerically simulating the stochastic time evolution
of coupled chemical reactions. Journal of computational physics, 22(4):403–434, 1976.

60
Alexander Y Shestopaloff and Radford M Neal. On bayesian inference for the m/g/1 queue with
efficient mcmc sampling. arXiv preprint arXiv:1401.5548, 2014.

61

You might also like