2502.03609v1
2502.03609v1
Abstract R such as the prediction error of a model ŷ, for each obser-
vation (x, y) in Dn and ranking these score values. The con-
Conformal prediction (CP) quantifies the uncer- formal prediction set for the new input xn+1 is the collection
tainty of machine learning models by construct-
arXiv:2502.03609v1 [stat.ML] 5 Feb 2025
1
Multivariate Conformal Prediction using Optimal Transport
the geometry, i.e., the spatial arrangement of the data and However, this result is typically not directly usable, as
its distribution: points closer to the center get lower ranks. the ground-truth distribution F is unknown and must be
approximated empirically with Fn using finite samples of
Contributions We propose to leverage recent advances data. When the sample size goes to infinity, one expects to
in computational optimal transport (Peyré & Cuturi, 2019), recover Equation (2). The following result provides the tool
using notably differentiable transport map estimators (Poola- to obtain the finite sample version (Shafer & Vovk, 2008).
dian & Niles-Weed, 2021; Cuturi et al., 2019), and apply Lemma 2.1. If Z1 , . . . , Zn , Z be a sequence of real-valued
such map estimators in the definition of multivariate score exchangeable random variables, then it holds
functions. More precisely:
• OT-CP: We extend conformal prediction techniques to 1 2
Fn (Z) ∼ U 0, , , . . . , 1
multivariate score functions by leveraging optimal trans- n n
port ordering, which offers a principled way to define and ⌊nb⌋ − ⌈na⌉ + 1
P(Fn (Z) ∈ [a, b]) = Un+1 ([a, b]) = .
compute a higher-dimensional quantile and cumulative dis- n+1
tribution function. As a result, we obtain distribution-free
uncertainty sets that capture the joint behavior of multi- By choosing any a, b such that Un+1 ([a, b]) ≥ 1 − α,
variate predictions that enhance the flexibility and scope Lemma 2.1 guarantees a coverage, that is at least equal
of conformal predictions. to the prescribed level of uncertainty
• We propose a computational approach to this theoreti-
cal ansatz using the entropic map (Pooladian & Niles- P (Z ∈ Rα,n ) ≥ 1 − α.
Weed, 2021) computed from solutions to the Sinkhorn
problem (Cuturi, 2013). We prove that our approach pre- where, the uncertainty set Rα,n = Rα (Dn ) is defined based
serves the coverage guarantee while being tractable. on observations Dn = {Z1 , . . . , Zn } as:
• We show the application of OT-CP using a recently re-
Rα,n = z ∈ R : Fn (z) ∈ [a, b] . (4)
leased benchmark of regression tasks (Dheur et al., 2025).
We acknowledge the concurrent proposal of Thurin et al.
In short, Equation (4) is an empirical version of Equation (1)
(2025), who adopt a similar approach to ours, with, however,
based on finite data samples that still preserves the coverage
a few important practical differences, discussed in more
probability (1 − α) and does not depend on the ground-truth
detail in Section 6.
distribution of the data.
Given data Dn , a prediction model ŷ and a new input Xn+1 ,
2. Background
one can build an uncertainty set for the unobserved output
2.1. Univariate Conformal Prediction Yn+1 by applying it to observed score functions.
We recall the basics of conformal prediction based on real- Proposition 2.2 (Conformal Prediction Coverage). Con-
valued score function and refer to the recent tutorials (Shafer sider Zi = S(Xi , Yi ) for i in [n] and Z = S(Xn+1 , Yn+1 )
& Vovk, 2008; Angelopoulos & Bates, 2021). In the follow- in Lemma 2.1. The conformal prediction set is defined as
ing, we denote [n] := {1, . . . , n}.
Rα,n (Xn+1 ) = y ∈ Y : Fn ◦ S(Xn+1 , y) ∈ [a, b]
For a real-valued random variable Z, it is common to con-
struct an interval [a, b], within which it is expected to fall, as and satisfies a finite sample coverage guarantee
Rα = {z ∈ R : F (z) ∈ [a, b]} (1) P (Yn+1 ∈ Rα,n (Xn+1 )) ≥ 1 − α.
This is based on the probability integral transform that states
that the cumulative distribution function F maps variables The conformal prediction coverage guarantee in Proposi-
to uniform distribution, i.e., P(F (Z) ∈ [a, b]) = U([a, b]). tion 2.2 holds for the unknown ground-truth distribution of
To guarantee a (1 − α) uncertainty region, it suffices to the data P, does not require quantifying the estimation error
choose a and b such that U([a, b]) ≥ 1 − α which implies |Fn − F |, and is applicable to any prediction model ŷ as
long as it treats the data exchangeably, e.g., a pre-trained
P (Z ∈ Rα ) ≥ 1 − α. (2)
model independent of Dn .
Applying it to a real-valued score Z = S(X, Y ) of the Leveraging the quantile function Fn−1 = Qn , and by setting
prediction model ŷ, an uncertainty set for the response of a = 0 and b = 1 − α, we have the usual description
a given a input X can be expressed as
Rα (X) = y ∈ Y : F ◦ S(X, y) ∈ [a, b] . (3) Rα,n (Xn+1 ) = y ∈ Y : S(Xn+1 , y) ≤ Qn (1 − α)
2
Multivariate Conformal Prediction using Optimal Transport
namely the set of all possible responses whose score rank As (Dheur et al., 2025), we will use conformalized quantile
is smaller or equal to ⌈(1 − α)(n + 1)⌉ compared to the regression (Romano et al., 2019) to define the score func-
rankings of previously observed scores. For the absolute tions above, for each output i ∈ [d], where the conformity
value difference score function, the CP set corresponds to score is given by:
with ˆli (x) and ûi (x) representing the lower and upper con-
Center-Outward View Another classical choice is a = α2
ditional quantiles of Yi |X = x at levels αl and αu , re-
and b = 1 − α2 . In that case, we have the usual confidence
spectively. In our experiments, we consider equal-tailed
set that corresponds to a range of values that captures the
prediction intervals, where αl = α2 , αu = 1 − α2 , and α
central proportion with α/2 of the data lying below Q(α/2)
denotes the miscoverage level.
and α/2 lying above Q(1 − α/2).
Merge-CP. An alternative approach is simply to use a
Introducing the center-outward distribution of Z as the func-
squared Euclidean aggregation,
tion T = 2F − 1 , the probability integral transform T (Z)
is uniform in the unit ball [−1, 1]. This ensures a symmetric s(x, y) := ∥ŷ(x) − y∥2 ,
description of Rα = T −1 (B(0, 1 − α)) around a central
point such as the median Q(1/2) = T −1 (0), with the ra- where the choice of the norm (e.g., ℓ1 , ℓ2 , or ℓ∞ ) depends on
dius of the ball that corresponds to the desired confidence the desired sensitivity to errors across tasks. This approach
level of uncertainty. Similarly, we have the empirical center- reduces the multidimensional residual to a scalar conformity
outward distribution Tn = 2Fn − 1 and the center-outward score, leveraging the natural ordering of real numbers. This
view of the conformal prediction set follows as simplification not only makes it straightforward to apply
univariate conformal prediction methods, but also avoids
Rα,n (Xn+1 ) = y ∈ Y : |Tn ◦ S(Xn+1 , y)| ≤ 1 − α . the complexities of directly managing vector-valued scores
in conformal prediction. A variant consists of applying a
If Z follows a probability distribution P, then the transfor- Mahalanobis norm (Johnstone & Cox, 2021) in lieu of the
mation z 7→ T (z) is mapping the source distribution P to squared Euclidean norm, using the covariance matrix Σ
the uniform distribution U over a unit ball. In fact, it can be estimated from the training data (Johnstone & Cox, 2021;
characterized as essentially the unique monotone increasing Katsios & Papadopulos, 2024; Henderson et al., 2024),
function such that T (Z) is uniformly distributed.
s(x, y) := ∥Σ−1/2 (ŷ(x) − y)∥2 ,
2.2. Multivariate Conformal Prediction
2.3. Kantorovich Ranks
While many conformal methods exist for univariate pre-
diction, we focus here on those applicable to multivariate A naive way to define ranks in multiple dimensions might
outputs. As recalled in (Dheur et al., 2025), several alterna- be to measure how far each point is from the origin and
tive conformal prediction approaches have been proposed then rank them by that distance. This breaks down if the
to tackle multivariate prediction problems. Some of these distribution of the data is stretched or skewed in certain
methods can directly operate using a simple predictor (e.g., directions. To correct for this, Hallin et al. (2021) developed
a conditional mean) of the response y, while some may re- a formal framework of center-outward distributions and
quire stronger assumptions, such as requiring an estimator of quantiles, also called Kantorovich ranks (Chernozhukov
the joint probability density function between x and y, or ac- et al., 2017), extending the familiar univariate concepts of
cess to a generative model that mimics the conditional distri- ranks and quantiles into higher dimensions by building on
bution of y given x) (Izbicki et al., 2022; Wang et al., 2022). elements of optimal transport theory.
We restrict our attention to approaches that make no such Optimal Transport Map. Let µ and ν be source and
assumption, reflecting our modeling choices for OT-CP. target probability measures on Ω ⊂ Rd . One can look for a
M-CP. We will consider the template approach of (Zhou map T : Ω → Ω that pushes forward µ to ν and minimizes
et al., 2024) to use classical CP by aggregating a score the average transportation cost
function computed on each of the d outputs of the multivari- Z
ate response. Given a conformity score si (to be defined T ⋆ ∈ arg min ∥x − T (x)∥2 dµ(x). (6)
T# µ=ν Ω
next) for the i-th dimension, Zhou et al. (2024) define the
following aggregation rule: Brenier’s theorem states that if the source measure µ has a
density, there exists a solution to (6) that is the gradient of a
sM-CP (x, y) = max si (x, yi ). (5) convex function ϕ : Ω → R such that T ⋆ = ∇ϕ.
i∈[d]
3
Multivariate Conformal Prediction using Optimal Transport
In the one-dimensional case, the cumulative distribution where the weights depend on z as:
function of a distribution P is the unique increasing function
transporting it to the uniform distribution. This monotonic- exp − ∥z − uj ∥2 − gj⋆ /ε
p j (z) := Pm 2 ⋆ . (12)
ity property generalizes to higher dimensions through the k=1 exp (− (∥z − uk ∥ − gk ) /ε)
gradient of a convex function ∇ϕ. Thus, one may view the
optimal transport map in higher dimensions as a natural ana- Analogously to (12), one can obtain P an estimator for the
n
log of the univariate cumulative distribution function: both inverse map (T ⋆ )−1 as Tεinv (u) := i=1 q j (u)zj , with
represent a unique, monotone way to send one probability weights q j (u) arising for a vector u from the Gibbs distri-
distribution onto another. bution of the values [∥zi − u∥2 − fi⋆ ]i
Definition 2.3. The center-outward distribution of a random
variable Z ∼ P is defined as the optimal transport map 3. Kantorovich Conformal Prediction
T = ∇ϕ that pushes P forward to the uniform distribution
3.1. Multi-Output Conformal Prediction
U on the unit ball B(0, 1). The rank of Z is defined as
Rank(Z) = ∥T (Z)∥, the distance from the origin. We suppose that P is only available through a finite samples
and consider the discrete transport map
Quantile region is an extension of quantiles to multiple
dimensions to represent region in the sample space that con- Tn+1 : (Zi )i∈[n+1] → (Ui )i∈[n+1]
tains a given proportion of probability mass. The quantile
region at probability level (1 − α) ∈ (0, 1) can be defined as which can be obtained by solving the optimal assignment
problem, which seeks to minimize the total transport cost
Rα = {z ∈ Rd : ∥T (z)∥ ≤ 1 − α}. between the empirical distributions Pn+1 and Un+1 :
n+1
By definition of the spherical uniform distribution, we have Tn+1 ∈ arg min
X
∥Zi − T (Zi )∥2 , (13)
∥T (Z)∥ is uniform on (0, 1) which implies T ∈T i=1
P(Z ∈ Rα ) = 1 − α. (7) where T is the set of bijections mapping the observed sam-
ple (Zi )i∈[n+1] to the target grid (Ui )i∈[n+1] .
2.4. Entropic Map. Definition 3.1. Let (Z1 , . . . , Zn , Zn+1 ) be a sequence of
⋆ exchangeable variables in Rd that follow a common distri-
A convenient estimator to approximate the Brenier map T
bution P. The discrete center-outward distribution Tn+1 is
from samples (z1 , . . . , zn ) and (u1 , . . . , um ) is the entropic
the transport map pushing forward Pn+1 to Un+1 .
map (Pooladian & Niles-Weed, 2021): Let ε > 0 and write
Kij = [exp(−∥zi − uj ∥2 /ε)]ij , the kernel matrix. Define, Following (Hallin et al., 2021), we begin by constructing
f g the target discritbution Un+1 as a discretized version of a
f ⋆ , g⋆ = argmax ⟨f , 1nn ⟩ + ⟨g, 1mm ⟩ − ε⟨e ε , Ke ε ⟩ . (8) spherical uniform distribution. It is defined such that the
f ∈Rn ,g∈Rm
total number of points n + 1 = nR nS + no , where no
The Equation (8) is an unconstrained concave optimiza- points are at the origin:
tion problem known as the regularized OT problem in dual
form (Peyré & Cuturi, 2019, Prop. 4.4) and can be solved • nS unit vectors u1 , . . . , unS are uniform on the sphere.
numerically with the Sinkhorn algorithm (Cuturi, 2013). n o
Equipped with these optimal vectors, one can define the • nR radius are regularly spaced as n1R , n2R , . . . , 1 .
maps, valid out of sample:
The grid discretizes the sphere into layers of concentric
fε (z) = minε ([∥z − uj ∥2 − gj⋆ ]j ) , (9) shells, with each shell containing nS equally spaced points
gε (u) = minε ([∥zi − u∥2 − fi⋆ ]i ) , (10) along the directions determined by the unit vectors. The
discrete spherical uniform distribution places equal mass
where for a vector u or arbitrary size s we define the log- over each points of the grid, with no /(n + 1) mass on the
sum-exp operator as minε (u) := −ε log( 1s 1Ts e−u/ε ). Us- origin and 1/(n + 1) on the remaining points. This ensures
ing the Brenier (1991) theorem, linking potential values to isotropic sampling at fixed radius onto [0, 1].
optimal map estimation, one obtains an estimator for T ⋆ :
By definition of target distribution Un+1 , it holds
m
X
Tε (z) := z − ∇fε (z) = p j (z)uj , (11) 1 2
∥Tn+1 (Zn+1 )∥ ∼ Un+1 0, , ,...,1 . (14)
j=1 nR nR
4
Multivariate Conformal Prediction using Optimal Transport
In order to define an empirical quantile region as Equa- 1D score using OT. We redefine the non-conformity score
tion (7), we need an extrapolation T̄n+1 of Tn+1 out of the function of an observation as
samples (Zi )i∈[n+1] . By definition of such maps
SOT−CP (x, y) = ∥T ⋆ ◦ S(x, y)∥ (16)
∥T̄n+1 (Zn+1 )∥ = ∥Tn+1 (Zn+1 )∥
where T ⋆ is the optimal Brenier (1991) map that pushes the
is still uniformly distributed. With an appropriate choice of distribution of vector-valued scores onto a uniform ball dis-
radius rα,n+1 , the empirical quantile region can be defined tribution U of the same dimension. This approach ultimately
relies on the natural ordering of the real line, making it possi-
Rα,n+1 = {z ∈ Rd : ∥T̄n+1 (z)∥ ≤ rα,n+1 }. ble to directly apply one-dimensional conformal prediction
methods to the sequence of transformed scores
When working with such finite samples Z1 , . . . , Zn , Zn+1 ,
and considering the asymptotic regime (Chewi et al., 2024; Zi = ∥SOT−CP (Xi , Yi )∥ for i ∈ [n + 1].
Hallin et al., 2021), the empirical source distribution Pn+1
converges to the true distribution P and the empirical trans- In practice, T ⋆ can be replaced by any approximation T̂ that
port map T̄n+1 converges to the true transport map T ⋆ . As preserves the permutation invariance of the score function.
such, with the choice rα,n+1 = 1 − α, one can expect that The resulting conformal prediction set, OT-CP is
P (Z ∈ Rα,n+1 ) ≈ 1 − α when n is large.
ROT−CP (Xn+1 , α) = Rα (T̂ , Xn+1 )
However, the core point of conformal prediction methodol-
ogy is to go beyond asymptotic results or regularity assump- with respect to a given transport map T̂ , and where
tions about the data distribution. The following result show
how to select a radius preserving the coverage with respect Rα (T̂ , x) = y ∈ Y : Fn (∥SOT−CP (x, y)∥2 ) ≤ 1 − α .
to the ground-truth distribution such as in Equation (18).
have a coverage (1 − α), where Fn is empirical (univariate)
Proposition 3.2. Given n discrete sample points distributed cumulative distribution function of the observed scores
over a sphere with radius {0, n1R , n2R , . . . , 1} and directions
uniformly sampled on the sphere, the smallest radius to ∥SOT−CP (X1 , Y1 )∥, . . . , ∥SOT−CP (Xn , Yn )∥ .
obtain a coverage (1 − α) is determined by
Proposition 2.2 directly implies
jα (n + 1)(1 − α) − no
rα,n+1 = where jα = ,
nR nS P(Yn+1 ∈ ROT−CP (Xn+1 )) ≥ 1 − α.
where nS is the number of directions, nR is the number of Remark 3.4. Our proposed conformal prediction frame-
radius, and no is the number of copies of the origin. work OT-CP with optimal transport merging score function
generalizes the Merge-CP approaches. More specifically,
The corresponding conformal prediction set is obtained as: under the additional assumption that we are transporting
a source Gaussian (resp. uniform) distribution to a target
{y ∈ Y : ∥T̄n+1 ◦ S(Xn+1 , y)∥ ≤ rα,n+1 }. (15)
Gaussian (resp. uniform) distribution, the transport map is
Remark 3.3 (Computational Issues). While appealing, the affine (Gelbrich, 1990; Muzellec & Cuturi, 2018) with a pos-
previous result has notable computational limitations. At itive definite linear map term. This results in Equation (16)
every new candidate y ∈ Y, the empirical transport map being equivalent to the Mahalanobis distance.
must be recomputed which might be untractable. Moreover,
the coverage guarantee does not hold if the transport map 3.3. Coverage Guarantees under Approximations
is computed solely on a hold-out independent dataset, as When dealing with high-dimensional data or complex dis-
it is usually done in split conformal prediction. Plus, for tributions, it is essential to find computationally feasible
computational efficiency, the empirical entropic map cannot methods to approximate the optimal transport map T ⋆ with
be directly leveraged, since the target values would no longer a map T̂ . In practical applications, we will rely on empirical
follow a uniform distribution, as described in Equation (14). approximations of the Brenier (1991) map using finite sam-
To address these challenges, we propose two simple ap- ples. Note that this approach may encouter a few statistical
proaches in the following section. roadblocks, as such estimators are significantly hindered by
the curse of dimensionality (Chewi et al., 2024). However,
conformal prediction allows us to maintain a coverage level
3.2. Optimal Transport Merging
irrespective of sample size limitations. We defer the pre-
We introduce optimal transport merging, a procedure that sentation of this practical approach to section 3.4 and focus
reduces any vector-valued score S(x, y) ∈ Rd to a suitable first on coverage guarantees.
5
Multivariate Conformal Prediction using Optimal Transport
Coverages of Approximated Quantile Region However, this is only an empirical coverage statement:
Let us assume an arbitrary approximation T̂ of the Brenier
n+1
(1991) map and define the corresponding quantile region as 1 X
1{Zi ∈ R(T̂n+1 , r̂α,n+1 )} ≥ 1 − α
n + 1 i=1
R T̂ , r = {z ∈ Rd : ∥T̂ (z)∥ ≤ r},
6
Multivariate Conformal Prediction using Optimal Transport
103
method
M-CP
Merge-CP
102 Merge-CP (Mah)
OT-CP
region size
101
100
the review provided in (Nguyen et al., 2024) and pick sion boxes, we follow (Dheur et al., 2025) and leverage
their Gaussian based mapping approach (Basu, 2016). the empirical quantiles return by MQF2 to compute boxes
This consists of mapping a low-discrepancy sequence (Zhou et al., 2024).
w1 , . . . , wL on [0, 1]d to a potentially low-discrepancy
OT-CP: our implementation requires tuning two important
sequence θ1 , . . . , θL on Sd−1 through the mapping
hyperparameters: the entropic regularization ε and the total
θ = Φ−1 (w)/∥Φ−1 (w)∥2 , where Φ−1 is the inverse CDF
number of points used to discretize the sphere m, not nec-
of N (0, 1) applied entry-wise.
essarily equal to the input data sample size n. These two
parameters describe a fundamental statistical and computa-
4. Experiments tional trade-off. On the one hand, it is known that increasing
m will mechanically improve the ability of Tε to recover
4.1. Setup and Metrics
in the limit T ⋆ (or at least solve the semi-discrete (Peyré
We borrow the experimental setting provided by Dheur et al. & Cuturi, 2019) problem of mapping n data points to the
(2025) and benchmark multivariate conformal methods on sphere). However, large m incurs a heavier computational
a total of 24 tabular datasets. Total data size n in these price when running the Sinkhorn algorithm. On the other
datasets ranges from 103 to 50,000, with input dimension hand, increasing ε improves on both computational and
p ranging from 1 to 348, and output dimension d ranging statistical aspects, but deviates further the estimated map
from 2 to 16. We adopt their approach, which is to rely on a from the ground truth T ⋆ to target instead a blurred map.
multivariate quantile function forecaster (MQF2 , Kan et al., We have experimented with these aspects and derive from
2022), a normalizing flow that is able to quantify output our experiments that both m and ε should be increased
uncertainty conditioned on input x. However, in accordance to track increase in dimension. As a sidenote, we do ob-
with our stance mentioned in the background section, we serve that debiasing the outputs of the Sinkhorn algorithm
will only assume access to the conditional mean (point-wise) does not result in improved results, which agrees with the
estimator for OT-CP. findings in (Pooladian et al., 2022). We use the OTT-JAX
toolbox (Cuturi et al., 2022) to compute these maps.
As is common in the field, we evaluate the methods using
several metrics, including marginal coverage (MC), and
4.3. Results
mean region size (Size). The latter is using importance
sampling, leveraging (when computing test time metrics We present results by differentiating datasets with small
only), the generative flexibility provided by the MQF2 as an dimension d ≤ 6 from datasets with higher dimensionality
invertible flow. See (Dheur et al., 2025) and their code for 14 ≤ d ≤ 16, that we expect to be more challenging to han-
more details on the experimental setup. dle with OT approaches, owing to the curse of dimensional-
ity that might degrade the quality of multivariate quantiles.
4.2. Hyperparameter Choices Results in Figure 4 indicate an improvement (smaller re-
gion for similar coverage) on 15 out of 18 datasets in lower
We apply default parameters for all three competing meth-
dimensions, this edge vanishing in the higher-dimensional
ods, M-CP and Merge-CP, using (or not) the Mahalanobis
regime. Ablations provided in Figure 2 highlight the role of
correction. For M-CP using conformalized quantile regres-
7
Multivariate Conformal Prediction using Optimal Transport
#target points = 8192 #target points = 32768
epsilon
1012 0.001 0.1
0.01 1.0
109
region size
106
103
100
ansur2 (2)
bio (2)
births1 (2)
calcofi (2)
edm (2)
enb (2)
taxi (2)
jura (3)
scpf (3)
house (2)
sf1 (3)
sf2 (3)
atp7d (6)
slump (3)
air (6)
atp1d (6)
rf1 (8)
rf2 (8)
wq (14)
households (4)
oes10 (16)
oes97 (16)
scm1d (16)
scm20d (16)
ansur2 (2)
bio (2)
births1 (2)
calcofi (2)
edm (2)
enb (2)
taxi (2)
jura (3)
scpf (3)
rf1 (8)
rf2 (8)
wq (14)
house (2)
sf1 (3)
sf2 (3)
slump (3)
air (6)
atp1d (6)
atp7d (6)
households (4)
oes10 (16)
oes97 (16)
scm1d (16)
scm20d (16)
Figure 2. This plot details the impact of the two important hyperparameters one needs to set in OT-CP: number of target points m sampled
from the uniform ball and the ε regularization level. As can be seen, larger sample size m improves region size (smaller the better) for
roughly all datasets and regularization strengths. On the other hand, one must tune ε to operate at a suitable regime: not too low, which
results in the well-documented poor statistical performance of unregularized / linear program OT, nor too high, which would lead to a
collapse of the entropic map to the sphere. Using OTT-JAX and its automatic normalizations, we see that ε = 0.1 works best overall.
taxi (2)
jura (3)
scpf (3)
sf1 (3)
sf2 (3)
air (6)
atp1d (6)
atp7d (6)
house (2)
slump (3)
households (4)
5. Conclusion
We have proposed OT-CP, a new approach that can leverage
a recently proposed formulation for multivariate quantiles
that uses optimal transport theory and optimal transport
map estimators. We show the theoretical soundness of this
approach, but, most importantly, demonstrate its applicabil-
ity throughout a broad range of tasks compiled by (Dheur
et al., 2025). Compared to similar baselines that either use
a conditional mean regression estimator (Merge-CP), or
more involved quantile regression estimators (M-CP), OT-
CP shows overall superior performance, while incurring,
predictably, a higher train / calibration time cost. The chal-
8
Multivariate Conformal Prediction using Optimal Transport
= 0.05 References
= 0.1 Angelopoulos, A. N. and Bates, S. A gentle introduction
= 0.2 to conformal prediction and distribution-free uncertainty
source quantification. arXiv preprint arXiv:2107.07511, 2021.
target
Balasubramanian, V., Ho, S.-S., and Vovk, V. Conformal
prediction for reliable machine learning: theory, adapta-
tions and applications. Newnes, 2014.
Bates, S., Candès, E., Lei, L., Romano, Y., and Sesia,
M. Testing for outliers with conformal p-values. arXiv
preprint arXiv:2104.08279, 2021.
6. Concurrent Work. Chernozhukov, V., Galichon, A., Hallin, M., and Henry, M.
Monge–Kantorovich depth, quantiles, ranks and signs.
Concurrently to our work, Thurin et al. (2025) proposed
The Annals of Statistics, 45(1):223 – 256, 2017. doi:
recently to leverage OT in CP with a similar approach, de-
10.1214/16-AOS1450. URL https://doi.org/10.
riving a similar CP set as in Equation (15) and analyzing
1214/16-AOS1450.
a variant with asymptotic conditional coverage under addi-
tional regularity assumptions. However, our methods differ Chernozhukov, V., Wüthrich, K., and Zhu, Y. Exact and ro-
in several key aspects. On the computational side, our im- bust conformal inference methods for predictive machine
plementation leverages general entropic maps (Section 3.4) learning with dependent data. Conference On Learning
without compromising finite-sample coverage guarantees, Theory, 2018.
an aspect we analyze in detail in Section 3.3. In contrast,
their approach requires solving a linear assignment problem, Chernozhukov, V., Wüthrich, K., and Zhu, Y. An exact and
using for instance the Hungarian algorithm, which has cubic robust conformal inference method for counterfactual and
complexity O(n3 ) in the number of target points, and which synthetic controls. Journal of the American Statistical
also requires having a target set on the sphere that is of the Association, 116(536):1849–1864, 2021.
same size as the number of input points. With our notations
in Section 3.4, they require n = m, whereas we set m to any- Chewi, S., Niles-Weed, J., and Rigollet, P. Statistical opti-
where between 212 and 215 , independently of n. While they mal transport. arXiv preprint arXiv:2407.18163, 2024.
mention efficient approximations that reduce complexity to
quadratic in (Thurin et al., 2025, Remark 2.3), their theo- Cuturi, M. Sinkhorn distances: Lightspeed computation
retical results do not yet cover these cases since their anal- of optimal transport. In Advances in neural information
ysis relies on the fact that ranks are random permutations processing systems, pp. 2292–2300, 2013.
of {1/n, 2/n, . . . , 1}, which cannot be extended to using
Sinkhorn with soft assignment. In contrast, our work es- Cuturi, M., Teboul, O., and Vert, J.-P. Differentiable ranking
tablishes formal theoretical coverage guarantees even when and sorting using optimal transport. Advances in neural
approximated (pre-trained) transport map are used. information processing systems, 32, 2019.
9
Multivariate Conformal Prediction using Optimal Transport
Cuturi, M., Meng-Papaxanthos, L., Tian, Y., Bunne, C., Johnstone, C. and Cox, B. Conformal uncertainty sets for
Davis, G., and Teboul, O. Optimal transport tools (ott): robust optimization. In Carlsson, L., Luo, Z., Cheru-
A jax toolbox for all things wasserstein, 2022. URL bin, G., and An Nguyen, K. (eds.), Proceedings of the
https://arxiv.org/abs/2201.12324. Tenth Symposium on Conformal and Probabilistic Pre-
diction and Applications, volume 152 of Proceedings
Dheur, V., Fontana, M., Estievenart, Y., Desobry, N., and of Machine Learning Research, pp. 72–90. PMLR, 08–
Taieb, S. B. Multi-output conformal regression: A unified 10 Sep 2021. URL https://proceedings.mlr.
comparative study with new conformity scores, 2025. press/v152/johnstone21a.html.
URL https://arxiv.org/abs/2501.10533.
Kan, K., Aubet, F.-X., Januschowski, T., Park, Y., Benidis,
Fisch, A., Schuster, T., Jaakkola, T., and Barzilay, R. Few-
K., Ruthotto, L., and Gasthaus, J. Multivariate quantile
shot conformal prediction with auxiliary tasks. ICML,
function forecaster. In International Conference on Artifi-
2021.
cial Intelligence and Statistics, pp. 10603–10621. PMLR,
Gammerman, A., Vovk, V., and Vapnik, V. Learning by 2022.
transduction, 1998.
Katsios, K. and Papadopulos, H. Multi-label conformal
Gelbrich, M. On a formula for the l2 wasserstein metric prediction with a mahalanobis distance nonconformity
between measures on euclidean and hilbert spaces. Math- measure. In Vantini, S., Fontana, M., Solari, A., Boström,
ematische Nachrichten, 147(1), 1990. H., and Carlsson, L. (eds.), Proceedings of the Thirteenth
Symposium on Conformal and Probabilistic Prediction
Guha, E., Natarajan, S., Möllenhoff, T., Khan, M. E., with Applications, volume 230 of Proceedings of Machine
and Ndiaye, E. Conformal prediction via regression- Learning Research, pp. 522–535. PMLR, 09–11 Sep
as-classification. arXiv preprint arXiv:2404.08168, 2024. 2024. URL https://proceedings.mlr.press/
Hallin, M., del Barrio, E., Cuesta-Albertos, J., and Matrán, v230/katsios24a.html.
C. Distribution and quantile functions, ranks and signs
Kumar, B., Lu, C., Gupta, G., Palepu, A., Bellamy, D.,
in dimension d: A measure transportation approach.
Raskar, R., and Beam, A. Conformal prediction with large
The Annals of Statistics, 49(2):1139 – 1165, 2021. doi:
language models for multi-choice question answering.
10.1214/20-AOS1996. URL https://doi.org/10.
arXiv preprint arXiv:2305.18404, 2023.
1214/20-AOS1996.
Laxhammar, R. and Falkman, G. Inductive conformal
Hallin, M., La Vecchia, D., and Liu, H. Center-outward
anomaly detection for sequential detection of anomalous
r-estimation for semiparametric varma models. Journal of
sub-trajectories. Annals of Mathematics and Artificial
the American Statistical Association, 117(538):925–938,
Intelligence, 2015.
2022.
Hallin, M., Hlubinka, D., and Hudecová, Š. Efficient fully Lin, Z., Trivedi, S., and Sun, J. Conformal prediction inter-
distribution-free center-outward rank tests for multiple- vals with temporal dependence. Transactions of Machine
output regression and manova. Journal of the American Learning Research, 2022.
Statistical Association, 118(543):1923–1939, 2023. Lu, C., Lemay, A., Chang, K., Höbel, K., and Kalpathy-
Henderson, I., Mazoyer, A., and Gamboa, F. Adap- Cramer, J. Fair conformal predictors for applications in
tive inference with random ellipsoids through confor- medical imaging. In Proceedings of the AAAI Conference
mal conditional linear expectation. arXiv preprint on Artificial Intelligence, volume 36, pp. 12008–12016,
arXiv:2409.18508, 2024. 2022.
Ho, S.-S. and Wechsler, H. Query by transduction. IEEE Messoudi, S., Destercke, S., and Rousseau, S. Copula-based
transactions on pattern analysis and machine intelligence, conformal prediction for multi-target regression. Pattern
2008. Recognition, 120:108101, 2021.
Holland, M. J. Making learning more transparent using Muzellec, B. and Cuturi, M. Generalizing point embeddings
conformalized performance prediction. arXiv preprint using the wasserstein space of elliptical distributions. Ad-
arXiv:2007.04486, 2020. vances in Neural Information Processing Systems, 31,
2018.
Izbicki, R., Shimizu, G., and Stern, R. B. Cd-split and
hpd-split: Efficient conformal regions in high dimen- Nguyen, K., Bariletto, N., and Ho, N. Quasi-monte carlo
sions. Journal of Machine Learning Research, 23(87): for 3d sliced wasserstein. In The Twelfth International
1–32, 2022. Conference on Learning Representations, 2024.
10
Multivariate Conformal Prediction using Optimal Transport
Park, J. W., Tibshirani, R., and Cho, K. Semiparametric Zaffran, M., Féron, O., Goude, Y., Josse, J., and Dieuleveut,
conformal prediction. arXiv preprint arXiv:2411.02114, A. Adaptive conformal predictions for time series. In In-
2024. ternational Conference on Machine Learning, pp. 25834–
25866. PMLR, 2022.
Peyré, G. and Cuturi, M. Computational optimal transport.
Foundations and Trends® in Machine Learning, 11, 2019. Zhou, Y., Lindemann, L., and Sesia, M. Conformalized
adaptive forecasting of heterogeneous trajectories. arXiv
Pooladian, A.-A. and Niles-Weed, J. Entropic estimation of preprint arXiv:2402.09623, 2024.
optimal transport maps. arXiv preprint arXiv:2109.12004,
2021.
Quach, V., Fisch, A., Schuster, T., Yala, A., Sohn, J. H.,
Jaakkola, T. S., and Barzilay, R. Conformal language
modeling. arXiv preprint arXiv:2306.10193, 2023.
Wang, Z., Gao, R., Yin, M., Zhou, M., and Blei, D. M. Prob-
abilistic conformal prediction using conditional random
samples. arXiv preprint arXiv:2206.06584, 2022.
11
coverage
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
ansur2 (2)
bio (2)
births1 (2)
calcofi (2)
edm (2)
enb (2)
house (2)
taxi (2)
jura (3)
A. Appendix
scpf (3)
sf1 (3)
sf2 (3)
slump (3)
households (4)
air (6)
atp1d (6)
atp7d (6)
0
5
10
15
20
25
30
ansur2 (2)
0.60
0.65
0.70
0.75
0.80
0.85
bio (2)
births1 (2)
calcofi (2)
rf1
edm (2)
enb (2)
(8)
house (2)
rf1
taxi (2)
jura (3)
scpf (3)
sf1 (3)
rf2
sf2 (3)
M-CP
slump (3)
OT-CP
(8)
households (4)
air (6)
atp1d (6)
method
Merge-CP
atp7d (6)
wq
rf2 (8)
w
wq (14)
(14
oes10 (16)
)
Merge-CP (Mah)
oes97 (16)
scm1d (16)
scm20d (16)
oes
12
1 0
1
ansur2 (2)
bio (2)
)
births1 (2)
oes
calcofi (2)
edm (2)
9
enb (2)
house (2)
taxi (2)
jura (3)
scpf (3)
sf1 (3) )
sf2 (3)
M-CP
1
OT-CP
slump (3)
households (4)
air (6)
method
Merge-CP
atp1d (6)
)
atp7d (6)
rf1 (8)
scm scm2
rf2 (8)
0
wq (14)
oes10 (16)
Merge-CP (Mah)
oes97 (16)
scm1d (16)
Multivariate Conformal Prediction using Optimal Transport
scm20d (16)
)
ansur2 (2)
bio (2)
births1 (2)
calcofi (2)
edm (2)
enb (2)
house (2)
taxi (2)
jura (3)
scpf (3)
sf1 (3)
Figure 6. Coverage for higher dimensional datasets, corresponding to the setting displayed in Figure 6.
Figure 7. Runtimes for higher dimensional datasets, corresponding to the setting displayed in Figure 6.
sf2 (3)
slump (3)
households (4)
air (6)
atp1d (6)
Figure 8. Ablation: coverage quality as a function of hyperparameters, with the setting corresponding to Figure 2.
atp7d (6)
rf1 (8)
0.01
#target points = 32768
0.001
rf2 (8)
wq (14)
epsilon
oes10 (16)
oes97 (16)
scm1d (16)
scm20d (16)
1.0
0.1
time (s)
0
25
50
75
100
125
150
175
ansur2 (2)
coverage
bio (2)
births1 (2)
calcofi (2)
edm (2)
0.6
0.7
0.8
0.9
enb (2)
house (2)
taxi (2)
jura (3)
scpf (3)
sf1 (3)
sf2 (3)
slump (3)
households (4)
air (6)
atp1d (6) ansur2 (2)
atp7d (6)
13
scm20d (16)
ansur2 (2)
bio (2)
taxi (2)
births1 (2)
calcofi (2)
edm (2) jura (3)
enb (2)
house (2)
taxi (2)
jura (3)
scpf (3) scpf (3)
sf1 (3)
sf2 (3)
slump (3)
households (4)
air (6) sf1 (3)
atp1d (6)
atp7d (6)
scm1d (16)
scm20d (16) slump (3)
ansur2 (2)
bio (2)
births1 (2)
calcofi (2)
M-CP
edm (2)
households (4)
OT-CP
enb (2)
house (2)
taxi (2)
jura (3) air (6)
scpf (3)
method
sf1 (3)
Merge-CP
sf2 (3)
slump (3)
households (4) atp1d (6)
Figure 10. Ablation: running time as a function of hyperparameters, with the setting corresponding to Figure 2.
air (6)
atp1d (6)
atp7d (6)
Figure 9. Coverage of all baselines on small dimensional datasets, corresponding to the region sizes given in Figure 1.
0.01
#target points = 32768
rf1 (8)
0.001
rf2 (8) atp7d (6)
epsilon
wq (14)
oes10 (16)
Merge-CP (Mah)
oes97 (16)
scm1d (16)
1.0
0.1
scm20d (16)
Multivariate Conformal Prediction using Optimal Transport
B. Proofs
Proposition B.1. Given n discrete sample points distributed over a sphere with radii {0, n1R , n2R , . . . , 1} and directions
uniformly sampled on the sphere, the smallest radius rα = njαR satisfying (1 − α)-coverage is is determined by
(n + 1)(1 − α) − no
jα = ,
nS
where nS is the number of directions, nR is the number of radii, and no is the number of copies of the origin (∥U ∥ = 0).
Proof. The discrete spherical uniform distribution places the same probability mass on all n + 1 sample points, including
the no copies of the origin. As such, given a radius rj = njR , we have
1
P(∥U ∥ = rj ) = nS · .
n+1
The cumulative probability up to radius rj is given by:
j
X no nS
P(∥U ∥ ≤ rj ) = P(∥U ∥ = 0) + P(∥U ∥ = rk ) = +j× .
n+1 n+1
k=1
jα
To find the smallest rα = nR such that P(∥U ∥ ≤ rjα ) ≥ 1 − α, it suffices to solve:
no nS
+ jα × ≥ 1 − α.
n+1 n+1
Lemma B.2 (Coverage of Empirical Quantile Region). Let Z1 , . . . , Zn , Zn+1 be a sequence of exchangeable variables in
Rd , then, P(Zn+1 ∈ Rb α,n+1 ) ≥ 1 − α, where, for simplicity, we denoted the approximated empirical quantile region as
Rα,n+1 = R(T̂n+1 , r̂α,n+1 ).
b
P(Zn+1 ∈ R
b α,n+1 ) = P(Zi ∈ R
b α,n+1 ) ∀i ∈ [n + 1].
≥ 1 − α.
14
Multivariate Conformal Prediction using Optimal Transport
ansur2 (2) bio (2) births1 (2) calcofi (2) edm (2) enb (2) house (2) taxi (2) jura (3) scpf (3) sf1 (3) sf2 (3)
epsilon #target
0.001 4096 3.3±0.064 0.46±0.057 78±70 2.6±0.089 1.9±0.3 0.81±0.21 2±0.051 7±0.12 13±2.6 0.78±0.4 14±2.6 0.82±0.32
8192 3.4±0.059 0.45±0.057 78±70 2.6±0.089 1.9±0.29 0.81±0.2 2±0.05 7±0.13 11±2.6 0.73±0.23 16±3.9 0.4±0.16
16384 3.4±0.059 0.46±0.058 78±70 2.6±0.093 1.8±0.28 0.83±0.21 2±0.048 7±0.13 12±2.3 0.87±0.34 21±4.8 0.44±0.2
32768 3.4±0.063 0.46±0.058 78±70 2.6±0.092 1.9±0.3 0.81±0.2 2±0.05 7±0.13 12±2.6 1.2±0.47 16±2.9 0.57±0.18
0.01 4096 3.3±0.055 0.55±0.12 78±70 2.5±0.084 1.9±0.3 0.81±0.21 2±0.05 7.5±0.63 11±2.8 0.43±0.15 12±2.1 0.2±0.086
8192 3.3±0.054 0.56±0.13 78±70 2.5±0.082 1.8±0.3 0.8±0.21 2±0.049 7.5±0.69 10±2.6 0.37±0.15 12±2.8 0.17±0.063
16384 3.3±0.045 0.56±0.12 78±70 2.5±0.082 1.7±0.24 0.8±0.21 2±0.05 7.5±0.71 13±4.3 0.4±0.18 11±2.9 0.19±0.076
32768 3.3±0.064 0.56±0.12 78±70 2.5±0.085 1.7±0.26 0.82±0.22 2±0.049 7.5±0.69 10±2.7 0.41±0.17 12±2.6 0.18±0.071
0.1 4096 3.3±0.058 0.49±0.011 78±70 2.5±0.084 1.6±0.25 0.81±0.21 2.3±0.065 8.3±1.4 9.2±2.8 0.37±0.15 6.6±0.96 0.48±0.1
8192 3.3±0.059 0.49±0.011 78±70 2.5±0.084 1.6±0.26 0.8±0.21 2.3±0.065 8.2±1.5 9.4±2.9 0.4±0.15 6.1±0.89 0.53±0.11
16384 3.3±0.054 0.49±0.012 78±70 2.5±0.081 1.6±0.26 0.8±0.21 2.3±0.058 8.2±1.4 9.4±2.9 0.37±0.12 6.4±0.83 0.45±0.092
32768 3.3±0.051 0.49±0.011 77±70 2.5±0.083 1.5±0.25 0.79±0.2 2.3±0.057 8.2±1.4 8.9±2.9 0.36±0.12 6.5±1.2 0.5±0.1
1 4096 3.6±0.055 0.65±0.019 78±70 2.5±0.1 1.7±0.27 0.92±0.24 3±0.13 6.4±0.14 13±4 0.45±0.16 9.5±1.9 0.84±0.13
8192 3.6±0.067 0.59±0.013 78±70 2.5±0.099 1.7±0.26 0.91±0.24 3±0.14 6.3±0.14 13±4 0.42±0.14 10±1.8 0.93±0.16
16384 3.5±0.072 0.57±0.016 78±70 2.5±0.099 1.7±0.27 0.91±0.24 3±0.13 6.4±0.14 14±4 0.48±0.17 9.8±1.7 0.91±0.17
32768 3.5±0.061 0.6±0.028 78±71 2.5±0.1 1.7±0.27 0.91±0.24 2.9±0.13 6.4±0.15 13±4 0.47±0.17 10±1.7 0.9±0.17
slump (3) households (4) air (6) atp1d (6) atp7d (6)
epsilon #target
0.001 4096 15±7.6 37±1.4 2.6E+03±1.9E+03 81±19 8.5E+02±4.5E+02
8192 7.9±2 36±1.9 7.1E+02±56 99±41 5.9E+02±1.8E+02
16384 11±3.7 34±1.3 6.9E+02±52 65±19 9.4E+02±3E+02
32768 12±4.3 36±2.6 6.8E+02±36 87±28 5.1E+02±2E+02
0.01 4096 20±6.8 37±1.6 8.5E+02±1E+02 85±24 7.9E+02±4.1E+02
8192 12±4.9 34±1.7 1.3E+03±7E+02 82±24 4E+02±1.5E+02
16384 7.1±2.2 33±0.81 5.5E+02±47 1.1E+02±26 3.7E+02±68
32768 10±4 31±0.97 4.8E+02±51 42±9.1 2.8E+02±98
0.1 4096 5.8±1.3 27±1.3 3.2E+02±32 8.1±1.7 33±9.2
8192 5.9±1.3 26±1.3 3.1E+02±33 5.7±1 27±6.9
16384 5.9±1.4 25±1 3.1E+02±34 4±1.4 26±7.7
32768 5.1±1.1 25±1 3.1E+02±34 3.8±0.88 16±5.1
1 4096 14±5.3 29±1.3 4.3E+02±31 6.2±1.7 69±25
8192 15±5.3 30±2.1 3.4E+02±38 5.6±2.2 69±25
16384 16±5.6 28±1.1 4.1E+02±36 6.1±2 76±27
32768 15±5.5 29±1.9 4.3E+02±38 5.6±1.5 73±24
rf1 (8) rf2 (8) wq (14) oes10 (16) oes97 (16) scm1d (16) scm20d (16)
epsilon #target
0.001 4096 2E+13±2E+13 2E+13±2E+13 7.1E+09±3E+09 2.9E+08±8.3E+07 8.7E+08±4E+08 4E+07±3.6E+07 1.7E+07±1.1E+07
8192 2E+13±2E+13 2E+13±2E+13 3.7E+09±1.9E+09 3.7E+08±1.3E+08 1.4E+09±1.2E+09 9.3E+05±5E+05 2.5E+08±1.9E+08
16384 2E+13±2E+13 2E+13±2E+13 6.6E+09±3.2E+09 5.6E+08±4.3E+08 2.5E+08±1.3E+08 3.5E+05±1.3E+05 8.9E+07±5.7E+07
32768 2E+13±2E+13 2E+13±2E+13 3.1E+09±1.2E+09 5.5E+08±3E+08 3.1E+08±9.5E+07 9.7E+05±4.5E+05 1.3E+09±1.3E+09
0.01 4096 2E+13±2E+13 2E+13±2E+13 1.1E+10±7.3E+09 4.3E+09±3.8E+09 3.5E+09±2.5E+09 4.1E+08±3.8E+08 1.3E+11±1.1E+11
8192 2E+13±2E+13 2E+13±2E+13 6.4E+10±6E+10 3E+10±2.8E+10 1E+10±6.1E+09 8.1E+08±5.5E+08 1.1E+11±1.1E+11
16384 2E+13±2E+13 2E+13±2E+13 3.3E+09±7.9E+08 1.1E+09±4.3E+08 1E+10±5.7E+09 4.8E+07±3.7E+07 1.3E+09±8.3E+08
32768 2E+13±2E+13 2E+13±2E+13 5.1E+11±4.9E+11 6.5E+09±5E+09 4E+09±3.2E+09 1.6E+07±9.5E+06 2.7E+08±1.3E+08
0.1 4096 2E+13±2E+13 2E+13±2E+13 8.7E+09±3.7E+09 4.8E+04±3.2E+04 6E+09±6E+09 1.5E+03±6.7E+02 1.3E+06±6.4E+05
8192 2E+13±2E+13 2E+13±2E+13 4.8E+09±1.5E+09 1.7E+05±1.3E+05 6E+09±6E+09 6.2E+02±2.8E+02 1.2E+06±8.7E+05
16384 2E+13±2E+13 2E+13±2E+13 1.3E+10±6.8E+09 5.2E+04±4.7E+04 5.6E+09±5.6E+09 2.2E+02±46 2.9E+05±1E+05
32768 2E+13±2E+13 2E+13±2E+13 7.4E+09±2.9E+09 7.6E+03±5.1E+03 9.2E+07±8.1E+07 1.1E+02±17 1.1E+05±3.1E+04
1 4096 2E+13±2E+13 2E+13±2E+13 8E+08±2E+08 6.6E+02±3.4E+02 8.3E+05±8.1E+05 4.1E+02±76 5.2E+05±6.5E+04
8192 2E+13±2E+13 2E+13±2E+13 6.9E+08±1.7E+08 3.5E+02±1.8E+02 7.7E+05±7.6E+05 8.5E+02±3.1E+02 1.1E+06±3.9E+05
16384 2E+13±2E+13 2E+13±2E+13 5.3E+08±1.2E+08 2.2E+02±1.5E+02 4E+05±4E+05 1.3E+02±14 4.7E+05±1.8E+05
32768 2E+13±2E+13 2E+13±2E+13 5.5E+08±1.5E+08 1.9E+02±1.6E+02 3.1E+05±3.1E+05 1E+02±11 3.4E+05±6.4E+04
Table 1. Mean region size for varying ε and the number of target points in the ball.
15