0% found this document useful (0 votes)
40 views

Informativeness of Weighted Conformal Prediction

Weighted conformal prediction (WCP), a recently proposed framework, provides uncertainty quantification with the flexibility to accommodate different covariate distributions between training and test data. However, it is pointed out in this paper that the effectiveness of WCP heavily relies on the overlap between covariate distributions; insufficient overlap can lead to uninformative prediction intervals. To enhance the informativeness of WCP, we propose two methods for scenarios involving multi

Uploaded by

guydumais
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Informativeness of Weighted Conformal Prediction

Weighted conformal prediction (WCP), a recently proposed framework, provides uncertainty quantification with the flexibility to accommodate different covariate distributions between training and test data. However, it is pointed out in this paper that the effectiveness of WCP heavily relies on the overlap between covariate distributions; insufficient overlap can lead to uninformative prediction intervals. To enhance the informativeness of WCP, we propose two methods for scenarios involving multi

Uploaded by

guydumais
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Informativeness of Weighted Conformal Prediction

Mufang Ying† , Wenge Guo⋆ , Koulik Khamaru† , Ying Hung†

Department of Statistics, Rutgers University - New Brunswick†


Department of Mathematical Sciences, New Jersey Institute of Technology⋆

May 13, 2024


arXiv:2405.06479v1 [stat.ME] 10 May 2024

Abstract
Weighted conformal prediction (WCP), a recently proposed framework, provides uncer-
tainty quantification with the flexibility to accommodate different covariate distributions
between training and test data. However, it is pointed out in this paper that the effec-
tiveness of WCP heavily relies on the overlap between covariate distributions; insufficient
overlap can lead to uninformative prediction intervals. To enhance the informativeness
of WCP, we propose two methods for scenarios involving multiple sources with varied co-
variate distributions. We establish theoretical guarantees for our proposed methods and
demonstrate their efficacy through simulations.

1 Introduction
In recent years, there has been an extraordinary surge in computational power and sophisti-
cated machine learning models, revolutionizing various fields, spanning from artificial intelli-
gence to scientific research and beyond. These machine learning models are trained on vast
amounts of data to comprehend and predict complex phenomena like weather forecasting and
disease diagnostics. However, as problems grow in complexity, it is crucial not only to provide
accurate predictions but also to quantify the associated uncertainties.
Conformal prediction, a methodology for constructing prediction intervals, has gained sig-
nificant attention and popularity for the ability to assess uncertainties with machine learning
models (Vovk et al., 1999; Papadopoulos et al., 2002; Vovk et al., 2005; Lei et al., 2013; Lei and
Wasserman, 2015; Angelopoulos et al., 2023). One of the reasons for the prominence of confor-
mal prediction is its capacity to provide nonasymptotic coverage guarantees for any black box
algorithms that remain unaffected by the underlying distribution. This remarkable feature
is achieved by relying on the exchangeability of the data points. However, in practice, the
data points are not guaranteed to be exchangeable with one notable example being covariate
shift between training and test distributions in supervised learning tasks Quiñonero-Candela
et al. (2022). A recent framework, weighted conformal prediction Tibshirani et al. (2019),
offers a solution to the regression setup by incorporating knowledge about the likelihood ratio
between training and test covariate distributions.
While weighted conformal prediction has demonstrated successful applications in diverse
domains such as experimental design, survival analysis and causal inference (e.g., see Fan-
njiang et al. (2022); Lei and Candès (2021); Candès et al. (2023)), the effectiveness of this
framework heavily depends on the overlap of covariate distributions between training and test.
In Figure 1, a simple example is used to demonstrate that the constructed WCP intervals
can be uninformative in certain cases. We examine a regression example with QX = N (0, 9)

1
0.98
Marginal coverage of WCP intervals Probability of getting finite WCP intervals 0.95
Informative coverage of WCP intervals
1.00

Empirical coverage probability

Empirical coverage probability


0.90

Empirical probability
0.94 0.75

0.85
0.50

0.90

n = 10 0.25 n = 10 0.80 n = 10
n = 50 n = 50 n = 50
n = 100 n = 100 n = 100
target coverage target coverage target coverage
0.86 0.00 0.75
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
: mean of covariate distribution : mean of covariate distribution : mean of covariate distribution

Figure 1. An application of weighted conformal prediction to PY |X = N (sigmoid(X), 0.01),


QX = N (0, 9) and PX = N (µ, 9) with mean µ ∈ [0, 6] at level 0.9. Left: empirical
marginal coverage probability. Middle: empirical probability of getting finite prediction in-
tervals. Right: empirical conditional coverage given prediction interval is finite. Sample size
n ∈ {10, 50, 100} are considered and the results are obtained through 10000 replications. See
Appendix for additional details and another example with varying variance of PX .

representing the covariate distribution of test data and PX = N (µ, 9) representing the co-
variate distribution of training data. Three sample sizes, n = 10, 50, 100 are considered and
the empirical results of the constructed WCP intervals are obtained from 10000 replications.1
We reduce the overlap of covariate distributions PX and QX by increasing µ - mean of PX .
Although WCP intervals provide marginal coverage above the target level 0.9 (the left panel
of Figure 1), this coverage guarantee is accomplished at the cost of an increasing probability
of uninformative prediction intervals, (−∞, ∞), as the overlap decreases (the middle panel of
Figure 1). Furthermore, conditioning on finite prediction intervals, it is shown on the right
panel of Figure 1 that the conditional coverage, referred to as informative coverage probabil-
ity, decreases and falls below target level as the overlap decreases. The decrease of coverage
appears to be more significant for smaller sample size. Motivated by this example, when
evaluating the efficacy of WCP intervals, one should assess the probability of obtaining infor-
mative prediction intervals and the informative coverage probability as more direct metrics
instead of marginal coverage probability.
In addition to the issue of uninformativeness that WCP intervals may present, another
practical concern associated with the WCP framework arises when dealing with training
data sourced from multiple groups with varied covariate distributions. This scenario is quite
common in practice, particularly in medical studies aimed at predicting treatment effects
for patients using various covariates such as age, gender, and medical history. Data are often
collected from different hospitals or clinics, each possessing its own distinct patient population
and covariate distributions. Although in theory we can apply the generalized WCP techniques
from Tibshirani et al. (2019), the resulting weight functions are complex and, consequently,
not practical, as noted in Lei and Candès (2021). A recent work Bhattacharyya and Barber
(2024) focuses on achieving a marginal coverage guarantee in a special scenario where both
training and test data can be viewed as collected via stratified sampling. Specifically, the
covariate X is represented as X = (X 0 , X 1 ), with X 0 ∈ [K] encoding the group information,
and the test distribution exhibits covariate shift only at X 0 . The aim of this paper is to
address a more general scenario where covariates do not contain explicit group information.
Motivated by the preceding discussions, utilizing metrics - probability of obtaining infor-

1
The codes used in this paper are available on our Github page [link].

2
mative prediction intervals and informative coverage probability are important to evaluate
the informativeness of WCP-based procedures. When multiple sources (we will interchange-
ably use words groups and sources) are present in the training data with covariate shifts, it
is crucial to adapt the WCP framework to handle multiple varied covariate distributions to
enhance the informativeness.

Contribution of this work. This paper focuses on improving the informativeness of WCP
when multiple sources are available with covariate shifts. Two procedures, WCP based on
selective Bonferroni and WCP based on data pooling, are proposed to integrate informa-
tion from different sources to enhance the informativeness. The proposed approaches aim to
increase the probability of obtaining a finite prediction interval, thereby ensuring that infor-
mative coverage probability closely approximates to target coverage probability. We establish
theoretical guarantees for these methods and provide empirical evidence to demonstrate their
effectiveness in numerical experiments.

2 Multiple data sources with covariate shifts


Suppose data is collected from K groups in a study, where the sample points from the k-th
(k) (k) (k)
group are denoted as Dcal = {(Xk,i , Yk,i ) ∈ Rd × R : i ∈ Ical } with Ical = [nk ]. We denote
the distribution of the observations from the k-th group as
i.i.d (k) (k)
(Xk,i , Yk,i ) ∼ PX × PY |X for i ∈ Ical , (1)
(k)
where PX denotes the marginal distribution of the covariate for the k-th group and PY |X
denotes the conditional distribution of Y given X. We assume unlabeled training datasets
(k) (k) (k)
Dtr = {Xk,i : i ∈ Itr } with i.i.d. covariates and Itr = {nk + 1, . . . , 2nk }, a pre-trained
model f to predict E[Y |X] and a score function s : Rd × R → R, are available. The unlabeled
training datasets will be used for group selection and likelihood ratio estimation. Our goal
is to utilize data collected from K observed groups to provide uncertainty quantification for
predictions in a test group. We represent the observations from the test group by
i.i.d
(X0,i , Y0,i ) ∼ QX × PY |X for i ∈ [n0 ], (2)

where covariates from the test group have marginal distribution QX and outcomes {Y0,i :
i ∈ [n0 ]} are not observed. We assume QX is known (i.e., unlabeled dataset is available for
(0)
training purpose, which we denote as Dtr = {X0,i : i ∈ [n0 ]}). Note that QX can be different
from the covariate distributions of the observed groups. Without further explanation, we
(k)
assume in the following discussion that the covariate distributions QX and {PX : k ∈ [K]}
are pairwise absolutely continuous with respect to each other.
Lastly, it is worth mentioning that conformal prediction aims to create prediction interval
for Y0,i with the following guarantee:
n o
P Y0,i ∈ Cbn (X0,i ) ≥ 1 − α,

where Cbn (x) is a prediction band constructed based on available data sources. Beyond ensuring
theoretical guarantee for marginal coverage probability, our aim in this study is to leverage
multiple sources to increase the probability of obtaining a finite prediction interval and improve
the informative coverage probability.

3
Remark 1. The two-layer hierarchical model studied in Dunn et al. (2023) assumes exchange-
ability between the covariate distributions of observed groups and the covariate distribution of
the test group (i.e., these covariate distributions are drawn independently and identically dis-
tributed from a certain distribution). In this work, we do not make such an assumption.

Two special cases. Suppose observed groups have separated support; in such cases, a com-
bination of Mondrian conformal prediction Vovk et al. (2005) and weighted conformal pre-
diction Tibshirani et al. (2019) could be effective. When observed groups have varying levels
of overlap among themselves, the problem at hand becomes more challenging. With overlap-
ping support, we also consider the special scenario when QX can be expressed as a mixture of
(k)
{PX : k ∈ [K]}. In this case, the idea of group-weighted conformal prediction Bhattacharyya
and Barber (2024) can be useful. However, when covariates do not explicitly contain group
information, such a mixture structure can impose practical limitations and have certain iden-
tification issues. More details of these two cases are given in Appendices A and B.

3 Informative prediction interval in WCP


We start the discussion about informativeness of WCP by recalling the main theorem from Tib-
shirani et al. (2019), which allows for covariate shift between training and test distributions
with K = 1.

Proposition 1 (Theorem 2 in Tibshirani et al. (2019)). Let dataset D = {(Xi , Yi ) ∈ Rd × R :


i ∈ [n]} consist of i.i.d data points drawn from PX ×PY |X and (X0 , Y0 ) be drawn independently
from QX × PY |X . Let QX be absolutely continuous with respect to PX with known likelihood
ratio w(x) = dQX /dPX (x). Given the dataset D and likelihood ratio w, define the weight
functions at x as follows:

w(x) w(Xi )
pw
0 (x; D) = Pn and pw
i (x; D) = , (3)
w(x) + nj=1 w(Xj )
P
w(x) + j=1 w(Xj )

where i ∈ [n]. Then, it follows that


n o
P Y0 ∈ Cbn (X0 ; 1 − α, D) ≥ 1 − α, (4)

where

Cbn (x; 1 − α, D) = y : s(x, y) ≤ Quantile 1 − α; ni=1 pw w


 P 
i (x; D)δs(Xi ,Yi ) + p0 (x; D)δ∞
and δz denotes a unit point mass at z ∈ R.

By reweighting the scores based on the dataset D, one can attain a finite sample guarantee
when the likelihood ratio is known. While equation (4) ensures marginal coverage guarantee
for Y0 , the prediction interval is notably conservative as demonstrated in Figure 1 when PX and
QX have large total variation distance, denoted by dTV (PX , QX ) = supA |PX (A) − QX (A)|.
Specifically, when the event E = {pw 0 (X0 ; D) ≤ α} does not happen, the resulting WCP
interval is uninformative:

event E c happens ⇐⇒ Cbn (X0 ; 1 − α, D) = (−∞, ∞).

4
For this reason, we decompose the marginal coverage probability in equation (4) as:
n o n o
P Y0 ∈ Cbn (X0 ; 1 − α, D) = 1 − P (E) + P (E) · P Y0 ∈ Cbn (X0 ; 1 − α, D) | E . (5)

We refer to the conditional coverage probability in equation (5) as informative coverage prob-
ability and present its properties in Theorem 1.
Theorem 1. Under the same assumptions in Proposition 1, it holds that
n o α
P Y0 ∈ Cbn (X0 ; 1 − α, D) | E ≥ 1 − . (6)
P (E)

When EX∼PX [w(X)]1+δ < ∞ for some δ ≥ 1, it holds that


 
n o α C1 (δ)
P Y0 ∈ Cn (X0 ; 1 − α, D) | E ≤ 1 −
b 1− , (7)
P (E) α · nδ/(1+δ)
where C1 (δ) is a universal constant depending on δ.
See Appendix D.1 for a proof of Theorem 1. Note that P(E) depends on two factors - sam-
ple size n and the likelihood ratio w. Moreover, Theorem 1 implies that P(E) governs the
informative coverage probability. In the following remark, we provide a lower bound for P(E).
Remark 2. It can be shown that
n
   !
nα 1X 1
P (E) ≥ EX∼PX w(X)1 w(X) ≤ ·P w(Xi ) − 1 ≤ . (8)
2 − 2α n 2
i=1

With QX fixed, P(E) depends on the likelihood ratio w and the sample size in D. This de-
pendency and the finite sample performance can be studied through the lower bound provided
in equation (8). When there exists some δ > 0 such that EX∼PX [w(X)]1+δ < ∞, the sec-
ond term on the right side of equation (8) can be controlled using concentration inequalities.
With n being sufficiently large, this lower bound approaches 1. The detailed discussions about
equation (8) can be found in Appendix D.2.

4 Enhancing informativeness through integration


A straightforward method to integrate information from multiple groups is to create K pre-
diction intervals based on each group and then select the interval with the shortest length.
However, adopting such an idea falls into the trap of post selection Benjamini (2010); Taylor
and Tibshirani (2015); Taylor (2018), which ultimately leads to the breakdown of the theoret-
ical guarantee. In this section, we propose two approaches: the first involves integrating WCP
intervals following group selection, while the second entails integrating data from observed
groups before applying the WCP framework.

4.1 A conservative approach - selective Bonferroni procedure


(k)
With known likelihood ratio wk = dQX /dPX , the k-th prediction band can be constructed
as
n  o
(k) (k) (k)
Cb(k) (x; 1 − α, Dcal ) = y : s(x, y) ≤ Quantile 1 − α; ni=1
P k wk
pi (x; Dcal )δs(Xk,i ,Yk,i ) + pw
0
k
(x; Dcal )δ∞ .

5
Algorithm 1 group selection
all
Input: number of groups Kinit to be selected, likelihood ratios {wk : k ∈ [K]}, training data set Dtr
Procedure:
1: Initialize a list G = {} and a 0-1 matrix M with dimension K × n0
(the (k, i)-th element estimates whether group k can provide finite prediction for X0,i at level 1 − α/Kinit )
2: Compute the (k, i)-th element of M

wk (X0,i )
P ≤ α/Kinit .
wk (X0,i ) + (k) wk (Xk,j )
j∈Itr

3: while |G| ≤ Kinit and M is not empty do


4: Find the group kj that maximizes the row sum of M and add kj to G
5: Update M by deleting the columns which have value 1 in the kj -th row
6: end while
7: return G

According to Proposition 1, for each k ∈ [K], we have


n o
(k)
P Y0 ∈ Cb(k) (X0 ; 1 − α, Dcal ) ≥ 1 − α. (9)

Equipped with prediction intervals obtained from different groups, our goal is to combine them
to increase the probability of obtaining finite prediction intervals and to improve informative
coverage probability. First consider the majority vote procedure Gasparin and Ramdas (2024)
at x: n n o o
y : K1 K
P
1 y ∈ b(k) (x; 1 − α, D(k) ) > 1/2 ,
C
k=1 cal

which includes all y voted by at least a half of the WCP intervals. When there are more
than two groups (K > 2), such a construction is not effective. Consider a scenario where
(k)
one group has a significant overlap between its PX and QX , resulting in a finite prediction
interval. However, the other groups fail to adequately quantify the uncertainty for the new
group and produce prediction interval (−∞, ∞) with high probability. In such cases, the
majority vote procedure leads to uninformative prediction intervals with high probability. To
remedy this issue, we consider Bonferroni’s correction following a group selection step. The
following ingredients are required for the proposed method:
• Kinit : the initial guess of the number of groups required to encompass QX
(0) (k)
• Algorithm 1: algorithm to perform group selection based on Dtr
all = D
tr ∪ (∪k∈[K] Dtr )

• G: a list including selected groups based on Algorithm 1


Then, we define prediction interval based on selective Bonferroni procedure as:
n P n o o
(k)
CbB (x; 1 − α, G) = y : k∈G 1 y ∈ Cb(k) (x; 1 − α/|G|, Dcal ) = |G| . (10)

Note that G is a list of groups selected by Algorithm 1 based on training data and CbB (x; 1 −
(k)
α, G) is the intersection of the prediction bands Cb(k) (x; 1−α/|G|, Dcal ) from the selected groups
G.
Theorem 2. Let G be the set of groups selected based on Algorithm 1 with inputs Kinit ,
likelihood ratios {wk : k ∈ [K]} and training data Dtr all . Then, the prediction interval defined

in equation (10) is a level 1 − α prediction interval:


n o
P Y0 ∈ CbB (X0 ; 1 − α, G) ≥ 1 − α. (11)

6
Moreover, the corresponding informative coverage probability satisfies
n o α
P Y0 ∈ CbB (X0 ; 1 − α, G) | E B ≥ 1 − , (12)
P (E B )

where
n o
(k)
E B = there exists a group k ∈ G such that pw
0
k
(X0 ; Dcal ) ≤ α/|G| .

See Appendix D.3 for a proof of Theorem 2. Note that WCP based on selective Bonferroni
procedure mitigates the occurrence of infinite prediction intervals, albeit at the expense of
amplifying the level of WCP intervals. Hence, Bonferroni’s correction may result in a wider
prediction interval, making it preferable to have fewer groups capable of encompassing the
covariate distribution. Algorithm 1 is designed for this purpose, estimating the probability
all . When computation power is sufficient, one
of getting finite prediction intervals through Dtr
can examine a sequence of Kinit and determine the final group list which has a smaller size
(0) all .
while capable of ensuring Dtr can be provided with finite prediction intervals based on Dtr

4.2 Pooling method


While integrating WCP intervals based on selective Bonferroni procedure reduces the prob-
ability of getting infinite prediction intervals, it tends to inflate the final interval lengths. In
this section, we utilize the data pooling technique, which integrates information from multiple
groups with the aim of enhancing sample efficiency and obtaining shorter prediction intervals.
As the name suggests, data pooling aggregates data from different groups and treats them as
observations from a weighted population. By creating a covariate distribution that exhibits
better overlap with QX and enables a larger sample size, data pooling tends to increase the
probability of obtaining informative prediction intervals, as indicated by Remark 2. The data
pooling procedure cab be described as follows:
(k) (k)
• collect Dcal from different groups and form dataset Dcal
all = ∪
k∈[K] Dcal

all
Dcal = {(X1,1 , Y1,1 ), . . . , (X1,n1 , Y1,n1 ), . . . , (XK,1 , YK,1 ), . . . , (XK,nK , YK,nK )}

• permute the elements in Dcal


all uniformly at random to obtain D pool where

n o
Dpool = (X e1 , Ye1 ), . . . , (X
en , Yen ) .

One can easily verify that for any i ∈ [n]


X nk (k)
ei , Yei ) ∼ PeX × P
(X where PeX = P . (13)
Y |X
n X
k∈[K]

Note that, after the data permutation step, observations share a covariate distribution PeX
as indicated in equation (13), enabling the utilization of the WCP framework. However, the
marginal coverage guarantee of WCP breaks apart due to the correlation within the dataset
Dpool . Nonetheless, in cases where correlations within Dpool are minimal, the difference be-
tween Dpool and its i.i.d. version is insignificant. To summarize, we have the following
theorem:

7
n o
Theorem 3. Suppose pooled dataset Dpool = (X ei , Yei ) : i ∈ [n] is obtained and the likelihood
ratio w̄ = dQX /dPeX is known. Then, it holds that
n o
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) ≥ 1 − α − dTV (X, e X′ ), (14)

where
n  o
CbP (x; 1 − α, Dpool ) = y ∈ R : s(x, y) ≤ Quantile 1 − α; ni=1 pw̄ pool )δ w̄ pool )δ
P
i (x; D ei ,Yei ) + p0 (x; D
s(X ∞ ,
e1⊤ , . . . , X
en⊤ ), i.i.d
X
e = (X X′ = (X1′⊤ , . . . , Xn′⊤ ), and Xi′ ∼ PeX .

Moreover, with E P = {pw̄


0 (X0 ; D
pool ) ≤ α}, informative coverage probability based on data

pooling satisfies
n o e X′ )
α + dTV (X,
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) | E P ≥ 1 − . (15)
P (E P )
See Appendix D.4 for a proof of Theorem 3.

The above theorem ensures that weighted conformal prediction can provide almost valid
coverage when X e is close to its i.i.d version X′ in total variation distance. Note that equa-
tion (14) is equivalent to replacing Dpool by Dcal all . While total variation distance can be

bounded using Pinsker’s inequality when PeX follows normal distribution, it is generally chal-
lenging to control in other cases.
Alternatively, imposing a two-layer data generating mechanism can remove the coverage
gap posed by total variation distance. Specifically, one can assume dataset {(g, Xg,i , Yg,i ) :
g ∈ [K], i ∈ [ng ]} consists of i.i.d random variables which are generated from the following
process: (
g ∼ Multinomial (q1 , . . . , qK ) ,
(k)
(Xg,i , Yg,i ) | (g = k) ∼ PX × PY |X .
Consequently, {(Xk,i , Yk,i ) : k ∈ [K], i ∈ [nk ]} can be viewed as a realization of i.i.d random
P (k)
variables distributed as ( 1≤k≤K qk PX ) × PY |X by ignoring the group information. With-
out estimating the mixture weights, one can work with the marginal mixture distribution
P (k)
1≤k≤K qk PX in the WCP framework. Besides, one can consider designing a weighted pop-
(k)
ulation that depends on {PX : k ∈ [K]} and overlaps well with QX , as mentioned in Lei
(k)
and Candès (2021). However, with available datasets {Dcal : k ∈ [K]}, sampling enough i.i.d
data from this population may be challenging, especially when sample sizes {nk : k ∈ [K]}
are imbalanced.
Note that we present Theorem 2 and Theorem 3 with known likelihood ratios. However,
in practice, likelihood ratios are often unknown and needs to be estimated. In Appendix C,
we provide modified versions of Theorem 2 and Theorem 3 with estimated likelihood ratios,
using Theorem 3 in Lei and Candès (2021). Additionally, in Appendix E.3, we discuss how
we estimate the likelihood ratios in numerical experiments.

5 Numerical experiments
In this section, simulations are conducted to demonstrate finite sample performance of the
proposed approaches, WCP based on selective Bonferroni procedure denoted by WCP-SB

8
Visualization of observed groups Visualization of new group
1.25 1.25

1.00 1.00

0.75 0.75
Response

Response
0.50 0.50

0.25 0.25
GP model
0.00 Group 1 0.00 E(Y|X)
Group 2 Group 0
0.25 0.25
9 6 3 0 3 6 9 9 6 3 0 3 6 9
Covariate Covariate
Figure 2: Visualization of data and pre-trained model with σ 2 = 4.

and WCP based on data pooling technique denoted by WCP-P. These methods are compared
with a naive alternative denoted by WCP-SS, which selects the shortest WCP interval among
WCP intervals obtained from each single group. The numerical performance is evaluated by
four different measurements, the marginal coverage probability (MCP), the probability of ob-
taining informative prediction intervals (IP), informative coverage probability (ICP), and the
average lengths of finite prediction intervals (AIL). In Section 5.1, we begin with an example
of 2 groups with one-dimensional covariate and known likelihood ratios. In Section 5.2, more
complex numerical examples with a higher covariate dimension, more groups, and unknown
likelihood ratios are conducted. For simplicity, we consider homoscedastic errors and use ab-
solute residual as our score function. To address heteroscedastic errors, one can use methods,
for example, conformalized quantile regression Romano et al. (2019).

5.1 Covariate shift: d = 1 with known likelihood ratios


Consider the setting:
(1) (2)
PX = N (−3, σ 2 ), PX = N (3, σ 2 ), QX = N (0, 9), and PY |X = N (sigmoid(X), 0.01),

where sigmoid(x) = exp(x)/(1 + exp(x)). Two settings, σ 2 ∈ {1, 4}, are considered to demon-
strate different scenarios of overlap between observed groups with n1 = n2 = 100. As shown
in Figure 2 where σ 2 = 4, each single observed group, either Group 1 in blue or Group 2 in
red (the left panel), has only partial overlap with the new test group (the right panel) and

1.50
Prediction band by WCP on Group 1 1.50
Prediction band by WCP on Group 2
E(Y|X)
1.25 Oracle 90% CI 1.25

1.00 Uninformative PI 1.00


Informative PI
Response

Response

0.75 0.75

0.50 0.50

0.25 0.25
E(Y|X)
0.00 0.00 Oracle 90% CI
0.25 0.25 Uninformative PI
Informative PI
0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate

Figure 3: WCP band based on single group data with σ 2 = 4.

9
1.50
Prediction band by WCP-SB 1.50
Prediction band by WCP-P
E(Y|X) E(Y|X)
1.25 Oracle 90% CI 1.25 Oracle 90% CI
1.00 Informative PI 1.00 Informative PI
Response

Response
0.75 0.75

0.50 0.50

0.25 0.25

0.00 0.00

0.25 0.25

0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate

Figure 4. WCP band based on selective Bonferroni procedure and data pooling with σ 2 = 4.

therefore does not have sufficient information to provide uncertainty quantification for the test
Group 0. Figure 3 demonstrates that WCP based on single group can lead to uninformative
prediction intervals, i.e., (−∞, ∞). The prediction bands of the proposed methods, WCP-SB
and WCP-P, are provided in Figure 4, where both proposed methods reduce the chance of
getting infinite prediction interval and constructed intervals are reasonably close to the oracle
90% confidence intervals.
Table 1 summarizes the simulation results based on 5000 replications. We can see WCP-SB
is relatively more conservative, with the lowest IP and the largest AIL. WCP-P has the largest
IP when σ 2 = 1, demonstrating effectiveness in providing informative prediction intervals
when groups have relatively separated covariates. Compared with the naive approach, WCP-
SS, it is observed that when the covariates of two groups are relatively well separated (σ 2 =
1), the MCP is close to the target coverage. The coverage gap becomes more significant
when the covariates of two groups have better overlap (σ 2 = 4). Moreover, we also conduct
experiments involving multiple groups with K ∼ Uniform({3, . . . , 10}). For each group, we set
(k)
the size of calibration data, mean and variance of PX randomly. See Appendix for detailed
implementation of the experiments and additional results.

Table 1: Comparisons for one-dimensional covariate shift with K = 2


σ2 = 1 σ2 = 4
MCP IP ICP AIL MCP IP ICP AIL
WCP - SB 0.950 0.711 0.929 0.419 0.927 1.000 0.927 0.414
WCP - P 0.910 0.884 0.898 0.372 0.895 1.000 0.895 0.359
WCP - SS 0.896 0.809 0.872 0.356 0.866 1.000 0.866 0.341
WCP - Group 1 0.948 0.412 0.875 0.373 0.915 0.733 0.884 0.369
WCP - Group 2 0.948 0.397 0.869 0.337 0.918 0.725 0.886 0.370

5.2 Covariate shift: higher dimension and unknown likelihood ratios


Inspired by the numerical examples given in Lei and Candès (2021), we consider covariate
vectors for the k-th groups as equicorrelated Gaussian vectors, i.e.,
 
Xk,i ∼ N (µk , Σk ) with µk ∈ Rd and Σk = σk2 ρk 1d 1⊤
d + (1 − ρk d ,
)I

10
where σk2 > 0, ρk ∈ [0, 1], 1d is the all-one vector, k = 1, ..., K, and Id is a d-dimensional
identity matrix. We consider the conditional distribution

Yk,i |Xk,i ∼ N (4 · sigmoid(Xk,i,1 ) · sigmoid(Xk,i,2 ), 0.01),

where Xk,i,1 and Xk,i,2 denote the first and second coordinates of Xk,i respectively. Different
from changing the variance of the covariate distributions as in Section 5.1, we specify different
covariate shifts by varying the correlation ρk . Two scenarios are considered: weakly correlated
with ρk ∈ [0, 0.2] and strongly correlated with ρk ∈ [0.7, 0.9]. We also implement two different
initial settings for WCP-SB, Kinit = 1 and Kinit = min{K, 3}.
The simulation results based on 5000 replications are summarized in Table 2 for the case of
d = 10. Observations are consistent with the ones in Section 5.1. With a stronger correlation,
ρk ∈ [0.7, 0.9], the data pooling method WCP-P demonstrates the best performance, with
MCP and ICP close to 0.9, the largest IP, and the smallest AIL. Additionally, under strong
covariate correlation, WCP-SB with Kinit = min{K, 3} yields a smaller IP compared to
utilizing only one group’s information when covariate dimension is higher. In Appendix E.3,
we provide more simulation details and tables that summarize simulation results for d =
5, 20, 50.

Table 2: Comparisons with d = 10 and K ≥ 2


ρk ∼ Uniform(0, 0.2) ρk ∼ Uniform(0.7, 0.9)
MCP IP ICP AIL MCP IP ICP AIL
WCP-SB (Kinit = 1) 0.905 0.978 0.903 0.400 0.960 0.536 0.926 0.446
WCP-SB (Kinit = min{K, 3}) 0.948 0.986 0.947 0.479 0.980 0.444 0.955 0.515
WCP-P 0.900 0.998 0.900 0.388 0.914 0.921 0.906 0.400
WCP-SS 0.849 0.998 0.848 0.346 0.926 0.832 0.912 0.417

6 Discussion
In this paper, we demonstrate that constructed WCP intervals can be uninformative. The
event that is linked with obtaining an informative prediction interval is explicitly formulated.
When multiple sources are available, two approaches are introduced to enhance the informa-
tiveness of WCP. Theoretical results are developed for the proposed methods: WCP based on
selective Bonferroni procedure and WCP based on data pooling. Selective Bonferroni proce-
dure produces relatively conservative prediction intervals. Additionally, when the dimension
of covariates increases, IP can decrease with a larger Kinit . On the other hand, data pooling
method in general outperforms the other alternatives in our numerical experiments. Note
that the lower bound in Theorem 3 is relatively conservative. Therefore, an interesting fu-
ture work is to explore a sharper lower bound for the data pooling method. Additionally, we
leave extensions to distribution shift and conformal risk control in scenarios involving multiple
groups for future work.

11
References
Angelopoulos, A. N., Bates, S., et al. (2023). Conformal prediction: A gentle introduction. Foundations
and Trends® in Machine Learning, 16(4):494–591.
Benjamini, Y. (2010). Simultaneous and selective inference: Current successes and future challenges.
Biometrical Journal, 52(6):708–721.
Berrett, T. B., Wang, Y., Barber, R. F., and Samworth, R. J. (2020). The conditional permutation
test for independence while controlling for confounders. Journal of the Royal Statistical Society
Series B: Statistical Methodology, 82(1):175–197.
Bhattacharyya, A. and Barber, R. F. (2024). Group-weighted conformal prediction. arXiv preprint
arXiv:2401.17452.
Candès, E., Lei, L., and Ren, Z. (2023). Conformalized survival analysis. Journal of the Royal Statistical
Society Series B: Statistical Methodology, 85(1):24–45.
Dunn, R., Wasserman, L., and Ramdas, A. (2023). Distribution-free prediction sets for two-layer
hierarchical models. Journal of the American Statistical Association, 118(544):2491–2502.
Fannjiang, C., Bates, S., Angelopoulos, A. N., Listgarten, J., and Jordan, M. I. (2022). Conformal
prediction under feedback covariate shift for biomolecular design. Proceedings of the National
Academy of Sciences, 119(43):e2204569119.
Gasparin, M. and Ramdas, A. (2024). Merging uncertainty sets via majority vote. arXiv preprint
arXiv:2401.09379.
Lei, J., Robins, J., and Wasserman, L. (2013). Distribution-free prediction sets. Journal of the
American Statistical Association, 108(501):278–287.
Lei, J. and Wasserman, L. (2015). Distribution-free prediction bands for nonparametric regression.
Quality control and applied statistics, 60(1):109–110.
Lei, L. and Candès, E. J. (2021). Conformal inference of counterfactuals and individual treatment
effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5):911–938.
Papadopoulos, H., Proedrou, K., Vovk, V., and Gammerman, A. (2002). Inductive confidence machines
for regression. In Machine learning: ECML 2002: 13th European conference on machine learning
Helsinki, Finland, August 19–23, 2002 proceedings 13, pages 345–356. Springer.
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (2022). Dataset shift in
machine learning. Mit Press.
Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer school on machine
learning, pages 63–71. Springer.
Romano, Y., Patterson, E., and Candes, E. (2019). Conformalized quantile regression. Advances in
neural information processing systems, 32.
Taylor, J. and Tibshirani, R. J. (2015). Statistical learning and selective inference. Proceedings of the
National Academy of Sciences, 112(25):7629–7634.
Taylor, J. E. (2018). A selective survey of selective inference. In Proceedings of the International
Congress of Mathematicians: Rio de Janeiro 2018, pages 3019–3038. World Scientific.
Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. (2019). Conformal prediction under
covariate shift. Advances in neural information processing systems, 32.
Vovk, V., Gammerman, A., and Saunders, C. (1999). Machine-learning applications of algorithmic
randomness.
Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic learning in a random world, volume 29.
Springer.

12
Appendix
Table of Contents
A Mondrianize weighted comformal prediction 13

B Employ group-weighted conformal prediction 14

C Weighted conformal prediction with estimated likelihood ratio 15

D Proofs 16
D.1 Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
D.2 Remark 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
D.3 Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
D.4 Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
D.5 Corollary 1 & 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

E Simulation details 20
E.1 Informative WCP intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
E.2 Covariate shift: d = 1 with known likelihood ratios . . . . . . . . . . . . . . 21
E.3 Covariate shift: higher dimension with unknown likelihood ratios . . . . . . 24

A Mondrianize weighted comformal prediction


(k)
When the support of training covariate distributions {PX : k ∈ [K]} are separable, a combination of
Mondrian conformal prediction and weighted conformal prediction can be useful. We begin by setting
(k)
up some notations. Let Xk denote the support of the distribution PX for k ∈ [K], which satisfies

Xi ∩ Xj = ∅ for i ̸= j.
(k)
Let QX denote the conditional distribution of X0 on the region Xk where X0 ∼ QX and assume that
(k) (k) (k) (k)
QX is absolute continuous with respect to PX . Let w̄k = dQX /dPX be the conditional likelihood
ratio and set qk = P (X0 ∈ Xk ).
(k)
With the above notations in place, we are now ready to utilize {w̄k : k ∈ [K]} and {Dcal : k ∈ [K]}
to construct prediction band at x. For k ∈ [K], define
n  o
(k) Pnk w̄k (k) (k)
C (k) (x; 1 − α, Dcal ) = y : s(x, y) ≤ Quantile 1 − α; i=1 pi (x; Dcal )δs(Xk,i ,Yk,i ) + pw̄
0 (x; Dcal )δ∞
k
.

Subsequently, we define Mondrianized weighted conformal prediction interval as below:


(
(k)
all C (k) (x; 1 − α, Dcal ) if x ∈ Xk for k ∈ [K]
Cbn (x; 1 − α, Dcal ) = (16)
(−∞, ∞) otherwise,

all (k)
where Dcal = ∪k∈[K] Dcal . We point out that when there is no data in regions outside ∪k∈[K] Xk , there
is no way to quantify uncertainty at X0 lying in those regions without making distribution assumptions

13
about PY |X . Therefore, we consider the uninformative interval (−∞, ∞) when X0 ∈
/ ∪k∈[K] Xk . With
prediction interval (16) at hand, we can show that
n o K
X  
all all
P Y0 ∈ Cbn (X0 ; 1 − α, Dcal ) = P (X0 ∈ Xk ) · P Y0 ∈ Cbn (X0 ; 1 − α, Dcal ) | X0 ∈ Xk
k=1
 
all

+ P X0 ∈
/ ∪k∈[K] Xk · P Y0 ∈ Cbn (X0 ; 1 − α, Dcal ) | X0 ∈
/ ∪k∈[K] Xk
K  
(i) X (k)
= qk · P Y0 ∈ C (k) (X0 ; 1 − α, Dcal ) | X0 ∈ Xk
k=1
K
X 
+ (1 − qk ) · P Y0 ∈ (−∞, ∞) | X0 ∈
/ ∪k∈[K] Xk
k=1
K   K
(k)
X X
= qk · P Y0 ∈ C (k) (X0 ; 1 − α, Dcal ) | X0 ∈ Xk + 1 − qk
k=1 k=1
(ii) 
≥ 1 − α + α · P X0 ∈
/ ∪k∈[K] Xk .
all
Equality (i) follows from the construction of Cen (X0 ; 1 − α, Dcal ) and an application of weighted con-
formal prediction for all k ∈ [K] yields inequality (ii).

B Employ group-weighted conformal prediction


When covariates do not explicitly contain group information, the method introduced as group-weighted
conformal prediction (GWCP) in Bhattacharyya and Barber (2024) can be adapted for our problem in
(k)
cases where QX can be represented as a mixture of the {PX : k ∈ [K]}. Additionally, no assumptions
(k)
are required for the overlap between {PX : k ∈ [K]}. Consider the following expression for QX :
K K
(k)
X X
QX = qk PX with qk = 1. (17)
k=1 k=1

Let the empirical distribution of scores for data points in group k be


nk
(k) 1 X
Pbscore = δs(Xk,i ,Yk,i ) ,
nk i=1

where δs(Xk,i ,Yk,i ) denotes the point mass at s(Xk,i , Yk,i ). Then, the level 1 − α prediction band at x
can be calculated by
 
Cbn (x) = {y : s(x, y) ≤ qb} where qb = Quantile 1 − α; Pbscore , (18)
PK (k)
where Pbscore = k=1 qk Pbscore . In the following proposition, we provide a modified version of Theorem
4.1 in Bhattacharyya and Barber (2024).

Proposition 2. Suppose {(Xk,i , Yk,i )}k∈[K],i∈[nk ] are distributed as in (1). Assume assumption (17)
holds and let (X0 , Y0 ) be drawn independently from the distribution in (2). Then, the prediction interval
defined in (18) satisfies n o qk
P Y0 ∈ Cbn (X0 ) ≥ 1 − α − max . (19)
k nk

Some observations are in order regarding Proposition 2. Assumption (17) requires that the co-
variate distribution QX can be represented as a mixture of the covariate distributions of the observed
groups. This assumption may limit the practicality of applying Proposition 2. One limitation arises

14
from the fact that the test covariate distribution typically differs from the mixture of covariate distri-
butions of observed groups. Furthermore, in cases where there is substantial covariate overlap between
observed groups, the identification issues associated with such a mixture structure may arise. To
demonstrate, the idea of GWCP applies potentially only to test group 3 and is not applicable to test
groups 1 and 2.

Covariate distributions for different groups


Feature 2

Test group 1 Test group 3 Observed group 2


Test group 2 Observed group 1
Feature 1

Figure 5: Practical limitation of GWCP.

C Weighted conformal prediction with estimated likelihood


ratio
When estimated likelihood ratio is used in weighted conformal prediction, a marginal coverage gap is
incurred due to estimation error of the likelihood ratio. To proceed, we state a lemma.

Lemma 1 (Theorem 3 in Lei and Candès (2021)). Assume the same set of assumptions as in
Proposition 1 holds. Let ŵ be the estimated likelihood ratio, which is independent of D and satis-
fies EX∼PX ŵ(X) = 1. Then
n o 1
P Y0 ∈ Cen (X0 ; 1 − α, D) ≥ 1 − α − EX∼PX |w(X) − ŵ(X)|,
2
where
 Pn 
Cen (x; 1 − α, D) = y : s(x, y) ≤ Quantile 1 − α; i=1 pŵ ŵ
i (x; D)δs(Xi ,Yi ) + p0 (x; D)δ∞ .

Analogously, for k ∈ [K], we define


n  o
(k) Pnk ŵk (k) (k)
Ce(k) (x; 1 − α, Dcal ) = y : s(x, y) ≤ Quantile 1 − α; i=1 pi (x; Dcal )δs(Xk,i ,Yk,i ) + pŵ
0
k
(x; Dcal )δ ∞ .

For selective Bonferroni’s procedure and data pool method, we can define
n P n o o
(k)
CeB (x; 1 − α, G) = y : k∈G 1 y ∈ Ce(k) (x; 1 − α/|G|, Dcal ) = |G|
n  Pn o
CeP (x; 1 − α, Dpool ) = y ∈ R : s(x, y) ≤ Quantile 1 − α; i=1 pw̃
i (x; D
pool
)δs(Xei ,Yei ) + pw̃
0 (x; D
pool
)δ∞ ,

(k) (0)
where the estimated likelihood ratio ŵk is obtained by using Dtr and Dtr while w̃ is obtained using
all (0)
Dtr and Dtr . With these notations in place, we provide modified version of Theorem 2 and Theorem 3
with estimated likelihood ratios.

15
Corollary 1. Let G be the set of groups selected based on Algorithm 1 with inputs Kinit , training
all
data Dtr and estimated likelihood ratios {ŵk : k ∈ [K]}. Assuming estimated likelihood ratios satisfy
(k) (0)
EX∼P (k) [ŵk (X) | Dtr , Dtr ] = 1, we have
X

n o 1 X
P Y0 ∈ CeB (X0 ; 1 − α, G) ≥ 1 − α − E Errk , (20)
2
k∈G

all
where Errk = EX∼P (k) [|wk (X) − ŵk (X)| | Dtr ]. Moreover, the corresponding informative coverage
X
probability satisfies

α + 21 E k∈G Errk
n o P
B B
P Y0 ∈ C (X0 ; 1 − α, G) | E
e e ≥1−   , (21)
P EeB

where
n o
(k)
EeB = there exists a group k ∈ G such that pŵ
0
k
(X0 ; Dcal ) ≤ α/|G|

all (0)
Corollary 2. Let w̃ be the estimated likelihood ratio for dQX /dPeX by using Dtr and Dtr and satisfy
all (0)
EX∼PeX [w̃(X) | Dtr , Dtr ] = 1. Then, under the same set of assumptions as in Theorem 3,

e X′ ) − 1 E e |w̄(X) − w̃(X)|.
n o
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) ≥ 1 − α − dTV (X, (22)
2 X∼PX

Moreover, with EeP = {pw̃


0 (X0 ; D
pool
) ≤ α}, informative coverage probability based on data pooling
satisfies
n o e X′ ) + 1 E e |w̄(X) − w̃(X)|
α + dTV (X, 2 X∼PX
P Y0 ∈ CeP (X0 ; 1 − α, Dpool ) | EeP ≥ 1 −   . (23)
P EeP

See a proof of Corollary 1 and 2 in Appendix D.5.

D Proofs
D.1 Theorem 1
Proof. Combining equation (4) and equation (5) yields equation (6). Equation (7) follows from
equation (5) and Proposition 1 in Lei and Candès (2021), which states that when EX∼PX [w(X)]1+δ <
∞ for some δ ≥ 1, there exists a universal constant C1 (δ) depending on δ such that
  C1 (δ)
P Y0 ∈ Cbn (X0 ; 1 − α, D) ≤ 1 − α + δ/(1+δ) .
n

D.2 Remark 2
Proof.
( n
! )
nα 1X nα
P (E) = P w(X0 ) + 1− w(Xi ) ≤
1−α n i=1 1−α
  ( n
)
nα 1X 1
≥ P w(X0 ) ≤ ·P w(Xi ) ≥ (24)
2(1 − α) n i=1 2
   n
!
nα 1X 1
≥ EX∼PX w(X)1 w(X) ≤ ·P w(Xi ) − 1 ≤
2 − 2α n i=1 2

16
The first inequality follows from the independence of X0 and D while the second inequality follows
from
     
nα nα nα
P w(X0 ) ≤ = E1 w(X0 ) ≤ = EX∼PX w(X)1 w(X) ≤ .
2(1 − α) 2(1 − α) 2(1 − α)
By Monotone convergence theorem, we have
     
nα nα
lim EX∼PX w(X)1 w(X) ≤ = EX∼PX lim w(X)1 w(X) ≤
n→∞ 2 − 2α n→∞ 2 − 2α
= EX∼PX [w(X)] = 1.

Pn {w(Xi ) : i ∈ [n]} are i.i.d random variables, we can use concentration inequality to control
Note that
P n1 i=1 w(Xi ) − 1 < 21 . According to the proof of Theorem 4 in Lei and Candès (2021), we
conclude that there exists a constant C2 (δ) such that
n
!
1X 1 C2 (δ)
P w(Xi ) − 1 ≥ ≤ ,
n i=1 2 nδ ′

where δ ′ = δ/2 + min{1/2, δ/2}.

D.3 Theorem 2
all (0) (k) all (k)
Proof. Given Dtr = Dtr ∪ (∪k∈[K] Dtr ) and Dcal = ∪k∈[K] Dcal at hand, we have
  n o
/ CbB (X0 ; 1 − α, G) = E1 Y0 ∈
P Y0 ∈ / CbB (X0 ; 1 − α, G)
( )
(i) X n o
(k) (k)
≤E 1 Y0 ∈ / Cb (X0 ; 1 − α/|G|, Dcal )
k∈G
( ( ))
X n (k)
o
=E E / Cb(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
1 Y0 ∈ all

k∈G
( )  
(ii) X   α
(k)
≤ E / Cb(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
P Y0 ∈ all
≤ E |G| = α.
|G|
k∈G

Inequality (i) follows by the construction of CbB (x; 1 − α/Kinit , G), i.e., if Y0 ∈ / CbB (x; 1 − α, G), then
(k)
/ Cb(k) (X0 ; 1 − α/|G|, Dcal ) for at least one k ∈ G. Inequality (ii) makes use of the independence
Y0 ∈
all all
between Dtr and Dcal , as well as the fact that
 
(k)
P Y0 ∈ Cb(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
all
≥ 1 − α/|G|.

Lastly, equation (12) follows from the equivalence:

{CbB (X0 ; 1 − α, D) is a finite prediction interval} ⇐⇒ event E B happens.

D.4 Theorem 3
Proof. To begin with, we define some notations. Given datasets Dpool and D′ = {(Xi′ , Yi′ ) : i ∈ [n]},
i.i.d
where (Xi′ , Yi′ ) ∼ PeX × PY |X , define response vectors

Y
e = (Ye1 , . . . , Yen ) and Y′ = (Y1′ , . . . , Yn′ ).

With a little abuse of notation, we let D = {(Xj , Yj ) : j ∈ [n]}, where

(Xn1 +...+nk−1 +i , Yn1 +...+nk−1 +i ) = (Xk,i , Yk,i ) for i ∈ [nk ].

17
We use Π to denote the permutations on {1, . . . , n}. Therefore, there exists a permutation π ∈ Π
such that Dpool = π(D), i.e., for i ∈ [n], (X
ei , Yei ) = (Xπ(i) , Yπ(i) ). Now we start our proof by applying
Proposition 1, which yields  
P Y0 ∈ CbP (X0 ; 1 − α, D′ ) ≥ 1 − α. (25)

By further conditioning on (X0 , Y0 ), we derive


  (i)  
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) | X0 , Y0 ≥ P Y0 ∈ CbP (X0 ; 1 − α, D′ ) | X0 , Y0
 
− dTV (X, e (X′ , Y′ )
e Y),
   
= P Y0 ∈ CbP (X0 ; 1 − α, D′ ) | X0 , Y0 − dTV X,
e X′ . (26)

Inequality (i) follows from the definition of total variation distance and the independence between
(X0 , Y0 ) and D′ ,Dpool . To show the validity of equation (26), we use equation (10) in Berrett et al.
(2020), according to which, it suffices to show that
d
(Y′ | X′ = x) == (Y
e |X
e = x) for any x⊤ ∈ Rnd . (27)

For simplicity, we prove equation (27) for the case when PY |X=x is a discrete distribution for all x ∈ Rd
and define h(y|x) = PU ∼PY |X=x (U = y). Let x = (x⊤ ⊤
1 , . . . , xn ) and y = (y1 , . . . , yn ), where xi ∈ R
d

and yi ∈ R for i ∈ [n]. We have


   
 P Y e = y, Xe =x P
P Y e = x, Dpool = π(D)
e = y, X
π∈Π

P Y e =y|X e =x =   =  
P pool = π(D)
P X π∈Π P X = x, D
e =x e
 
P
P Y
e = y, Xe = x | Dpool = π(D)
(ii) π∈Π (28)
= P  
pool
π∈Π P X = x | D = π(D)
e
P 
π∈Π P ∀i ∈ [n] : Yπ(i) = yi , Xπ(i) = xi
= P  .
π∈Π P ∀i ∈ [n] : Xπ(i) = xi

Equation (ii) follows from π ∼ Uniform(Π). By the independence between observations in D and D′ ,
we have
  Y
P ∀i ∈ [n] : Yπ(i) = yi , Xπ(i) = xi = P ∀i ∈ [n] : Xπ(i) = xi · h(yi |xi ) (29)
i∈[n]
Y
′ ′
and P (Y = y | X = x) = h(yi |xi ). (30)
i∈[n]

Combining equations (28), (29), and (30), we conclude


 
P Y e =y|X e = x = P (Y′ = y | X′ = x) ,

which proves equation (27) and establishes equation (26). Subsequently, by taking the expectation for
(X0 , Y0 ) and combining equation (25), we derive
   
e X′ .
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) ≥ 1 − α − dTV X,

Lastly, we complete the proof by observing

{CbP (X0 ; 1 − α, Dpool ) is a finite prediction interval} ⇐⇒ event E P happens.

18
D.5 Corollary 1 & 2
(k) (0) all
Note that given Dtr and Dtr , ŵk can be viewed as known. The same thing applies to w̃ given Dtr
(0)
and Dtr .

Proof of Corollary 1.
  n o
/ CeB (X0 ; 1 − α, G) = E1 Y0 ∈
P Y0 ∈ / CeB (X0 ; 1 − α, G)
( )
X n (k)
o
≤E 1 Y0 ∈ / Ce(k) (X0 ; 1 − α/|G|, Dcal )
k∈G
( ( ))
X n (k)
o
=E E / Ce(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
1 Y0 ∈ all

k∈G
( )
X  (k)

=E / Ce(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
P Y0 ∈ all

k∈G
(i)X α 1

1 X
≤E + Errk ≤ α + E Errk .
|G| 2 2
k∈G k∈G

Inequality (i) follows from Lemma 1 as well as the fact that



(k)
 1
P Y0 ∈ Cb(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
all
≥ 1 − α/|G| − Errk .
2
Hence we complete the proof of equation (20). Lastly, we prove equation (21) by observing the following
equivalence:
{CeB (X0 ; 1 − α, G) is a finite prediction interval} ⇐⇒ event EeB happens.

Proof of Corollary 2.
We adopt the same set of nations as in Theorem 3. Following the proof of Theorem 3, we have
 
(0)
P Y0 ∈ CeP (X0 ; 1 − α, Dpool ) | Dtr , Dtr
all
, X0 , Y0
(i)    
(0)
≥ P Y0 ∈ CeP (X0 ; 1 − α, D′ ) | Dtr , Dtr
all
, X0 , Y0 − dTV (X, e (X′ , Y′ )
e Y),
   
(0)
= P Y0 ∈ CeP (X0 ; 1 − α, D′ ) | Dtr , Dtr
all
, X0 , Y0 − dTV X,e X′ .

Therefore, we have
     
P Y0 ∈ CeP (X0 ; 1 − α, Dpool ) ≥ P Y0 ∈ CeP (X0 ; 1 − α, D′ ) − dTV X,
e X′ . (31)

An application of Lemma 1 yields



(0)
 1 (0)
P Y0 ∈ CeP (X0 ; 1 − α, D′ )|Dtr
all all
, Dtr ≥ 1 − α − EX∼PeX [|w̄(X) − w̃(X)||Dtr , Dtr ]. (32)
2
Combining equation (31) and equation (32), we conclude

e X′ − 1 E e [|w̄(X) − w̃(X)|] ,
   
P Y0 ∈ CeP (X0 ; 1 − α, D′ ) ≥ 1 − α − dTV X,
2 X∼PX
which completes the proof of equation (22). Subsequently, we prove equation (23) by noting

{CeP (X0 ; 1 − α, D′ ) is a finite prediction interval} ⇐⇒ event EeP happens.

19
E Simulation details
E.1 Informative WCP intervals
Details of Figure 1
We use the absolute residual as our score function, i.e., s(f (x), y) = |f (x) − y|. Here, function f is
obtained by
• data: a set of i.i.d pre-training data points {(Xipre , Yipre )} with size 100 and
i.i.d
Xipre ∼ Uniform(−20, 20) and Yipre | Xipre ∼ N (sigmoid(Xipre ), 0.01)

• model: Gaussian process model Rasmussen (2003) with radial basis function (RBF) kernel,
implemented using python package GPy

Alternate example with varying variance of PX


In Figure 1, we examine the scenario where both QX and PX have the same variance and we vary the
mean of PX to observe changes in their overlap. Now, we explore the variation in the variance of PX
while keeping the mean zero. With QX and PY |X specified in the caption of Figure 1, and using the
same pre-trained model as depicted above, we consider the distribution:

PX ∼ N (0, σ 2 ) with σ ∈ [0.5, 2.5].

A similar set of plots are generated with specified level 0.9:

Marginal coverage of WCP intervals Probability of getting finite WCP intervals Informative coverage of WCP intervals
1.00 1.0 0.92
Empirical coverage probability

Empirical coverage probability

0.8
0.96 0.90
Empirical probability

0.6

0.92 0.88
0.4

0.88 n = 10 n = 10 0.86 n = 10
n = 50 0.2 n = 50 n = 50
n = 100 n = 100 n = 100
target coverage target coverage target coverage
0.84 0.0 0.84
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
: standard deviation of covariate distribution : standard deviation of covariate distribution : standard deviation of covariate distribution

Figure 6. Left: empirical marginal coverage probability. Middle: empirical probability


of getting finite prediction intervals. Right: empirical informative coverage probability. We
consider sample size n ∈ {10, 50, 100} and run simulations across 10000 replications.

Visualization of two pre-trained models


1.0
0.8
Response

0.6
0.4
0.2 Model 1
Model 2
0.0 E(Y|X)
20 15 10 5 0 5 10 15 20
Covariate
Figure 7. Model 1 is trained using data with less noise, and Model 2 is trained using noisier
data.

20
The effect of the pre-trained model
We also notice that the informative coverage probability is influenced by the quality of the pre-trained
model. To explore this, we consider a Gaussian process model trained on noisier data. The training
dataset {(Xipre , Yipre )} now comprises 100 observations with the following distribution:

i.i.d
Xipre ∼ Uniform(−20, 20) and Yipre | Xipre ∼ N (sigmoid(Xipre ), 1).

We visualize these pre-trained models in Figure 7. Furthermore, we generate figures similar to Figure 1
and Figure 6 and present them in Figure 8. Notably, the informative coverage plot in Figure 6 and the
one of varying variance in Figure 8 exhibit significant difference. This discrepancy arises from using
models of different accuracy, where the magnitude of the scores is inflated by using a less accurate
model, leading to the increased length of the prediction intervals. To demonstrate, we conduct a
simulation comparing the length of prediction intervals in the varying variance case and present results
in Figure 9.

0.98
Marginal coverage of WCP intervals Probability of getting finite WCP intervals 0.95
Informative coverage of WCP intervals
1.00
Empirical coverage probability

Empirical coverage probability


0.90
Empirical probability

0.94 0.75

0.85
0.50

0.90

n = 10 0.25 n = 10 0.80 n = 10
n = 50 n = 50 n = 50
n = 100 n = 100 n = 100
target coverage target coverage target coverage
0.86 0.00 0.75
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
: mean of covariate distribution : mean of covariate distribution : mean of covariate distribution

Marginal coverage of WCP intervals Probability of getting finite WCP intervals Informative coverage of WCP intervals
1.00 1.0
0.92
Empirical coverage probability

Empirical coverage probability

0.8
0.96
Empirical probability

0.90
0.6

0.92
0.4
0.88
0.88 n = 10 n = 10 n = 10
n = 50 0.2 n = 50 n = 50
n = 100 n = 100 n = 100
target coverage target coverage target coverage
0.84 0.0 0.86
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
: standard deviation of covariate distribution : standard deviation of covariate distribution : standard deviation of covariate distribution

Figure 8: Same setup as Figure 1 and Figure 6 with noisier training data.

From Figure 9, we can observe that the average lengths of finite prediction intervals obtained by
using a more accurate model are smaller. This explains that the increase in informative coverage
probability in Figure 8 for the varying variance case is due to the inflation of the length by using a
less accurate model.

E.2 Covariate shift: d = 1 with known likelihood ratios


Simulation details
We continue to use the absolute residual as our score function and utilize the Gaussian process model
trained as depicted in Section E.1 with less noisy data (i.e., Model 1 in Figure 7). In the experiments,
we generate calibration data with a sample size of n1 = n2 = 100 and sample one observation (X0 , Y0 )
from QX × PY |X , and then compute corresponding quantities. To obtain results in Table 1, we run
experiments independently with 5000 replications.

21
Average length of finite WCP intervals Average length of finite WCP intervals
n = 10 0.59
0.40
n = 50
n = 100
0.55
0.39
Length

Length
0.51
0.38

0.47 n = 10
0.37
n = 50
n = 100
0.43
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
: standard deviation of covariate distribution : standard deviation of covariate distribution
Figure 9. Left: average length of finite prediction intervals by using Model 1. Right: average
length of finite prediction intervals by using Model 2.

Scatter plots and prediction bands for σ 2 = 1:

Visualization of observed groups Visualization of new group


1.25 1.25

1.00 1.00

0.75 0.75
Response

Response

0.50 0.50

0.25 0.25
GP model
0.00 Group 1 0.00 E(Y|X)
Group 2 Group 0
0.25 0.25
9 6 3 0 3 6 9 9 6 3 0 3 6 9
Covariate Covariate
Figure 10: Visualization of observed groups with σ 2 = 1.

1.50
Prediction band by WCP on Group 1 1.50
Prediction band by WCP on Group 2
E(Y|X)
1.25 Oracle 90% CI 1.25

1.00 Uninformative PI 1.00


Informative PI
Response

Response

0.75 0.75

0.50 0.50

0.25 0.25
E(Y|X)
0.00 0.00 Oracle 90% CI
0.25 0.25 Uninformative PI
Informative PI
0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate
Figure 11: Prediction bands by WCP on observed groups with σ 2 = 1.

22
1.50
Prediction band by WCP-SB 1.50
Prediction band by WCP-P
1.25 1.25

1.00 1.00
Response

Response
0.75 0.75

0.50 0.50

0.25 0.25
E(Y|X) E(Y|X)
0.00 Oracle 90% CI 0.00 Oracle 90% CI
0.25 Uninformative PI 0.25 Uninformative PI
Informative PI Informative PI
0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate

1.50
Prediction band by WCP-SS 1.50
Prediction band by WCP-SS
E(Y|X) E(Y|X)
1.25 Oracle 90% CI 1.25 Oracle 90% CI
1.00 Uninformative PI 1.00 Informative PI
Informative PI
Response

Response
0.75 0.75

0.50 0.50

0.25 0.25

0.00 0.00

0.25 0.25

0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate
Figure 12. First row: prediction bands by WCP based on selective Bonferroni procedure
and WCP based on data pooling with σ 2 = 1. Second row: prediction bands by selecting
shorter WCP interval among 2 groups with σ 2 = 1 and σ 2 = 4.

Simulation details with multiple groups


Under the same setup, we consider group number K > 2. More specifically, we consider the following
data generating procedure:
• Generate group number K ∼ Uniform({3, . . . , 10})
• Set sample size nk ∼ Uniform({10, 11, . . . , 100}) and covariate mean EP (k) X ∼ Uniform(−6, 6)
X
for k ∈ [K]
(k) (k)
• Set Sd(PX ) ∼ Uniform(0.5, 1.5) or Sd(PX ) ∼ Uniform(1.5, 2.5) for k ∈ [K]
(k) (k) (0)
• Prepare Dcal and Dtr for k ∈ [K] and Dtr
• Sample (X0 , Y0 ) ∼ QX × PY |X and carry out analysis
Table 3 is obtained by running experiment with 5000 replications.

Table 3: Method comparison with K > 2


(k) (k)
Sd(PX ) ∼ Uniform(0.5, 1.5) Sd(PX ) ∼ Uniform(1.5, 2.5)
MCP IP ICP AIL MCP IP ICP AIL
WCP-SB (Kinit = 1) 0.932 0.584 0.884 0.370 0.917 0.881 0.905 0.376
WCP-SB (Kinit = 3) 0.963 0.747 0.950 0.445 0.945 0.949 0.942 0.432
WCP-P 0.915 0.916 0.907 0.361 0.902 0.990 0.901 0.359
WCP-SS 0.846 0.889 0.827 0.311 0.808 0.984 0.805 0.294

23
E.3 Covariate shift: higher dimension with unknown likelihood ratios
We consider covariate dimension d ∈ {5, 10, 20, 50}.

Data generating and pre-trained model


We continue to use the absolute residual as our score function and utilize GPy to implement a pre-
trained Gaussian process model with RBF kernel. Specifically, we obtain the model by using a set of
i.i.d pre-training data points {(Xipre , Yipre )} with size 100 ∗ d:
pre i.i.d
Xi,j ∼ Uniform(−3, 3) and
pre
Yipre | Xipre ∼ N (4 ∗ sigmoid(Xi,1 pre
) ∗ sigmoid(Xi,2 ), 0.01).
The test covariate distribution is set to be a d-dimensional standard Gaussian vector QX = N (0, Id ).
To produce multi-group data, we generate the following quantities randomly:
• group number K ∼ Uniform({2, . . . , 10})
• sample sizes nk ∼ Uniform(100, 500)
• covariate mean µk = EP (k) X ∼ Uniform([−1, 1]d ) for k ∈ [K]
X

(k)
• σk = Sd(PX ) ∼ Uniform(0.8, 1) for k ∈ [K]
• ρk ∼ Uniform(0, 0.2) or ρk ∼ Uniform(0.7, 0.9) for k ∈ [K]
(k) (k) (0) P
Subsequently, we generate Dcal and Dtr for k ∈ [K], and Dtr with n0 = k∈[K] nk . Lastly, we
sample (X0 , Y0 ) ∼ QX × PY |X as a test data point, for which we compute weighted prediction interval
and evaluate the coverage probability and length of the prediction interval.

Estimating likelihood ratios


Motivated by Section 2.3 in Tibshirani et al. (2019), we utilize a random forest classifier implemented
in the Python package scikit-learn to estimate the likelihood ratio. It is important to note that, for
(k) (0)
estimating the likelihood ratio wk , we use Dtr and randomly sample a subset of Dtr with size nk
to ensure balanced data representation between the k-th group and the test group. For the pooling
all (0)
method, we combine training data from observed groups and train a classifier based on Dtr and Dtr .
When implementing Algorithm 1, we add an extra condition: in the while-loop, if the kj -th row sum
of the matrix M is less than λ ∗ n0 , the group selection process is terminated. In other words, if the
remaining groups can not make sufficient contribution to providing a finite prediction interval, those
groups will not be selected. Here λ is a user-specified tuning parameter; in the simulation, we set
λ = 0.01.

Tables
Tables 2, 4, 5, and 6 are obtained by running experiment with 5000 replications.

Table 4: Method comparison with d = 5 and K ≥ 2


ρk ∼ Uniform(0, 0.2) ρk ∼ Uniform(0.7, 0.9)
MCP IP ICP AIL MCP IP ICP AIL
WCP-SB (Kinit = 1) 0.890 0.995 0.890 0.360 0.937 0.727 0.913 0.393
WCP-SB (Kinit = min{K, 3}) 0.897 0.998 0.897 0.404 0.964 0.722 0.950 0.470
WCP-P 0.890 0.999 0.890 0.355 0.897 0.944 0.890 0.360
WCP-SS 0.754 1.000 0.754 0.279 0.890 0.941 0.883 0.351

24
Table 5: Method comparison with d = 20 and K ≥ 2
ρk ∼ Uniform(0, 0.2) ρk ∼ Uniform(0.7, 0.9)
MCP IP ICP AIL MCP IP ICP AIL
WCP-SB (Kinit = 1) 0.919 0.961 0.916 0.500 0.979 0.391 0.945 0.565
WCP-SB (Kinit = min{K, 3}) 0.969 0.965 0.968 0.617 0.979 0.390 0.945 0.565
WCP-P 0.905 0.999 0.905 0.471 0.920 0.912 0.912 0.479
WCP-SS 0.881 0.999 0.881 0.444 0.956 0.693 0.937 0.548

Table 6: Method comparison with d = 50 and K ≥ 2


ρk ∼ Uniform(0, 0.2) ρk ∼ Uniform(0.7, 0.9)
MCP IP ICP AIL MCP IP ICP AIL
WCP-SB (Kinit = 1) 0.903 0.960 0.899 4.020 0.982 0.236 0.924 4.491
WCP-SB (Kinit = min{K, 3}) 0.955 0.937 0.952 4.577 0.982 0.236 0.924 4.491
WCP-P 0.913 0.999 0.913 3.907 0.902 0.906 0.892 3.840
WCP-SS 0.800 0.999 0.800 3.094 0.968 0.457 0.931 4.461

Covariate vector with weak correlation. Note that selecting the shortest WCP interval
among those based on each single group achieves the highest IP. However, this method fails to provide
valid coverage probability: both MCP and ICP fall below the target level of 0.9. On the other hand,
WCP based on selective Bonferroni procedure with Kinit = 1 performs similarly to WCP based on data
pooling in this setup, though the data pooling method exhibits a slightly larger IP and shorter AIL.
WCP based on selective Bonferroni procedure with Kinit = min{K, 3} is more conservative, which
improves MCP, IP, and ICP at the cost of inflating the length of informative prediction intervals. It
is important to note that when d = 50, WCP-SB with Kinit = min{K, 3} even has a smaller IP.

Covariate vector with strong correlation. When the covariate vector has strong correlation,
methods other than WCP based on data pooling show a significant decrease in IP as the dimension d
increases. Meanwhile, WCP based on data pooling maintains MCP and ICP close to the target level,
while also achieving the highest IP and shortest AIL. Note that when dimension is high, WCP-SB
with Kinit = min{K, 3} has nearly the same performance as WCP-SB with Kinit = 1, indicating only
one group is selected predominantly even with the specified Kinit > 1.

25

You might also like