0% found this document useful (0 votes)

40 views

Informativeness of Weighted Conformal Prediction

Weighted conformal prediction (WCP), a recently proposed framework, provides uncertainty quantification with the flexibility to accommodate different covariate distributions between training and test data. However, it is pointed out in this paper that the effectiveness of WCP heavily relies on the overlap between covariate distributions; insufficient overlap can lead to uninformative prediction intervals. To enhance the informativeness of WCP, we propose two methods for scenarios involving multi

Uploaded by

guydumais

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

Informativeness of Weighted Conformal Prediction

Uploaded by

guydumais

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Informativeness of Weighted Conformal Prediction

Mufang Ying† , Wenge Guo⋆ , Koulik Khamaru† , Ying Hung†

Department of Statistics, Rutgers University - New Brunswick†

Department of Mathematical Sciences, New Jersey Institute of Technology⋆

May 13, 2024

arXiv:2405.06479v1 [stat.ME] 10 May 2024

Abstract
Weighted conformal prediction (WCP), a recently proposed framework, provides uncer-
tainty quantification with the flexibility to accommodate different covariate distributions
between training and test data. However, it is pointed out in this paper that the effec-
tiveness of WCP heavily relies on the overlap between covariate distributions; insufficient
overlap can lead to uninformative prediction intervals. To enhance the informativeness
of WCP, we propose two methods for scenarios involving multiple sources with varied co-
variate distributions. We establish theoretical guarantees for our proposed methods and
demonstrate their efficacy through simulations.

1 Introduction
In recent years, there has been an extraordinary surge in computational power and sophisti-
cated machine learning models, revolutionizing various fields, spanning from artificial intelli-
gence to scientific research and beyond. These machine learning models are trained on vast
amounts of data to comprehend and predict complex phenomena like weather forecasting and
disease diagnostics. However, as problems grow in complexity, it is crucial not only to provide
accurate predictions but also to quantify the associated uncertainties.
Conformal prediction, a methodology for constructing prediction intervals, has gained sig-
nificant attention and popularity for the ability to assess uncertainties with machine learning
models (Vovk et al., 1999; Papadopoulos et al., 2002; Vovk et al., 2005; Lei et al., 2013; Lei and
Wasserman, 2015; Angelopoulos et al., 2023). One of the reasons for the prominence of confor-
mal prediction is its capacity to provide nonasymptotic coverage guarantees for any black box
algorithms that remain unaffected by the underlying distribution. This remarkable feature
is achieved by relying on the exchangeability of the data points. However, in practice, the
data points are not guaranteed to be exchangeable with one notable example being covariate
shift between training and test distributions in supervised learning tasks Quiñonero-Candela
et al. (2022). A recent framework, weighted conformal prediction Tibshirani et al. (2019),
offers a solution to the regression setup by incorporating knowledge about the likelihood ratio
between training and test covariate distributions.
While weighted conformal prediction has demonstrated successful applications in diverse
domains such as experimental design, survival analysis and causal inference (e.g., see Fan-
njiang et al. (2022); Lei and Candès (2021); Candès et al. (2023)), the effectiveness of this
framework heavily depends on the overlap of covariate distributions between training and test.
In Figure 1, a simple example is used to demonstrate that the constructed WCP intervals
can be uninformative in certain cases. We examine a regression example with QX = N (0, 9)

1
0.98
Marginal coverage of WCP intervals Probability of getting finite WCP intervals 0.95
Informative coverage of WCP intervals
1.00

Empirical coverage probability

0.90

Empirical probability
0.94 0.75

0.85
0.50

0.90

n = 10 0.25 n = 10 0.80 n = 10
n = 50 n = 50 n = 50
n = 100 n = 100 n = 100
target coverage target coverage target coverage
0.86 0.00 0.75
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
: mean of covariate distribution : mean of covariate distribution : mean of covariate distribution

Figure 1. An application of weighted conformal prediction to PY |X = N (sigmoid(X), 0.01),

QX = N (0, 9) and PX = N (µ, 9) with mean µ ∈ [0, 6] at level 0.9. Left: empirical
marginal coverage probability. Middle: empirical probability of getting finite prediction in-
tervals. Right: empirical conditional coverage given prediction interval is finite. Sample size
n ∈ {10, 50, 100} are considered and the results are obtained through 10000 replications. See
Appendix for additional details and another example with varying variance of PX .

representing the covariate distribution of test data and PX = N (µ, 9) representing the co-
variate distribution of training data. Three sample sizes, n = 10, 50, 100 are considered and
the empirical results of the constructed WCP intervals are obtained from 10000 replications.1
We reduce the overlap of covariate distributions PX and QX by increasing µ - mean of PX .
Although WCP intervals provide marginal coverage above the target level 0.9 (the left panel
of Figure 1), this coverage guarantee is accomplished at the cost of an increasing probability
of uninformative prediction intervals, (−∞, ∞), as the overlap decreases (the middle panel of
Figure 1). Furthermore, conditioning on finite prediction intervals, it is shown on the right
panel of Figure 1 that the conditional coverage, referred to as informative coverage probabil-
ity, decreases and falls below target level as the overlap decreases. The decrease of coverage
appears to be more significant for smaller sample size. Motivated by this example, when
evaluating the efficacy of WCP intervals, one should assess the probability of obtaining infor-
mative prediction intervals and the informative coverage probability as more direct metrics
instead of marginal coverage probability.
In addition to the issue of uninformativeness that WCP intervals may present, another
practical concern associated with the WCP framework arises when dealing with training
data sourced from multiple groups with varied covariate distributions. This scenario is quite
common in practice, particularly in medical studies aimed at predicting treatment effects
for patients using various covariates such as age, gender, and medical history. Data are often
collected from different hospitals or clinics, each possessing its own distinct patient population
and covariate distributions. Although in theory we can apply the generalized WCP techniques
from Tibshirani et al. (2019), the resulting weight functions are complex and, consequently,
not practical, as noted in Lei and Candès (2021). A recent work Bhattacharyya and Barber
(2024) focuses on achieving a marginal coverage guarantee in a special scenario where both
training and test data can be viewed as collected via stratified sampling. Specifically, the
covariate X is represented as X = (X 0 , X 1 ), with X 0 ∈ [K] encoding the group information,
and the test distribution exhibits covariate shift only at X 0 . The aim of this paper is to
address a more general scenario where covariates do not contain explicit group information.
Motivated by the preceding discussions, utilizing metrics - probability of obtaining infor-

1
The codes used in this paper are available on our Github page [link].

2
mative prediction intervals and informative coverage probability are important to evaluate
the informativeness of WCP-based procedures. When multiple sources (we will interchange-
ably use words groups and sources) are present in the training data with covariate shifts, it
is crucial to adapt the WCP framework to handle multiple varied covariate distributions to
enhance the informativeness.

Contribution of this work. This paper focuses on improving the informativeness of WCP
when multiple sources are available with covariate shifts. Two procedures, WCP based on
selective Bonferroni and WCP based on data pooling, are proposed to integrate informa-
tion from different sources to enhance the informativeness. The proposed approaches aim to
increase the probability of obtaining a finite prediction interval, thereby ensuring that infor-
mative coverage probability closely approximates to target coverage probability. We establish
theoretical guarantees for these methods and provide empirical evidence to demonstrate their
effectiveness in numerical experiments.

2 Multiple data sources with covariate shifts

Suppose data is collected from K groups in a study, where the sample points from the k-th
(k) (k) (k)
group are denoted as Dcal = {(Xk,i , Yk,i ) ∈ Rd × R : i ∈ Ical } with Ical = [nk ]. We denote
the distribution of the observations from the k-th group as
i.i.d (k) (k)
(Xk,i , Yk,i ) ∼ PX × PY |X for i ∈ Ical , (1)
(k)
where PX denotes the marginal distribution of the covariate for the k-th group and PY |X
denotes the conditional distribution of Y given X. We assume unlabeled training datasets
(k) (k) (k)
Dtr = {Xk,i : i ∈ Itr } with i.i.d. covariates and Itr = {nk + 1, . . . , 2nk }, a pre-trained
model f to predict E[Y |X] and a score function s : Rd × R → R, are available. The unlabeled
training datasets will be used for group selection and likelihood ratio estimation. Our goal
is to utilize data collected from K observed groups to provide uncertainty quantification for
predictions in a test group. We represent the observations from the test group by
i.i.d
(X0,i , Y0,i ) ∼ QX × PY |X for i ∈ [n0 ], (2)

where covariates from the test group have marginal distribution QX and outcomes {Y0,i :
i ∈ [n0 ]} are not observed. We assume QX is known (i.e., unlabeled dataset is available for
(0)
training purpose, which we denote as Dtr = {X0,i : i ∈ [n0 ]}). Note that QX can be different
from the covariate distributions of the observed groups. Without further explanation, we
(k)
assume in the following discussion that the covariate distributions QX and {PX : k ∈ [K]}
are pairwise absolutely continuous with respect to each other.
Lastly, it is worth mentioning that conformal prediction aims to create prediction interval
for Y0,i with the following guarantee:
n o
P Y0,i ∈ Cbn (X0,i ) ≥ 1 − α,

where Cbn (x) is a prediction band constructed based on available data sources. Beyond ensuring
theoretical guarantee for marginal coverage probability, our aim in this study is to leverage
multiple sources to increase the probability of obtaining a finite prediction interval and improve
the informative coverage probability.

3
Remark 1. The two-layer hierarchical model studied in Dunn et al. (2023) assumes exchange-
ability between the covariate distributions of observed groups and the covariate distribution of
the test group (i.e., these covariate distributions are drawn independently and identically dis-
tributed from a certain distribution). In this work, we do not make such an assumption.

Two special cases. Suppose observed groups have separated support; in such cases, a com-
bination of Mondrian conformal prediction Vovk et al. (2005) and weighted conformal pre-
diction Tibshirani et al. (2019) could be effective. When observed groups have varying levels
of overlap among themselves, the problem at hand becomes more challenging. With overlap-
ping support, we also consider the special scenario when QX can be expressed as a mixture of
(k)
{PX : k ∈ [K]}. In this case, the idea of group-weighted conformal prediction Bhattacharyya
and Barber (2024) can be useful. However, when covariates do not explicitly contain group
information, such a mixture structure can impose practical limitations and have certain iden-
tification issues. More details of these two cases are given in Appendices A and B.

3 Informative prediction interval in WCP

We start the discussion about informativeness of WCP by recalling the main theorem from Tib-
shirani et al. (2019), which allows for covariate shift between training and test distributions
with K = 1.

Proposition 1 (Theorem 2 in Tibshirani et al. (2019)). Let dataset D = {(Xi , Yi ) ∈ Rd × R :

i ∈ [n]} consist of i.i.d data points drawn from PX ×PY |X and (X0 , Y0 ) be drawn independently
from QX × PY |X . Let QX be absolutely continuous with respect to PX with known likelihood
ratio w(x) = dQX /dPX (x). Given the dataset D and likelihood ratio w, define the weight
functions at x as follows:

w(x) w(Xi )
pw
0 (x; D) = Pn and pw
i (x; D) = , (3)
w(x) + nj=1 w(Xj )
P
w(x) + j=1 w(Xj )

where i ∈ [n]. Then, it follows that

n o
P Y0 ∈ Cbn (X0 ; 1 − α, D) ≥ 1 − α, (4)

where

Cbn (x; 1 − α, D) = y : s(x, y) ≤ Quantile 1 − α; ni=1 pw w

P
i (x; D)δs(Xi ,Yi ) + p0 (x; D)δ∞
and δz denotes a unit point mass at z ∈ R.

By reweighting the scores based on the dataset D, one can attain a finite sample guarantee
when the likelihood ratio is known. While equation (4) ensures marginal coverage guarantee
for Y0 , the prediction interval is notably conservative as demonstrated in Figure 1 when PX and
QX have large total variation distance, denoted by dTV (PX , QX ) = supA |PX (A) − QX (A)|.
Specifically, when the event E = {pw 0 (X0 ; D) ≤ α} does not happen, the resulting WCP
interval is uninformative:

event E c happens ⇐⇒ Cbn (X0 ; 1 − α, D) = (−∞, ∞).

4
For this reason, we decompose the marginal coverage probability in equation (4) as:
n o n o
P Y0 ∈ Cbn (X0 ; 1 − α, D) = 1 − P (E) + P (E) · P Y0 ∈ Cbn (X0 ; 1 − α, D) | E . (5)

We refer to the conditional coverage probability in equation (5) as informative coverage prob-
ability and present its properties in Theorem 1.
Theorem 1. Under the same assumptions in Proposition 1, it holds that
n o α
P Y0 ∈ Cbn (X0 ; 1 − α, D) | E ≥ 1 − . (6)
P (E)

When EX∼PX [w(X)]1+δ < ∞ for some δ ≥ 1, it holds that

n o α C1 (δ)
P Y0 ∈ Cn (X0 ; 1 − α, D) | E ≤ 1 −
b 1− , (7)
P (E) α · nδ/(1+δ)
where C1 (δ) is a universal constant depending on δ.
See Appendix D.1 for a proof of Theorem 1. Note that P(E) depends on two factors - sam-
ple size n and the likelihood ratio w. Moreover, Theorem 1 implies that P(E) governs the
informative coverage probability. In the following remark, we provide a lower bound for P(E).
Remark 2. It can be shown that
n
!
nα 1X 1
P (E) ≥ EX∼PX w(X)1 w(X) ≤ ·P w(Xi ) − 1 ≤ . (8)
2 − 2α n 2
i=1

With QX fixed, P(E) depends on the likelihood ratio w and the sample size in D. This de-
pendency and the finite sample performance can be studied through the lower bound provided
in equation (8). When there exists some δ > 0 such that EX∼PX [w(X)]1+δ < ∞, the sec-
ond term on the right side of equation (8) can be controlled using concentration inequalities.
With n being sufficiently large, this lower bound approaches 1. The detailed discussions about
equation (8) can be found in Appendix D.2.

4 Enhancing informativeness through integration

A straightforward method to integrate information from multiple groups is to create K pre-
diction intervals based on each group and then select the interval with the shortest length.
However, adopting such an idea falls into the trap of post selection Benjamini (2010); Taylor
and Tibshirani (2015); Taylor (2018), which ultimately leads to the breakdown of the theoret-
ical guarantee. In this section, we propose two approaches: the first involves integrating WCP
intervals following group selection, while the second entails integrating data from observed
groups before applying the WCP framework.

4.1 A conservative approach - selective Bonferroni procedure

(k)
With known likelihood ratio wk = dQX /dPX , the k-th prediction band can be constructed
as
n o
(k) (k) (k)
Cb(k) (x; 1 − α, Dcal ) = y : s(x, y) ≤ Quantile 1 − α; ni=1
P k wk
pi (x; Dcal )δs(Xk,i ,Yk,i ) + pw
0
k
(x; Dcal )δ∞ .

5
Algorithm 1 group selection
all
Input: number of groups Kinit to be selected, likelihood ratios {wk : k ∈ [K]}, training data set Dtr
Procedure:
1: Initialize a list G = {} and a 0-1 matrix M with dimension K × n0
(the (k, i)-th element estimates whether group k can provide finite prediction for X0,i at level 1 − α/Kinit )
2: Compute the (k, i)-th element of M

wk (X0,i )
P ≤ α/Kinit .
wk (X0,i ) + (k) wk (Xk,j )
j∈Itr

3: while |G| ≤ Kinit and M is not empty do

4: Find the group kj that maximizes the row sum of M and add kj to G
5: Update M by deleting the columns which have value 1 in the kj -th row
6: end while
7: return G

According to Proposition 1, for each k ∈ [K], we have

n o
(k)
P Y0 ∈ Cb(k) (X0 ; 1 − α, Dcal ) ≥ 1 − α. (9)

Equipped with prediction intervals obtained from different groups, our goal is to combine them
to increase the probability of obtaining finite prediction intervals and to improve informative
coverage probability. First consider the majority vote procedure Gasparin and Ramdas (2024)
at x: n n o o
y : K1 K
P
1 y ∈ b(k) (x; 1 − α, D(k) ) > 1/2 ,
C
k=1 cal

which includes all y voted by at least a half of the WCP intervals. When there are more
than two groups (K > 2), such a construction is not effective. Consider a scenario where
(k)
one group has a significant overlap between its PX and QX , resulting in a finite prediction
interval. However, the other groups fail to adequately quantify the uncertainty for the new
group and produce prediction interval (−∞, ∞) with high probability. In such cases, the
majority vote procedure leads to uninformative prediction intervals with high probability. To
remedy this issue, we consider Bonferroni’s correction following a group selection step. The
following ingredients are required for the proposed method:
• Kinit : the initial guess of the number of groups required to encompass QX
(0) (k)
• Algorithm 1: algorithm to perform group selection based on Dtr
all = D
tr ∪ (∪k∈[K] Dtr )

• G: a list including selected groups based on Algorithm 1

Then, we define prediction interval based on selective Bonferroni procedure as:
n P n o o
(k)
CbB (x; 1 − α, G) = y : k∈G 1 y ∈ Cb(k) (x; 1 − α/|G|, Dcal ) = |G| . (10)

Note that G is a list of groups selected by Algorithm 1 based on training data and CbB (x; 1 −
(k)
α, G) is the intersection of the prediction bands Cb(k) (x; 1−α/|G|, Dcal ) from the selected groups
G.
Theorem 2. Let G be the set of groups selected based on Algorithm 1 with inputs Kinit ,
likelihood ratios {wk : k ∈ [K]} and training data Dtr all . Then, the prediction interval defined

in equation (10) is a level 1 − α prediction interval:

n o
P Y0 ∈ CbB (X0 ; 1 − α, G) ≥ 1 − α. (11)

6
Moreover, the corresponding informative coverage probability satisfies
n o α
P Y0 ∈ CbB (X0 ; 1 − α, G) | E B ≥ 1 − , (12)
P (E B )

where
n o
(k)
E B = there exists a group k ∈ G such that pw
0
k
(X0 ; Dcal ) ≤ α/|G| .

See Appendix D.3 for a proof of Theorem 2. Note that WCP based on selective Bonferroni
procedure mitigates the occurrence of infinite prediction intervals, albeit at the expense of
amplifying the level of WCP intervals. Hence, Bonferroni’s correction may result in a wider
prediction interval, making it preferable to have fewer groups capable of encompassing the
covariate distribution. Algorithm 1 is designed for this purpose, estimating the probability
all . When computation power is sufficient, one
of getting finite prediction intervals through Dtr
can examine a sequence of Kinit and determine the final group list which has a smaller size
(0) all .
while capable of ensuring Dtr can be provided with finite prediction intervals based on Dtr

4.2 Pooling method

While integrating WCP intervals based on selective Bonferroni procedure reduces the prob-
ability of getting infinite prediction intervals, it tends to inflate the final interval lengths. In
this section, we utilize the data pooling technique, which integrates information from multiple
groups with the aim of enhancing sample efficiency and obtaining shorter prediction intervals.
As the name suggests, data pooling aggregates data from different groups and treats them as
observations from a weighted population. By creating a covariate distribution that exhibits
better overlap with QX and enables a larger sample size, data pooling tends to increase the
probability of obtaining informative prediction intervals, as indicated by Remark 2. The data
pooling procedure cab be described as follows:
(k) (k)
• collect Dcal from different groups and form dataset Dcal
all = ∪
k∈[K] Dcal

all
Dcal = {(X1,1 , Y1,1 ), . . . , (X1,n1 , Y1,n1 ), . . . , (XK,1 , YK,1 ), . . . , (XK,nK , YK,nK )}

• permute the elements in Dcal

all uniformly at random to obtain D pool where

n o
Dpool = (X e1 , Ye1 ), . . . , (X
en , Yen ) .

One can easily verify that for any i ∈ [n]

X nk (k)
ei , Yei ) ∼ PeX × P
(X where PeX = P . (13)
Y |X
n X
k∈[K]

Note that, after the data permutation step, observations share a covariate distribution PeX
as indicated in equation (13), enabling the utilization of the WCP framework. However, the
marginal coverage guarantee of WCP breaks apart due to the correlation within the dataset
Dpool . Nonetheless, in cases where correlations within Dpool are minimal, the difference be-
tween Dpool and its i.i.d. version is insignificant. To summarize, we have the following
theorem:

7
n o
Theorem 3. Suppose pooled dataset Dpool = (X ei , Yei ) : i ∈ [n] is obtained and the likelihood
ratio w̄ = dQX /dPeX is known. Then, it holds that
n o
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) ≥ 1 − α − dTV (X, e X′ ), (14)

where
n o
CbP (x; 1 − α, Dpool ) = y ∈ R : s(x, y) ≤ Quantile 1 − α; ni=1 pw̄ pool )δ w̄ pool )δ
P
i (x; D ei ,Yei ) + p0 (x; D
s(X ∞ ,
e1⊤ , . . . , X
en⊤ ), i.i.d
X
e = (X X′ = (X1′⊤ , . . . , Xn′⊤ ), and Xi′ ∼ PeX .

Moreover, with E P = {pw̄

0 (X0 ; D
pool ) ≤ α}, informative coverage probability based on data

pooling satisfies
n o e X′ )
α + dTV (X,
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) | E P ≥ 1 − . (15)
P (E P )
See Appendix D.4 for a proof of Theorem 3.

The above theorem ensures that weighted conformal prediction can provide almost valid
coverage when X e is close to its i.i.d version X′ in total variation distance. Note that equa-
tion (14) is equivalent to replacing Dpool by Dcal all . While total variation distance can be

bounded using Pinsker’s inequality when PeX follows normal distribution, it is generally chal-
lenging to control in other cases.
Alternatively, imposing a two-layer data generating mechanism can remove the coverage
gap posed by total variation distance. Specifically, one can assume dataset {(g, Xg,i , Yg,i ) :
g ∈ [K], i ∈ [ng ]} consists of i.i.d random variables which are generated from the following
process: (
g ∼ Multinomial (q1 , . . . , qK ) ,
(k)
(Xg,i , Yg,i ) | (g = k) ∼ PX × PY |X .
Consequently, {(Xk,i , Yk,i ) : k ∈ [K], i ∈ [nk ]} can be viewed as a realization of i.i.d random
P (k)
variables distributed as ( 1≤k≤K qk PX ) × PY |X by ignoring the group information. With-
out estimating the mixture weights, one can work with the marginal mixture distribution
P (k)
1≤k≤K qk PX in the WCP framework. Besides, one can consider designing a weighted pop-
(k)
ulation that depends on {PX : k ∈ [K]} and overlaps well with QX , as mentioned in Lei
(k)
and Candès (2021). However, with available datasets {Dcal : k ∈ [K]}, sampling enough i.i.d
data from this population may be challenging, especially when sample sizes {nk : k ∈ [K]}
are imbalanced.
Note that we present Theorem 2 and Theorem 3 with known likelihood ratios. However,
in practice, likelihood ratios are often unknown and needs to be estimated. In Appendix C,
we provide modified versions of Theorem 2 and Theorem 3 with estimated likelihood ratios,
using Theorem 3 in Lei and Candès (2021). Additionally, in Appendix E.3, we discuss how
we estimate the likelihood ratios in numerical experiments.

5 Numerical experiments
In this section, simulations are conducted to demonstrate finite sample performance of the
proposed approaches, WCP based on selective Bonferroni procedure denoted by WCP-SB

8
Visualization of observed groups Visualization of new group
1.25 1.25

1.00 1.00

0.75 0.75
Response

Response
0.50 0.50

0.25 0.25
GP model
0.00 Group 1 0.00 E(Y|X)
Group 2 Group 0
0.25 0.25
9 6 3 0 3 6 9 9 6 3 0 3 6 9
Covariate Covariate
Figure 2: Visualization of data and pre-trained model with σ 2 = 4.

and WCP based on data pooling technique denoted by WCP-P. These methods are compared
with a naive alternative denoted by WCP-SS, which selects the shortest WCP interval among
WCP intervals obtained from each single group. The numerical performance is evaluated by
four different measurements, the marginal coverage probability (MCP), the probability of ob-
taining informative prediction intervals (IP), informative coverage probability (ICP), and the
average lengths of finite prediction intervals (AIL). In Section 5.1, we begin with an example
of 2 groups with one-dimensional covariate and known likelihood ratios. In Section 5.2, more
complex numerical examples with a higher covariate dimension, more groups, and unknown
likelihood ratios are conducted. For simplicity, we consider homoscedastic errors and use ab-
solute residual as our score function. To address heteroscedastic errors, one can use methods,
for example, conformalized quantile regression Romano et al. (2019).

5.1 Covariate shift: d = 1 with known likelihood ratios

Consider the setting:
(1) (2)
PX = N (−3, σ 2 ), PX = N (3, σ 2 ), QX = N (0, 9), and PY |X = N (sigmoid(X), 0.01),

where sigmoid(x) = exp(x)/(1 + exp(x)). Two settings, σ 2 ∈ {1, 4}, are considered to demon-
strate different scenarios of overlap between observed groups with n1 = n2 = 100. As shown
in Figure 2 where σ 2 = 4, each single observed group, either Group 1 in blue or Group 2 in
red (the left panel), has only partial overlap with the new test group (the right panel) and

1.50
Prediction band by WCP on Group 1 1.50
Prediction band by WCP on Group 2
E(Y|X)
1.25 Oracle 90% CI 1.25

1.00 Uninformative PI 1.00

Informative PI
Response

Response

0.75 0.75

0.50 0.50

0.25 0.25
E(Y|X)
0.00 0.00 Oracle 90% CI
0.25 0.25 Uninformative PI
Informative PI
0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate

Figure 3: WCP band based on single group data with σ 2 = 4.

9
1.50
Prediction band by WCP-SB 1.50
Prediction band by WCP-P
E(Y|X) E(Y|X)
1.25 Oracle 90% CI 1.25 Oracle 90% CI
1.00 Informative PI 1.00 Informative PI
Response

Response
0.75 0.75

0.50 0.50

0.25 0.25

0.00 0.00

0.25 0.25

0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate

Figure 4. WCP band based on selective Bonferroni procedure and data pooling with σ 2 = 4.

therefore does not have sufficient information to provide uncertainty quantification for the test
Group 0. Figure 3 demonstrates that WCP based on single group can lead to uninformative
prediction intervals, i.e., (−∞, ∞). The prediction bands of the proposed methods, WCP-SB
and WCP-P, are provided in Figure 4, where both proposed methods reduce the chance of
getting infinite prediction interval and constructed intervals are reasonably close to the oracle
90% confidence intervals.
Table 1 summarizes the simulation results based on 5000 replications. We can see WCP-SB
is relatively more conservative, with the lowest IP and the largest AIL. WCP-P has the largest
IP when σ 2 = 1, demonstrating effectiveness in providing informative prediction intervals
when groups have relatively separated covariates. Compared with the naive approach, WCP-
SS, it is observed that when the covariates of two groups are relatively well separated (σ 2 =
1), the MCP is close to the target coverage. The coverage gap becomes more significant
when the covariates of two groups have better overlap (σ 2 = 4). Moreover, we also conduct
experiments involving multiple groups with K ∼ Uniform({3, . . . , 10}). For each group, we set
(k)
the size of calibration data, mean and variance of PX randomly. See Appendix for detailed
implementation of the experiments and additional results.

Table 1: Comparisons for one-dimensional covariate shift with K = 2

σ2 = 1 σ2 = 4
MCP IP ICP AIL MCP IP ICP AIL
WCP - SB 0.950 0.711 0.929 0.419 0.927 1.000 0.927 0.414
WCP - P 0.910 0.884 0.898 0.372 0.895 1.000 0.895 0.359
WCP - SS 0.896 0.809 0.872 0.356 0.866 1.000 0.866 0.341
WCP - Group 1 0.948 0.412 0.875 0.373 0.915 0.733 0.884 0.369
WCP - Group 2 0.948 0.397 0.869 0.337 0.918 0.725 0.886 0.370

5.2 Covariate shift: higher dimension and unknown likelihood ratios

Inspired by the numerical examples given in Lei and Candès (2021), we consider covariate
vectors for the k-th groups as equicorrelated Gaussian vectors, i.e.,

Xk,i ∼ N (µk , Σk ) with µk ∈ Rd and Σk = σk2 ρk 1d 1⊤
d + (1 − ρk d ,
)I

10
where σk2 > 0, ρk ∈ [0, 1], 1d is the all-one vector, k = 1, ..., K, and Id is a d-dimensional
identity matrix. We consider the conditional distribution

Yk,i |Xk,i ∼ N (4 · sigmoid(Xk,i,1 ) · sigmoid(Xk,i,2 ), 0.01),

where Xk,i,1 and Xk,i,2 denote the first and second coordinates of Xk,i respectively. Different
from changing the variance of the covariate distributions as in Section 5.1, we specify different
covariate shifts by varying the correlation ρk . Two scenarios are considered: weakly correlated
with ρk ∈ [0, 0.2] and strongly correlated with ρk ∈ [0.7, 0.9]. We also implement two different
initial settings for WCP-SB, Kinit = 1 and Kinit = min{K, 3}.
The simulation results based on 5000 replications are summarized in Table 2 for the case of
d = 10. Observations are consistent with the ones in Section 5.1. With a stronger correlation,
ρk ∈ [0.7, 0.9], the data pooling method WCP-P demonstrates the best performance, with
MCP and ICP close to 0.9, the largest IP, and the smallest AIL. Additionally, under strong
covariate correlation, WCP-SB with Kinit = min{K, 3} yields a smaller IP compared to
utilizing only one group’s information when covariate dimension is higher. In Appendix E.3,
we provide more simulation details and tables that summarize simulation results for d =
5, 20, 50.

Table 2: Comparisons with d = 10 and K ≥ 2

ρk ∼ Uniform(0, 0.2) ρk ∼ Uniform(0.7, 0.9)
MCP IP ICP AIL MCP IP ICP AIL
WCP-SB (Kinit = 1) 0.905 0.978 0.903 0.400 0.960 0.536 0.926 0.446
WCP-SB (Kinit = min{K, 3}) 0.948 0.986 0.947 0.479 0.980 0.444 0.955 0.515
WCP-P 0.900 0.998 0.900 0.388 0.914 0.921 0.906 0.400
WCP-SS 0.849 0.998 0.848 0.346 0.926 0.832 0.912 0.417

6 Discussion
In this paper, we demonstrate that constructed WCP intervals can be uninformative. The
event that is linked with obtaining an informative prediction interval is explicitly formulated.
When multiple sources are available, two approaches are introduced to enhance the informa-
tiveness of WCP. Theoretical results are developed for the proposed methods: WCP based on
selective Bonferroni procedure and WCP based on data pooling. Selective Bonferroni proce-
dure produces relatively conservative prediction intervals. Additionally, when the dimension
of covariates increases, IP can decrease with a larger Kinit . On the other hand, data pooling
method in general outperforms the other alternatives in our numerical experiments. Note
that the lower bound in Theorem 3 is relatively conservative. Therefore, an interesting fu-
ture work is to explore a sharper lower bound for the data pooling method. Additionally, we
leave extensions to distribution shift and conformal risk control in scenarios involving multiple
groups for future work.

11
References
Angelopoulos, A. N., Bates, S., et al. (2023). Conformal prediction: A gentle introduction. Foundations
and Trends® in Machine Learning, 16(4):494–591.
Benjamini, Y. (2010). Simultaneous and selective inference: Current successes and future challenges.
Biometrical Journal, 52(6):708–721.
Berrett, T. B., Wang, Y., Barber, R. F., and Samworth, R. J. (2020). The conditional permutation
test for independence while controlling for confounders. Journal of the Royal Statistical Society
Series B: Statistical Methodology, 82(1):175–197.
Bhattacharyya, A. and Barber, R. F. (2024). Group-weighted conformal prediction. arXiv preprint
arXiv:2401.17452.
Candès, E., Lei, L., and Ren, Z. (2023). Conformalized survival analysis. Journal of the Royal Statistical
Society Series B: Statistical Methodology, 85(1):24–45.
Dunn, R., Wasserman, L., and Ramdas, A. (2023). Distribution-free prediction sets for two-layer
hierarchical models. Journal of the American Statistical Association, 118(544):2491–2502.
Fannjiang, C., Bates, S., Angelopoulos, A. N., Listgarten, J., and Jordan, M. I. (2022). Conformal
prediction under feedback covariate shift for biomolecular design. Proceedings of the National
Academy of Sciences, 119(43):e2204569119.
Gasparin, M. and Ramdas, A. (2024). Merging uncertainty sets via majority vote. arXiv preprint
arXiv:2401.09379.
Lei, J., Robins, J., and Wasserman, L. (2013). Distribution-free prediction sets. Journal of the
American Statistical Association, 108(501):278–287.
Lei, J. and Wasserman, L. (2015). Distribution-free prediction bands for nonparametric regression.
Quality control and applied statistics, 60(1):109–110.
Lei, L. and Candès, E. J. (2021). Conformal inference of counterfactuals and individual treatment
effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5):911–938.
Papadopoulos, H., Proedrou, K., Vovk, V., and Gammerman, A. (2002). Inductive confidence machines
for regression. In Machine learning: ECML 2002: 13th European conference on machine learning
Helsinki, Finland, August 19–23, 2002 proceedings 13, pages 345–356. Springer.
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (2022). Dataset shift in
machine learning. Mit Press.
Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer school on machine
learning, pages 63–71. Springer.
Romano, Y., Patterson, E., and Candes, E. (2019). Conformalized quantile regression. Advances in
neural information processing systems, 32.
Taylor, J. and Tibshirani, R. J. (2015). Statistical learning and selective inference. Proceedings of the
National Academy of Sciences, 112(25):7629–7634.
Taylor, J. E. (2018). A selective survey of selective inference. In Proceedings of the International
Congress of Mathematicians: Rio de Janeiro 2018, pages 3019–3038. World Scientific.
Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. (2019). Conformal prediction under
covariate shift. Advances in neural information processing systems, 32.
Vovk, V., Gammerman, A., and Saunders, C. (1999). Machine-learning applications of algorithmic
randomness.
Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic learning in a random world, volume 29.
Springer.

12
Appendix
Table of Contents
A Mondrianize weighted comformal prediction 13

B Employ group-weighted conformal prediction 14

C Weighted conformal prediction with estimated likelihood ratio 15

D Proofs 16
D.1 Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
D.2 Remark 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
D.3 Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
D.4 Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
D.5 Corollary 1 & 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

E Simulation details 20
E.1 Informative WCP intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
E.2 Covariate shift: d = 1 with known likelihood ratios . . . . . . . . . . . . . . 21
E.3 Covariate shift: higher dimension with unknown likelihood ratios . . . . . . 24

A Mondrianize weighted comformal prediction

(k)
When the support of training covariate distributions {PX : k ∈ [K]} are separable, a combination of
Mondrian conformal prediction and weighted conformal prediction can be useful. We begin by setting
(k)
up some notations. Let Xk denote the support of the distribution PX for k ∈ [K], which satisfies

Xi ∩ Xj = ∅ for i ̸= j.
(k)
Let QX denote the conditional distribution of X0 on the region Xk where X0 ∼ QX and assume that
(k) (k) (k) (k)
QX is absolute continuous with respect to PX . Let w̄k = dQX /dPX be the conditional likelihood
ratio and set qk = P (X0 ∈ Xk ).
(k)
With the above notations in place, we are now ready to utilize {w̄k : k ∈ [K]} and {Dcal : k ∈ [K]}
to construct prediction band at x. For k ∈ [K], define
n o
(k) Pnk w̄k (k) (k)
C (k) (x; 1 − α, Dcal ) = y : s(x, y) ≤ Quantile 1 − α; i=1 pi (x; Dcal )δs(Xk,i ,Yk,i ) + pw̄
0 (x; Dcal )δ∞
k
.

Subsequently, we define Mondrianized weighted conformal prediction interval as below:

(
(k)
all C (k) (x; 1 − α, Dcal ) if x ∈ Xk for k ∈ [K]
Cbn (x; 1 − α, Dcal ) = (16)
(−∞, ∞) otherwise,

all (k)
where Dcal = ∪k∈[K] Dcal . We point out that when there is no data in regions outside ∪k∈[K] Xk , there
is no way to quantify uncertainty at X0 lying in those regions without making distribution assumptions

13
about PY |X . Therefore, we consider the uninformative interval (−∞, ∞) when X0 ∈
/ ∪k∈[K] Xk . With
prediction interval (16) at hand, we can show that
n o K
X
all all
P Y0 ∈ Cbn (X0 ; 1 − α, Dcal ) = P (X0 ∈ Xk ) · P Y0 ∈ Cbn (X0 ; 1 − α, Dcal ) | X0 ∈ Xk
k=1

all

+ P X0 ∈
/ ∪k∈[K] Xk · P Y0 ∈ Cbn (X0 ; 1 − α, Dcal ) | X0 ∈
/ ∪k∈[K] Xk
K
(i) X (k)
= qk · P Y0 ∈ C (k) (X0 ; 1 − α, Dcal ) | X0 ∈ Xk
k=1
K
X
+ (1 − qk ) · P Y0 ∈ (−∞, ∞) | X0 ∈
/ ∪k∈[K] Xk
k=1
K K
(k)
X X
= qk · P Y0 ∈ C (k) (X0 ; 1 − α, Dcal ) | X0 ∈ Xk + 1 − qk
k=1 k=1
(ii)
≥ 1 − α + α · P X0 ∈
/ ∪k∈[K] Xk .
all
Equality (i) follows from the construction of Cen (X0 ; 1 − α, Dcal ) and an application of weighted con-
formal prediction for all k ∈ [K] yields inequality (ii).

B Employ group-weighted conformal prediction

When covariates do not explicitly contain group information, the method introduced as group-weighted
conformal prediction (GWCP) in Bhattacharyya and Barber (2024) can be adapted for our problem in
(k)
cases where QX can be represented as a mixture of the {PX : k ∈ [K]}. Additionally, no assumptions
(k)
are required for the overlap between {PX : k ∈ [K]}. Consider the following expression for QX :
K K
(k)
X X
QX = qk PX with qk = 1. (17)
k=1 k=1

Let the empirical distribution of scores for data points in group k be

nk
(k) 1 X
Pbscore = δs(Xk,i ,Yk,i ) ,
nk i=1

where δs(Xk,i ,Yk,i ) denotes the point mass at s(Xk,i , Yk,i ). Then, the level 1 − α prediction band at x
can be calculated by

Cbn (x) = {y : s(x, y) ≤ qb} where qb = Quantile 1 − α; Pbscore , (18)
PK (k)
where Pbscore = k=1 qk Pbscore . In the following proposition, we provide a modified version of Theorem
4.1 in Bhattacharyya and Barber (2024).

Proposition 2. Suppose {(Xk,i , Yk,i )}k∈[K],i∈[nk ] are distributed as in (1). Assume assumption (17)
holds and let (X0 , Y0 ) be drawn independently from the distribution in (2). Then, the prediction interval
defined in (18) satisfies n o qk
P Y0 ∈ Cbn (X0 ) ≥ 1 − α − max . (19)
k nk

Some observations are in order regarding Proposition 2. Assumption (17) requires that the co-
variate distribution QX can be represented as a mixture of the covariate distributions of the observed
groups. This assumption may limit the practicality of applying Proposition 2. One limitation arises

14
from the fact that the test covariate distribution typically differs from the mixture of covariate distri-
butions of observed groups. Furthermore, in cases where there is substantial covariate overlap between
observed groups, the identification issues associated with such a mixture structure may arise. To
demonstrate, the idea of GWCP applies potentially only to test group 3 and is not applicable to test
groups 1 and 2.

Covariate distributions for different groups

Feature 2

Test group 1 Test group 3 Observed group 2

Test group 2 Observed group 1
Feature 1

Figure 5: Practical limitation of GWCP.

C Weighted conformal prediction with estimated likelihood

ratio
When estimated likelihood ratio is used in weighted conformal prediction, a marginal coverage gap is
incurred due to estimation error of the likelihood ratio. To proceed, we state a lemma.

Lemma 1 (Theorem 3 in Lei and Candès (2021)). Assume the same set of assumptions as in
Proposition 1 holds. Let ŵ be the estimated likelihood ratio, which is independent of D and satis-
fies EX∼PX ŵ(X) = 1. Then
n o 1
P Y0 ∈ Cen (X0 ; 1 − α, D) ≥ 1 − α − EX∼PX |w(X) − ŵ(X)|,
2
where
Pn
Cen (x; 1 − α, D) = y : s(x, y) ≤ Quantile 1 − α; i=1 pŵ ŵ
i (x; D)δs(Xi ,Yi ) + p0 (x; D)δ∞ .

Analogously, for k ∈ [K], we define

n o
(k) Pnk ŵk (k) (k)
Ce(k) (x; 1 − α, Dcal ) = y : s(x, y) ≤ Quantile 1 − α; i=1 pi (x; Dcal )δs(Xk,i ,Yk,i ) + pŵ
0
k
(x; Dcal )δ ∞ .

For selective Bonferroni’s procedure and data pool method, we can define
n P n o o
(k)
CeB (x; 1 − α, G) = y : k∈G 1 y ∈ Ce(k) (x; 1 − α/|G|, Dcal ) = |G|
n Pn o
CeP (x; 1 − α, Dpool ) = y ∈ R : s(x, y) ≤ Quantile 1 − α; i=1 pw̃
i (x; D
pool
)δs(Xei ,Yei ) + pw̃
0 (x; D
pool
)δ∞ ,

(k) (0)
where the estimated likelihood ratio ŵk is obtained by using Dtr and Dtr while w̃ is obtained using
all (0)
Dtr and Dtr . With these notations in place, we provide modified version of Theorem 2 and Theorem 3
with estimated likelihood ratios.

15
Corollary 1. Let G be the set of groups selected based on Algorithm 1 with inputs Kinit , training
all
data Dtr and estimated likelihood ratios {ŵk : k ∈ [K]}. Assuming estimated likelihood ratios satisfy
(k) (0)
EX∼P (k) [ŵk (X) | Dtr , Dtr ] = 1, we have
X

n o 1 X
P Y0 ∈ CeB (X0 ; 1 − α, G) ≥ 1 − α − E Errk , (20)
2
k∈G

all
where Errk = EX∼P (k) [|wk (X) − ŵk (X)| | Dtr ]. Moreover, the corresponding informative coverage
X
probability satisfies

α + 21 E k∈G Errk
n o P
B B
P Y0 ∈ C (X0 ; 1 − α, G) | E
e e ≥1− , (21)
P EeB

where
n o
(k)
EeB = there exists a group k ∈ G such that pŵ
0
k
(X0 ; Dcal ) ≤ α/|G|

all (0)
Corollary 2. Let w̃ be the estimated likelihood ratio for dQX /dPeX by using Dtr and Dtr and satisfy
all (0)
EX∼PeX [w̃(X) | Dtr , Dtr ] = 1. Then, under the same set of assumptions as in Theorem 3,

e X′ ) − 1 E e |w̄(X) − w̃(X)|.
n o
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) ≥ 1 − α − dTV (X, (22)
2 X∼PX

Moreover, with EeP = {pw̃

0 (X0 ; D
pool
) ≤ α}, informative coverage probability based on data pooling
satisfies
n o e X′ ) + 1 E e |w̄(X) − w̃(X)|
α + dTV (X, 2 X∼PX
P Y0 ∈ CeP (X0 ; 1 − α, Dpool ) | EeP ≥ 1 − . (23)
P EeP

See a proof of Corollary 1 and 2 in Appendix D.5.

D Proofs
D.1 Theorem 1
Proof. Combining equation (4) and equation (5) yields equation (6). Equation (7) follows from
equation (5) and Proposition 1 in Lei and Candès (2021), which states that when EX∼PX [w(X)]1+δ <
∞ for some δ ≥ 1, there exists a universal constant C1 (δ) depending on δ such that
C1 (δ)
P Y0 ∈ Cbn (X0 ; 1 − α, D) ≤ 1 − α + δ/(1+δ) .
n

D.2 Remark 2
Proof.
( n
! )
nα 1X nα
P (E) = P w(X0 ) + 1− w(Xi ) ≤
1−α n i=1 1−α
( n
)
nα 1X 1
≥ P w(X0 ) ≤ ·P w(Xi ) ≥ (24)
2(1 − α) n i=1 2
n
!
nα 1X 1
≥ EX∼PX w(X)1 w(X) ≤ ·P w(Xi ) − 1 ≤
2 − 2α n i=1 2

16
The first inequality follows from the independence of X0 and D while the second inequality follows
from

nα nα nα
P w(X0 ) ≤ = E1 w(X0 ) ≤ = EX∼PX w(X)1 w(X) ≤ .
2(1 − α) 2(1 − α) 2(1 − α)
By Monotone convergence theorem, we have

nα nα
lim EX∼PX w(X)1 w(X) ≤ = EX∼PX lim w(X)1 w(X) ≤
n→∞ 2 − 2α n→∞ 2 − 2α
= EX∼PX [w(X)] = 1.

Pn {w(Xi ) : i ∈ [n]} are i.i.d random variables, we can use concentration inequality to control
Note that
P n1 i=1 w(Xi ) − 1 < 21 . According to the proof of Theorem 4 in Lei and Candès (2021), we
conclude that there exists a constant C2 (δ) such that
n
!
1X 1 C2 (δ)
P w(Xi ) − 1 ≥ ≤ ,
n i=1 2 nδ ′

where δ ′ = δ/2 + min{1/2, δ/2}.

D.3 Theorem 2
all (0) (k) all (k)
Proof. Given Dtr = Dtr ∪ (∪k∈[K] Dtr ) and Dcal = ∪k∈[K] Dcal at hand, we have
n o
/ CbB (X0 ; 1 − α, G) = E1 Y0 ∈
P Y0 ∈ / CbB (X0 ; 1 − α, G)
( )
(i) X n o
(k) (k)
≤E 1 Y0 ∈ / Cb (X0 ; 1 − α/|G|, Dcal )
k∈G
( ( ))
X n (k)
o
=E E / Cb(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
1 Y0 ∈ all

k∈G
( )
(ii) X α
(k)
≤ E / Cb(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
P Y0 ∈ all
≤ E |G| = α.
|G|
k∈G

Inequality (i) follows by the construction of CbB (x; 1 − α/Kinit , G), i.e., if Y0 ∈ / CbB (x; 1 − α, G), then
(k)
/ Cb(k) (X0 ; 1 − α/|G|, Dcal ) for at least one k ∈ G. Inequality (ii) makes use of the independence
Y0 ∈
all all
between Dtr and Dcal , as well as the fact that

(k)
P Y0 ∈ Cb(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
all
≥ 1 − α/|G|.

Lastly, equation (12) follows from the equivalence:

{CbB (X0 ; 1 − α, D) is a finite prediction interval} ⇐⇒ event E B happens.

D.4 Theorem 3
Proof. To begin with, we define some notations. Given datasets Dpool and D′ = {(Xi′ , Yi′ ) : i ∈ [n]},
i.i.d
where (Xi′ , Yi′ ) ∼ PeX × PY |X , define response vectors

Y
e = (Ye1 , . . . , Yen ) and Y′ = (Y1′ , . . . , Yn′ ).

With a little abuse of notation, we let D = {(Xj , Yj ) : j ∈ [n]}, where

(Xn1 +...+nk−1 +i , Yn1 +...+nk−1 +i ) = (Xk,i , Yk,i ) for i ∈ [nk ].

17
We use Π to denote the permutations on {1, . . . , n}. Therefore, there exists a permutation π ∈ Π
such that Dpool = π(D), i.e., for i ∈ [n], (X
ei , Yei ) = (Xπ(i) , Yπ(i) ). Now we start our proof by applying
Proposition 1, which yields
P Y0 ∈ CbP (X0 ; 1 − α, D′ ) ≥ 1 − α. (25)

By further conditioning on (X0 , Y0 ), we derive

(i)
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) | X0 , Y0 ≥ P Y0 ∈ CbP (X0 ; 1 − α, D′ ) | X0 , Y0

− dTV (X, e (X′ , Y′ )
e Y),

= P Y0 ∈ CbP (X0 ; 1 − α, D′ ) | X0 , Y0 − dTV X,
e X′ . (26)

Inequality (i) follows from the definition of total variation distance and the independence between
(X0 , Y0 ) and D′ ,Dpool . To show the validity of equation (26), we use equation (10) in Berrett et al.
(2020), according to which, it suffices to show that
d
(Y′ | X′ = x) == (Y
e |X
e = x) for any x⊤ ∈ Rnd . (27)

For simplicity, we prove equation (27) for the case when PY |X=x is a discrete distribution for all x ∈ Rd
and define h(y|x) = PU ∼PY |X=x (U = y). Let x = (x⊤ ⊤
1 , . . . , xn ) and y = (y1 , . . . , yn ), where xi ∈ R
d

and yi ∈ R for i ∈ [n]. We have

P Y e = y, Xe =x P
P Y e = x, Dpool = π(D)
e = y, X
π∈Π

P Y e =y|X e =x = =
P pool = π(D)
P X π∈Π P X = x, D
e =x e

P
P Y
e = y, Xe = x | Dpool = π(D)
(ii) π∈Π (28)
= P
pool
π∈Π P X = x | D = π(D)
e
P
π∈Π P ∀i ∈ [n] : Yπ(i) = yi , Xπ(i) = xi
= P .
π∈Π P ∀i ∈ [n] : Xπ(i) = xi

Equation (ii) follows from π ∼ Uniform(Π). By the independence between observations in D and D′ ,
we have
Y
P ∀i ∈ [n] : Yπ(i) = yi , Xπ(i) = xi = P ∀i ∈ [n] : Xπ(i) = xi · h(yi |xi ) (29)
i∈[n]
Y
′ ′
and P (Y = y | X = x) = h(yi |xi ). (30)
i∈[n]

Combining equations (28), (29), and (30), we conclude

P Y e =y|X e = x = P (Y′ = y | X′ = x) ,

which proves equation (27) and establishes equation (26). Subsequently, by taking the expectation for
(X0 , Y0 ) and combining equation (25), we derive

e X′ .
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) ≥ 1 − α − dTV X,

Lastly, we complete the proof by observing

{CbP (X0 ; 1 − α, Dpool ) is a finite prediction interval} ⇐⇒ event E P happens.

18
D.5 Corollary 1 & 2
(k) (0) all
Note that given Dtr and Dtr , ŵk can be viewed as known. The same thing applies to w̃ given Dtr
(0)
and Dtr .

k∈G
( )
X (k)

=E / Ce(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
P Y0 ∈ all

k∈G
(i)X α 1

1 X
≤E + Errk ≤ α + E Errk .
|G| 2 2
k∈G k∈G

Inequality (i) follows from Lemma 1 as well as the fact that

(k)
1
P Y0 ∈ Cb(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
all
≥ 1 − α/|G| − Errk .
2
Hence we complete the proof of equation (20). Lastly, we prove equation (21) by observing the following
equivalence:
{CeB (X0 ; 1 − α, G) is a finite prediction interval} ⇐⇒ event EeB happens.

Proof of Corollary 2.
We adopt the same set of nations as in Theorem 3. Following the proof of Theorem 3, we have

(0)
P Y0 ∈ CeP (X0 ; 1 − α, Dpool ) | Dtr , Dtr
all
, X0 , Y0
(i)
(0)
≥ P Y0 ∈ CeP (X0 ; 1 − α, D′ ) | Dtr , Dtr
all
, X0 , Y0 − dTV (X, e (X′ , Y′ )
e Y),

(0)
= P Y0 ∈ CeP (X0 ; 1 − α, D′ ) | Dtr , Dtr
all
, X0 , Y0 − dTV X,e X′ .

Therefore, we have

P Y0 ∈ CeP (X0 ; 1 − α, Dpool ) ≥ P Y0 ∈ CeP (X0 ; 1 − α, D′ ) − dTV X,
e X′ . (31)

An application of Lemma 1 yields

(0)
1 (0)
P Y0 ∈ CeP (X0 ; 1 − α, D′ )|Dtr
all all
, Dtr ≥ 1 − α − EX∼PeX [|w̄(X) − w̃(X)||Dtr , Dtr ]. (32)
2
Combining equation (31) and equation (32), we conclude

e X′ − 1 E e [|w̄(X) − w̃(X)|] ,

P Y0 ∈ CeP (X0 ; 1 − α, D′ ) ≥ 1 − α − dTV X,
2 X∼PX
which completes the proof of equation (22). Subsequently, we prove equation (23) by noting

{CeP (X0 ; 1 − α, D′ ) is a finite prediction interval} ⇐⇒ event EeP happens.

19
E Simulation details
E.1 Informative WCP intervals
Details of Figure 1
We use the absolute residual as our score function, i.e., s(f (x), y) = |f (x) − y|. Here, function f is
obtained by
• data: a set of i.i.d pre-training data points {(Xipre , Yipre )} with size 100 and
i.i.d
Xipre ∼ Uniform(−20, 20) and Yipre | Xipre ∼ N (sigmoid(Xipre ), 0.01)

• model: Gaussian process model Rasmussen (2003) with radial basis function (RBF) kernel,
implemented using python package GPy

Alternate example with varying variance of PX

In Figure 1, we examine the scenario where both QX and PX have the same variance and we vary the
mean of PX to observe changes in their overlap. Now, we explore the variation in the variance of PX
while keeping the mean zero. With QX and PY |X specified in the caption of Figure 1, and using the
same pre-trained model as depicted above, we consider the distribution:

PX ∼ N (0, σ 2 ) with σ ∈ [0.5, 2.5].

A similar set of plots are generated with specified level 0.9:

Marginal coverage of WCP intervals Probability of getting finite WCP intervals Informative coverage of WCP intervals
1.00 1.0 0.92
Empirical coverage probability

Empirical coverage probability

0.8
0.96 0.90
Empirical probability

0.6

0.92 0.88
0.4

0.88 n = 10 n = 10 0.86 n = 10
n = 50 0.2 n = 50 n = 50
n = 100 n = 100 n = 100
target coverage target coverage target coverage
0.84 0.0 0.84
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
: standard deviation of covariate distribution : standard deviation of covariate distribution : standard deviation of covariate distribution

Figure 6. Left: empirical marginal coverage probability. Middle: empirical probability

of getting finite prediction intervals. Right: empirical informative coverage probability. We
consider sample size n ∈ {10, 50, 100} and run simulations across 10000 replications.

Visualization of two pre-trained models

1.0
0.8
Response

0.6
0.4
0.2 Model 1
Model 2
0.0 E(Y|X)
20 15 10 5 0 5 10 15 20
Covariate
Figure 7. Model 1 is trained using data with less noise, and Model 2 is trained using noisier
data.

20
The effect of the pre-trained model
We also notice that the informative coverage probability is influenced by the quality of the pre-trained
model. To explore this, we consider a Gaussian process model trained on noisier data. The training
dataset {(Xipre , Yipre )} now comprises 100 observations with the following distribution:

i.i.d
Xipre ∼ Uniform(−20, 20) and Yipre | Xipre ∼ N (sigmoid(Xipre ), 1).

We visualize these pre-trained models in Figure 7. Furthermore, we generate figures similar to Figure 1
and Figure 6 and present them in Figure 8. Notably, the informative coverage plot in Figure 6 and the
one of varying variance in Figure 8 exhibit significant difference. This discrepancy arises from using
models of different accuracy, where the magnitude of the scores is inflated by using a less accurate
model, leading to the increased length of the prediction intervals. To demonstrate, we conduct a
simulation comparing the length of prediction intervals in the varying variance case and present results
in Figure 9.

0.98
Marginal coverage of WCP intervals Probability of getting finite WCP intervals 0.95
Informative coverage of WCP intervals
1.00
Empirical coverage probability

Empirical coverage probability

0.90
Empirical probability

0.94 0.75

0.85
0.50

0.90

Marginal coverage of WCP intervals Probability of getting finite WCP intervals Informative coverage of WCP intervals
1.00 1.0
0.92
Empirical coverage probability

Empirical coverage probability

0.8
0.96
Empirical probability

0.90
0.6

0.92
0.4
0.88
0.88 n = 10 n = 10 n = 10
n = 50 0.2 n = 50 n = 50
n = 100 n = 100 n = 100
target coverage target coverage target coverage
0.84 0.0 0.86
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
: standard deviation of covariate distribution : standard deviation of covariate distribution : standard deviation of covariate distribution

Figure 8: Same setup as Figure 1 and Figure 6 with noisier training data.

From Figure 9, we can observe that the average lengths of finite prediction intervals obtained by
using a more accurate model are smaller. This explains that the increase in informative coverage
probability in Figure 8 for the varying variance case is due to the inflation of the length by using a
less accurate model.

E.2 Covariate shift: d = 1 with known likelihood ratios

Simulation details
We continue to use the absolute residual as our score function and utilize the Gaussian process model
trained as depicted in Section E.1 with less noisy data (i.e., Model 1 in Figure 7). In the experiments,
we generate calibration data with a sample size of n1 = n2 = 100 and sample one observation (X0 , Y0 )
from QX × PY |X , and then compute corresponding quantities. To obtain results in Table 1, we run
experiments independently with 5000 replications.

21
Average length of finite WCP intervals Average length of finite WCP intervals
n = 10 0.59
0.40
n = 50
n = 100
0.55
0.39
Length

Length
0.51
0.38

0.47 n = 10
0.37
n = 50
n = 100
0.43
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
: standard deviation of covariate distribution : standard deviation of covariate distribution
Figure 9. Left: average length of finite prediction intervals by using Model 1. Right: average
length of finite prediction intervals by using Model 2.

Scatter plots and prediction bands for σ 2 = 1:

Visualization of observed groups Visualization of new group

1.25 1.25

1.00 1.00

0.75 0.75
Response

Response

0.50 0.50

0.25 0.25
GP model
0.00 Group 1 0.00 E(Y|X)
Group 2 Group 0
0.25 0.25
9 6 3 0 3 6 9 9 6 3 0 3 6 9
Covariate Covariate
Figure 10: Visualization of observed groups with σ 2 = 1.

1.50
Prediction band by WCP on Group 1 1.50
Prediction band by WCP on Group 2
E(Y|X)
1.25 Oracle 90% CI 1.25

1.00 Uninformative PI 1.00

Informative PI
Response

Response

0.75 0.75

0.50 0.50

0.25 0.25
E(Y|X)
0.00 0.00 Oracle 90% CI
0.25 0.25 Uninformative PI
Informative PI
0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate
Figure 11: Prediction bands by WCP on observed groups with σ 2 = 1.

22
1.50
Prediction band by WCP-SB 1.50
Prediction band by WCP-P
1.25 1.25

1.00 1.00
Response

Response
0.75 0.75

0.50 0.50

0.25 0.25
E(Y|X) E(Y|X)
0.00 Oracle 90% CI 0.00 Oracle 90% CI
0.25 Uninformative PI 0.25 Uninformative PI
Informative PI Informative PI
0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate

1.50
Prediction band by WCP-SS 1.50
Prediction band by WCP-SS
E(Y|X) E(Y|X)
1.25 Oracle 90% CI 1.25 Oracle 90% CI
1.00 Uninformative PI 1.00 Informative PI
Informative PI
Response

Response
0.75 0.75

0.50 0.50

0.25 0.25

0.00 0.00

0.25 0.25

0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate
Figure 12. First row: prediction bands by WCP based on selective Bonferroni procedure
and WCP based on data pooling with σ 2 = 1. Second row: prediction bands by selecting
shorter WCP interval among 2 groups with σ 2 = 1 and σ 2 = 4.

Simulation details with multiple groups

Under the same setup, we consider group number K > 2. More specifically, we consider the following
data generating procedure:
• Generate group number K ∼ Uniform({3, . . . , 10})
• Set sample size nk ∼ Uniform({10, 11, . . . , 100}) and covariate mean EP (k) X ∼ Uniform(−6, 6)
X
for k ∈ [K]
(k) (k)
• Set Sd(PX ) ∼ Uniform(0.5, 1.5) or Sd(PX ) ∼ Uniform(1.5, 2.5) for k ∈ [K]
(k) (k) (0)
• Prepare Dcal and Dtr for k ∈ [K] and Dtr
• Sample (X0 , Y0 ) ∼ QX × PY |X and carry out analysis
Table 3 is obtained by running experiment with 5000 replications.

Table 3: Method comparison with K > 2

(k) (k)
Sd(PX ) ∼ Uniform(0.5, 1.5) Sd(PX ) ∼ Uniform(1.5, 2.5)
MCP IP ICP AIL MCP IP ICP AIL
WCP-SB (Kinit = 1) 0.932 0.584 0.884 0.370 0.917 0.881 0.905 0.376
WCP-SB (Kinit = 3) 0.963 0.747 0.950 0.445 0.945 0.949 0.942 0.432
WCP-P 0.915 0.916 0.907 0.361 0.902 0.990 0.901 0.359
WCP-SS 0.846 0.889 0.827 0.311 0.808 0.984 0.805 0.294

23
E.3 Covariate shift: higher dimension with unknown likelihood ratios
We consider covariate dimension d ∈ {5, 10, 20, 50}.

Data generating and pre-trained model

We continue to use the absolute residual as our score function and utilize GPy to implement a pre-
trained Gaussian process model with RBF kernel. Specifically, we obtain the model by using a set of
i.i.d pre-training data points {(Xipre , Yipre )} with size 100 ∗ d:
pre i.i.d
Xi,j ∼ Uniform(−3, 3) and
pre
Yipre | Xipre ∼ N (4 ∗ sigmoid(Xi,1 pre
) ∗ sigmoid(Xi,2 ), 0.01).
The test covariate distribution is set to be a d-dimensional standard Gaussian vector QX = N (0, Id ).
To produce multi-group data, we generate the following quantities randomly:
• group number K ∼ Uniform({2, . . . , 10})
• sample sizes nk ∼ Uniform(100, 500)
• covariate mean µk = EP (k) X ∼ Uniform([−1, 1]d ) for k ∈ [K]
X

(k)
• σk = Sd(PX ) ∼ Uniform(0.8, 1) for k ∈ [K]
• ρk ∼ Uniform(0, 0.2) or ρk ∼ Uniform(0.7, 0.9) for k ∈ [K]
(k) (k) (0) P
Subsequently, we generate Dcal and Dtr for k ∈ [K], and Dtr with n0 = k∈[K] nk . Lastly, we
sample (X0 , Y0 ) ∼ QX × PY |X as a test data point, for which we compute weighted prediction interval
and evaluate the coverage probability and length of the prediction interval.

Estimating likelihood ratios

Motivated by Section 2.3 in Tibshirani et al. (2019), we utilize a random forest classifier implemented
in the Python package scikit-learn to estimate the likelihood ratio. It is important to note that, for
(k) (0)
estimating the likelihood ratio wk , we use Dtr and randomly sample a subset of Dtr with size nk
to ensure balanced data representation between the k-th group and the test group. For the pooling
all (0)
method, we combine training data from observed groups and train a classifier based on Dtr and Dtr .
When implementing Algorithm 1, we add an extra condition: in the while-loop, if the kj -th row sum
of the matrix M is less than λ ∗ n0 , the group selection process is terminated. In other words, if the
remaining groups can not make sufficient contribution to providing a finite prediction interval, those
groups will not be selected. Here λ is a user-specified tuning parameter; in the simulation, we set
λ = 0.01.

Tables
Tables 2, 4, 5, and 6 are obtained by running experiment with 5000 replications.

Table 4: Method comparison with d = 5 and K ≥ 2

ρk ∼ Uniform(0, 0.2) ρk ∼ Uniform(0.7, 0.9)
MCP IP ICP AIL MCP IP ICP AIL
WCP-SB (Kinit = 1) 0.890 0.995 0.890 0.360 0.937 0.727 0.913 0.393
WCP-SB (Kinit = min{K, 3}) 0.897 0.998 0.897 0.404 0.964 0.722 0.950 0.470
WCP-P 0.890 0.999 0.890 0.355 0.897 0.944 0.890 0.360
WCP-SS 0.754 1.000 0.754 0.279 0.890 0.941 0.883 0.351

24
Table 5: Method comparison with d = 20 and K ≥ 2
ρk ∼ Uniform(0, 0.2) ρk ∼ Uniform(0.7, 0.9)
MCP IP ICP AIL MCP IP ICP AIL
WCP-SB (Kinit = 1) 0.919 0.961 0.916 0.500 0.979 0.391 0.945 0.565
WCP-SB (Kinit = min{K, 3}) 0.969 0.965 0.968 0.617 0.979 0.390 0.945 0.565
WCP-P 0.905 0.999 0.905 0.471 0.920 0.912 0.912 0.479
WCP-SS 0.881 0.999 0.881 0.444 0.956 0.693 0.937 0.548

Table 6: Method comparison with d = 50 and K ≥ 2

ρk ∼ Uniform(0, 0.2) ρk ∼ Uniform(0.7, 0.9)
MCP IP ICP AIL MCP IP ICP AIL
WCP-SB (Kinit = 1) 0.903 0.960 0.899 4.020 0.982 0.236 0.924 4.491
WCP-SB (Kinit = min{K, 3}) 0.955 0.937 0.952 4.577 0.982 0.236 0.924 4.491
WCP-P 0.913 0.999 0.913 3.907 0.902 0.906 0.892 3.840
WCP-SS 0.800 0.999 0.800 3.094 0.968 0.457 0.931 4.461

Covariate vector with weak correlation. Note that selecting the shortest WCP interval
among those based on each single group achieves the highest IP. However, this method fails to provide
valid coverage probability: both MCP and ICP fall below the target level of 0.9. On the other hand,
WCP based on selective Bonferroni procedure with Kinit = 1 performs similarly to WCP based on data
pooling in this setup, though the data pooling method exhibits a slightly larger IP and shorter AIL.
WCP based on selective Bonferroni procedure with Kinit = min{K, 3} is more conservative, which
improves MCP, IP, and ICP at the cost of inflating the length of informative prediction intervals. It
is important to note that when d = 50, WCP-SB with Kinit = min{K, 3} even has a smaller IP.

Covariate vector with strong correlation. When the covariate vector has strong correlation,
methods other than WCP based on data pooling show a significant decrease in IP as the dimension d
increases. Meanwhile, WCP based on data pooling maintains MCP and ICP close to the target level,
while also achieving the highest IP and shortest AIL. Note that when dimension is high, WCP-SB
with Kinit = min{K, 3} has nearly the same performance as WCP-SB with Kinit = 1, indicating only
one group is selected predominantly even with the specified Kinit > 1.

2023 - Molnar - Introduction To Conformal Prediction With Python - A Short Guide For Quantifying Uncertainty of Machine Learning Models
No ratings yet
2023 - Molnar - Introduction To Conformal Prediction With Python - A Short Guide For Quantifying Uncertainty of Machine Learning Models
101 pages
Bank Case Study
No ratings yet
Bank Case Study
43 pages
313_identifying_homogeneous_and_in
No ratings yet
313_identifying_homogeneous_and_in
15 pages
Qin 等 - 2024 - Distribution-Free Prediction Intervals Under Covariate Shift, With an Application to Causal Inferenc
No ratings yet
Qin 等 - 2024 - Distribution-Free Prediction Intervals Under Covariate Shift, With an Application to Causal Inferenc
14 pages
Machine Learning Paper - 4
No ratings yet
Machine Learning Paper - 4
30 pages
Fontana 2023 Conformal Review
No ratings yet
Fontana 2023 Conformal Review
23 pages
2023 - Barber Et Al - Conformal Prediction Beyond Exchangeability
No ratings yet
2023 - Barber Et Al - Conformal Prediction Beyond Exchangeability
63 pages
2402.09623
No ratings yet
2402.09623
52 pages
conformal-prediction
No ratings yet
conformal-prediction
103 pages
Theoretical Foundations of Conformal Prediction 1732440976
No ratings yet
Theoretical Foundations of Conformal Prediction 1732440976
179 pages
2208.11111v1
No ratings yet
2208.11111v1
54 pages
Papadopoulos-ConformalPrediction
No ratings yet
Papadopoulos-ConformalPrediction
9 pages
Conformal Prediction
No ratings yet
Conformal Prediction
51 pages
Uncertainty Notes
No ratings yet
Uncertainty Notes
166 pages
inv_prob_pred_2024
No ratings yet
inv_prob_pred_2024
32 pages
[FREE PDF sample] Introduction To Conformal Prediction With Python : A Short Guide For Quantifying Uncertainty Of Machine Learning Models 1st Edition Christoph Molnar ebooks
100% (5)
[FREE PDF sample] Introduction To Conformal Prediction With Python : A Short Guide For Quantifying Uncertainty Of Machine Learning Models 1st Edition Christoph Molnar ebooks
50 pages
Introduction To Conformal Prediction With Python : A Short Guide For Quantifying Uncertainty Of Machine Learning Models 1st Edition Christoph Molnar pdf download
100% (1)
Introduction To Conformal Prediction With Python : A Short Guide For Quantifying Uncertainty Of Machine Learning Models 1st Edition Christoph Molnar pdf download
72 pages
toccaceli17a
No ratings yet
toccaceli17a
23 pages
(Ebook) Introduction To Conformal Prediction With Python : A Short Guide For Quantifying Uncertainty Of Machine Learning Models by Christoph Molnar 2024 Scribd Download
100% (2)
(Ebook) Introduction To Conformal Prediction With Python : A Short Guide For Quantifying Uncertainty Of Machine Learning Models by Christoph Molnar 2024 Scribd Download
55 pages
Where can buy Introduction To Conformal Prediction With Python : A Short Guide For Quantifying Uncertainty Of Machine Learning Models 1st Edition Christoph Molnar ebook with cheap price
No ratings yet
Where can buy Introduction To Conformal Prediction With Python : A Short Guide For Quantifying Uncertainty Of Machine Learning Models 1st Edition Christoph Molnar ebook with cheap price
40 pages
CP Tutorial 2017
No ratings yet
CP Tutorial 2017
112 pages
A Novel Regularization Approach To Fair ML
No ratings yet
A Novel Regularization Approach To Fair ML
20 pages
Conformal Prediction - LLMs
No ratings yet
Conformal Prediction - LLMs
16 pages
Bayesian Modeling
100% (1)
Bayesian Modeling
305 pages
NeurIPS 2018 Why is My Classifier Discriminatory Paper
No ratings yet
NeurIPS 2018 Why is My Classifier Discriminatory Paper
12 pages
Fairness Lectures-21
No ratings yet
Fairness Lectures-21
63 pages
Predictive_overfitting_in_immunological_applicatio
No ratings yet
Predictive_overfitting_in_immunological_applicatio
11 pages
Machine Learning Paper - 3
No ratings yet
Machine Learning Paper - 3
50 pages
2106.05964
No ratings yet
2106.05964
72 pages
BerikovChallBlBoxKrasn
No ratings yet
BerikovChallBlBoxKrasn
19 pages
Detecting and Correcting For Label Shift With Black Box Predictors
No ratings yet
Detecting and Correcting For Label Shift With Black Box Predictors
11 pages
2502.03609v1
No ratings yet
2502.03609v1
15 pages
arvix_1
No ratings yet
arvix_1
46 pages
Weights in Statistics - Biased and Inefficient
No ratings yet
Weights in Statistics - Biased and Inefficient
11 pages
23-1553
No ratings yet
23-1553
38 pages
FSMLecture6 - statistics
No ratings yet
FSMLecture6 - statistics
61 pages
dpnn_tpami2017
No ratings yet
dpnn_tpami2017
9 pages
Uncertainty On Asynchronous Time Event Prediction
No ratings yet
Uncertainty On Asynchronous Time Event Prediction
10 pages
cs188_sp16_f_sol
No ratings yet
cs188_sp16_f_sol
27 pages
Lipton 18 A
No ratings yet
Lipton 18 A
9 pages
Pitman's Measure of Closeness For Weighted Random Variables: Authors: Mosayeb Ahmadi
No ratings yet
Pitman's Measure of Closeness For Weighted Random Variables: Authors: Mosayeb Ahmadi
18 pages
Lclas (Lect 04)
No ratings yet
Lclas (Lect 04)
9 pages
Selective Classification Using A Robust Meta-Learning Approach
No ratings yet
Selective Classification Using A Robust Meta-Learning Approach
26 pages
Four lectures on Statistical Physics of Learning
No ratings yet
Four lectures on Statistical Physics of Learning
74 pages
Foundations of Machine Learning: Module 7: Computational Learning Theory
No ratings yet
Foundations of Machine Learning: Module 7: Computational Learning Theory
64 pages
w11-cp-pt1
No ratings yet
w11-cp-pt1
54 pages
Label Noise Robustness of Conformal Prediction: Bat-Sheva Einbinder
No ratings yet
Label Noise Robustness of Conformal Prediction: Bat-Sheva Einbinder
66 pages
Conditional Likelihood Maximisation: A Unifying Framework For Information Theoretic Feature Selection
No ratings yet
Conditional Likelihood Maximisation: A Unifying Framework For Information Theoretic Feature Selection
40 pages
PhD_Thesis_Paulon_Luca_2013
No ratings yet
PhD_Thesis_Paulon_Luca_2013
133 pages
Lecture 2.4
No ratings yet
Lecture 2.4
28 pages
Problem 1 Report Trần Minh Long 2052154 Final
No ratings yet
Problem 1 Report Trần Minh Long 2052154 Final
31 pages
Thesis Template
No ratings yet
Thesis Template
42 pages
Destercke 22 A
No ratings yet
Destercke 22 A
11 pages
Midterm 2008s Solution
No ratings yet
Midterm 2008s Solution
12 pages
10-701/15-781 Machine Learning Mid-Term Exam Solution: Your Name
No ratings yet
10-701/15-781 Machine Learning Mid-Term Exam Solution: Your Name
12 pages
cptuto
No ratings yet
cptuto
76 pages
Introduction To Machine Learning Lecture 3: Linear Classification Methods
No ratings yet
Introduction To Machine Learning Lecture 3: Linear Classification Methods
40 pages
Conformal_slides_I_2024
No ratings yet
Conformal_slides_I_2024
61 pages
Mod 7 Smote ML
No ratings yet
Mod 7 Smote ML
40 pages
2
No ratings yet
2
18 pages
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
Morphology Legal Language
No ratings yet
Morphology Legal Language
11 pages
Lesson Plan Science
No ratings yet
Lesson Plan Science
2 pages
Matseis-1.8 Manual
No ratings yet
Matseis-1.8 Manual
141 pages
OB MCQs
100% (1)
OB MCQs
16 pages
Nutritional and Antioxidant Activities of Newly Re
No ratings yet
Nutritional and Antioxidant Activities of Newly Re
12 pages
Vision Inspection Applied To Leather Quality Control
No ratings yet
Vision Inspection Applied To Leather Quality Control
3 pages
Stress and Coping Among The Under Graduate Nursing Students A Cross Sectional Study
No ratings yet
Stress and Coping Among The Under Graduate Nursing Students A Cross Sectional Study
4 pages
FUCHS ClassicRange PDF
No ratings yet
FUCHS ClassicRange PDF
12 pages
STE Mod Research-II-Correlation Q3 Wk-1 Final-1
No ratings yet
STE Mod Research-II-Correlation Q3 Wk-1 Final-1
10 pages
ALV Object Model
No ratings yet
ALV Object Model
29 pages
SHOP Rubrics
No ratings yet
SHOP Rubrics
2 pages
GregMat TOEFL Vocab
No ratings yet
GregMat TOEFL Vocab
26 pages
Dynamic Prgming & Backtracking
0% (1)
Dynamic Prgming & Backtracking
98 pages
Assignment
No ratings yet
Assignment
8 pages
Detention Basin - Design
No ratings yet
Detention Basin - Design
3 pages
Open ISO 15189 - 2022
No ratings yet
Open ISO 15189 - 2022
72 pages
Sensorless Speed Control of BLDC Using Geno-Fuzzy Controller
No ratings yet
Sensorless Speed Control of BLDC Using Geno-Fuzzy Controller
36 pages
Placement Manual ECE
No ratings yet
Placement Manual ECE
44 pages
Notes Taken During 3 July APQP PPAP Class
No ratings yet
Notes Taken During 3 July APQP PPAP Class
2 pages
BRIGADA ESKWELA Resource Mobilization
No ratings yet
BRIGADA ESKWELA Resource Mobilization
2 pages
Talk Shows - Analysis
No ratings yet
Talk Shows - Analysis
3 pages
Written Work
No ratings yet
Written Work
6 pages
Nalytical Modelling of Groundwater Wells and Well Systems How To Get It Right
No ratings yet
Nalytical Modelling of Groundwater Wells and Well Systems How To Get It Right
40 pages
Shibatani - What Is Nominalization Towards The Theory
No ratings yet
Shibatani - What Is Nominalization Towards The Theory
345 pages
8037-Article Text-18201-21835-10-20191214
No ratings yet
8037-Article Text-18201-21835-10-20191214
11 pages
Soviet Armenian Identity and Cultural Representation
No ratings yet
Soviet Armenian Identity and Cultural Representation
15 pages
2 No BS Day Trading Webinar-Introduction
No ratings yet
2 No BS Day Trading Webinar-Introduction
6 pages
Thoretical Questions of Electronic Properties of Materials
No ratings yet
Thoretical Questions of Electronic Properties of Materials
7 pages
Nptel: Geo-Environmental Engineering - Web Course
No ratings yet
Nptel: Geo-Environmental Engineering - Web Course
4 pages

Informativeness of Weighted Conformal Prediction

Uploaded by

Informativeness of Weighted Conformal Prediction

Uploaded by

Informativeness of Weighted Conformal Prediction

Mufang Ying† , Wenge Guo⋆ , Koulik Khamaru† , Ying Hung†

Department of Statistics, Rutgers University - New Brunswick†

May 13, 2024

Empirical coverage probability

Empirical coverage probability

Figure 1. An application of weighted conformal prediction to PY |X = N (sigmoid(X), 0.01),

2 Multiple data sources with covariate shifts

3 Informative prediction interval in WCP

Proposition 1 (Theorem 2 in Tibshirani et al. (2019)). Let dataset D = {(Xi , Yi ) ∈ Rd × R :

where i ∈ [n]. Then, it follows that

Cbn (x; 1 − α, D) = y : s(x, y) ≤ Quantile 1 − α; ni=1 pw w

event E c happens ⇐⇒ Cbn (X0 ; 1 − α, D) = (−∞, ∞).

When EX∼PX [w(X)]1+δ < ∞ for some δ ≥ 1, it holds that

4 Enhancing informativeness through integration

4.1 A conservative approach - selective Bonferroni procedure

3: while |G| ≤ Kinit and M is not empty do

According to Proposition 1, for each k ∈ [K], we have

• G: a list including selected groups based on Algorithm 1

in equation (10) is a level 1 − α prediction interval:

4.2 Pooling method

• permute the elements in Dcal

One can easily verify that for any i ∈ [n]

Moreover, with E P = {pw̄

5.1 Covariate shift: d = 1 with known likelihood ratios

1.00 Uninformative PI 1.00

Figure 3: WCP band based on single group data with σ 2 = 4.

Table 1: Comparisons for one-dimensional covariate shift with K = 2

5.2 Covariate shift: higher dimension and unknown likelihood ratios

Yk,i |Xk,i ∼ N (4 · sigmoid(Xk,i,1 ) · sigmoid(Xk,i,2 ), 0.01),

Table 2: Comparisons with d = 10 and K ≥ 2

B Employ group-weighted conformal prediction 14

C Weighted conformal prediction with estimated likelihood ratio 15

A Mondrianize weighted comformal prediction

Subsequently, we define Mondrianized weighted conformal prediction interval as below:

B Employ group-weighted conformal prediction

Let the empirical distribution of scores for data points in group k be

Covariate distributions for different groups

Test group 1 Test group 3 Observed group 2

Figure 5: Practical limitation of GWCP.

C Weighted conformal prediction with estimated likelihood

Analogously, for k ∈ [K], we define

Moreover, with EeP = {pw̃

See a proof of Corollary 1 and 2 in Appendix D.5.

where δ ′ = δ/2 + min{1/2, δ/2}.

Lastly, equation (12) follows from the equivalence:

{CbB (X0 ; 1 − α, D) is a finite prediction interval} ⇐⇒ event E B happens.

With a little abuse of notation, we let D = {(Xj , Yj ) : j ∈ [n]}, where

(Xn1 +...+nk−1 +i , Yn1 +...+nk−1 +i ) = (Xk,i , Yk,i ) for i ∈ [nk ].

By further conditioning on (X0 , Y0 ), we derive

and yi ∈ R for i ∈ [n]. We have

Combining equations (28), (29), and (30), we conclude

Lastly, we complete the proof by observing

{CbP (X0 ; 1 − α, Dpool ) is a finite prediction interval} ⇐⇒ event E P happens.

Inequality (i) follows from Lemma 1 as well as the fact that

An application of Lemma 1 yields

{CeP (X0 ; 1 − α, D′ ) is a finite prediction interval} ⇐⇒ event EeP happens.

Alternate example with varying variance of PX

PX ∼ N (0, σ 2 ) with σ ∈ [0.5, 2.5].

A similar set of plots are generated with specified level 0.9:

Empirical coverage probability

Figure 6. Left: empirical marginal coverage probability. Middle: empirical probability

Visualization of two pre-trained models

Empirical coverage probability

Empirical coverage probability

E.2 Covariate shift: d = 1 with known likelihood ratios

Scatter plots and prediction bands for σ 2 = 1:

Visualization of observed groups Visualization of new group

1.00 Uninformative PI 1.00

Simulation details with multiple groups

Table 3: Method comparison with K > 2

Data generating and pre-trained model

Estimating likelihood ratios

Table 4: Method comparison with d = 5 and K ≥ 2

Table 6: Method comparison with d = 50 and K ≥ 2

You might also like