Informativeness of Weighted Conformal Prediction
Informativeness of Weighted Conformal Prediction
Abstract
Weighted conformal prediction (WCP), a recently proposed framework, provides uncer-
tainty quantification with the flexibility to accommodate different covariate distributions
between training and test data. However, it is pointed out in this paper that the effec-
tiveness of WCP heavily relies on the overlap between covariate distributions; insufficient
overlap can lead to uninformative prediction intervals. To enhance the informativeness
of WCP, we propose two methods for scenarios involving multiple sources with varied co-
variate distributions. We establish theoretical guarantees for our proposed methods and
demonstrate their efficacy through simulations.
1 Introduction
In recent years, there has been an extraordinary surge in computational power and sophisti-
cated machine learning models, revolutionizing various fields, spanning from artificial intelli-
gence to scientific research and beyond. These machine learning models are trained on vast
amounts of data to comprehend and predict complex phenomena like weather forecasting and
disease diagnostics. However, as problems grow in complexity, it is crucial not only to provide
accurate predictions but also to quantify the associated uncertainties.
Conformal prediction, a methodology for constructing prediction intervals, has gained sig-
nificant attention and popularity for the ability to assess uncertainties with machine learning
models (Vovk et al., 1999; Papadopoulos et al., 2002; Vovk et al., 2005; Lei et al., 2013; Lei and
Wasserman, 2015; Angelopoulos et al., 2023). One of the reasons for the prominence of confor-
mal prediction is its capacity to provide nonasymptotic coverage guarantees for any black box
algorithms that remain unaffected by the underlying distribution. This remarkable feature
is achieved by relying on the exchangeability of the data points. However, in practice, the
data points are not guaranteed to be exchangeable with one notable example being covariate
shift between training and test distributions in supervised learning tasks Quiñonero-Candela
et al. (2022). A recent framework, weighted conformal prediction Tibshirani et al. (2019),
offers a solution to the regression setup by incorporating knowledge about the likelihood ratio
between training and test covariate distributions.
While weighted conformal prediction has demonstrated successful applications in diverse
domains such as experimental design, survival analysis and causal inference (e.g., see Fan-
njiang et al. (2022); Lei and Candès (2021); Candès et al. (2023)), the effectiveness of this
framework heavily depends on the overlap of covariate distributions between training and test.
In Figure 1, a simple example is used to demonstrate that the constructed WCP intervals
can be uninformative in certain cases. We examine a regression example with QX = N (0, 9)
1
0.98
Marginal coverage of WCP intervals Probability of getting finite WCP intervals 0.95
Informative coverage of WCP intervals
1.00
Empirical probability
0.94 0.75
0.85
0.50
0.90
n = 10 0.25 n = 10 0.80 n = 10
n = 50 n = 50 n = 50
n = 100 n = 100 n = 100
target coverage target coverage target coverage
0.86 0.00 0.75
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
: mean of covariate distribution : mean of covariate distribution : mean of covariate distribution
representing the covariate distribution of test data and PX = N (µ, 9) representing the co-
variate distribution of training data. Three sample sizes, n = 10, 50, 100 are considered and
the empirical results of the constructed WCP intervals are obtained from 10000 replications.1
We reduce the overlap of covariate distributions PX and QX by increasing µ - mean of PX .
Although WCP intervals provide marginal coverage above the target level 0.9 (the left panel
of Figure 1), this coverage guarantee is accomplished at the cost of an increasing probability
of uninformative prediction intervals, (−∞, ∞), as the overlap decreases (the middle panel of
Figure 1). Furthermore, conditioning on finite prediction intervals, it is shown on the right
panel of Figure 1 that the conditional coverage, referred to as informative coverage probabil-
ity, decreases and falls below target level as the overlap decreases. The decrease of coverage
appears to be more significant for smaller sample size. Motivated by this example, when
evaluating the efficacy of WCP intervals, one should assess the probability of obtaining infor-
mative prediction intervals and the informative coverage probability as more direct metrics
instead of marginal coverage probability.
In addition to the issue of uninformativeness that WCP intervals may present, another
practical concern associated with the WCP framework arises when dealing with training
data sourced from multiple groups with varied covariate distributions. This scenario is quite
common in practice, particularly in medical studies aimed at predicting treatment effects
for patients using various covariates such as age, gender, and medical history. Data are often
collected from different hospitals or clinics, each possessing its own distinct patient population
and covariate distributions. Although in theory we can apply the generalized WCP techniques
from Tibshirani et al. (2019), the resulting weight functions are complex and, consequently,
not practical, as noted in Lei and Candès (2021). A recent work Bhattacharyya and Barber
(2024) focuses on achieving a marginal coverage guarantee in a special scenario where both
training and test data can be viewed as collected via stratified sampling. Specifically, the
covariate X is represented as X = (X 0 , X 1 ), with X 0 ∈ [K] encoding the group information,
and the test distribution exhibits covariate shift only at X 0 . The aim of this paper is to
address a more general scenario where covariates do not contain explicit group information.
Motivated by the preceding discussions, utilizing metrics - probability of obtaining infor-
1
The codes used in this paper are available on our Github page [link].
2
mative prediction intervals and informative coverage probability are important to evaluate
the informativeness of WCP-based procedures. When multiple sources (we will interchange-
ably use words groups and sources) are present in the training data with covariate shifts, it
is crucial to adapt the WCP framework to handle multiple varied covariate distributions to
enhance the informativeness.
Contribution of this work. This paper focuses on improving the informativeness of WCP
when multiple sources are available with covariate shifts. Two procedures, WCP based on
selective Bonferroni and WCP based on data pooling, are proposed to integrate informa-
tion from different sources to enhance the informativeness. The proposed approaches aim to
increase the probability of obtaining a finite prediction interval, thereby ensuring that infor-
mative coverage probability closely approximates to target coverage probability. We establish
theoretical guarantees for these methods and provide empirical evidence to demonstrate their
effectiveness in numerical experiments.
where covariates from the test group have marginal distribution QX and outcomes {Y0,i :
i ∈ [n0 ]} are not observed. We assume QX is known (i.e., unlabeled dataset is available for
(0)
training purpose, which we denote as Dtr = {X0,i : i ∈ [n0 ]}). Note that QX can be different
from the covariate distributions of the observed groups. Without further explanation, we
(k)
assume in the following discussion that the covariate distributions QX and {PX : k ∈ [K]}
are pairwise absolutely continuous with respect to each other.
Lastly, it is worth mentioning that conformal prediction aims to create prediction interval
for Y0,i with the following guarantee:
n o
P Y0,i ∈ Cbn (X0,i ) ≥ 1 − α,
where Cbn (x) is a prediction band constructed based on available data sources. Beyond ensuring
theoretical guarantee for marginal coverage probability, our aim in this study is to leverage
multiple sources to increase the probability of obtaining a finite prediction interval and improve
the informative coverage probability.
3
Remark 1. The two-layer hierarchical model studied in Dunn et al. (2023) assumes exchange-
ability between the covariate distributions of observed groups and the covariate distribution of
the test group (i.e., these covariate distributions are drawn independently and identically dis-
tributed from a certain distribution). In this work, we do not make such an assumption.
Two special cases. Suppose observed groups have separated support; in such cases, a com-
bination of Mondrian conformal prediction Vovk et al. (2005) and weighted conformal pre-
diction Tibshirani et al. (2019) could be effective. When observed groups have varying levels
of overlap among themselves, the problem at hand becomes more challenging. With overlap-
ping support, we also consider the special scenario when QX can be expressed as a mixture of
(k)
{PX : k ∈ [K]}. In this case, the idea of group-weighted conformal prediction Bhattacharyya
and Barber (2024) can be useful. However, when covariates do not explicitly contain group
information, such a mixture structure can impose practical limitations and have certain iden-
tification issues. More details of these two cases are given in Appendices A and B.
w(x) w(Xi )
pw
0 (x; D) = Pn and pw
i (x; D) = , (3)
w(x) + nj=1 w(Xj )
P
w(x) + j=1 w(Xj )
where
By reweighting the scores based on the dataset D, one can attain a finite sample guarantee
when the likelihood ratio is known. While equation (4) ensures marginal coverage guarantee
for Y0 , the prediction interval is notably conservative as demonstrated in Figure 1 when PX and
QX have large total variation distance, denoted by dTV (PX , QX ) = supA |PX (A) − QX (A)|.
Specifically, when the event E = {pw 0 (X0 ; D) ≤ α} does not happen, the resulting WCP
interval is uninformative:
4
For this reason, we decompose the marginal coverage probability in equation (4) as:
n o n o
P Y0 ∈ Cbn (X0 ; 1 − α, D) = 1 − P (E) + P (E) · P Y0 ∈ Cbn (X0 ; 1 − α, D) | E . (5)
We refer to the conditional coverage probability in equation (5) as informative coverage prob-
ability and present its properties in Theorem 1.
Theorem 1. Under the same assumptions in Proposition 1, it holds that
n o α
P Y0 ∈ Cbn (X0 ; 1 − α, D) | E ≥ 1 − . (6)
P (E)
With QX fixed, P(E) depends on the likelihood ratio w and the sample size in D. This de-
pendency and the finite sample performance can be studied through the lower bound provided
in equation (8). When there exists some δ > 0 such that EX∼PX [w(X)]1+δ < ∞, the sec-
ond term on the right side of equation (8) can be controlled using concentration inequalities.
With n being sufficiently large, this lower bound approaches 1. The detailed discussions about
equation (8) can be found in Appendix D.2.
5
Algorithm 1 group selection
all
Input: number of groups Kinit to be selected, likelihood ratios {wk : k ∈ [K]}, training data set Dtr
Procedure:
1: Initialize a list G = {} and a 0-1 matrix M with dimension K × n0
(the (k, i)-th element estimates whether group k can provide finite prediction for X0,i at level 1 − α/Kinit )
2: Compute the (k, i)-th element of M
wk (X0,i )
P ≤ α/Kinit .
wk (X0,i ) + (k) wk (Xk,j )
j∈Itr
Equipped with prediction intervals obtained from different groups, our goal is to combine them
to increase the probability of obtaining finite prediction intervals and to improve informative
coverage probability. First consider the majority vote procedure Gasparin and Ramdas (2024)
at x: n n o o
y : K1 K
P
1 y ∈ b(k) (x; 1 − α, D(k) ) > 1/2 ,
C
k=1 cal
which includes all y voted by at least a half of the WCP intervals. When there are more
than two groups (K > 2), such a construction is not effective. Consider a scenario where
(k)
one group has a significant overlap between its PX and QX , resulting in a finite prediction
interval. However, the other groups fail to adequately quantify the uncertainty for the new
group and produce prediction interval (−∞, ∞) with high probability. In such cases, the
majority vote procedure leads to uninformative prediction intervals with high probability. To
remedy this issue, we consider Bonferroni’s correction following a group selection step. The
following ingredients are required for the proposed method:
• Kinit : the initial guess of the number of groups required to encompass QX
(0) (k)
• Algorithm 1: algorithm to perform group selection based on Dtr
all = D
tr ∪ (∪k∈[K] Dtr )
Note that G is a list of groups selected by Algorithm 1 based on training data and CbB (x; 1 −
(k)
α, G) is the intersection of the prediction bands Cb(k) (x; 1−α/|G|, Dcal ) from the selected groups
G.
Theorem 2. Let G be the set of groups selected based on Algorithm 1 with inputs Kinit ,
likelihood ratios {wk : k ∈ [K]} and training data Dtr all . Then, the prediction interval defined
6
Moreover, the corresponding informative coverage probability satisfies
n o α
P Y0 ∈ CbB (X0 ; 1 − α, G) | E B ≥ 1 − , (12)
P (E B )
where
n o
(k)
E B = there exists a group k ∈ G such that pw
0
k
(X0 ; Dcal ) ≤ α/|G| .
See Appendix D.3 for a proof of Theorem 2. Note that WCP based on selective Bonferroni
procedure mitigates the occurrence of infinite prediction intervals, albeit at the expense of
amplifying the level of WCP intervals. Hence, Bonferroni’s correction may result in a wider
prediction interval, making it preferable to have fewer groups capable of encompassing the
covariate distribution. Algorithm 1 is designed for this purpose, estimating the probability
all . When computation power is sufficient, one
of getting finite prediction intervals through Dtr
can examine a sequence of Kinit and determine the final group list which has a smaller size
(0) all .
while capable of ensuring Dtr can be provided with finite prediction intervals based on Dtr
all
Dcal = {(X1,1 , Y1,1 ), . . . , (X1,n1 , Y1,n1 ), . . . , (XK,1 , YK,1 ), . . . , (XK,nK , YK,nK )}
n o
Dpool = (X e1 , Ye1 ), . . . , (X
en , Yen ) .
Note that, after the data permutation step, observations share a covariate distribution PeX
as indicated in equation (13), enabling the utilization of the WCP framework. However, the
marginal coverage guarantee of WCP breaks apart due to the correlation within the dataset
Dpool . Nonetheless, in cases where correlations within Dpool are minimal, the difference be-
tween Dpool and its i.i.d. version is insignificant. To summarize, we have the following
theorem:
7
n o
Theorem 3. Suppose pooled dataset Dpool = (X ei , Yei ) : i ∈ [n] is obtained and the likelihood
ratio w̄ = dQX /dPeX is known. Then, it holds that
n o
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) ≥ 1 − α − dTV (X, e X′ ), (14)
where
n o
CbP (x; 1 − α, Dpool ) = y ∈ R : s(x, y) ≤ Quantile 1 − α; ni=1 pw̄ pool )δ w̄ pool )δ
P
i (x; D ei ,Yei ) + p0 (x; D
s(X ∞ ,
e1⊤ , . . . , X
en⊤ ), i.i.d
X
e = (X X′ = (X1′⊤ , . . . , Xn′⊤ ), and Xi′ ∼ PeX .
pooling satisfies
n o e X′ )
α + dTV (X,
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) | E P ≥ 1 − . (15)
P (E P )
See Appendix D.4 for a proof of Theorem 3.
The above theorem ensures that weighted conformal prediction can provide almost valid
coverage when X e is close to its i.i.d version X′ in total variation distance. Note that equa-
tion (14) is equivalent to replacing Dpool by Dcal all . While total variation distance can be
bounded using Pinsker’s inequality when PeX follows normal distribution, it is generally chal-
lenging to control in other cases.
Alternatively, imposing a two-layer data generating mechanism can remove the coverage
gap posed by total variation distance. Specifically, one can assume dataset {(g, Xg,i , Yg,i ) :
g ∈ [K], i ∈ [ng ]} consists of i.i.d random variables which are generated from the following
process: (
g ∼ Multinomial (q1 , . . . , qK ) ,
(k)
(Xg,i , Yg,i ) | (g = k) ∼ PX × PY |X .
Consequently, {(Xk,i , Yk,i ) : k ∈ [K], i ∈ [nk ]} can be viewed as a realization of i.i.d random
P (k)
variables distributed as ( 1≤k≤K qk PX ) × PY |X by ignoring the group information. With-
out estimating the mixture weights, one can work with the marginal mixture distribution
P (k)
1≤k≤K qk PX in the WCP framework. Besides, one can consider designing a weighted pop-
(k)
ulation that depends on {PX : k ∈ [K]} and overlaps well with QX , as mentioned in Lei
(k)
and Candès (2021). However, with available datasets {Dcal : k ∈ [K]}, sampling enough i.i.d
data from this population may be challenging, especially when sample sizes {nk : k ∈ [K]}
are imbalanced.
Note that we present Theorem 2 and Theorem 3 with known likelihood ratios. However,
in practice, likelihood ratios are often unknown and needs to be estimated. In Appendix C,
we provide modified versions of Theorem 2 and Theorem 3 with estimated likelihood ratios,
using Theorem 3 in Lei and Candès (2021). Additionally, in Appendix E.3, we discuss how
we estimate the likelihood ratios in numerical experiments.
5 Numerical experiments
In this section, simulations are conducted to demonstrate finite sample performance of the
proposed approaches, WCP based on selective Bonferroni procedure denoted by WCP-SB
8
Visualization of observed groups Visualization of new group
1.25 1.25
1.00 1.00
0.75 0.75
Response
Response
0.50 0.50
0.25 0.25
GP model
0.00 Group 1 0.00 E(Y|X)
Group 2 Group 0
0.25 0.25
9 6 3 0 3 6 9 9 6 3 0 3 6 9
Covariate Covariate
Figure 2: Visualization of data and pre-trained model with σ 2 = 4.
and WCP based on data pooling technique denoted by WCP-P. These methods are compared
with a naive alternative denoted by WCP-SS, which selects the shortest WCP interval among
WCP intervals obtained from each single group. The numerical performance is evaluated by
four different measurements, the marginal coverage probability (MCP), the probability of ob-
taining informative prediction intervals (IP), informative coverage probability (ICP), and the
average lengths of finite prediction intervals (AIL). In Section 5.1, we begin with an example
of 2 groups with one-dimensional covariate and known likelihood ratios. In Section 5.2, more
complex numerical examples with a higher covariate dimension, more groups, and unknown
likelihood ratios are conducted. For simplicity, we consider homoscedastic errors and use ab-
solute residual as our score function. To address heteroscedastic errors, one can use methods,
for example, conformalized quantile regression Romano et al. (2019).
where sigmoid(x) = exp(x)/(1 + exp(x)). Two settings, σ 2 ∈ {1, 4}, are considered to demon-
strate different scenarios of overlap between observed groups with n1 = n2 = 100. As shown
in Figure 2 where σ 2 = 4, each single observed group, either Group 1 in blue or Group 2 in
red (the left panel), has only partial overlap with the new test group (the right panel) and
1.50
Prediction band by WCP on Group 1 1.50
Prediction band by WCP on Group 2
E(Y|X)
1.25 Oracle 90% CI 1.25
Response
0.75 0.75
0.50 0.50
0.25 0.25
E(Y|X)
0.00 0.00 Oracle 90% CI
0.25 0.25 Uninformative PI
Informative PI
0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate
9
1.50
Prediction band by WCP-SB 1.50
Prediction band by WCP-P
E(Y|X) E(Y|X)
1.25 Oracle 90% CI 1.25 Oracle 90% CI
1.00 Informative PI 1.00 Informative PI
Response
Response
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0.25 0.25
0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate
Figure 4. WCP band based on selective Bonferroni procedure and data pooling with σ 2 = 4.
therefore does not have sufficient information to provide uncertainty quantification for the test
Group 0. Figure 3 demonstrates that WCP based on single group can lead to uninformative
prediction intervals, i.e., (−∞, ∞). The prediction bands of the proposed methods, WCP-SB
and WCP-P, are provided in Figure 4, where both proposed methods reduce the chance of
getting infinite prediction interval and constructed intervals are reasonably close to the oracle
90% confidence intervals.
Table 1 summarizes the simulation results based on 5000 replications. We can see WCP-SB
is relatively more conservative, with the lowest IP and the largest AIL. WCP-P has the largest
IP when σ 2 = 1, demonstrating effectiveness in providing informative prediction intervals
when groups have relatively separated covariates. Compared with the naive approach, WCP-
SS, it is observed that when the covariates of two groups are relatively well separated (σ 2 =
1), the MCP is close to the target coverage. The coverage gap becomes more significant
when the covariates of two groups have better overlap (σ 2 = 4). Moreover, we also conduct
experiments involving multiple groups with K ∼ Uniform({3, . . . , 10}). For each group, we set
(k)
the size of calibration data, mean and variance of PX randomly. See Appendix for detailed
implementation of the experiments and additional results.
10
where σk2 > 0, ρk ∈ [0, 1], 1d is the all-one vector, k = 1, ..., K, and Id is a d-dimensional
identity matrix. We consider the conditional distribution
where Xk,i,1 and Xk,i,2 denote the first and second coordinates of Xk,i respectively. Different
from changing the variance of the covariate distributions as in Section 5.1, we specify different
covariate shifts by varying the correlation ρk . Two scenarios are considered: weakly correlated
with ρk ∈ [0, 0.2] and strongly correlated with ρk ∈ [0.7, 0.9]. We also implement two different
initial settings for WCP-SB, Kinit = 1 and Kinit = min{K, 3}.
The simulation results based on 5000 replications are summarized in Table 2 for the case of
d = 10. Observations are consistent with the ones in Section 5.1. With a stronger correlation,
ρk ∈ [0.7, 0.9], the data pooling method WCP-P demonstrates the best performance, with
MCP and ICP close to 0.9, the largest IP, and the smallest AIL. Additionally, under strong
covariate correlation, WCP-SB with Kinit = min{K, 3} yields a smaller IP compared to
utilizing only one group’s information when covariate dimension is higher. In Appendix E.3,
we provide more simulation details and tables that summarize simulation results for d =
5, 20, 50.
6 Discussion
In this paper, we demonstrate that constructed WCP intervals can be uninformative. The
event that is linked with obtaining an informative prediction interval is explicitly formulated.
When multiple sources are available, two approaches are introduced to enhance the informa-
tiveness of WCP. Theoretical results are developed for the proposed methods: WCP based on
selective Bonferroni procedure and WCP based on data pooling. Selective Bonferroni proce-
dure produces relatively conservative prediction intervals. Additionally, when the dimension
of covariates increases, IP can decrease with a larger Kinit . On the other hand, data pooling
method in general outperforms the other alternatives in our numerical experiments. Note
that the lower bound in Theorem 3 is relatively conservative. Therefore, an interesting fu-
ture work is to explore a sharper lower bound for the data pooling method. Additionally, we
leave extensions to distribution shift and conformal risk control in scenarios involving multiple
groups for future work.
11
References
Angelopoulos, A. N., Bates, S., et al. (2023). Conformal prediction: A gentle introduction. Foundations
and Trends® in Machine Learning, 16(4):494–591.
Benjamini, Y. (2010). Simultaneous and selective inference: Current successes and future challenges.
Biometrical Journal, 52(6):708–721.
Berrett, T. B., Wang, Y., Barber, R. F., and Samworth, R. J. (2020). The conditional permutation
test for independence while controlling for confounders. Journal of the Royal Statistical Society
Series B: Statistical Methodology, 82(1):175–197.
Bhattacharyya, A. and Barber, R. F. (2024). Group-weighted conformal prediction. arXiv preprint
arXiv:2401.17452.
Candès, E., Lei, L., and Ren, Z. (2023). Conformalized survival analysis. Journal of the Royal Statistical
Society Series B: Statistical Methodology, 85(1):24–45.
Dunn, R., Wasserman, L., and Ramdas, A. (2023). Distribution-free prediction sets for two-layer
hierarchical models. Journal of the American Statistical Association, 118(544):2491–2502.
Fannjiang, C., Bates, S., Angelopoulos, A. N., Listgarten, J., and Jordan, M. I. (2022). Conformal
prediction under feedback covariate shift for biomolecular design. Proceedings of the National
Academy of Sciences, 119(43):e2204569119.
Gasparin, M. and Ramdas, A. (2024). Merging uncertainty sets via majority vote. arXiv preprint
arXiv:2401.09379.
Lei, J., Robins, J., and Wasserman, L. (2013). Distribution-free prediction sets. Journal of the
American Statistical Association, 108(501):278–287.
Lei, J. and Wasserman, L. (2015). Distribution-free prediction bands for nonparametric regression.
Quality control and applied statistics, 60(1):109–110.
Lei, L. and Candès, E. J. (2021). Conformal inference of counterfactuals and individual treatment
effects. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(5):911–938.
Papadopoulos, H., Proedrou, K., Vovk, V., and Gammerman, A. (2002). Inductive confidence machines
for regression. In Machine learning: ECML 2002: 13th European conference on machine learning
Helsinki, Finland, August 19–23, 2002 proceedings 13, pages 345–356. Springer.
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. (2022). Dataset shift in
machine learning. Mit Press.
Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer school on machine
learning, pages 63–71. Springer.
Romano, Y., Patterson, E., and Candes, E. (2019). Conformalized quantile regression. Advances in
neural information processing systems, 32.
Taylor, J. and Tibshirani, R. J. (2015). Statistical learning and selective inference. Proceedings of the
National Academy of Sciences, 112(25):7629–7634.
Taylor, J. E. (2018). A selective survey of selective inference. In Proceedings of the International
Congress of Mathematicians: Rio de Janeiro 2018, pages 3019–3038. World Scientific.
Tibshirani, R. J., Foygel Barber, R., Candes, E., and Ramdas, A. (2019). Conformal prediction under
covariate shift. Advances in neural information processing systems, 32.
Vovk, V., Gammerman, A., and Saunders, C. (1999). Machine-learning applications of algorithmic
randomness.
Vovk, V., Gammerman, A., and Shafer, G. (2005). Algorithmic learning in a random world, volume 29.
Springer.
12
Appendix
Table of Contents
A Mondrianize weighted comformal prediction 13
D Proofs 16
D.1 Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
D.2 Remark 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
D.3 Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
D.4 Theorem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
D.5 Corollary 1 & 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
E Simulation details 20
E.1 Informative WCP intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
E.2 Covariate shift: d = 1 with known likelihood ratios . . . . . . . . . . . . . . 21
E.3 Covariate shift: higher dimension with unknown likelihood ratios . . . . . . 24
Xi ∩ Xj = ∅ for i ̸= j.
(k)
Let QX denote the conditional distribution of X0 on the region Xk where X0 ∼ QX and assume that
(k) (k) (k) (k)
QX is absolute continuous with respect to PX . Let w̄k = dQX /dPX be the conditional likelihood
ratio and set qk = P (X0 ∈ Xk ).
(k)
With the above notations in place, we are now ready to utilize {w̄k : k ∈ [K]} and {Dcal : k ∈ [K]}
to construct prediction band at x. For k ∈ [K], define
n o
(k) Pnk w̄k (k) (k)
C (k) (x; 1 − α, Dcal ) = y : s(x, y) ≤ Quantile 1 − α; i=1 pi (x; Dcal )δs(Xk,i ,Yk,i ) + pw̄
0 (x; Dcal )δ∞
k
.
all (k)
where Dcal = ∪k∈[K] Dcal . We point out that when there is no data in regions outside ∪k∈[K] Xk , there
is no way to quantify uncertainty at X0 lying in those regions without making distribution assumptions
13
about PY |X . Therefore, we consider the uninformative interval (−∞, ∞) when X0 ∈
/ ∪k∈[K] Xk . With
prediction interval (16) at hand, we can show that
n o K
X
all all
P Y0 ∈ Cbn (X0 ; 1 − α, Dcal ) = P (X0 ∈ Xk ) · P Y0 ∈ Cbn (X0 ; 1 − α, Dcal ) | X0 ∈ Xk
k=1
all
+ P X0 ∈
/ ∪k∈[K] Xk · P Y0 ∈ Cbn (X0 ; 1 − α, Dcal ) | X0 ∈
/ ∪k∈[K] Xk
K
(i) X (k)
= qk · P Y0 ∈ C (k) (X0 ; 1 − α, Dcal ) | X0 ∈ Xk
k=1
K
X
+ (1 − qk ) · P Y0 ∈ (−∞, ∞) | X0 ∈
/ ∪k∈[K] Xk
k=1
K K
(k)
X X
= qk · P Y0 ∈ C (k) (X0 ; 1 − α, Dcal ) | X0 ∈ Xk + 1 − qk
k=1 k=1
(ii)
≥ 1 − α + α · P X0 ∈
/ ∪k∈[K] Xk .
all
Equality (i) follows from the construction of Cen (X0 ; 1 − α, Dcal ) and an application of weighted con-
formal prediction for all k ∈ [K] yields inequality (ii).
where δs(Xk,i ,Yk,i ) denotes the point mass at s(Xk,i , Yk,i ). Then, the level 1 − α prediction band at x
can be calculated by
Cbn (x) = {y : s(x, y) ≤ qb} where qb = Quantile 1 − α; Pbscore , (18)
PK (k)
where Pbscore = k=1 qk Pbscore . In the following proposition, we provide a modified version of Theorem
4.1 in Bhattacharyya and Barber (2024).
Proposition 2. Suppose {(Xk,i , Yk,i )}k∈[K],i∈[nk ] are distributed as in (1). Assume assumption (17)
holds and let (X0 , Y0 ) be drawn independently from the distribution in (2). Then, the prediction interval
defined in (18) satisfies n o qk
P Y0 ∈ Cbn (X0 ) ≥ 1 − α − max . (19)
k nk
Some observations are in order regarding Proposition 2. Assumption (17) requires that the co-
variate distribution QX can be represented as a mixture of the covariate distributions of the observed
groups. This assumption may limit the practicality of applying Proposition 2. One limitation arises
14
from the fact that the test covariate distribution typically differs from the mixture of covariate distri-
butions of observed groups. Furthermore, in cases where there is substantial covariate overlap between
observed groups, the identification issues associated with such a mixture structure may arise. To
demonstrate, the idea of GWCP applies potentially only to test group 3 and is not applicable to test
groups 1 and 2.
Lemma 1 (Theorem 3 in Lei and Candès (2021)). Assume the same set of assumptions as in
Proposition 1 holds. Let ŵ be the estimated likelihood ratio, which is independent of D and satis-
fies EX∼PX ŵ(X) = 1. Then
n o 1
P Y0 ∈ Cen (X0 ; 1 − α, D) ≥ 1 − α − EX∼PX |w(X) − ŵ(X)|,
2
where
Pn
Cen (x; 1 − α, D) = y : s(x, y) ≤ Quantile 1 − α; i=1 pŵ ŵ
i (x; D)δs(Xi ,Yi ) + p0 (x; D)δ∞ .
For selective Bonferroni’s procedure and data pool method, we can define
n P n o o
(k)
CeB (x; 1 − α, G) = y : k∈G 1 y ∈ Ce(k) (x; 1 − α/|G|, Dcal ) = |G|
n Pn o
CeP (x; 1 − α, Dpool ) = y ∈ R : s(x, y) ≤ Quantile 1 − α; i=1 pw̃
i (x; D
pool
)δs(Xei ,Yei ) + pw̃
0 (x; D
pool
)δ∞ ,
(k) (0)
where the estimated likelihood ratio ŵk is obtained by using Dtr and Dtr while w̃ is obtained using
all (0)
Dtr and Dtr . With these notations in place, we provide modified version of Theorem 2 and Theorem 3
with estimated likelihood ratios.
15
Corollary 1. Let G be the set of groups selected based on Algorithm 1 with inputs Kinit , training
all
data Dtr and estimated likelihood ratios {ŵk : k ∈ [K]}. Assuming estimated likelihood ratios satisfy
(k) (0)
EX∼P (k) [ŵk (X) | Dtr , Dtr ] = 1, we have
X
n o 1 X
P Y0 ∈ CeB (X0 ; 1 − α, G) ≥ 1 − α − E Errk , (20)
2
k∈G
all
where Errk = EX∼P (k) [|wk (X) − ŵk (X)| | Dtr ]. Moreover, the corresponding informative coverage
X
probability satisfies
α + 21 E k∈G Errk
n o P
B B
P Y0 ∈ C (X0 ; 1 − α, G) | E
e e ≥1− , (21)
P EeB
where
n o
(k)
EeB = there exists a group k ∈ G such that pŵ
0
k
(X0 ; Dcal ) ≤ α/|G|
all (0)
Corollary 2. Let w̃ be the estimated likelihood ratio for dQX /dPeX by using Dtr and Dtr and satisfy
all (0)
EX∼PeX [w̃(X) | Dtr , Dtr ] = 1. Then, under the same set of assumptions as in Theorem 3,
e X′ ) − 1 E e |w̄(X) − w̃(X)|.
n o
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) ≥ 1 − α − dTV (X, (22)
2 X∼PX
D Proofs
D.1 Theorem 1
Proof. Combining equation (4) and equation (5) yields equation (6). Equation (7) follows from
equation (5) and Proposition 1 in Lei and Candès (2021), which states that when EX∼PX [w(X)]1+δ <
∞ for some δ ≥ 1, there exists a universal constant C1 (δ) depending on δ such that
C1 (δ)
P Y0 ∈ Cbn (X0 ; 1 − α, D) ≤ 1 − α + δ/(1+δ) .
n
D.2 Remark 2
Proof.
( n
! )
nα 1X nα
P (E) = P w(X0 ) + 1− w(Xi ) ≤
1−α n i=1 1−α
( n
)
nα 1X 1
≥ P w(X0 ) ≤ ·P w(Xi ) ≥ (24)
2(1 − α) n i=1 2
n
!
nα 1X 1
≥ EX∼PX w(X)1 w(X) ≤ ·P w(Xi ) − 1 ≤
2 − 2α n i=1 2
16
The first inequality follows from the independence of X0 and D while the second inequality follows
from
nα nα nα
P w(X0 ) ≤ = E1 w(X0 ) ≤ = EX∼PX w(X)1 w(X) ≤ .
2(1 − α) 2(1 − α) 2(1 − α)
By Monotone convergence theorem, we have
nα nα
lim EX∼PX w(X)1 w(X) ≤ = EX∼PX lim w(X)1 w(X) ≤
n→∞ 2 − 2α n→∞ 2 − 2α
= EX∼PX [w(X)] = 1.
Pn {w(Xi ) : i ∈ [n]} are i.i.d random variables, we can use concentration inequality to control
Note that
P n1 i=1 w(Xi ) − 1 < 21 . According to the proof of Theorem 4 in Lei and Candès (2021), we
conclude that there exists a constant C2 (δ) such that
n
!
1X 1 C2 (δ)
P w(Xi ) − 1 ≥ ≤ ,
n i=1 2 nδ ′
D.3 Theorem 2
all (0) (k) all (k)
Proof. Given Dtr = Dtr ∪ (∪k∈[K] Dtr ) and Dcal = ∪k∈[K] Dcal at hand, we have
n o
/ CbB (X0 ; 1 − α, G) = E1 Y0 ∈
P Y0 ∈ / CbB (X0 ; 1 − α, G)
( )
(i) X n o
(k) (k)
≤E 1 Y0 ∈ / Cb (X0 ; 1 − α/|G|, Dcal )
k∈G
( ( ))
X n (k)
o
=E E / Cb(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
1 Y0 ∈ all
k∈G
( )
(ii) X α
(k)
≤ E / Cb(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
P Y0 ∈ all
≤ E |G| = α.
|G|
k∈G
Inequality (i) follows by the construction of CbB (x; 1 − α/Kinit , G), i.e., if Y0 ∈ / CbB (x; 1 − α, G), then
(k)
/ Cb(k) (X0 ; 1 − α/|G|, Dcal ) for at least one k ∈ G. Inequality (ii) makes use of the independence
Y0 ∈
all all
between Dtr and Dcal , as well as the fact that
(k)
P Y0 ∈ Cb(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
all
≥ 1 − α/|G|.
D.4 Theorem 3
Proof. To begin with, we define some notations. Given datasets Dpool and D′ = {(Xi′ , Yi′ ) : i ∈ [n]},
i.i.d
where (Xi′ , Yi′ ) ∼ PeX × PY |X , define response vectors
Y
e = (Ye1 , . . . , Yen ) and Y′ = (Y1′ , . . . , Yn′ ).
17
We use Π to denote the permutations on {1, . . . , n}. Therefore, there exists a permutation π ∈ Π
such that Dpool = π(D), i.e., for i ∈ [n], (X
ei , Yei ) = (Xπ(i) , Yπ(i) ). Now we start our proof by applying
Proposition 1, which yields
P Y0 ∈ CbP (X0 ; 1 − α, D′ ) ≥ 1 − α. (25)
Inequality (i) follows from the definition of total variation distance and the independence between
(X0 , Y0 ) and D′ ,Dpool . To show the validity of equation (26), we use equation (10) in Berrett et al.
(2020), according to which, it suffices to show that
d
(Y′ | X′ = x) == (Y
e |X
e = x) for any x⊤ ∈ Rnd . (27)
For simplicity, we prove equation (27) for the case when PY |X=x is a discrete distribution for all x ∈ Rd
and define h(y|x) = PU ∼PY |X=x (U = y). Let x = (x⊤ ⊤
1 , . . . , xn ) and y = (y1 , . . . , yn ), where xi ∈ R
d
Equation (ii) follows from π ∼ Uniform(Π). By the independence between observations in D and D′ ,
we have
Y
P ∀i ∈ [n] : Yπ(i) = yi , Xπ(i) = xi = P ∀i ∈ [n] : Xπ(i) = xi · h(yi |xi ) (29)
i∈[n]
Y
′ ′
and P (Y = y | X = x) = h(yi |xi ). (30)
i∈[n]
which proves equation (27) and establishes equation (26). Subsequently, by taking the expectation for
(X0 , Y0 ) and combining equation (25), we derive
e X′ .
P Y0 ∈ CbP (X0 ; 1 − α, Dpool ) ≥ 1 − α − dTV X,
18
D.5 Corollary 1 & 2
(k) (0) all
Note that given Dtr and Dtr , ŵk can be viewed as known. The same thing applies to w̃ given Dtr
(0)
and Dtr .
Proof of Corollary 1.
n o
/ CeB (X0 ; 1 − α, G) = E1 Y0 ∈
P Y0 ∈ / CeB (X0 ; 1 − α, G)
( )
X n (k)
o
≤E 1 Y0 ∈ / Ce(k) (X0 ; 1 − α/|G|, Dcal )
k∈G
( ( ))
X n (k)
o
=E E / Ce(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
1 Y0 ∈ all
k∈G
( )
X (k)
=E / Ce(k) (X0 ; 1 − α/|G|, Dcal ) | Dtr
P Y0 ∈ all
k∈G
(i)X α 1
1 X
≤E + Errk ≤ α + E Errk .
|G| 2 2
k∈G k∈G
Proof of Corollary 2.
We adopt the same set of nations as in Theorem 3. Following the proof of Theorem 3, we have
(0)
P Y0 ∈ CeP (X0 ; 1 − α, Dpool ) | Dtr , Dtr
all
, X0 , Y0
(i)
(0)
≥ P Y0 ∈ CeP (X0 ; 1 − α, D′ ) | Dtr , Dtr
all
, X0 , Y0 − dTV (X, e (X′ , Y′ )
e Y),
(0)
= P Y0 ∈ CeP (X0 ; 1 − α, D′ ) | Dtr , Dtr
all
, X0 , Y0 − dTV X,e X′ .
Therefore, we have
P Y0 ∈ CeP (X0 ; 1 − α, Dpool ) ≥ P Y0 ∈ CeP (X0 ; 1 − α, D′ ) − dTV X,
e X′ . (31)
e X′ − 1 E e [|w̄(X) − w̃(X)|] ,
P Y0 ∈ CeP (X0 ; 1 − α, D′ ) ≥ 1 − α − dTV X,
2 X∼PX
which completes the proof of equation (22). Subsequently, we prove equation (23) by noting
19
E Simulation details
E.1 Informative WCP intervals
Details of Figure 1
We use the absolute residual as our score function, i.e., s(f (x), y) = |f (x) − y|. Here, function f is
obtained by
• data: a set of i.i.d pre-training data points {(Xipre , Yipre )} with size 100 and
i.i.d
Xipre ∼ Uniform(−20, 20) and Yipre | Xipre ∼ N (sigmoid(Xipre ), 0.01)
• model: Gaussian process model Rasmussen (2003) with radial basis function (RBF) kernel,
implemented using python package GPy
Marginal coverage of WCP intervals Probability of getting finite WCP intervals Informative coverage of WCP intervals
1.00 1.0 0.92
Empirical coverage probability
0.8
0.96 0.90
Empirical probability
0.6
0.92 0.88
0.4
0.88 n = 10 n = 10 0.86 n = 10
n = 50 0.2 n = 50 n = 50
n = 100 n = 100 n = 100
target coverage target coverage target coverage
0.84 0.0 0.84
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
: standard deviation of covariate distribution : standard deviation of covariate distribution : standard deviation of covariate distribution
0.6
0.4
0.2 Model 1
Model 2
0.0 E(Y|X)
20 15 10 5 0 5 10 15 20
Covariate
Figure 7. Model 1 is trained using data with less noise, and Model 2 is trained using noisier
data.
20
The effect of the pre-trained model
We also notice that the informative coverage probability is influenced by the quality of the pre-trained
model. To explore this, we consider a Gaussian process model trained on noisier data. The training
dataset {(Xipre , Yipre )} now comprises 100 observations with the following distribution:
i.i.d
Xipre ∼ Uniform(−20, 20) and Yipre | Xipre ∼ N (sigmoid(Xipre ), 1).
We visualize these pre-trained models in Figure 7. Furthermore, we generate figures similar to Figure 1
and Figure 6 and present them in Figure 8. Notably, the informative coverage plot in Figure 6 and the
one of varying variance in Figure 8 exhibit significant difference. This discrepancy arises from using
models of different accuracy, where the magnitude of the scores is inflated by using a less accurate
model, leading to the increased length of the prediction intervals. To demonstrate, we conduct a
simulation comparing the length of prediction intervals in the varying variance case and present results
in Figure 9.
0.98
Marginal coverage of WCP intervals Probability of getting finite WCP intervals 0.95
Informative coverage of WCP intervals
1.00
Empirical coverage probability
0.94 0.75
0.85
0.50
0.90
n = 10 0.25 n = 10 0.80 n = 10
n = 50 n = 50 n = 50
n = 100 n = 100 n = 100
target coverage target coverage target coverage
0.86 0.00 0.75
0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6
: mean of covariate distribution : mean of covariate distribution : mean of covariate distribution
Marginal coverage of WCP intervals Probability of getting finite WCP intervals Informative coverage of WCP intervals
1.00 1.0
0.92
Empirical coverage probability
0.8
0.96
Empirical probability
0.90
0.6
0.92
0.4
0.88
0.88 n = 10 n = 10 n = 10
n = 50 0.2 n = 50 n = 50
n = 100 n = 100 n = 100
target coverage target coverage target coverage
0.84 0.0 0.86
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
: standard deviation of covariate distribution : standard deviation of covariate distribution : standard deviation of covariate distribution
Figure 8: Same setup as Figure 1 and Figure 6 with noisier training data.
From Figure 9, we can observe that the average lengths of finite prediction intervals obtained by
using a more accurate model are smaller. This explains that the increase in informative coverage
probability in Figure 8 for the varying variance case is due to the inflation of the length by using a
less accurate model.
21
Average length of finite WCP intervals Average length of finite WCP intervals
n = 10 0.59
0.40
n = 50
n = 100
0.55
0.39
Length
Length
0.51
0.38
0.47 n = 10
0.37
n = 50
n = 100
0.43
0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 2.0 2.5
: standard deviation of covariate distribution : standard deviation of covariate distribution
Figure 9. Left: average length of finite prediction intervals by using Model 1. Right: average
length of finite prediction intervals by using Model 2.
1.00 1.00
0.75 0.75
Response
Response
0.50 0.50
0.25 0.25
GP model
0.00 Group 1 0.00 E(Y|X)
Group 2 Group 0
0.25 0.25
9 6 3 0 3 6 9 9 6 3 0 3 6 9
Covariate Covariate
Figure 10: Visualization of observed groups with σ 2 = 1.
1.50
Prediction band by WCP on Group 1 1.50
Prediction band by WCP on Group 2
E(Y|X)
1.25 Oracle 90% CI 1.25
Response
0.75 0.75
0.50 0.50
0.25 0.25
E(Y|X)
0.00 0.00 Oracle 90% CI
0.25 0.25 Uninformative PI
Informative PI
0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate
Figure 11: Prediction bands by WCP on observed groups with σ 2 = 1.
22
1.50
Prediction band by WCP-SB 1.50
Prediction band by WCP-P
1.25 1.25
1.00 1.00
Response
Response
0.75 0.75
0.50 0.50
0.25 0.25
E(Y|X) E(Y|X)
0.00 Oracle 90% CI 0.00 Oracle 90% CI
0.25 Uninformative PI 0.25 Uninformative PI
Informative PI Informative PI
0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate
1.50
Prediction band by WCP-SS 1.50
Prediction band by WCP-SS
E(Y|X) E(Y|X)
1.25 Oracle 90% CI 1.25 Oracle 90% CI
1.00 Uninformative PI 1.00 Informative PI
Informative PI
Response
Response
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0.25 0.25
0.50 0.50
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
Covariate Covariate
Figure 12. First row: prediction bands by WCP based on selective Bonferroni procedure
and WCP based on data pooling with σ 2 = 1. Second row: prediction bands by selecting
shorter WCP interval among 2 groups with σ 2 = 1 and σ 2 = 4.
23
E.3 Covariate shift: higher dimension with unknown likelihood ratios
We consider covariate dimension d ∈ {5, 10, 20, 50}.
(k)
• σk = Sd(PX ) ∼ Uniform(0.8, 1) for k ∈ [K]
• ρk ∼ Uniform(0, 0.2) or ρk ∼ Uniform(0.7, 0.9) for k ∈ [K]
(k) (k) (0) P
Subsequently, we generate Dcal and Dtr for k ∈ [K], and Dtr with n0 = k∈[K] nk . Lastly, we
sample (X0 , Y0 ) ∼ QX × PY |X as a test data point, for which we compute weighted prediction interval
and evaluate the coverage probability and length of the prediction interval.
Tables
Tables 2, 4, 5, and 6 are obtained by running experiment with 5000 replications.
24
Table 5: Method comparison with d = 20 and K ≥ 2
ρk ∼ Uniform(0, 0.2) ρk ∼ Uniform(0.7, 0.9)
MCP IP ICP AIL MCP IP ICP AIL
WCP-SB (Kinit = 1) 0.919 0.961 0.916 0.500 0.979 0.391 0.945 0.565
WCP-SB (Kinit = min{K, 3}) 0.969 0.965 0.968 0.617 0.979 0.390 0.945 0.565
WCP-P 0.905 0.999 0.905 0.471 0.920 0.912 0.912 0.479
WCP-SS 0.881 0.999 0.881 0.444 0.956 0.693 0.937 0.548
Covariate vector with weak correlation. Note that selecting the shortest WCP interval
among those based on each single group achieves the highest IP. However, this method fails to provide
valid coverage probability: both MCP and ICP fall below the target level of 0.9. On the other hand,
WCP based on selective Bonferroni procedure with Kinit = 1 performs similarly to WCP based on data
pooling in this setup, though the data pooling method exhibits a slightly larger IP and shorter AIL.
WCP based on selective Bonferroni procedure with Kinit = min{K, 3} is more conservative, which
improves MCP, IP, and ICP at the cost of inflating the length of informative prediction intervals. It
is important to note that when d = 50, WCP-SB with Kinit = min{K, 3} even has a smaller IP.
Covariate vector with strong correlation. When the covariate vector has strong correlation,
methods other than WCP based on data pooling show a significant decrease in IP as the dimension d
increases. Meanwhile, WCP based on data pooling maintains MCP and ICP close to the target level,
while also achieving the highest IP and shortest AIL. Note that when dimension is high, WCP-SB
with Kinit = min{K, 3} has nearly the same performance as WCP-SB with Kinit = 1, indicating only
one group is selected predominantly even with the specified Kinit > 1.
25