0% found this document useful (0 votes)
3 views

Biometrics - 2020 - Williamson - Nonparametric variable importance assessment using machine learning techniques

This document discusses a nonparametric variable importance measure that can be applied across various regression techniques, allowing for valid statistical inference and population-level interpretation. The proposed measure is a generalization of the analysis of variance (ANOVA) variable importance and is designed to assess the contribution of features in predicting outcomes without being tied to specific estimation methods. The authors provide a framework for constructing efficient estimators and demonstrate the measure's application through simulations and real data analysis in cardiovascular disease risk factors.

Uploaded by

2229006286a
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Biometrics - 2020 - Williamson - Nonparametric variable importance assessment using machine learning techniques

This document discusses a nonparametric variable importance measure that can be applied across various regression techniques, allowing for valid statistical inference and population-level interpretation. The proposed measure is a generalization of the analysis of variance (ANOVA) variable importance and is designed to assess the contribution of features in predicting outcomes without being tied to specific estimation methods. The authors provide a framework for constructing efficient estimators and demonstrate the measure's application through simulations and real data analysis in cardiovascular disease risk factors.

Uploaded by

2229006286a
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Received: 21 April 2018 Revised: 20 March 2019 Accepted: 22 March 2019

DOI: 10.1111/biom.13392

BIOMETRIC METH ODOLOGY

Nonparametric variable importance assessment using


machine learning techniques

Brian D. Williamson1 Peter B. Gilbert1,2 Marco Carone1,2 Noah Simon1

1 Department of Biostatistics, University

of Washington, Seattle, Washington, USA Abstract


2 Vaccine and Infectious Disease Division, In a regression setting, it is often of interest to quantify the importance of various
Fred Hutchinson Cancer Research features in predicting the response. Commonly, the variable importance measure
Center, Seattle, Washington, USA
used is determined by the regression technique employed. For this reason, practi-
Correspondence tioners often only resort to one of a few regression techniques for which a variable
Brian D. Williamson, Department of importance measure is naturally defined. Unfortunately, these regression tech-
Biostatistics, University of Washington,
Seattle, WA 98195, USA. niques are often suboptimal for predicting the response. Additionally, because
Email: [email protected] the variable importance measures native to different regression techniques gen-
erally have a different interpretation, comparisons across techniques can be diffi-
Funding information
National Institute of Allergy and Infec- cult. In this work, we study a variable importance measure that can be used with
tious Diseases, Grant/Award Num- any regression technique, and whose interpretation is agnostic to the technique
bers: UM1AI068635, F31AI140836,
used. This measure is a property of the true data-generating mechanism. Specifi-
DP5OD019820, R01AI029168
cally, we discuss a generalization of the analysis of variance variable importance
measure and discuss how it facilitates the use of machine learning techniques
to flexibly estimate the variable importance of a single feature or group of fea-
tures. The importance of each feature or group of features in the data can then
be described individually, using this measure. We describe how to construct an
efficient estimator of this measure as well as a valid confidence interval. Through
simulations, we show that our proposal has good practical operating character-
istics, and we illustrate its use with data from a study of risk factors for cardio-
vascular disease in South Africa.

KEYWORDS
machine learning, nonparametric 𝑅2 , statistical inference, targeted learning, variable impor-
tance

1 INTRODUCTION covariate vector and 𝑌𝑖 ∈ ℝ is the outcome of interest. It


is often of interest to understand the association between
Suppose that we have independent observations 𝑍1 , … , 𝑍𝑛 𝑌 and 𝑋 under 𝑃0 . To do this, we generally consider the
drawn from an unknown distribution 𝑃0 , known only to conditional mean function μ0 ∶= μ𝑃0 , where for each 𝑃 ∈
lie in a potentially rich class of distributions . We refer  we define
to  as our model. Further, suppose that each observation
𝑍𝑖 consists of (𝑋𝑖 , 𝑌𝑖 ), where 𝑋𝑖 ∶= (𝑋𝑖1 , … , 𝑋𝑖𝑝 ) ∈ ℝ𝑝 is a μ𝑃 (𝑥) ∶= 𝐸𝑃 (𝑌 ∣ 𝑋 = 𝑥). (1)

© 2020 The International Biometric Society

Biometrics. 2021;77:9–22. wileyonlinelibrary.com/journal/biom 9


15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 WILLIAMSON et al.

Estimation of μ0 is the canonical “predictive modeling” 𝑋, and 𝑊 represent adjustment variables; and the mean
problem. There are many tools for estimating μ0 : clas- difference in absolute deviations, 𝐸𝑃0 {|𝑌 − μ0 (𝑋)| − |𝑌 −
sical parametric techniques (eg, linear regression), and μ0,𝑠 (𝑋)|} (Lei et al., 2017). Methods in this latter category
more flexible nonparametric or semiparametric meth- allow valid inference and have broad potential applicabil-
ods, including random forests (Breiman, 2001), general- ity. The appropriate measure to use depends on the scien-
ized additive models (Hastie and Tibshirani, 1990), loess tific context.
smoothing (Cleveland, 1979), and artificial neural net- We are interested in studying a variable importance
works (Barron, 1989), among many others. Once a good measure that (i) is entirely agnostic to the estimation
estimate of μ0 is obtained, it is often of scientific inter- technique, (ii) allows valid inference, and (iii) provides a
est to identify the features that contribute most to the population-level interpretation that is well suited to scien-
variation in μ0 . For any given set 𝑠 ⊆ {1, … , 𝑝} and dis- tific applications. In this work, we study a variable impor-
tribution 𝑃 ∈ , we may define the reduced conditional tance measure that satisfies each of these criteria, adding
mean to the class of technique-agnostic measures referenced
above. In particular, we consider the ANOVA-based vari-
μ𝑃,𝑠 (𝑥) ∶= 𝐸𝑃 (𝑌 ∣ 𝑋−𝑠 = 𝑥−𝑠 ), (2) able importance measure

where for any vector 𝑣 and set 𝑟 of indices the symbol 𝑣−𝑟 { }2
∫ μ0 (𝑥) − μ0,𝑠 (𝑥) 𝑑𝑃0 (𝑥)
denotes the vector of all components of 𝑣 with index not in 𝜓0,𝑠 ∶= . (3)
𝑟. Here, the set 𝑠 can represent a single element or a group 𝑣𝑎𝑟𝑃0 (𝑌)
of elements. The importance of the elements in 𝑠 can be
evaluated by comparing μ0 and μ0,𝑠 ∶= μ𝑃0 ,𝑠 . This strategy For a vector 𝑣 and a subset 𝑟 of indices, we denote by 𝑣𝑟 the
will be leveraged in this paper. vector of all components of 𝑣 with index in 𝑟. Then, we may
The analysis of variance (ANOVA) decomposition is interpret (3) as the additional proportion of variability in
the main classical tool for evaluating variable importance. the outcome explained by including 𝑋𝑠 in the conditional
There, μ0 is assumed to have a simple parametric form. mean. This follows from the fact that we can express 𝜓0,𝑠
While this facilitates the task at hand considerably, the as
conclusions drawn can be misleading in view of the high
[ ] [ ]
risk of model misspecification. For this reason, it is increas- 𝐸𝑃0 {𝑌 − μ0 (𝑋)}2 𝐸𝑃0 {𝑌 − μ0,𝑠 (𝑋)}2
ingly common to use nonparametric or machine learning- 1− − 1− ,
𝑣𝑎𝑟𝑃0 (𝑌) 𝑣𝑎𝑟𝑃0 (𝑌)
based regression methods to estimate μ0 ; in such cases,
classical ANOVA results do not necessarily apply.
Recent work on evaluating variable importance with- the difference in the population 𝑅2 obtained using the
out relying on overly strong modeling assumptions can full set of covariates as compared to the reduced set of
generally be categorized as being either (i) intimately covariates only. Thus, the parameter we focus on is a sim-
tied to a specific estimation technique for the conditional ple generalization of the classical 𝑅2 measure of impor-
mean function or (ii) agnostic to the estimation technique tance to a nonparametric model and is useful in any
used. The former category includes, for example, vari- setting in which the mean squared error is a scientifi-
able importance measures for random forests (Breiman, cally relevant population measure of predictiveness. This
2001; Ishwaran, 2007; Strobl et al., 2007; Grömping, 2009) parameter is a function of 𝑃0 alone, in that it describes
and neural networks (see, eg, Olden et al., 2004), and a property of the true data-generating mechanism and
ANOVA in linear models. Among these, ANOVA alone not of any particular estimation method. In this work,
appears to readily allow valid statistical inference. Addi- we provide a framework for building a nonparametric
tionally, it is generally not possible to directly compare efficient estimator of 𝜓0,𝑠 that permits valid statistical
the importance assessment stemming from different meth- inference.
ods: they usually measure different quantities and thus We emphasize that the purpose of the variable impor-
have different interpretations. The latter category includes, tance measure we study here is not to offer insight into
for example, nonparametric extensions of 𝑅2 for kernel- the characteristics of any particular algorithm, but rather
based estimators, local polynomial regression, and func- to describe the importance of variables in predicting the
tional regression (Doksum and Samarov, 1995; Yao et al., outcome in the population. This is in contrast to common
2005; Huang and Chen, 2008); the marginalized mean algorithm-specific measures of variable importance. If a
difference, 𝐸𝑃0 {𝐸𝑃0 (𝑌 ∣ 𝑋 = 𝑥1 , 𝑊) − 𝐸𝑃0 (𝑌 ∣ 𝑋 = 𝑥0 , 𝑊)} tool for interpreting black-box algorithms is desired, other
(van der Laan, 2006; Chambaz et al., 2012; Sapp et al., 2014), approaches to variable importance may be preferred, as ref-
where 𝑥1 and 𝑥 0 are two meaningful reference levels of erenced above.
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WILLIAMSON et al. 11

Care must be taken in building point and interval esti- 2 VARIABLE IMPORTANCE IN A
mators for 𝜓0,𝑠 when μ0 and μ0,𝑠 are not known to belong NONPARAMETRIC MODEL
to simple parametric families. In particular, when μ0 and
μ0,𝑠 are estimated using flexible methods, simply plugging 2.1 Parameter of interest
estimates of these regression functions into (3) will not
yield a regular and asymptotically linear, let alone effi- We work in a nonparametric model  with only restric-
cient, estimator of 𝜓0,𝑠 . In this paper, we propose a sim- tion that, under each distribution 𝑃 in , the distribution
ple method that, given sufficiently accurate estimators of of 𝑌 given 𝑋 = 𝑥 must have a finite second moment for 𝑃-
μ0 and μ0,𝑠 , yields an efficient point estimator for 𝜓0,𝑠 and almost every 𝑥. For given 𝑠 ⊆ {1, … , 𝑝} and 𝑃 ∈ , based
a confidence interval with asymptotically correct coverage. on the conditional means (1) and (2), we define the statis-
We show that this method—based on ideas from semipara- tical functional
metric theory—is equivalent to simply plugging in esti-
2
mates of μ0 and μ0,𝑠 into the difference in 𝑅2 values. In ∫ {μ𝑃 (𝑥) − μ𝑃,𝑠 (𝑥)} 𝑑𝑃(𝑥)
Ψ𝑠 (𝑃) ∶= (4)
Williamson et al. (2020), we generalize this phenomenon 𝑣𝑎𝑟𝑃 (𝑌)
and provide results for plug-in estimators of a large class
[ ] [ ]
of variable importance measures. 𝐸𝑃 {𝑌 − μ𝑃 (𝑋)}2 𝐸𝑃 {𝑌 − μ𝑃,𝑠 (𝑋)}2
We note that, while variable importance is related to = 1− − 1− .
𝑣𝑎𝑟𝑃 (𝑌) 𝑣𝑎𝑟𝑃 (𝑌)
variable selection, these paradigms may have distinct
goals. In variable selection, it is typically of interest to cre- (5)
ate the best predictive model based on the current data,
and this model may include only a subset of the available This is the nonparametric measure of variable importance
variables. There are many contributions in both technique- we focus on. The value of Ψ𝑠 (𝑃) measures the importance
specific (see, eg, Breiman, 2001; Friedman, 2001; Loh, of variables in the set {𝑋𝑗 }𝑗∈𝑠 relative to the entire covariate
2002) and nonparametric (see, eg, Doksum et al., 2008) vector for predicting outcome 𝑌 under the data-generating
selection. The goal in variable importance is to assess the mechanism 𝑃. Using observations 𝑍1 , … , 𝑍𝑛 independently
extent to which (subsets of) features contribute to improv- drawn from the true, unknown joint distribution 𝑃0 ∈ ,
ing the population-level predictive power of the best pos- we aim to make efficient inference about the true value
sible outcome predictor based on all available features. 𝜓0,𝑠 = Ψ𝑠 (𝑃0 ).
Of course, variable importance can be used as part of This parameter is a nonparametric extension of the
the process of variable selection. To highlight the distinc- usual ANOVA-derived measure of variable importance in
tion between importance and selection, it may be useful parametric models. We first note that 𝜓0,𝑠 ∈ [0, 1]. Further-
to consider a scenario in which two perfectly correlated more, 𝜓0,𝑠 = 0 if and only if 𝑌 is conditionally uncorrelated
covariates 𝑋1 and 𝑋2 are available. Neither covariate has with every transformation of 𝑋𝑠 given 𝑋−𝑠 . In addition, the
importance relative to the other, but the variables may value of 𝜓0,𝑠 is invariant to linear transformations of the
be highly important as a pair. A variable importance outcome and to a large class of transformations of the fea-
procedure considering individual and grouped features ture vector, as detailed in the Supporting Information. As
would identify this, whereas a variable selection proce- such, common data normalization steps may be performed
dure would likely choose only one of 𝑋1 or 𝑋2 for use in without impact on 𝜓0,𝑠 . Finally, 𝜓0,𝑠 can be seen as a ratio
prediction. of the extra sum of squares, averaged over the joint feature
This paper is organized as follows. We present some distribution, to the total sum of squares. The value of 𝜓0,𝑠 is
properties of the parameter we consider and give our thus precisely the improvement in predictive performance,
proposed estimator in Section 2. In Section 3, we pro- in terms of standardized mean squared error, that can be
vide empirical evidence that our proposed estimator expected if we build a model using all of 𝑋 versus only
outperforms both the naive plug-in ANOVA-based esti- 𝑋−𝑠 . If we assume simple linear regression models for μ0
mator and an ordinary least squares-based estimator in and μ0,𝑠 , then 𝜓0,𝑠 is precisely the usual difference in 𝑅2
settings where the covariate vector is low- or moderate- between nested models.
dimensional and the data-generating mechanism is non- We want to reiterate here that, in contrast to simple para-
linear. In Section 4, we apply our method on data from metric approaches to variable importance, our functional
a retrospective study of heart disease in South African Ψ𝑠 simply maps any candidate data-generating mechanism
men. We provide concluding remarks in Section 5. Tech- to a positive number. This definition does not require a
nical details and an illustration based on the landmark parametric specification of μ0 or μ0,𝑠 . While this is usual
Boston housing data are provided in the Supporting for non- or semiparametric inference problems, it is differ-
Information. ent from classical approaches to variable importance.
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 WILLIAMSON et al.

For building an efficient estimator of 𝜓0,𝑠 , it is criti- where ℙ𝑛 is the empirical distribution based on 𝑍1 , … , 𝑍𝑛 ,
cal to consider the differentiability of Ψ𝑠 as a functional. 𝐻𝑠,𝑛 (𝑃, 𝑃0 ) ∶= ∫ {𝜑𝑃,𝑠 (𝑧) − 𝜑𝑃0 ,𝑠 (𝑧)} 𝑑(ℙ𝑛 − 𝑃0 )(𝑧) is an
Specifically, we have that (4) is pathwise differentiable with empirical process term, and we have made repeated use
respect to the unrestricted model (see, eg, Bickel et al., of the fact that 𝜑𝑃,𝑠 (𝑍) has mean zero under 𝑃 for any
1998). Pathwise differentiable functionals generally admit a 𝑃 ∈ . This representation is critical for characterizing
convenient functional Taylor expansion that can be used to the behavior of the plug-in estimator Θ𝑠 (𝑃). ˆ The four
characterize the asymptotic behavior of plug-in estimators. terms on the right-hand side in (7) can be studied sepa-
An analysis of the pathwise derivative allows us to deter- rately. The first term is an empirical average of mean-zero
mine the efficient influence function (EIF) of the func- transformations of 𝑍1 , … , 𝑍𝑛 . The second term is an
tional relative to the statistical model (Bickel et al., 1998). empirical process term, and the third term is a remainder
The EIF plays a key role in establishing efficiency bounds term. Both of these second-order terms can be shown to
for regular and asymptotically linear estimators of the true be asymptotically negligible under certain conditions on
parameter value, and most importantly, in the construc- ˆ The fourth term can be thought of as the bias incurred
𝑃.
tion of efficient estimators, as we will highlight below. For from flexibly estimating the conditional means (1) and
convenience, we will denote the numerator of Ψ𝑠 (𝑃) by (2) and will generally tend to zero slowly. This bias term
Θ𝑠 (𝑃) ∶= ∫ {μ𝑃 (𝑥) − μ𝑃,𝑠 (𝑥)}2 𝑑𝑃(𝑥). The EIFs of Θ𝑠 and of motivates our choice of estimator for 𝜓0,𝑠 in Section 2.2.
Ψ𝑠 relative to the nonparametric model  are provided in We will employ one particular bias correction method,
the following lemma. and the large-sample properties of our proposed estimator
will be determined by the first term in (7).
Lemma 1. The parameters Θ𝑠 and Ψ𝑠 are pathwise differ-
entiable at each 𝑃 ∈  relative to , with EIFs 𝜑𝑃,𝑠 and

𝜑𝑃,𝑠 relative to , respectively, given by
2.2 Estimation procedure
𝜑𝑃,𝑠 ∶ 𝑧 ↦ 2{𝑦 − μ𝑃 (𝑥)}{μ𝑃 (𝑥) − μ𝑃,𝑠 (𝑥)} + {μ𝑃 (𝑥) − μ𝑃,𝑠 (𝑥)}2 − Θ𝑠 (𝑃),
Writing the numerator Θ𝑠 of the parameter of interest

2{𝑦 − μ𝑃 (𝑥)}{μ𝑃 (𝑥) − μ𝑃,𝑠 (𝑥)} + {μ𝑃 (𝑥) − μ𝑃,𝑠 (𝑥)}2 as a statistical functional suggests a natural estimation
𝜑𝑃,𝑠 ∶𝑧 ↦
𝑣𝑎𝑟𝑃 (𝑌) procedure. If we have estimators μ̂ and μ̂ 𝑠 of μ0 and
{ }2 μ0,𝑠 , respectively—obtained through any method that we
𝑦 − 𝐸𝑃 (𝑌)
− Θ𝑠 (𝑃) .
𝑣𝑎𝑟𝑃 (𝑌) choose, including machine learning techniques—a natu-
ral plug-in estimator of 𝜃0,𝑠 ∶= Θ𝑠 (𝑃0 ) is given by
A linearization of the evaluation of Θ𝑠 at 𝑃 ∈  around
𝑃0 can be expressed as 2
𝜃̂naive,s ∶= {μ(𝑥)
̂ − μ̂ 𝑠 (𝑥)} 𝑑ℙ𝑛 (𝑥)

Θ𝑠 (𝑃) = Θ𝑠 (𝑃0 ) + 𝜑𝑃,𝑠 (𝑧) 𝑑(𝑃 − 𝑃0 )(𝑧) + 𝑅𝑠 (𝑃, 𝑃0 ), 𝑛
∫ 1∑ 2
= {μ(𝑋
̂ 𝑖 ) − μ̂ 𝑠 (𝑋𝑖 )} . (8)
(6) 𝑛 𝑖=1

where 𝑅𝑠 (𝑃, 𝑃0 ) is a remainder term from this first-order In turn, this suggests using, with 𝑌̄ 𝑛 denoting the empirical
expansion around 𝑃0 . The explicit form of 𝑅𝑠 (𝑃, 𝑃0 ) is pro- mean of 𝑌1 , … , 𝑌𝑛 ,
vided in Section 2.3 and can be used to algebraically verify
1 ∑𝑛
this representation. For any given estimator 𝑃ˆ ∈  of 𝑃0 , 2
𝜃̂naive,s 𝑛 𝑖=1 {μ(𝑋
̂ 𝑖 ) − μ̂ 𝑠 (𝑋𝑖 )}
we can write 𝜓̂ naive,s ∶= = 1 ∑𝑛
𝑣𝑎𝑟ℙ𝑛 (𝑌) − 𝑌̄ 𝑛 )2
𝑛 𝑖=1 (𝑌𝑖
ˆ − Θ𝑠 (𝑃0 ) =
Θ𝑠 (𝑃) 𝜑𝑃,𝑠 ˆ − 𝑃0 )(𝑧) + 𝑅𝑠 (𝑃,
ˆ (𝑧) 𝑑(𝑃 ˆ 𝑃0 )
∫ as a simple estimator of 𝜓0,𝑠 . We refer to this as the naive
= 𝜑𝑃,𝑠 ˆ 𝑃0 ) estimator. This simple estimator involves hidden trade-
ˆ (𝑧) 𝑑(ℙ𝑛 − 𝑃0 )(𝑧) + 𝑅𝑠 (𝑃,
∫ offs. On the one hand, it is easy to construct given esti-
𝑛
1∑ mators μ̂ and μ̂ 𝑠 . On the other hand, it does not generally
− 𝜑 (𝑍 )
𝑛 𝑖=1 𝑃ˆ𝑛 ,𝑠 𝑖 enjoy good inferential properties. If a flexible technique is
𝑛 used to estimate μ0 and μ0,𝑠 , constructing μ̂ and μ̂ 𝑠 usu-
1∑ ˆ 𝑃0 ) + 𝐻𝑠,𝑛 (𝑃,
ˆ 𝑃0 )
= 𝜑 (𝑍 ) + 𝑅𝑠 (𝑃, ally entails selecting tuning parameter values to achieve an
𝑛 𝑖=1 𝑃0 ,𝑠 𝑖
optimal bias-variance trade-off for μ0 and μ0,𝑠 , respectively.
𝑛
1∑ This is generally not the optimal bias-variance trade-off for
− 𝜑 ˆ (𝑍𝑖 ), (7)
𝑛 𝑖=1 𝑃,𝑠 estimating the parameter of interest 𝜓0,𝑠 , a key fact from
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WILLIAMSON et al. 13

non- and semiparametric theory. The estimator 𝜓̂ naive,s has A l g o r i t h m 1 Estimating 𝜓0,𝑠
a variance decreasing at a parametric rate, with little sen- 1: Choose a technique to estimate the conditional means μ0
sitivity to the tuning of μ̂ and μ̂ 𝑠 , because of the involved and μ0,𝑠 , eg, ensemble learning with various predictive
marginalization over the feature distribution. However, it modeling algorithms (Wolpert, 1992);
inherits much of the bias from μ̂ and μ̂ 𝑠 . Some form of debi- 2: μ̂ ← Regress 𝑌 on 𝑋 using the technique from step (1) to
asing is thus needed, as we discuss below. In particular, the estimate μ0 ;
estimator 𝜓̂ naive,s is generally overly biased, in the sense 3: μ̂ 𝑠 ← Regress μ(𝑋)
̂ on 𝑋−𝑠 using the technique from step (1)
that its bias does not tend to zero sufficiently fast to allow to estimate μ0,𝑠 ;
1 ∑𝑛 1 ∑𝑛
{𝑌𝑖 −μ̂ 𝑠 (𝑋𝑖 )}2 − ̂ 𝑖 )}2
{𝑌𝑖 −μ(𝑋
consistency at rate 𝑛−1∕2 , let alone efficiency. This is prob- 4: 𝜓̂ 𝑛,𝑠 ← 𝑛 𝑖=1
1 ∑𝑛
𝑛 𝑖=1
, as in Equation (10).
𝑖=1
(𝑌𝑖 −𝑌̄ 𝑛 )2
𝑛
lematic, in particular, because it renders the construction
of valid confidence intervals difficult, if not impossible.
Instead, we consider the simple one-step correction
estimator Interestingly, as we note from above, this is needed when
constructing a plug-in estimator based on the ANOVA rep-
∑𝑛
resentation (4) of 𝜓0,𝑠 but not based on its difference-in-𝑅2
̂𝜃𝑛,𝑠 ∶= 𝜃̂naive,s + 1 𝜑 ˆ (𝑍𝑖 )
𝑛 𝑖=1 𝑃,𝑠 representation (5).
While we are not constrained to any particular estima-
of 𝜃0,𝑠 , which, in view of (7), is asymptotically efficient tion method to construct μ̂ and μ̂ 𝑠 , we have found one par-
under certain regularity conditions. This estimator is ticular strategy to work well in practice. Using any spe-
obtained by correcting for the excess bias of the naive plug- cific predictive modeling technique to regress the outcome
in estimator 𝜃̂naive,s using the empirical average of the esti- 𝑌 on the full covariate vector 𝑋 and then on the reduced
mated EIF as a first-order approximation of (minus) this covariate vector 𝑋−𝑠 does not take into account that the two
bias (see, eg, Pfanzagl, 1982). We note that to compute 𝜃̂𝑛,𝑠 conditional means are related and will generally result in
it is not necessary to obtain an estimator 𝑃ˆ of the entire dis- incompatible estimates. Specifically, we have that μ0,𝑠 (𝑥) =
tribution 𝑃0 . Instead, estimators μ̂ and μ̂ 𝑠 of μ0 and μ0,𝑠 suf- 𝐸𝑃0 {μ0 (𝑋) ∣ 𝑋−𝑠 = 𝑥−𝑠 }, which we can take advantage of
fice. The variance of 𝑌 under 𝑃0 may simply be estimated to produce the following sequential regression estimating
using the empirical variance. It is easy to verify that the procedure: (i) regress 𝑌 on 𝑋 to obtain an estimate μ̂ of μ0 ,
resulting estimator of 𝜓0,𝑠 simplifies to and then (ii) regress μ(𝑋)
̂ on 𝑋−𝑠 to obtain an estimate μ̂ 𝑠
of μ0,𝑠 .
𝜃̂𝑛,𝑠 The final estimation procedure we recommend for 𝜓0,𝑠
𝜓̂ 𝑛,𝑠 ∶= consists of estimator 𝜓̂ 𝑛,𝑠 , where the conditional means
𝑣𝑎𝑟ℙ𝑛 (𝑌)
involved are estimated using flexible regression estima-
∑𝑛
𝑖=1 2{𝑌𝑖 − μ(𝑋
̂ 𝑖 )}{μ(𝑋
̂ 𝑖 ) − μ̂ 𝑠 (𝑋𝑖 )} tors based on the sequential regression approach; see
= 𝜓̂ naive,s + ∑𝑛 . (9) Algorithm 1 for explicit details. This may also be embed-
̄ 2
𝑖=1 (𝑌𝑖 − 𝑌𝑛 ) ded in a split-sample validation scheme, by first creating
training and validation sets, then obtaining μ̂ and μ̂ 𝑠 on
This estimator adjusts for the inadequate bias-variance
the training set as outlined above, and finally, obtaining an
trade-off performed when flexible estimators μ̂ and μ̂ 𝑠 are
estimator of 𝜓0,𝑠 by using the validation data along with
tuned to be good estimators of μ0 and μ0,𝑠 rather than
predictions from the conditional mean estimators on the
being tuned for the end objective of estimating 𝜓0,𝑠 . Sim-
validation data. This can be extended to a cross-fitted pro-
ple algebraic manipulations yield that 𝜓̂ 𝑛,𝑠 is equivalent to
cedure given in Algorithm 2 and discussed in the Support-
the plug-in estimator
ing Information.
1 ∑𝑛 1 ∑𝑛
⎡ ̂ 𝑖 )}2 ⎤ ⎡ − μ̂ 𝑠 (𝑋𝑖 )}2 ⎤
𝑛 𝑖=1 {𝑌𝑖 − μ(𝑋
𝑛 𝑖=1 {𝑌𝑖
⎢1 − ⎥ − ⎢1 − ⎥
⎢ 𝑣𝑎𝑟ℙ𝑛 (𝑌) ⎥ ⎢ 𝑣𝑎𝑟ℙ𝑛 (𝑌) ⎥ 2.3 Asymptotic behavior of the
⎣ ⎦ ⎣ ⎦
proposed estimator
(10)
By studying the remainder term 𝑅𝑠 (𝑃, ˆ 𝑃0 ) and the empiri-
obtained by viewing 𝜓0,𝑠 as a difference in population 𝑅2 ˆ
cal process term 𝐻𝑠,𝑛 (𝑃, 𝑃0 ), we can establish appropriate
values, as in (5). As indicated above, semiparametric theory conditions on μ̂ and μ̂ 𝑠 under which the proposed estima-
indicates that plug-in estimators based on flexible regres- tor 𝜓̂ 𝑛,𝑠 is asymptotically efficient. This allows us to deter-
sion algorithms typically require bias correction if the lat- mine the asymptotic distribution of the proposed estimator
ter are not tuned towards the target of inference, as in (9). and, therefore, to propose procedures for performing valid
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 WILLIAMSON et al.

A l g o r i t h m 2 Estimating 𝜓0,𝑠 using 𝑉-fold cross fitting variation norm of 𝜑𝑃ˆ𝑛 ,𝑠 is bounded with probability tend-
1: Choose a technique to estimate the conditional means μ0 ing to one (Gill et al., 1995). When using flexible machine
and μ0,𝑠 ; learning-based regression estimators, there may be reason
2: Generate a random vector 𝐵𝑛 ∈ {1, … , 𝑉}𝑛 by sampling for concern regarding the validity of condition (A2). In
uniformly from {1, … , 𝑉} with replacement, and denote such cases, using the cross-fitted estimator 𝜓̂ 𝑛,𝑠
cv
may cir-
by 𝐷𝑗 the subset of observations with index in cumvent this condition. While this cross-fitted estimator
{𝑖 ∶ 𝐵𝑛,𝑖 = 𝑗} for 𝑗 = 1, … , 𝑉. is only slightly more complex, we restrict attention here to
3: for 𝑣 = 1, … , 𝑉 do studying the simpler estimator and leave study of the cross-
4: μ̂ 𝑣 ← Regress 𝑌 on 𝑋 using the data in ∪𝑗≠𝑣 𝐷𝑗 using the fitted estimator to the Supporting Information.
technique from step (1) to estimate μ0 ; The following theorem describes the asymptotic behav-
5: μ̂ 𝑠,𝑣 ← Regress μ̂ 𝑣 (𝑋) on 𝑋−𝑠 using the data in ∪𝑗≠𝑣 𝐷𝑗 to ior of the proposed estimator.
estimate μ0,𝑠 ;
∑ ∑
{𝑌𝑖 −μ̂ 𝑠,𝑣 (𝑋𝑖 )}2 − 𝑖∈𝐷 {𝑌𝑖 −μ̂ 𝑣 (𝑋𝑖 )}2
6: 𝜓̂ 𝑛,𝑠
𝑣

𝑖∈𝐷𝑗

𝑗
, as in Equation (10); Theorem 1. Provided conditions (A1)–(A3) hold, 𝜓̂ 𝑛,𝑠 is
(𝑌𝑖 −𝑌̄ 𝑛 )2
𝑖∈𝐷𝑗
asymptotically linear with influence function 𝜑𝑃∗ ,𝑠 . In par-
7: end for 0
1 ∑𝑉 ticular, this implies that (a) 𝜓̂ 𝑛,𝑠 tends to 𝜓0,𝑠 in probabil-
8: 𝜓̂ 𝑛,𝑠
cv
← 𝜓̂ 𝑣 .
𝑉𝑣=1 𝑛,𝑠
ity, and if 𝜓0,𝑠 ∈ (0, 1) (b) 𝑛1∕2 (𝜓̂ 𝑛,𝑠 − 𝜓0,𝑠 ) tends in distri-
bution to a mean-zero Gaussian random variable with vari-
2
ance 𝜎0,𝑠 ∶= ∫ 𝜑𝑃∗ ,𝑠 (𝑧)2 𝑑𝑃0 (𝑧).
0
inference on 𝜓0,𝑠 . Below, we will make reference to the fol-
lowing conditions, in which we have defined the condi- A natural plug-in estimator of 𝜎0,𝑠 2 2
is given by 𝜎̂ 𝑛,𝑠 ∶=
tional outcome variance 𝜏02 ∶ 𝑥 ↦ 𝑣𝑎𝑟𝑃0 (𝑌 ∣ 𝑋 = 𝑥). 1 ∑ 𝑛 ∗ 2 ∗
𝑖=1 𝜑𝑃0 ,𝑠 (𝑍𝑖 ) , where 𝜑𝑃0 ,𝑠 is any consistent estimator
ˆ ˆ
𝑛

(A1) max[∫ {μ(𝑥) ̂ − μ0 (𝑥)}2 𝑑𝑃0 (𝑥), ∫ {μ̂ 𝑠 (𝑥) − of 𝜑𝑃 ,𝑠 . For example, 𝜑 ˆ𝑃 ,𝑠 may be taken to be 𝜑𝑃∗ ,𝑠 with

0 0 0
μ0,𝑠 (𝑥)} 𝑑𝑃0 (𝑥)] = 𝑜𝑃 (𝑛−1∕2 );
2 μ0 , μ0,𝑠 , 𝐸𝑃0 (𝑌), 𝑣𝑎𝑟𝑃0 (𝑌), and 𝜃0,𝑠 replaced by μ,̂ μ̂ 𝑠 , 𝑌̄ 𝑛 ,
(A2) there exists a 𝑃0 -Donsker class 0 such that 𝑃0 (𝜑𝑃,𝑠
ˆ ∈
𝑣𝑎𝑟ℙ𝑛 (𝑌), and 𝜃̂𝑛,𝑠 , respectively. In view of the asymptotic
0 ) ⟶ 1; normality of 𝑛1∕2 (𝜓̂ 𝑛,𝑠 − 𝜓0,𝑠 ), an asymptotically valid (1 −
(A3) there exists a constant 𝐾 > 0 such that each of μ0 , 𝛼) × 100% Wald-type confidence interval for 𝜓0,𝑠 ∈ (0, 1)
μ̂ 0 , μ̂ 0,𝑠 , and 𝜏02 has range contained uniformly in can be obtained as 𝜓̂ 𝑛,𝑠 ± 𝑞1−𝛼∕2 𝜎̂ 𝑛,𝑠 𝑛−1∕2 , where 𝑞𝛽 is the
(−𝐾, +𝐾) with probability tending to one as sample 𝛽-quantile of the standard normal distribution.
size tends to +∞. To underscore the importance of using the proposed
debiased procedure, we recall that, in contrast to 𝜓̂ 𝑛,𝑠 , the
First, it is straightforward to verify that linearization naive ANOVA-based estimator is generally not asymptot-
(6) holds with second-order remainder term 𝑅𝑠 (𝑃, 𝑃0 ) = ically linear when flexible (eg, machine learning based)
∫ {μ𝑃,𝑠 (𝑥) − μ0,𝑠 (𝑥)}2 𝑑𝑃0 (𝑥) − ∫ {μ𝑃 (𝑥) − μ0 (𝑥)}2 𝑑𝑃0 (𝑥). It estimators of the involved regression are used. It will usu-
follows then that condition (A1) suffices to ensure ally be overly biased, resulting in a rate of convergence
that 𝑅𝑠 (𝑃,ˆ 𝑃0 ) is asymptotically negligible, that is, that slower than 𝑛−1∕2 . Constructing valid confidence intervals
ˆ 𝑃0 ) = 𝑜𝑃 (𝑛−1∕2 ). Each of the second-order terms in
𝑅𝑠 (𝑃, based on the naive estimator can thus be difficult. It may
condition (A1) can feasibly be made negligible, even while be tempting to consider bootstrap resampling as a remedy.
using flexible regression techniques, including generalized However, this is not advisable since, besides the computa-
additive models (Hastie and Tibshirani, 1990), to estimate tional burden of such an approach, there is little theory to
the conditional mean functions. We thus turn our attention justify using the standard nonparametric bootstrap in this
to 𝐻𝑠,𝑛 (𝑃,ˆ 𝑃0 ). By empirical process theory, we have that context, particularly for the naive ANOVA-based estimator
𝐻𝑠,𝑛 (𝑃,ˆ 𝑃0 ) = 𝑜𝑃 (𝑛−1∕2 ) provided, for example, ∫ {𝜑𝑃,𝑠
ˆ (𝑧) −
(Shao, 1994).
2
𝜑𝑃0 ,𝑠 (𝑧)} 𝑑𝑃0 (𝑧) tends to zero in probability and condition
(A2) holds (Lemma 19.24 of van der Vaart, 2000). For the
former, uniform consistency of μ̂ and μ̂ 𝑠 under 𝐿2 (𝑃0 ) suf- 2.4 Behavior under the zero-importance
fices under condition (A3). We note that if there is a known null hypothesis
bound on the outcome support, condition (A3) will readily
be satisfied provided the learning algorithms used incor- This work focuses on efficient estimation of a population-
porate this knowledge. For the latter, the set of possible level algorithm-agnostic variable importance measure
realizations of μ̂ and μ̂ 𝑠 must become sufficiently restricted using flexible estimation techniques and on describing
with probability tending to one as sample size grows. This how valid inference may be drawn when the set 𝑠
condition is satisfied if, for example, the uniform sectional of features under evaluation does not have degenerate
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WILLIAMSON et al. 15

importance. Specifically, we have restricted our attention by this data-generating mechanism can be shown to
to cases in which 𝜓0,𝑠 ∈ (0, 1) strictly and provided confi- be 𝜓0,1 ≈ 0.158 and 𝜓0,2 ≈ 0.342. This nonlinear setting
dence intervals valid in such cases. It may be of interest, helps to highlight the drawbacks of relying on a simple
however, to test the null hypothesis 𝜓0,𝑠 = 0 of zero impor- parametric model to estimate the conditional means.
tance. Developing valid and powerful tests of this particu- To obtain μ, ̂ μ̂ 1 , and μ̂ 2 , we fit locally constant loess
lar null hypothesis is difficult. Because the null hypothesis smoothing using the R function loess with tuning selected
is on the boundary of the parameter space, 𝜑𝑃0 ,𝑠 is iden- to minimize a fivefold cross-validated estimate of the
tically zero under this null, and a higher order expansion empirical risk based on the squared error loss function.
may be required to construct and characterize the behavior Loess smoothing was chosen because it is a data-adaptive
of an appropriately-regularized estimator of 𝜃0,𝑠 —and thus algorithm with an efficient implementation, and it satisfies
of 𝜓0,𝑠 —with good power. However, the parameters Θ𝑠 and the minimum convergence rate condition outlined in Sec-
Ψ𝑠 are generally not second-order pathwise differentiable tion 2.3, allowing us to numerically verify our theoretical
in nonparametric models, and so higher order expansions results. Because we obtained the same trends using locally
cannot easily be constructed. There may be hope in using constant kernel regression, we do not report summaries
approximate second-order gradients, as outlined in Carone from these additional simulations here. This fact neverthe-
et al. (2018), though this remains an open problem. A crude less highlights the ease of comparing results from two dif-
alternative solution based on sample splitting is investi- ferent estimation techniques.
gated in Williamson et al. (2020). To highlight the diffi- We computed the naive and proposed estimators and
culties that arise under this particular null hypothesis, we respective confidence intervals for each replication and
conducted a simulation study for a setting in which one compared these to a parametric difference in 𝑅2 based
of the variables has zero importance. The results from this on simple linear regression using ordinary least squares
study are provided in the next section. (OLS). Because a simple asymptotic distribution for the
naive estimator is unavailable, a percentile bootstrap
approach with 1000 bootstrap samples was used in an
3 EXPERIMENTS ON SIMULATED attempt to obtain approximate confidence intervals based
DATA on 𝜓̂ naive,j . For each estimator, we then computed the
empirical bias scaled by 𝑛1∕2 and the empirical variance
We now present empirical results describing the perfor- scaled by 𝑛. Our output for the estimated bias includes con-
mance of the proposed estimator (9) compared to that of fidence intervals for the true bias based on the resulting
the naive plug-in estimator (8). In all implementations, draws from the bootstrap sampling distribution. Finally,
we use the sequential regression estimating procedure we computed the empirical coverage of the nominal 95%
described in Algorithm 1 for each feature or group of inter- confidence intervals constructed.
est to compute compatible estimates of the required regres- Figure 1 displays the results of this simulation. In the
sion functions, and we compute nominal 95% Wald-type left panel, we note that the scaled empirical bias of the
confidence intervals as outlined in Section 2.3. proposed estimator decreases towards zero as 𝑛 tends to
infinity, regardless of which feature we remove. Also, we
see that both the naive estimator and the OLS estimator
3.1 Low-dimensional vector of features have substantial bias that does not tend to zero faster than
𝑛−1∕2 . This coincides with our expectations: the naive esti-
We consider here data generated according to the following mator involves an inadequate bias-variance trade-off with
specification: respect to the parameter of interest and does not include
any debiasing; the OLS estimator is based on a misspec-
𝑖𝑖𝑑
𝑋1 , 𝑋2 ∼ Uniform(−1, 1) and ified mean model. Though there is very substantial bias
reduction from using the proposed estimator, we see that
𝜖 ∼ 𝑁(0, 1) independent of (𝑋1 , 𝑋2 ) its scaled bias appears to dip slightly below zero for large
( 𝑛. We expect for larger 𝑛 to see this scaled bias for the
7 ) 25 2
𝑌 = 𝑋12 𝑋1 + + 𝑋2 + 𝜖. proposed estimator get closer to zero; numerical error in
5 9
our computations may explain why this does not exactly
We generated 1000 random datasets of size happen. These results provide empirical evidence that the
𝑛 ∈ {100, 300, 500, 700, 1000, 2000, … , 8000} and con- debiasing step is necessary to account for the slow rates of
sidered in each case the importance of 𝑋1 and of 𝑋2 . The convergence in estimation of 𝜓0,𝑠 introduced because μ0
true value of the variable importance measures implied and μ0,𝑠 are flexibly estimated.
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 WILLIAMSON et al.

n
nn

n
n n n

F I G U R E 1 Empirical bias (scaled by 𝑛1∕2 ) with Monte Carlo error bars, empirical variance (scaled by 𝑛), and empirical coverage of nominal
95% confidence intervals for the proposed, naive, and OLS estimators for either feature, using loess smoothing with cross-validation tuning (in
the case of the proposed and naive estimators). Circles, filled diamonds, and filled squares denote that we have removed 𝑋1 ; stars, crossed
diamonds, and empty squares denote that we have removed 𝑋2 . This figure appears in color in the electronic version of this article, and any
mention of color refers to that version

In the middle panel of Figure 1, we see that the vari- means and summarized the results of this simulation as
ance of the proposed estimator is close to that of the in the previous simulation.
naive estimator—we have thus not suffered much from Figure 2 displays the results of this simulation. In the
removing excess bias in our estimation procedure. The left-hand panel, we observe that the proposed estimator
variance of the OLS estimator is the smallest of the three: has smaller scaled bias in magnitude than the naive esti-
using a parametric model tends to result in a smaller vari- mator when we remove the feature with nonzero impor-
ance. The ratio of the variance of the naive estimator to tance. However, when we remove the feature with zero
that of the proposed estimator is near one for all 𝑛 con- importance, the proposed estimator has slightly higher
sidered and ranges between approximately 0.8 and 1.2 in bias. While this is somewhat surprising, it likely is due to
our simulation study. Finally, in the right-hand panel, we the additive correction in the one-step construction being
see that as sample size grows, coverage increases for the slightly too large. The scaled bias of the proposed estima-
confidence interval based on the proposed estimator and tor tends to zero as 𝑛 increases for both features, which is
approaches the nominal level. In contrast, the coverage of not true of the naive estimator. In the middle panel, we
intervals based on both the naive estimator and the OLS see that we have not incurred excess variance by using
estimator decreases instead and quickly becomes poor. the proposed estimator. In the right-hand panel, we see
that both estimators have close to zero coverage for the
parameter under the null hypothesis, but that the pro-
3.2 Testing the zero-importance null posed estimator has higher coverage than the naive estima-
hypothesis tor for the predictive feature. These results highlight that
more work needs to be done for valid testing and estima-
We now consider data generated according to the following tion under this boundary null hypothesis. While our cur-
specification: rent proposal yields valid results for the predictive feature,
even in the presence of a null feature, ensuring valid infer-
𝑖𝑖𝑑
𝑋1 , 𝑋2 ∼ Uniform(−1, 1) and ence for null features themselves remains an important
25 2 challenge.
𝜖 ∼ 𝑁(0, 1) independent of (𝑋1 , 𝑋2 ); 𝑌 = 𝑋 + 𝜖.
9 1

We generated 1000 random datasets of size 3.3 Moderate-dimensional vector of


𝑛 ∈ {100, 300, 500, 700, 1000, 2000, 3000} and again con- features
sidered in each case the importance of 𝑋1 and of 𝑋2 . The
true value of the variable importance measures implied We consider one setting in which the features are inde-
by this data-generating mechanism can be shown to be pendent and a second in which groups of features are cor-
𝜓0,1 ≈ 0.407 and 𝜓0,2 = 0. We estimated the conditional related. In setting 𝐴, we generated data according to the
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WILLIAMSON et al. 17

n
nn

n
n n n

F I G U R E 2 Empirical bias (scaled by 𝑛1∕2 ) with Monte Carlo error bars, empirical variance (scaled by 𝑛), and empirical coverage of nominal
95% confidence intervals for the proposed and naive estimators for either feature, using loess smoothing with cross-validation tuning. Circles
and filled diamonds denote that we have removed 𝑋1 , while stars and crossed diamonds denote that we have removed 𝑋2 . We operate under
the null hypothesis for 𝑋2 , that is, 𝜓0,2 = 0. This figure appears in color in the electronic version of this article, and any mention of color refers
to that version

following specification: T A B L E 1 Approximate values of 𝜓0,𝑠 for each simulation


setting and group considered in the moderate-dimensional
𝑖𝑖𝑑 simulations in Section 3.3
𝑋1 , 𝑋2 , … , 𝑋15 ∼ 𝑁(0, 4) and
Setting
𝜖 ∼ 𝑁(0, 1) independent of (𝑋1 , 𝑋2 , … , 𝑋15 ) Group 𝑨 𝑩
(𝑋1 , 𝑋2 , … , 𝑋5 ) 0.295 0.281
𝑌 = 𝐼(−2,+2) (𝑋1 ) ⋅ ⌊𝑋1 ⌋ + 𝐼(−∞,0] (𝑋2 )
(𝑋6 , 𝑋7 , … , 𝑋10 ) 0.240 0.314
| 𝑋 |3 (𝑋11 , 𝑋12 , … , 𝑋15 ) 0.242 0.179
+ 𝐼(0,+∞) (𝑋3 ) + || 6 ||
| 4 |
( )
| 𝑋 |5 7 𝑋11
+ || 7 || + cos + 𝜖. the importance of the feature sets {1, 2, 3, 4, 5}, {6, … , 10},
| 4 | 3 2 and {11, … , 15} for each sample size. The true value of the
variable importance measures corresponding to each of
In setting 𝐵, the covariate distribution was modi-
the considered groups in both settings is given in Table 1.
fied to include clustering. Specifically, we generated
Results for the analysis of additional groupings are pro-
(𝑋1 , 𝑋2 , … , 𝑋15 ) ∼ 𝑀𝑉𝑁15 (μ, Σ), where the mean vector
vided in the Supporting Information.
is
For each scenario considered, we estimated the condi-
tional mean functions with gradient-boosted trees (Fried-
μ = 3 × (0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0)
man, 2001) fit using GradientBoostingRegressor in the
− 2 × (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1) sklearn module in Python. Gradient-boosted trees were
used due to their generally favorable prediction perfor-
and the variance–covariance matrix is block-diagonal with mance and large degree of flexibility, with full knowledge
blocks that they are not guaranteed to satisfy the minimum rate
condition outlined in Section 2.3. We used fivefold cross-
⎡ 1 0.15 0.15⎤ ⎡ 1 0.5 0.5⎤ ⎡ 1 0.85 0.85⎤ validation to select the optimal number of trees with one
⎢0.15 1 0.15⎥ , ⎢0.5 1 0.5⎥ and ⎢0.85 1 0.85⎥ node as well as the optimal learning rate for the algorithm.
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣0.15 0.15 1 ⎦ ⎣0.5 0.5 1 ⎦ ⎣0.85 0.85 1 ⎦ We summarized the results of these simulations in the
same manner as in the low-dimensional simulations.
and all other off-diagonal entries equal to zero. The ran- The results for setting 𝐴 are presented in Figure 3. From
dom error 𝜖 and the outcome 𝑌 are then generated as the top row, we note that as sample size increases, the
in setting 𝐴. In both settings, we generated 500 random scaled empirical bias of the proposed estimator approaches
datasets of size 𝑛 ∈ {100, 300, 500, 1000} and considered zero, whereas that of the naive estimator increases in
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
18 WILLIAMSON et al.

s s s

n n n

s s s

n n n

F I G U R E 3 Top row: empirical bias for the proposed and naive estimators scaled by 𝑛1∕2 for setting 𝐴, based on gradient-boosted trees.
Bottom row: empirical coverage of nominal 95% confidence intervals for the proposed and naive estimators for setting 𝐴, using gradient-boosted
trees. We consider all 𝑠 combinations from Table 1. Diamonds denote the naive estimator, and circles denote the proposed estimator. Monte
Carlo error bars are displayed vertically. This figure appears in color in the electronic version of this article, and any mention of color refers to
that version

magnitude across all subsets considered. From the bot- The proposed estimator performs substantially better
tom row, we observe that the empirical coverage of inter- than the naive estimator in these simulations, though
vals based on the proposed estimator increases toward the higher levels of correlation appear to be associated with rel-
nominal level as sample size increases and is uniformly atively poorer point and interval estimator performance.
higher than the empirical coverage of bootstrap intervals This suggests that it may be wise to consider in practice
based on the naive estimator. the importance of entire groups of correlated predictors
The results for setting 𝐵 are presented in Figure 4. rather than that of individual features. This is a sensible
From the top row, we note some residual bias in the pro- approach for dealing with correlated features, which nec-
posed estimator for 𝑠 = {11, … , 15}. Larger samples may essarily render variable importance assessment challeng-
be needed to observe more thorough bias reduction— ing. In our simulations, the empirical coverage of proposed
indeed, this group of features is that with the highest intervals for the importance of a group of highly corre-
within-group correlation. Nevertheless, the scaled empiri- lated features approaches the nominal level as sample size
cal bias of the proposed estimator approaches zero as sam- increases, indicating that the proposed approach does yield
ple size increases for both 𝑠 = {1, … , 5} and 𝑠 = {6, … , 10}. good results in such cases.
In all cases, the scaled empirical bias of the naive esti- Use of the proposed estimator results in better point
mator increases in magnitude as sample size increases. In and interval estimation performance than the naive esti-
the bottom row, we again see that intervals based on the mator in the presence of null features. This is illus-
proposed estimator have uniformly higher coverage than trated, for example, when evaluating the importance of the
those based on the naive estimator. group (𝑋1 , … , 𝑋5 ), in which case most other features (ie,
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WILLIAMSON et al. 19

s s s

n n n

s s s

n n n

F I G U R E 4 Top row: empirical bias for the proposed and naive estimators scaled by 𝑛1∕2 for setting 𝐵, using gradient-boosted trees. Bottom
row: empirical coverage of nominal 95% confidence intervals for the proposed and naive estimators for setting 𝐵, using gradient-boosted trees.
We consider all 𝑠 combinations from Table 1. Diamonds denote the naive estimator, and circles denote the proposed estimator. Monte Carlo
error bars are displayed vertically. This figure appears in color in the electronic version of this article, and any mention of color refers to that
version

𝑋8 , 𝑋9 , 𝑋10 , 𝑋12 , … , 𝑋15 ) have null importance. However, tion (MI) at the time of the survey is recorded, yielding
as before, we expect the behavior of point and interval esti- 160 cases of MI. In addition, measurements on two sets
mators for the variable importance of null features to be of features are available: behavioral features, including
poorer. Additional work on valid estimation and testing cumulative tobacco consumption, current alcohol con-
under this null hypothesis is necessary. sumption, and type A behavior, a behavioral pattern
linked to stress (Friedman and Rosenman, 1971); and bio-
logical features, including systolic blood pressure, low-
4 RESULTS FROM THE SOUTH density lipoprotein (LDL) cholesterol, adiposity (similar to
AFRICAN HEART DISEASE STUDY DATA body mass index), family history of heart disease, obesity,
and age.
We consider a subset of the data from the Coronary Risk We considered the importance of each feature sepa-
Factor Study (Rosseauw et al., 1983), a retrospective cross- rately, as well as that of these two groups of features,
sectional sample of 462 white males aged 15–64 in a region when predicting the presence or absence of MI. We esti-
of the Western Cape, South Africa; these data are publicly mated the conditional means using the sequential regres-
available in Hastie et al. (2009). The primary aim of this sion estimating procedure outlined in Section 2.2 and using
study was to establish the prevalence of ischemic heart dis- the Super Learner (van der Laan et al., 2007) via the
ease risk factors in this high-incidence region. For each SuperLearner R package. The Super Learner is a partic-
participant, the presence or absence of myocardial infarc- ular implementation of stacking (Wolpert, 1992), and the
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
20 WILLIAMSON et al.

F I G U R E 5 Estimates from the South African heart disease study for the proposed and naive estimators of the variable importance param-
eter, on left and right, respectively. The Super Learner with library including the elastic net, generalized additive models, gradient boosted trees,
and random forests, was used

resulting estimator is guaranteed to have the same risk our uncertainty estimates (Figure 5): there, we see that
as the oracle estimator, asymptotically, along with finite- the point estimates for age and family history are close,
sample guarantees (van der Laan et al., 2007). Our library and their confidence intervals largely overlap. We find the
of candidate learners consists of boosted trees, gener- same pattern for LDL cholesterol and tobacco consump-
alized additive models, elastic net, and random forests tion, the third- and fourth-ranked variables by logistic
implemented in the R packages gbm, gam, glmnet, and regression. While our results match closely with the sim-
randomForest, respectively, each with varying tuning plest approach to analyzing variable importance in these
parameters. Tenfold cross-validation was used to deter- data, our proposed method is not dependent on a single
mine the optimal convex combination of these learn- estimation technique, such as logistic regression. The use
ers chosen to minimize the cross-validated mean-squared of more flexible learners to estimate 𝜓0,𝑠 , as we have done
error. This process allowed the Super Learner to determine in this analysis, renders our findings less likely to be driven
the optimal tuning parameters for the individual algo- by potential model misspecification.
rithms as part of its optimal combination, and our result-
ing estimator of the conditional means is the optimal con-
vex combination of the individual algorithms. Finally, we 5 CONCLUSION
produced confidence intervals based on the proposed esti-
mator alone, since as we have seen earlier, intervals based We have obtained novel results for a familiar measure of
on the naive estimator are generally invalid. variable importance, interpreted as the additional propor-
The results are presented in Figure 5. The ordering is tion of variability in the outcome explained by including
slightly different in the two plots; this is not surprising, a subset of the features in the conditional mean outcome
since the one-step procedure should eliminate excess bias relative to the entire covariate vector. This parameter can
in the naive estimator introduced by estimating the condi- be readily seen as a nonparametric extension of the clas-
tional means using flexible learners. We find that biologi- sical 𝑅2 -based measure, and it provides a description of
cal factors are more important than behavioral factors. The the true relationship between the outcome and covariates
most important individual feature is family history of heart rather than an algorithm-specific measure of association.
disease; family history has been found to be a risk factor We have also studied the properties of this parameter and
of MI in previous studies. It appears scientifically sensible derived its nonparametric EIF. We found that the form of
that both groups of features are more important than any the variable importance measure under consideration can
individual feature besides family history. have a dramatic impact on the ease with which efficient
We compared these results to a logistic regression model estimators may be constructed—for example, debiasing is
fit to these data. Based on the absolute values of 𝑧-statistics, needed for ANOVA-based plug-in estimators using flexi-
logistic regression picks age as most important, followed ble learners, but not for plug-in estimators based on the
by family history. This slight difference is captured in difference in 𝑅2 values. We provide general results
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WILLIAMSON et al. 21

describing this phenomenon in Williamson et al. (2020). and UM1AI068635. The opinions expressed in this article
Leveraging tools from semiparametric theory, we have are those of the authors and do not necessarily represent
described the construction of an asymptotically efficient the official views of the NIH.
estimator of the true variable importance measure built
upon flexible, data-adaptive learners. We have studied the ORCID
properties of this estimator, notably providing distribu- Brian D. Williamson https://orcid.org/0000-0002-7024-
tional results, and described the construction of asymp- 548X
totically valid confidence intervals. In simulations, we Marco Carone https://orcid.org/0000-0003-2106-0953
found the proposed estimator to have good practical per-
formance, particularly as compared to a naive estimator of REFERENCES
the proposed variable importance measure. However, we Barron, A. (1989) Statistical properties of artificial neural networks. In
found this performance to depend very much on whether Proceedings of the 28th IEEE Conference on Decision and Control.
or not the true variable importance measure equals Piscataway, NJ: IEEE, pp. 280–285.
zero. When it does, a limiting distribution is not readily Bickel, P., Klaasen, C., Ritov, Y. and Wellner, J. (1998) Efficient and
Adaptive Estimation for Semiparametric Models. Berlin: Springer.
available, and significant theoretical developments seem
Breiman, L. (2001) Random forests. Machine Learning, 45, 5–32.
needed in order to perform valid and powerful inference. Carone, M., Diaz, I. and van der Laan, M. (2018) Higher-order tar-
However, for those features with true importance, the geted loss-based estimation. In Targeted Learning in Data Sci-
behavior of point and interval estimates is not influenced ence: Causal Inference for Complex Longitudinal Studies. Berlin:
by the presence of null features. While the parameter we Springer, pp. 483–510.
have studied has broad interpretability, alternative mea- Chambaz, A., Neuvial, P. and van der Laan, M. (2012) Estimation of
sures of variable importance may also be useful in certain a non-parametric variable importance measure of a continuous
settings (eg, difference in the area under the receiver oper- exposure. Electronic Journal of Statistics, 6, 1059–1099.
Cleveland, W. (1979) Robust locally weighted regression and smooth-
ating characteristic curve in the context of a binary out-
ing scatterplots. Journal of the American Statistical Association, 74,
come). We study such measures in Williamson et al. (2020). 829–836.
For each candidate set of variables, the estimation pro- Doksum, K. and Samarov, A. (1995) Nonparametric estimation
cedure we proposed requires estimation of two conditional of global functionals and a measure of the explanatory power
mean functions. To guarantee that our estimator has good of covariates in regression. The Annals of Statistics, 23, 1443–
properties, these conditional means must be estimated 1473.
well. For this reason, and as was illustrated in our work, Doksum, K., Tang, S. and Tsui, K.-W. (2008) Nonparametric variable
selection: the EARTH algorithm. Journal of the American Statisti-
we recommend the use of model stacking with a wide
cal Association, 103, 1609–1620.
range of candidate learners, ranging from parametric to
Dudoit, S. and van der Laan, M. (2007) Multiple Testing Procedures
fully nonparametric algorithms. This flexibility mitigates with Applications to Genomics. Springer Science & Business Media.
concerns regarding model misspecification. Additionally, Friedman, J. (2001) Greedy function approximation: a gradient boost-
we suggest the use of sequential regressions to minimize ing machine. Annals of Statistics, 29, 1189–1232.
any incompatibility between the two conditional means Friedman, M. and Rosenman, R. (1971) Type A behavior pattern:
estimated. its association with coronary heart disease. Annals of Clinical
A multiple testing issue arises when inference is desired Research, 3, 300–312.
Gill, R., van der Laan, M. and Wellner, J. (1995) Inefficient estima-
on many feature subsets. Of course, a Bonferroni approach
tors of the bivariate survival function for three models. Annales de
may be easily implemented. Alternatively, we could use l’Institut Henri Poincaré Probabilités et Statistiques, 31, 545–597.
a consistent estimator of the variance-covariance matrix Grömping, U. (2009) Variable importance in regression: linear regres-
for the importance of all subsets of features under study, sion versus random forest. The American Statistician, 63, 308–319.
obtained using the influence functions exhibited in this Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models, vol-
paper. This alternative multiple-testing adjustment has ume 43. Boca Raton, FL: CRC Press.
improved power over a Bonferroni-type approach. Strate- Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Sta-
gies based on this approach are described, for example, in tistical Learning: Data mining, Inference, and Prediction. Berlin:
Springer.
Dudoit and van der Laan (2007).
Huang, L. and Chen, J. (2008) Analysis of variance, coefficient of
determination and F-test for local polynomial regression. The
Annals of Statistics, 36, 2085–2109.
AC K N OW L E D G M E N T S
Ishwaran, H. (2007) Variable importance in binary regression trees
The authors thank Prof. Antoine Chambaz for insight- and forests. Electronic Journal of Statistics, 1, 519–537.
ful comments that improved this manuscript. This work Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. and Wasserman, L.
was supported by the National Institutes of Health (NIH) (2018) Distribution-free predictive inference for regression. Jour-
through awards F31AI140836, DP5OD019820, R01AI029168 nal of the American Statistical Association, 113, 1094–1111.
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
22 WILLIAMSON et al.

Loh, W.-Y. (2002) Regression trees with unbiased variable selection Wolpert, D. (1992) Stacked generalization. Neural Networks, 5, 241–
and interaction detection. Statistica Sinica, 12, 361–386. 259.
Olden, J., Joy, M. and Death, R. (2004) An accurate comparison of Yao, F., Müller, H. and Wang, J. (2005) Functional linear regression
methods for quantifying variable importance in artificial neural analysis for longitudinal data. The Annals of Statistics, 33, 2873–
networks using simulated data. Ecological Modelling, 173, 389– 2903.
397.
Pfanzagl, J. (1982) Contributions to a General Asymptotic Statistical
Theory. Berlin: Springer. S U P P O RT I N G I N F O R M AT I O N
Rosseauw, J., Du Plessis, J., Benade, A., Jordann, P., et al. (1983) Web Appendices, Tables, and Figures referenced in Sec-
Coronary risk factor screening in three rural communities. South tions 1 and 2 are available with this paper at the Biometrics
African Medical Journal, 64, 430–436. website on Wiley Online Library. These include all tech-
Sapp, S., van der Laan, M. and Page, K. (2014) Targeted estimation of nical details, and additional simulation results and data
binary variable importance measures with interval-censored out- analyses. Code to reproduce all results is available online at
comes. The International Journal of Biostatistics, 10, 77–97.
https://github.com/bdwilliamson/vimpaper_supplement.
Shao, J. (1994) Bootstrap sample size in nonregular cases. Proceedings
of the American Mathematical Society, 122, 1251–1262.
All methods are implemented in R package vimp and
Strobl, C., Boulesteix, A., Zeileis, A. and Hothorn, T. (2007) Bias Python package vimpy, both available on CRAN (https:
in random forest variable importance measures: Illustrations, //cran.r-project.org/web/packages/vimp/index.html) and
sources and a solution. BMC Bioinformatics, 8, 1. PyPI (https://pypi.org/project/vimpy/), respectively.
van der Laan, M. (2006) Statistical inference for variable importance.
The International Journal of Biostatistics.
van der Laan, M., Polley, E. and Hubbard, A. (2007) Super learner.
Statistical Applications in Genetics and Molecular Biology, 6, How to cite this article: Williamson BD, Gilbert
Online Article 25. PB, Carone M, Simon N. Nonparametric variable
van der Vaart, A. (2000) Asymptotic Statistics, volume 3. Cambridge, importance assessment using machine learning
UK: Cambridge University Press.
techniques. Biometrics. 2021;77:9–22.
Williamson, B., Gilbert, P., Simon, N. and Carone, M. (2020) A uni-
fied approach for inference on algorithm-agnostic variable impor-
https://doi.org/10.1111/biom.13392
tance. arXiv :2004.03683.

You might also like