0% found this document useful (0 votes)

3 views

Biometrics - 2020 - Williamson - Nonparametric variable importance assessment using machine learning techniques

This document discusses a nonparametric variable importance measure that can be applied across various regression techniques, allowing for valid statistical inference and population-level interpretation. The proposed measure is a generalization of the analysis of variance (ANOVA) variable importance and is designed to assess the contribution of features in predicting outcomes without being tied to specific estimation methods. The authors provide a framework for constructing efficient estimators and demonstrate the measure's application through simulations and real data analysis in cardiovascular disease risk factors.

Uploaded by

2229006286a

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Biometrics - 2020 - Williamson - Nonparametric variable importance assessment using machine learning techniques

Uploaded by

2229006286a

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Received: 21 April 2018 Revised: 20 March 2019 Accepted: 22 March 2019

DOI: 10.1111/biom.13392

BIOMETRIC METH ODOLOGY

Nonparametric variable importance assessment using

machine learning techniques

Brian D. Williamson1 Peter B. Gilbert1,2 Marco Carone1,2 Noah Simon1

1 Department of Biostatistics, University

of Washington, Seattle, Washington, USA Abstract

2 Vaccine and Infectious Disease Division, In a regression setting, it is often of interest to quantify the importance of various
Fred Hutchinson Cancer Research features in predicting the response. Commonly, the variable importance measure
Center, Seattle, Washington, USA
used is determined by the regression technique employed. For this reason, practi-
Correspondence tioners often only resort to one of a few regression techniques for which a variable
Brian D. Williamson, Department of importance measure is naturally defined. Unfortunately, these regression tech-
Biostatistics, University of Washington,
Seattle, WA 98195, USA. niques are often suboptimal for predicting the response. Additionally, because
Email: [email protected] the variable importance measures native to different regression techniques gen-
erally have a different interpretation, comparisons across techniques can be diffi-
Funding information
National Institute of Allergy and Infec- cult. In this work, we study a variable importance measure that can be used with
tious Diseases, Grant/Award Num- any regression technique, and whose interpretation is agnostic to the technique
bers: UM1AI068635, F31AI140836,
used. This measure is a property of the true data-generating mechanism. Specifi-
DP5OD019820, R01AI029168
cally, we discuss a generalization of the analysis of variance variable importance
measure and discuss how it facilitates the use of machine learning techniques
to flexibly estimate the variable importance of a single feature or group of fea-
tures. The importance of each feature or group of features in the data can then
be described individually, using this measure. We describe how to construct an
efficient estimator of this measure as well as a valid confidence interval. Through
simulations, we show that our proposal has good practical operating character-
istics, and we illustrate its use with data from a study of risk factors for cardio-
vascular disease in South Africa.

KEYWORDS
machine learning, nonparametric 𝑅2 , statistical inference, targeted learning, variable impor-
tance

1 INTRODUCTION covariate vector and 𝑌𝑖 ∈ ℝ is the outcome of interest. It

is often of interest to understand the association between
Suppose that we have independent observations 𝑍1 , … , 𝑍𝑛 𝑌 and 𝑋 under 𝑃0 . To do this, we generally consider the
drawn from an unknown distribution 𝑃0 , known only to conditional mean function μ0 ∶= μ𝑃0 , where for each 𝑃 ∈
lie in a potentially rich class of distributions . We refer  we define
to  as our model. Further, suppose that each observation
𝑍𝑖 consists of (𝑋𝑖 , 𝑌𝑖 ), where 𝑋𝑖 ∶= (𝑋𝑖1 , … , 𝑋𝑖𝑝 ) ∈ ℝ𝑝 is a μ𝑃 (𝑥) ∶= 𝐸𝑃 (𝑌 ∣ 𝑋 = 𝑥). (1)

© 2020 The International Biometric Society

Biometrics. 2021;77:9–22. wileyonlinelibrary.com/journal/biom 9

15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 WILLIAMSON et al.

Estimation of μ0 is the canonical “predictive modeling” 𝑋, and 𝑊 represent adjustment variables; and the mean
problem. There are many tools for estimating μ0 : clas- difference in absolute deviations, 𝐸𝑃0 {|𝑌 − μ0 (𝑋)| − |𝑌 −
sical parametric techniques (eg, linear regression), and μ0,𝑠 (𝑋)|} (Lei et al., 2017). Methods in this latter category
more flexible nonparametric or semiparametric meth- allow valid inference and have broad potential applicabil-
ods, including random forests (Breiman, 2001), general- ity. The appropriate measure to use depends on the scien-
ized additive models (Hastie and Tibshirani, 1990), loess tific context.
smoothing (Cleveland, 1979), and artificial neural net- We are interested in studying a variable importance
works (Barron, 1989), among many others. Once a good measure that (i) is entirely agnostic to the estimation
estimate of μ0 is obtained, it is often of scientific inter- technique, (ii) allows valid inference, and (iii) provides a
est to identify the features that contribute most to the population-level interpretation that is well suited to scien-
variation in μ0 . For any given set 𝑠 ⊆ {1, … , 𝑝} and dis- tific applications. In this work, we study a variable impor-
tribution 𝑃 ∈ , we may define the reduced conditional tance measure that satisfies each of these criteria, adding
mean to the class of technique-agnostic measures referenced
above. In particular, we consider the ANOVA-based vari-
μ𝑃,𝑠 (𝑥) ∶= 𝐸𝑃 (𝑌 ∣ 𝑋−𝑠 = 𝑥−𝑠 ), (2) able importance measure

where for any vector 𝑣 and set 𝑟 of indices the symbol 𝑣−𝑟 { }2
∫ μ0 (𝑥) − μ0,𝑠 (𝑥) 𝑑𝑃0 (𝑥)
denotes the vector of all components of 𝑣 with index not in 𝜓0,𝑠 ∶= . (3)
𝑟. Here, the set 𝑠 can represent a single element or a group 𝑣𝑎𝑟𝑃0 (𝑌)
of elements. The importance of the elements in 𝑠 can be
evaluated by comparing μ0 and μ0,𝑠 ∶= μ𝑃0 ,𝑠 . This strategy For a vector 𝑣 and a subset 𝑟 of indices, we denote by 𝑣𝑟 the
will be leveraged in this paper. vector of all components of 𝑣 with index in 𝑟. Then, we may
The analysis of variance (ANOVA) decomposition is interpret (3) as the additional proportion of variability in
the main classical tool for evaluating variable importance. the outcome explained by including 𝑋𝑠 in the conditional
There, μ0 is assumed to have a simple parametric form. mean. This follows from the fact that we can express 𝜓0,𝑠
While this facilitates the task at hand considerably, the as
conclusions drawn can be misleading in view of the high
[ ] [ ]
risk of model misspecification. For this reason, it is increas- 𝐸𝑃0 {𝑌 − μ0 (𝑋)}2 𝐸𝑃0 {𝑌 − μ0,𝑠 (𝑋)}2
ingly common to use nonparametric or machine learning- 1− − 1− ,
𝑣𝑎𝑟𝑃0 (𝑌) 𝑣𝑎𝑟𝑃0 (𝑌)
based regression methods to estimate μ0 ; in such cases,
classical ANOVA results do not necessarily apply.
Recent work on evaluating variable importance with- the difference in the population 𝑅2 obtained using the
out relying on overly strong modeling assumptions can full set of covariates as compared to the reduced set of
generally be categorized as being either (i) intimately covariates only. Thus, the parameter we focus on is a sim-
tied to a specific estimation technique for the conditional ple generalization of the classical 𝑅2 measure of impor-
mean function or (ii) agnostic to the estimation technique tance to a nonparametric model and is useful in any
used. The former category includes, for example, vari- setting in which the mean squared error is a scientifi-
able importance measures for random forests (Breiman, cally relevant population measure of predictiveness. This
2001; Ishwaran, 2007; Strobl et al., 2007; Grömping, 2009) parameter is a function of 𝑃0 alone, in that it describes
and neural networks (see, eg, Olden et al., 2004), and a property of the true data-generating mechanism and
ANOVA in linear models. Among these, ANOVA alone not of any particular estimation method. In this work,
appears to readily allow valid statistical inference. Addi- we provide a framework for building a nonparametric
tionally, it is generally not possible to directly compare efficient estimator of 𝜓0,𝑠 that permits valid statistical
the importance assessment stemming from different meth- inference.
ods: they usually measure different quantities and thus We emphasize that the purpose of the variable impor-
have different interpretations. The latter category includes, tance measure we study here is not to offer insight into
for example, nonparametric extensions of 𝑅2 for kernel- the characteristics of any particular algorithm, but rather
based estimators, local polynomial regression, and func- to describe the importance of variables in predicting the
tional regression (Doksum and Samarov, 1995; Yao et al., outcome in the population. This is in contrast to common
2005; Huang and Chen, 2008); the marginalized mean algorithm-specific measures of variable importance. If a
difference, 𝐸𝑃0 {𝐸𝑃0 (𝑌 ∣ 𝑋 = 𝑥1 , 𝑊) − 𝐸𝑃0 (𝑌 ∣ 𝑋 = 𝑥0 , 𝑊)} tool for interpreting black-box algorithms is desired, other
(van der Laan, 2006; Chambaz et al., 2012; Sapp et al., 2014), approaches to variable importance may be preferred, as ref-
where 𝑥1 and 𝑥 0 are two meaningful reference levels of erenced above.
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WILLIAMSON et al. 11

Care must be taken in building point and interval esti- 2 VARIABLE IMPORTANCE IN A
mators for 𝜓0,𝑠 when μ0 and μ0,𝑠 are not known to belong NONPARAMETRIC MODEL
to simple parametric families. In particular, when μ0 and
μ0,𝑠 are estimated using flexible methods, simply plugging 2.1 Parameter of interest
estimates of these regression functions into (3) will not
yield a regular and asymptotically linear, let alone effi- We work in a nonparametric model  with only restric-
cient, estimator of 𝜓0,𝑠 . In this paper, we propose a sim- tion that, under each distribution 𝑃 in , the distribution
ple method that, given sufficiently accurate estimators of of 𝑌 given 𝑋 = 𝑥 must have a finite second moment for 𝑃-
μ0 and μ0,𝑠 , yields an efficient point estimator for 𝜓0,𝑠 and almost every 𝑥. For given 𝑠 ⊆ {1, … , 𝑝} and 𝑃 ∈ , based
a confidence interval with asymptotically correct coverage. on the conditional means (1) and (2), we define the statis-
We show that this method—based on ideas from semipara- tical functional
metric theory—is equivalent to simply plugging in esti-
2
mates of μ0 and μ0,𝑠 into the difference in 𝑅2 values. In ∫ {μ𝑃 (𝑥) − μ𝑃,𝑠 (𝑥)} 𝑑𝑃(𝑥)
Ψ𝑠 (𝑃) ∶= (4)
Williamson et al. (2020), we generalize this phenomenon 𝑣𝑎𝑟𝑃 (𝑌)
and provide results for plug-in estimators of a large class
[ ] [ ]
of variable importance measures. 𝐸𝑃 {𝑌 − μ𝑃 (𝑋)}2 𝐸𝑃 {𝑌 − μ𝑃,𝑠 (𝑋)}2
We note that, while variable importance is related to = 1− − 1− .
𝑣𝑎𝑟𝑃 (𝑌) 𝑣𝑎𝑟𝑃 (𝑌)
variable selection, these paradigms may have distinct
goals. In variable selection, it is typically of interest to cre- (5)
ate the best predictive model based on the current data,
and this model may include only a subset of the available This is the nonparametric measure of variable importance
variables. There are many contributions in both technique- we focus on. The value of Ψ𝑠 (𝑃) measures the importance
specific (see, eg, Breiman, 2001; Friedman, 2001; Loh, of variables in the set {𝑋𝑗 }𝑗∈𝑠 relative to the entire covariate
2002) and nonparametric (see, eg, Doksum et al., 2008) vector for predicting outcome 𝑌 under the data-generating
selection. The goal in variable importance is to assess the mechanism 𝑃. Using observations 𝑍1 , … , 𝑍𝑛 independently
extent to which (subsets of) features contribute to improv- drawn from the true, unknown joint distribution 𝑃0 ∈ ,
ing the population-level predictive power of the best pos- we aim to make efficient inference about the true value
sible outcome predictor based on all available features. 𝜓0,𝑠 = Ψ𝑠 (𝑃0 ).
Of course, variable importance can be used as part of This parameter is a nonparametric extension of the
the process of variable selection. To highlight the distinc- usual ANOVA-derived measure of variable importance in
tion between importance and selection, it may be useful parametric models. We first note that 𝜓0,𝑠 ∈ [0, 1]. Further-
to consider a scenario in which two perfectly correlated more, 𝜓0,𝑠 = 0 if and only if 𝑌 is conditionally uncorrelated
covariates 𝑋1 and 𝑋2 are available. Neither covariate has with every transformation of 𝑋𝑠 given 𝑋−𝑠 . In addition, the
importance relative to the other, but the variables may value of 𝜓0,𝑠 is invariant to linear transformations of the
be highly important as a pair. A variable importance outcome and to a large class of transformations of the fea-
procedure considering individual and grouped features ture vector, as detailed in the Supporting Information. As
would identify this, whereas a variable selection proce- such, common data normalization steps may be performed
dure would likely choose only one of 𝑋1 or 𝑋2 for use in without impact on 𝜓0,𝑠 . Finally, 𝜓0,𝑠 can be seen as a ratio
prediction. of the extra sum of squares, averaged over the joint feature
This paper is organized as follows. We present some distribution, to the total sum of squares. The value of 𝜓0,𝑠 is
properties of the parameter we consider and give our thus precisely the improvement in predictive performance,
proposed estimator in Section 2. In Section 3, we pro- in terms of standardized mean squared error, that can be
vide empirical evidence that our proposed estimator expected if we build a model using all of 𝑋 versus only
outperforms both the naive plug-in ANOVA-based esti- 𝑋−𝑠 . If we assume simple linear regression models for μ0
mator and an ordinary least squares-based estimator in and μ0,𝑠 , then 𝜓0,𝑠 is precisely the usual difference in 𝑅2
settings where the covariate vector is low- or moderate- between nested models.
dimensional and the data-generating mechanism is non- We want to reiterate here that, in contrast to simple para-
linear. In Section 4, we apply our method on data from metric approaches to variable importance, our functional
a retrospective study of heart disease in South African Ψ𝑠 simply maps any candidate data-generating mechanism
men. We provide concluding remarks in Section 5. Tech- to a positive number. This definition does not require a
nical details and an illustration based on the landmark parametric specification of μ0 or μ0,𝑠 . While this is usual
Boston housing data are provided in the Supporting for non- or semiparametric inference problems, it is differ-
Information. ent from classical approaches to variable importance.
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 WILLIAMSON et al.

For building an efficient estimator of 𝜓0,𝑠 , it is criti- where ℙ𝑛 is the empirical distribution based on 𝑍1 , … , 𝑍𝑛 ,
cal to consider the differentiability of Ψ𝑠 as a functional. 𝐻𝑠,𝑛 (𝑃, 𝑃0 ) ∶= ∫ {𝜑𝑃,𝑠 (𝑧) − 𝜑𝑃0 ,𝑠 (𝑧)} 𝑑(ℙ𝑛 − 𝑃0 )(𝑧) is an
Specifically, we have that (4) is pathwise differentiable with empirical process term, and we have made repeated use
respect to the unrestricted model (see, eg, Bickel et al., of the fact that 𝜑𝑃,𝑠 (𝑍) has mean zero under 𝑃 for any
1998). Pathwise differentiable functionals generally admit a 𝑃 ∈ . This representation is critical for characterizing
convenient functional Taylor expansion that can be used to the behavior of the plug-in estimator Θ𝑠 (𝑃). ˆ The four
characterize the asymptotic behavior of plug-in estimators. terms on the right-hand side in (7) can be studied sepa-
An analysis of the pathwise derivative allows us to deter- rately. The first term is an empirical average of mean-zero
mine the efficient influence function (EIF) of the func- transformations of 𝑍1 , … , 𝑍𝑛 . The second term is an
tional relative to the statistical model (Bickel et al., 1998). empirical process term, and the third term is a remainder
The EIF plays a key role in establishing efficiency bounds term. Both of these second-order terms can be shown to
for regular and asymptotically linear estimators of the true be asymptotically negligible under certain conditions on
parameter value, and most importantly, in the construc- ˆ The fourth term can be thought of as the bias incurred
𝑃.
tion of efficient estimators, as we will highlight below. For from flexibly estimating the conditional means (1) and
convenience, we will denote the numerator of Ψ𝑠 (𝑃) by (2) and will generally tend to zero slowly. This bias term
Θ𝑠 (𝑃) ∶= ∫ {μ𝑃 (𝑥) − μ𝑃,𝑠 (𝑥)}2 𝑑𝑃(𝑥). The EIFs of Θ𝑠 and of motivates our choice of estimator for 𝜓0,𝑠 in Section 2.2.
Ψ𝑠 relative to the nonparametric model  are provided in We will employ one particular bias correction method,
the following lemma. and the large-sample properties of our proposed estimator
will be determined by the first term in (7).
Lemma 1. The parameters Θ𝑠 and Ψ𝑠 are pathwise differ-
entiable at each 𝑃 ∈  relative to , with EIFs 𝜑𝑃,𝑠 and
∗
𝜑𝑃,𝑠 relative to , respectively, given by
2.2 Estimation procedure
𝜑𝑃,𝑠 ∶ 𝑧 ↦ 2{𝑦 − μ𝑃 (𝑥)}{μ𝑃 (𝑥) − μ𝑃,𝑠 (𝑥)} + {μ𝑃 (𝑥) − μ𝑃,𝑠 (𝑥)}2 − Θ𝑠 (𝑃),
Writing the numerator Θ𝑠 of the parameter of interest
∗
2{𝑦 − μ𝑃 (𝑥)}{μ𝑃 (𝑥) − μ𝑃,𝑠 (𝑥)} + {μ𝑃 (𝑥) − μ𝑃,𝑠 (𝑥)}2 as a statistical functional suggests a natural estimation
𝜑𝑃,𝑠 ∶𝑧 ↦
𝑣𝑎𝑟𝑃 (𝑌) procedure. If we have estimators μ̂ and μ̂ 𝑠 of μ0 and
{ }2 μ0,𝑠 , respectively—obtained through any method that we
𝑦 − 𝐸𝑃 (𝑌)
− Θ𝑠 (𝑃) .
𝑣𝑎𝑟𝑃 (𝑌) choose, including machine learning techniques—a natu-
ral plug-in estimator of 𝜃0,𝑠 ∶= Θ𝑠 (𝑃0 ) is given by
A linearization of the evaluation of Θ𝑠 at 𝑃 ∈  around
𝑃0 can be expressed as 2
𝜃̂naive,s ∶= {μ(𝑥)
̂ − μ̂ 𝑠 (𝑥)} 𝑑ℙ𝑛 (𝑥)
∫
Θ𝑠 (𝑃) = Θ𝑠 (𝑃0 ) + 𝜑𝑃,𝑠 (𝑧) 𝑑(𝑃 − 𝑃0 )(𝑧) + 𝑅𝑠 (𝑃, 𝑃0 ), 𝑛
∫ 1∑ 2
= {μ(𝑋
̂ 𝑖 ) − μ̂ 𝑠 (𝑋𝑖 )} . (8)
(6) 𝑛 𝑖=1

where 𝑅𝑠 (𝑃, 𝑃0 ) is a remainder term from this first-order In turn, this suggests using, with 𝑌̄ 𝑛 denoting the empirical
expansion around 𝑃0 . The explicit form of 𝑅𝑠 (𝑃, 𝑃0 ) is pro- mean of 𝑌1 , … , 𝑌𝑛 ,
vided in Section 2.3 and can be used to algebraically verify
1 ∑𝑛
this representation. For any given estimator 𝑃ˆ ∈  of 𝑃0 , 2
𝜃̂naive,s 𝑛 𝑖=1 {μ(𝑋
̂ 𝑖 ) − μ̂ 𝑠 (𝑋𝑖 )}
we can write 𝜓̂ naive,s ∶= = 1 ∑𝑛
𝑣𝑎𝑟ℙ𝑛 (𝑌) − 𝑌̄ 𝑛 )2
𝑛 𝑖=1 (𝑌𝑖
ˆ − Θ𝑠 (𝑃0 ) =
Θ𝑠 (𝑃) 𝜑𝑃,𝑠 ˆ − 𝑃0 )(𝑧) + 𝑅𝑠 (𝑃,
ˆ (𝑧) 𝑑(𝑃 ˆ 𝑃0 )
∫ as a simple estimator of 𝜓0,𝑠 . We refer to this as the naive
= 𝜑𝑃,𝑠 ˆ 𝑃0 ) estimator. This simple estimator involves hidden trade-
ˆ (𝑧) 𝑑(ℙ𝑛 − 𝑃0 )(𝑧) + 𝑅𝑠 (𝑃,
∫ offs. On the one hand, it is easy to construct given esti-
𝑛
1∑ mators μ̂ and μ̂ 𝑠 . On the other hand, it does not generally
− 𝜑 (𝑍 )
𝑛 𝑖=1 𝑃ˆ𝑛 ,𝑠 𝑖 enjoy good inferential properties. If a flexible technique is
𝑛 used to estimate μ0 and μ0,𝑠 , constructing μ̂ and μ̂ 𝑠 usu-
1∑ ˆ 𝑃0 ) + 𝐻𝑠,𝑛 (𝑃,
ˆ 𝑃0 )
= 𝜑 (𝑍 ) + 𝑅𝑠 (𝑃, ally entails selecting tuning parameter values to achieve an
𝑛 𝑖=1 𝑃0 ,𝑠 𝑖
optimal bias-variance trade-off for μ0 and μ0,𝑠 , respectively.
𝑛
1∑ This is generally not the optimal bias-variance trade-off for
− 𝜑 ˆ (𝑍𝑖 ), (7)
𝑛 𝑖=1 𝑃,𝑠 estimating the parameter of interest 𝜓0,𝑠 , a key fact from
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WILLIAMSON et al. 13

non- and semiparametric theory. The estimator 𝜓̂ naive,s has A l g o r i t h m 1 Estimating 𝜓0,𝑠
a variance decreasing at a parametric rate, with little sen- 1: Choose a technique to estimate the conditional means μ0
sitivity to the tuning of μ̂ and μ̂ 𝑠 , because of the involved and μ0,𝑠 , eg, ensemble learning with various predictive
marginalization over the feature distribution. However, it modeling algorithms (Wolpert, 1992);
inherits much of the bias from μ̂ and μ̂ 𝑠 . Some form of debi- 2: μ̂ ← Regress 𝑌 on 𝑋 using the technique from step (1) to
asing is thus needed, as we discuss below. In particular, the estimate μ0 ;
estimator 𝜓̂ naive,s is generally overly biased, in the sense 3: μ̂ 𝑠 ← Regress μ(𝑋)
̂ on 𝑋−𝑠 using the technique from step (1)
that its bias does not tend to zero sufficiently fast to allow to estimate μ0,𝑠 ;
1 ∑𝑛 1 ∑𝑛
{𝑌𝑖 −μ̂ 𝑠 (𝑋𝑖 )}2 − ̂ 𝑖 )}2
{𝑌𝑖 −μ(𝑋
consistency at rate 𝑛−1∕2 , let alone efficiency. This is prob- 4: 𝜓̂ 𝑛,𝑠 ← 𝑛 𝑖=1
1 ∑𝑛
𝑛 𝑖=1
, as in Equation (10).
𝑖=1
(𝑌𝑖 −𝑌̄ 𝑛 )2
𝑛
lematic, in particular, because it renders the construction
of valid confidence intervals difficult, if not impossible.
Instead, we consider the simple one-step correction
estimator Interestingly, as we note from above, this is needed when
constructing a plug-in estimator based on the ANOVA rep-
∑𝑛
resentation (4) of 𝜓0,𝑠 but not based on its difference-in-𝑅2
̂𝜃𝑛,𝑠 ∶= 𝜃̂naive,s + 1 𝜑 ˆ (𝑍𝑖 )
𝑛 𝑖=1 𝑃,𝑠 representation (5).
While we are not constrained to any particular estima-
of 𝜃0,𝑠 , which, in view of (7), is asymptotically efficient tion method to construct μ̂ and μ̂ 𝑠 , we have found one par-
under certain regularity conditions. This estimator is ticular strategy to work well in practice. Using any spe-
obtained by correcting for the excess bias of the naive plug- cific predictive modeling technique to regress the outcome
in estimator 𝜃̂naive,s using the empirical average of the esti- 𝑌 on the full covariate vector 𝑋 and then on the reduced
mated EIF as a first-order approximation of (minus) this covariate vector 𝑋−𝑠 does not take into account that the two
bias (see, eg, Pfanzagl, 1982). We note that to compute 𝜃̂𝑛,𝑠 conditional means are related and will generally result in
it is not necessary to obtain an estimator 𝑃ˆ of the entire dis- incompatible estimates. Specifically, we have that μ0,𝑠 (𝑥) =
tribution 𝑃0 . Instead, estimators μ̂ and μ̂ 𝑠 of μ0 and μ0,𝑠 suf- 𝐸𝑃0 {μ0 (𝑋) ∣ 𝑋−𝑠 = 𝑥−𝑠 }, which we can take advantage of
fice. The variance of 𝑌 under 𝑃0 may simply be estimated to produce the following sequential regression estimating
using the empirical variance. It is easy to verify that the procedure: (i) regress 𝑌 on 𝑋 to obtain an estimate μ̂ of μ0 ,
resulting estimator of 𝜓0,𝑠 simplifies to and then (ii) regress μ(𝑋)
̂ on 𝑋−𝑠 to obtain an estimate μ̂ 𝑠
of μ0,𝑠 .
𝜃̂𝑛,𝑠 The final estimation procedure we recommend for 𝜓0,𝑠
𝜓̂ 𝑛,𝑠 ∶= consists of estimator 𝜓̂ 𝑛,𝑠 , where the conditional means
𝑣𝑎𝑟ℙ𝑛 (𝑌)
involved are estimated using flexible regression estima-
∑𝑛
𝑖=1 2{𝑌𝑖 − μ(𝑋
̂ 𝑖 )}{μ(𝑋
̂ 𝑖 ) − μ̂ 𝑠 (𝑋𝑖 )} tors based on the sequential regression approach; see
= 𝜓̂ naive,s + ∑𝑛 . (9) Algorithm 1 for explicit details. This may also be embed-
̄ 2
𝑖=1 (𝑌𝑖 − 𝑌𝑛 ) ded in a split-sample validation scheme, by first creating
training and validation sets, then obtaining μ̂ and μ̂ 𝑠 on
This estimator adjusts for the inadequate bias-variance
the training set as outlined above, and finally, obtaining an
trade-off performed when flexible estimators μ̂ and μ̂ 𝑠 are
estimator of 𝜓0,𝑠 by using the validation data along with
tuned to be good estimators of μ0 and μ0,𝑠 rather than
predictions from the conditional mean estimators on the
being tuned for the end objective of estimating 𝜓0,𝑠 . Sim-
validation data. This can be extended to a cross-fitted pro-
ple algebraic manipulations yield that 𝜓̂ 𝑛,𝑠 is equivalent to
cedure given in Algorithm 2 and discussed in the Support-
the plug-in estimator
ing Information.
1 ∑𝑛 1 ∑𝑛
⎡ ̂ 𝑖 )}2 ⎤ ⎡ − μ̂ 𝑠 (𝑋𝑖 )}2 ⎤
𝑛 𝑖=1 {𝑌𝑖 − μ(𝑋
𝑛 𝑖=1 {𝑌𝑖
⎢1 − ⎥ − ⎢1 − ⎥
⎢ 𝑣𝑎𝑟ℙ𝑛 (𝑌) ⎥ ⎢ 𝑣𝑎𝑟ℙ𝑛 (𝑌) ⎥ 2.3 Asymptotic behavior of the
⎣ ⎦ ⎣ ⎦
proposed estimator
(10)
By studying the remainder term 𝑅𝑠 (𝑃, ˆ 𝑃0 ) and the empiri-
obtained by viewing 𝜓0,𝑠 as a difference in population 𝑅2 ˆ
cal process term 𝐻𝑠,𝑛 (𝑃, 𝑃0 ), we can establish appropriate
values, as in (5). As indicated above, semiparametric theory conditions on μ̂ and μ̂ 𝑠 under which the proposed estima-
indicates that plug-in estimators based on flexible regres- tor 𝜓̂ 𝑛,𝑠 is asymptotically efficient. This allows us to deter-
sion algorithms typically require bias correction if the lat- mine the asymptotic distribution of the proposed estimator
ter are not tuned towards the target of inference, as in (9). and, therefore, to propose procedures for performing valid
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 WILLIAMSON et al.

A l g o r i t h m 2 Estimating 𝜓0,𝑠 using 𝑉-fold cross fitting variation norm of 𝜑𝑃ˆ𝑛 ,𝑠 is bounded with probability tend-
1: Choose a technique to estimate the conditional means μ0 ing to one (Gill et al., 1995). When using flexible machine
and μ0,𝑠 ; learning-based regression estimators, there may be reason
2: Generate a random vector 𝐵𝑛 ∈ {1, … , 𝑉}𝑛 by sampling for concern regarding the validity of condition (A2). In
uniformly from {1, … , 𝑉} with replacement, and denote such cases, using the cross-fitted estimator 𝜓̂ 𝑛,𝑠
cv
may cir-
by 𝐷𝑗 the subset of observations with index in cumvent this condition. While this cross-fitted estimator
{𝑖 ∶ 𝐵𝑛,𝑖 = 𝑗} for 𝑗 = 1, … , 𝑉. is only slightly more complex, we restrict attention here to
3: for 𝑣 = 1, … , 𝑉 do studying the simpler estimator and leave study of the cross-
4: μ̂ 𝑣 ← Regress 𝑌 on 𝑋 using the data in ∪𝑗≠𝑣 𝐷𝑗 using the fitted estimator to the Supporting Information.
technique from step (1) to estimate μ0 ; The following theorem describes the asymptotic behav-
5: μ̂ 𝑠,𝑣 ← Regress μ̂ 𝑣 (𝑋) on 𝑋−𝑠 using the data in ∪𝑗≠𝑣 𝐷𝑗 to ior of the proposed estimator.
estimate μ0,𝑠 ;
∑ ∑
{𝑌𝑖 −μ̂ 𝑠,𝑣 (𝑋𝑖 )}2 − 𝑖∈𝐷 {𝑌𝑖 −μ̂ 𝑣 (𝑋𝑖 )}2
6: 𝜓̂ 𝑛,𝑠
𝑣
←
𝑖∈𝐷𝑗
∑
𝑗
, as in Equation (10); Theorem 1. Provided conditions (A1)–(A3) hold, 𝜓̂ 𝑛,𝑠 is
(𝑌𝑖 −𝑌̄ 𝑛 )2
𝑖∈𝐷𝑗
asymptotically linear with influence function 𝜑𝑃∗ ,𝑠 . In par-
7: end for 0
1 ∑𝑉 ticular, this implies that (a) 𝜓̂ 𝑛,𝑠 tends to 𝜓0,𝑠 in probabil-
8: 𝜓̂ 𝑛,𝑠
cv
← 𝜓̂ 𝑣 .
𝑉𝑣=1 𝑛,𝑠
ity, and if 𝜓0,𝑠 ∈ (0, 1) (b) 𝑛1∕2 (𝜓̂ 𝑛,𝑠 − 𝜓0,𝑠 ) tends in distri-
bution to a mean-zero Gaussian random variable with vari-
2
ance 𝜎0,𝑠 ∶= ∫ 𝜑𝑃∗ ,𝑠 (𝑧)2 𝑑𝑃0 (𝑧).
0
inference on 𝜓0,𝑠 . Below, we will make reference to the fol-
lowing conditions, in which we have defined the condi- A natural plug-in estimator of 𝜎0,𝑠 2 2
is given by 𝜎̂ 𝑛,𝑠 ∶=
tional outcome variance 𝜏02 ∶ 𝑥 ↦ 𝑣𝑎𝑟𝑃0 (𝑌 ∣ 𝑋 = 𝑥). 1 ∑ 𝑛 ∗ 2 ∗
𝑖=1 𝜑𝑃0 ,𝑠 (𝑍𝑖 ) , where 𝜑𝑃0 ,𝑠 is any consistent estimator
ˆ ˆ
𝑛
∗
(A1) max[∫ {μ(𝑥) ̂ − μ0 (𝑥)}2 𝑑𝑃0 (𝑥), ∫ {μ̂ 𝑠 (𝑥) − of 𝜑𝑃 ,𝑠 . For example, 𝜑 ˆ𝑃 ,𝑠 may be taken to be 𝜑𝑃∗ ,𝑠 with
∗
0 0 0
μ0,𝑠 (𝑥)} 𝑑𝑃0 (𝑥)] = 𝑜𝑃 (𝑛−1∕2 );
2 μ0 , μ0,𝑠 , 𝐸𝑃0 (𝑌), 𝑣𝑎𝑟𝑃0 (𝑌), and 𝜃0,𝑠 replaced by μ,̂ μ̂ 𝑠 , 𝑌̄ 𝑛 ,
(A2) there exists a 𝑃0 -Donsker class 0 such that 𝑃0 (𝜑𝑃,𝑠
ˆ ∈
𝑣𝑎𝑟ℙ𝑛 (𝑌), and 𝜃̂𝑛,𝑠 , respectively. In view of the asymptotic
0 ) ⟶ 1; normality of 𝑛1∕2 (𝜓̂ 𝑛,𝑠 − 𝜓0,𝑠 ), an asymptotically valid (1 −
(A3) there exists a constant 𝐾 > 0 such that each of μ0 , 𝛼) × 100% Wald-type confidence interval for 𝜓0,𝑠 ∈ (0, 1)
μ̂ 0 , μ̂ 0,𝑠 , and 𝜏02 has range contained uniformly in can be obtained as 𝜓̂ 𝑛,𝑠 ± 𝑞1−𝛼∕2 𝜎̂ 𝑛,𝑠 𝑛−1∕2 , where 𝑞𝛽 is the
(−𝐾, +𝐾) with probability tending to one as sample 𝛽-quantile of the standard normal distribution.
size tends to +∞. To underscore the importance of using the proposed
debiased procedure, we recall that, in contrast to 𝜓̂ 𝑛,𝑠 , the
First, it is straightforward to verify that linearization naive ANOVA-based estimator is generally not asymptot-
(6) holds with second-order remainder term 𝑅𝑠 (𝑃, 𝑃0 ) = ically linear when flexible (eg, machine learning based)
∫ {μ𝑃,𝑠 (𝑥) − μ0,𝑠 (𝑥)}2 𝑑𝑃0 (𝑥) − ∫ {μ𝑃 (𝑥) − μ0 (𝑥)}2 𝑑𝑃0 (𝑥). It estimators of the involved regression are used. It will usu-
follows then that condition (A1) suffices to ensure ally be overly biased, resulting in a rate of convergence
that 𝑅𝑠 (𝑃,ˆ 𝑃0 ) is asymptotically negligible, that is, that slower than 𝑛−1∕2 . Constructing valid confidence intervals
ˆ 𝑃0 ) = 𝑜𝑃 (𝑛−1∕2 ). Each of the second-order terms in
𝑅𝑠 (𝑃, based on the naive estimator can thus be difficult. It may
condition (A1) can feasibly be made negligible, even while be tempting to consider bootstrap resampling as a remedy.
using flexible regression techniques, including generalized However, this is not advisable since, besides the computa-
additive models (Hastie and Tibshirani, 1990), to estimate tional burden of such an approach, there is little theory to
the conditional mean functions. We thus turn our attention justify using the standard nonparametric bootstrap in this
to 𝐻𝑠,𝑛 (𝑃,ˆ 𝑃0 ). By empirical process theory, we have that context, particularly for the naive ANOVA-based estimator
𝐻𝑠,𝑛 (𝑃,ˆ 𝑃0 ) = 𝑜𝑃 (𝑛−1∕2 ) provided, for example, ∫ {𝜑𝑃,𝑠
ˆ (𝑧) −
(Shao, 1994).
2
𝜑𝑃0 ,𝑠 (𝑧)} 𝑑𝑃0 (𝑧) tends to zero in probability and condition
(A2) holds (Lemma 19.24 of van der Vaart, 2000). For the
former, uniform consistency of μ̂ and μ̂ 𝑠 under 𝐿2 (𝑃0 ) suf- 2.4 Behavior under the zero-importance
fices under condition (A3). We note that if there is a known null hypothesis
bound on the outcome support, condition (A3) will readily
be satisfied provided the learning algorithms used incor- This work focuses on efficient estimation of a population-
porate this knowledge. For the latter, the set of possible level algorithm-agnostic variable importance measure
realizations of μ̂ and μ̂ 𝑠 must become sufficiently restricted using flexible estimation techniques and on describing
with probability tending to one as sample size grows. This how valid inference may be drawn when the set 𝑠
condition is satisfied if, for example, the uniform sectional of features under evaluation does not have degenerate
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WILLIAMSON et al. 15

importance. Specifically, we have restricted our attention by this data-generating mechanism can be shown to
to cases in which 𝜓0,𝑠 ∈ (0, 1) strictly and provided confi- be 𝜓0,1 ≈ 0.158 and 𝜓0,2 ≈ 0.342. This nonlinear setting
dence intervals valid in such cases. It may be of interest, helps to highlight the drawbacks of relying on a simple
however, to test the null hypothesis 𝜓0,𝑠 = 0 of zero impor- parametric model to estimate the conditional means.
tance. Developing valid and powerful tests of this particu- To obtain μ, ̂ μ̂ 1 , and μ̂ 2 , we fit locally constant loess
lar null hypothesis is difficult. Because the null hypothesis smoothing using the R function loess with tuning selected
is on the boundary of the parameter space, 𝜑𝑃0 ,𝑠 is iden- to minimize a fivefold cross-validated estimate of the
tically zero under this null, and a higher order expansion empirical risk based on the squared error loss function.
may be required to construct and characterize the behavior Loess smoothing was chosen because it is a data-adaptive
of an appropriately-regularized estimator of 𝜃0,𝑠 —and thus algorithm with an efficient implementation, and it satisfies
of 𝜓0,𝑠 —with good power. However, the parameters Θ𝑠 and the minimum convergence rate condition outlined in Sec-
Ψ𝑠 are generally not second-order pathwise differentiable tion 2.3, allowing us to numerically verify our theoretical
in nonparametric models, and so higher order expansions results. Because we obtained the same trends using locally
cannot easily be constructed. There may be hope in using constant kernel regression, we do not report summaries
approximate second-order gradients, as outlined in Carone from these additional simulations here. This fact neverthe-
et al. (2018), though this remains an open problem. A crude less highlights the ease of comparing results from two dif-
alternative solution based on sample splitting is investi- ferent estimation techniques.
gated in Williamson et al. (2020). To highlight the diffi- We computed the naive and proposed estimators and
culties that arise under this particular null hypothesis, we respective confidence intervals for each replication and
conducted a simulation study for a setting in which one compared these to a parametric difference in 𝑅2 based
of the variables has zero importance. The results from this on simple linear regression using ordinary least squares
study are provided in the next section. (OLS). Because a simple asymptotic distribution for the
naive estimator is unavailable, a percentile bootstrap
approach with 1000 bootstrap samples was used in an
3 EXPERIMENTS ON SIMULATED attempt to obtain approximate confidence intervals based
DATA on 𝜓̂ naive,j . For each estimator, we then computed the
empirical bias scaled by 𝑛1∕2 and the empirical variance
We now present empirical results describing the perfor- scaled by 𝑛. Our output for the estimated bias includes con-
mance of the proposed estimator (9) compared to that of fidence intervals for the true bias based on the resulting
the naive plug-in estimator (8). In all implementations, draws from the bootstrap sampling distribution. Finally,
we use the sequential regression estimating procedure we computed the empirical coverage of the nominal 95%
described in Algorithm 1 for each feature or group of inter- confidence intervals constructed.
est to compute compatible estimates of the required regres- Figure 1 displays the results of this simulation. In the
sion functions, and we compute nominal 95% Wald-type left panel, we note that the scaled empirical bias of the
confidence intervals as outlined in Section 2.3. proposed estimator decreases towards zero as 𝑛 tends to
infinity, regardless of which feature we remove. Also, we
see that both the naive estimator and the OLS estimator
3.1 Low-dimensional vector of features have substantial bias that does not tend to zero faster than
𝑛−1∕2 . This coincides with our expectations: the naive esti-
We consider here data generated according to the following mator involves an inadequate bias-variance trade-off with
specification: respect to the parameter of interest and does not include
any debiasing; the OLS estimator is based on a misspec-
𝑖𝑖𝑑
𝑋1 , 𝑋2 ∼ Uniform(−1, 1) and ified mean model. Though there is very substantial bias
reduction from using the proposed estimator, we see that
𝜖 ∼ 𝑁(0, 1) independent of (𝑋1 , 𝑋2 ) its scaled bias appears to dip slightly below zero for large
( 𝑛. We expect for larger 𝑛 to see this scaled bias for the
7 ) 25 2
𝑌 = 𝑋12 𝑋1 + + 𝑋2 + 𝜖. proposed estimator get closer to zero; numerical error in
5 9
our computations may explain why this does not exactly
We generated 1000 random datasets of size happen. These results provide empirical evidence that the
𝑛 ∈ {100, 300, 500, 700, 1000, 2000, … , 8000} and con- debiasing step is necessary to account for the slow rates of
sidered in each case the importance of 𝑋1 and of 𝑋2 . The convergence in estimation of 𝜓0,𝑠 introduced because μ0
true value of the variable importance measures implied and μ0,𝑠 are flexibly estimated.
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 WILLIAMSON et al.

n
nn

n
n n n

F I G U R E 1 Empirical bias (scaled by 𝑛1∕2 ) with Monte Carlo error bars, empirical variance (scaled by 𝑛), and empirical coverage of nominal
95% confidence intervals for the proposed, naive, and OLS estimators for either feature, using loess smoothing with cross-validation tuning (in
the case of the proposed and naive estimators). Circles, filled diamonds, and filled squares denote that we have removed 𝑋1 ; stars, crossed
diamonds, and empty squares denote that we have removed 𝑋2 . This figure appears in color in the electronic version of this article, and any
mention of color refers to that version

In the middle panel of Figure 1, we see that the vari- means and summarized the results of this simulation as
ance of the proposed estimator is close to that of the in the previous simulation.
naive estimator—we have thus not suffered much from Figure 2 displays the results of this simulation. In the
removing excess bias in our estimation procedure. The left-hand panel, we observe that the proposed estimator
variance of the OLS estimator is the smallest of the three: has smaller scaled bias in magnitude than the naive esti-
using a parametric model tends to result in a smaller vari- mator when we remove the feature with nonzero impor-
ance. The ratio of the variance of the naive estimator to tance. However, when we remove the feature with zero
that of the proposed estimator is near one for all 𝑛 con- importance, the proposed estimator has slightly higher
sidered and ranges between approximately 0.8 and 1.2 in bias. While this is somewhat surprising, it likely is due to
our simulation study. Finally, in the right-hand panel, we the additive correction in the one-step construction being
see that as sample size grows, coverage increases for the slightly too large. The scaled bias of the proposed estima-
confidence interval based on the proposed estimator and tor tends to zero as 𝑛 increases for both features, which is
approaches the nominal level. In contrast, the coverage of not true of the naive estimator. In the middle panel, we
intervals based on both the naive estimator and the OLS see that we have not incurred excess variance by using
estimator decreases instead and quickly becomes poor. the proposed estimator. In the right-hand panel, we see
that both estimators have close to zero coverage for the
parameter under the null hypothesis, but that the pro-
3.2 Testing the zero-importance null posed estimator has higher coverage than the naive estima-
hypothesis tor for the predictive feature. These results highlight that
more work needs to be done for valid testing and estima-
We now consider data generated according to the following tion under this boundary null hypothesis. While our cur-
specification: rent proposal yields valid results for the predictive feature,
even in the presence of a null feature, ensuring valid infer-
𝑖𝑖𝑑
𝑋1 , 𝑋2 ∼ Uniform(−1, 1) and ence for null features themselves remains an important
25 2 challenge.
𝜖 ∼ 𝑁(0, 1) independent of (𝑋1 , 𝑋2 ); 𝑌 = 𝑋 + 𝜖.
9 1

We generated 1000 random datasets of size 3.3 Moderate-dimensional vector of

𝑛 ∈ {100, 300, 500, 700, 1000, 2000, 3000} and again con- features
sidered in each case the importance of 𝑋1 and of 𝑋2 . The
true value of the variable importance measures implied We consider one setting in which the features are inde-
by this data-generating mechanism can be shown to be pendent and a second in which groups of features are cor-
𝜓0,1 ≈ 0.407 and 𝜓0,2 = 0. We estimated the conditional related. In setting 𝐴, we generated data according to the
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WILLIAMSON et al. 17

n
nn

n
n n n

F I G U R E 2 Empirical bias (scaled by 𝑛1∕2 ) with Monte Carlo error bars, empirical variance (scaled by 𝑛), and empirical coverage of nominal
95% confidence intervals for the proposed and naive estimators for either feature, using loess smoothing with cross-validation tuning. Circles
and filled diamonds denote that we have removed 𝑋1 , while stars and crossed diamonds denote that we have removed 𝑋2 . We operate under
the null hypothesis for 𝑋2 , that is, 𝜓0,2 = 0. This figure appears in color in the electronic version of this article, and any mention of color refers
to that version

following specification: T A B L E 1 Approximate values of 𝜓0,𝑠 for each simulation

setting and group considered in the moderate-dimensional
𝑖𝑖𝑑 simulations in Section 3.3
𝑋1 , 𝑋2 , … , 𝑋15 ∼ 𝑁(0, 4) and
Setting
𝜖 ∼ 𝑁(0, 1) independent of (𝑋1 , 𝑋2 , … , 𝑋15 ) Group 𝑨 𝑩
(𝑋1 , 𝑋2 , … , 𝑋5 ) 0.295 0.281
𝑌 = 𝐼(−2,+2) (𝑋1 ) ⋅ ⌊𝑋1 ⌋ + 𝐼(−∞,0] (𝑋2 )
(𝑋6 , 𝑋7 , … , 𝑋10 ) 0.240 0.314
| 𝑋 |3 (𝑋11 , 𝑋12 , … , 𝑋15 ) 0.242 0.179
+ 𝐼(0,+∞) (𝑋3 ) + || 6 ||
| 4 |
( )
| 𝑋 |5 7 𝑋11
+ || 7 || + cos + 𝜖. the importance of the feature sets {1, 2, 3, 4, 5}, {6, … , 10},
| 4 | 3 2 and {11, … , 15} for each sample size. The true value of the
variable importance measures corresponding to each of
In setting 𝐵, the covariate distribution was modi-
the considered groups in both settings is given in Table 1.
fied to include clustering. Specifically, we generated
Results for the analysis of additional groupings are pro-
(𝑋1 , 𝑋2 , … , 𝑋15 ) ∼ 𝑀𝑉𝑁15 (μ, Σ), where the mean vector
vided in the Supporting Information.
is
For each scenario considered, we estimated the condi-
tional mean functions with gradient-boosted trees (Fried-
μ = 3 × (0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0)
man, 2001) fit using GradientBoostingRegressor in the
− 2 × (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1) sklearn module in Python. Gradient-boosted trees were
used due to their generally favorable prediction perfor-
and the variance–covariance matrix is block-diagonal with mance and large degree of flexibility, with full knowledge
blocks that they are not guaranteed to satisfy the minimum rate
condition outlined in Section 2.3. We used fivefold cross-
⎡ 1 0.15 0.15⎤ ⎡ 1 0.5 0.5⎤ ⎡ 1 0.85 0.85⎤ validation to select the optimal number of trees with one
⎢0.15 1 0.15⎥ , ⎢0.5 1 0.5⎥ and ⎢0.85 1 0.85⎥ node as well as the optimal learning rate for the algorithm.
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣0.15 0.15 1 ⎦ ⎣0.5 0.5 1 ⎦ ⎣0.85 0.85 1 ⎦ We summarized the results of these simulations in the
same manner as in the low-dimensional simulations.
and all other off-diagonal entries equal to zero. The ran- The results for setting 𝐴 are presented in Figure 3. From
dom error 𝜖 and the outcome 𝑌 are then generated as the top row, we note that as sample size increases, the
in setting 𝐴. In both settings, we generated 500 random scaled empirical bias of the proposed estimator approaches
datasets of size 𝑛 ∈ {100, 300, 500, 1000} and considered zero, whereas that of the naive estimator increases in
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
18 WILLIAMSON et al.

s s s

n n n

s s s

n n n

F I G U R E 3 Top row: empirical bias for the proposed and naive estimators scaled by 𝑛1∕2 for setting 𝐴, based on gradient-boosted trees.
Bottom row: empirical coverage of nominal 95% confidence intervals for the proposed and naive estimators for setting 𝐴, using gradient-boosted
trees. We consider all 𝑠 combinations from Table 1. Diamonds denote the naive estimator, and circles denote the proposed estimator. Monte
Carlo error bars are displayed vertically. This figure appears in color in the electronic version of this article, and any mention of color refers to
that version

magnitude across all subsets considered. From the bot- The proposed estimator performs substantially better
tom row, we observe that the empirical coverage of inter- than the naive estimator in these simulations, though
vals based on the proposed estimator increases toward the higher levels of correlation appear to be associated with rel-
nominal level as sample size increases and is uniformly atively poorer point and interval estimator performance.
higher than the empirical coverage of bootstrap intervals This suggests that it may be wise to consider in practice
based on the naive estimator. the importance of entire groups of correlated predictors
The results for setting 𝐵 are presented in Figure 4. rather than that of individual features. This is a sensible
From the top row, we note some residual bias in the pro- approach for dealing with correlated features, which nec-
posed estimator for 𝑠 = {11, … , 15}. Larger samples may essarily render variable importance assessment challeng-
be needed to observe more thorough bias reduction— ing. In our simulations, the empirical coverage of proposed
indeed, this group of features is that with the highest intervals for the importance of a group of highly corre-
within-group correlation. Nevertheless, the scaled empiri- lated features approaches the nominal level as sample size
cal bias of the proposed estimator approaches zero as sam- increases, indicating that the proposed approach does yield
ple size increases for both 𝑠 = {1, … , 5} and 𝑠 = {6, … , 10}. good results in such cases.
In all cases, the scaled empirical bias of the naive esti- Use of the proposed estimator results in better point
mator increases in magnitude as sample size increases. In and interval estimation performance than the naive esti-
the bottom row, we again see that intervals based on the mator in the presence of null features. This is illus-
proposed estimator have uniformly higher coverage than trated, for example, when evaluating the importance of the
those based on the naive estimator. group (𝑋1 , … , 𝑋5 ), in which case most other features (ie,
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WILLIAMSON et al. 19

s s s

n n n

s s s

n n n

F I G U R E 4 Top row: empirical bias for the proposed and naive estimators scaled by 𝑛1∕2 for setting 𝐵, using gradient-boosted trees. Bottom
row: empirical coverage of nominal 95% confidence intervals for the proposed and naive estimators for setting 𝐵, using gradient-boosted trees.
We consider all 𝑠 combinations from Table 1. Diamonds denote the naive estimator, and circles denote the proposed estimator. Monte Carlo
error bars are displayed vertically. This figure appears in color in the electronic version of this article, and any mention of color refers to that
version

𝑋8 , 𝑋9 , 𝑋10 , 𝑋12 , … , 𝑋15 ) have null importance. However, tion (MI) at the time of the survey is recorded, yielding
as before, we expect the behavior of point and interval esti- 160 cases of MI. In addition, measurements on two sets
mators for the variable importance of null features to be of features are available: behavioral features, including
poorer. Additional work on valid estimation and testing cumulative tobacco consumption, current alcohol con-
under this null hypothesis is necessary. sumption, and type A behavior, a behavioral pattern
linked to stress (Friedman and Rosenman, 1971); and bio-
logical features, including systolic blood pressure, low-
4 RESULTS FROM THE SOUTH density lipoprotein (LDL) cholesterol, adiposity (similar to
AFRICAN HEART DISEASE STUDY DATA body mass index), family history of heart disease, obesity,
and age.
We consider a subset of the data from the Coronary Risk We considered the importance of each feature sepa-
Factor Study (Rosseauw et al., 1983), a retrospective cross- rately, as well as that of these two groups of features,
sectional sample of 462 white males aged 15–64 in a region when predicting the presence or absence of MI. We esti-
of the Western Cape, South Africa; these data are publicly mated the conditional means using the sequential regres-
available in Hastie et al. (2009). The primary aim of this sion estimating procedure outlined in Section 2.2 and using
study was to establish the prevalence of ischemic heart dis- the Super Learner (van der Laan et al., 2007) via the
ease risk factors in this high-incidence region. For each SuperLearner R package. The Super Learner is a partic-
participant, the presence or absence of myocardial infarc- ular implementation of stacking (Wolpert, 1992), and the
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
20 WILLIAMSON et al.

F I G U R E 5 Estimates from the South African heart disease study for the proposed and naive estimators of the variable importance param-
eter, on left and right, respectively. The Super Learner with library including the elastic net, generalized additive models, gradient boosted trees,
and random forests, was used

resulting estimator is guaranteed to have the same risk our uncertainty estimates (Figure 5): there, we see that
as the oracle estimator, asymptotically, along with finite- the point estimates for age and family history are close,
sample guarantees (van der Laan et al., 2007). Our library and their confidence intervals largely overlap. We find the
of candidate learners consists of boosted trees, gener- same pattern for LDL cholesterol and tobacco consump-
alized additive models, elastic net, and random forests tion, the third- and fourth-ranked variables by logistic
implemented in the R packages gbm, gam, glmnet, and regression. While our results match closely with the sim-
randomForest, respectively, each with varying tuning plest approach to analyzing variable importance in these
parameters. Tenfold cross-validation was used to deter- data, our proposed method is not dependent on a single
mine the optimal convex combination of these learn- estimation technique, such as logistic regression. The use
ers chosen to minimize the cross-validated mean-squared of more flexible learners to estimate 𝜓0,𝑠 , as we have done
error. This process allowed the Super Learner to determine in this analysis, renders our findings less likely to be driven
the optimal tuning parameters for the individual algo- by potential model misspecification.
rithms as part of its optimal combination, and our result-
ing estimator of the conditional means is the optimal con-
vex combination of the individual algorithms. Finally, we 5 CONCLUSION
produced confidence intervals based on the proposed esti-
mator alone, since as we have seen earlier, intervals based We have obtained novel results for a familiar measure of
on the naive estimator are generally invalid. variable importance, interpreted as the additional propor-
The results are presented in Figure 5. The ordering is tion of variability in the outcome explained by including
slightly different in the two plots; this is not surprising, a subset of the features in the conditional mean outcome
since the one-step procedure should eliminate excess bias relative to the entire covariate vector. This parameter can
in the naive estimator introduced by estimating the condi- be readily seen as a nonparametric extension of the clas-
tional means using flexible learners. We find that biologi- sical 𝑅2 -based measure, and it provides a description of
cal factors are more important than behavioral factors. The the true relationship between the outcome and covariates
most important individual feature is family history of heart rather than an algorithm-specific measure of association.
disease; family history has been found to be a risk factor We have also studied the properties of this parameter and
of MI in previous studies. It appears scientifically sensible derived its nonparametric EIF. We found that the form of
that both groups of features are more important than any the variable importance measure under consideration can
individual feature besides family history. have a dramatic impact on the ease with which efficient
We compared these results to a logistic regression model estimators may be constructed—for example, debiasing is
fit to these data. Based on the absolute values of 𝑧-statistics, needed for ANOVA-based plug-in estimators using flexi-
logistic regression picks age as most important, followed ble learners, but not for plug-in estimators based on the
by family history. This slight difference is captured in difference in 𝑅2 values. We provide general results
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
WILLIAMSON et al. 21

describing this phenomenon in Williamson et al. (2020). and UM1AI068635. The opinions expressed in this article
Leveraging tools from semiparametric theory, we have are those of the authors and do not necessarily represent
described the construction of an asymptotically efficient the official views of the NIH.
estimator of the true variable importance measure built
upon flexible, data-adaptive learners. We have studied the ORCID
properties of this estimator, notably providing distribu- Brian D. Williamson https://orcid.org/0000-0002-7024-
tional results, and described the construction of asymp- 548X
totically valid confidence intervals. In simulations, we Marco Carone https://orcid.org/0000-0003-2106-0953
found the proposed estimator to have good practical per-
formance, particularly as compared to a naive estimator of REFERENCES
the proposed variable importance measure. However, we Barron, A. (1989) Statistical properties of artificial neural networks. In
found this performance to depend very much on whether Proceedings of the 28th IEEE Conference on Decision and Control.
or not the true variable importance measure equals Piscataway, NJ: IEEE, pp. 280–285.
zero. When it does, a limiting distribution is not readily Bickel, P., Klaasen, C., Ritov, Y. and Wellner, J. (1998) Efficient and
Adaptive Estimation for Semiparametric Models. Berlin: Springer.
available, and significant theoretical developments seem
Breiman, L. (2001) Random forests. Machine Learning, 45, 5–32.
needed in order to perform valid and powerful inference. Carone, M., Diaz, I. and van der Laan, M. (2018) Higher-order tar-
However, for those features with true importance, the geted loss-based estimation. In Targeted Learning in Data Sci-
behavior of point and interval estimates is not influenced ence: Causal Inference for Complex Longitudinal Studies. Berlin:
by the presence of null features. While the parameter we Springer, pp. 483–510.
have studied has broad interpretability, alternative mea- Chambaz, A., Neuvial, P. and van der Laan, M. (2012) Estimation of
sures of variable importance may also be useful in certain a non-parametric variable importance measure of a continuous
settings (eg, difference in the area under the receiver oper- exposure. Electronic Journal of Statistics, 6, 1059–1099.
Cleveland, W. (1979) Robust locally weighted regression and smooth-
ating characteristic curve in the context of a binary out-
ing scatterplots. Journal of the American Statistical Association, 74,
come). We study such measures in Williamson et al. (2020). 829–836.
For each candidate set of variables, the estimation pro- Doksum, K. and Samarov, A. (1995) Nonparametric estimation
cedure we proposed requires estimation of two conditional of global functionals and a measure of the explanatory power
mean functions. To guarantee that our estimator has good of covariates in regression. The Annals of Statistics, 23, 1443–
properties, these conditional means must be estimated 1473.
well. For this reason, and as was illustrated in our work, Doksum, K., Tang, S. and Tsui, K.-W. (2008) Nonparametric variable
selection: the EARTH algorithm. Journal of the American Statisti-
we recommend the use of model stacking with a wide
cal Association, 103, 1609–1620.
range of candidate learners, ranging from parametric to
Dudoit, S. and van der Laan, M. (2007) Multiple Testing Procedures
fully nonparametric algorithms. This flexibility mitigates with Applications to Genomics. Springer Science & Business Media.
concerns regarding model misspecification. Additionally, Friedman, J. (2001) Greedy function approximation: a gradient boost-
we suggest the use of sequential regressions to minimize ing machine. Annals of Statistics, 29, 1189–1232.
any incompatibility between the two conditional means Friedman, M. and Rosenman, R. (1971) Type A behavior pattern:
estimated. its association with coronary heart disease. Annals of Clinical
A multiple testing issue arises when inference is desired Research, 3, 300–312.
Gill, R., van der Laan, M. and Wellner, J. (1995) Inefficient estima-
on many feature subsets. Of course, a Bonferroni approach
tors of the bivariate survival function for three models. Annales de
may be easily implemented. Alternatively, we could use l’Institut Henri Poincaré Probabilités et Statistiques, 31, 545–597.
a consistent estimator of the variance-covariance matrix Grömping, U. (2009) Variable importance in regression: linear regres-
for the importance of all subsets of features under study, sion versus random forest. The American Statistician, 63, 308–319.
obtained using the influence functions exhibited in this Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models, vol-
paper. This alternative multiple-testing adjustment has ume 43. Boca Raton, FL: CRC Press.
improved power over a Bonferroni-type approach. Strate- Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Sta-
gies based on this approach are described, for example, in tistical Learning: Data mining, Inference, and Prediction. Berlin:
Springer.
Dudoit and van der Laan (2007).
Huang, L. and Chen, J. (2008) Analysis of variance, coefficient of
determination and F-test for local polynomial regression. The
Annals of Statistics, 36, 2085–2109.
AC K N OW L E D G M E N T S
Ishwaran, H. (2007) Variable importance in binary regression trees
The authors thank Prof. Antoine Chambaz for insight- and forests. Electronic Journal of Statistics, 1, 519–537.
ful comments that improved this manuscript. This work Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. and Wasserman, L.
was supported by the National Institutes of Health (NIH) (2018) Distribution-free predictive inference for regression. Jour-
through awards F31AI140836, DP5OD019820, R01AI029168 nal of the American Statistical Association, 113, 1094–1111.
15410420, 2021, 1, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/biom.13392 by South University Of Science, Wiley Online Library on [19/03/2025]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
22 WILLIAMSON et al.

Loh, W.-Y. (2002) Regression trees with unbiased variable selection Wolpert, D. (1992) Stacked generalization. Neural Networks, 5, 241–
and interaction detection. Statistica Sinica, 12, 361–386. 259.
Olden, J., Joy, M. and Death, R. (2004) An accurate comparison of Yao, F., Müller, H. and Wang, J. (2005) Functional linear regression
methods for quantifying variable importance in artificial neural analysis for longitudinal data. The Annals of Statistics, 33, 2873–
networks using simulated data. Ecological Modelling, 173, 389– 2903.
397.
Pfanzagl, J. (1982) Contributions to a General Asymptotic Statistical
Theory. Berlin: Springer. S U P P O RT I N G I N F O R M AT I O N
Rosseauw, J., Du Plessis, J., Benade, A., Jordann, P., et al. (1983) Web Appendices, Tables, and Figures referenced in Sec-
Coronary risk factor screening in three rural communities. South tions 1 and 2 are available with this paper at the Biometrics
African Medical Journal, 64, 430–436. website on Wiley Online Library. These include all tech-
Sapp, S., van der Laan, M. and Page, K. (2014) Targeted estimation of nical details, and additional simulation results and data
binary variable importance measures with interval-censored out- analyses. Code to reproduce all results is available online at
comes. The International Journal of Biostatistics, 10, 77–97.
https://github.com/bdwilliamson/vimpaper_supplement.
Shao, J. (1994) Bootstrap sample size in nonregular cases. Proceedings
of the American Mathematical Society, 122, 1251–1262.
All methods are implemented in R package vimp and
Strobl, C., Boulesteix, A., Zeileis, A. and Hothorn, T. (2007) Bias Python package vimpy, both available on CRAN (https:
in random forest variable importance measures: Illustrations, //cran.r-project.org/web/packages/vimp/index.html) and
sources and a solution. BMC Bioinformatics, 8, 1. PyPI (https://pypi.org/project/vimpy/), respectively.
van der Laan, M. (2006) Statistical inference for variable importance.
The International Journal of Biostatistics.
van der Laan, M., Polley, E. and Hubbard, A. (2007) Super learner.
Statistical Applications in Genetics and Molecular Biology, 6, How to cite this article: Williamson BD, Gilbert
Online Article 25. PB, Carone M, Simon N. Nonparametric variable
van der Vaart, A. (2000) Asymptotic Statistics, volume 3. Cambridge, importance assessment using machine learning
UK: Cambridge University Press.
techniques. Biometrics. 2021;77:9–22.
Williamson, B., Gilbert, P., Simon, N. and Carone, M. (2020) A uni-
fied approach for inference on algorithm-agnostic variable impor-
https://doi.org/10.1111/biom.13392
tance. arXiv :2004.03683.

Statistical Models Based On Counting Processes (PDFDrive) PDF
No ratings yet
Statistical Models Based On Counting Processes (PDFDrive) PDF
778 pages
Kernel Smoothing-MP Wand-MC Jones-1995
100% (1)
Kernel Smoothing-MP Wand-MC Jones-1995
228 pages
Shrinkage Estimation: Dominique Fourdrinier William E. Strawderman Martin T. Wells
No ratings yet
Shrinkage Estimation: Dominique Fourdrinier William E. Strawderman Martin T. Wells
339 pages
Encurtidos
No ratings yet
Encurtidos
7 pages
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
A Simple and Effective Model-Based Variable Importance Measure PDF
No ratings yet
A Simple and Effective Model-Based Variable Importance Measure PDF
27 pages
5
No ratings yet
5
35 pages
22-0801
No ratings yet
22-0801
27 pages
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet
RJ-2020-013
No ratings yet
RJ-2020-013
24 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
ROC Covariado
No ratings yet
ROC Covariado
20 pages
The Optimization Problem
No ratings yet
The Optimization Problem
45 pages
Probabilistic Sensitivity Analysis: E. Borgonovo IMQ, Bocconi University Viale Isonzo 25 20135, Milano (Italy)
No ratings yet
Probabilistic Sensitivity Analysis: E. Borgonovo IMQ, Bocconi University Viale Isonzo 25 20135, Milano (Italy)
4 pages
Social Indicators Research 45: 253-275, 1998
No ratings yet
Social Indicators Research 45: 253-275, 1998
23 pages
Glossary of Research Methodology
From Everand
Glossary of Research Methodology
Dr. Awadhesh Kishore
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Nonparametric Bayes Applications To Biostatistics PDF
No ratings yet
Nonparametric Bayes Applications To Biostatistics PDF
47 pages
Understanding Analysis: Foundations and Applications
From Everand
Understanding Analysis: Foundations and Applications
Tanmay Shroff
No ratings yet
Complete Download Student Solutions Manual to Accompany an Introduction to Econometrics a Self Contained Approach 1st Edition Frank Westhoff PDF All Chapters
No ratings yet
Complete Download Student Solutions Manual to Accompany an Introduction to Econometrics a Self Contained Approach 1st Edition Frank Westhoff PDF All Chapters
88 pages
BMC Bioinformatics: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and A Solution
No ratings yet
BMC Bioinformatics: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and A Solution
21 pages
A Generalized Exponential Type Estimator
No ratings yet
A Generalized Exponential Type Estimator
29 pages
Datasplitting Varselection
No ratings yet
Datasplitting Varselection
26 pages
Mortality Risk Score Prediction in An Elderly Population Using Machine Learning
No ratings yet
Mortality Risk Score Prediction in An Elderly Population Using Machine Learning
10 pages
A Parametric Approach To Nonparametric Statistics: Mayer Alvo Philip L. H. Yu
100% (4)
A Parametric Approach To Nonparametric Statistics: Mayer Alvo Philip L. H. Yu
277 pages
13Simple linear Regression
No ratings yet
13Simple linear Regression
46 pages
Student Solutions Manual to Accompany an Introduction to Econometrics a Self Contained Approach 1st Edition Frank Westhoff - The ebook in PDF format is ready for immediate access
100% (1)
Student Solutions Manual to Accompany an Introduction to Econometrics a Self Contained Approach 1st Edition Frank Westhoff - The ebook in PDF format is ready for immediate access
74 pages
Applications of Differential Equations
From Everand
Applications of Differential Equations
Jayant Ramaswamy
No ratings yet
(Ebook) Student Solutions Manual to Accompany an Introduction to Econometrics: a Self-Contained Approach by Frank Westhoff ISBN 9780262317184, 0262317184 download
100% (2)
(Ebook) Student Solutions Manual to Accompany an Introduction to Econometrics: a Self-Contained Approach by Frank Westhoff ISBN 9780262317184, 0262317184 download
58 pages
Biostatistics Explored Through R Software: An Overview
From Everand
Biostatistics Explored Through R Software: An Overview
Vinaitheerthan Renganathan
3.5/5 (2)
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
A Gentle Tutorial in Bayesian Statistics PDF
100% (4)
A Gentle Tutorial in Bayesian Statistics PDF
45 pages
Weighted generalized score test for comparing predictive values in the presence of verification bias
No ratings yet
Weighted generalized score test for comparing predictive values in the presence of verification bias
22 pages
Circulationaha 105 594929
No ratings yet
Circulationaha 105 594929
4 pages
Scalable Variational Inference For Bayesian
No ratings yet
Scalable Variational Inference For Bayesian
36 pages
Usp, Winter - 2010 - Practical Approaches To Dealing With Nonnormal and Categorical Variables-Annotated
No ratings yet
Usp, Winter - 2010 - Practical Approaches To Dealing With Nonnormal and Categorical Variables-Annotated
3 pages
Statistical Practice: Estimators of Relative Importance in Linear Regression Based On Variance Decomposition
No ratings yet
Statistical Practice: Estimators of Relative Importance in Linear Regression Based On Variance Decomposition
9 pages
Robust Estimation and Testing
From Everand
Robust Estimation and Testing
Robert G. Staudte
3/5 (1)
Lecture Notes in Statistics 148
No ratings yet
Lecture Notes in Statistics 148
241 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Handbook of Statistics Epidemiology and Medical Statistics 1st Edition C.R. Rao download
100% (1)
Handbook of Statistics Epidemiology and Medical Statistics 1st Edition C.R. Rao download
61 pages
FCM 3.4 Biostatistics
No ratings yet
FCM 3.4 Biostatistics
9 pages
Kxy 067
No ratings yet
Kxy 067
19 pages
LargeScaleInference PDF
No ratings yet
LargeScaleInference PDF
273 pages
Handout-4a-Hotellings T
No ratings yet
Handout-4a-Hotellings T
6 pages
Iijcs 2014 07 19 18
No ratings yet
Iijcs 2014 07 19 18
7 pages
Statistical+Inference+1 Shaw2007
No ratings yet
Statistical+Inference+1 Shaw2007
66 pages
An AUC-based Permutation Variable Importance Measure for Random Forests;Janitza, University of Munich;BMC Bioinformatics
No ratings yet
An AUC-based Permutation Variable Importance Measure for Random Forests;Janitza, University of Munich;BMC Bioinformatics
11 pages
Nonparametric Statistics Theory and Methods
No ratings yet
Nonparametric Statistics Theory and Methods
275 pages
A Novel Variational Bayesian Method For Variable Selection in Logistic Regression Models2019Computational Statistics and Data Analysis
No ratings yet
A Novel Variational Bayesian Method For Variable Selection in Logistic Regression Models2019Computational Statistics and Data Analysis
19 pages
Informational Rescaling of PCA Maps With Application To Genetic Distance
No ratings yet
Informational Rescaling of PCA Maps With Application To Genetic Distance
5 pages
Principles of Statistical Inference
100% (10)
Principles of Statistical Inference
236 pages
Introduction_to_data_analysis
No ratings yet
Introduction_to_data_analysis
3 pages
Multiple Regression in CB
No ratings yet
Multiple Regression in CB
6 pages
Mathematical Statistics (MA212M) : Lecture Slides
No ratings yet
Mathematical Statistics (MA212M) : Lecture Slides
16 pages
9 - RF Behaviour
No ratings yet
9 - RF Behaviour
13 pages
Project Questions For STAT 585
No ratings yet
Project Questions For STAT 585
11 pages
KNN Algorithm For Conditional Mean and Variance Estimation With Automated Uncertainty Quantification and Variable Selection
No ratings yet
KNN Algorithm For Conditional Mean and Variance Estimation With Automated Uncertainty Quantification and Variable Selection
31 pages
Gaussian Process Emulation of Dynamic Computer Codes (Conti, Gosling Et Al)
No ratings yet
Gaussian Process Emulation of Dynamic Computer Codes (Conti, Gosling Et Al)
14 pages
Bba Iii Sem PDF
No ratings yet
Bba Iii Sem PDF
30 pages
Variables
No ratings yet
Variables
11 pages
Compensation and Benefit Management Module Hand Book
No ratings yet
Compensation and Benefit Management Module Hand Book
63 pages
FRP Strength Probability
No ratings yet
FRP Strength Probability
48 pages
Instant download A review of uncertainty visualization errors: Working memory as an explanatory theory - eBook PDF pdf all chapter
100% (4)
Instant download A review of uncertainty visualization errors: Working memory as an explanatory theory - eBook PDF pdf all chapter
69 pages
How to Build Social Science Theories 1st Edition Pamela J. Shoemaker all chapter instant download
100% (8)
How to Build Social Science Theories 1st Edition Pamela J. Shoemaker all chapter instant download
85 pages
Classification of Soil Fertility Using Machine Learning-Based Classifier
No ratings yet
Classification of Soil Fertility Using Machine Learning-Based Classifier
6 pages
Stat and Prob DLL
100% (2)
Stat and Prob DLL
15 pages
Steps of The Research Process
No ratings yet
Steps of The Research Process
8 pages
Exploring The Relation Between Blood Tests and Covid-19 Using Machine Learning
No ratings yet
Exploring The Relation Between Blood Tests and Covid-19 Using Machine Learning
6 pages
Researchhhhhh Pota Kaaaaaaaaaaaaaaa
No ratings yet
Researchhhhhh Pota Kaaaaaaaaaaaaaaa
31 pages
Microsoft Word - Sample Proposal
No ratings yet
Microsoft Word - Sample Proposal
5 pages
Aman Kedia Question Bank
No ratings yet
Aman Kedia Question Bank
210 pages
LP 3 - This Lesson Plan Is About Gathering Statistical Data
No ratings yet
LP 3 - This Lesson Plan Is About Gathering Statistical Data
11 pages
The Effectiveness of Cognitive Behavioral Group Therapy On Reducing Procrastination and Academic A
No ratings yet
The Effectiveness of Cognitive Behavioral Group Therapy On Reducing Procrastination and Academic A
7 pages
Chapter 8: Discrete Probability Distributions: Ms. Amna Riaz
No ratings yet
Chapter 8: Discrete Probability Distributions: Ms. Amna Riaz
6 pages
Qam PPT (Merged)
No ratings yet
Qam PPT (Merged)
83 pages
Charles P. Jones, Investments: Analysis and Management, Ninth Edition, John Wiley & Sons
No ratings yet
Charles P. Jones, Investments: Analysis and Management, Ninth Edition, John Wiley & Sons
23 pages
Unit 16 Assignment One
No ratings yet
Unit 16 Assignment One
16 pages
Practical Research 2
100% (1)
Practical Research 2
10 pages
PTSP - MLRS - R22 - II - I - ECE - Syllabus
No ratings yet
PTSP - MLRS - R22 - II - I - ECE - Syllabus
2 pages
Fourth Semester Probability and Queuing Theory Two Marks With Answers Regulation 2013
100% (2)
Fourth Semester Probability and Queuing Theory Two Marks With Answers Regulation 2013
63 pages
Stats
No ratings yet
Stats
5 pages
Probability and Value of Normal Distribution
No ratings yet
Probability and Value of Normal Distribution
5 pages
KDD Vs Data Mining
No ratings yet
KDD Vs Data Mining
2 pages
Kernel Regression Section3
No ratings yet
Kernel Regression Section3
3 pages
Clustering L7
No ratings yet
Clustering L7
7 pages
B00901573 International Business Research Skill
No ratings yet
B00901573 International Business Research Skill
10 pages

Biometrics - 2020 - Williamson - Nonparametric variable importance assessment using machine learning techniques

Uploaded by

Biometrics - 2020 - Williamson - Nonparametric variable importance assessment using machine learning techniques

Uploaded by

Received: 21 April 2018 Revised: 20 March 2019 Accepted: 22 March 2019

BIOMETRIC METH ODOLOGY

Nonparametric variable importance assessment using

Brian D. Williamson1 Peter B. Gilbert1,2 Marco Carone1,2 Noah Simon1

1 Department of Biostatistics, University

of Washington, Seattle, Washington, USA Abstract

1 INTRODUCTION covariate vector and 𝑌𝑖 ∈ ℝ is the outcome of interest. It

© 2020 The International Biometric Society

Biometrics. 2021;77:9–22. wileyonlinelibrary.com/journal/biom 9

We generated 1000 random datasets of size 3.3 Moderate-dimensional vector of

following specification: T A B L E 1 Approximate values of 𝜓0,𝑠 for each simulation

You might also like