0% found this document useful (0 votes)
2 views

Five miths about variable selection

The document discusses the misconceptions surrounding variable selection in multivariable regression models used in transplantation research. It highlights five myths that can lead to inappropriate application of variable selection, emphasizing that such methods often complicate analysis rather than simplify it. The authors advocate for the use of expert knowledge and careful consideration of model stability instead of relying on variable selection techniques.

Uploaded by

carla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Five miths about variable selection

The document discusses the misconceptions surrounding variable selection in multivariable regression models used in transplantation research. It highlights five myths that can lead to inappropriate application of variable selection, emphasizing that such methods often complicate analysis rather than simplify it. The authors advocate for the use of expert knowledge and careful consideration of model stability instead of relying on variable selection techniques.

Uploaded by

carla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Transplant International

REVIEW

Five myths about variable selection


Georg Heinze & Daniela Dunkler

Section for Clinical Biometrics, SUMMARY


Center for Medical Statistics, Multivariable regression models are often used in transplantation research
Informatics and Intelligent Systems,
to identify or to confirm baseline variables which have an independent
Medical University of Vienna,
Vienna, Austria
association, causally or only evidenced by statistical correlation, with trans-
plantation outcome. Although sound theory is lacking, variable selection is
Correspondence a popular statistical method which seemingly reduces the complexity of
Georg Heinze PhD, Section for such models. However, in fact, variable selection often complicates analysis
Clinical Biometrics, Center for as it invalidates common tools of statistical inference such as P-values and
Medical Statistics, Informatics and confidence intervals. This is a particular problem in transplantation
Intelligent Systems, Medical University research where sample sizes are often only small to moderate. Furthermore,
of Vienna, Spitalgasse 23, 1090 variable selection requires computer-intensive stability investigations and a
Vienna, Austria. particularly cautious interpretation of results. We discuss how five com-
Tel.: +4314040066880; mon misconceptions often lead to inappropriate application of variable
fax: +4314040066870;
selection. We emphasize that variable selection and all problems related
e-mail: georg.heinze@
meduniwien.ac.at
with it can often be avoided by the use of expert knowledge.

Transplant International 2017; 30: 6–10


Key words
association, explanatory models, multivariable modeling, prediction, statistical analysis

Received: 12 September 2016; Revision requested: 14 October 2016; Accepted: 25 November


2016

Observational studies are often conducted in a prognos- first a multivariable model was built with all predictors
tic or etiologic research context [1]. To this end, many selected by univariate screening in a first step. Then,
papers dealing with observational studies use multivari- nonsignificant predictor variables were sequentially
able regression approaches to identify important predic- eliminated and models re-estimated until all variables
tors of an outcome [2–4] or to assess effects of new remaining in the model show significant association
markers adjusted for known clinical predictors [5–7], with the outcome. The technique of backward selection
respectively. As the number of candidate predictor vari- is sometimes also applied directly to a set of predictor
ables or known confounders can be large, a “full” model variables [10–13], or the process of variable selection is
including all candidate predictors as explanatory vari- reversed by “forward selection” [5,14], meaning that
ables is often considered impractical for clinical use or, candidate predictors are sequentially included in a
in the extreme case, is even impossible to estimate model if their association with the outcome variable, on
because of multicollinearity. Therefore, variable selection top of the set of variables already in the model, is sig-
approaches are often employed, mostly based on evalu- nificant. Box 1 provides an overview of the most com-
ating P-values for testing regression coefficients against mon approaches of variable selection and explains some
zero. For example, Martinez-Selles et al. [8] used statistical notions commonly used in this context.
univariate screening of effects to build a multivariable Whatever technique applied, the approach of letting
model for survival after heart transplantation. After statistics decide which variables should be included in a
univariate selection, Rodriguez-Peralvarez et al. [9] model is popular among scientists. Among all 89 clini-
employed “backward elimination” which means that cal research articles published in Transplant

6 ª 2016 Steunstichting ESOT


doi:10.1111/tri.12895
Five myths about variable selection

Box 1. Glossary
AIC Akaike information criterion. A number based on information theory expressing the model fit
(log likelihood) discounted by the number of unknown parameters
Augmented An extension of backward elimination recently proposed by Dunkler et al. [23], which can
backward elimination consider the change-in-estimate as an additional selection criterion. Usually leads to selection
of more variables than standard backward elimination. Preferable method for etiologic models
Backward Remove insignificant predictors from a model one-by-one until all variables are significant.
elimination The preferable method for prognostic models if enough data are available
Bayes factor Ratio of likelihood of two competing models
Change-in-estimate The magnitude by which the regression coefficient of a variable X changes if a variable Z is
removed from a multivariable model
EPV Events per variable. A simple measure to define the amount of information in a data set (the
number of events or the sample size) relative to the number of regression coefficients to be
estimated (the number of variables). Note that nonselection of a variable corresponds to an
estimated regression coefficient of zero, and thus, this formula should always consider
all candidate variables
Etiologic models Statistical models used to explain the role of a risk factor or treatment in its (possibly causal)
effect on patient outcome
Forward selection Add significant candidate predictors to a model one-by-one until no further predictors can be added.
Usually leads to inferior results compared to backward elimination
Likelihood The probability of the data to be observed under a given model
Maximum likelihood Statistical estimation techniques which sets unknown model parameters (e.g., regression
coefficients) to those values which are such that the observed data are most plausibly explained
Multicollinearity Almost perfect correlation of explanatory variables. Causes ambiguity in estimation of regression
coefficients and selection of variables. Can be a problem if regression coefficients should be
interpretable, that is, in etiologic research contexts
Prognostic models Statistical models used to prognosticate a patient’s outcome (e.g., graft loss or death) by
the values of some variables available at the time point at which this prediction is made
Univariable Each candidate variable is evaluated in its association with the outcome. Only significant variables
prefiltering are entered in a multivariable model.
(bivariable analysis) Often used by researchers, but should be avoided
Variable, independent Independent variable: a variable considered as predicting or explaining patient outcome.
or dependent Also termed predictor or explanatory variable, respectively.
Dependent variable: the variable representing the outcome under study, for example,
occurrence of rejections, patient survival, or graft survival

International in 2015, 49 applied multivariable regres- problem, for example, putting the “cause” as outcome
sion modeling, and variable selection was used in 30 variable and the “effect” as independent, adjusting effects
(65%) of those 49 articles. However, it is not commonly for later outcomes, or the like. Moreover, it is assumed
known that there hardly exists any statistical theory here that, without looking into the data, a set of candidate
which justifies the use of these techniques. This means predictors has already been preselected by clinical exper-
that quantities such as regression coefficients, hazard tise, for example, a prior belief that those predictors could
ratios or odds ratios, P-values or confidence intervals be related to the outcome. Availability in a data set alone is
may suffer from systematic biases if variable selection not the basis for their consideration for a model.
was applied, and usually, the magnitude or direction of
these biases is unpredictable [15]. This is in sharp con-
Myth 1: “The number of variables in a model
trast to the simplicity of application of variable selection
should be reduced until there are 10 events per
approaches, being ready to use in statistical standard
variables.” No!
software such as IBM SPSS [16] or SAS software [17].
The popularity of variable selection approaches is based Simulation studies have revealed that multivariable mod-
on five myths, that is, “believes” lacking theoretic founda- els become very unstable with too low events-per-variable
tion. Before discussing these myths in this review, it should (EPV) ratios. Current recommendations suggest that a
be noted that no variable selection approach can guard minimum of 5–15 EPV should be available, depending
against general errors in setting up a statistical modeling on context [15,18]. However, practitioners often overlook

Transplant International 2017; 30: 6–10 7


ª 2016 Steunstichting ESOT
Heinze and Dunkler

that this recommendation refers to a priori fixed models happen, that after eliminating a potential confounder
which do not result from earlier selection [15]. If variable another adjustment variable’s coefficients moves closer
selection is considered, the rule should consider the num- to zero, changing from “significant” to “nonsignificant”
ber of candidate variables with which the selection pro- and hence leading to the elimination of that variable in
cess is initialized, and probably even much higher values a later step. However, despite its usual detrimental
such as 50 EPV are needed to obtain approximately stable effects on bias, elimination of very weak predictors from
results, as the selection adds another source of uncer- a model can sometimes decrease the variance (uncer-
tainty to estimation [19]. If the number of candidate vari- tainty) of the remaining regression coefficients. Dunkler
ables seems to be too large, background knowledge et al. [23] have proposed “augmented backward elimi-
obtained from analyses of former studies or from theoret- nation,” a selection algorithm which leaves insignificant
ical considerations should be applied to prefilter variables effects in a model, if their elimination would cause a
in order to meet the EPV-rule requirements of a problem. change in the estimate of another variable. Thus, their
Causal diagrams such as directed acyclic graphs can also proposal extends pure “significance”-based sequential
help in discarding candidate variables before statistically elimination of variables (“backward elimination”) and is
analyzing the data [20,21]. of particular interest in etiologic modeling.

Myth 2: “Only variables with proven Myth 4: “The reported P-value quantifies the
univariable-model significance should be type I error of a variable being falsely
included in a model.” No! selected.” No!
While it is true that regression coefficients are often larger First, while the probability of a type I error mainly
in univariable models than in multivariable ones, also the depends on the significance level, a P-value is a result of
opposite may occur, in particular if some variables (with data collection and analysis and quantifies the plausibility
all positive effects on the outcome) are negatively corre- of the observed data under the null hypothesis. Therefore,
lated. Moreover, univariable prefiltering, sometimes also the P-value does not quantify the type I error [24]. Sec-
referred to as “bivariable analysis,” does not add stability ond, after a sequence of elimination or selection steps,
to the selection process as it is based on stochastic quanti- standard software reports P-values only from the finally
ties, and can lead to overlooking important adjustment estimated model. Any quantities from this last model are
variables needed for control in an etiologic model. unreliable as they do not “remember” which steps have
Although univariable prefiltering is traceable and easy to been performed before. Therefore, P-values are biased
do with standard software, one should better completely low (as only those P-values are reported which fall below
forget about it as it is neither a prerequisite nor providing a certain threshold), and confidence intervals are often
any benefits when building multivariable models [22]. too narrow, claiming, for example, a confidence level of
95% while they actually cover the true value with a much
lower probability [15]. On the other hand, there is also
Myth 3: “Insignificant effects should be
the danger of false elimination of variables, the possibility
eliminated from a model.” No!
of which is not quantified at all by just reporting the final
Eliminating a variable from a model means to put its model of a variable selection procedure. To overcome
regression coefficient to exactly zero – even if the likeli- these problems, statisticians have argued in favor of using
est value for it, given the data, is different. In this way, resampling techniques or relative AIC- or Bayes factor-
one is moving away from a maximum likelihood solu- based approaches to explore alternative models and their
tion (which has theoretical foundation) and reports a likelihood to fit the data, and to use averages over com-
model which is suboptimal by intention. Eliminating peting models instead of just selecting one ultimate
weak effects may also be dangerous as in etiologic stud- model [25]. Such analyses may provide valuable insight
ies, bias could result from falsely omitting an important in how stable models are and how many and which com-
confounder. This is because regression coefficients gen- peting models would be selected how often. Resampling
erally depend on which other variables are in a model, can also be used to quantify selection probabilities of
and consequently, they change their value if one of the variables or pairs of correlated, competitive variables
other variables are omitted from a model. This [26]. Unfortunately, with the exception of SAS/PROC GLMSE-
“change-in-estimate” [23] can be positive or negative, LECT software [17], we do not know of any implementa-
that is, away from or toward zero. Hence, it may tions of these very useful approaches in standard
8 Transplant International 2017; 30: 6–10
ª 2016 Steunstichting ESOT
Five myths about variable selection

software. For example, while simple bootstrap resampling


Box 2. Key points to remember
is indeed implemented in IBM SPSS software [16], its appli-
1. The five myths presented here are all misconcep-
cation is restricted to a model with a fixed set of indepen-
tions of variable selection.
dent variables rather than to evaluating the stability of
2. Often there is no scientific reason to perform
variable selection across resamples.
variable selection. In particular, variable selection
methods require a much larger sample size than esti-
Myth 5: “Variable selection simplifies analysis.” mation of a multivariable model with a fixed set of
No! predictors based on clinical expertise.
3. If a researcher needs to perform variable selec-
While a smaller model may be easier to use and – at first
tion, and sample size is large enough and the candi-
glance – to report, there are many problems to be solved
date predictors have been carefully selected based on
when variable selection techniques are considered. First,
prior knowledge, then backward elimination with a
an appropriate variable selection method has to be
P-value criterion of 0.157 is a good choice for prog-
selected for the problem at hand. Statisticians have rec-
nostic models, and augmented backward elimination
ommended backward elimination as the most reliable
for etiologic models.
one among those that can be easily achieved with stan-
4. Variable selection should always be accompanied
dard software [27]. Second, an often arbitrary choice has
by sensitivity analyses to avoid wrong conclusions.
to be made about the selection parameter, that is, the sig-
nificance level to decide whether an effect should be
retained in a model. While smaller values such as 0.05 or background knowledge, for example, formalized by direc-
0.01 are only recommended for very large sample sizes ted acyclic graphs [20,21], is usually a much better guide
(EPV of 100 or above), in the vast majority of applica- to robust multivariable models. At least, such knowledge
tions, a value of 0.2 or 0.157 (corresponding to selection (and not data-driven methods!) should be used to restrict
based on AIC) or even 0.5 (resulting in very mild selec- the number of candidate variables competing for selec-
tion) will be a better choice. Third, as selection is a “hard tion to a number compatible with published EPV rules.
decision” but often based on vague quantities, investiga- In line with many other statisticians, we think that for
tions on model stability should accompany any applied prognostic models backward elimination with a selection
variable selection to justify the decision for the model criterion of 0.157 and without a preceding univariable
finally reported or at least to quantify the uncertainty prefiltering is a good starter, but sometimes other choices
related to the selection of the variables. This has to be for the selection criterion may be more appropriate. For
done with resampling methods, which, until robust etiologic models, “augmented backward elimination”
implementations are available in standard software, are [23] preceded by a careful preselection based on assump-
still cumbersome to implement. Such evaluations are also tions on the causal roles of variables [20,21] is a reason-
needed (and even more computationally demanding) for able approach. Whenever investigators decide to use
best subset searches, that is, letting the computer evaluate statistical variable selection approaches, they should use
all different models that can be thought of using a given them with care and should add sensitivity and robustness
set of candidate predictors. Similar stability investigations analyses. Ideally, such analyses are conducted using
should be carried out if modern variable selection meth- resampling techniques, but often it will already be helpful
ods such as the LASSO [28] or boosting [29] are used. if robustness of basic conclusions of a study is demon-
By way of conclusion, we see that five myths have strated by comparing the main results with those
obscured the problems of variable selection that have obtained after eliminating some variables from the main
been identified in recent decades by statisticians. While model or including additional ones. The key messages of
variable selection methods seem simple to use and handy this review are summarized in Box 2.
to build multivariable models, issues such as selection
uncertainty or bias in the reported quantities have too Funding
often been overlooked by practitioners. Before using vari-
able selection techniques one should critically reflect The authors have declared no funding.
whether such methods are needed in a particular study at
all, and if yes, whether there is enough data available to Conflict of interest
justify elimination or inclusion of variables in a model
just by “letting the data speak.” By contrast, expert The authors have declared no conflicts of interest.

Transplant International 2017; 30: 6–10 9


ª 2016 Steunstichting ESOT
Heinze and Dunkler

Acknowledgements International. Furthermore, we would like to thank two


anonymous reviewers for helpful comments on a previ-
We acknowledge Isabell Gl€aser’s and Thomas Hille-
ous version of this manuscript.
brand’s help with conducting the systematic review of
the use of variable selection methods in Transplant

REFERENCES
1. Tripepi G, Jager KJ, Dekker FW, Zoccali Spanish Heart Transplantation Registry. 18. Vittinghoff E, McCulloch CE. Relaxing
C. Testing for causality and prognosis: Transpl Int 2015; 28: 305. the rule of ten events per variable in
etiological and prognostic models. 9. Rodrıguez-Peralvarez M, Garcıa- logistic and Cox regression. Am J
Kidney Int 2008; 74: 1512. Caparr~ os C, Tsochatzis E, et al. Lack of Epidemiol 2007; 165: 710.
2. Von D€ uring ME, Jenssen T, Bollerslev J, agreement for defining ‘clinical 19. Steyerberg EW. Clinical Prediction
et al. Visceral fat is better related to suspicion of rejection’ in liver Models. New York, NY: Springer, 2009.
impaired glucose metabolism than body transplantation: a model to select 20. Greenland S, Pearl J, Robins JM. Causal
mass index after kidney transplantation. candidates for liver biopsy. Transpl Int diagrams for epidemiologic research.
Transpl Int 2015; 28: 1162. 2015; 28: 455. Epidemiology 1999; 10: 37.
3. Bhat M, Hathcock M, Kremers WK, 10. Pezawas T, Grimm M, Ristl R, et al. 21. VanderWeele TJ, Shpitser I. A new
et al. Portal vein encasement predicts Primary preventive cardioverter- criterion for confounder selection.
neoadjuvant therapy response in liver defibrillator implantation (Pro-ICD) in Biometrics 2011; 67: 1406.
transplantation for perihilar patients awaiting heart transplantation. A 22. Sung G-W, Shook TL, Kay GL.
cholangiocarcinoma protocol. Transpl prospective, randomized, controlled 12- Inappropriate use of bivariable analysis
Int 2015; 28: 1383. year follow-up study. Transpl Int 2015; to screen risk factors for use in
4. Pianta TJ, Peake PW, Pickering JW, 28: 34. multivariable analysis. J Clin Epidemiol
Kelleher M, Buckley NA, Endre ZH. 11. Prasad GVR, Huang M, Silver SA, et al. 1996; 49: 907.
Evaluation of biomarkers of cell cycle Metabolic syndrome definitions and 23. Dunkler D, Plischke M, Leffondre K,
arrest and inflammation in prediction of components in predicting major adverse Heinze G. Augmented backward
dialysis or recovery after kidney cardiovascular events after kidney elimination: a pragmatic and purposeful
transplantation. Transpl Int 2015; 28: transplantation. Transpl Int 2015; 28: 79. way to develop statistical models. PLoS
1392. 12. Tripon S, Francoz C, Albuquerque A, One 2014; 9: e113677. doi:10.1371/
5. Rompianesi G, Montalti R, Cautero N, et al. Interactions between virus-related journal.pone.0113677
et al. Neurological complications after factors and post-transplant ascites in 24. Goodman S. A dirty dozen: twelve P-
liver transplantation as a consequence of patients with hepatitis C and no value misconceptions. Semin Hematol
immunosuppression: univariate and cirrhosis: role of cryoglobulinemia. 2008; 45: 135.
multivariate analysis of risk factors. Transpl Int 2015; 28: 162. 25. Burnham KP, Anderson DR. Model
Transpl Int 2015; 28: 864. 13. Somers J, Ruttens D, Verleden SE, et al. Selection and Multimodel Inference, 2nd
6. Zijlstra LE, Constantinescu AA, A decade of extended-criteria lung edn. New York, NY: Springer, 2002.
Manintveld O, et al. Improved long- donors in a single center: was it 26. Sauerbrei W, Schumacher M. A
term survival in Dutch heart transplant justified? Transpl Int 2015; 28: 170. bootstrap resampling procedure for
patients despite increasing donor age: 14. Nagai S, Mangus RS, Anderson E, et al. model building: application to the Cox
the Rotterdam experience. Transpl Int Post-transplant persistent lymphopenia regression model. Stat Med 1992; 11:
2015; 28: 962. is a strong predictor of late survival in 2093.
7. Fernandez-Ruiz M, Arias M, Campistol isolated intestine and multivisceral 27. Royston P, Sauerbrei W. Multivariable
JM, et al. Cytomegalovirus prevention transplantation. Transpl Int 2015; 28: Model-Building – A Pragmatic Approach
strategies in seropositive kidney 1195–204. to Regression Analysis based on Fractional
transplant recipients: an insight into 15. Harrell FEJ. Regression Modeling Polynomials for Modelling Continuous
current clinical practice. Transpl Int Strategies, 2nd edn. Switzerland: Variables. Chichester: John Wiley &
2015; 28: 1042. Springer, 2015. Sons Ltd, 2008.
8. Martinez-Selles M, Almenar L, 16. IBM Corp. IBM Statistics for Windows, 28. Tibshirani R. Regression shrinkage and
Paniagua-Martin MJ, et al. Donor/ 23.0 edn. New York, NY: IBM Corp, selection via the Lasso. J Roy Stat Soc B
recipient sex mismatch and survival 2013. Met 1996; 58: 267.
after heart transplantation: only an issue 17. SAS Institute Inc. SAS/STAT, 9.4 edn. 29. Breiman L. Arcing classifiers. Ann Stat
in male recipients? An analysis of the Cary, NC: SAS Institute Inc., 2012. 1998; 26: 801.

10 Transplant International 2017; 30: 6–10


ª 2016 Steunstichting ESOT

You might also like