Five miths about variable selection
Five miths about variable selection
REVIEW
Observational studies are often conducted in a prognos- first a multivariable model was built with all predictors
tic or etiologic research context [1]. To this end, many selected by univariate screening in a first step. Then,
papers dealing with observational studies use multivari- nonsignificant predictor variables were sequentially
able regression approaches to identify important predic- eliminated and models re-estimated until all variables
tors of an outcome [2–4] or to assess effects of new remaining in the model show significant association
markers adjusted for known clinical predictors [5–7], with the outcome. The technique of backward selection
respectively. As the number of candidate predictor vari- is sometimes also applied directly to a set of predictor
ables or known confounders can be large, a “full” model variables [10–13], or the process of variable selection is
including all candidate predictors as explanatory vari- reversed by “forward selection” [5,14], meaning that
ables is often considered impractical for clinical use or, candidate predictors are sequentially included in a
in the extreme case, is even impossible to estimate model if their association with the outcome variable, on
because of multicollinearity. Therefore, variable selection top of the set of variables already in the model, is sig-
approaches are often employed, mostly based on evalu- nificant. Box 1 provides an overview of the most com-
ating P-values for testing regression coefficients against mon approaches of variable selection and explains some
zero. For example, Martinez-Selles et al. [8] used statistical notions commonly used in this context.
univariate screening of effects to build a multivariable Whatever technique applied, the approach of letting
model for survival after heart transplantation. After statistics decide which variables should be included in a
univariate selection, Rodriguez-Peralvarez et al. [9] model is popular among scientists. Among all 89 clini-
employed “backward elimination” which means that cal research articles published in Transplant
Box 1. Glossary
AIC Akaike information criterion. A number based on information theory expressing the model fit
(log likelihood) discounted by the number of unknown parameters
Augmented An extension of backward elimination recently proposed by Dunkler et al. [23], which can
backward elimination consider the change-in-estimate as an additional selection criterion. Usually leads to selection
of more variables than standard backward elimination. Preferable method for etiologic models
Backward Remove insignificant predictors from a model one-by-one until all variables are significant.
elimination The preferable method for prognostic models if enough data are available
Bayes factor Ratio of likelihood of two competing models
Change-in-estimate The magnitude by which the regression coefficient of a variable X changes if a variable Z is
removed from a multivariable model
EPV Events per variable. A simple measure to define the amount of information in a data set (the
number of events or the sample size) relative to the number of regression coefficients to be
estimated (the number of variables). Note that nonselection of a variable corresponds to an
estimated regression coefficient of zero, and thus, this formula should always consider
all candidate variables
Etiologic models Statistical models used to explain the role of a risk factor or treatment in its (possibly causal)
effect on patient outcome
Forward selection Add significant candidate predictors to a model one-by-one until no further predictors can be added.
Usually leads to inferior results compared to backward elimination
Likelihood The probability of the data to be observed under a given model
Maximum likelihood Statistical estimation techniques which sets unknown model parameters (e.g., regression
coefficients) to those values which are such that the observed data are most plausibly explained
Multicollinearity Almost perfect correlation of explanatory variables. Causes ambiguity in estimation of regression
coefficients and selection of variables. Can be a problem if regression coefficients should be
interpretable, that is, in etiologic research contexts
Prognostic models Statistical models used to prognosticate a patient’s outcome (e.g., graft loss or death) by
the values of some variables available at the time point at which this prediction is made
Univariable Each candidate variable is evaluated in its association with the outcome. Only significant variables
prefiltering are entered in a multivariable model.
(bivariable analysis) Often used by researchers, but should be avoided
Variable, independent Independent variable: a variable considered as predicting or explaining patient outcome.
or dependent Also termed predictor or explanatory variable, respectively.
Dependent variable: the variable representing the outcome under study, for example,
occurrence of rejections, patient survival, or graft survival
International in 2015, 49 applied multivariable regres- problem, for example, putting the “cause” as outcome
sion modeling, and variable selection was used in 30 variable and the “effect” as independent, adjusting effects
(65%) of those 49 articles. However, it is not commonly for later outcomes, or the like. Moreover, it is assumed
known that there hardly exists any statistical theory here that, without looking into the data, a set of candidate
which justifies the use of these techniques. This means predictors has already been preselected by clinical exper-
that quantities such as regression coefficients, hazard tise, for example, a prior belief that those predictors could
ratios or odds ratios, P-values or confidence intervals be related to the outcome. Availability in a data set alone is
may suffer from systematic biases if variable selection not the basis for their consideration for a model.
was applied, and usually, the magnitude or direction of
these biases is unpredictable [15]. This is in sharp con-
Myth 1: “The number of variables in a model
trast to the simplicity of application of variable selection
should be reduced until there are 10 events per
approaches, being ready to use in statistical standard
variables.” No!
software such as IBM SPSS [16] or SAS software [17].
The popularity of variable selection approaches is based Simulation studies have revealed that multivariable mod-
on five myths, that is, “believes” lacking theoretic founda- els become very unstable with too low events-per-variable
tion. Before discussing these myths in this review, it should (EPV) ratios. Current recommendations suggest that a
be noted that no variable selection approach can guard minimum of 5–15 EPV should be available, depending
against general errors in setting up a statistical modeling on context [15,18]. However, practitioners often overlook
that this recommendation refers to a priori fixed models happen, that after eliminating a potential confounder
which do not result from earlier selection [15]. If variable another adjustment variable’s coefficients moves closer
selection is considered, the rule should consider the num- to zero, changing from “significant” to “nonsignificant”
ber of candidate variables with which the selection pro- and hence leading to the elimination of that variable in
cess is initialized, and probably even much higher values a later step. However, despite its usual detrimental
such as 50 EPV are needed to obtain approximately stable effects on bias, elimination of very weak predictors from
results, as the selection adds another source of uncer- a model can sometimes decrease the variance (uncer-
tainty to estimation [19]. If the number of candidate vari- tainty) of the remaining regression coefficients. Dunkler
ables seems to be too large, background knowledge et al. [23] have proposed “augmented backward elimi-
obtained from analyses of former studies or from theoret- nation,” a selection algorithm which leaves insignificant
ical considerations should be applied to prefilter variables effects in a model, if their elimination would cause a
in order to meet the EPV-rule requirements of a problem. change in the estimate of another variable. Thus, their
Causal diagrams such as directed acyclic graphs can also proposal extends pure “significance”-based sequential
help in discarding candidate variables before statistically elimination of variables (“backward elimination”) and is
analyzing the data [20,21]. of particular interest in etiologic modeling.
Myth 2: “Only variables with proven Myth 4: “The reported P-value quantifies the
univariable-model significance should be type I error of a variable being falsely
included in a model.” No! selected.” No!
While it is true that regression coefficients are often larger First, while the probability of a type I error mainly
in univariable models than in multivariable ones, also the depends on the significance level, a P-value is a result of
opposite may occur, in particular if some variables (with data collection and analysis and quantifies the plausibility
all positive effects on the outcome) are negatively corre- of the observed data under the null hypothesis. Therefore,
lated. Moreover, univariable prefiltering, sometimes also the P-value does not quantify the type I error [24]. Sec-
referred to as “bivariable analysis,” does not add stability ond, after a sequence of elimination or selection steps,
to the selection process as it is based on stochastic quanti- standard software reports P-values only from the finally
ties, and can lead to overlooking important adjustment estimated model. Any quantities from this last model are
variables needed for control in an etiologic model. unreliable as they do not “remember” which steps have
Although univariable prefiltering is traceable and easy to been performed before. Therefore, P-values are biased
do with standard software, one should better completely low (as only those P-values are reported which fall below
forget about it as it is neither a prerequisite nor providing a certain threshold), and confidence intervals are often
any benefits when building multivariable models [22]. too narrow, claiming, for example, a confidence level of
95% while they actually cover the true value with a much
lower probability [15]. On the other hand, there is also
Myth 3: “Insignificant effects should be
the danger of false elimination of variables, the possibility
eliminated from a model.” No!
of which is not quantified at all by just reporting the final
Eliminating a variable from a model means to put its model of a variable selection procedure. To overcome
regression coefficient to exactly zero – even if the likeli- these problems, statisticians have argued in favor of using
est value for it, given the data, is different. In this way, resampling techniques or relative AIC- or Bayes factor-
one is moving away from a maximum likelihood solu- based approaches to explore alternative models and their
tion (which has theoretical foundation) and reports a likelihood to fit the data, and to use averages over com-
model which is suboptimal by intention. Eliminating peting models instead of just selecting one ultimate
weak effects may also be dangerous as in etiologic stud- model [25]. Such analyses may provide valuable insight
ies, bias could result from falsely omitting an important in how stable models are and how many and which com-
confounder. This is because regression coefficients gen- peting models would be selected how often. Resampling
erally depend on which other variables are in a model, can also be used to quantify selection probabilities of
and consequently, they change their value if one of the variables or pairs of correlated, competitive variables
other variables are omitted from a model. This [26]. Unfortunately, with the exception of SAS/PROC GLMSE-
“change-in-estimate” [23] can be positive or negative, LECT software [17], we do not know of any implementa-
that is, away from or toward zero. Hence, it may tions of these very useful approaches in standard
8 Transplant International 2017; 30: 6–10
ª 2016 Steunstichting ESOT
Five myths about variable selection
REFERENCES
1. Tripepi G, Jager KJ, Dekker FW, Zoccali Spanish Heart Transplantation Registry. 18. Vittinghoff E, McCulloch CE. Relaxing
C. Testing for causality and prognosis: Transpl Int 2015; 28: 305. the rule of ten events per variable in
etiological and prognostic models. 9. Rodrıguez-Peralvarez M, Garcıa- logistic and Cox regression. Am J
Kidney Int 2008; 74: 1512. Caparr~ os C, Tsochatzis E, et al. Lack of Epidemiol 2007; 165: 710.
2. Von D€ uring ME, Jenssen T, Bollerslev J, agreement for defining ‘clinical 19. Steyerberg EW. Clinical Prediction
et al. Visceral fat is better related to suspicion of rejection’ in liver Models. New York, NY: Springer, 2009.
impaired glucose metabolism than body transplantation: a model to select 20. Greenland S, Pearl J, Robins JM. Causal
mass index after kidney transplantation. candidates for liver biopsy. Transpl Int diagrams for epidemiologic research.
Transpl Int 2015; 28: 1162. 2015; 28: 455. Epidemiology 1999; 10: 37.
3. Bhat M, Hathcock M, Kremers WK, 10. Pezawas T, Grimm M, Ristl R, et al. 21. VanderWeele TJ, Shpitser I. A new
et al. Portal vein encasement predicts Primary preventive cardioverter- criterion for confounder selection.
neoadjuvant therapy response in liver defibrillator implantation (Pro-ICD) in Biometrics 2011; 67: 1406.
transplantation for perihilar patients awaiting heart transplantation. A 22. Sung G-W, Shook TL, Kay GL.
cholangiocarcinoma protocol. Transpl prospective, randomized, controlled 12- Inappropriate use of bivariable analysis
Int 2015; 28: 1383. year follow-up study. Transpl Int 2015; to screen risk factors for use in
4. Pianta TJ, Peake PW, Pickering JW, 28: 34. multivariable analysis. J Clin Epidemiol
Kelleher M, Buckley NA, Endre ZH. 11. Prasad GVR, Huang M, Silver SA, et al. 1996; 49: 907.
Evaluation of biomarkers of cell cycle Metabolic syndrome definitions and 23. Dunkler D, Plischke M, Leffondre K,
arrest and inflammation in prediction of components in predicting major adverse Heinze G. Augmented backward
dialysis or recovery after kidney cardiovascular events after kidney elimination: a pragmatic and purposeful
transplantation. Transpl Int 2015; 28: transplantation. Transpl Int 2015; 28: 79. way to develop statistical models. PLoS
1392. 12. Tripon S, Francoz C, Albuquerque A, One 2014; 9: e113677. doi:10.1371/
5. Rompianesi G, Montalti R, Cautero N, et al. Interactions between virus-related journal.pone.0113677
et al. Neurological complications after factors and post-transplant ascites in 24. Goodman S. A dirty dozen: twelve P-
liver transplantation as a consequence of patients with hepatitis C and no value misconceptions. Semin Hematol
immunosuppression: univariate and cirrhosis: role of cryoglobulinemia. 2008; 45: 135.
multivariate analysis of risk factors. Transpl Int 2015; 28: 162. 25. Burnham KP, Anderson DR. Model
Transpl Int 2015; 28: 864. 13. Somers J, Ruttens D, Verleden SE, et al. Selection and Multimodel Inference, 2nd
6. Zijlstra LE, Constantinescu AA, A decade of extended-criteria lung edn. New York, NY: Springer, 2002.
Manintveld O, et al. Improved long- donors in a single center: was it 26. Sauerbrei W, Schumacher M. A
term survival in Dutch heart transplant justified? Transpl Int 2015; 28: 170. bootstrap resampling procedure for
patients despite increasing donor age: 14. Nagai S, Mangus RS, Anderson E, et al. model building: application to the Cox
the Rotterdam experience. Transpl Int Post-transplant persistent lymphopenia regression model. Stat Med 1992; 11:
2015; 28: 962. is a strong predictor of late survival in 2093.
7. Fernandez-Ruiz M, Arias M, Campistol isolated intestine and multivisceral 27. Royston P, Sauerbrei W. Multivariable
JM, et al. Cytomegalovirus prevention transplantation. Transpl Int 2015; 28: Model-Building – A Pragmatic Approach
strategies in seropositive kidney 1195–204. to Regression Analysis based on Fractional
transplant recipients: an insight into 15. Harrell FEJ. Regression Modeling Polynomials for Modelling Continuous
current clinical practice. Transpl Int Strategies, 2nd edn. Switzerland: Variables. Chichester: John Wiley &
2015; 28: 1042. Springer, 2015. Sons Ltd, 2008.
8. Martinez-Selles M, Almenar L, 16. IBM Corp. IBM Statistics for Windows, 28. Tibshirani R. Regression shrinkage and
Paniagua-Martin MJ, et al. Donor/ 23.0 edn. New York, NY: IBM Corp, selection via the Lasso. J Roy Stat Soc B
recipient sex mismatch and survival 2013. Met 1996; 58: 267.
after heart transplantation: only an issue 17. SAS Institute Inc. SAS/STAT, 9.4 edn. 29. Breiman L. Arcing classifiers. Ann Stat
in male recipients? An analysis of the Cary, NC: SAS Institute Inc., 2012. 1998; 26: 801.