0% found this document useful (0 votes)
17 views23 pages

Social Indicators Research 45: 253-275, 1998

This document summarizes a paper that examines Pratt's measure of variable importance in linear regression. It explores criticisms of Pratt's measure producing "counterintuitive" results or negative importance values. The paper shows these cases can occur due to multicollinearity among explanatory variables. It also describes how Pratt's measure can be applied to ridge regression as a way to address multicollinearity issues. The objective is to explain Pratt's measure and help practitioners choose the best variable importance method for their linear regression analyses.

Uploaded by

thyagosmesme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views23 pages

Social Indicators Research 45: 253-275, 1998

This document summarizes a paper that examines Pratt's measure of variable importance in linear regression. It explores criticisms of Pratt's measure producing "counterintuitive" results or negative importance values. The paper shows these cases can occur due to multicollinearity among explanatory variables. It also describes how Pratt's measure can be applied to ridge regression as a way to address multicollinearity issues. The objective is to explain Pratt's measure and help practitioners choose the best variable importance method for their linear regression analyses.

Uploaded by

thyagosmesme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

D. ROLAND THOMAS, EDWARD HUGHES and BRUNO D.

ZUMBO

ON VARIABLE IMPORTANCE IN LINEAR REGRESSION

ABSTRACT. The paper examines in detail one particular measure of variable


importance for linear regression that was theoretically justified by Pratt (1987),
but which has since been criticized by Bring (1996) for producing “counterintu-
itive” results in certain situations, and by other authors for failing to guarantee
that importance be non-negative. In the article, the “counterintuitive” result is
explored and shown to be a defensible characteristic of an importance measure.
It is also shown that negative importance of large magnitude can only occur in
the presence of multicollinearity of the explanatory variables, and methods for
applying Pratt’s measure in such cases are described. The objective of the article
is to explain and to clarify the characteristics of Pratt’s measure, and thus to assist
practitioners who have to choose from among the many methods available for
assessing variable importance in linear regression.

KEY WORDS: variable importance, least squares geometry, discriminant ratio


coefficients, negative importance, Pratt’s importance measures, multicollinearity

1. INTRODUCTION

The concept of relative importance pervades the scientific literature,


as documented by Kruskall and Majors (1989). In the literature
of applied methodology and statistics, much of the interest has
focussed on the relative importance of explanatory variables in
multiple regression. According to Healey (1990), the question of the
relative importance of two variables in a regression analysis is one
of the most common questions asked of a statistical consultant, more
common even than the question of statistical significance. Numer-
ous measures of relative importance for linear regression have been
proposed and discussed, as illustrated by the work of Green, Carroll
and DeSarbo (1978), Pratt (1987), Kruskall (1987), Budescu (1993),
Bring (1994, 1996) and the other references cited in these articles.
The related issue of variable importance in MANOVA has also been
extensively discussed in the methodological literature of applied
psychology and education, see for example Thomas (1992, 1997),

Social Indicators Research 45: 253–275, 1998.


© 1998 Kluwer Academic Publishers. Printed in the Netherlands.
254 D. ROLAND THOMAS ET AL.

Huberty (1994), and Thomas and Zumbo (1996). Recently, Thomas,


Hughes and Zumbo (1996) and Bring (1996) used the geometry of
least squares to explain and provide a motivation for some specific
measures of variable importance used in multiple regression. The
article by Thomas et al. (1996) focussed on one particular measure
previously used by many authors, but that had lacked a proper ratio-
nale until its axiomatic justification by Pratt (1987). Termed the
“product measure” by Bring (1996), this measure assigns impor-
tance to a variable in proportion to the product of its standardized
regression coefficient and its simple correlation with the response
variable. It will be referred to in this paper as “Pratt’s measure”
in recognition of his theoretical justification. Thomas et al (1996)
addressed some apparent practical shortcomings of Pratt’s measure,
and provided some arguments in its favour. Thomas (1992) earlier
used a geometric argument to derive a similar measure of vari-
able importance for MANOVA, and the geometric approach has
been recently used by Thomas and Zumbo (1997) to develop an
importance measure for logistic regression. Bring (1996) provided a
geometric interpretation of several measures of variable importance,
and illustrated geometrically a situation in which Pratt’s measure
gives what he considered to be a counterintuitive result. Bring noted
that there is no unique definition of variable importance for regres-
sion analysis, and was of the opinion that “no clear-cut answer can
be given as to which measure to use”.
This opinion will not be challenged in this article. However,
it will be argued that Bring’s “counterintuitive” result is in fact a
justifiable characteristic of an importance measure, and that Pratt’s
measure provides a readily interpretable result in this case. Pratt’s
measure has also been criticized because it can in certain cases
take on negative values, a potentially troubling characteristic for
an importance measure. However, it will be shown in this article
that “large” negative importances can only occur in multicollinear
situations, when all measures of variable importance exhibit inter-
pretational difficulties. These results will be summarized and illus-
trated on a datset where the multicollinearity can be resolved by
the omission of a variable. It will also be shown that the geometric
approach can be used to apply Pratt’s measure to ridge regression
(Hoerl and Kennard, 1970), a more general approach to multi-
ON VARIABLE IMPORTANCE IN LINEAR REGRESSION 255

collinearity whereby instability of regression coefficients is reduced


at the expense of introducing a small amount of bias into the
regression coefficient estimates.
Despite the emphasis on Pratt’s approach, it must be stressed that
the objective of this article is not to propose Pratt’s (1987) measure
as the “clear-cut answer” to variable importance in regression,
but rather to ensure that its merits and demerits are appropriately
weighted by practitioners when a method is being selected.
Pratt’s measure will be described in Section 2, and a summary
of its axiomatic derivation will be presented to ensure that the
theoretical basis of the method is properly reckognized. A brief
outline of the geometric interpretation is also provided in Section 2.
The criticisms of the method raised by Bring (1996) and others will
be described and resolved in Section 3; some of the more technical
material is given in an appendix.
Examples are presented and analyzed in Section 4, and a discus-
sion and some concluding remarks are provided in Section 5.

2. PRATT’S MEASURES OF RELATIVE IMPORTANCE

2.1 The Axiomatic Approach


Pratt (1987) considered a linear regression equation of the form
(2.1) y = b0 + b1 x1 + . . . + bp xp + u,
where the disturbance term u is uncorrelated with x1 , . . . , xp ,
and is distributed with mean zero and variance σ 2 . He sought a
measure of relative importance that satisfied a minimal set of natural
requirements, the first two of which were:
(1) that relative importance can depend only on the means, vari-
ances and correlations of y, x1 , . . . , xp ;
(2) that a linear transformation of any variable does not affect its
relative importance.
For the two variable case, p = 2, it follows from the above two
assumptions that relative importance depends only on β1 , β2 , ρ12
and σ 2 , where βj is the standardized regression coefficient corre-
sponding to xj , j = 1, 2, and ρ12 is the correlation between x1
and x2 . Pratt (1987) then examined the situation where the two
256 D. ROLAND THOMAS ET AL.

regressors x1 and x2 were constructed from a symmetric set of M


equicorrelated explanatory variables, Xi , i = 1, . . . , M, each having
equal variances and equal regression coefficients. He set x1 = X1 +
. . . + Xm and x2 = Xm+1 + . . . + Xm+n , where m + n = M, and in
this totally symmetric situation he further assumed that:
(3) the relative importance of x1 to x2 is as m to n.
Under this framework, Pratt then established that for any regression
involving only two explanatory variables, the relative importance of
x1 to x2 is uniquely defined by the ratio of β1 ρ1 to β2 ρ2 (provided
both products are positive), where ρj , j = 1, 2, is the standard-
ized regression coefficient in the regression of y on xj alone, i.e.,
the simple correlation between y and xj . Essentially, his approach
consisted of proving that any two variable regression with specific
values of β1 , β2 , ρ12 and σ 2 could be constructed from the postu-
lated set of symmetric variables, Xi , i = 1, . . . , (m+n), by a suitable
choice of m and n. For variables composed of sums of symmetric
variables, assumption (3) is entirely natural. Pp
Both βj and ρj are population parameters. Since j =1 βj ρj
represents (in standardized form) the variance of the linear regres-
sion b1 x1 + . . . + bp xp , Pratt’s rule for p = 2 equates relative
importance to proportion of variance explained, provided that the
explained variance attributed to xj is βj ρj . The extension of this
rule, i.e., the use of the product βj ρj as a measure of variable
importance, to the general case p > 2 required only a few addi-
tional natural assumptions, principally the extension of assumption
(2) to non-singular linear transformations of the subset xq+1 , . . . ,
xp into xq+1 0 , . . . , xp0 , and the assumption that a pure noise variable
(independent of y and x1 , . . . , xp ) added to the list of explanatory
variables does not change the relative importance of the xj ’s. With
these and one other minor assumption (for which several natural
alternatives are available), Pratt (1987) showed that his measure is
the only one that satisfies his assumptions.
In the general case, p > 2, Pratt’s measure allows importance to
be defined additively, i.e., the importance of a subset of variables
x1 , . . . , xq (q < p), is the sum of their individual importances, given
by βj ρj , j = 1, . . . , q. It should be noted that other commonly used
measures do not lend themselves to an additive definition of variable
importance.
ON VARIABLE IMPORTANCE IN LINEAR REGRESSION 257

2.2 The Geometric Approach


The geometric interpretation of Pratt’s measure will be reviewed
in this section in order that the criticisms levelled by Bring (1996)
and others can be fully addressed. The review is based on the work
of Thomas et al. (1996), who first used the geometry to develop
an intuitively appealing measure of variable importance, and then
showed their measure to be a sample version of Pratt’s measure. For
readers not familiar with the geometric approach to regression, an
easily read introduction is given by Saville and Wood (1986).
For the case of multiple linear regression, consider a sample of N
observations to be fitted to a model of the form (2.1). The geometric
approach is based on a representation of the observed variables y,
x1 , . . . , xp as vectors in an N-dimensional space, denoted R N . For
the regression problem it will be assumed that all variables have a
mean of zero, which in vector notation means that y0 1 = x01 1 = . . . =
x0p 1 = 0, where 1 denotes an N × 1 vector of ones. It will also be
convenient to standardize the variables to have unit length, in which
case the dependent and independent variables will satisfy |y| = 1,
|xj | = 1, j = 1, . . . , p, where | · | denotes the length of a vector,
i.e., |y| = (y0 y)1/2 . Neither the centering nor the standardization
results in any loss of generality. Given this setup, the vector ŷ, which
denotes the least squares fitted value of y, can be represented as the
projection of y onto the subspace X of R N spanned by (i.e., defined
by) the xj , j = 1, . . . , p. The corresponding algebraic expression
for ŷ is
(2.2) ŷ = β̂1 x1 + . . . + β̂p xp ,
where the β̂j ’s are least squares estimates of the population regres-
sion coefficients, bj , j = 1, . . . , p, in standardized form. Note
that the β̂j ’s can be expressed in terms of unstandardized regres-
sion coefficient estimates as β̂j = bˆj |xj |/|y|. The subspace X (i.e.,
the model subspace) is illustrated in Figure 1 for a two variable
system, in which appropriate multiples of x1 and x2 (given by β̂1
and β̂2 , respectively) sum geometrically to ŷ, the projection of the
dependent variable vector y from R N onto the model subspace X.
Figure 2 illustrates the geometric interpretation of Pratt’s impor-
tance measures. The heavy lines represent the vector projection
of each β̂j xj onto ŷ, while the heavy dashed lines represent the
258 D. ROLAND THOMAS ET AL.

RN

x1
y
β1 x1 X

β2 x 2 x2

Figure 1. The geometry of least squares.

corresponding projections orthogonal to ŷ. Clearly, the orthogonal


components sum to zero. Thus it is natural to use the (signed)
lengths of the individual projections in the ŷ direction (which sum
to ŷ) as measures of the contribution of each xj to ŷ. Ratios of the
signed lengths of these projections to the length of ŷ then provide
convenient measures of the relative importance of individual vari-
ables, which will be denoted dj . By construction, these measures of
relative importance will sum to one.
Note that Bring (1996) described Pratt’s measure slightly differ-
ently. Instead of projections onto ŷ in the model space, he used
projections onto the dependent variable y in the full space R N . He
then stated that “Despite the ease of finding a geometrical interpreta-
tion, it is still difficult to comprehend the meaning of this measure.”
In contrast, the treatment described above, which is based entirely
on projections in the model space, provides an intuitive and easily
comprehended interpretation of Pratt’s measure.
To develop an expression for the importance measures dj , an
expression for the projection of β̂j xj onto ŷ is needed. This is given
by

(2.3) Pŷ (β̂j xj ) = (β̂j (ŷ0 xj )/|ŷ|2 )ŷ,

which has a signed length of

(2.4) |Pŷ (β̂j xj )|s = β̂j (ŷ0 xj )/|ŷ|.


ON VARIABLE IMPORTANCE IN LINEAR REGRESSION 259

β1 x1

y

β2 x 2

projections in y direction

projections orthogonal to y
Figure 2. Importance measures as projections.

The geometric measure of importance, dj , namely the ratio of this


signed length to the total length of ŷ, then becomes
(2.5) dj = β̂j (ŷ0 xj )/|ŷ|2 , j = 1, . . . , p
Since ŷ and y differ only by a residual vector which is orthogonal
to all vectors xj in the model space (see Saville and Wood, 1986), it
follows that ŷ0 xj = y0 xj = ρ̂j |xj ||y| = ρ̂j , where ρ̂j is the sample
estimate of the simple correlation between y and xj . Also, since R 2 ,
the proportion of sample variance explained by the regression, is
given by R 2 = |ŷ|2 /|y|2 = |ŷ|2 , it follows that equation (2.5) can be
expressed in the form
(2.6) dj = β̂j ρ̂j /R 2 , j = 1, . . . , p,
which is a sample estimate of Pratt’s (1987) measure, divided by
the total (standardized) variance explained. The dj will henceforth
260 D. ROLAND THOMAS ET AL.

be referred to as “standardized Pratt measures”. The importance


measures (discriminant ratio coefficients) developed by Thomas
(1992) for use in MANOVA and descriptive discriminant analysis
are equivalent to the dj ’s.

3. CRITICISMS OF PRATT’S MEASURES, WITH REMEDIES

3.1 Bring’s “Counterintuitive Result”


Consider the two variable example shown in Figure 3, in which x1
and y are orthogonal, and both x1 and y are correlated with x2 . It is
clear from the model space representation that the projection of x1
onto ŷ is zero. Thus Pratt’s measure assigns zero importance to x1 ,
despite the fact that the regression coefficients of x1 and x2 are of
similar magnitude. In terms of the standardized Pratt measures, dj ,
we have d1 = 0 and d2 = 1. Bring (1996) considered this allocation
of importance to be counterintuitive. It will be shown below that this
result has an intuitively reasonable interpretation.
Consider y and x1 as above, together with a set of explanatory
variables x2 , . . . , xp . We will assume that at least some of the xj , j =
2, . . . , p are correlated both with y and x1 . Thus β1 will in general
be non-zero (comparable in magnitude Pp to at least some of the βj ,
j = 2, . . . , p), while d1 = 0 and 2 dj = 1, i.e., all importance
(in the Pratt sense) will be shared among the last p − 1 variables.
The meaning of the zero importance of x1 can best be understood
by expressing the fitted equation (2.2) in terms of the components
of the xj ’s orthogonal to x1 . These components are given by

(3.1) x0j = xj − γ̂j x1 , j = 2, . . . , p,


where γ̂j is the regression coefficient in the regression of xj on x1 ,
j = 2, . . . , p. Thus the fitted equation (2.2) can in this case be
represented as
(3.2) ŷ = γ̂1 x1 + β̂2 x02 + . . . + β̂p x0p ,
where
X
p
(3.3) γ̂1 = β̂1 + β̂j γ̂j .
j =2
ON VARIABLE IMPORTANCE IN LINEAR REGRESSION 261

β1 x1
y

x2
β2 x 2

x1

β1 x1

y

β2 x 2
Figure 3. A suppressor variable.
262 D. ROLAND THOMAS ET AL.

Since all the xoj , j = 2, . . . , p, are orthogonal to x1 , γ̂1 is equal to


the coefficient in the regression of y on x1 alone. Thus γ̂1 = 0, since
y is orthogonal to x1 , and the fitted equation can then be written as
(3.4) ŷ = β̂2 xo2 + . . . + β̂p xop .
Standardized versions of Pratt’s importance measures, denoted djo ,
j = 2, . . . , p, can be defined via projections of the xoj as before, and
it is easily shown that
(3.5) djo = dj , j = 2, . . . , p.
Equation (3.5) demonstrates that when y and x1 are orthogonal,
all variable importance in the Pratt sense is assigned to the sub-
space orthogonal to x1 , with relative importance assessed in terms
of projections of the components xoj onto ŷ. Thus Pratt’s measure
reflects the fact that x1 and the remaining xj ’s contribute to the
regression in completely different ways. The effect of variable x1
is indirect, being made entirely via the other variables.
This aspect of Pratt’s approach to variable importance was
discussed by Thomas et al. (1996), who referred to variables such
as x1 as suppressor variables. In the above discussion, only one
explanatory variable was considered to be uncorrelated with y, but
the argument can clearly be extended to more than one suppressor.
Thomas et al. (1996) noted that suppressors can be identified as
variables having small (not necessarily zero) values of dj , but with
values of |β̂j | that are comparable with the values exhibited by the
explanatory variables deemed to be important.
Once it is reckognized that suppressor and non-suppressor vari-
ables contribute to the regression in different ways, then Bring’s
“counterintuitive result” can be seen to be an intuitive and entirely
reasonable approach to measuring the relative importance of the
non-suppressor variables. However, when a regression equation
contains suppressor variables, a complete description of variable
importance should always include a separate assessment of their
contribution, relative to the non-suppressors. A natural measure
of their relative importance is given by (R 2 − RNS 2 )/R 2 , where
2 denotes the sample variance explained by the non-suppressors
RNS
alone. This application of the sequential or stepwise method of
importance assessment (see, for example, Bring, 1996) does not
ON VARIABLE IMPORTANCE IN LINEAR REGRESSION 263

suffer from the problem of order dependence normally associated


with this method, given that there are only two classes of variables
to compare, and that the focus is on the relative importance of the
suppressors, a priori.
In practice, when suppressor variables have small but non-zero
standardized Pratt measures, the dj ’s for non-suppressors will not
be precisely equal to the corresponding values of the djo ’s. Neverthe-
less, since the importance of a variable xj relative to another variable
xj 0 is as dj to dj 0 , the relative importances of the non-suppressors,
one to another, is not affected by the suppressor variables. For
convenience, the dj ’s for the non-suppressors can be re-defined to
sum to one, i.e.,
X
(3.6) djNS = dj / dk , j ∈ SNS ,
k∈SN S

where SNS denotes the set of indices corresponding to the non-


suppressors. Alternatively, exact values of the orthogonalized
importance measures, djo , for the non-suppressor variables can be
readily obtained. For the case of one suppressor variable, x1 , they
can be obtained from the regression of y − γ̂1 x1 on the xoj , j =
2, . . . , p. For more than one suppressor, the same strategy can be
used, with orthogonalized versions of the dependent variable and
the non-suppressor variables each being obtained as the residuals
of regressions on the set of identified suppressor variables. This is
equivalent to the strategy of conditioning on suppressor variables
recommended by Thomas and Zumbo (1996) for the related case of
variable importance in MANOVA. The values of the orthogonalized
importance measures will always be similar to the corresponding
scaled importances, djNS , as illustrated in Section 4.1.

3.2 Negativity of the Pratt Measures


If the independent variables xj are orthogonal, β̂j = ρ̂j , in which
case it can be seen from equation (2.6) that all standardized Pratt
measures will be positive, i.e., dj ≥ 0, j = 1, . . . , p. In general,
the dj need not be positive, a property that potentially detracts
from their utility as measures of variable importance. It is therefore
important to examine in detail the conditions under which negative
importance can occur.
264 D. ROLAND THOMAS ET AL.

In the Appendix, a negative lower bound for the relative impor-


tance, dj , of variable xj is derived in terms of a measure of the
non-orthogonality (or multicollinearity) of the xj ’s. This bound is
given by
(3.7) dj ≥ −(1/2)[(V I Fj )1/2 − 1], j = 1, . . . , p,
where V I Fj , the variance inflation factor (Wetherill, 1986, p. 87)
for variable xj , is given by V I Fj = 1/(1 − R(j 2 ), and R 2 is the
) (j )
squared multiple correlation from the regression of variable xj on
the remaining x’s. Clearly, a negative dj of “large” magnitude can
occur only if the variance inflation factor (a standard measure of
multicollinearity) for the j ’th variable is large.
In some cases, multicollinearity can be easily resolved by omit-
ting one or more variables from the regression (see Wetherill,
1986, p. 106). In cases where this strategy may not be appropriate,
the technique of ridge regression (Hoerl and Kennard, 1970) may
provide a remedy. It will be shown in the next sub-section that
the geometric interpretation of Pratt’s measure can also be used to
develop a measure of variable importance for ridge regression.

3.3 Ridge Regression


The technique of ridge regression (Hoerl and Kennard, 1970) uses
biased estimates of the regression parameters that have smaller
mean squared errors (i.e., more stability) than the unbiased estimates
obtained via least squares.
In their simplest form, ridge estimates are given by
R
β = (X0 X + kI)−1 X0 y,
(3.8) β̂
R
where β̂β j is the p × 1 vector of ridge regression estimates with
elements β̂jR , j = 1, . . . , p, X is the n × p matrix with columns
xk , j = 1, . . . , p, and k is a positive constant. This formulation is
equivalent to the addition of the constant k to each of the eigenvalues
of X0 X. An alternative approach, referred to as generalized ridge
regression, adds a different constant ki , i = 1, . . . , p, to each of
the eigenvalues of X0 X, denoted λi , i = 1, . . . , p. The advantage
of the latter approach is that the effect of small λi (which result in
large parameter variances) can be damped by the addition of large ki ,
ON VARIABLE IMPORTANCE IN LINEAR REGRESSION 265

while much smaller biasing constants can be added to the larger λi ,


which do not contribute as much to variance inflation. Generalized
ridge regression is expressed most succincly in terms of orthogonal-
ized variables. The eigenvalue decomposition of X0 X yields X0 X =
Q33Q0, where 3 is the diagonal matrix of eigenvalues, λi , i = 1, . . . ,
p, in decreasing order, and Q is the p × p matrix of corresponding
eigenvectors. The orthogonalized regressors can then be expressed
as Z = XQ, with corresponding parameter vector α = Q0 β , where
β is the p × 1 vector of standardized regression coefficients βj , j =
1, . . . , p. It then follows that the least squares estimate of α is given
by
3−1 Z0 y
α =3
(3.9) α̂
To obtain generalized ridge regression estimates of α , the eigen-
value matrix 3 in equation (3.9) is replaced by 3 3 + K, where K
is the diagonal p × p matrix of biasing constants ki , i = 1, . . . , p.
β = Qα̂
Since β̂ α , the generalized ridge estimates of the standardized
regression coefficients β are defined as
G
3 + K)−1 Q0 X0 y.
β = Q(3
(3.10) β̂
The fitted value of y corresponding to the generalized ridge esti-
G
mates is given by ŷG = Xβ̂
β , or alternatively
(3.11) ŷG = β̂1G x1 + . . . + β̂pG xp .
Fitted values under the ridge estimates (3.8) are given by equation
(3.11), with the β̂jG ’s replaced by β̂jR ’s.
One of the challenges of ridge regression is the choice of the ridge
constant k, or in the generalized case, the choice of the matrix K. In
the example given in Section 4.2, a method proposed by Hocking,
Speed and Lynn (1976) is used. Details are given by Montgomery
and Peck (1982), along with a discussion of alternative approaches.
The geometric argument of Section 2 can be used to define
measures of relative importance based on ridge and generalized
ridge estimates. As before, the importance measures are defined as
the ratios of the signed lengths of the projected components of βjR xj
(or βjG xj ) to the length of yR (or yG ). The resulting measures are
given by equation (2.5), with ŷ replaced by yR (or yG ), and βj by
266 D. ROLAND THOMAS ET AL.

β R (or β G ). A more easily evaluated expression for the importance


measures under generalized ridge regression, for example, is given
by
G G0 G
(3.12) djG = β̂jG (X0 Xβ̂ β X0 Xβ̂
β )j /(β̂ β ), j = 1, . . . , p,
where (·)j denotes the j ’th element of a vector. A corresponding
expression holds for the djR , the importances under regular ridge
regression. It should be noted that for the centred and standardized
variables used in this analysis, X0 X corresponds to the sample corre-
lation matrix of the explanatory variables. Importance measures
under generalized ridge regression are illustrated in Section 4.2.

4. EXAMPLES

4.1 Variable Importance and Suppressor Effects


This example is based on some SPSS computer output presented
by Stevens (1992, pp. 83-87). The data were originally from Morri-
son (1983), and relate to a student evaluation of an MBA course.
The dependent variable is the evaluation rating given the instructor
(INSTEVAL), with five candidate explanatory variables consisting
of the course evaluation score (COUEVAL), instructors knowledge
(KNOWLDGE), stimulation (STIMUL), clarity of presentation
(CLARITY), and interest (INTEREST). Stevens reports the results
of two sequential selection algorithms, a forward stepwise proce-
dure and a backward elimination procedure. The results reported
in Table I relate to the backward elimination procedure. Only the
INTEREST variable was eliminated, leaving four explanatory vari-
ables in the final model, resulting in an R 2 of 0.894. Values of the
standardized regression coefficients, the importance measures, dj ,
and the t-statistics, for each of the four variables, are shown in the
first three columns of Table I. An inspection of the dj ’s reveals that
the most important variable is CLARITY, with an importance rating
of 0.501. Variables COUEVAL and STIMUL are effectively equal
in importance, with relative importances of 0.205 and 0.224, respec-
tively. The variable KNOWLDGE has a relatively small value of dj ,
though its standardized regression coefficient is similar in magni-
tude to that of COUEVAL and STIMUL. This suggests that the
ON VARIABLE IMPORTANCE IN LINEAR REGRESSION 267
TABLE I
The Morrison MBA data: Suppressor effects
(NS) (0)
Variable β̂j dj tj dj dj

COUEVAL 0.249 0.205 2.453 0.221 0.222


KNOWLDGE 0.221 0.070 3.518 – –
STIMUL 0.271 0.224 3.335 0.241 0.239
CLARITY 0.520 0.501 5.698 0.538 0.539

R 2 = 0.894
(R 2 − RNS
2 )/R 2 = 0.054

variable KNOWLDGE contributes to the regression in a suppressor


role. No hint of the suppressor status of KNOWLDGE is provided
either by its standardized regression coefficient or its t-statistic, both
of which are often used as measures of variable importance (Bring,
1996).
For the three non-suppressor variables COUEVAL, STIMUL and
CLARITY, Table I also displays the scaled importance measures
djNS (see equation 3.6) together with the importances djo corre-
sponding to the components of the non-suppressors orthogonal to
the suppressor variable KNOWLDGE. As noted in Section 3.1,
these alternative measures of the relative importances of the non-
suppressor variables are numerically very similar. They demonstrate
very clearly that CLARITY is about as important a direct contrib-
utor to the regression as COUEVAL and STIMUL together. To
complete the assessment of variable importance, the importance
of the suppressor variable KNOWLEDGE relative to the three
non-suppressor variables must be measured. For this example, the
measure (R 2 − RNS 2 )/R 2 proposed in Section 3.1 yields a relative

importance of 0.054, i.e., the suppressor variable KNOWLEDGE


makes a relatively small indirect contribution to the regression.

4.2 Variable Importance and Multicollinearity


The data for this multiple regression example are taken from a study
by Kumar and Kumar (1992) of R&D projects undertaken by a
sample of technology companies. The variables used in the exam-
ple have been selected to illustrate the behaviour of the importance
measure dj in the presence of multicollinearity of the explanatory
268 D. ROLAND THOMAS ET AL.

TABLE II
The project cost data: Multicollinearity effects

Variable β̂j dj tj V IF

SIZE −0.354 −0.330 −1.737 8.00


SALES 0.420 0.438 1.974 8.73
RISKY 0.573 0.818 4.859 2.683
RDINTEN 0.105 0.074 1.242 1.42

R 2 = 0.466

variables. A complete analysis of the data is described in the original


report referred to above. The dependent variable in the regression
is COST, the log of the total cost of the sampled project, and the
explanatory variables are: the log of the annual sales of the company
running the project (SALES); the log of the number of company
employees (SIZE); the log of the total dollar expenditure by the
company on projects that have less than a 50% chance of success
(RISKY); and a measure of the R&D intensiveness of the company
running the project (RDINTEN). The regression of COST on these
four explanatory variables yielded an R 2 of 0.466. The standardized
regression coefficients, the importance measures dj , the t-statistics,
and the corresponding variance inflation factors are shown in Table
II. It can immediately be seen from the table that the importance
measure for SIZE is negative, i.e., d1 = −0.33. Given that the dj ’s
sum to one, this is too “large” a negative value of importance to
be ignored. It is clear from the corresponding value of the vari-
ance inflation factor, i.e., V I F1 = 8.0, that this “large” negative
value is associated with multicollinearity. It should also be noted
that variable SALES has a large variance inflation factor of 8.73,
though its importance measure is positive. These results illustrate
the implication of the lower bound shown in equation (3.7), namely
that “large” negative importance implies multicollinearity. It is not
surprising that the variables SIZE and SALES are involved in a
collinear relationship, since both are measures of the size of the
company sponsoring the project.
It is natural to resolve the multicollinearity problem in this case
by dropping either the SIZE or SALES variable from the regression.
The first three columns of Table III display the regression results
ON VARIABLE IMPORTANCE IN LINEAR REGRESSION 269
TABLE III
The project cost data: Multicollinearity resolves
(G) (G)
Variable β̂j dj tj β̂j dj

SIZE – – – 0.014 0.015


SALES 0.101 0.109 0.931 0.061 0.068
RISKY 0.557 0.823 4.693 0.565 0.870
RDINTEN 0.095 0.068 1.103 0.067 0.047

R 2 = 0.450

obtained when SIZE, the variable exhibiting negative importance,


is dropped. It is clear from the dj values that in the three variable
regression, RISKY is the only explanatory variable of any impor-
tance. Note also that the importance of SALES in Table III (d1 =
0.109) is very similar to the joint importance of SIZE and SALES
in Table II, i.e., d1 (SIZE) + d2 (SALES) = 0.108, a clear illustra-
tion of the effects of multicollinearity coupled with additivity of the
importance measures.
As noted earlier, dropping variables is not always an appropriate
way of dealing with multicollinearity. Thus generalized ridge esti-
mates of regression coefficients and associated importance measures
have also been evaluated (see equations 3.10 and 3.12), and are
displayed in the fourth and fifth columns of Table III, respec-
tively. From the table, it can be seen that the estimates of both the
standardized regression coefficients and the corresponding impor-
tances are similar to those obtained by dropping the SIZE variable.
In particular, the ridge coefficient for the SIZE variable and the
corresponding importance measure are negligible (0.014 and 0.015,
respectively), confirming that in this example, the simple strategy
of dropping the variable exhibiting “large” negative importance was
appropriate. This need not be the case in general.

5. SUMMARY AND CONCLUSIONS

This article has been concerned with a commonly used but contro-
versial measure of variable importance in regression (the “inde-
pendent contribution measure”, Green et al. (1978); the “product
270 D. ROLAND THOMAS ET AL.

measure”, Bring (1996). This measure has been shown by Pratt


(1987) to be the unique measure of relative importance satisfy-
ing a set of natural assumptions, or axioms, and in this article
we have used the appelation “Pratt’s measure” in recognition of
Pratt’s (1987) theoretical justification. An overview of the axiomatic
derivation has been presented in order to highlight the intuitive
appeal of Pratt’s assumptions, and to emphasize that, mild though
these assumptions are, only Pratt’s measure satisfies them all. An
overview of the geometric interpretation of Pratt’s measure has also
been provided, given that the much of the analysis is based on the
geometry of least squares regression.
The primary focus of the article has been on two particular
criticisms of Pratt’s measure. First, Bring (1996) illustrated geomet-
rically what he called a “counterintuitive” situation where a variable
is accorded zero importance by Pratt’s measure, despite the fact
that its inclusion in the regression equation results in an increase
in R 2 . Variables exhibiting this behaviour have been categorized as
suppressor variables in this article, namely variables which have no
direct relationship to the response variable, but which make their
contribution to the regression model through their relationship with
the other predictors (see also Thomas, 1992; Thomas and Zumbo,
1996). It has been demonstrated that when suppressor variables are
included in the regression model, all variable importance in the Pratt
sense is assigned to the subspace of explanatory variables ortho-
gonal to the suppressors. As illustrated by the Morrison MBA data
in Section 4.1, suppressors can be identified in practice as variables
having small standardized Pratt measures (namely Pratt measures
scaled to sum to one) together with standardized regression coeffi-
cients that are comparable in magnitude to those exhibited by the
“important” explanatory variables. It was shown in the example
that the relative importance of the non-suppressors can be assessed
directly using the original standardized Pratt measures, or alter-
natively by using Pratt measures based on the components of the
non-suppressors orthogonal to the suppressors. Numerically, the two
approaches are very similar, so the former is recommended in view
of its simplicity.
Once it is recognized that suppressor and non-suppressor vari-
ables contribute to the regression in different ways, it can be
ON VARIABLE IMPORTANCE IN LINEAR REGRESSION 271

seen that the Pratt measures provide an entirely natural and justi-
fiable approach to assessing the relative importance of the non-
suppressors. In other words, the distinction between suppressor and
non-suppressor variables renders Bring’s (1996) “counterintuitive”
example entirely “intuitive”. Of course, a complete assessment of
variable importance for a given regression equation also requires
some method for measuring the importance of the suppressor vari-
ables relative to the set of non-suppressors. It has been suggested
in the article that this relative importance can be unambiguously
assessed using the increment in R 2 contributed by the suppressors.
The separate treatment of suppressor and non-suppressor variables
recommended in this article is consistent with some comments
made by Budescu (1993), who developed a method of assessing
the relative importance of variables in terms of the “dominance” of
certain subset regression models. He noted that “It is important to
distinguish between those cases in which a ranking, or scaling, by
importance is feasible and meaningful and those instances in which
such a ranking is futile and meanigless”. In Budescu’s terms, an
assessment of the relative importance of the non-suppressors in an
equation is “feasible and meaningful” and can be routinely effected
using the Pratt measures. On the other hand, any attempt to assess
relative importance without first identifying which, if any, of the
variables are suppressors is likely to be “futile and meaningless”
irrespective of the method employed.
The second criticism of Pratt’s importance measure addressed in
this article relates to its possible negativity. Though Pratt (1987)
considered negative importance to signify “a situation too complex
for a single measure, not a defect in the definition”, it does appear
that prospective users of the measure are troubled by the idea of
negative importance. However, a lower bound for an individual
standardized Pratt measure has been derived (see Appendix) that
shows that negative importance of “large” magnitude can only occur
in the presence of multicollinearity of the explanatory variables.
Furthermore, the geometric view of relative importance has been
used to extend Pratt’s measure to regression estimates obtained
using generalized ridge regression (Hoerl and Kennard, 1970; Hock-
ing et al., 1976), an approach to parameter estimation that yields
lower mean squared errors than does least squares. This approach to
272 D. ROLAND THOMAS ET AL.

importance assessment has been illustrated in Section 4.2. For the


example studied, the importance measures obtained via generalized
ridge estimation were very similar to those obtained by the simple
strategy of dropping the variable that exhibited “large” negative
importance, though it should be noted that this will not always be the
case. The generalized ridge version of Pratt’s measure, on the other
hand, will always provide a viable approach when multicollinearity
is encountered, so that the possibility of negative importance need
not be considered a deficiency. Practitioners should also note that
multicollinearity does not necessarily result in “large” negative Pratt
indices, since negativity is a sufficient but not necessary indicator
of multicollinearity. Nevertheless, unresolved multicollinearity will
always result in unreliable assessments of variable importance irre-
spective of the measures being used, so that standard checks for
multicollinearity should be routinely employed.
In summary, Pratt’s measure of relative importance is the only
measure that satisfies a set of natural requirements, comprised
primarily of symmetry and invariance to linear transformations.
An important side benefit of the measure is that it allows for
an additive definition of variable importance. Pratt’s measure can
be used, in conjunction with the standardized regression coeffi-
cients, to identify any suppressor variables that may be present
in the regression equation. Because suppressor and non-suppressor
variables contribute to the regression equation in completely differ-
ent ways, their contributions must be separately assessed. Pratt’s
measure yields unambiguous measures of the relative importance
of the non-suppressors. The relative importance of the suppressor
variables to the non-suppressors can be obtained in terms of their
additional contribution to R 2 . Though Pratt’s measures can be
negative, it has been shown that negative importance of “large”
magnitude signals multicollinearity of the explanatory variables.
Whenever multicollinearity is identified, it must be resolved irre-
spective of the importance measure to be used. Pratt’s measure can
be easily adapted for use with generalized ridge regression, so that
multicollinearity need no longer deter practitioners from using the
measure.
Regarding the choice of importance measure, Bring (1996) was
of the opinion that “no clear-cut answer can be given as to which
ON VARIABLE IMPORTANCE IN LINEAR REGRESSION 273

measure to use”. The authors agree that the choice rests with indi-
vidual practitioner. However, it is hoped that this exposition of the
advantages of Pratt’s measure, and the resolution of the criticisms
levelled against it, will simplify this choice.

APPENDIX

A Lower Bound For dj


The vectors of explanatory variables, xj , j = 1, . . . , p, are a basis for the
model space X. Consider the dual basis for X defined by the vectors qj ,
j = 1, . . . , p, which satisfy the relations
(A.1) q0j xj 0 = δj j 0 , j, j 0 , = 1, . . . , p,
i.e., qj is orthogonal to xj 0 , for j 6 = j 0 , and q0j xj = 1, j = 1, . . . , p. Let X(j )
be the subspace generated by all the x’s except xj . Then qj is orthogonal to
X(j ) , and X(j ) and either xj or qj generate X. Let φj be the angle between
qj and xj . Then
(A.2) cos φj = q0j xj /|qj ||xj | = 1/|qj |.
Since the xj are orthogonal if and only if φj = 0, j = 1, . . . , p, i.e., if
and only if qj = xj , j = 1, . . . , p, the cos φj provide a measure of non-
orthogonality. A small value of cos φj implies collinearity between xj and
the remaining x’s.
From equations (2.2) and (A.1), it follows that
(A.3) q0j ŷ = q0j 6 β̂j 0 xj 0 = β̂j q0j xj = β̂j ,
so that the importance dj given in equation (2.5) can be written
(A.4) dj = (q0j ŷ)(ŷ0 xj )/|ŷ|2 ,
(A.5) = (q0j y̌)(y̌0 xj )/|ŷ|2 ,
where y̌ is the projection of ŷ onto the plane of qj and xj . Clearly, the
absolute value of dj will be greatest if ŷ lies in the plane of qj and xj . To
get a lower bound for dj , we must chose ŷ to minimize the (negative) value
of dj , and in view of the preceeding remark, the search can be confined to
the plane of qj and xj . Since equation (A.4) is homogeneous in ŷ, the
minimization problem can be stated as
Minimize (y0 qj )(x0j y)
= (1/2)y0 (qj x0j + xj q0j )y
(A.6) = y0 My, subject to y0 y = 1.
274 D. ROLAND THOMAS ET AL.

It can be shown that the two non-zero eigenvalues of M are (1/2)[x0j qj ±


|xj ||qj |] = (1/2)[1 ± |qj |]. The desired (negative) minimum of dj is then
given by the eigenvalue corresponding to the minus sign, i.e.,

(A.7) dj ≥ (1/2)[1 − 1/ cos φj ].

Equality is achieved if ŷ coincides with the eigenvector of M correspond-


ing to the negative eigenvalue. Clearly, the magnitude of this negative
lower bound increases as cos φj decreases, i.e., as the degree of collinearity
between xj and the other x’s increases. The lower bound (A.7) can be
expressed in terms of the variance inflation factor V I Fj = 1/(1 − R(j 2
) ),
2
where R(j ) is the squared multiple correlation in the regression of xj on
the other explanatory variables. By definition, the angle between xj and
the hyperplane X(j ) is π/2 − φ, so that regressing xj on the other variables
1/2
) = cos (π/2 − φ) = 1 − cos φ. Thus cos φ = V I Fj , and the
2 2 2
yields R(j
lower bound then becomes

(A.8) dj ≥ −(1/2)[(V I Fj )1/2 − 1].

REFERENCES

Bring, J.: 1994, ‘How to standardize regression coefficients’, The American


Statistician 48, pp. 209–213.
Bring, J.: 1996, ‘A geometric approach to compare variables in a regression
model’, The American Statistician 50, pp. 57–62.
Budescu, D. V.: 1993, ‘Dominance analysis: A new approach to the problem of
relative importance of predictors in multiple regression’, Psychological Bulletin
114, pp. 542–551.
Green, P. E., J. D. Carroll and W. S. DeSarbo: 1978, ‘A new measure of predictor
variable importance in multiple regression’, Journal of Marketing Research 15,
pp. 356–360.
Healey, M. J. R.: 1990, ‘Measuring importance’, Statistics in Medicine 9, pp. 633–
637.
Hocking, R. R., F. M. Speed and M. J. Lynn: 1976, ‘A class of biased estimators
in linear regression’, Technometrics 18, pp. 425–437.
Hoerl, A. E. and R. W. Kennard: 1970, ‘Ridge regression: Biased estimation for
nonorthogonal problems’, Technometrics 12, pp. 55–67.
Huberty, C. J.: 1994, Applied Discriminant Analysis (Wiley, New York).
Kruskall, W.: 1987, ‘Relative importance by averaging over orderings’, The
American Statistician 41, pp. 6–10.
Kruskall, W. and R. Majors: 1989, ‘Concepts of relative importance in recent
scientific literature’, The American Statistician 43, pp. 2–6.
ON VARIABLE IMPORTANCE IN LINEAR REGRESSION 275

Kumar, V. and U. Kumar: 1992, ‘Technological innovation in Canadian manufac-


turing industry: An investigation of the speed and cost of innovation’, Report
for Industry Canada (Carleton University, Ottawa).
Montgomery, D. C. and E. A. Peck: 1982, Introduction to Linear Regression
Analysis (Wiley, New York).
Morrison, D. F.: 1983, Applied Linear Statistical Methods (Prentice-Hall, Engle-
wood Cliffs, NJ).
Pratt, J. W.: 1987, ‘Dividing the indivisible: Using simple symmetry to partition
variance explained’, in T. Pukkila and S. Puntanen (eds.), Proceedings of the
Second International Conference in Statistics (University of Tampere, Tampere,
Finland) pp. 245–260.
Saville, D. J. and G. R. Wood: 1986, ‘A method of teaching statistics using
N-dimensional geometry’, The American Statistician 40: 205–214.
Stevens, J.: 1992, Applied Multivariate Statistics for the Social Sciences (2nd
edn.) (Lawrence Erlbaum Associates, Hillsdale NJ).
Thomas, D. R.: 1992, ‘Interpreting discriminant functions: A data analytic
approach’, Multivariate Behavioural Research 27, pp. 335–362.
Thomas, D. R.: 1997, ‘A note on Huberty and Wisenbaker’s “Views of variable
importance” ’, Journal of Educational and Behavioral Statistics 22, pp. 309–
322.
Thomas, D. R., E. Hughes and B. D. Zumbo: 1996, ‘Variable importance
in regression and related analyses’, Working Paper, WPS 96-01, School of
Business, Carleton University, Ottawa, Canada.
Thomas, D. R. and B. D. Zumbo: 1996, ‘Using a measure of variable impor-
tance to investigate the standardization of discriminant coefficients’, Journal of
Educational and Behavioural Statistics 21, pp. 110–130.
Thomas, D. R. and B. D. Zumbo: 1997, ‘Variable importance in logistic regression
based on partitioning an R 2 measure’, Working Paper, WPS 97-01, School of
Business, Carleton University, Ottawa, Canada.
Wetherill, G. G.: 1986, Regression Analysis with Applications (Chapman and
Hall, London).

School of Business
Carleton University
Ottawa, ON K1S 5B6
Canada

You might also like