0% found this document useful (0 votes)
13 views

Some Comments on Cp

4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Some Comments on Cp

4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Technometrics

ISSN: 0040-1706 (Print) 1537-2723 (Online) Journal homepage: www.tandfonline.com/journals/utch20

Some Comments on Cp

C. L. Mallows

To cite this article: C. L. Mallows (2000) Some Comments on Cp , Technometrics, 42:1, 87-94,
DOI: 10.1080/00401706.2000.10485984

To link to this article: https://doi.org/10.1080/00401706.2000.10485984

Published online: 12 Mar 2012.

Submit your article to this journal

Article views: 701

View related articles

Citing articles: 10 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=utch20
Some Comments on Cp
C. L. MALLOWS*
Bell Laboratories
Murray Hill, New Jersey
(AT&T Research Laboratories, Florham Park, NJ 07932)

We discuss the interpretation of Cp-plots and show how they can be calibrated in several ways. We
comment on the practice of using the display as a basis for formal selection of a subset-regression
model, and extend the range of application of the device to encompass arbitrary linear estimates
of the regression coefficients, for example Ridge estimates.

KEY WORDS: Linear regression; Selection of variables: Ridge regression.

1. INTRODUCTION rows contain the matrix (Z;Zp-‘Z: where Zp is ob-


Suppose that we have data consisting of 11 observa- tained from X by deleting the columns corresponding to
tions on each of k + 1 variables, namely k independent Q). Let RSSp denote the corresponding residual sum of
variables x1, . . ~1; and one dependentvariable, y. Write squares, i.e.
zo = 1.x(1 x (k + 1)) = (2”: x1.. :zlc), ?/(n x 1) =
RSSI>= c(yU - x,p$
(~~l~...~~n)T,X(n~(X:+1))=(~:,,3).Amodeloftheform
For any such estimate ,&p, a measure of adequacy for
yu = rl(xu) + GL ‘u=1,2,...n (1) prediction is the “scaled sum of squarederrors”
where

is to be entertained (if $0 is absent the development is en-


tirely similar; we assumethroughout that X has rank k+ l),
with the residuals el . ..e. being regarded (tentatively) as
being independent random variables with mean zero and the expectation of which is easily found to be
unknown common variance 0’. The Z’S are not to be re-
garded as being sampled randomly from some population, E(JP) = VP + $L+
but rather are to be taken as fixed design variables. We where Vp, Bp are respectively “variance” and “bias” con-
supposethat the statistician is interested in choosing an es- tributions given by
timate ,B = (boo, , /&), with the idea that for any point x
in the general vicinity of the data at hand, the value
BP = P;XT(I - MP)XPC~ (2)

will be a good estimate of 7](x). In particular he may be and ,BQis ,B with the elementscorrespondingto P replaced
interested in choosing a “subset least-squares”estimate in by zeroes,and Mp = XX; = XrXp = Zp(Z~~ZII)-‘Z~.
which some components of b are set at zero and the re- The Cp statistic is defined to be
mainder estimated by least squares.
Cp = & RSSp - rl+ 21-1 (3)
The Cp-plot is a graphical display device that helps
the analyst to examine his data with this framework in where i” is an estimate of n2. Clearly (ashas beenremarked
mind. Consider a subset P of the set of indices Kt = by Kennard (1971)), Cp is a simple function of RSSp, as
(0.1~ 2, . , k}; let Q be the complementary subset. Sup- are the multiple correlation coefficient defined by I- R’$ =
pose the number of elements in P, Q are lPl = p, IQ1= q, RSSp/TSS (where TSS is the total sum of squares) and
so that p + 4 = k + 1. Denote by pp the vector of the “adjusted” version of this. However the form (3) has
estimates that is obtained when the coefficients with sub- the advantage(as has been shown by Gorman and Toman
scripts in P are estimated by least squares,the remaining (1966), Daniel and Wood (1971), and Godfrey (1972)) that
coefficients being set equal to zero; i.e. since under the above assumptions
i$ = X,Y E(RSSr) = (n - p)a” + Bp> (4)
where X, is the (Moore-Penrose) generalized inverse of
XT>,which in turn is obtained from X replacing the columns
@ 1973 American Statistical Association
having subscripts in Q by columns of zeroes.(Thus Xi has and the American Society for Quality
zeroes in the rows corresponding to Q, and the remaining TECHNOMETRICS, FEBRUARY 2000, VOL. 42, NO. 1

87
88 C. L. MALLOWS

Cp is an estimate of E(Jp), and is suitably standardized nating to consider how patterns similar to thosehe describes
for graphical display, plotted against p. Graphical presenta-
would show up on a Cp-plot.
tion of the various regression sums of squaresthemselves First, suppose the independent variables are not highly
againstp was advocatedby Watts (1965).For k not too large correlated, that 0 = pp, and that every non-zero element of
it is feasible (Furnival 1971;Garside 1965;Schatzloff, Tsao,
/3 is large (relative to the standarderror of its least-squares
and Fienberg 1968)to compute and display all the 2”+’ val-estimate).Then the Cp-plot will look something like Figure
uesof CP; for larger valuesone can use algorithms of Beale,
1 (drawn for the case p = k - 2, K+ - P = {1,2,3}).
Kendall, and Mann (19671,Hocking and Leslie (1967), and Notice the approximately linear diagonal configuration of
LaMotte and Hocking (1970) to compute only the more in- points correspondingto the well-fitting subsetsof variables.
teresting (smaller) values. Now, supposex1, 22, ~3 are highly correlated with each
In section 2 we describe some of the configurations that
other, with eachbeing about equally correlated with y. Then
can arise; in section 3 we provide some formal calibrationany two of these variables, but not all three, can be deleted
for the display and in section 4 comment on the practice
from the model without much effect. In this case the rele-
of using it as a basis for formal selection. The approachis
vant points on the Cp-plot will look something like Figure
extendedin section 5 to handle arbitrary linear estimatesof
the regressioncoefficients. 2a, if no other variables are of importance, or like Figure
The approachcan also be extendedto handle multivariate2b if some other subset P is also needed.(In all these ex-
responsedata and to deal with an arbitrary weight functionampleswe are assumingthat the constant term POis always
w(x) in factor-space, describing a region of interest dif-needed).Notice that now the diagonal pattern is incomplete.
In an intermediate case,when ~1,22,23 have moderatecor-
ferent from that indicated by the configuration of the data
currently to hand. In each case, the derivation is exactlyrelations, a picture intermediate between Figures 1 and 2b
will be obtained.
parallel to that given above.In the former case,one obtains
a matrix analog of CP in the form $:-l.RSSp - (n - 2p)1 Thirdly, supposex1,22 are individually unimportant but
where 2 is an estimate of the residual covariance matrix, jointly are quite effective in reducing the residual sum of
and RSSp is C(y, - x,bp)r(yU - xUbp). One or more squares;supposesome further subsetP of variables is also
measuresof the “size” of CP (such as the trace, or largestneeded.Mantel gives an explicit example of this behavior.
Figure 3 shows the resulting configuration in the case /PI =
eigenvalue)can be plotted againstp. In the latter case,with
the matrix A = (A,j) defined by Ai,, = J’ zizjw(x) dx, k - 4.
one arrives at a statistic of the form Notice that even if C’I~,~,~Iis the smallest Cp-value for
subsets of size p + 2, there might be subsets Pi, Pi, (not
Cc = & (& - B)TA(& - ,!?)- V,a, + 2V,a. containing P) with lP,l = p or p + 1 that gave smaller
values of Cp than those for P, {P, l}, {P, 2). In this case
where V$?+ = trace (A(XTX)-‘), Vp” = trace (A
an upward stepwise testing algorithm might be led to in-
(X$Xp))), and we can plot C$ against Vp”. This reduces
to the Cp-plot when A = XTX. If interest is concentrated clude variables in these subsetsand so not get to the sub-
at a single point z, we have A = xxT, and the statistic set {P, 1,2}. Mantel describesa situation where this would
6-“C$ is equivalent to that suggestedby Allen (1971); his happen.
equation (9) = s-“(C,” - x(X%-lxT).
2. SOME CONFIGURATIONS ON +-PLOTS
From (21, (3), (4) we see that if PQ = 0, so that the
P-subset model is in fact completely appropriate, then
RSSp = (n - p)a2 and Cp = p. If 3’ is taken as
RSSK+,(+-r), then CK+ = IK+I = k + 1 exactly.
Notice that if P* is a (p + 1) element subset which con- d
tains P, then
It+1
cp. - cp = 2 - g (5)
where SS is the one-d.f. contribution to the regressionsum C
of squares due to the (p + 1)-th variable, so that S,S’/e2
is a tf statistic that could be used in a stepwise testing
algorithm. If the additional variable is unimportant, i.e. if
the bias contribution BP - Bp* is small, then E(SS) M c2
and so
E(Cp* -C,) M 1.
Mantel (1970) has discussed the use of stepwise proce-
dures,and how they behavein the face of various patternsof
correlation amongst the independentvariables. It is illumi- Figure 1. Cp-plot: P is an Adequate Subset.

TECHNOMETRICS, FEBRUARY 2000, VOL. 42, NO. 1


SOME COMMENTS ON Cp 89

l 0

k+’ t-

C l Pl23
.P23
l P3 .Pl3
l P2
023 00123 l Pl l Pl2
‘03 8013
8 02 0012
01

1 1 I I 1
)- It+1
2 3 4 P P+( P+2 P+3
0 1
P
Figure 2b. C,-plot: Same as 2a Except That Variables in P Are Also
Figure 2a. Cp-plot: Variables I, 2, 3, Are Highly Explanatory Also Explanatory
Highly Correlated.
Var(CP - C,,) M 2(jAl+ IA’1 - 2R2)
3. CALIBRATION
To derive bench marks for more formal interpretation of where R2 is the sum of squares of the canonical corre-
Cp-plots, we assumethat the model (1) is in fact exactly lations between the sets of variables XA and XAI, after
appropriate, with the residuals er . . e, being independent partialling out the variables XB. (Thus if IBJ = IPI - 1,
and Normal (0, a”). Suppose6’ is estimated by RSS,+/u Var(Cp - CPJ) = 4(1 - p”) where p is the partial correla-
where v = n - k - 1, the residual degreesof freedom. We tion coefficient PAA~.B).[Srikantan (1970) has proposedthe
do not of course recommend that the following distribu- average,rather than the sum, of the squaredcanonical cor-
tional results be used blindly without careful inspection of relations as an overall measureof association.This measure
the empirical residuals yi - 5(x,), i = 1,. , n. However, has the property that its value is changedwhen a new vari-
they should give pause to workers who are tempted to as- able, completely uncorrelated with all the previous ones, is
sign significance to quantities of the magnitude of a few added to one of the sets of variates.]
units or even fractions of a unit on the Cp scale. We now use the Scheffe confidence ellipsoid to derive a
First, notice that the increment Cp* - CP (in (5) above) different kind of result. Let us write ,$ = (/?“,&) for the
is distributed as 2 - t?, where the t-statistic ti is central if
,B = ,Bp*. In this casethis increment has mean and variance
of approximately 1 and 2 respectively. Similarly,
CK+ - Cp = k + 1 - Cp = q(2 - Fq,v) (6) l P l Pq
where q = Ic + 1 - p and the F statistic is central if l P2
/?I = pp; thus if v is large compared with q this incre-
ment has mean and variance approximately q and 2q re-
spectively. The variance of the slope of the line joining
the points (p, Cp), (Ic + 1, k + 1) is thus 2/q, so that the k+ 0
slope of a diagonal configuration such as is shown in Fig-
ure 1 will vary considerably about 45”. The following ta- c
bles (derived from (6)) give values of Cp - p that will be
exceededwith probability a when the subset P is in fact
adequate(i.e. when ,8 = /3~ so that & = 0), for the cases
v = n - F;- 1 = 30, cx;. The value tabulated is q(F,,,,(cv)
- 1).
For comparing two Cp-values corresponding to subsets
P.P’ with Pn P’ = B;P = AU B,P’ = A’u B, it is ( 1 I I I I
straightforward to derive the results, valid under the null k-4 k-3 k-2 k-l k k+l
hypothesis that each of P and P’ is an adequatesubset,
Figure 3. Cp-plot: Two Variables That Jointly Are Explanatory But
E(Cp - Cpf) E IPl - IP’I = IAl - iA’1 Separately Are Not.

TECHNOMETRICS, FEBRUARY 2000, VOL. 42, NO. 1


90 C. L. MALLOWS

Table la. Values of Cp - p That Are Exceeded With Probability cy When p = pp; q = k + 1 -p, v = 30

q=k+ l-p 1 2 3 4 5 6 7 8 9 10 15 20 30

cy = .lO 1.88 2.98 3.83 4.57 5.25 5.88 6.49 7.07 7.64 8.20 10.83 13.35 18.19
.05 3.17 4.63 5.77 6.76 7.67 8.52 9.34 10.13 10.90 11.65 15.22 18.63 25.23
.Ol 6.56 8.78 10.53 12.07 13.50 14.84 16.13 17.38 18.60 19.79 25.20 30.97 41.58

least-squaresestimate of ,@ = (PO,pg), and let rejected if there is any OK with & = 0 that is acceptable
according to the corresponding test in the family, i.e., if
xTx=(E ma>. there is any PK with p Q = 0 lying within the confidence
ellipsoid S,. By the Lemma, the corresponding acceptable
subsets(0, P-} are just those that have
Then the Scheffe lOOcu%confidence ellipsoid for the ele-
ments of @K is the region Cp<2p-k-l+kF,. (9)
S, = {OK: (PK - ~K)~DK(PK - by) < k62&) (7) We state the property formally:
where F, is the upper lOOa%quantile of the F distribution A subset P = (0, P-} satisfies (9) if and only if there is some vector of
on Ic,n - Ic - 1 degreesof freedom. coefficients 0 having ,& = 0 that lies within the Scheffk ellipsoid (7),
Notice that S, can be written i.e. if and only if there is some vector of this form that is accepted by the
corresponding test with acceptance region of the form (8).
1 ^
s, = OK:5 (PK-PK)&S; As an example, consider the lo-variable data studied by
{ 1
Gorman and Toman (1966). Taking a = 0.10,k = 10,v =
where S; is a fixed ellipsoid centered at the origin: 25, we find that amongthe 58 subsetsfor which Gorman and
Toman computed Cp-values, there are 39 that satisfy (9),
Sz, = {7:yTD~y < kF,}. in number 7, 13, 9, 10 with p = 7,8,9,10 respectively. This
Let P- , Q be any complementary subsets of K, P = result gives little support to the view that this set of data is
sending a clear messageregarding the relative importance
(0, p-1.
The following lemma is proved in the Appendix. of the variables under consideration.
Notice that if the true coefficient vector 0” has p& = 0,
Lemma The following statementsare equivalent: then Pr {for all P containing P*, Cp 5 2p - k - 1 + kFa}
> Q, with equality only if p* = 1 (i.e. P* = (0)). This prop-
(i) The region S, intersects the coordinate hyperplane
erty of the procedureis not completely satisfying since it is
ffP = {pK:pQ = o}, not an equality; also the form of the boundary in the Cp-plot
(ii) The projection of S, onto the HQ hyperplane con-
is inflexible. In theory, one way of getting a better result is
tains the origin,
the following. Given any subset P* and a sequenceof con-
(iii) The subset least squares estimate fip = @a,BP- )
stantsci,c2,..., ck (and the matrix DK) one could compute
has BP- in S,, the probability Pr {for all P containing P*, Cp < I+}; this
(iv) Cp < 2p - Ic - 1 + kF,, probability dependson cl,. , ck, P* and DK, but not on
(v) RSSp- RSSK+ < k&‘F,.
any other parameters.One could then adjust cl, . . , ck so as
Now consider any hypothesis that specifies the value of to make the minimum of this probability over all choices
,0~, and the corresponding lOOa% acceptanceregion of P* (or possibly only over all choices with p* > some
pc) equal to some desired level cr. The computation would
Tplc- = DK: ;(bK -PK) E s: (8) presumably be done by simulation.
1 I Starting from the Scheffe ellipsoid, Spjmtvoll (1972) has
developed a multiple-comparison approach that provides
(clearly P(BKET,~I&) is in fact equal to a; this is just the confidenceintervals for arbitrary quadratic functions of the
confidence property of the Scheffe ellipsoid (7)). Starting unknown regressionparameters,for example BP - Bpr for
from this family of acceptanceregions for hypothesesthat two subsetsP, P’.
specify PK completely, a natural acceptanceregion for a
composite hypothesis of the form ,& = 0 is given by the 4. FORMAL SELECTION OF
union of all regions Tp; for values of & such that ,@$= SUBSET REGRESSIONS
0; the reasoning is that the hypothesis pQ = 0 cannot be Many authors have studied the problem of giving for-
Table lb. Values of Cp - p That Are Exceeded With Probability cy When p = pp; q = k + 1 - p, v = CC
q=k+ l-p 1 2 3 4 5 6 7 8 9 10 15 20 30
a = .lO 1.71 2.61 3.25 3.78 4.24 4.65 5.02 5.36 5.68 5.99 7.31 8.41 10.26
.05 2.84 3.99 4.82 5.49 6.07 6.59 7.07 7.51 7.92 8.31 10.00 11.41 13.77
.Ol 5.63 7.21 9.34 9.28 10.09 10.81 11.48 12.09 12.67 13.21 15.58 17.57 20.89

TECHNOMETRICS,FEBRUARY2000,VOL.42,NO.1
SOME COMMENTS ON Cp 91

ma1rules for the selection of predictors; Kennedy and Ban-


croft (I 971) give many references.Lindley (1968) presents
a Bayesian formulation of the problem. The discussion in
section 3 above does not lend any support to the practice
of taking the lowest point on a Cp-plot as defining a “best”
subset of terms. The present author feels that the greatest
value of the device is that it helps the statistician to ex-
amine some aspectsof the structure of his data and helps
him to recognize the ambiguities that confront him. The de-
vice cannot be expectedto provide a single “best” equation
when the data are intrinsically inadequateto support such
a strong inference.
To make these remarks more precise and objective, we
shall compute (in a special case) a measureof the perfor-
mance to be expected of the rule “choose the subset that
minimizes Cp, and fit it by least-squares.”We shall use as
a figure of merit of an arbitrary estimator ?j(~) the same
quantity as was used in setting up the Cp-plot, namely the 0 1 2 3 4 5 6 7
sum of predictive squarederrors STANDARDtZED REGRESStON COEFFICIENT, T

Figure 4. m-functions
J?j= $ k(il(xu) - xdq2.
u=l
Notice that the function m(r) is less than 1 only for /T/ <
We can handle in detail only the caseof orthogonal regres- .78, and rises to a maximum value of 1.65 at 1~1= 1.88. It
sors, and so now assumeXTX = nI. In this case we see exceeds1.25 for 1.05 < (71< 3.05.
from (5) that Cp is minimized when P contains just those We reiterate that in this case of orthogonal regressors
terms for which tf > 2, where ti = fibi/ is the t-statistic with n. very large, the “minimum Cp” rule is equivalent to
for the j-th regressioncoefficient, and bi is the least squares a stepwise regressionalgorithm with all critical levels set at
estimate, bi = Cxu~y/7L/n. Thus in this case the “minimize 15.73%. Also shown in Figure 4 are the m-functions cor-
Cp” rule is equivalent to a stepwise regression algorithm responding to several other critical levels; when all Ic + 1
in which all critical t-values are set at 4 and d2 is kept at terms are infallibly included (the “full-1.s.” rule), m(r) = 1
the full-equation value throughout. for all 7, so that E( Jfull I.~.) = Ic+ 1. We seethat the “min-
Now let us assumethat n is sufficiently large that varia- imum Cp” rule will give a smaller value for E(J) than the
tion in 6 can be ignored; then to7 tl, , tk will be indepen- “full-1.s.” rule only when rather more of the true regres-
dent Normal variables with unit variances and with means sion coefficients satisfy 171< .78 than satisfy /7/ > 1; in
ro: ( Tk where ‘TV= &&/c-. Let d(t) be the function that the worst case with 1~~1 = 1.88 for j = 1, . . , iF,E(J) for
equals0 for ItI 5 4, and equals I otherwise, then J for the the “minimize Cp” rule is 165% of that for the “full-1.s.”
“minimum-Cp subsetleast squares”estimate can be written rule. Similarly for rejection rules with other critical levels;
in particular, a rule with a nominal level of 5% (two tailed)
gives an E(J) at worst 246% of that of the “full-1.s.” rule.
Thus using the “minimum Cp” rule to select a subsetof
terms for least-squaresfitting cannot be recommendeduni-
which reducesto versally. Notice however that by examining the Cp-plot in
the light of the distributional results of the previous section
JMinCp = -&ii) - 7J2. one can see whether or not a single best subsetis uniquely
j=o indicated; the ambiguous cases where the “minimum Cp”
rule will give bad results are exactly those where a large
Hence number of subsetsare close competitors for the honor. With
such data no selection rule can be expected to perform re-
liably.

where v~(T) = E((,u+‘r)d(~+~)-7)’ (where ‘uis a standard 5. CL-PLOTS


Normal variable), and is the function displayed in Figure We now extend the Cp-plot device to handle general lin-
4 (labelled “16%“, since Pr{luJ > fi} = .1573). If the ear estimators. With the sameset-up as in the Introduction,
constantterm is always to be included in the selectedsubset, consider an estimate of the form
the corresponding result is

E(JMinC{O,P-}) = 1 + Cm(lr?l),
j=1

TECHNOMETRICS, FEBRUARY 2000, VOL. 42, NO. 1


92 C. L. MALLOWS

where $ is the mean jj = C:Y~~/~L, and L is a k x n ma-


trix of constants. We shall assume that Ll, = 0 (where tr(XL) = 5 L
i=l f + Ai
1; = (l:l:...,l)) so that a change in the origin of mea-
k
surementof y affects only fro and not ,!?I,. . pk. Examples f%Z
of estimators of this class are: full least-squares;subset- RSSL, - RSSL, = 1 (13)
i=l h(f + Xij2
least-squares;and Bayes estimatesunder multinormal spec-
ifications with a multinormal prior, a special case of which Figure 5 gives the resulting plot for the set of 1O-variable
is the class of “Ridge” estimates advocated by Hoer1 and data analyzed by German and Toman (1966) and by Hoer]
Kennard (1970a,b),(seealso Theil (1963), section 2.3): and Kennard (1970b).Shown are (p, Cp) points correspond-
ing to various subset-least-squaresestimatesand a continu-
L = Lf = (XTX + fI))lXT (11) ous arc of (VL, CL) points correspondingto Ridge estimates
where f is a (small) scalar parameter (Hoer1 and Kennard with values of f varying from zero at (1 I, 11) and increas-
used k), and in this section we are writing X for the n x k ing to the left. For this example,Hoer1and Kennard (1970b)
matrix of independentvariables, which we are now assum- suggestedthat a value of f in the interval (0.2, 0.3) would
ing have been standardized to have zero means and unit “undoubtedly” give estimated coefficients “closer to ,0 and
variances.Thus 1:X = 0, diag(XTX) = I. more stablefor prediction than the least-squarescoefficients
As a measure of adequacy for prediction we again use or some subsetof them.” On the other hand from Figure 5
the scaled summedmean squareerror, which in the present one would be inclined to suggesta value of f nearer to .02
notation is than to 0.2.
One obvious suggestionis to choose f to minimize C’L~.
JL = ~(lixy, - XPl12+ 41J- Po12) Some insight into the effect of this choice can be gained
as follows. First consider the caseof orthogonal regressors,
and which has expectation and now assumeXTX = I. Notice that in this caseour risk
function E(J) is equivalent to the quantity Et=, E(&,0,)2
used by Hoer1 and Kennard (1970a). We may take H = I,
E(JL) = VL + $BL
so that zi = bii, the least-squaresestimate of /$. From (12)
where and (13) we see that C’L~is a minimum when f satisfies

VL = 1-t tr(XTXLLT)
(1 + f)/f = 2 /3,2,rc82;
BL = @(LX - I)TXTX(LX - I)PK. j=l

The sum of squaresabout the fitted regression is

RSSL= llu-Un -@LII~


20
which has expectation
i ’
8 l
E(RSSL) = C-“V,*+ BL
where
V,* = 7(.- I- 2 tr(XL) + tr(XTXLLT). 15

Thus we have an estimator of E(JL), namely


C
CL = & RSSL - rl+ 2 + 2 tr(XL). (12)

By analogy with the Cp development,we propose that val-


ues of CL (for various choices of L) should be plotted lo
against values of VI,. Notice that when L is a matrix cor-
responding to subset least squares,CL, VL reduce to CP,p
respectively.
For computing values of CL, VL for Ridge estimates(1 l),
the following steps can be taken. First, find H (orthogonal)
and A = diagonal (Xi, X2.. i &) so that XTX = HThH. 5 I
-
t I I 1 I
Compute 2 = HX”y. Then 6 7 8 9 10 If 12
P
2
Figure 5. CL Plot for Gorman-Toman Data; Subset (Cp) Values and
Ridge (CL,) Values.

TECHNOMETRICS, FEBRUARY 2000, VOL. 42, NO. 1


SOME COMMENTS ON CP 93

the adjusted estimates are then given by where bT = (fia:&) is the vector of least-squaresesti-
mates of all the coefficients in the model, and (&,bF-) is
/!I?;= 1- $ /?, i=l,...,k. the vector of subset-least-squaresestimates.From the form
(14)
of S, (7) it now follows that (iii) BP- is in S, if and only
( 3)
if (v) RSSpPRSSK+ < ktf2F,, which is directly equivalent
It is interesting that this set of estimates is of the form (if 6’ = RSSK+/n - k - 1) to (iv) Cp < 2p - k - 1+ kF’,.
suggestedby Stein (1960) for the problem of estimating re- Clearly (iii) implies (i); to prove the converse we remark
gression coefficients in a multivariate Normal distribution. that for any vector PT = (pa,pg) with PK in the hyper-
Jamesand Stein (I 96 1) showedthat for k 2 3 the vector of plane Hp = {,dK : pQ = 0}, we have
estimates,&** obtained by replacing the multiplier k in (14)
by any number between 0 and 21% - 4 has the property that IIXP - WI2 = IIXP - XPPl12 + IIXBP - XPl12,
E(J**) is less than the full-least-squares value k + 1 (see
Hoer1and Kennard 1970b),for all values of the true regres- the cross-product term vanishing by definition of pp. Thus
sion coefficients. Thus our “minimize CL~” rule dominates if any point of HP is in S,, ,0p must be. Finally, (i) is
full least-squaresfor k 2 4. This result stands in interest- directly equivalent to (ii) by a simple geometrical argument.
ing contrast to the disappointing result found above for the To handle the case of non-orthogonal regressors,Sclove
“minimize Cp” rule. (1968) has suggestedtransforming to orthogonality before
Now, consider the caseof equi-correlatedregressors,with applying a shrinkage factor. A composite procedure with
XTX = I + p(llr - I). In this case the least-squaresesti- much intuitive appeal for this writer would be to use the
mate 3 of 6 = C/fl,/k has variance l/k(l - p + kp), and Cp plot or some similar device to identify the terms that
the vector of deviations (/?i - 5) has covariance matrix should certainly be included (since they appear in all sub-
(I - kP1llT)/(l - p). Thus when p is large, these devi- sets that give reasonably good fits to the data), to fit these
ations become very unstable. by least squares,and to adjust the remaining estimates by
It is found that for p near unity, C’L~is minimized when orthogonalizing and shrinking towards zero as in La Motte
f is near (1 - p)g, where and Hocking (1970).

[Received June 1972. Revised October 1972.1

REFERENCES
The adjusted estimates are given approximately by Allen, D. M. (1971). “Mean Square Error of Prediction as a Criterion for
Selecting Variables,” Technometrics, 13, 469-475.

&i,(l-&pi-i). Beale, E. M. L., Kendall, M. G., and Mann, D. W. (1967), “The Discarding
of Variables in Multivariate Analysis,” Biometrika, 54, 357-366.
Daniel, C., and Wood, F. S. (1971) Fitting Equations to Data, New York:
Wiley-Interscience.
Thus here the “minimize C’L~” rule leads to shrinking the
Furnival, G. M. (1971), “All Possible Regressions With Less Computation,”
least-squaresestimatestowards their average.While the de- Technometrics, 13, 403-408.
tails have not been fully worked out, one expects that this Garside, M. J. (1965), “The Best Subset in Multiple Regression Analysis,”
rule will dominate full least-squaresfor k > 5. Applied Statistics, 14, 196200.
Godfrey, M. B. (1972), “Relations Between C,, RSS, and Mean Square
6. ACKNOWLEDGMENTS Residual,” unpublished manuscript submitted to Technometrics.
It is a great personalpleasureto recall that the idea for the Gorman, J. W., and Toman, R. .I. (1966), “Selection of Variables for Fitting
Equations to Data,” Technometrics, 8, 27-5 1.
Cp-plot arosein the course of some discussionswith Cuth- Hocking, R. R., and Leslie, R. N. (1967). “Selection of the Best Subset in
bert Daniel around Christmas 1963. The use of the letter Regression Analysis,” Technometrics, 9, 531-540.
C is intended to do him honor. The device was described Hoerl, A. E., and Kennard, R. W. (1970a), “Ridge Regression: Biased Es-
publicly in Mallows (1964) and again in Mallows (1966) timation of Non-orthogonal Problems,” Technometrics, 12, 55-67.
(with the extensionsdescribedat the end of section 1 above) - (1970b), “Ridge Regression: Applications to Non-orthogonal Prob-
lems,” Technometrics, 12, 69-82.
and has appearedin several unpublished manuscripts. Im- James, W., and Stein, C. (1961), “Estimation With Quadratic Loss,” in
petus for preparing the presentexposition was gained in the Proceedings of the Fourth Berkeley Symposium, Berkeley: University of
course of delivering a series of lectures at the University of California Press, pp. 361-379.
California at Berkeley in February 1972; their support is Kennard, R. W. (1971), “A Note on the C, Statistic,” Technometrics, 13,
899-900.
gratefully acknowledged.
Kennedy, W. J., and Bancroft, T. A. (1971) “Model-Building for Prediction
in Regression Based on Repeated Significance Tests,” The Annals of
APPENDIX Mathematical Statistics, 42, 1273-1284.
La Motte, L. R., and Hocking, R. R. (1970) “Computational Efficiency in
Proof of the Lemma the Selection of Regression Variables,” Technometrics, 12, 83-93.
The key to theseresults is the identity, true for any subset Lindley, D. V. (1968), “The Choice of Variables in Multiple Regression,”
Journul of the Royal Statistical Society, Ser. B, 30, 31-53 (Discussion,
P that includes 0, i.e. P = {O,P-}, 5466).
Mallows, C. L. (1964), “Choosing Variables in a Linear Regression: A
RSSP- RSSK+ = (BP- - &)“DK(,?lP- - &) Graphical Aid,” unpublished paper presented at the Central Regional

TECHNOMETRICS, FEBRUARY 2000, VOL. 42, NO. 1


94 C. L. MALLOWS

Meeting of the Institute of Mathematical Statistics. Manhattan, KS. May Am~rrls ofMothemuticrr/ Stdstirs. 43, 1076-1088.
7-9. Srikantan, K. S. (1970). “Canonical Association Between Nominal Mea-
~ (1966), “Choosing a Subset Regression.” unpublished paper pre- surements,” Jortn~d of the America~r .‘jtati.stical Association, 65, 284-
sented at the Annual Meeting of the American Statistical Association. 292.
Los Angeles, August I S-19. Stein, C. (1960), ‘*Multiple Regression,” in Cot7trihutions to P robnbilit~
Mantel, N. (1970), “Why Stepdown Procedures in Variable Selection.” and Strrtistics, ed. I. Olkin, Stanford, CA: Stanford University Press, pp.
Technometrics. 12, 621625. 424-443.
Schatzoff. M., Tsao, R., and Fienberg, S. (1968) “Efficient Calculation of Theil, H. (1963). “On the Use of Incomplete Prior Information in Re-
All Possible Regressions,” Trchnometric.s, IO, 769-779. gression Analysis,” Joun7al of the Americcm Statistical A.ssociation, 58,
Sclove, S. L. (1968) “Improved Estimators for Coelbcients in Linear Re- 401&414.
gression,” Journul qf the American Statistical A.ssociation, 63, 596406. Watts, H. W. (1965), “The Test-o-Gram; a Pedagogical and Presentational
Spjotvoll, E. (1972), “Multiple Comparison of Regression Functions,” The Device,” The American Statistician, 19, 22-28.

TECHNOMETRICS, FEBRUARY 2000, VOL. 42, NO. 1

You might also like