0% found this document useful (0 votes)
39 views

Breiman-JASA-EstimatingOptimalTransformations-1985

Uploaded by

Hepdrey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Breiman-JASA-EstimatingOptimalTransformations-1985

Uploaded by

Hepdrey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Estimating Optimal Transformations for Multiple Regression and Correlation

Author(s): Leo Breiman and Jerome H. Friedman


Source: Journal of the American Statistical Association, Vol. 80, No. 391 (Sep., 1985), pp.
580-598
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association
Stable URL: https://www.jstor.org/stable/2288473
Accessed: 27-08-2024 22:13 UTC

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms

Taylor & Francis, Ltd., American Statistical Association are collaborating with JSTOR to
digitize, preserve and extend access to Journal of the American Statistical Association

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Estimating Optimal Transformations for Multiple
Regression and Correlation
LEO BREIMAN and JEROME H. FRIEDMAN*

In regression analysis the response variable Y and the predictor applied in situations where the response or the predictors in-
variables XI, . . , Xp are often replaced by functions 0(Y) and volve arbitrary mixtures of continuous ordered variables and
4I(XI), . . . p, (Xp). We discuss a procedure for estimating categorical variables (ordered or unordered). The functions 0,
those functions 0* and 4 * . . . * that minimize e2 = E{[0(Y) 01, . . ., ~ p are real-valued. If the original variable is cate-
_ lp=, 0j(Xj)]2}/var[0(Y)], given only a sample {(Yk, Xkl, gorical, the application of 0 or Xi assigns a real-valued score
. . ., Xkp), 1 ' k ? N} and making minimal assumptions to each of its categorical values.
concerning the data distribution or the form of the solution The procedure is nonparametric. The optimal transformation
functions. For the bivariate case, p = 1, 0* and 4* satisfy p* estimates are based solely on the data sample {(Yk, Xkl, .
= p(0*, 4*) = max0,0p[0(Y), +(X)], where p is the product Xkp), 1 ? k ? N} with minimal assumptions concerning the
moment correlation coefficient and p* is the maximal corre- data distribution and the form of the optimal transformations.
lation between X and Y. Our procedure thus also provides a In particular, we do not require the transformation functions to
method for estimating the maximal correlation between two be from a particular parameterized family or even monotone.
variables. (Later we illustrate situations in which the optimal transfor-
mations are not monotone.)
KEY WORDS: Smoothing; ACE. It is applicable to at least three situations:

1. random designs in regression


1. INTRODUCTION 2. autoregressive schemes in stationary ergodic time series
Nonlinear transformation of variables is a commonly used 3. controlled designs in regression.
practice in regression problems. Two common goals are sta- In the first of these, we assume the data (Yk, Xk), k = 1,
bilization of error variance and symmetrization/normalization . . N, are independent samples from the distribution of Y,
of error distribution. A more comprehensive goal, and the one XI, . . ., Xp. In the second, a stationary mean-zero ergodic
we adopt, is to find those transformations that produce the best- time series XI, X2, . . . is assumed, the optimal transformations
fitting additive model. Knowledge of such transformations aids are defined to be the functions that minimize
in the interpretation and understanding of the relationship be-
tween the response and predictors.
Let Y, X, .. . , Xp be random variables with Y the response
andXI, . . , XXp the predictors. Let 0(Y), q$(XI), . . . , Op(Xp) E02(XpX - >
be arbitrary measurable mean-zero functions of the correspond-
ing random variables. The fraction of variance not explained and the data consist of N + p consecutive observations xl,
(e2) by a regression of 0(Y) on 4I,I i(Xi) is * XN+P- This is put in a standard data form by defining

Yk = Xk+p, Xk = (Xk+p1, - I *, Xk), k = 1, . . ,N.


E{LO(Y) - E i(xi)
e2(0, 1 . . . , 4P) = E02(y) . (1.1)
In the controlled design situation, a distribution P(dy | x) for
the response variable Y is specified for every point x = (xl,
. . ., xp) in the design space. The Nth-order design consists
Then define optimal transformations as functions Q*, 41*, ... of a specification of N points xl, . . . , XN in the design space,
4* that minimize (1.1); that is, and the data consist of these points together with measurements

e2(0*, min, 1. .p~~~


., 44).k= mm e2(0, 01, . . ., 4p). (1.2) on the response variables Yl, . . ., YN. The {Yk} are assumed
0o01 ..... ,Xp independent with Yk drawn from the distribution P(dy I Xk).
Denote by PN(dx) the empirical distribution that gives mass
We show in Section 5 that optimal transformations exist and
1/N to each of the points xl, . . ., XN. Assume further that
satisfy a complex system of integral equations. The heart of
PN P, where P(dx) is a probability measure on the design
our approach is that there is a simple iterative algorithm using
space. Then P(dy I x) and P(dx) determine the distribution of
only bivariate conditional expectations, which converges to an
random variables Y, XI, . . . , Xp, and the optimal transfor-
optimal solution. When the conditional expectations are esti-
mations are defined as in (1.2).
mated from a finite data set, then use of the algorithm results
For the bivariate case, p = 1, the optimal transformations
in estimates of the optimal transformations.
0*(Y), +*(X) satisfy
This method has some powerful characteristics. It can be
p*(X, Y) = p(Q*, q*) = max p[0(Y), +(X)], (1.3)
0,0
* Leo Breiman is Professor, Department of Statistics, University of Cali-
fornia, Berkeley, CA 94720. Jerome H. Friedman is Professor, Department of
Statistics and Stanford Linear Accelerator Center, Stanford University, Stan- ? 1985 American Statistical Association
ford, CA 94305. This work was supported by Office of Naval Research Con- Journal of the American Statistical Association
tracts N00014-82-K-0054 and N00014-81-K-0340. September 1985, Vol. 80, No. 391, Theory and Methods

580

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 581
where p is the product-moment-correlation coefficient. The There are no analogous results, however, for stationary er-
quantity p*(X, Y) is known as the maximal correlation between godic series or controlled designs. To remedy this we show that
X and Y, and it is used as a general measure of dependence there are sequences of data smooths that have the requisite
(Gebelein 1947; also see Renyi 1959, Sarmanov 1958a, b, and properties in all three cases.
Lancaster 1958). The maximal correlation has the following This article is presented in two distinct parts. Sections 1-4
properties (Renyi 1959): give a fairly nontechnical overview of the method and discuss
its application to data. Section 5 and Appendix A are, of ne-
1. 0 ' p*(X, Y) ' 1.
cessity, more technical, presenting the theoretical foundation
2. p*(X, Y) = 0 if and only if X and Y are independent.
for the procedure.
3. If there exists a relation of the form u(X) = v(Y), where
There is relevant previous work. Closest in spirit to the ACE
u and v are Borel-measurable functions with var[u(X)] > 0,
algorithm we develop is the MORALS algorithm of Young et
then p*(X, Y) = 1.
al. (1976) (also see de Leeuw et al. 1976). It uses an alternating
Therefore, in the bivariate case our procedure can also be re- least squares fit, but it restricts transformations on discrete
garded as a method for estimating the maximal correlation be- ordered variables to be monotonic and transformations on con-
tween two variables, providing as a by-product estimates of the tinuous variables to be linear or polynomial. No theoretical
functions 0*, 4*, that achieve the maximum. framework for MORALS is given.
In the next section, we describe our procedure for finding Renyi (1959) gave a proof of the existence of optimal trans-
optimal transformations using algorithmic notation, deferring formations in the bivariate case under conditions similar to ours
mathematical justifications to Section 5 and Appendix A. We in the general case. He also derived integral equations satisfied
next illustrate the procedure in Section 3 by applying it to a by 0* and q* with kernels depending on the bivariate density
simulated data set in which the optimal transformations are of X and Y and concentrated on finding solutions assuming this
known. The estimates are surprisingly good. Our algorithm is density known. The equations seem generally intractable with
also applied to the Boston housing data of Harrison and Rub- only a few known solutions. He did not consider the problem
infeld (1978) as listed in Belsley et al. (1980). The transfor- of estimating 0*, q9* from data.
mations found by the algorithm generally differ from those Kolmogorov (see Sarmanov and Zaharov 1960 and Lancaster
applied in the original analysis. Finally, we apply the procedure 1969) proved that if Y1, . . . , Yq, XI, . . ., Xp have a joint
to a multiple time series arising from an air pollution study. A normal distribution, then the functions 0(YI, . . . , Yq), 4(XI,
FORTRAN implementation of our algorithm is available from * . . , Xp) having maximum correlation are linear. It follows
either author. Section 4 presents a general discussion and relates from this that in the regression model
this procedure to other empirical methods for finding transfor-

0(Y) = > 4i(Xi) + Z, (1.4)


p

mations.
Section 5 and Appendix A provide some theoretical frame- i=l

work for the algorithm. In Section 5, under weak conditions


if the 4i(Xi), i = 1, . . ., p, have a joint normal distribution
on the joint distribution of Y, XI, . . . , Xp, it is shown that
and Z is an independent N(0, 72), then the optimal transfor-
optimal transformations exist and are generally unique up to a
mations as defined in (1.2) are 0, 01, . . . , Op. Generally, for
change of sign. The optimal transformations are characterized
a model of the form (1.4) with Z independent of (XI, .
as the eigenfunctions of a set of linear integral equations whose
Xp), the optimal transformations are not equal to 0, 1, .
kernels involve bivariate distributions. We then show that our
OP. But in examples with simulated data generated from models
procedure converges to optimal transformations.
of the form (1.4), with non-normal {4i(Xi)}, the estimated
Appendix A discusses the algorithm as applied to finite data
optimal transformations were always close to 0, 01, . . , Op.
sets. The results are dependent on the type of data smooth
Finally, we note the work in a different direction by Ki-
employed to estimate the bivariate conditional expectations.
meldorf et al. (1982), who constructed a linear-programming-
Convergence of the algorithm is proven only for a restricted
type algorithm to find the monotone transformations 0(Y), #(X)
class of data smooths. However, in more than 1,000 applica-
that maximize the sample correlation coefficient in the bivariate
tions of the algorithm on a variety of data sets using three
case p = 1.
different types of data smoothers, only one (very contrived)
instance of nonconvergence has been found.
Appendix A also contains proof of a consistency result. Un- 2. THE ALGORITHM
der fairly general conditions, as the sample size increases the
Our procedure for finding 0*, 0*, . . ., 4* is iterative.
finite data transformations converge in a "weak" sense to the
Assume a known distribution for the variables Y, XI, . ,Xp.
distributional space optimal transformations. The essential con-
Without loss of generality, let E02(Y) = 1, and assume that
dition of the theorem involves the asymptotic consistency of a
all functions have expectation zero.
sequence of data smooths. In the case of iid data there are
To illustrate, we first look at the bivariate case:
known results concerning the consistency of various smooths.
Stone's (1977) pioneering paper established consistency for k- e2(0, 4) = E[0(Y) - /(X)]2. (2.1)
nearest-neighbor smoothing. Devroye and Wagner (1980) and,
Consider the minimization of (2.1) with respect to 0(Y) for a
independently, Spiegelman and Sacks (1980) gave weak con-
given function +(X), keeping EQ2 = 1. The solution is
ditions for consistency of kernel smooths. See Stone (1977)
and Devroye (1981) for a review of the literature. 01(Y) = E[+b(X) |Y]/IIE[44X) |Y]II (2.2)

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
582 Journal of the American Statistical Association, September 1985
with I * - [E( )2] 12. Next, consider the unrestricted min- Each iteration of the inner For loop minimizes e2 (2.4) with
imization of (2.1) with respect to +(X) for a given 0(Y). The respect to the function 4k(Xk), k = 1, . . . , p, with all other
solution is functions fixed at their previous evaluations (execution of the
For loop). The outer loop is iterated until one complete pass
OI(X) = E[0(Y) I X]. (2.3) over the predictor variables (inner For loop) fails to decrease
Equations (2.2) and (2.3) form the basis of an iterative opti- e2 (2.4).
mization procedure involving alternating conditional expecta- Substituting this procedure for the corresponding single func-
tions (ACE). tion optimization in the bivariate ACE algorithm gives rise to
the full ACE algorithm for minimizing the (2.4) e2.
Basic ACE Algorithm
ACE Algorithm
Set 0(Y) = Y/IIYII;
Iterate until e2(0, q) fails to decrease: Set 0(Y) = Y/IIYII and +,(X1), . . ., 4p(Xp) = 0;
XI(X) = E[0(Y) I X]; Iterate until e2(0, 4,, . . ., 4p) fails to decrease;
replace +(X) with XI(X); Iterate until e2(0, 1, . . ., /p) fails to decrease;
01(Y) = E[O(X) I Y]/IIE[k(X) I Y]II; Fork= ltopDo:
replace 0(Y) with 0I(Y); Ok,l(Xk) = E[0(Y) - Eilk qi(Xi) | Xk];
End Iteration Loop; replace 4k(Xk) with kk, I(Xk);
0 and 0 are the solutions O* and 4*; End For Loop;
End Algorithm. End Inner Iteration Loop;
01(Y) = E[Yi=, Ii(Xi) I Y]IIIE[= I Oi(Xi) IY];
This algorithm decreases (2.1) at each step by alternatingly
replace 0(Y) with 01(Y);
minimizing with respect to one function and holding the other
End Outer Iteration Loop;
fixed at its previous evaluation. Each iteration (execution of
0, 4, . . . , Op are the solutions 0*, 0, . ,p;
the iteration loop) performs one pair of these single-function
End ACE Algorithm.
minimizations. The process begins with an initial guess for one
of the functions (0 = Y/IIYII) and ends when a complete iteration In Section 5, we prove that the ACE algorithm converges to
pass fails to decrease e2. In Section 5, we prove that the al- optimal transformations.
gorithm converges to optimal transformations Q*, O*.
Now consider the more general case of multiple predictors 3. APPLICATIONS
XI,. . . , Xp. We proceed in direct analogy with the basic ACE In the previous section, the ACE algorithm was developed
algorithm. We minimize in the context of known distributions. In practice, data distri-
butions are seldom known. Instead, one has a data set {(Yk,
e2(0, q5, * , kp) = E[0(Y) - I dj(XJ)1, (2.4) Xkl, . . . , Xkp), 1 k ? N} that is presumed to be a sample
from Y, XI, . . ., Xp. The goal is to estimate the optimal
holding EQ2 = 1, EO = E I = E4p = 0, through a transformation functions 0(Y), 41(XI), . . . , 4p(Xp) from the
series of single-function minimizations involving bivariate con-
data. This can be accomplished by applying the ACE algorithm
to the data with the quantities e2, liii, and the conditional ex-
ditional expectations. For a given set of functions q$1(XI), .
pectations replaced by suitable estimates. The resulting func-
Op(Xp), minimization of (2.4) with respect to ?(Y) yields
tions 0, 4*, I . . ., Op are then taken as estimates of the
corresponding optimal transformations.
01(Y) = E[ I Y](xi) I j E[ i (xi) I Y (2.5) The estimate for e2 is the usual mean squared error for regres-
sion:
The next step is to minimize (2.4) with respect to 4I(X1),
... ., qp(Xp), given 0(Y). This is obtained through another e2(o, * * 4P ) I N E 0O(Yk) I Oj(Xk)]
iterative algorithm. Consider the minimization of (2.4) with Nk=l L J=
respect to a single function Ok(Xk) for given 0(Y) and a given If g(y, xl, . . ., xp) is a function defined for all data values,
set 41, . . , 4 k-1I, 4k+17 * , 4p. The solution is then u1gh12 is replaced by

kk, l (Xk) =E [0(Y) - > i(Xi) I Xkj (2.6) 111J12 I=


IIgIIN = N E 9 (Yk, Xkl, Xkp)-
i$k
Nk=1I

The corresponding iterative algorithm is as follows: For the case of categorical variables, the conditional expectation
estimates are straightforward: If the data are {(Xk, Zk)}, k = 1,
Set 41(XI), . . . , 4p(Xp) = 0;
N, and Z is categorical, then
Iterate until e2(0, 4', .P . . , 4) fails to decrease;
Fork= ltopDo:
E[XIZ=z] = 2 Xk ,
Xkk,l(Xk) = E[0(Y) - i#k q5i(Xi) I XkI; Zk.Z Zk Z
replace kk(Xk) with jk1 I(Xk);
End For Loop; where X is real-valued and the sums are over the subset of
End Iteration Loop; observations having (categorical) value Z = z. For variables
01, . . Xpare the solution functions. that can assume many ordered values, the estimation is based

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 583
on smoothing techniques. Such procedures have been the sub- to study the relation between air pollution (ozone) and various
ject of considerable study (e.g., see Gasser and Rosenblatt meteorological quantities.
1979, Cleveland 1979, and Craven and Wahba 1979). Since Our first example consists of 200 bivariate observations {(Yk,
the smoother is repeatedly applied in the algorithm, high speed Xk), 1 ? k ? 2001 generated from the model
is desirable, as well as adaptability to local curvature. We use
Yk = exp[xk + Ek],
a smoother employing local linear fits with varying window
width determined by local cross-validation (the "super- with the xl and the 8k drawn independently from a standard

AA
smoother"; see Friedman and Stuetzle 1982). normal distribution N(0, 1). Figure 1 (a) shows a scatterplot of
The algorithm evaluates 0*, 04, . . ., /* at all the corre- these data. Figures 1(b)-l(d) show the results of applying the
sponding data values; that is, 0*(y) is evaluated at the set of ACE algorithm to the data. The estimated optimal transfor-
data values {Ykl, k = 1, . . . , N. The simplest way to under- mation 0*(y) is shown in Figure 1(b)'s plot of 0*(Yk) versus
stand the shape of the transformations is by means of a plot of Yk, 1 s k s 200. Figure 1(c) is a plot of 4*(Xk) versus Xk.
the function versus the corresponding data values-that is, through These plots suggest the transformations 0(y) = log(y) and +(x)
the plots of 0*(Yk) versus Yk and 41, . . . , 4 versus the data = X3, which are optimal for the parent distribution. Figure 1 (d)
values of xl, . . . , xp, respectively. is a plot of 0*(Yk) versus 4*(Xk). This plot indicates a more
In this section, we illustrate the ACE procedure by applying linear relation between the transformed variables than that be-
it to various data sets. In order to evaluate performance on finite tween the untransformed ones.
samples, the procedure is first applied to simulated data for The next issue we address is how much the algorithm overfits
which the optimal transformations are known. We next apply the data due to the repeated smoothings, resulting in inflated
it to the Boston housing data of Harrison and Rubinfeld (1978) estimates of the maximal correlation p* and of R*2 = 1 -
as listed in Belsley et al. (1980), contrasting the ACE trans- e*2. The answer, on the simulated data sets we have generated,
formations with those used in the original analysis. For our last is surprisingly little.
example, we apply the ACE procedure to a multiple time series To illustrate this, we contrast two estimates of p* and R*2

40 0
a ~~~~~~~~~~~~~~~~~~c
y vs. xiF ()

bd
20

-1 0 1 -1 0
2 0*(Y) - 2_ 2 0b*(y) vs. 0*(x)

-1'-
-2 -2
A0

0 20 40 60 -2 -1 0
-2K I I I~~ I III -
Figure 1. First Example: (a) Original Data; (b) Transform on y; (c) Transform on x; (d) Transformed Data.

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
584 Journal of the American Statistical Association, September 1985
Table 1. Comparison of p* Estimates Table 3. Estimate Differences

Standard Standard
Estimate Mean Deviation Estimate Mean Deviation
p* direct .700 .034
ACE .709 .036 p- R2
R*2- .001
.012.015
.022
using the above model. The known optimal transformations are This example illustrates that the ACE algorithm is able to
0(Y) = log Y, +(X) = X3. Therefore, we define the direct produce nonmonotonic estimates for both response and predic-
estimate p for p*, given any data set generated as above by the tor transformations.
sample correlation between log Yk and xl and set R2 = p2. The For our next example, we apply the ACE algorithm to the
ACE algorithm produces the estimates Boston housing market data of Harrison and Rubinfeld (1978).
A complete listing of these data appears in Belsley et al. (1980).
lN
Harrison and Rubinfeld used these data to estimate marginal
P N E= 6*(Yk)
=I *i(Xk)
Nk=I1 air pollution damages as revealed in the housing market. Central
to their analysis was a housing value equation that relates the
and R*2 = 1 -I - e p*2 In this model p* = .707 and R*2
median value of owner-occupied homes in each of the 506
- .5.
census tracts in the Boston Standard Metropolitan Statistical
For 100 data sets, each of size 200, generated from the above
Area to air pollution (as reflected in concentration of nitrogen
model, the means and standard deviations of the p* estimates
oxides) and to 12 other variables that are thought to affect
are in Table 1. The means and standard deviations of the R *2
housing prices. This equation was estimated by trying to de-
estimates are in Table 2.
termine the best-fitting functional form of housing price on
We also computed the differences p* - p and R*2 - R2
these 13 variables. By experimenting with a number of possible
for the 100 data sets. The means and standard deviations are
transformations of the 14 variables (response and 13 predictors),
in Table 3.
Harrison and Rubinfeld settled on an equation of the form
The preceding experiment was duplicated for smaller sample
size N = 100. In this case we obtained the differences in log(MV) = al + a2(RM)2 + a3AGE
Table 4.
We next show an application of the procedure to simulated + a4log(DIS) + a5log(RAD) + a6TAX
data generated from the model
+ a7PTRATIO + a8(B - .63)2
Yk = exp[sin(27tXk) + Ck12], 1 ? k ? 200, + aglog(LSTAT) + ajOCRIM + aj1ZN
with the Xk sampled from a uniform distribution U(0, 1) and
+ a12INDUS + a13CHAS + a14(NOX)P + c.
the Ck drawn independently of the Xk from a standard normal
distribution N(0, 1). Figure 2(a) shows a scatterplot of these A brief description of each variable is given in Appendix B.
data. Figures 2(b) and 2(c) show the optimal transformation (For a more complete description, see Harrison and Rubinfeld
estimates 0*(y) and +*(x). Although log(y) and sin(2irx) are 1978, table 4.) The coefficients al, . . . , a14 were determined
not the optimal transformations for this model [owing to the by a least squares fit to measurements of the 14 variables for
non-normal distribution of sin(2irx)], these transformations are the 506 census tracts. The best value for the exponent p was
still clearly suggested by the resulting estimates. found to be 2.0, by a numerical optimization (grid search). This
Our next example consists of a sample of 200 triples {Yk, "basic equation" was used to generate estimates for the will-
Xkl, Xk2), 1 ' k ' 200} drawn from the model Y = XIX2, with ingness to pay for and the marginal benefits of clean air. Har-
XI and X2 generated independently from a uniform distribution rison and Rubinfeld (1978) noted that the results are highly
U(- 1, 1). Note that 0(Y) = log(Y) and Oj(Xj) = log Xj sensitive to the particular specification of the form of the hous-
(j = 1, 2) cannot be solutions here, since Y, XI, and X2 all ing price equation.
assume negative values. Figure 3(a) shows a plot of 0*(Yk) We applied the ACE algorithm to the transformed measure-
versus Yk, and Figures 3(b) and 3(c) show corresponding plots ments (y', xl .. x13) (using p = 2 for NOX) appearing in the
of j* (Xkl) and 45(Xk2) (1 ' k ' 200). All three solution basic equation. To the extent that these transformations are close
transformation functions are seen to be double-valued. The to the optimal ones, the algorithm will produce almost linear
optimal transformations for this problem are 0*(Y) = log|Y| functions. Departures from linearity indicate transformations
and 4j(Xj) = loglXjl (j = 1, 2). The estimates clearly reflect that can improve the quality of the fit.
this structure except near the origin, where the smoother cannot In this (and the following) example we apply the procedure
reproduce the infinite discontinuity in the derivative. in a forward stepwise manner. For the first pass we consider

Table 2. Comparison of R*2 Estimates Table 4. Estimate Differences, Sample Size 100

Standard Standard
Estimate Mean Deviation Estimate Mean Deviation
R*2 direct .492 .047 p* - p .029 .034
ACE .503 .050 R*- R2.042 .051

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 585
_ . 1 l l I | I~~~~~~~~~~~~~~~~~~~I | |I T
2a
6 _y vs. x
a

6~~~~~~~~~~~~~~~~~~~~~

1 _!

. 0.2:~~~~~~~~~~~
0.4 0 1
2

-1 -0.5 ~~~0 0.5


-2
0

b
b

0 .2 4. 0.6.
0I
0

2 0*(y)~~~~~~~0*x

-1 _-0.5 0 0.5 1
1.0 C 2* (X2)

~~~~~~~~ . nex pass inrese th o h rvosps yls hn.1

th prdco 1ta aiie 20(',4kx) sicue


0.5
in th moe.thescn'as(oe-h eanig1 rdc
0

trs inlue th:2tiait rbem p=2 novn '


0.0 Th reulin fia oe.novdforpeitr n a
O~ xe (k$k) h rdco htmxmzsA[2y)
anA o .8. ApligAEsmlaeusyt l 3peitr
tkkXk) 12() sicue nth oe.Ti owr
-0.5
reslt in an inres in , 2 of onl .02. I ,,I ,I
Fiur 4() hos po of the souinrsos 1rnfr

0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5 1


Figure 2. Second Example: (a) Original Data; (b) Transformed y;,
(c) Transformed x. Figure 3. Third Example: (a) Transformed y; (b) Transformed x;
(c) Transformed X2.

the 13 bivariate problems (p =1) involving the response y'


with each of the predictor variables x' (1I k -< 13) in turn. selection procedure is continued until the best predictor of the

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
586 Joumal of the American Statistical Association, September 1985

3 f4*(I0g
2 e - 00MV)''''IW^?|I 3 ai ~~~~ ~~~~~~~~~~~~~~0.4e
0*(PTRATIO)
0 -- 0.0
1 ~~~~0.2

.2La

2
b2*(MV)
0.4t' 0.2*(TAX)
8.5 I9 I9.5
I I10
, ,10.5
I I,_
11__2
12 14__,_____";
16 18 20 22
4 ''' i i '~~~~~~~~~~~~~~0.
1 0 r t i0.2
0

0.0

C 9. 01
0 10 20 30 40 50 200 400 600

2 0 0*(RM2) 0.0 , ; $*(NOX2)


2~~~~~~~~~~~~~~~~~~~~

I ~~~~-0.1

/ ~~~~~~~~~~-0.2

1.0 0 20 40 60 80 0.002 0.004 O.OOB 0.008 -0.3~~~~~~~~~~~~.0

1. d .*(log LSTAT) h
0.5 - 1 ~2 -:. : ,
0.0

- 1 .o -2N-
-4 -3 -2 -1 -1 0 1 2 3 4
Figure 4. Boston Hdousing Data: (a) Transformed Iog(MV); (b) Transformed MV; (c) Transformed RM2 (a= .492); (d) Transformed log(LSTAT)
(a - .417); (e) Transformed PT Rtatio (a = .147); (f) Transformed Tax (a - .122); (g) Transformed NOX2 (a = .09); (ih) Transformed y Versus
Predictor of Transformed y.

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 587
mation 0(y'). This function is seen to have a positive curvature meteorology in the Los Angeles basin. The data consist of daily
for central values of y', connecting two straight line segments measurements of ozone concentration (maximum one hour av-
of different slope in either side. This suggests that the loga- erage) and eight meteorological quantities for 330 days of 1976.
rithmic transformation may be too severe. Figure 4(b) shows Appendix C lists the variables used in the study. The ACE
the transformation 0(y) resulting when the (forward stepwise) algorithm was applied here in the same forward stepwise man-
ACE algorithm is applied to the original untransformed census ner as in the previous (housing data) example. Four variables
measurements. (The same predictor variable set appears in this were selected. These are the first four listed in Appendix C.
model.) This analysis indicates that, if anything, a mild trans- The resulting R2 was .78. Running the ACE algorithm with all
formation, involving positive curvature, is most appropriate for eight predictor variables produces an R2 of .79.
the response variable. In order to assess the extent to which these meteorological
Figures 4(c)-4(f) show the ACE transformations (X)k, (x,). variables capture the daily variation of the ozone level, the
kk4(Xk4) for the (transformed) predictor variables x' appearing variable day-of-the-year was added and the ACE algorithm was
in the final model. The standard deviation u(4,*) is indicated run with it and the four selected meteorological variables. This
in each graph. This provides a measure of how strongly each can detect possible seasonal effects not captured by the mete-
4>*(xj) enters into the model for 0*(y'). [Note that v(0) = orological variables. The resulting R2 was .82. Figures 5(a)-
1.] The two terms that enter most strongly involve the number 5(f) show the optimal transformation estimates.
of rooms squared [Figure 4(c)] and the logarithm of the fraction The solution for the response transformation, Figure 5(a),
of population that is of lower status [Figure 4(d)]. The nearly shows that, at most, a very mild transformation with negative
linear shape of the latter transformation suggests that the orig- curvature is indicated. Similarly, Figure 5(b) indicates that there
inal logarithmic transformation was appropriate for this vari- is no compelling necessity to consider a transformation on the
able. The transformation on the number of rooms squared vari- most influential predictor variable, Sandburg Air Force Base
able is far from linear, however, indicating that a simple quadratic Temperature. The solution transformation estimates for the re-
does not adequately capture its relationship to housing value. maining variables, however, are all highly nonlinear (and non-
For fewer than six rooms, housing value is roughly independent monotonic). For example, Figure 5(d) suggests that the ozone
of room number, whereas for larger values there is a strong concentration is much more influenced by the magnitude than
increasing linear dependence. The remaining two variables that the sign of the pressure gradient.
enter into this model are pupil-teacher ratio and property tax The solution for the day-of-the-year variable, Figure 5(f),
rate. The solution transformation for the former, Figure 4(e), indicates a substantial seasonal effect after accounting for the
is seen to be approximately linear whereas that for the latter, meteorological variables. This effect is minimum at the year
Figure 4(f), has considerable nonlinear structure. For tax rates boundaries and has a broad maximum peaking at about May
of up to $320, housing price seems to fall rapidly with increas- 1. This can be compared with the dependence of ozone pollution
ing tax, whereas for larger rates the association is roughly on day-of-the-year alone, without taking into account the me-
constant. teorological variables. Figure 5(g) shows a smooth of ozone
Although the variable (NOX)2 was not selected by our step- concentration on day-of-the-year. This smooth has an R2 of .38
wise procedure, we can try to estimate its marginal effect on and is seen to peak about three months later (August 3).
median home value by including it with the four selected vari- The fact that the day-of-the-year transformation peaked at
ables and running ACE with the resulting five predictor vari- the beginning of May was initially puzzling to us, since the
ables. The increase in R2 over the four-predictor model was highest pollution days occur from July to September. This latter
.006. The solution transformations on the response and original fact is confirmed by the day-of-the-year transformation with
four predictors changed very little. The solution transformation the meteorological variables removed. Our current belief is that
for (NOX)2 is shown in Figure 4(g). This curve is a nonmon- with the meteorological variables entered, day-of-the-year be-
otonic function of NOX2, not well approximated by a linear (or comes a partial surrogate for hours of daylight before and during
monotone) function. This makes it difficult to formulate a sim- the morning commuter rush. The decline past May 1 may then
ple interpretation of the willingness to pay for clean air from be explained by the fact that daylight saving time goes into
these data. For low concentration values, housing prices seem effect in Los Angeles on the last Sunday in April.
to increase with increasing (NOX)2, whereas for higher values These data illustrate that ACE is useful in uncovering inter-
this trend is substantially reversed. esting and suggestive relationships. The form of the dependence
Figure 4(h) shows a scatterplot of O*(Yk) verus _j_ f* (Xkj) on the Daggett pressure gradient and on the day-of-the-year
for the four-predictor model. This plot shows no evidence of would be extremely difficult to find by any previous method-
additional structure not captured in the model ology.

4. DISCUSSION
4

0()= , /j*(Xj) + e.
j=1
The ACE algorithm provides a fully automated method for
The e^*2 resulting from the use of the ACE transformations was estimating optimal transformations in multiple regression. It
.11,? as compared to the e2 value of .20 produced by the Harrison also provides a method for estimating maximal correlation be-
and Rubinfeld (1978) transformations involving all 14 varia- tween random variables. It differs from other empirical methods
bles. for finding transformations (Box and Tidwell 1962; Anscombe
For our final example, we use the ACE algorithm to study and Tukey 1963; Box and Cox 1964; Kruskal 1964, 1965;
the relationship between atmospheric ozone concentration and Fraser 1967; Box and Hill 1974; Linsey 1972, 1974; Wood

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
588 Journal of the American Statistical Association, September1i985

2 a ~~~~ ~~~~~~~~~~0.3 -
0 (UP03) 10.2 ~*(VSTY)
0 -j 0.0 5 ~~~~~0.1

bf
-1 ~~~~~~~~~~~~~~~~~~~~~~~-0.1

0 10 20 30 40 0 100 200 300 i-i_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __0_ __2

41.0 p)0.2 - ~*(Day of Year)


1.5

0.50. 0.0~~~~~~~~~~~~~~~~~~~.

-0.2

-0.5

-0.4

20 40 60 80 0 100 200 300 400


0.2 c
-1.0

0.0 0.0
1.0

7 q5~~~~~*(IBHT)
0.5

0.1~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~~~~~~1

0.1 ~~~~~~~~~~~~~~~~~~~-0.5
-0.2

0 1000 2000 3000 4000 5000 0 100 200 ...300 400


-1.0

d0.2 - $~~~~~*(DGPG)
0.0

-0.2

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 589
1974; Mosteller and Tukey 1977; and Tukey 1982) in that the dication of how good the analyst's guess is. We have found
"best" transformations of the response and predictor variables that the plots themselves often give surprising new insights into
are unambiguously defined and estimated without use of ad hoc the relationship between the response and predictor variables.
heuristics, restrictive distributional assumptions, or restriction As with any regression procedure, a high degree of associ-
of the transformation to a particular parametric family. ation between predictor variables can sometimes cause the in-
The algorithm is reasonably computer efficient. On the Bos- dividual transformation estimates to be highly variable, even
ton housing data set comprising 506 data points with 14 vari- though the complete model is reasonably stable. When this is
ables each, the run took 12 seconds of central processing unit suspected, running the algorithm on randomly selected subsets
(CPU) time on an IBM 3081 computer. Our guess is that this of the data, or on bootstrap samples (Efron 1979), can assist
translates into 2.5 minutes on a VAX 11/750 computer. To in assessing the variability.
extrapolate to other problems, use the estimate that running The ACE method has generality beyond that exploited here.
time is proportional to (number of variables) x (sample size). An immediate generalization would involve multiple response
A strong advantage of the ACE procedure is the ability to variables YI, . . . , Yq. The generalized algorithm would esti-
incorporate variables of quite different type in terms of the set mate optimal transformations 0*, . . .0, O*, 04*, . . ., p* that
of values they can assume. The transformation functions 0(y), minimize
01(xj), . . . , Op(xp) assume values on the real line. Their
arguments can, however, assume values on any set. For ex- EL 01 (Y1) - o )(Xj)~
ample, ordered real, periodic (circularly valued) real, ordered,
and unordered categorical variables can be incorporated in the
same regression equation. For periodic variables, the smoother subject to EO = O,= I 1, ..., q, E = O,j = 1, ...,
p, and IIY, 01(Y1)112 = 1.
window need only wrap around the boundaries. For categorical
This extension generalizes the ACE procedure in a sense
variables, the procedure can be regarded as estimating optimal
similar to that in which canonical correlation generalized linear
scores for each of their values. (The special case of a categorical
regression.
response and a single categorical predictor variable is known
The ACE algorithm (Section 2) is easily modified to incor-
as canonical analysis-see Kendall and Stuart 1967, p. 568-
porate this extension. An inner loop over the response variables,
and the optimal scores can, in this case, also be obtained by
analogous to that for the predictor variables, replaces the single-
solution of a matrix eigenvector problem.)
function minimization.
The ACE procedure can also handle variables of mixed type.
For example, a variable indicating present marital status might 5. OPTIMAL TRANSFORMATIONS IN
take on an integer value (number of years married) or one of FUNCTION SPACE
several categorical values (N = never, D = divorced, W =
widowed, etc.). This presents no additional complication in 5.1 Introduction
estimating conditional expectations. This ability provides a In this section, we first prove the existence of optimal trans-
straightforward way to handle missing data values (Young et formations (Theorem 5.2). Then we show that the ACE algo-
al. 1976). In addition to the regular sets of values realized by rithm converges to an optimal transformation (Theorems 5.4
a variable, it can also take on the value "missing." and 5.5).
In some situations the analyst, after running ACE, may want Define random variables to take values either in the reals or
to estimate values of y rather than 0*(y), given a specific value in a finite or countable unordered set. Given a set of random
of x. One method for doing this is to attempt to compute variables Y, XI, . . . , Xp , a transformation is defined by a set
0 Q*- ( j*(Xj)). Letting Z = 1j=1 Ij*(XJ), however, we of real-valued measurable functions (0, 4), . . ., 4)P) = (0,
know that the best least squares predictor of Y of the form Z(Z)
4), each function defined on the range of the corresponding
is given by E(Y I Z). This is implemented in the current ACE random variables, such that
program by predicting y as the function of ljP=I 4* (xj), ob-
tained by smoothing the data values of y on the data values of EO(Y) = 0, E/j(Xj) = 0, j = 1, . . ., p
Ej> j* (xj). We are grateful to Arthur Owens for suggesting
E02(y) < oo, E)j2(Xj) < oo, j = 1. p. (5.1)
this simple and elegant prediction procedure.
The solution functions 0*(y) and 4 (x1), . . ., * (xp) can Use the notation
be stored as a set of values associated with each observation
(Yk, Xkl, . . . , xkp), 1 ? k ? N. Since 0(y) and +(x), however, +(X) = E 4(Xi). (5.2)
are usually smooth (for continuous y, x), they can be easily
approximated and stored as cubic spline functions (deBoor 1978) Denote the set of all transformations by W.
with a few knots.
As a tool for data analysis, the ACE procedure provides Definition 5.1. A transformation (0*, q*) is optimal for
regression if E(0*)2 = 1 and
graphical output to indicate a need for transformations as well
as to guide in their choice. If a particular plot suggests a familiar e*2 = E[O*(Y) - (*(X)12
functional form for a transformation, then the data can be pre-
transformed using this functional form and the ACE algorithm
= inf {E[0(Y) - 4(X)]2; EQ2 =1}
can be rerun. The linearity (or nonlinearity) of the resulting
ACE transformation on the variable in question gives an in- Definition 5 . 2. A transformation (Q* *, + * *) is optimal for

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
590 Journal of the American Statistical Association, September 1985
correlation if E(0**)2 = 1, k(o**)2 = 1, and Proposition 5.1. The set of all functions f of the form

p= E[0**(Y)4**(X)]
f(Y, X) = O(Y) + , 41(X1), 0 E H2(Y), fj E H2(Xj),
= sup {E[O(Y)4(X)]; E(4)2 = 1, EO2 = 1}.
with the inner product and norm
Theorem 5.1. If (0**, 4**) is optimal for correlation, then
(g, f) = E[gf], lf 112 = Ef2,
0* = 0**, 4* = p*4** is optimal for regression, and the
converse. Furthermore, e*2 1 -p*. is a Hilbert space denoted by H2. The subspace of all functions
Proof. Write 4 of the form

E(O- )2 = 1 - 2EO4 + Eb2


(X) = t(X1), qj E H2(Xy),
= 1 - 2E(O)VE + E42,
where - q EIV. Hence is a closed linear subspace denoted by H2(X). So are H2(Y),

E(O - )2 2 1 -2p* + E42 (5.3) HAX,), -. . ., H2(Xp).

Proposition 5.1 follows from Proposition 5.2.


with equality only if EO = p*. The minimum of the right
side of (5.3) over E42 is at E42 = (p*)2, where it is equal to Proposition 5.2. Under Assumptions 5.1 and 5.2, there are
1 - (p*)2. Then (e*)2 = 1 - (p*)2; and if (0**, 4,**) is constants 0 < c1 ' c2 < oo such that
optimal for correlation, then O* = 0**, 4o* = p*4)** is
optimal for regression. The argument is reversible. (A similar C, 11011, + IkiPI2) ' o + 1 p,j2
result appears in Csaki and Fisher 1963.)

5.2 Existence of Optimal Transformations


' C2(1O1112 + > likIV)2
To show existence of optimal transformations, two additional
assumptions are needed.
Proof. The right-hand inequality is immediate. If the left side
Assumption 5.1. The only set of functions satisfying (5.1) does not hold, we can find a sequence fn = n + z )n j such
such that that lIIn0112 + JP, 1i1onjJ2 = 1, but llfnl12 -O 0. There is
a subsequence n' such that0n' w 0, O)n, 4 j); in the sense of
0(Y) + > 4j(Xj) = 0 a.s. weak convergence in H2(Y), H2(X1), . . , H2(Xp), respec-
tively.

Write
are individually a.s. zero.
E[0n'j(Xj)0n'i(Xi)] = E[1n,'(Xj)E(0n'i(Xi) I Xj)]
To formulate the second assumption, we use Definition 5.3.
to see that Assumption 5.2 implies E4n)t,n'i E4j4i (i = j),
Definition 5.3. Define the Hilbert spaces H2(Y), H2(XI),
and similarly for EOn'n4',j. Furthermore II) < lim inf Ikkn iII,
. , H2(Xp) as the sets of functions satisfying (5.1) with the
usual inner product; that is, H2(Xi) is the set of all measurable 11011 - lim inf lIn'll. Thus, defining f = 0 + Ejoj,
4, such that E4j(Xj) = 0, Eoj2(Xj) < oo with (0j', 4j) =
E[j' (Xj)0j(Xj)] . lf 112 = 110 + , 2 & < lim inf lIf '112 = 0,
I

Assumption 5.2. The conditional expectation operators


which implies, by Assumption 5.1, that 0 = 4, = = p
E(qj(Xj) | Y): H2(Xi) H2(Y),

ii
= 0. On the other hand,

E(4i(X1) Xi): H2(XJ) H2(Xi), i = j


lIfn 112 = IOn'I112 + 1 InIj4.112 + 2 o (On', On'j)
E(O(Y) | X) H2(Y) - H2(Xi)
+ 2 (o n" j n'i)
are all compact. ioj

Assumption 5.2 is satisfied in most cases of interest. A suf- Hence, if f = 0, then lim inf llfnlI2 ? 1.
ficient condition is given by the following. Let X, Y be random
Corollary 5.1. If fn w f in H2, then 0n > 0 in H2(Y), Onj
variables with joint density fx,y and marginals fx, fy. Then the
4j in H2(Xj), j = 1, . . ., p, and the converse.
conditional expectation operator on H2(Y)-* H2(X) is compact
Proof. If fn = On + O 4nj w 0 + Ej 4j, then by Prop-
if
osition 5.2, lim sup II?nIl < ?o, lim sup II4,nIll < ??. Take n' such
that on 0', 4t, n - 4)J, and let f' = 0' + Ej 4);. Then for
f f [fkyIfXfY]dxdy < o(.
any g E H2, (g, fn ')- (g, f') so (g, f) = (g, f') all g. The
Theorem 5.2. Under Assumptions 5.1 and 5.2, optimal converse iS easier.
transformations exist.
Definition 5.4. In H2, let Py P1, and Px denote the projection
Some machinery is needed. operators into H2(Y), H2(Xj), and H2(X), respectively.

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 591
On H2(Xi), Pj (j # i) is the conditional expectation operator, This implies
and similarly for Py.
IIPyI*110* = UO*, 11P4*4* = V+*
Proposition 5.3. Py is compact on H2(X) -> H2(Y), and Px
so that JlPy4*11 is an eigenvalue A* of U, V. Computing gives
is compact on H2(Y) -> H2(X).
110* - 4*112 = 1 - A*. Now take 0 any eigenfunction of U
Proof. Take t)n E H2(X), 4, t). This implies, by Cor-
corresponding to A, with 11011 = 1. Let 4 = P,0; then 110 -
ollary 5.1, that ()nj - ()i. By Assumption 5.2, PyOnj -4 PY 4)
)112 = 1 - A. This shows that Q*, O* are not optimal unless
so that Py4n -4 Py4. Now take 0 E H2(Y), 4 E H2(X); then
A*= . The rest of the theorem is straightforward verification.
(0, Py4) = (0, 4) = (PxO, 4). Thus Px: H2(Y) -> H2(X) is
the adjoint of Py and hence compact. Corollary 5.2. If A has multiplicity one, then the optimal
transformation is unique up to a sign change. In any case, the
Now to complete the proof of Theorem 5.2, consider the
set of optimal transformations is finite dimensional.
functional 110 - )112 on the set of all (0, 4) with 110112 = 1. For
any 0, 4, 5.4 Alternating Conditional Methods
110 - Q112 ? 110 - pX0II2. Direct solution of the equations AO = UO or A4 = V4 is
If there is a 0* that achieves the minimum of 110 - PXOII2 over formidable. Attempting to use data to directly estimate the
solutions is just as difficult. In the bivariate case, if X, Y are
110112 = 1, then an optimal transformation is 0*, PxO*. On 110112
categorical, then 40 = UO becomes a matrix eigenvalue prob-
lem and is tractable. This is the case treated in Kendall and
110 - PX0II2 = 1 - IIPX0II2. Stuart (1967).
Let s = {supllPxOll; 11011 = 1}. Take On such that IlInII2 = 1, On The ACE algorithm is founded on the observation that there
is an iterative method for finding optimal transformations. We
-4 0, and IIPX0nll s-> . By the compactness of Px, IIPXOIll
illustrate this in the bivariate case. The goal is to minimize
IIPxOlI = s. Furthermore, 11011 ' 1. If 11011 < 1, then for 0' =
110(Y) - 4(X)112 with 110112 = 1. Denote PxO = E(0 I X), Py4
0/11011, we get the contradiction IIPxO'II > s. Hence 11011 = 1
= E(O I Y). Start with any first-guess function 0O(Y) having
and (0, Px0) is an optimal transformation. This argument as-
a nonzero projection on the eigenspace of the largest eigenvalue
sumes that s > 0. If s = 0, then 110 - Px0II = 1 for all 0 with
of U. Then define a sequence of functions by
11011 = 1, and any (0, 0) is optimal.

5.3 Characterization of Optimal Transformations o = Px0o

Define two operators, U: H2(Y) -> H2(Y) and V: H2(X) 01 = PYko/llPY0011


H2(X), by
01 = PXOl,
US = PyPx0, V+ = PxPr and in general /,+l = PXOn, 0n+1 = PY0n+1llPYfn+11l. It is
Proposition 5.4. U and V are compact, self-adjoint, and clear that at each step in the iteration 110 - 0112 is decreased.
non-negative definite. They have the same eigenvalues, and It is not hard to show that in general, Ong, 4)n converge to an
there is a 1-1 correspondence between eigenspaces for a given optimal transformation.

positive eigenvalue specified by The preceding method of alternating conditionals extends to


the general multivariate case. The analog is clear; given O,n
0 = PXOIIIPo0II 0 = PY/iiiPY1ii- Ong the next iteration is
Proof Direct verification.
On + 1 = PXOn , On+1 = PYOn+1111PYOn+111-
Let the largest eigenvalue be denoted by A, A = IlUlI = IIVII. However, there is an additional issue: How can Px0 be com-
In the sequel we add the assumption that there is at least one
puted using only the conditional expectation operators P1 (] =
0(Y) such that IIPx0II > 0. Then A > 0 and Theorem 5.3 follows. 1, . . . , p)? This is done by starting with some function 00
Theorem 5.3. If Q*, 4* is an optimal transformation for and iteratively subtracting off the projections of 0 - On on the
regression, then subspaces H2(X1), . . . , H2(Xp) until we get a function 4 such
that the projection of 0 - 4 on each of H2(X1) is zero. This
AS* = U0*, * =V leads to the double-loop algorithm.
Conversely, if 0 satisfies AO = UO, 11011 = 1, then 0, Px0 is
The Double-Loop Algorithm
optimal for regression. If 4 satisfies AO = V+, then 0 =
Py/llIPyIll, and A/llIPyIll are optimal for regression. In ad- The Outer Loop. (a) Start with an initial guess 0O(Y). (b)
dition, Put On+1 = PXOn 0n+1 = Pyk)n+1II1Pyk)n+111 and repeat until
convergence.
(e2) = 1 -
Let PEOO be the projection of 00 on the eigenspace E of U
Proof Let 0*, 'j** be optimal. Then A* = PxO*. Write
corresponding to A.

110* - +*II2 = 1 - 2(0*, Xt*) ? Ikg*112. Theorem 5.4. If IIPEOOII # 0, define an optimal transfor-
Note that (0*, 4)*) = (0*, Py4i*) _ IIPy4)*II with equality only mation by 0* = PESOOIIPEOOII, k* = PXO* Then 110Jn - ?*11
if Q* = cPy4)*, c constant. Therefore, Q* = y*lP*l. ? 0,1k,, - (*I>O11

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
592 Journal of the American Statistical Association, September 1985
Proof. Notice that O,+, = UO,/1lUO,l. For any n, 0On = the argument given before leads to 0 = 0. For any 4 and
ant,* + gn, where gn I E, because, if it is true for n, then E > 0, take W4l so that 114 - W+l11 e. Then lITmIll e E +
IITmW4111, which completes the proof.
On+1 = (an7,* + Ugn)/||an7O + Ugn|i
There are two versions of the double loop. In the first, the
and Ugn is I to E. For any g I E, lUghi ' rilgil, where r <
initial functions 40 are the limiting functions produced by the
. Since an+ I = i{an/hUOnI0, g9n+1 = Ugn/llUOnl0; then
preceding inner loop. This is called the restart version. In the
hh9n + 1||/an + 1 = |I Ug9n|/Xan < (rI)II9gIIIan - second, the initial functions are 00 0. This is thefresh start
version. The main theoretical difference is that a stronger con-
Thus lignillan ' c(rIl)". But 110nil = 1, a' + lignih2 = 1 im-
sistency result holds for the fresh start. Restart is a faster-
plying a 2 - 1. Since ao > 0, then an > 0; so an -' 1. Now
running algorithm, and it is embodied in the ACE code.
use i1?n - 0*112 = (1 - an)2 + ilgn i12 to reach the conclusion.
Since I4,+1 - 011I*I X = nIPxO - PXt0*I C 0l,n - 0* IIthe theorem The Single-Loop Algorithm
follows.
The original implementation of ACE combined a single it-
The Inner Loop. (a) Start with functions 0, 40. (b) If, after eration of the inner loop with an iteration of the outer loop.
m stages of iteration, the functions are 4)m), then define, for j Thus it is summarized by the following.
=1, 2,...,1~p,
1. Start with 00, k0 = 0.

4)(M+l)
<> j =
(0 -_(M M 4m+
E iC))_E gi 1))
I))
2. If the current functions are 0n, 4n, define P)n+1 by

i>j i<j On - 4)n+I = T(fJn - )n) d

Theorem 5.5. Let 'm = Ej 'P(m). Then IIPxO - Q I-I > 0. 3. Let On+1 = Pkn+1/IIPy4n+ 11. Run to convergence.
Proof. Define the operator T by
This is a cleaner algorithm than the double loop, and its
T = (Il-PPp)(l - p_ ) ..(Il-PI) . implementation on data runs at least twice as fast as the double
loop and requires only a single convergence test. Unfortunately,
Then the iteration in the inner loop is expressed as
we have been unable to prove that it converges in function
0 - 4m+I = T(0 - Pm) space. Assuming convergence, it can be shown that the limiting
0 is an eigenfunction of U. But giving conditions for 0 to
= Tm+l(0 - 4)) (5.5) correspond to i, or even showing that 0 will correspond to i,
Write 0 - 00 = 0 - PO + PxO - ~0. Noting that T(O- "almost always" seems difficult. For this reason, we adopted
PxO) = 0 - PxO, (5.5) becomes the double-loop algorithm instead.

Om+ I = Px0 - Tm +I(PxO - ). APPENDIX A: THE ACE ALGORITHM ON


FINITE DATA SETS
The theorem is then proven by Proposition 5.5.
A.1 Introduction
Proposition 5.5. For any 0 E H2(X), IIT"'II --* 0.
The ACE algorithm is implemented on finite data sets by replacing
Proof. 11( - Pj)4II2 = 11,112 - lIPj4II2 S 11k112. Thus 1ITIh
conditional expectations, given continuous variables, by data smooths.
s 1. There is no 0 # 0 such that IIT)II = 11411. If there were,
In the theoretical results concerning the convergence and consistency
then ItPjtjI = 0, all j. Then for 4' = 2 4);,
properties of the ACE algorithm, the critical element is the properties

JJ
of the data smooth used. The results are fragmentary. Convergence
(q9 q$) = (q$ 4j4) = E (PJq, 4j) = 0. of the algorithm is proven only for a restricted class of smooths. In
practice, in more than 1,000 runs of ACE on a wide variety of data
sets and using three different types of smooths, we have seen only
The operator T can be decomposed as I + W, where W is
one instance of failure to converge. A fairly general, but weak, con-
compact. Now we claim that IITmWII -* 0 on H2(X). To prove
sistency proof is given. We conjecture the form of a stronger con-
this, let y > 0 and define
sistency result.

G(y) = sup {IITW4)I/IIW41I; 1111 sI 1, IIW)II ' y}. A.2 Data Smooths
Define a data set D to be a set {x, XN} of N points in p-
Take Xn w 4), Ik1I < 1, II|WVnll 2 y so that IITWIIIIIWnII- dimensional space; that is, Xk = (Xkl, . . , Xkp). Let q)N be the collection
G(y). Then 1111 SI 1, IIW4jII > y, and G(y) = IITW4)I/IIW411. of all such data sets. For fixed D, define F(x) as the space of all real-
Thus G(y) < 1 for all y > 0 and is clearly nonincreasing in y. valued functions 4 defined on D; that is, 4 E F(x) is defined by the
Then N real numbers {+(xl), . . . I)(XN)}. Define F(x,), j = 1, . . ., p,
as the space of all real-valued functions defined on the set {xl,, x2 ,
IITmW`WII = IITWTM"- 1(11 G(IITm - lW411)IITm- 1W41
*. * , XNj}-

Put yo = IIWIl Ym = Gm(ym)yo; then IITmWII - Ym. But clearly Definition A.l. A data smooth S of x on xj is a mapping S: F(x)
-*F(x,) defined for every D in GPN. If 4) E F(x), denote the corre-
The range of W is dense in H2(X). Otherwise, there is a 4)' sponding element in F(xj) by 5(4) | xj) and its values by 5(4) I Xkj).
# 0 such that (4)', W4)) = 0, all 4). This implies (W*4)', 4)) Let xbe any one of x,,... p Some examples of data smooths
- 0 or W'*4' = 0. Then IIT*4)'II - II4)'II, and a repetition of are the following.

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 593
1. Histogram. Divide the real axis into disjoint intervals {I,}. If smooth to be constant preserving so that the modified smooths take
xk E I,, define constants into zero.
The ACE algorithm is defined by the following.
S(O I xk) = - > 4(Xm). 1- 0( )(Yk) = Yk, Cb50(xkJ) = 0.
nX,,m4kE1

(The inner loop)


2. Nearest Neighbor. Fix M < N12. Order the xi getting x < x2
< .. < XN (assume no ties) and corresponding +(x,), ). 2. At the n stage of the outer loop, start with 0(n)5 0(?). For every
Put m 2 I and j = 1, . p, define

S(ktIXk) I OXkM ) (m+1) S(0n) - > 4m+) _-E >1 m)


2M m=-M i<J l>J
mOO

Keep increasing m until convergence to -,.


If M points are not available on one side, make up the deficiency on
the other side.
(The outer loop)

3. Kernel. Take K(x) defined on the reals with maximum at x = 3. Set Q(n+1) = SY(i 0)/)IISy(li 0j)IIN Go back to the inner loop
with Oj?)' = 4, (restart) or Oj50 = 0 (fresh start). Continue until con-

mf
0. Then
vergence.

S(4O I Xk) = O 4?(xm)K(xm - Xk) E K(x, - Xk). To formalize this algorithm, introduce the space H2(O, +) with
elements (0, 04, . . ., ,p), 0 E H2(y), 4, E H2(x,), and subspaces
4. Regression. Fix M and order Xk as in example 2. At Xk, re- H2(0) with elements (0, 0, 0, . . ., 0) = 0 and H2(W) with elements
gress the values of 4)(xk+M) . . ., 4)(xk+M), excluding O(Xk), on (O, 01, . ., p) = +4
Xk-M, . . ., Xk+M, excluding Xk, getting a regression line L(x). Put For f = (fo, f,., fp) in H2(0, 4)), define S,: H2(O, 4)
S(I I Xk) = L(xk). If M points are not available on each side of Xk, H2(0, 4)) by
make up the deficiency on the other side.
(S,f) =0, j ? i
5. Supersmoother. See Friedman and Stuetzle (1982).

Some properties that are relevant to the behavior of smoothers are =fi + Sij( f,), j=i
\,oj
given next. These properties hold only if they are true for all D C & ,
Starting with 0 = (0, 0, 0, . . , 0), 4)(m) = (0, 0(m)), one complete
1. Linearity. A smooth is linear if
cycle in the inner loop is described by
S(aqi + /42) = aSq51 + fS4)2
0 - + (m I ) = I Sp)(I - Sp - ) ... (I - Sj)(O t() (A.2)
for all 41, ()2 E F(x) and all constants a, ,B.
2. Constant Preserving. If 4 E F(x) is constant (4-c), then Define T on H2(0, 4) H2(0, 4)) as the product operator in (A.2).
Then
SO = c.
To give a further property, introduce the inner product ( )N on 4)(m) = 0 - Tm(0 - 4)(O)) (A.3)
F(x) defined by
If, for a given 0, the inner loop converges, then the limiting 4)
satisfies
(4), 4')N = - 4)(Xk)4)'(Xk)
Nk
S(0- 4) = 0, ] = 1, P. (A.4)
and the corresponding norm 11 IIN That is, the smooth of the residuals on any predictor variable is zero.
3. Boundedness. S is bounded by M if
Adding

IIS)IIN ? MII4IIN, all 4 E F(x), 0 = Sy,SIISYNIk (A.5)


where IIS5IIN is defined on F(x,) exactly as IkPIIN is defined on F(x).
to (A.4) gives a set of equations satisfied by the estimated optimal
In these examples of smooths, all are linear, except the super- transformations.
smoother. This implies they can be represented as an N X N matrix Assume, for the remainder of this section, that the smooths are
operator varying with D. All are constant preserving. Histograms and linear. The (A.4) can be written as
the nearest neighbor are bounded by 2. Regression is unbounded due
to end effects, but in the Section A.5 we introduce a modified regres-
SA) = S,O, i = 1, . - , P. (A.6)
sion smooth that is bounded by 2. The bound for kernel smooths is Let sp(S,) denote the spectrum of the matrix Sj. Assume 1 o sp(Sj).
(The number 1 is in the spectrum for constant preserving smooths but
more complicated.
not for modified smooths.) Define matrices A, by A, = S,(I - S,)-I
A.3 Convergence of ACE and the matrix A as ,AJ. Assume further that -1 sp(A). Then
(A.6) has the unique solution
Let the data be of the form (Yk, Xk) = (Yk, Xkl. Xkp), k = 1,
N. Assume that y = x= x= = 0. Define smooths S., 4,= A,(I + A)'0, j = 1, . . . 'p- (A.7)
S l . . , Sp, where S, F(y, x) F(y) and S,: F(y, x) -> F(x,). The element + = (0, 4,p.4),,) given by (A.7) will be denoted
Let H2(y, x) be the set of all functions in F(y, x) with zero mean, and by PO. Rewrite (A.3) using (I - T)(0 - P0) = 0 as
let H2(y), H2(x,) be the corresponding subspaces.
4)(m) = PO - Tm(P - ''J'?' (A. 8)
It is essential to modify the smooths so that the resulting functions
have zero means. This is done by subtracting the mean; thus the Therefore, the inner loop converges if it can be shown that Tmf -o 0
modified S, is defined by for all f E H2(4)). What we can show is Theorem A. 1.
Sf4) = S,4) - Av(S,4)). (A.1) Theorem A.]. If det[I + A] $ 0 and if the spectral radii of l,,
Henceforth, we use only modified smooths and assume the original . ,Snare all less than one, a necessary and sufficient condition for

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
594 Journal of the American Statistical Association, September 1985
iTmf -O 0 for all f E H2(4,) is that smooths are "close" enough to being self-adjoint so that their largest
eigenvalue is real, positive, and less than one.
det[A! - (I - Si/A)-I(I - S))] (A.9)
A.4 Consistency of ACE
has no zeros in JA I 1 except A = 1. For '0, 01, . . ., 0,, any functions in H2(Y), H2(X) , H2(Xp),
Proof. For Tmf 0, all f E H2(4), it is necessary and sufficient and any data set D E 9N, define functions Pj(0, I x,) by
that the spectral radius of T be less than one. The equation Tf = 2f
Pj(Ij Xkj) = E(O,(Xi) I Xj = XkJ). (A. 15)
in component form is
Let 4j in H2(xj) be defined as the restriction of 4)j to the set of data
Af, = -Si(Ai f, + E f, j = 1,. p. (A.1O) values {xl,, . . ., x,j} minus its mean value over the data values.
I<j ,,j Assume that the N data vectors (Yk, Xk) are samples from the dis-
Let s = li fi and rewrite (A. 10) as tribution of (Y, X, . . ., Xp), not necessarily independent or even
random (see Section A.5).

(Ai -Sj)fj = sj (I -iA) E f,-s) (A.I1) Definition A.2. Let S(m, S/') be any sequence of data smooths.
They are mean squared consistent if
If A = 1, (A.11) becomes (I - Sj)fj = -Sjs or s = -As. By EIISj(N)(0 I xj) - P N(4i | xjJN -
assumption, this implies that s = 0, and hence fj = 0, for all j. This
for all 00, .. ,p as above, with the analogous definition for S(N.
rules out A = 1 as an eigenvalue of T'. For A $ 1, but A greater than
Whether or not the algorithm converges, a weak consistency result
the maximum of the spectral radii of the S, (j = 1, . . p), define
can be given under general conditions for the fresh-start algorithm.
g, = (1 - A) i< f,- s. Then f, = (g+- gj))/(I - s), 5o
Start with 00 E H2(Y). On each data set, run the inner-loop iteration
(A! - S,)(g1+1 - g,) = (1 - A)S,g1 m times; that is, define

or (nn+ I) = 9(n) - Tm(9(n))

gi+, = (I-SI)-'(I - S)g,. (A.12) Then set


0(n + 1) = S 4(n+l)lllsY+(n + I)IIN.
Since gp+1 -iAs, g = -s; then (A.12) leads to
As = (I -Sp/iA)-I(I - Sp) ... (I - SI/I)-'(I- S)s. (A.13) Repeat the outer loop I times, getting the final functions ON(y; m, 1),
OjN(xj; m, 1). Do the analogous thing in function space starting with
If (A. 13) has no nonzero solutions, then s = 0, g, = 0, and j = 1,
00, getting functions whose restriction to the data set D are denoted
. . ., p, implying all f, = 0. Conversely, if (A. 13) has a solution s
by 0(y; m, 1), Oj(x,; m, 1).
# 0, it leads to a solution of (A. 10).
Unfortunately, condition (A.9) is difficult to verify for general linear Theorem A.2. For the fresh-start algorithm, if the smooths SyN)
smooths. If the S, are self-adjoint, non-negative definite, such that all S,N' are mean squared consistent, linear, and uniformly bounded as N
elements in the unmodified smooth matrix are non-negative, then all - 00, and if for any 0 E L2(Y), 11N0 110112, EllIII 110112, then
spectral radii of Sj are less than one and (A.9) can be shown to hold EIION(y; m, I) - 0(y; m, 1)112 -* 0,
by verifying that
EIIOjN(xj; m, 1) - N,(x1; m, 1)11k -0

121 s 1 l(i - S,iaru( - S,)1I If 0* is the optimal transformation PEOOIIIPEOOII, 4* = Px0*, then as
m, I - a) in any way,
has no solutions A with JAI > 1 and then ruling out solutions with JAI
110(-; m, 1) - 0*11 -? 0, llj(I ; m, 1) -04*1l -* 0.
= 1.
Assuming that the inner loop converges to PC, the outer loop it- Proof. First note that for any product of smooths S(,N) ... kv
eration is given by
EIIS~' ..."Sth0
EllS(N - pPI,h....^ON
S(NO So 00k? - O.
0 (n +11 = Sy p @(n) / | Sy PO IN)
This is illustrated with S,v)SJN)00 (i 5 j). Since EIIS>N)00 -
Put the matrix SyP = U so that
PAUII - 0, then Sf00 = Pjo0 + 4j,N' where EIIjNIN|- 0.
O(n 1) = 'O(n)/llCJII ll (A.n14) Therefore

If the eigenvalue A of UL having largest absolute value is real and SFN)(Sj(0o) = St PJ00 + SrN)4)JN.
positive, then Q(n+ 1) converges to the projection of 0(0) on the eigenspace
By assumption, 11S(M)0j,N11N < M110j,NIIN, where M does not depend on
of A. The limiting 0, PO is a solution of (A.4) and (A.5). If i is not N. Therefore EIISlN)4)j,NIk2 - 0. By assumption, EIIS(N)P10o -
real and positive, then f9() oscillates and does not converge. If the P,P,0011N -- 0 so that EIISVv)SjN)0o - PPj0 1k2 ->O.
smooths are self-adjoint and non-negative definite, then SYP is the
Proposition A.1. If ON is defined in H2(y) for all data sets D, and
product of two self-adjoint non-negative definite matrices; hence it has
0 E H2(Y) such that
only real non-negative eigenvalues. We are unable to find conditions
guaranteeing this for more general smooths. EIION(y) - 0(y)112 0,
It can be easily shown that with modifications near the endpoints, then
the nearest neighbor smooth satisfies the preceding conditions. Our
current research indicates a possibility that other types of common E ON(Y) 0(y) 2 2
smooths can also be modified into self-adjoint, non-negative definite
smooths with non-negative matrix elements. For these, ACE conver-
Proof. Write 0/11011 = 0/IIOIIN + 0(/101 - l/IIOIIN). Then two
gence is guaranteed by the preceding arguments.
parts are needed: first, to show that
ACE, however, has invariably converged using a variety of non-
self-adjoint smooths (with one exception found using an odd type of
kernel smooth). We conjecture that for most data sets, reasonable 11IIONIIN IIOIIN IN

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 595
and second, to show that Write
Wm = Um P~,E W = U -PE;
F 1- IIOIIN 2N so IlWm - Wll -- 0 again. Now,

Um6o = Pm)0O + WI0o


For the first part, let

S2 1 (0N(Yk) 0 (Yk) 2 (ON , 0)N U'0O = 1IPE-0o + W00. (A.16)


N N k IIONIIN IIOIIN) IIONIINIIOIIN) For any E > 0 we will show that there exists mo, 1 such that for m
M iO, 1 ? 10,
Then SN ? 4, so it is enough to show that SN 0 to get ESN 0.
||wmo/im 001A 8, |W Soll/2 - * (A. 17)
Let
Take r = (, + A')/2 and select mo such that r > max(G, ImlI; m '
VN N -> (ON(Yk) -(YJ MO). Denote by R(A, Wm) the resolvent of Wm. Then

WI = I | RQL, Wm)di
= IIONIIk + 111N - 2(ON, 0)N 27r I|=r R

and
= (IIONIIN - IIIIN) + 2(1IOIINIIONIIN - (ON, 0)N)-
Both terms are positive, and since EVN 2- 0, E(I10NIIN - IIOIIN)2 - 0 -ilmi rI | gR(A, Wm)11dJAJ,
27r 12=r
and E(IIOIINIIONIIN - (ON, 0)N) O 0. By assumption, 1101kN 110112,
resulting in SN 40 where dI)4 is arc length along JIH = r. On JiA = r, for m m io, IIR(A,
Now look at Wm)II_ is continuous and bounded. Furthermore, IIR(Q, Wm)II -> IIR({,
W)II uniformly. If M(r) = maxlpI=rIIR(Q, W)II, then
WN = - 0 O2(yk)[liII01IN - 1/11011]2
Nk IIWII < r'M(r)(1 + Am),

where Am O 0 as m -> oo. Certainly,


IIIIk(1 IIOIIN - 1/11011)2
IIWIII ? r'M(r).
= (1 - IIOIIN/IIOII)
Fix 6 > 0 such that (1 + 6)r < A. Take m' such that for m 2 max(mo,
Then EWN -- 0 follows from the assumptions.
ms), Am ' (1 + 6)r. Then
Using Proposition A. 1, it follows that EIION(Y; m, 1) - 0(y; m,
N)II- 0 and, in consequence, that E II,N(X,; m, 1) - +,(x, ; m, 1)112 IIWII/II A 1/(1 + 6))'M(r)(l + Am)
- 0. and
In function space, define
ll'l li'< 1/ 1+ 6))'M(r) -
P)m'Q = 0 - Tm0
Now choose a new mo and 10 such that (A. 17) is satisfied.
Um= x Using (A.17),

Then ul 00 P(m)00
0(; m, 1) = Um 0lIU n

U1 0 PE 0O PE-00
The last step in the proof is showing that where 8m,i 0 as m, I -l oo. Thus

||UM00I/11UM00II - 0*11 -| 0 m0- 0* =1


as m, I go to infinity. Begin with Proposition A.2. IIU|| 00H m,I + IIPEm 1l IIPE-0011

Proposition A.2. As m - oo, Um - U in the uniform operator and the right side goes to zero as m, I - oo.
norm. The term weak consistency is used above because we have in mind
Proof. llUmO - U0II = IlPyTmPx0ll _ IlTmPx0ll. Now on H2(Y), a desirable stronger result. We conjecture that for reasonable smooths,
IlTmPxIl -O 0. If not, take 0mg 110mll = 1 such that 1T1'PX0mll ? 6, all the set CN = {(Y1, Xl), . . ., (YN, XN); algorithm converges} satisfies
m. Let Om,'4 0; then PX0m s PxO and P(CN) --+1 and that for 0N, the limit on CN starting from a fixed 00,

JlTm'PXOm,ll IITm'Px(0m, - 0)11 + IITmPx0II E[ICNII0N - 0N] 0.


We also conjecture that such a theorem will be difficult to prove. A
C llPx(0rm - 0)11 + IlTm'PxOll.
weaker, but probably much easier result would be to assume the use
By Proposition (5.5) the right-hand side goes to zero. of self-adjoint non-negative definite smooths with non-negative matrix

The operator Um is not necessarily self-adjoint, but it is compact. elements. Then we know that the algorithm converges to some ON,
and we conjecture that E[II0N - 0*N] 0
By Proposition (A.2), if 0(sp(U)) is any open set containing sp(U),
then for m sufficiently large, sp(Um) C 0(sp(U)). Suppose, for sim-
A.5 Mean Squared Consistency of Nearest
plicity, that the eigenspace EA corresponding to the largest eigenvalue
Neighbor Smooths
i of U is one-dimensional. (The proof goes through if E, is higher-
dimensional, but it is more complicated.) Then for any open neigh- To show that the ACE algorithm is applicable in a situation, we
borhood 0 of A, and m sufficiently large, there is only one eigenvalue need to verify that the assumptions of Theorem (A.2) can be satisfied.
Am of Um in 0, )m Lrn> s, and the projection P(m) of Um corresponding We do this, first assuming that the data (Y,, X), . (YN, XN) are
to )r converges to PEA in the uniform operator topology. Moreover, samples from a two-dimensional stationary, ergodic process. Then the
'ir can be taken as the eigenvalue of Urn having largest absolute value. ergodic theorem implies that for any 0 E L2(Y), 11011k -l 11012 and,
If iL' is the second largest eigenvalue of U and 4m is the eigenvalue trivially, E~I0ISI >~ 1lOW
of Urn having the second highest absolute value, then (assuming E,~ To show that we can get a bounded, linear sequence of smooths
is one-dimensional) 4m > A'. that are mean squared consistent, we use the nearest neighbor smooths.

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
596 Journal of the American Statistical Association, September 1985
Theorem A.3. Let (Y1, XI),'. . . , (YN, XN) be samples from a By the ergodic theorem, for a countable {x"} dense on the real line,
stationary ergodic process such that the distribution of X has no atoms. and c E W', P(W') = 1,
Then there exists a mean squared consistent sequence of nearest-
('N(X., w0) = gN(X, CO) - Pb(g I Xn) -O 0.
neighbor smooths of Y on X.
Use (A. 19) to establish that for any bounded interval J and any wo E
The proof begins with Lemma A. 1.
W', (N(X, co) 0 uniformly for x E J. Then write
Lemma A.J. Suppose that P(dx) has no atoms, and let PN(dx) 1N
P(dx). Take 3N> O, 6N- > O; define J(x; E) = [x - c, x + ?]; ll|DN(X, 0)IIN = E > N'(Xk, w)I(Xk E J)
N k=1
and
N
CN(x) = min{e; PN(J(x, ?)) 2 AN} + - >k=Fkl, o41(Xk E' J).
Nk=
e(x) = min{e; P(J(x, e)) 6 }.
The first term is bounded and goes to zero for co E W'; hence its
Then using A to denote symmetric difference,
expectation goes to zero. The expectation of the second tenn is bounded
PN(J(X, EN(X)) A J(x, e(x))) -* 0 uniformly in x (A.18) by cP(X E ' J). Since J can be taken arbitrarily large, this completes
and the proof.

lim sup sup PN(J(x, E(x)) A J(y, E(y))) c &X(h), (A. 19) Using the inequality
N {(x,y);Ix-yjIt}
EjIS6'g - Pxglls 2 Ej|S( g - P6gll + 21IP6g - Pxgll2
where s1(h)- 0 ash- 0.
gives
Proof. Let FN(x), F(x) be the cumulative df corresponding to PN,
P. Since FN - F and F is continuous, then it follows that lim sup EjjS?g - Pxgll2 ? 21jP6g - Pxgll2.

supIFN(x) - F(x)I -- O. Proposition A.4. For any 4(x) c L2(X), lim,,,0jjP& - O.11 - 0.
Proof. For 4 bounded and continuous,
To prove (A. 18), note that
O I q(x')I(x' E J(x, e(x)))P(dx') - (x)
PN(J(X, 9N) A J(X, E))

_ 1PN(J(X EN)) - PN(J(X, 0)) as (5-- 0 for every x. Since suplP,5 - ? c for all (, then IIP,4
1 |N - PN(J(X, 9N))l - oil -- 0. The proposition follows if it can be shown that for every
0 E L2(X), lim sup6llP0ll < o. But
+ 1|N - 31 + IFN(X + ?(x)) - F(x + ?(x))|
+ IFN(X - ?(X)) - FN(x - ( , IP6l12 = f [ O f k(x')I(x' E J(x, C(x)))P(dx')1 P(dx)
which does it. To prove (A. 19), it is sufficient to show that

sup P(J(x, e(x)) A J(y, ?(y))) c ?X(h)- S O (X )2 p(d) [ I(X' E- J(x, e(x)))P(dx)]
x,y, k-yj5h

First, note that Suppose that x' is such that there are numbers E+, c- with P([x', x'
+ c+]) = (, P([x', x' - -]) = 6. Then x' E J(x, E(x)) implies
|?(x) - s(Y)I S Ix - yA.
xi - e x x' + +, and
If J(x, E(x)), J(y, e(y)) overlap, then their symmetric difference con-
sists of two intervals I,, 12 such that JIj ? 2jx - Yl, 1I21 C 21x - yl. 116 f I(x' E J(x, c(x)))P(dx) ? 2. (A.20)
There is an ho > 0 such that if |x - y ho, the two neighborhoods
If, say, P([x', co)) < ( then x 2? x' - c and (A.20) still holds, and
always overlap. Otherwise there is a sequence {x"}, with e(x,) -* 0
similarly if P((- oo, x']) < 3.
and P(J(x", e(x"))) = 3, which is impossible, since P has no atoms.
Then for h s ho Take {OnJ to be a countable set of functions dense in L2(Y). By
Propositions A.3 and A.4, for any c > 0, we can select 6(e, n), N(6,
x,y;jx-yt-h |iI92h
sup P(J(x, e(x)) A J(y, e(y))) s 2 sup P(I)
n) so that for all n,

and the right-hand side goes to zero as h -> 0. E1lS'On - PX0,Ik2 ? c for ( s ((, n), N 2 N(5, n).
The lemma is applied as follows: Let g(y) be any bounded function
Let cM I 0 as M -* 0o; define 3M = minnlM 6(c, n) and N(M) =
in L2(Y). Define P6(g I x), using If) to denote the indicator function,
maxn.M N(6M, n). Then
as
E1IS,N 0n - PX0n112 < CM for n ? M, N 2 N(M).
11/ g(y) I(x' E J(x, e(x)))P(dy, dx')
Put M(N) = max{M; N ? max(M, N(M))}. Then M(N) -> 00 as N
= 11/ Px(g I x') I(x' E J(x, e(x)))P(dx'). oo, and the sequence of smooths SI is mean squared consistent
for all On. Noting that for B E L2(Y),
Note that Pa is bounded and continuous in x. Denote by SW the smooths
with M = [NJ]. Proposition A.3 follows.
EIIS0B -PX0II2 s 3EIISI0n - PXOn N + 9110 - OnlI2

Proposition A.3. ElISg g - Pjgllj -- 0 for fixed 3. completes the proof of the theorem.

Proof. By (A. 18), with probability one, The fact that ACE uses modified smooths SWg = Smg -
Av(S?g) and functions g such that Eg = 0 causes no problems, since
Sr (g I x) = (1/ [N]) I g(yj)I(x1 E J(x, EN(X)))
IIAv(S rg)II = (Av(SNg))2
and
can be replaced for all x by

gN(x, a)) = (11[3N]) > g(y3)1(x, E J(x, iE(x))), Av(Sag, gN(x, cf),
where w is a sample sequence. using the notation of Proposition A.3.

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
Breiman and Friedman: Estimating Optimal Transformations 597
Assume g is bounded, and write - M') directly below (above). For a regression smooth,
I N ( ) 1N S(4 I x) = f + [rF(0, x)/](x - xx), (A.21)
Av(SI) g) =N k Ni +
where /X, xx are the averages of 0(yj, x, over the indexes in J(x),
By the ergodic theorem, the second term goes a.s. to EPj(g I X), and and Fx(4, x), U2 are the covariance between IYk), Xk and the variance
an argument mimicking the proof of Proposition A.3 shows that the of Xk over the indexes in J(x).
first term goes to zero a.s. Write the second term in (A.21) as
Finally, write
[Fx(& x)OlR[(x - Xx)ux]
IEP6(g I X)| = IEP6(g I X) - EPxgI s lIP64 - 4lI,
If there are M points above and below in J(x), it is not hard to show
where 0 = Pxg. Thus, Theorem A.3 can be easily changed to account
that
for modified smooths.
In the controlled experiment situation, the {Xk} are not random, but l(x - XX)/I s 1
the condition PN(dx) P(dx) is imposed. Additional assumptions are This is not true near endpoints where (x -x )/Ix can become arbi-
necessary. trarily large as M gets large. This endpoint behavior keeps regression
Assumption A. 1. For O(Y) any bounded function in L2(Y), E(O(Y) from being uniformly bounded. To remedy this, define a function
| X = x) is continuous in x. [x], = x, lxl? s1
Assumption A.2. For i # i and +(x) any bounded continuous = sign(x), lxi > 1,
function, E(O(X,) I X, = x) is continuous in x.
and define the modified regression smooth by
A necessary result is Proposition A.5.
S(4 I x) = x + Fr(4, x)/Ux[(x - XX)/Ux],. (A.22)
Proposition A.S. For O(y) bounded in L2(Y) and +(x) bounded
and continuous, This modified smooth is bounded by 2.

IN Theorem A.4. If, as N -> oo, M -> oo, MIN -> 0, and P(dx) has
- E O(yJ)o(xJ) as > EO(Y)O(X). no atoms, then the modified regression smooths are mean squared
NJ=I
consistent.
Let TN = J=, O(YJ)4(xJ). Then ETN = J7 g(x,)+(xj), g(x) =
E[O(Y) I X = x]. By hypothesis, ETNIN-> EO(Y) (X). Furthermore, The proof is in Breiman and Friedman (1982). We are almost certain
that the modified regression smooths are also mean squared consistent
N

ON var(TN) = E E[O(y) - g(x )]20(X ) for stationary ergodic time series and in the weaker sense for controlled
experiments, but under less definitive conditions on rates at which M
N
00.
=E h (x,) 0(x,),
*I

APPENDIX B: VARIABLES USED IN THE


where h(x) = E[(O(Y) - g(X))2 | X = x]. Since ho is continuous
HOUSING VALUE EQUATION OF
and bounded, then NIN -+ Eh(X)O(X). Now the application of Kol-
mogorov's exponential bound gives
HARRISON AND RUBINFELD (1978)

TNIN - ETNIN aS > 0, MV-median value of owner-occupied home


RM-average number of rooms in owner units
proving the proposition.
AGE-proportion of owner units built prior to 1940
In Theorem A. 2 we add the restriction that 00 be a bounded function
DIS-weighted distances to five employment centers in the Boston
in L2(Y). Then the condition on 0 may be relaxed to the following:
region
For 0, any bounded function in L2(Y), 1111 N 110112, EI0lIN 11-> .
RAD-index of accessibility to radial highways
These follow from Proposition A.5 and its proof. Furthermore, because
TAX-full property tax rate ($/$10,000)
of Assumptions A. 1 and A.2, mean squared consistency of the smooths
PTRATIO-pupil-teacher ratio by town school district
can be relaxed to the following requirements.
B-black proportion of population
Assumption A.3. For i =# j and every bounded continuous function LSTAT-proportion of population that is lower status
+(x,), CRIM-crime rate by town
2I~ P4I --> 0. ZN-proportion of town's residential land zoned for lots greater
||s,+ - PJOIIN ?
than 25,000 square feet
Assumption A.4. For every bounded function O(y) E L2(Y),
INDUS-proportion of nonretail business acres per town
EIIS,O - P N0IIk 0 CHAS-Charles River dummy = 1 if tract bounds the Charles
Assumption A.5. For every bounded continuous function +(x,), River, 0 otherwise
EIIS,q$ - PVII2 -> 0. NOX-nitrogen oxide concentration in pphm

The existence of sequences of nearest-neighbor smooths satisfying


Assumptions A.3, A.4, and A.5 can be proven in a fashion similar APPENDIX C: VARIABLES USED IN THE
to the proof of Theorem A.3. Assumption A. 3 is proven using Lemma OZONE-POLLUTION EXAMPLE
A.1 and Proposition A.4. Assumptions A.4 and A.5 require Propo-
sition A.S in addition. SBTP-Sandburg Air Force Base temperature (C?)
If the data are iid, stronger results can be obtained. For instance, IBHT-inversion base height (ft.)
mean squared consistency can be proven for a modified regression DGPG-Daggett pressure gradient (mmhg)
smooth similar to the supersmoother. For x of any point, let J(x) be VSTY-visibility (miles)
the indexes of the M points in {XA} directly above x plus the M below. VDHT-Vandenburg 500 millibar height (in)
If there are only M' < M above (below), then include the M + (M HMDT-humidity (percent)

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms
598 Journal of the American Statistical Association, September 1985
IBTP-inversion base temperature (F?) Curve Estimation," in Lecture Notes in Mathematics, No. 757, New York:
Springer-Verlag.
WDSP-wind speed (mph)
Gebelein, H. (1947), "Das Statitistiche Problem der Korrelation als Variations
Dependent Variable: und Eigenwert Problem und Sein Zusammenhang mit der Ausgleichung-
Srechnung," Zeitschrift Fuer Angewandte Mathematik und Mechanik, 21,
UP03-Upland ozone concentration (ppm) 364-379.
Harrison, D., and Rubinfeld, D. L. (1978), "Hedonic Housing Prices and the
Demand for Clean Air," Journal of Environmental Economics Management,
[Received August 1982. Revised July 1984.]
5, 81-102.
Kendall, M. A., and Stuart, A. (1967), The Advanced Theory of Statistics
REFERENCES (Vol. 2), New York: Hafner Publishing.
Kimeldorf, G., May, J. H., and Sampson, A. R. (1982), "Concordant and
Anscombe, F. J., and Tukey, J. W. (1963), "The Examination and Analysis Discordant Monotone Correlations and Their Evaluations by Nonlinear
of Residuals," Technometrics, 5, 141-160. Optimization," Studies in the Management Sciences (19): Optimization in
Belsley, D. A., Kuh, E., and Welsch, R. E. (1980), Regression Diagnostics, Statistics, eds. S. H. Zanakis and J. S. Rustagi, Amsterdam: North-Holland,
New York: John Wiley. pp. 117-130.
Box, G. E. P., and Cox, D. R. (1964), "An Analysis of Transformations," Kruskal, J. B. (1964), "Nonmetric Multidimensional Scaling: A Numerical
Journal of the Royal Statistical Society, Ser. B, 26, 211-252. Method," Psychometrika, 29, 115-129.
Box, G. E. P., and Hill, W. J. (1974), "Correcting Inhomogeneity of Variance (1965), "Analysis of Factorial Experiments by Estimating Monotone
With Power Transformation Weighting," Technometrics, 16, 385-389. Transformations of the Data," Journal of the Royal Statistical Society, Ser.
Box, G. E. P., and Tidwell, P. W. (1962), "Transformations of the Independent B, 27, 251-263.
Variables," Technometrics, 4, 531-550. Lancaster, H. 0. (1958), "The Structure of Bivariate Distributions," Annals
Breiman, L., and Friedman, J. (1982), "Estimating Optimal Transformations of Mathematical Statistics, 29, 719-736.
for Multiple Regression and Correlation," Technical Report 9, University (1969), The Chi-Squared Distribution, New York: John Wiley.
of California, Berkeley, Dept. of Statistics. Linsey, J. K. (1972), "Fitting Response Surfaces With Power Trans-
Cleveland, W. S. (1979), "Robust Locally Weighted Regression and Smoothing formations," Journal of the Royal Statistical Society, Ser. C, 21, 234-237.
Scatterplots," Journal of the American Statistical Association, 74, 828-836. (1974), "Construction and Comparison of Statistical Models," Journal
Craven, P., and Wahba, G. (1979), "Smoothing Noisy Data With Spline Func- of the Royal Statistical Society, Ser. B, 36, 418-425.
tions: Estimating the Correct Degree of Smoothing by the Method of Gen- Mosteller, F., and Tukey, J. W. (1977); Data Analysis and Regression, Read-
eralized Cross-Validation," Numerische Mathematik, 31, 317-403. ing, MA: Addison-Wesley.
Csaki, P., and Fisher, J. (1963), "On the General Notion of Maximal Renyi, A. (1959), "On Measures of Dependence," Acta Mathematica Aca-
Correlation," Magyar Tudomanyos Akademia, Budapest, Matematikai Ko- demiae Scientiarum Hungaricae, 10, 441-451.
tato Intezet, Kozlemenyei, 8, 27-51. Sarmanov, 0. V. (1958a), "The Maximal Correlation Coefficient (Symmetric
DeBoor, C. (1978), A Practical Guide to Splines, New York: Springer-Verlag. Case)," Doklady Akademii Nauk UzSSR, 120, 715-718.
De Leeuw, J., Young, F. W., and Takane, Y. (1976), "Additive Structure in (1958b), "The Maximal Correlation Coefficient (Nonsymmetric
Qualitative Data: An Alternating Least Squares Method With Optimal Scal- Case)," Doklady Akademii Nauk UzSSR, 121, 52-55.
ing Features," Psychometrika, 41, 471-503. Sarmanov, 0. V., and Zaharov, V. K. (1960), "Maximum Coefficients of
Devroye, L. (1981), "On the Almost Everywhere Convergence of Nonpara- Multiple Correlation," Doklady Akademii Nauk UzSSR, 130, 269-271.
metric Regression Function Estimates," The Annals of Statistics, 9, 1310- Spiegelman, C., and Sacks, J. (1980), "Consistent Window Estimation in
1319. Nonparametric Regression," The Annals of Statistics, 8, 240-246.
Devroye, L., and Wagner, T. J. (1980), "Distribution-Free Consistency Results Stone, C. J. (1977), "Consistent Nonparametric Regression," The Annals of
in Nonparametric Discrimination and Regression Function Estimation," The Statistics, 7, 139-149.
Annals of Statistics, 8, 231-239. Tukey, J. W. (1982), "The Use of Smelting in Guiding Re-Expression," in
Efron, B. (1979), "Bootstrap Methods: Another Look at the Jackknife," The Modern Data Analysis, eds. J. Laurner and A. Siegel, New York: Academic
Annals of Statistics, 7, 1-26. Press.
Fraser, D. A. S. (1967), "Data Transformations and the Linear Model, " Annals Wood, J. T. (1974), "An Extension of the Analysis of Transformations of Box
of Mathematical Statistics, 38, 1456-1465. and Cox," Journal of the Royal Statistical Society, Ser. C, 23, 278-283.
Friedman, J. H., and Stuetzle, W. (1982), "Smoothing of Scatterplots," Tech- Young, F. W., de Leeuw, J., and Takane, Y. (1976), "Regression With Qual-
nical Report ORION006, Stanford University, Dept. of Statistics. itative and Quantitative Variables: An Alternating Least Squares Method
Gasser, T., and Rosenblatt, M. (eds.) (1979), "Smoothing Techniques for With Optimal Scaling Features," Psychometrika, 41, 505-529.

Comment
DARYL PREGIBON and YEHUDA VARDI*

In data analysis, the choice of transformations is often done for narrowing the gap between mathematical statistics and data
subjectively. ACE is a major attempt to bring objectivity to analysis, and for providing the data analyst with a useful tool.
this area. As Breiman and Friedman have demonstrated with
1. ACE IN THEORY: HOW MEANINGFUL IS
their examples, and as we have experienced with our own,
MAXIMAL CORRELATION?
ACE is a powerful tool indeed. Our comments are sometimes
critical in nature and reflect our view that there is much more To keep our discussion simple we limit it here to the bivariate
to be done on the subject. We consider the methodology a case, though the issues that we raise are equally relevant to the
significant contribution to statistics, however, and would like general case. The basis of ACE lies in the properties of maximal
to compliment the authors for attacking an important problem,
? 1985 American Statistical Association
* Daryl Pregibon and Yehuda Vardi are Members of Technical Staff, Journal of the American Statistical Association
AT & T Bell Laboratories, Murray Hill, NJ 07974. September 1985, Vol. 80, No. 391, Theory and Methods

This content downloaded from 66.215.50.115 on Tue, 27 Aug 2024 22:13:50 UTC
All use subject to https://about.jstor.org/terms

You might also like